We Caught Ourselves Overclaiming. Here's What We Fixed.

Published 2026-04-19 — OracleMangle

Building a scoring product on top of real dispute data creates a specific temptation: the numbers look impressive enough that you start reaching for the most impressive version of every claim.

We did that. We caught it. Here's the full account.

1. We were presenting model self-confidence as outcome probability

Our alert engine generates natural-language explanations using Gemini. The explainer prompt had access to a field called claude_confidence — the arbitration model's self-confidence score on a 0–1 scale.

The problem: Gemini didn't know what claude_confidence meant. So it invented an interpretation. Alerts went out saying things like:

"Claude's confidence at 99% and a Yes price of only $0.13 — high conviction against his nomination."

That's wrong. 99% was not the probability of any outcome. It was the model's confidence in its own classification — a completely different thing. Five alerts went out with this framing before we caught it.

Fix: Stripped claude_confidence from the LLM payload entirely. Added a field glossary telling the model that dispute_risk is a dispute probability, not a YES/NO outcome probability.

2. A garbage filter had a regex edge case

The alert engine has a filter that blocks malformed market titles — things like raw SHA-256 hashes stored in the database instead of real question text. The filter worked for pure hex strings. It didn't work for hex strings with a trailing ?. One alert went out with a headline that was literally a hash followed by a question mark.

Fix: Strip trailing punctuation before the hex check. Regression test added.

3. We were alerting on markets that had already closed

An alert fired on "Will Bitcoin reach $100,000 by December 31, 2025?" in April 2026. That market closed months earlier. The alert engine was treating zero-volume closed markets as live candidates.

Fix: Added a close_time-in-past check to all three alert detectors. Markets past their close date are skipped regardless of their risk score.

4. Our top-line accuracy claim was wrong

The landing page said "AUC-ROC 0.80." We ran a proper calibration this week on 139,484 resolved Polymarket markets.

The actual AUC is 0.624.

The 0.80 figure came from an early benchmark run on a small, curated subset. It didn't hold at scale. We pulled it. The landing page now leads with Brier score 0.022 — which measures what actually matters: are the scores well-calibrated probabilities?

The answer is: mostly yes on average, weaker at the tails. Predictions near 0 or 1 should be read as directional signals, not exact probabilities. We say that on the site now.

What the calibration actually looks like

Run on 139,484 resolved Polymarket markets with UMA dispute ground truth:

Risk Bucket	Markets	Disputed	Rate	vs Baseline
Clean (0–10%)	88,410	702	0.8%	0.7x
Medium (10–25%)	44,469	548	1.2%	1.1x
High (25–50%)	6,392	324	5.1%	4.4x
Extreme (50%+)	213	20	9.4%	8.2x

Baseline dispute rate: 1.14%. The 8.2x separation in the extreme bucket is real and holds on 139K markets. The model isn't a perfect dispute predictor — but it separates risk meaningfully.

Why we're publishing this

The prediction market community is small and technically sharp. Inflated claims get spotted fast, and getting caught overclaiming is worse for credibility than the original metric ever was.

More importantly: if you're using OracleMangle to make position decisions, you deserve to know what the model actually does and where it falls short. The tail calibration caveat matters. If we say a market is 95% dispute risk, that's a strong directional signal — but don't treat it as a precise probability.

We'll keep publishing calibration updates as the dataset grows.

Free dispute risk checks on Telegram.
Every open Polymarket market scored. Refreshed every 30 minutes.

Check Markets Free