Building a scoring product on top of real dispute data creates a specific temptation: the numbers look impressive enough that you start reaching for the most impressive version of every claim.
We did that. We caught it. Here's the full account.
Our alert engine generates natural-language explanations using Gemini. The explainer prompt had access to a field called claude_confidence — the arbitration model's self-confidence score on a 0–1 scale.
The problem: Gemini didn't know what claude_confidence meant. So it invented an interpretation. Alerts went out saying things like:
"Claude's confidence at 99% and a Yes price of only $0.13 — high conviction against his nomination."
That's wrong. 99% was not the probability of any outcome. It was the model's confidence in its own classification — a completely different thing. Five alerts went out with this framing before we caught it.
Fix: Stripped claude_confidence from the LLM payload entirely. Added a field glossary telling the model that dispute_risk is a dispute probability, not a YES/NO outcome probability.
The alert engine has a filter that blocks malformed market titles — things like raw SHA-256 hashes stored in the database instead of real question text. The filter worked for pure hex strings. It didn't work for hex strings with a trailing ?. One alert went out with a headline that was literally a hash followed by a question mark.
Fix: Strip trailing punctuation before the hex check. Regression test added.
An alert fired on "Will Bitcoin reach $100,000 by December 31, 2025?" in April 2026. That market closed months earlier. The alert engine was treating zero-volume closed markets as live candidates.
Fix: Added a close_time-in-past check to all three alert detectors. Markets past their close date are skipped regardless of their risk score.
The landing page said "AUC-ROC 0.80." We ran a proper calibration this week on 139,484 resolved Polymarket markets.
The actual AUC is 0.624.
The 0.80 figure came from an early benchmark run on a small, curated subset. It didn't hold at scale. We pulled it. The landing page now leads with Brier score 0.022 — which measures what actually matters: are the scores well-calibrated probabilities?
The answer is: mostly yes on average, weaker at the tails. Predictions near 0 or 1 should be read as directional signals, not exact probabilities. We say that on the site now.
Run on 139,484 resolved Polymarket markets with UMA dispute ground truth:
| Risk Bucket | Markets | Disputed | Rate | vs Baseline |
|---|---|---|---|---|
| Clean (0–10%) | 88,410 | 702 | 0.8% | 0.7x |
| Medium (10–25%) | 44,469 | 548 | 1.2% | 1.1x |
| High (25–50%) | 6,392 | 324 | 5.1% | 4.4x |
| Extreme (50%+) | 213 | 20 | 9.4% | 8.2x |
Baseline dispute rate: 1.14%. The 8.2x separation in the extreme bucket is real and holds on 139K markets. The model isn't a perfect dispute predictor — but it separates risk meaningfully.
The prediction market community is small and technically sharp. Inflated claims get spotted fast, and getting caught overclaiming is worse for credibility than the original metric ever was.
More importantly: if you're using OracleMangle to make position decisions, you deserve to know what the model actually does and where it falls short. The tail calibration caveat matters. If we say a market is 95% dispute risk, that's a strong directional signal — but don't treat it as a precise probability.
We'll keep publishing calibration updates as the dataset grows.
Free dispute risk checks on Telegram.
Every open Polymarket market scored. Refreshed every 30 minutes.