Method

How it works.

A forecasting model, pre-committed against the prediction markets and graded the way a trading desk grades itself, on whether the price moves toward it, and on whether its probabilities are calibrated. Not a tip sheet. A track record built in the open.

The model

Each team carries a strength rating from a World Football Elo built on ~49,000 international matches, blended with squad market value and corrected for confederation-level bias (weaker regions inflate against weak regional opposition; an empirical-Bayes shrinkage re-levels them). That rating difference feeds a goal model: Poisson scoring rates with a Dixon-Coles low-score correction, which yields a full scoreline distribution for every fixture, and from it the win / draw / loss probabilities.

The tournament itself is a joint Monte-Carlo simulation: tens of thousands of simulated tournaments, coherent by construction (the advance probabilities sum to the right number of teams, the bracket reconciles). That's where the advance, group-winner, reach-round and champion numbers come from.

Predictions are pre-committed

Every forecast is timestamped to an append-only ledger before the match is played. Nothing is fit in hindsight. That's the difference between a forecast and a backtest: when the grade comes, it's honest, because the call can't be edited after the fact.

It learns from the tournament

The model is an updating one. As each group game resolves, the simulation conditions on the actual result: that game is fixed, only the rest is simulated, on ratings that have absorbed what happened. Once the group stage is done, the knockout simulation runs on the real bracket and updated ratings. It's Bayesian-style evidence updating, not a one-shot prediction.

How it's graded

Closing-Line Value (CLV). The cleanest test of skill that doesn't wait for the result: did the market price drift toward the model after the forecast was logged? Positive CLV is a skill signal independent of who eventually wins, the same metric sharp bettors and desks live by. It needs scale to be conclusive, though: the established bar is a few hundred resolved forecasts before consistent positive CLV is more than suggestive, and the forecasts logged here are correlated across markets, so the early numbers are a signal forming, not a verdict.

Calibration. When the model says 30%, does it happen about 30% of the time? Graded with proper scoring rules (Brier, log-loss) against real outcomes as markets resolve. A model can be confident and wrong; calibration is the check that it isn't.

First read (group stage, 72 games). Of every forecaster scored on these match outcomes, the market is the best-calibrated: 23% Brier skill over the no-information baseline, with a reliability slope of 1.07 (1.0 is perfect). It edges the pre-committed model (21% skill, slope 0.87). A real-money market being better-calibrated than a transparent fundamental model is exactly the premise of this project.

Reliability: predicted probability vs what happened

72 group-stage matches · points sized by sample · closer to the dashed 45° line = better calibrated

pre-committed model · slope 0.87market · slope 1.07perfect calibration

The market curve hugs the diagonal; the model bows below it at the high end — backing favorites a shade too hard, the favorite-longshot signature. Both are close and both skilful, but the real-money price wins the grade.

The data engine

Alongside the model, a second pillar runs live: a cross-venue microstructure pipeline that records both prediction markets (Kalshi and Polymarket) tick by tick during matches, then studies how price actually forms. An always-on server collects; a laptop does the heavy parsing. Raw recordings are disposable, the small per-match summaries are the source of truth, and every pooled result is rebuilt from those summaries.

Server · 24/7

Schedule & capture35 minutes before each kickoff, record a 3-hour tape of both venues' full order books and every trade (about 1.3 GB per match).

▼ laptop pulls each settled tape, every 2 hours

Laptop · 2h

Parse once, three waysLoad the tape a single time (about 7 GB in memory) and feed all three analyses from the one pass.

Lead-lagWhich venue's price moves first on a goal.

OverreactionDoes the price overshoot a goal, then revert?

Order flowDoes flow on one venue predict the other's next move?

▼ one summary per match, pooled across every match

Laptop

Pool, render, discardRebuild the pooled result from the per-match summaries, render the graphic, and delete the raw tape once all three summaries exist.

Each analysis outputs a pooled statistic with its sample size and significance, not a headline. The final result, reported straight: Polymarket's price leads Kalshi's by a median +600ms on goals, first in 72% of decisive events and carrying an 81.0% Gonzalo-Granger information share across 63 cointegrated matches, a lead that survives a cluster-robust bootstrap. Order-flow imbalance moves price on both venues exactly as the microstructure literature predicts. Whether that price lead is tradeable once you cross the spread is a separate, disclosed forward-test, and the answer is no: the book collapses to ~0.5% depth at the goal. The detail lives in the research note.

The honesty principle

The premise is pro-market: these markets are very hard to beat, so the model is held to the price, not graded against a strawman. Wins and misses are reported the same way. A mediocre result reported straight is worth more than a cherry-picked one. The rigor is the point, and it's the only thing that makes a public track record mean anything.

Full methodology, code, and pre-registration: github.com/Thesavagecoder7784/xResidual. Paper only, no real-money trading.