How Accurate Is FPL Tactics?
We’ve backtested the planner against three full Premier League seasons under walk-forward cross-validation — the same out-of-sample testing discipline a quant fund would use before trusting a strategy with real money. This page covers every headline result, the experiments that didn’t work, and what three test seasons can and can’t tell us.
For context: a top-10k FPL finish usually lands in the 2400–2500 range, top-1k around 2500–2600, and an elite season clears 2700. Steady’s mean from a top-15 opener (2380 ± 30) and Climber’s mean from the same start (2378 ± 31) both put a manager comfortably inside top-10k territory. What most people will actually feel is Climber’s lift from a weaker starting squad.
What walk-forward testing means
The numbers on this page come from walk-forward cross-validation. The idea is simple: when we test the model on a season, the model isn’t allowed to have seen that season during training. It only knows what was available before.
- Testing on 2022/23 → the projection model was trained on 2019/20, 2020/21, and 2021/22 only.
- Testing on 2023/24 → training adds 2022/23.
- Testing on 2024/25 → training adds 2023/24.
Earlier versions of this page used a single set of model parameters fit on every available season, then tested on those same seasons. That’s a known way to flatter a model — it gets to peek at the answer before grading itself. Walk-forward closes that gap: each test season is held strictly out of training.
The trade-off is that the model has less data in the earlier folds (2024/25 trains on five seasons; 2022/23 only on three). That’s the cost of an honest test, and we report the per-fold numbers below so you can see whether a finding holds across all three years or only on average.
Headline strategy backtest
We replayed three full Premier League seasons end to end: 2022/23, 2023/24, and 2024/25. The simulator enforces every FPL rule that matters — the £100m budget cap, the three-per-club limit, free transfers, the −4 hit cost, the half-rise sell mechanic, bench autosubs. Each strategy makes its own transfers, captain calls, and bench decisions gameweek by gameweek, against the real historical fixtures and the real prices players had at the time. Re-running any (strategy, season, starting squad) combination gives exactly the same result — there’s no randomness in the planner.
Two strategy archetypes ship, each tested on the starting condition it was built for.
From a weak GW1 squad (bottom 15 players by GW1–3 form, fitted within the £100m budget):
| Strategy | Horizon | Threshold | Hits cap | Mean | Per-season | Transfers / yr | Hits taken / yr | Δ vs Steady |
|---|---|---|---|---|---|---|---|---|
| Climber | h = 6 | T = 2.5 | up to 2/GW | 2178 ± 49 | 2135 / 2247 / 2151 | 37.3 | 6.0 | — |
| Steady | h = 6 | T = 2.5 | 0 | 2074 ± 29 | 2037 / 2079 / 2107 | 35.0 | 0 | — |
| Climber − Steady | +2.3 | +6.0 | +103 ± 51 |
From a strong GW1 squad (top 15 by GW1–3 form, same budget):
| Strategy | Horizon | Threshold | Hits cap | Mean | Per-season | Transfers / yr | Hits taken / yr |
|---|---|---|---|---|---|---|---|
| Climber | h = 6 | T = 2.5 | up to 2/GW | 2378 ± 31 | 2421 / 2351 / 2363 | 29.3 | 0.3 |
| Steady | h = 6 | T = 2.5 | 0 | 2380 ± 30 | 2421 / 2350 / 2368 | 29.3 | 0 |
Means across 2022/23, 2023/24, 2024/25 under walk-forward training. Transfers and hits are per-season averages across the three folds.
The strong-start row is the cleanest way to see why Steady is the right pick for an established squad. Climber’s hit budget self-regulates: from a top-15 opener, the MILP almost never takes a −4 (0.3 hits per season, all in one fold), because nothing on an elite XV is worth one. With the hit budget effectively unused, Climber and Steady make the same number of transfers (29.3 each) on the same gameweeks, and the per-season totals are within 1 point of each other every fold. Steady’s “no hits” guarantee comes with zero performance cost from a strong start because the guarantee isn’t binding. From a weak start the picture is completely different: Climber spends 6 hits per season every fold, takes ~2 extra transfers, and gains the headline +103 points over Steady.
Two things to take from this:
- From a weak start, Climber beat Steady in every test season — by +98, +168, and +44 points. The mean is +103, the std is 51, and there’s no fold where Steady wins. The narrowest margin (2024/25, +44) is also the season that introduced the CBIT defensive-contribution rule, which the walk-forward model wasn’t trained on; we cover that in the caveats.
- From a strong start, Climber and Steady finish in a tie. Both land around 2380. The “hits allowed” permission collapses to “no hits taken in practice” because nothing on a strong XV is worth a −4 to swap out. Steady is the right pick if you want the no-hits guarantee in writing.
Starting-condition results
The two tables above test the archetypes on the conditions they were designed for. To get a wider view, we ran the full strategy grid twice — once from a weak start, once from a strong one — sweeping every combination of horizon (4, 5, 6, 7), threshold (2.0, 2.5, 3.0, 3.5), and max-hits (0, 1, 2). That’s 4 × 4 × 3 = 48 cells per starting condition.
Two patterns came out clearly:
Max-hits matters in the obvious direction. From a weak start, every one of the eight worst-performing cells had max-hits set to 0 — you can’t climb out of a bad squad without spending hits. From a strong start, the top two cells had max-hits set to 0 — a strong XV has no high-conviction hit waiting, so allowing hits adds nothing and occasionally hurts.
The shipped pair lives in the top tier from both starts. Climber (H6 T2.5 X2) finished 7th of 48 from a weak start and 9th of 48 from a strong start. Steady (H6 T2.5 X0) finished 33rd from a weak start (expected — it can’t escape a hole without hits) and 7th from a strong start.
We didn’t pick the absolute best cell from a strong start. That was H6 T3.0 X0, which posted a higher mean (2437) but with more than double the variance (std 68 vs Steady’s 30). The shipped Steady ties Climber’s strong-start mean within a point, has half the variance of the top cell, and pairs symmetrically with Climber — same horizon, same threshold, only the hits toggle differs. Trading 57 points of mean for half the season-to-season swing felt like the right call, especially since the symmetry makes the two-archetype product story coherent.
xGI ensemble validation
Separately from the strategy backtest, we validated the projection model itself on multi-season holdouts. We compared the xGI + Elo ensemble against an Elo-only baseline on three metrics:
| Metric | Result | What it means |
|---|---|---|
| Bias reduction | 30 – 50% | The ensemble’s projections are systematically less off-centre than Elo alone. Players the model used to over-rate drift down; under-rated players drift up. |
| RMSE reduction | 5 – 9% | The ensemble’s projection is closer to realised points on average. |
| MAE | Flat | Mean absolute error is roughly unchanged. The ensemble pays a small noise tax to get a much better bias profile — exactly the trade we wanted. |
Component-Elo v2d (which added the defensive-contribution component for the 2025/26 CBIT rule) originally shipped at T = 3.0 under the previous archetype. The walk-forward redesign re-tuned the threshold down to T = 2.5 alongside the horizon move from h=5 to h=6.
Experiments that failed
An accuracy page that only lists wins is a marketing page. The experiments below were run, didn’t work, and changed how we built the planner.
Differential strategy — failed
We tested two ways of leaning into differentials. The first was a gated multiplier that boosted projections for already-top-25%-rated players who also had low ownership. The second was a bold-differential finder that applied a concentrated 1.5× boost to the top-K candidates per gameweek, ranked by projection × (1 − ownership). Neither beat the baseline. Both lost 64–131 points per season versus the h=4 baseline they were tested against, and 180–248 versus the h=5 no-hits configuration that was Steady at the time.
The diagnosis: ownership is carrying signal we aren’t modelling. The most-owned players are most-owned because they’re the highest-projected ones, and boosting low-ownership picks trades the template players the model already likes for “better-than-projected” alternatives that don’t actually deliver. The idea might still work against a different objective — owning a low-ownership hauler moves you up the league standings even if it doesn’t maximise raw points — but against a points-EV objective, it loses.
Elo seed/retention tuning — failed
We tested lowering the initial Elo for new players and resetting the state more aggressively at the start of each season. Two variants (V4: median seeds plus reset; V6: seeds only) produced visibly better rank correlation with real outcomes, but lost 50–75 points per season in the strategy backtest.
The diagnosis: high seed Elos are load-bearing for how the optimiser thinks about unknowns. A newly transferred-in player needs to look attractive enough for the MILP to take a position on him, even when his underlying Elo is uncertain. Lower seeds give a more honest rating but starve the optimiser of plausible targets. Reverted.
S6 hold-3 + injury-react — superseded
An earlier sweep flagged a hold-3 + injury-react trigger combination as the first robustly winning strategy we’d found — +5% mean, 3/3 wins. Under the current ensemble, the same strategy loses 3.2%. The squad-alerts banner that originally shipped on that finding still exists, but its rationale is stale: the alert is harmless and adds no measurable lift.
Shorter-window h=4 archetype — retired
For most of the model’s evolution we shipped a shorter-window (h=4) variant, marketed as a tighter, more predictable alternative to the longer-window default. The stratified-start sweep retired it: across 16 different starting squads it lost to Climber 15 times, and it couldn’t even back up its “tighter range” pitch — variance was equal or worse. The walk-forward results agree. From a weak start, the best h=4 cell sits at rank 24 of 48; from a strong start it gets to rank 5, but still well behind H6 T2.5 X0. h=4 simply doesn’t earn a slot.
Climber breakout state machine — failed
When we first identified the Climber profile, an obvious worry was that allowing hits would start leaking points once the squad had actually climbed out of the hole. We built a state machine to flip Climber back to Steady’s parameters once the squad’s projected starting XI total crossed a quality threshold, and tested thresholds from 50 to 70 points per gameweek.
Nothing worked. Low thresholds (50, 55) fired too early and lost 20–30 points by cutting transfers short before they’d paid off. High thresholds (≥65) never fired across three seasons — by the state machine’s measure, Climber’s squad stayed “still climbing” for the entire season. The hits-allowed Climber stays positive-EV all the way through; there’s no point where it locks into bad decisions and needs to be rescued.
Old (h=5) archetypes — superseded by walk-forward redesign
The previous shipped archetypes were Climber = h=5 / T=3.5 / X=2 and Steady = h=5 / T=3.0 / X=0. They were chosen from a 144-run stratified-start sweep using model parameters fit on the same seasons the strategies were being tested on — that’s the “peek at the answer” problem walk-forward is designed to fix.
When we re-ran those exact cells under proper walk-forward, they slipped: old Climber ranks 14 / 48 from a weak start (mean 2162), and old Steady ranks 32 / 48 (mean 2080). Still competitive, but beaten on both ends by H6 T2.5 X2 and H6 T2.5 X0. The redesign moved the horizon from 5 to 6 and the threshold from 3.5/3.0 down to 2.5, which had the bonus of unifying the two archetypes around shared (h, T) values — the hits toggle is now the only thing that differs between them.
Limitations and caveats
Three test seasons is a small sample. Climber beats Steady in all three folds from a weak start (+98, +168, +44), but the margin compresses sharply in 2024/25 and we shouldn’t read three folds as a tight confidence interval. The cross-fold std is roughly ±50 points from a weak start and ±30 from a strong one. As more seasons accumulate we’ll re-run the numbers and tighten the confidence intervals.
The 2024/25 fold is doing something different. From a weak start, Climber wins 2022/23 by +98 points and 2023/24 by +168, but only +44 in 2024/25. The most plausible cause is the CBIT defensive-contribution rule, which was introduced in 2024/25. Walk-forward training only saw the pre-CBIT seasons (2019–2023), so the model went into 2024/25 with no idea the new scoring signal existed. The production model, which is trained on everything available, picks it up. The compression doesn’t flip the result under the shipped archetypes — Climber still wins 2024/25 — but it’s a real margin contraction worth flagging, and we’ll know more once 2025/26 is in the test window.
No chip play. The strategy backtest doesn’t model Wildcard, Free Hit, Bench Boost, or Triple Captain decisions. A separate study is on the roadmap. Our prior is that the optimal chip windows don’t shift much under the ensemble, but we haven’t proven it.
The h=6 vs h=5 finding only holds inside this three-season test window. An earlier six-season Elo-only study had longer-horizon configurations losing one season in six, so deeper windows do carry tail risk in principle. The current ensemble may or may not carry that risk — we’d need more seasons of xGI training data to know for sure.
What we want to test next
Three open threads we plan to run before the next archetype review:
- Comparison against top-tier human managers. The numbers on this page tell you how the archetypes perform against each other. They don’t tell you how the planner compares to the actual top 10k, top 1k, or top overall finishers from the seasons we tested. Pulling those season totals from the FPL API and putting the planner’s three-season results next to them is the comparison most managers actually want to see.
- Stratified-condition replication under walk-forward. Two starting conditions (bottom-15 and top-15) is a much narrower test than the 16-squad stratified sweep we ran under production-fit training. Re-running the stratified starts under walk-forward would tell us whether the cross-condition pattern from that sweep (Climber wins below median, Steady wins above) holds up, or whether the regime difference changes the picture.
- Chip-aware planning. None of the numbers on this page include Wildcard, Free Hit, Bench Boost, or Triple Captain. A chip-aware MILP extension is on the roadmap.
What we tested over the 2025/26 end of season
All the work below ran during the closing weeks of the 2025/26 Premier League season. Pinning each finding to a specific calendar day would oversell how clean the process was — most of these threads ran in parallel for weeks before the numbers settled. Treating them as one cluster of “end-of-season research” is more honest than a tidy timeline.
Walk-forward validation initiative. Built the validation protocol, the strict-past training discipline, and a tag-column separation between the production model fit and the research model fit. Ran an Understat backfill to unlock 2019–2022 as training seasons (the FPL feed only inlined per-match xG from 2022/23 onward), then fixed a data-attribution bug in the historical backfill that had been silently scrambling player identities across seasons. Then fixed a related stub-row issue that had been dropping 923 historic Premier League regulars from training — Kane, KDB, Sterling, Vardy, that whole tier. With those fixes in, we ran a pair of 48-cell strategy grids (bottom-15 and top-15 starts) under strict walk-forward and used the results to redesign the archetypes.
Climber/Steady redesign. The shipped Climber moved from h=5 / T=3.5 / hits ≤ 2 to h=6 / T=2.5 / hits ≤ 2. The shipped Steady moved from h=5 / T=3.0 / no-hits to h=6 / T=2.5 / no-hits. The two archetypes now share horizon and threshold; the hits toggle is the only thing that differs. From a weak start, Climber beats Steady by +103 ± 51 points/season and wins all three folds; from a strong start they tie within noise.
Differential sweep. Two differential mechanisms tested — gated multiplier, bold-differential finder. Both lost across all three test seasons. A previously offered “Balanced” middle option was retired off the back of this work.
Component-Elo v2d. The Fantasy Rating gained a fifth Elo component for defensive contribution — the new 2025/26 CBIT rule that earns defenders and midfielders 2 points for hitting the 10-event threshold. Won 3/3 seasons at every threshold we tried in {2.5, 3.0, 3.5, 4.0, 4.5} under production-fit testing. The walk-forward redesign re-tuned the threshold to T = 2.5 alongside the horizon move from h=5 to h=6.
Streak-boost K-multiplier. Low-seed players who are over-performing their cohort now get an accelerated Elo update. Improved the visible player ratings; cost about 10 points per season in the strategy backtest — an accepted trade for more accurate UI numbers.
DNP-weight Elo. Sub-21-minute appearances now apply partial decay rather than a full Elo update, so a player returning from injury for a 12-minute cameo doesn’t get penalised as if he’d played a full 90.
Simulation infrastructure. Wall time on the full 48-cell × 3-fold walk-forward sweep is around 18 minutes on a 12-core machine, after a top-80 player pre-filter and a persistent worker pool. Speed was made a first-class concern so future experiments don’t get discouraged by slow infrastructure.