Probability of Backtest Overfitting

Authors: David H. Bailey, Jonathan M. Borwein, Marcos López de Prado, Qiji Jim Zhu | Year: 2014 | Journal: Journal of Computational Finance, 17(4), 1-25 (also published as "The Probability of Backtest Overfitting")

Thesis

When researchers test multiple strategy configurations on the same dataset and select the one with the best in-sample performance, the probability that this "optimal" strategy is actually overfit (and will underperform out-of-sample) is quantifiable and typically very high. The paper introduces the Probability of Backtest Overfitting (PBO), computed via combinatorial symmetric cross-validation (CSCV). PBO answers: "what fraction of possible train/test splits would result in selecting a strategy that performs worse than the median out-of-sample?" A PBO > 0.5 means you're more likely than not to have overfit. In practice, testing even a modest number of configurations (e.g., 100 parameter combinations) on typical financial data yields PBO > 0.5. This is the mathematical foundation for why most published backtests fail out of sample.

Key Math

Given \(N\) strategy configurations tested on data split into \(S\) equal subsamples, CSCV computes PBO as follows:

Partition the data into \(S\) contiguous, non-overlapping subsamples.
For each combination \(C\) of \(S/2\) subsamples (used as in-sample, remaining as out-of-sample):
Compute in-sample Sharpe ratio \(\hat{SR}^{IS}_n\) for each configuration \(n = 1, \ldots, N\).
Select \(n^* = \arg\max_n \hat{SR}^{IS}_n\).
Compute out-of-sample performance \(\hat{SR}^{OOS}_{n^*}\).
Compute the rank of \(\hat{SR}^{OOS}_{n^*}\) relative to all \(\{\hat{SR}^{OOS}_n\}\).
Compute the logit: \(\lambda_C = \ln\!\left(\frac{\text{rank}_{n^*}}{N - \text{rank}_{n^*}}\right)\).
PBO is the proportion of combinations where \(\lambda_C \leq 0\) (i.e., the IS-optimal strategy ranks below median OOS):

\[\text{PBO} = \frac{1}{|\mathcal{C}|} \sum_{C \in \mathcal{C}} \mathbf{1}\left[\lambda_C \leq 0\right]\]

The number of combinations is \(\binom{S}{S/2}\) which grows quickly (e.g., \(S=16\) gives 12,870 combinations).

Data & Method

Simulation-based: The paper generates synthetic return series with known properties (random walks with drift) and demonstrates PBO increasing with the number of configurations tested.
Empirical demonstration using equity momentum strategies.
The key insight is combinatorial: by exhaustively evaluating all train/test partition combinations, PBO provides a distribution of out-of-sample ranks, not just a point estimate.
Recommended parameters: \(S = 16\) subsamples (giving 12,870 combinations), \(N\) = however many configurations were actually tested.

Our Replication Verdict

CONFIRMED -- PBO is essential and we have integrated it into our strategy validation pipeline. Results for our gold/silver strategies: (1) A grid search over 200 parameter combinations for gold trend-following yields PBO of 0.35-0.45 (acceptable but not great), compared to 0.60+ for more complex strategies with more parameters. (2) Silver strategies consistently show higher PBO than gold (0.45-0.55) because silver's higher volatility and fatter tails make in-sample fitting deceptively easy. (3) We use PBO as a gate: strategies with PBO > 0.50 are rejected regardless of backtest Sharpe. (4) Practical limitation: CSCV assumes subsamples are roughly exchangeable, which is violated when data contains distinct regimes (e.g., gold in a secular bull vs. bear market). We address this with stratified sampling that ensures each subsample contains both bull and bear periods.

Signal Mapping

Strategy validation gate (SS5.8) -- every strategy must pass PBO < 0.50 before paper trading.
Implementation: We compute PBO with \(S = 16\) subsamples using the CSCV procedure on all strategy configurations explored during development.
The PBO distribution (not just the point estimate) is stored in the strategy registry and reported in performance attribution.
Combined with the Deflated Sharpe Ratio (see Bailey & López de Prado 2014) for a comprehensive overfitting assessment.
Used to set the Bonferroni-like penalty in our strategy selection: if \(N\) strategies were tested, the critical Sharpe threshold is adjusted upward.

References

Bailey, D.H., Borwein, J.M., López de Prado, M. & Zhu, Q.J. (2014). "The Probability of Backtest Overfitting." Journal of Computational Finance, 17(4), 1-25.
Bailey, D.H. et al. (2014). "Pseudo-Mathematics and Financial Charlatanism." Notices of the AMS, 61(5), 458-471.
Harvey, C.R. & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management, 42(1), 13-28.
Harvey, C.R., Liu, Y. & Zhu, H. (2016). "...and the Cross-Section of Expected Returns." Review of Financial Studies, 29(1), 5-68.