Alpha Validation Gate — root cause, fix, and promotion policy
Date: 2026-06-06
Branch: feat/alpha-validation-gate
Mandate: Find why the paper book is flat/negative, prove it, fix the backtest
wiring so strategies are genuinely exercised on real history, and ensure only
out-of-sample-validated edge receives capital — even in paper.
Scope guardrail: PAPER only. Nothing here enables live trading or changes the capital mode. The daemon's active set is now derived from validated evidence instead of a hand-maintained list.
1. Root cause of the 0.00-Sharpe cohort (proven)
backtest_results/SUMMARY.md (2020-2024) showed 0 of 36 strategies PASS, with
most at Sharpe 0.00 / CAGR 0.0% and a handful badly negative. Two distinct
failure modes, now separated by evidence:
1a. Macro / altdata sleeves were starved of their inputs (the 0.00 cohort)
The backtest harness computed price features only. The ~30 macro/altdata
sleeves read columns that the harness never joined in — dxy, dfii10,
t10yie, vix, COT net positioning, futures curve, options IV, etc. Starved of
inputs, every one of them hits its own _empty(...) guard and returns zero
signals → zero trades → Sharpe exactly 0.00.
Direct proof (dxy_gold, run on real GLD history):
Features fed to dxy_gold.generate_signals |
Signals | empty_reason |
|---|---|---|
| price-only (what the old harness produced) | 0 | insufficient_dxy_history |
price + dxy macro column (the fix) |
processes normally | residual_zscore_below_entry_threshold (legit "no extreme dislocation today") |
The 0.00 rows were never a strategy verdict — they were a data-wiring bug.
Why it persisted: the only data paths were Alpaca (bars, needs keys) and a
FRED API-key cache (scripts/populate_fred_cache.py). With no keys and
nothing committed, the harness either aborted ("NO DATA AVAILABLE") or silently
ran price-only. That is why checked-in OOS artifacts existed for exactly one
strategy (mvg_v1).
1b. The non-zero strategies have genuinely negative post-cost edge
The strategies that did trade (pure price sleeves: carver_trend,
gold_tsmom, gold_multi_ma_trend, gold_seasonal_demand, gold_day_of_week,
gold_silver_ratio, gold_platinum) are genuinely negative after costs on the
commodity-ETF universe — carver_trend at Sharpe −1.63 / −94% max DD is a daily
EWMAC book over 46 ETFs including 2x/-2x leveraged names (UCO, SCO, BOIL,
KOLD): a turnover-and-decay bloodbath. These are real losers, correctly killed.
Net: capital was spread across ~46 "active" PM strategies, ~30 of which never actually traded in validation and a handful of which had proven-negative edge. That is exactly how a paper book goes flat/negative and why "resetting" doesn't help — the allocation itself was the problem.
2. What was fixed in the backtest wiring
- Keyless, reproducible data path — new
scripts/backtest_data.pybuilds the two caches the harness needs from sources that need no paid keys: bars_cache.parquet— adjusted daily OHLCV for the ETF universe viayfinance.-
fred_macro.parquet— market-based macro (VIX, DXY, nominal Treasury yields) via Yahoo, plus FRED's keyless CSV endpoint for real yields/breakevens/balance-sheet when reachable. Columns are written under the exact names the strategies read (dfii10,t10yie,dxy,vix,dgs2,t10y2y, …) plus canonical aliases. Now anyone (CI, an operator, a fresh clone) can regenerate honest backtest inputs and exercise the macro sleeves on real history. -
Engine made decade-scale-runnable, and faithful to live —
BacktestEnginenow (a) precomputes an O(1)(symbol, timestamp) → ret_1dlookup instead of filtering the full frame per signal, and (b) supports a boundedmax_history_dayswindow. The harness sets it toTRADING_LOOKBACK_DAYS(800), which is exactly the lookback the live daemon feeds strategies — so the backtest mirrors production and the full registry evaluates in minutes, not hours. Default isNone(unbounded) so every existing caller/test is byte-for- byte unchanged. -
Artifacts for the whole registry —
scripts/backtest_all.pynow recordsdaily_returns_countin eachvalidation.jsonand, at the end of a run, derives the promotion gate (below) intobacktest_results/promotion.json.
3. The promotion gate (capital eligibility = f(evidence))
qgtm_portfolio/promotion.py turns OOS artifacts into a binary capital verdict.
A strategy is promoted (eligible for capital, even paper) only if every
bar passes:
| Bar | Default | Source |
|---|---|---|
| post-cost annualized Sharpe | ≥ 0.40 | GAP-002 policy |
| PBO (prob. backtest overfit) | < 0.50 | PBO_REJECT_THRESHOLD |
| Deflated-Sharpe confidence | ≥ 0.95 | DSR_ACCEPT_CONFIDENCE |
| worst historical stress-window return | ≥ −25% | new |
| out-of-sample days | ≥ 252 | new |
| trades executed | ≥ 30 | new — auto-rejects the data-starved 0-trade cohort |
Strategies in PROVEN_NEGATIVE_STRATEGIES (the audited hard kill-list) are
quarantined regardless of evidence.
Output: backtest_results/promotion.json — {promoted, quarantined, decisions,
thresholds}, every decision carrying its failed-check reasons.
4. The live registry is now DERIVED from the validated set
The daemon skips any strategy_id in qgtm_core.constants.QUARANTINED_STRATEGIES
(daemon.py::_is_quarantined). That constant is no longer hand-maintained:
constants.py reads backtest_results/promotion.json at import (defensively —
any parse error falls back to the hard kill-list and never raises, so a bad file
cannot crash the daemon). Therefore:
Re-deriving the gate (e.g. after new artifacts or a threshold change) is one
command: python scripts/backtest_promote.py. No hand-editing of enablement
lists.
5. Leaderboard & recommended allocation
Full registry re-run on reproducible data (yfinance bars + VIX/DXY/yields), liquid universe (22 ETFs, ≥500k ADV), 2019-01-01 → 2025-12-31, 10 bps round-trip cost, 800-day live-matched lookback.
Result: 0 of 52 promoted. 17 strategies actually traded; 35 are data-starved (they need FRED real-rate/breakeven/curve, COT net positioning, COMEX warehouse, the futures curve, options IV, or real ETF flows — none reproducibly sourceable here, so they return zero signals and cannot be validated, let alone promoted).
Top of the leaderboard (the only strategies that traded and came closest):
| Strategy | Post-cost Sharpe | Trades | PBO | DSR conf | Max DD | Verdict |
|---|---|---|---|---|---|---|
| gold_lbma_fix_anomaly | 0.38 | 346 | 0.38 ✓ | 0.23 ✗ | −1.4% | Quarantine — Sharpe < 0.40 and DSR fails; tiny capacity (346 trades) |
| overnight_gold | 0.28 | 2184 | 0.62 ✗ | 0.12 ✗ | −7.8% | Quarantine (also proven-negative) |
| xsmom | −0.13 | 17590 | 1.00 ✗ | 0.01 ✗ | −91.8% | Quarantine — negative, near-certain overfit |
| vix_haven | −0.14 | 236 | 1.00 ✗ | 0.02 ✗ | −8.4% | Quarantine |
| gold_tsmom | −0.14 | 1054 | 1.00 ✗ | 0.01 ✗ | −17.5% | Quarantine |
| tsmom | −1.24 | — | 1.00 ✗ | 0.01 ✗ | — | Quarantine |
| vol_risk_parity | −1.11 | — | 1.00 ✗ | — | — | Quarantine |
Every daily-rebalanced sleeve is dragged negative by turnover × cost on the
commodity-ETF universe. The macro fix is working (e.g. dxy_gold now produces
a real −0.62 instead of a data-starved 0.00) — the strategies are genuinely being
exercised; they simply have no validated edge as currently configured.
Recommended allocation: $0 to the current registry. recommended_allocation()
returns {} because the promoted set is empty. Concentrating capital on
"validated edge only" here means deploying no capital until something clears
the bar — which is the correct, honest outcome, not a failure of the gate.
Smallest robust starting set (evidence-based, NOT yet promoted)
The trend premium is not gone — it is being destroyed by daily rebalancing. An indicative low-turnover backtest (monthly rebalance, 12-month TSMOM, liquid ETFs, 10 bps turnover cost, 2014-2025, 132 months):
| Sleeve | Sharpe | CAGR |
|---|---|---|
| 12m TSMOM, long-only, 12 liquid ETFs | +0.47 | +7.6% |
| 12m TSMOM long/short, 12 liquid ETFs | +0.38 | +3.8% |
| 12m TSMOM long/short, core (GLD,GDX,DBC,USO) | +0.33 | +3.9% |
A monthly-rebalanced, long-biased 12-month TSMOM sleeve on the liquid ETF basket is the smallest defensible starting point. This is an indicative result (simplified single-pass backtest, not the full PBO/DSR/walk-forward harness) — so the recommendation is to implement it as a low-turnover strategy variant and put it through this same gate before it receives capital. Do not fabricate edge by promoting it on this indicative number alone.
6. Operator decisions still required
- Missing reproducible data feeds. Real rates/breakevens (FRED key), COT net positioning, COMEX warehouse stocks, the GC/SI futures curve, options IV surfaces, and real ETF creation/redemption flows are not reproducibly available to the validation harness from this host. Sleeves that depend on them cannot be validated offline and are therefore not promotable until a point-in-time source lands. This is a data-procurement decision, not a code bug.
- Threshold calibration. The DSR ≥ 0.95 bar (deflated across 52 trials) is
deliberately strict. If the honest gate promotes too few strategies, the
operator can either (a) fund the missing data feeds to validate more sleeves, or
(b) consciously relax a documented threshold in
PromotionThresholds— the gate is explicit so the trade-off is visible, not buried.