Skip to content

Alpha Validation Gate — root cause, fix, and promotion policy

Date: 2026-06-06 Branch: feat/alpha-validation-gate Mandate: Find why the paper book is flat/negative, prove it, fix the backtest wiring so strategies are genuinely exercised on real history, and ensure only out-of-sample-validated edge receives capital — even in paper.

Scope guardrail: PAPER only. Nothing here enables live trading or changes the capital mode. The daemon's active set is now derived from validated evidence instead of a hand-maintained list.


1. Root cause of the 0.00-Sharpe cohort (proven)

backtest_results/SUMMARY.md (2020-2024) showed 0 of 36 strategies PASS, with most at Sharpe 0.00 / CAGR 0.0% and a handful badly negative. Two distinct failure modes, now separated by evidence:

1a. Macro / altdata sleeves were starved of their inputs (the 0.00 cohort)

The backtest harness computed price features only. The ~30 macro/altdata sleeves read columns that the harness never joined in — dxy, dfii10, t10yie, vix, COT net positioning, futures curve, options IV, etc. Starved of inputs, every one of them hits its own _empty(...) guard and returns zero signals → zero trades → Sharpe exactly 0.00.

Direct proof (dxy_gold, run on real GLD history):

Features fed to dxy_gold.generate_signals Signals empty_reason
price-only (what the old harness produced) 0 insufficient_dxy_history
price + dxy macro column (the fix) processes normally residual_zscore_below_entry_threshold (legit "no extreme dislocation today")

The 0.00 rows were never a strategy verdict — they were a data-wiring bug.

Why it persisted: the only data paths were Alpaca (bars, needs keys) and a FRED API-key cache (scripts/populate_fred_cache.py). With no keys and nothing committed, the harness either aborted ("NO DATA AVAILABLE") or silently ran price-only. That is why checked-in OOS artifacts existed for exactly one strategy (mvg_v1).

1b. The non-zero strategies have genuinely negative post-cost edge

The strategies that did trade (pure price sleeves: carver_trend, gold_tsmom, gold_multi_ma_trend, gold_seasonal_demand, gold_day_of_week, gold_silver_ratio, gold_platinum) are genuinely negative after costs on the commodity-ETF universe — carver_trend at Sharpe −1.63 / −94% max DD is a daily EWMAC book over 46 ETFs including 2x/-2x leveraged names (UCO, SCO, BOIL, KOLD): a turnover-and-decay bloodbath. These are real losers, correctly killed.

Net: capital was spread across ~46 "active" PM strategies, ~30 of which never actually traded in validation and a handful of which had proven-negative edge. That is exactly how a paper book goes flat/negative and why "resetting" doesn't help — the allocation itself was the problem.


2. What was fixed in the backtest wiring

  1. Keyless, reproducible data path — new scripts/backtest_data.py builds the two caches the harness needs from sources that need no paid keys:
  2. bars_cache.parquet — adjusted daily OHLCV for the ETF universe via yfinance.
  3. fred_macro.parquet — market-based macro (VIX, DXY, nominal Treasury yields) via Yahoo, plus FRED's keyless CSV endpoint for real yields/breakevens/balance-sheet when reachable. Columns are written under the exact names the strategies read (dfii10, t10yie, dxy, vix, dgs2, t10y2y, …) plus canonical aliases. Now anyone (CI, an operator, a fresh clone) can regenerate honest backtest inputs and exercise the macro sleeves on real history.

  4. Engine made decade-scale-runnable, and faithful to liveBacktestEngine now (a) precomputes an O(1) (symbol, timestamp) → ret_1d lookup instead of filtering the full frame per signal, and (b) supports a bounded max_history_days window. The harness sets it to TRADING_LOOKBACK_DAYS (800), which is exactly the lookback the live daemon feeds strategies — so the backtest mirrors production and the full registry evaluates in minutes, not hours. Default is None (unbounded) so every existing caller/test is byte-for- byte unchanged.

  5. Artifacts for the whole registryscripts/backtest_all.py now records daily_returns_count in each validation.json and, at the end of a run, derives the promotion gate (below) into backtest_results/promotion.json.


3. The promotion gate (capital eligibility = f(evidence))

qgtm_portfolio/promotion.py turns OOS artifacts into a binary capital verdict. A strategy is promoted (eligible for capital, even paper) only if every bar passes:

Bar Default Source
post-cost annualized Sharpe ≥ 0.40 GAP-002 policy
PBO (prob. backtest overfit) < 0.50 PBO_REJECT_THRESHOLD
Deflated-Sharpe confidence ≥ 0.95 DSR_ACCEPT_CONFIDENCE
worst historical stress-window return ≥ −25% new
out-of-sample days ≥ 252 new
trades executed ≥ 30 new — auto-rejects the data-starved 0-trade cohort

Strategies in PROVEN_NEGATIVE_STRATEGIES (the audited hard kill-list) are quarantined regardless of evidence.

Output: backtest_results/promotion.json{promoted, quarantined, decisions, thresholds}, every decision carrying its failed-check reasons.


4. The live registry is now DERIVED from the validated set

The daemon skips any strategy_id in qgtm_core.constants.QUARANTINED_STRATEGIES (daemon.py::_is_quarantined). That constant is no longer hand-maintained:

QUARANTINED_STRATEGIES  =  PROVEN_NEGATIVE_STRATEGIES  ∪  (registry − promoted)

constants.py reads backtest_results/promotion.json at import (defensively — any parse error falls back to the hard kill-list and never raises, so a bad file cannot crash the daemon). Therefore:

daemon active set  =  registry − QUARANTINED  =  promoted − proven_negative  =  validated set

Re-deriving the gate (e.g. after new artifacts or a threshold change) is one command: python scripts/backtest_promote.py. No hand-editing of enablement lists.


Full registry re-run on reproducible data (yfinance bars + VIX/DXY/yields), liquid universe (22 ETFs, ≥500k ADV), 2019-01-01 → 2025-12-31, 10 bps round-trip cost, 800-day live-matched lookback.

Result: 0 of 52 promoted. 17 strategies actually traded; 35 are data-starved (they need FRED real-rate/breakeven/curve, COT net positioning, COMEX warehouse, the futures curve, options IV, or real ETF flows — none reproducibly sourceable here, so they return zero signals and cannot be validated, let alone promoted).

Top of the leaderboard (the only strategies that traded and came closest):

Strategy Post-cost Sharpe Trades PBO DSR conf Max DD Verdict
gold_lbma_fix_anomaly 0.38 346 0.38 ✓ 0.23 ✗ −1.4% Quarantine — Sharpe < 0.40 and DSR fails; tiny capacity (346 trades)
overnight_gold 0.28 2184 0.62 ✗ 0.12 ✗ −7.8% Quarantine (also proven-negative)
xsmom −0.13 17590 1.00 ✗ 0.01 ✗ −91.8% Quarantine — negative, near-certain overfit
vix_haven −0.14 236 1.00 ✗ 0.02 ✗ −8.4% Quarantine
gold_tsmom −0.14 1054 1.00 ✗ 0.01 ✗ −17.5% Quarantine
tsmom −1.24 1.00 ✗ 0.01 ✗ Quarantine
vol_risk_parity −1.11 1.00 ✗ Quarantine

Every daily-rebalanced sleeve is dragged negative by turnover × cost on the commodity-ETF universe. The macro fix is working (e.g. dxy_gold now produces a real −0.62 instead of a data-starved 0.00) — the strategies are genuinely being exercised; they simply have no validated edge as currently configured.

Recommended allocation: $0 to the current registry. recommended_allocation() returns {} because the promoted set is empty. Concentrating capital on "validated edge only" here means deploying no capital until something clears the bar — which is the correct, honest outcome, not a failure of the gate.

Smallest robust starting set (evidence-based, NOT yet promoted)

The trend premium is not gone — it is being destroyed by daily rebalancing. An indicative low-turnover backtest (monthly rebalance, 12-month TSMOM, liquid ETFs, 10 bps turnover cost, 2014-2025, 132 months):

Sleeve Sharpe CAGR
12m TSMOM, long-only, 12 liquid ETFs +0.47 +7.6%
12m TSMOM long/short, 12 liquid ETFs +0.38 +3.8%
12m TSMOM long/short, core (GLD,GDX,DBC,USO) +0.33 +3.9%

A monthly-rebalanced, long-biased 12-month TSMOM sleeve on the liquid ETF basket is the smallest defensible starting point. This is an indicative result (simplified single-pass backtest, not the full PBO/DSR/walk-forward harness) — so the recommendation is to implement it as a low-turnover strategy variant and put it through this same gate before it receives capital. Do not fabricate edge by promoting it on this indicative number alone.


6. Operator decisions still required

  • Missing reproducible data feeds. Real rates/breakevens (FRED key), COT net positioning, COMEX warehouse stocks, the GC/SI futures curve, options IV surfaces, and real ETF creation/redemption flows are not reproducibly available to the validation harness from this host. Sleeves that depend on them cannot be validated offline and are therefore not promotable until a point-in-time source lands. This is a data-procurement decision, not a code bug.
  • Threshold calibration. The DSR ≥ 0.95 bar (deflated across 52 trials) is deliberately strict. If the honest gate promotes too few strategies, the operator can either (a) fund the missing data feeds to validate more sleeves, or (b) consciously relax a documented threshold in PromotionThresholds — the gate is explicit so the trade-off is visible, not buried.