04 — Macro / Regime & Statistical-Arbitrage / Pairs: Research & Specification
Author: Institutional quant research (read-only audit + design)
Date: 2026-06-06
Scope: Two families implementable in /Users/admin/qgtmai/trading (Python 3.12, Alpaca paper, ETFs/options).
Status: RESEARCH + SPEC ONLY. No production code, no git, no repo edits. One file written: this one.
Universe (Alpaca-tradable): GLD, IAU, SLV, GDX, GDXJ, SIL, SIVR (PM core) + SPY, TLT, IEF, UUP, HYG, LQD, VIXY (cross-asset for regime/macro). PPLT/PALL are tradable but thin — capacity-gated, see §B.4.
0. Executive summary & verdict
Two related families are specified:
- (A) Macro / Regime — a proper regime classifier (growth × inflation × risk quadrants from PIT macro), a regime-conditioned allocation overlay that replaces the current decorative one, and a point-in-time real-rate-residual gold signal.
- (B) Statistical arbitrage / Pairs — cointegration-gated, cost-aware, beta-neutral dynamic-hedge pairs (GLD/SLV, GDX/GLD, GDXJ/GDX) with z-score bands and half-life holding, designed specifically to avoid the four mechanical failures that killed the existing pairs.
Headline diagnosis (evidence in §1):
- The unwired classifier sleeve (
regime_classifier_pm) is orphaned for three concrete reasons: it is not in the daemon registry, it demands feature columns that are never built, and it is architecturally miscast (a classifier emitting directional GLD trades instead of a regime label the allocator can consume). - The quarantined pairs did not fail because "pairs don't work" — they failed because of mechanical bugs (hedge ratio computed but never used → legs are not beta-neutral → residual directional metal exposure), a cost model that overcharges multi-day holds ~10–20×, and a feature-plumbing gap that produced 0.00-Sharpe "no-trade" backtests. These are fixable; the signal idea is partly salvageable, the implementations are not.
Salvage verdict (full detail §5):
| Existing strategy | Class | Backtest evidence | Verdict |
|---|---|---|---|
real_rate_gold |
macro | 0.00 Sharpe = no trades (macro features unplumbed in backtest) | Salvage — re-spec PIT + re-backtest; keep live |
gold_real_rate_residual |
macro | 0.00 = no trades in backtest | Salvage → replace with §A.3 (PIT residual) |
dxy_gold |
macro | 0.00 = no trades | Salvage — fold into §A.3 as a second residual leg |
breakeven_inflation_gold |
macro | 0.00 = no trades; uses revisable WALCL w/o vintage | Salvage w/ PIT fix (§A.1 feeds it) |
regime_classifier_pm (HMM+BOCPD) |
regime | unwired; non-identifiable labels; in-sample leak | Kill as a strategy; rebuild as a classifier (§A.1) |
gold_bitcoin_regime |
"regime" | wired but it's a directional corr filter, not regime | Keep as niche tilt; not a regime classifier |
gold_silver_ratio |
pairs | Sharpe −0.46, PBO 0.88 | Dead as-is → replace with §B.1 |
miners_vs_metal |
pairs | Sharpe −0.16, PBO 0.63 | Dead as-is → replace with §B.2 (beta-neutral) |
gold_platinum |
cross-commodity pair | Sharpe −0.47…−1.57, PBO 1.00 | Stay dead (no stable cointegration; capacity-poor) |
kalman_pairs |
pairs | enabled, not quarantined, but trades the innovation (white noise) | Re-spec — trades wrong object (§B.1 fixes it) |
pairs_mr |
pairs | enabled; naive log-spread, no cointegration gate, no beta | Re-spec or retire in favor of §B.1/§B.2 |
Prioritized shortlist (build order, §4): (1) Fix the backtester cost/feature gaps — nothing below is trustworthy until then; (2) regime_classifier_v2 (PIT quadrant) + real allocation overlay; (3) gold_real_rate_residual_v2 (PIT); (4) statarb_metal_v2 (GLD/SLV beta-neutral cointegration); (5) miners_metal_v2 (beta-neutral). Cross-commodity (gold_platinum) stays retired.
1. System diagnosis (why the current sleeves fail)
1.1 The regime classifier is unwired — three root causes
There are two regime systems in the tree and only the simple one is connected:
MarketRegimeDetector(qgtm_strategies/regime_detector.py) — a 4-state price-based vane (RISK_ON / RISK_OFF / CRISIS / TRANSITION) from realized vol, breadth, correlation. It is wired: the daemon constructs it and the portfolio allocator consumes it.
```113:139:trading/qgtm_portfolio/allocator.py def allocate_regime_aware( self, strategy_signals: dict[str, list[Signal]], regime: Regime, regime_confidence: float, strategy_performance: dict[str, dict[str, float]] | None = None, ) -> list[Signal]: ... # 3. Get regime sector weights regime_weights = MarketRegimeDetector.get_regime_weights(regime)
- `RegimeClassifierStrategy` (`qgtm_strategies/regime_classifier_pm.py`) — the HMM+BOCPD ensemble producing 5 states (risk_on/off, inflation, deflation, crisis). It is **orphaned**.
**Root cause 1 — not registered.** The daemon's canonical registry has 52 sleeves across 7 categories; `regime_classifier_pm` is in none of them (only `gold_bitcoin_regime` and the `MarketRegimeDetector` import appear). With no registry entry it is never instantiated, never called, never routed. Confirmed in `qgtm_live/daemon.py::PM_STRATEGY_CATEGORIES` (lines 185–252) and `qgtm_core/strategy_state.py::DEFAULT_STRATEGY_STATES` (no key for it).
**Root cause 2 — feature-contract mismatch.** It requires ≥3 of `["real_rate","dxy","vix","gold_vol","curve_slope"]`:
```141:147:trading/qgtm_strategies/regime_classifier_pm.py
available_cols = [c for c in REGIME_FEATURE_COLS if c in features.columns]
if len(available_cols) < 3:
return signals
if len(features) < self.min_data_points:
return signals
But the feature build produces dfii10/real_10y (not real_rate), aliases gvz→gold_vol, and builds t10y2y (not curve_slope). So even if registered it would routinely return [].
Root cause 3 — architecturally miscast (the deep reason). A regime classifier should emit a label + probabilities that the allocator uses to tilt sleeves. This one emits directional GLD/IAU Signals via a hard-coded _regime_to_signal map, i.e. it is a thinly-disguised directional gold strategy. Worse, its labels are not identifiable: it runs k-means each call and maps cluster index → REGIME_NAMES positionally (cluster 0 ≡ "risk_on"), which is arbitrary and unstable run-to-run; and it standardizes with full-sample mean/std mu = np.mean(X, axis=0) (lines 157–160) — an in-sample leak because the last row's z-score uses statistics computed including that row and future rows in the window. Net: it cannot be trusted as a label source even if plumbed.
Design implication: do not "wire up"
regime_classifier_pm. Rebuild it as a label producer (§A.1) consumed by a real overlay (§A.2). KeepMarketRegimeDetectoras the fast risk vane; add the slow macro-quadrant classifier as a second, orthogonal input.
1.2 Why the existing pairs were quarantined — mechanical, not conceptual
Quarantine list with the platform's own recorded numbers:
```98:111:trading/qgtm_core/constants.py QUARANTINED_STRATEGIES: frozenset[str] = frozenset( { "gold_silver_ratio", # Sharpe -0.46, PBO 0.88 "gold_platinum", # Sharpe -1.57, PBO 1.00 "miners_vs_metal", # Sharpe -0.16, PBO 0.63 "seasonality_pm", # Sharpe -1.74, PBO 1.00 "overnight_gold", # Sharpe -0.24, PBO 0.88
Reading the code, four concrete defects explain the negative Sharpe:
**Defect A — the dynamic hedge ratio is decorative (the killer).** `gold_silver_ratio` computes a Kalman β then **never uses it**: both legs are sized `+w / -w` (equal dollar), independent of β.
```251:279:trading/qgtm_strategies/gold_silver_ratio.py
if z > self.entry_z:
# Ratio too high -> silver cheap -> long SLV, short GLD
w = float(np.clip(abs(z) / self.stop_z * self.max_weight, 0.05, self.max_weight))
signals.extend(
[
Signal( ... symbol="SLV", side=Side.LONG, weight=w, ...),
Signal( ... symbol="GLD", side=Side.SHORT, weight=-w, ...),
],
)
Equal-dollar legs are not market-neutral when β≠1. GLD/SLV daily-vol ratio is ~0.5 (silver ~2× gold vol), so an equal-dollar GLD/SLV book carries a large net short-silver-beta tilt. The "pair" is therefore a disguised directional metal bet — it wins/loses on metal direction, not on convergence. miners_vs_metal has the identical bug (computes Kalman β for the spread, sizes legs +w/−w regardless). With GDX β-to-GLD ≈ 1.5–2.0, an equal-dollar GDX/GLD book is net-long miner beta. This single defect is sufficient to produce negative post-cost Sharpe in a trending-gold tape.
Defect B — trading the wrong object. kalman_pairs enters on the innovation (one-step prediction error) standardized by its own innovation variance:
```155:162:trading/qgtm_strategies/kalman_pairs.py z_score = current_spread / current_std # current_spread == innovation # No trade if within entry band or beyond stop if abs(z_score) < self.entry_threshold: return self._empty("spread_zscore_below_entry_threshold")
If the filter is even roughly well-specified, the standardized innovation is ~N(0,1) **white noise** — so `|z|>2` fires ≈2.3% of the time essentially at random, with no mean-reverting structure to harvest. Pairs trading must z-score the **level of the spread** `s_t = log P_a − (α_t + β_t·log P_b)` against its own trailing distribution, not the filter's surprise. `kalman_pairs` also feeds **raw price levels** (non-stationary) into the measurement equation, so α/β drift to absorb the trend and the "spread" is dominated by level drift.
**Defect C — `gold/silver` traded as a raw price ratio, not a tested spread.** `gold_silver_ratio` z-scores `GSR = P_GLD/P_SLV` directly. The ratio is not mean-stationary in modern data: the 2010–2025 GLD/SLV relationship **fails standard cointegration over the full sample** (Engle–Granger residuals are long-memory; equilibrium detected only in the post-break sub-period) because silver is ~50–60% industrial demand and decouples on the manufacturing cycle (CESifo WP 12559, 2025; CME 2025). Z-scoring a non-stationary ratio guarantees the "reversion" trade is often a structural-break trade — exactly the "trapped long silver 2011–2015" failure mode (SpeedwayMedia 2026; goldsilver.com 2026).
**Defect D — cross-commodity with no stable equilibrium and poor capacity.** `gold_platinum` (PBO 1.00) trades GLD vs PPLT; platinum is a supply-concentrated auto-catalyst metal with idiosyncratic shocks (diesel collapse, South-African supply) — no robust cointegration with gold and PPLT ADV is thin. PBO 1.00 = every IS-best config had negative OOS Sharpe → pure overfit. This one is correctly dead.
### 1.3 The backtester is the confound — two flaws inflate "no edge"
The mandate says the backtester is being repaired; two specific flaws make *every* number above suspect and must be fixed before any promote/kill decision is final.
**Flaw 1 — macro features are not plumbed into the backtest.** The live daemon enriches features with FRED/COT via `_enrich_features` (and aliases `real_10y→dfii10`, etc.), but the batch harness (`scripts/backtest_all.py` → `qgtm_backtest/engine.py`) feeds a price/`ret_1d` frame only. Result: every macro sleeve (`real_rate_gold`, `dxy_gold`, `breakeven_inflation_gold`, `gold_real_rate_residual`, `cot_precious`, …) hits its `_empty("insufficient_dfii10_history")` guard, trades **zero** times, scores **Sharpe 0.00**, and is auto-quarantined. That is why 30 of 36 sleeves in `backtest_results/SUMMARY.md` show identical `0.00 / 0.0% / 0.0%`. **These are false negatives (a data gap), not evidence of no alpha.**
**Flaw 2 — costs are charged on notional every day, not on turnover.** The engine charges the full round-trip bps on `|weight|` *each day a position is held*:
```269:274:trading/qgtm_backtest/engine.py
cost = (self.config.commission_bps + self.config.slippage_bps) / 10000
scaled_weight = float(sig.weight) * weight_scale
net_ret = asset_ret * scaled_weight - cost * abs(scaled_weight)
daily_ret += net_ret
num_trades += 1
A macro sleeve holding GLD at w=0.45 for 21 days is charged 10bps × 0.45 × 21 ≈ 94 bps, when the real cost is ~10bps × 0.45 × 2 ≈ 9 bps (entry+exit). That is a ~10–20× cost overcharge for any multi-day hold, and num_trades/turnover_annual are meaningless (they count symbol-days, not round-trips). Costs must be applied to Δweight (turnover), not held notional.
Mandatory pre-work (P0 in §4): (i) inject the PIT macro panel into the backtest feature frame; (ii) replace per-day notional cost with per-trade turnover cost
cost·Σ|w_t − w_{t-1}|; (iii) re-run the 2015–2025 GAP-002 batch. Only then are promote/kill verdicts trustworthy.
2. Research grounding (with citations)
Real rates → gold (the dominant macro channel). Gold is a zero-coupon, long-duration real asset; its opportunity cost is the long-term real yield. Theory and data give a strong negative level relationship: the Chicago Fed estimates a 1 pp rise in the 10y real rate lowers the real gold price ≈13% (Chicago Fed Letter 464, 2021), confirmed at annual, quarterly-innovation, and daily-difference frequencies. Erb & Harvey (2013, The Golden Dilemma) and Jermann (2021, Wharton) model gold off a real-rate term structure; 10y TIPS (FRED DFII10) is the canonical proxy. Caveat — state dependence: post-COVID the link weakened to non-significance as central-bank buying / safe-haven / geopolitical demand dominated (Univ. of Malta OAR 2024; ECB FSR 2025). This is why a regime overlay and a residual (not level) formulation matter.
Dollar → gold. Gold is USD-priced, so a weaker dollar mechanically lifts it for foreign buyers; the inverse DXY–gold correlation runs ≈ −0.4 to −0.6 and is regime-dependent (Capie, Mills & Wood 2005; Joy 2011). Use the broad trade-weighted dollar (FRED DTWEXBGS) rather than the futures DXY for ETF-relevant exposure.
Regime / Markov-switching allocation. Macro regimes partition asset returns; the standard practitioner frame is the growth × inflation quadrant (Goldilocks / Reflation / Stagflation / Deflation), often with a third risk axis. Hidden-Markov / state-dependent weighting improves OOS risk-adjusted return and drawdown vs static benchmarks after costs (Kritzman, Page & Turkington 2012 Regime Shifts; Ang & Bekaert 2002/2004; Nystrup et al.; Shu, Yu & Mulvey 2024 Downside risk reduction using regime-switching signals; ICPM 2024 regime-based SAA over 1973–2023). Two hard lessons from the literature: (i) build regimes from macroeconomic data, not noisy returns, and (ii) lag the regime signal for publication delay or the backtest is fiction.
Pairs / cointegration (the canon). Gatev, Goetzmann & Rouwenhorst (2006, RFS) — distance method, 12-month formation, enter at 2σ, ~11%/yr excess 1962–2002, robust to conservative costs. But: no cointegration test in the distance method → up to ~32% of selected pairs never converge (Do & Faff 2010); and profitability decayed to ~0.24%/mo by 2003–2009 and is largely unprofitable after 2002 once realistic costs are applied (Do & Faff 2010, 2012, FAJ/JFR). Modern replication still finds a positive but modest distance-pairs Sharpe ≈1.3 in equities with refined matching (Zhu, Yale 2024). For spread construction, the state-space / Kalman dynamic hedge ratio is the standard upgrade over rolling OLS (Chan 2013 Algorithmic Trading; Feng & Palomar 2016; QuantStart EWA/EWC).
Why naive gold/silver fails post-cost. (1) Costs: each round trip is two spreads + two commissions; modest ETF edges per reversion are eaten when held/round-tripped frequently (Do & Faff 2012). (2) Structure: silver's ~50–60% industrial demand decouples it from monetary gold over multi-year windows; cointegration is not stable over 2010–2025 (CESifo WP 12559, 2025; CME 2025 Four drivers of the GSR). (3) Slow/again-diverging reversion: the GSR can stay "extreme" for years (sub-40 in 2011–2015; >100 in 2020 and 2025), so a 2σ entry with a finite stop is a structural-break generator, not an arbitrage (SpeedwayMedia 2026; goldsilver.com 2026). Implication: trade GLD/SLV only when a rolling cointegration test currently passes and a measured half-life is short; size beta-neutral; budget costs on turnover; expect to be flat much of the time.
3. Strategy specifications
Each spec uses this template: Rationale & persistence · Signal formulas · Data + wired? + PIT/revision · Entry/Exit/Sizing · Expected gross/net Sharpe, turnover, capacity · Correlation to trend/carry/vol · Failure modes & decay · OOS validation.
PIT reference points already in the tree: pit_join(..., knowledge_time=...) (qgtm_data/pit.py), COT_REPORT_LAG_DAYS = 3 (qgtm_core/constants.py), join_asof(strategy="backward") in _enrich_features. The Strategy contract (generate_signals(features, universe, timestamp) -> list[Signal], weights ∈[−1,1]) is in qgtm_strategies/base.py.
A. MACRO / REGIME FAMILY
A.1 regime_classifier_v2 — PIT growth × inflation × risk quadrant (label producer)
Rationale & persistence. Macro regimes are persistent (months–quarters) because growth and inflation trends and policy cycles are persistent; that persistence is what makes a conditioning signal survive costs (you re-tilt rarely). This is a classifier, not a trade: it outputs a regime label + state probabilities for the overlay (§A.2) and for sleeve gating. It replaces the orphaned regime_classifier_pm and complements the fast price vane MarketRegimeDetector.
Signal formulas. Build three standardized macro axes from PIT monthly/weekly data, each as a 3-month change vs trailing 5y mean/std (z-scores computed on data strictly ≤ t):
GROWTH_t = z(Δ3m INDPRO) + z(Δ3m MANEMP) + z(level of T10Y2Y) # rising curve ⇒ growth ahead
INFLATION_t = z(Δ3m T10YIE) + z(Δ3m T5YIE) + z(Δ3m CPI_yoy) # market + realized inflation
RISK_t = z(VIXCLS) + z(MOVE) + z(HY_OAS Δ) − z(SPY 63d return) # stress axis
REAL_RATE_t = level(DFII10) # carried as a control
quadrant = (sign(GROWTH_t), sign(INFLATION_t)) ∈
{(+,−):Goldilocks, (+,+):Reflation, (−,+):Stagflation, (−,−):Deflation}
crisis_override = RISK_t > +2 → label = "crisis"
Estimate state probabilities with a Gaussian HMM on the 3-axis vector (use a proper library, e.g. hmmlearn, fit on an expanding PIT window), but anchor identifiability by labeling states from their posterior mean macro coordinates (sign of mean GROWTH/INFLATION), never by k-means cluster index. Smooth with the HMM filtered (not smoothed) probabilities — filtered uses only data ≤ t. Persistence damp: require a state to hold ≥2 observations before switching (mirrors the existing MarketRegimeDetector 0.8 confidence dampener).
Data + wired? + PIT/revision.
- DFII10, DTWEXBGS, T10YIE, T5YIE, T10Y2Y, VIXCLS, MOVE — wired (FRED provider qgtm_data/fred.py; all are daily market-priced series → published next business day, no meaningful revisions → PIT-safe with a 1-day lag).
- INDPRO, MANEMP, CPI, TCU — partially wired (INDPRO, MANEMP in FRED_SERIES) but these are monthly, revised, and lagged 2–6 weeks. The current provider hits the standard /series/observations endpoint = latest revised value = look-ahead. Required fix: use ALFRED real-time vintages (realtime_start/realtime_end) or apply a conservative release lag (INDPRO/MANEMP: +6 weeks via pit_join(knowledge_time=release_date)).
- HY_OAS (BAMLH0A0HYM2) — not wired; add to FRED_SERIES. Daily, ~1-day lag.
Entry/Exit/Sizing. N/A — emits no orders. Output object: {label, p_goldilocks, p_reflation, p_stagflation, p_deflation, p_crisis, growth_z, inflation_z, risk_z, real_rate, asof} for the overlay and for should_trade() gating.
Expected contribution. Not a standalone Sharpe. Literature attributes most of the value to drawdown reduction and turning sleeves off in hostile regimes (Kritzman 2012; Shu-Mulvey 2024). Target: overlay lifts blended net Sharpe by ~0.1–0.3 and cuts max-DD by 20–35% vs un-conditioned, after costs.
Correlation to trend/carry/vol. The classifier is orthogonal by construction; its downstream tilts will correlate with whatever sleeves it up-weights (in Reflation: long-gold/real-rate; in Crisis: haven/vol).
Failure modes & decay. (1) Regime mislabel at turning points (HMM lag) — mitigate with the BOCPD-style changepoint flag only to widen uncertainty (flatten tilts), never to flip direction. (2) Revision look-ahead if ALFRED not used — the single biggest validity risk; gate the whole classifier behind a PIT lint. (3) Few historical crisis observations → unstable crisis emission; cap crisis tilt magnitude. Decay: slow (macro relationships are sticky); re-fit HMM annually, monitor label stability.
OOS validation. Walk-forward with anchored expanding window, embargo ≥ 21 trading days (covers monthly-data release lag). Validate labels against an independent NBER/recession and CPI-surprise tape (not against returns — avoids circularity). Then validate the overlay (§A.2) returns with CPCV/PBO. Acceptance: regime-stratified Sharpe positive in ≥2 regimes (qgtm_backtest/validation.py::regime_stratified_sharpe).
A.2 regime_overlay_v2 — regime-conditioned allocation (replaces the decorative one)
Rationale & persistence. The current overlay blends each signal toward a static sector weight map (REGIME_WEIGHTS in regime_detector.py) keyed only on the 4-state price vane, and the blend is a linear interp on confidence (allocator.py lines 134–169). It ignores the macro quadrant entirely and applies sector weights even when no sleeve has a view there. Replace with a gating + tilt overlay driven by §A.1.
Signal formulas (overlay logic, not a trade).
for each sleeve s with regime_tags R_s:
gate_s = 1 if label ∈ R_s (or R_s empty) else 0 # hard off in hostile regime
tilt_s = w_base_s · (1 + κ · p(label) · sign_s(label)) # soft tilt by state prob
final_weight_symbol = Σ_s gate_s · tilt_s · signal_weight_{s,symbol}
crisis: scale gross by 0.5 and force net-PM ≤ cap # de-risk, don't reverse
Data + wired? + PIT. Consumes §A.1 output (PIT by construction) + existing per-sleeve signals. No new market data.
Sizing. Keep the platform's downstream vol-target (DEFAULT_VOL_TARGET=0.12) and per-name caps (RiskLimits); the overlay only sets relative sleeve weights and a gross scalar.
Expected contribution / correlation / failure / OOS: as §A.1 (they are validated jointly). Key failure mode unique here: gating churn if labels flip — enforce the ≥2-observation hold and a per-sleeve re-tilt cooldown to keep turnover (hence cost) low.
A.3 gold_real_rate_residual_v2 — PIT cointegration-residual gold (the flagship macro alpha)
Rationale & persistence. Gold and the 10y real yield are economically linked (§2). The level sleeve (real_rate_gold) only fires when DFII10 is in the bottom tercile and falling — it misses the common regime where rates and gold trend together and only trades the residual deviation. A residual formulation (gold rich/cheap vs the real-rate-implied fair value) captures the persistent, economically-anchored dislocation and is far less directional than a level rule. This supersedes gold_real_rate_residual (whose math is right but PIT/units are loose) and folds in the dxy_gold residual as a secondary axis.
Signal formulas.
Fair value (rolling, window W=252, computed on data ≤ t):
log(GLD_t) = a_t + b_r · DFII10_t + b_d · log(DTWEXBGS_t) + ε_t
Residual z-score:
z_t = (ε_t − mean(ε_{t-W..t})) / std(ε_{t-W..t}) # NB: causal window only
Half-life gate (OU on ε): HL = −ln2 / φ from Δε = c + φ·ε_{t-1} + u_t, require 5 ≤ HL ≤ 126
Cointegration gate: ADF(ε) p < 0.10 on the trailing window (else stand aside)
z_t ≤ −1.0 (gold cheap vs macro fair value), exit toward |z|<0.5, hard stop at z ≤ −3.5 (model break). Optional symmetric short only in non-Crisis, non-Reflation regimes (gold has a structural upward drift from CB buying — be cautious shorting).
Data + wired? + PIT/revision. GLD close — wired (Alpaca). DFII10, DTWEXBGS — wired, daily market series, PIT-safe at +1 business-day lag (no ALFRED needed; these are essentially non-revised). This is the cleanest macro sleeve for PIT — use it as the reference implementation. Use b_d·log(dollar) with the broad index (DTWEXBGS), not futures DXY.
Entry/Exit/Sizing. Entry z≤−1.0; scale-in linearly to z=−2.5; exit |z|<0.5 or HL-clock expiry (hold ≤ 2×HL days); stop z≤−3.5. Size w = clip(|z|/2.5, 0, 1) · max_weight, max_weight≈0.30, then hand to vol-target. Rebalance daily but trade only on band cross (keeps turnover ≈ entries/yr × 2).
Expected gross/net Sharpe, turnover, capacity. Realistic standalone gross Sharpe 0.4–0.7, net 0.3–0.5 (post 10–20bps round trip; turnover is low). Expected ~8–20 round trips/yr → annual turnover ≈ 3–6×. Capacity is large — GLD ADV > $1–2bn; at 1% ADV participation and the capacity_calibration model this clears tens of $M easily (validation.py capacity check). The edge is diversification + drawdown timing, not a high standalone Sharpe.
Correlation to trend/carry/vol. Low-to-moderately-negative vs gold trend (it fades dislocations); ~0 vs carry; mildly long vol (cheap-gold episodes cluster with risk-off). This diversifies a trend book — its main portfolio value.
Failure modes & decay. (1) Structural beta break (post-2022 rates+gold rose together) → the rolling window + ADF gate + MOVE-style bond-vol down-weight (already present in real_rate_gold) handle this; stand aside when ADF fails. (2) CB-buying drift makes shorts dangerous → default long-only/asymmetric. (3) Look-ahead via non-causal window — the _ols_residuals in the existing code regresses on the whole trailing window then takes the last residual, which is acceptable (causal) only if the window excludes future rows; current code is causal — preserve that. Decay: slow; monitor 6m/12m Sharpe via DecayMetrics (base.py).
OOS validation. CPCV (6 groups, 2 test) + DSR with num_trials = total sleeves tested (Harvey-Liu-Zhu honest count) + anchored & rolling walk-forward + regime-stratified Sharpe + block bootstrap (all already in qgtm_backtest/validation.py::run_full_validation). Embargo ≥5d (daily data). Stress: 2013 gold crash, 2020 COVID, 2022 rate shock (the platform's stress set). Promote only if PBO<0.5, DSR>0.95, WF aggregate Sharpe>0, positive in ≥2 regimes.
B. STATISTICAL ARBITRAGE / PAIRS FAMILY
General design rules (derived from §1.2 + §2) that every pair below obeys — these are the antidotes to the quarantine:
- Cointegration gate, evaluated rolling. Trade only when Engle–Granger/ADF on the trailing window currently rejects unit-root at p<0.10 (Johansen optional). No gate ⇒ ≤68% convergence (Do & Faff 2010).
- Trade the spread level, not the innovation.
s_t = log P_a − (α_t + β_t·log P_b), z-scored on a causal window. (Fixeskalman_pairsDefect B.) - Beta-neutral sizing. Leg notionals
(1, −β_t)normalized, not(+w, −w). (Fixes Defect A — the killer.) - Half-life holding + time stop. Measure OU half-life; require
min ≤ HL ≤ max; hold ≤ 2×HL then exit regardless. (Avoids structural-break traps.) - Turnover-budgeted costs. Net edge per reversion must exceed
2×(spread+commission)×Σ legs; pre-screen on it. (Do & Faff 2012.) - Regime gate OFF in Crisis (convergence breaks; divergence risk spikes).
B.1 statarb_metal_v2 — GLD/SLV beta-neutral cointegration (replaces gold_silver_ratio, kalman_pairs)
Rationale & persistence. GLD/SLV share a monetary-metal factor; when they are cointegrated the spread mean-reverts on a tradeable horizon. Persistence is conditional — present in monetary regimes, absent when silver's industrial cycle dominates (§2). So the edge is "trade only when the test says you can," which is exactly what the existing ratio sleeve fails to do.
Signal formulas.
β_t, α_t via Kalman state-space on LOG prices: log SLV_t = α_t + β_t·log GLD_t + v_t
spread s_t = log SLV_t − (α_t + β_t·log GLD_t)
z_t = (s_t − mean_W(s)) / std_W(s), W = 126 (causal)
HL = −ln2/φ from Δs = c + φ·s_{t−1}; require 5 ≤ HL ≤ 60
coint ADF(s_t, trailing 252) p < 0.10 (gate)
hedge leg sizing: SLV notional = +1 unit; GLD notional = −β_t units; scale to gross cap
|z|≥2.0, exit |z|≤0.5, stop |z|≥3.5 or ADF gate fails or HL-clock (2×HL) expires.
Data + wired? + PIT. GLD, SLV closes — wired (Alpaca). No macro dependency, fully PIT (only past prices). Use log prices (existing code uses raw ratio — change). Borrow/short feasibility: SLV/GLD are hard-to-locate-free, deeply liquid ETFs on Alpaca paper — realistic.
Entry/Exit/Sizing. As formulas; max_gross ≈ 0.40 split beta-neutral across legs; per-leg cap from RiskLimits. Trade only on band cross; expect long flat periods (feature, not bug).
Expected gross/net Sharpe, turnover, capacity. When the cointegration gate is respected, realistic gross Sharpe 0.6–1.0 in qualifying windows, but blended net 0.2–0.5 because the sleeve is flat much of the year and pays two spreads per round trip (Do & Faff 2012; Zhu 2024 ≈1.3 is equities with 100s of pairs). Turnover moderate (10–25 round trips/yr when active). Capacity high (GLD/SLV ADV in $100Ms). Honest expectation: this is a low-net-Sharpe diversifier, not a standalone winner — size it accordingly.
Correlation to trend/carry/vol. Near-zero to trend and carry (market-neutral by construction); slightly short vol (convergence bets lose in dislocation spikes). Genuine diversifier if truly beta-neutral.
Failure modes & decay. (1) Cointegration break (silver industrial decoupling) — handled by the rolling ADF gate; (2) beta estimation lag in fast moves — Kalman δ (process noise) tuned by EM/grid (QuantStack/Chan), default δ≈1e-4; (3) crowding/decay — distance-style metal pairs have decayed since 2002 (Do & Faff), so expect modest net and monitor DecayMetrics; (4) execution: two-leg slippage — net-and-cross within PM wrappers via the existing PRIMARY_BY_ALIAS netting. Decay: medium; re-estimate gates continuously.
OOS validation. run_full_validation with regime labels from §A.1 (must be positive in ≥2 regimes), CPCV/PBO<0.5, DSR>0.95, WF anchored+rolling, block bootstrap CI lower>0. Critical embargo: purge ≥ HL around test blocks so a straddling open trade can't leak. Stress 2011 silver spike, 2020, 2025 GSR collapse. Promote only if net (turnover-cost) Sharpe>0.3 OOS.
B.2 miners_metal_v2 — GDX/GLD & GDXJ/GDX beta-neutral (replaces miners_vs_metal)
Rationale & persistence. Miners are levered, operationally-geared equity on the metal (Tufano 1998); GDX should track GLD with β≈1.5–2.0 plus equity-beta and idiosyncratic cost noise. Mispricings in the beta-adjusted spread mean-revert. Persistence is real but the leg is equity — it carries SPY beta and AISC/cost shocks the metal does not, so neutralization must be explicit.
Signal formulas. Same engine as B.1 on log GDX ~ log GLD (and a second pair log GDXJ ~ log GDX for junior-vs-senior). Crucially, size legs (1, −β_t) (beta-neutral), not (+w,−w) — this is the exact fix for the quarantined version's Defect A. Add a market-beta neutralizer: regress spread residual on SPY return and subtract, or cap when GDX–SPY 63d β is extreme (don't trade a disguised equity-beta bet).
Data + wired? + PIT. GDX, GDXJ, GLD, SIL, SLV — wired (Alpaca). Optionally enrich with AISC margin (qgtm_data/aisc.py exists) as a fundamental gate (don't fade a spread that AISC justifies). PIT: prices are clean; AISC is quarterly with reporting lag → pit_join(knowledge_time=filing_date).
Entry/Exit/Sizing / Expected / Correlation / Failure / OOS. Entry/exit/stop as B.1. Expected gross 0.5–0.9 / net 0.2–0.4; turnover moderate; capacity good (GDX ADV in $100Ms; GDXJ thinner → capacity-gate). Correlation: low to metal trend if beta-neutral, but residual SPY beta is the main leakage risk → the neutralizer is mandatory. Failure: junior-miner idiosyncratic blowups (single-name weight in GDXJ), takeover gaps → cap leg, widen stop logic to HL-clock not just z-stop. OOS as B.1 with SPY-beta-stratified check added.
B.3 pairs_mr / cross-commodity basket — re-spec or retire
pairs_mr (mean_reversion.py) trades 7 hard-coded log-spreads (incl. USO/BNO, CORN/WEAT, URA/URNM, CPER/COPX) with no cointegration gate, no beta, no half-life — it is the distance method with all of Do & Faff's documented weaknesses. Recommendation: retire the metal pairs from it (covered by B.1/B.2); keep only pairs that pass the §B general rules, run through the B.1 engine. Energy/ag/uranium/copper-miner pairs are out of the PM mandate and capacity-thin — defer.
B.4 gold_platinum and other cross-commodity — stay dead
PBO 1.00, no stable cointegration, thin PPLT/PALL ADV (capacity fails the capacity_calibration floor after spread). Cross-commodity metal pairs (gold–platinum, gold–copper) have idiosyncratic supply shocks and no robust long-run equilibrium. Do not resurrect. If a platinum view is wanted, express it as a small standalone tilt gated by the regime classifier, not as a "pair."
4. Prioritized shortlist (build order)
| # | Item | Why first | Gate to next |
|---|---|---|---|
| P0 | Repair backtester: (a) inject PIT macro panel into backtest features; (b) cost on turnover Σ|Δw|, not held notional; (c) re-run 2015–2025 GAP-002 |
Every verdict above is unreliable until macro sleeves can trade and costs are realistic | Macro sleeves produce non-zero trade counts; cost ≈ entries×2×bps |
| 1 | regime_classifier_v2 (§A.1) — PIT quadrant + HMM probabilities, ALFRED vintages |
Unlocks the overlay + sleeve gating; orthogonal value | Labels validated vs NBER/CPI tape; PIT lint passes |
| 2 | regime_overlay_v2 (§A.2) — gate + tilt, replace decorative blend |
Converts labels into drawdown reduction | Blended net Sharpe ↑, max-DD ↓ vs un-conditioned |
| 3 | gold_real_rate_residual_v2 (§A.3) — cleanest PIT macro alpha |
Highest-confidence standalone macro edge; reference PIT impl | PBO<0.5, DSR>0.95, ≥2 regimes |
| 4 | statarb_metal_v2 (§B.1) — GLD/SLV beta-neutral cointegration |
Replaces 2 quarantined/weak sleeves with one correct engine | Net (turnover-cost) Sharpe>0.3 OOS |
| 5 | miners_metal_v2 (§B.2) — GDX/GLD beta+SPY-neutral |
Reuses B.1 engine; diversifies | Same as #4 + SPY-beta-neutral check |
| — | Cross-commodity (gold_platinum), pairs_mr metal legs |
Negative evidence, capacity-poor | Retire |
5. Salvage verdict (quarantined macro/pairs: live vs dead)
| Strategy | Recorded evidence | Real cause | Verdict | Action |
|---|---|---|---|---|
real_rate_gold |
0.00 (no trades) | backtest macro-feature gap | Salvageable / keep live | re-backtest after P0; tighten PIT |
gold_real_rate_residual |
0.00 (no trades) | feature gap; loose units | Salvageable → supersede | replace with §A.3 |
dxy_gold |
0.00 (no trades) | feature gap | Salvageable | fold dollar leg into §A.3 |
breakeven_inflation_gold |
0.00 (no trades); WALCL no vintage | feature gap + revision look-ahead | Salvageable w/ PIT fix | ALFRED WALCL/INDPRO; re-test |
regime_classifier_pm |
unwired | miscast + non-identifiable + in-sample leak | Kill as strategy | rebuild as §A.1 classifier |
gold_silver_ratio |
−0.46, PBO 0.88 | β unused (not neutral) + raw ratio + no coint gate | Dead as-is | replace with §B.1 |
miners_vs_metal |
−0.16, PBO 0.63 | β unused (not neutral) + residual SPY beta | Dead as-is | replace with §B.2 |
kalman_pairs |
enabled (not quarantined) | trades innovation (white noise) + raw-price state | Re-spec | fold into §B.1 (trade spread level) |
pairs_mr |
enabled | distance method, no gates | Re-spec/retire | metal legs → §B.1/§B.2; rest defer |
gold_platinum |
−0.47…−1.57, PBO 1.00 | no stable cointegration; capacity-poor | Stay dead | do not resurrect |
gold_bitcoin_regime |
wired | it's a corr filter, not a regime classifier | Keep as niche tilt | don't treat as regime source |
One-line bottom line: the macro family is mostly false-negative-quarantined by a broken backtester and is salvageable once features and costs are fixed; the pairs family is correctly flagged but for mechanical reasons (non-neutral hedges, wrong traded object, missing cointegration gates) — rebuild them on one correct beta-neutral, cointegration-gated, half-life-bounded engine and keep cross-commodity pairs retired.
6. References
- Barsky & Summers (1988) "Gibson's Paradox and the Gold Standard," JPE.
- Capie, Mills & Wood (2005) "Gold as a hedge against the dollar," J. Int. Fin. Markets.
- Joy (2011) "Gold and the US dollar: Hedge or haven?" Finance Research Letters.
- Erb & Harvey (2013) "The Golden Dilemma," FAJ.
- Jermann (2021) "Gold's Value as an Investment," Wharton/Rodney White.
- Chicago Fed (2021) "What drives gold prices?" Chicago Fed Letter 464.
- Univ. of Malta OAR (2024) real yields vs gold, pre/post-COVID; ECB FSR (2025).
- Hamilton (1989) "A New Approach… Business Cycle," Econometrica; Ang & Bekaert (2002, 2004) regime switches.
- Kritzman, Page & Turkington (2012) "Regime Shifts: Implications for Dynamic Strategies," FAJ.
- Shu, Yu & Mulvey (2024) "Downside risk reduction using regime-switching signals," J. Asset Management; ICPM (2024) Regime-Based SAA.
- Gatev, Goetzmann & Rouwenhorst (2006) "Pairs Trading: Performance of a Relative-Value Arbitrage Rule," RFS 19(3).
- Do & Faff (2010) "Does Simple Pairs Trading Still Work?" FAJ; (2012) "Are Pairs Trading Profits Robust to Trading Costs?" JFR.
- Krauss (2017) "Statistical Arbitrage Pairs Trading Strategies: Review and Outlook," J. Economic Surveys.
- Zhu (2024) "Examining Pairs Trading," Yale.
- Vidyamurthy (2004) Pairs Trading: Quantitative Methods and Analysis; Chan (2013) Algorithmic Trading; Feng & Palomar (2016) state-space pairs.
- CME Group (2025) "Four Major Drivers of the Gold-Silver Price Ratio"; CESifo WP 12559 (2025) gold–silver cointegration instability 2010–2025.
- Bailey & López de Prado (2014) "The Deflated Sharpe Ratio"; Bailey, Borwein, López de Prado & Zhu (2014) "Probability of Backtest Overfitting"; Harvey, Liu & Zhu (2016) "…and the Cross-Section of Expected Returns"; López de Prado (2018) AFML.
In-tree evidence cited above: qgtm_strategies/{regime_classifier_pm,regime_detector,gold_silver_ratio,miners_vs_metal,kalman_pairs,real_rate_gold,gold_real_rate_residual,dxy_gold,base}.py, qgtm_portfolio/allocator.py, qgtm_backtest/{engine,validation}.py, qgtm_core/{constants,strategy_state,types}.py, qgtm_data/{fred,cftc,cot_reports,pit}.py, qgtm_live/daemon.py, backtest_results/{SUMMARY,RESULTS}.md.