Skip to content

QGTM Trading — God-Tier Audit (2026-05-17)

Ten parallel agents audited every dimension. Below is the master synthesis.


Tier 0 — Silent killers (real-money risk, ship now)

# Finding File:line Action
0.1 DrawdownManager cool-off is per-call, not per-day. Called every 30s → 5-day cool-off completes in 2.5 minutes, full re-leverage in ~5 min after 15% DD. With live $: flash-crash → FLATTEN → re-leverage straight into recovery chop. qgtm_portfolio/signal_aggregator.py:860-872 Gate decrement on now.date() > _last_step_date + regression test
0.2 Intraday rebalance bypasses compliance entirely. Wash-sale, PDT, restricted-list, sector caps — none enforced for intraday. Macro path has it; intraday doesn't. qgtm_live/daemon.py:3500-3504 Add same compliance gate as macro
0.3 PositionLimitChecker effectively dead in production. Wiring exists but daemon.py:3039 and :3828 call check_order without portfolio_state/proposed_size → cap never evaluated. Per-symbol caps (VIXY=10%) silently inactive. qgtm_live/daemon.py:3039,3828 Build PortfolioState before check
0.4 Factor exposures computed with EQUAL weights, not portfolio weights. API surfaces decorative numbers, not real exposures. False sense of monitoring. qgtm_live/daemon.py:2820 Map asset_syms[i] to portfolio.positions[i]/equity
0.5 20% per-order cap conflicts with 25% per-position cap. Strategies sizing to 25% get killed at 20%. qgtm_risk/manager.py:174 Replace with limits.effective_cap_for(symbol)
0.6 gold_etf_flow signal is structurally broken. Uses Yahoo daily VOLUME as shares-outstanding. Strategy z-scores trading noise, not creations/redemptions. qgtm_data/etf_flows.py:141-164 Switch to issuer-published shares-outstanding (SPDR API for GLD)
0.7 _fetch_cot_history uses to_thread but NO wait_for. Same DMS-trip shape as 2026-04-16 / 05-13. qgtm_live/daemon.py:1910 Wrap with asyncio.wait_for(..., 45)
0.8 No SIGTERM handler on daemon. systemctl restart mid-rebalance orphans Alpaca orders, misses S3 audit flush. qgtm_live/daemon.py:3937-3942 + infra/systemd/qgtm-daemon.service:13 Register signal handler that awaits cycle completion
0.9 HighErrorRate Prometheus alert firing every cycle since 2026-04-26 (3+ weeks straight). Rule is wrong — fires when nobody makes requests. Mo's trust eroder, exactly the pattern PR #49 was meant to prevent. /etc/prometheus/alerts.yml on droplet Delete or rewrite as actual 5xx-rate
0.10 CI has been RED for 6+ days. Every PR merged on red. deploy-api-self-hosted.yml has NO CI gate → ships to prod regardless. deploy.yml IS gated → been skipped every run. .github/workflows/deploy-api-self-hosted.yml (no workflow_run) Add CI gate + fix root-cause of failure

Tier 1 — Theatre (advertised but not running)

# Finding Files Action
1.1 Self-learning loop is Potemkin. SelfLearningOrchestrator, AutoRetrainLoop, AutoPostmortemEngine, ModelRegistry, DecayMonitor are all never instantiated outside tests. README+gap-ledger+Notion all claim it's done. Loop never runs. Models never retrain. qgtm_live/self_learning.py, auto_retrain.py, auto_postmortem.py, qgtm_core/model_registry.py Either wire to scheduler.py post-close OR delete the claim
1.2 TCA engine wired to NOTHING. 480 LOC analyzer + 433 LOC advanced algos (Iceberg/Sniper/POV/IS) — zero call sites outside tests. Slippage we're paying is unknown. qgtm_execution/tca.py, algos.py Hook into oms.py after fills, persist via audit log
1.3 AlpacaBroker never retrieves filled_avg_price/qty/at. Slippage cannot be measured. Macro Order.filled_avg_price always None. qgtm_execution/broker.py:78-144 Copy fill fields from Alpaca response; ideally subscribe to trade_updates ws
1.4 ml_ensemble + meta_labeller_pm not in production strategy registry. Notion claims "shadow strategies"; no shadow execution path exists. ML contributes 0 to live PnL. qgtm_live/daemon.py:165-232 PM_STRATEGY_CATEGORIES Wire or delete
1.5 shap declared in pyproject (0.46) but NEVER imported. Maintenance lie. pyproject.toml:32 Wire shap.TreeExplainer into ml_ensemble OR remove dep
1.6 Sentiment computed and largely ignored. Only event_drift_pm.py reads sentiment_score — FOMC-specific. gold_geopolitical_risk claims sentiment-driven, doesn't read the column. Single RSS feed (reuters_commodities via Google News) is fragile single-source. qgtm_altdata/news_sentiment.py:90-94 Multi-source + LLM scoring (Llama via existing Ollama)
1.7 11 dashboard metrics emit nothing. Per-strategy P&L, rolling Sharpe, win-rate, turnover, VaR, factor exposures, factor PnL attribution — all panels show "No data". qgtm_api/metrics.py:1-232 vs /tmp/qgtm-dashboard.json Define + emit
1.8 node_exporter not running. Zero host visibility — CPU/mem/disk/fd/network all invisible. Droplet host.docker.internal:9100 Install systemd unit
1.9 OTLP tracing wired but no collector. Spans created, vanish into void. qgtm_core/telemetry.py + daemon.py:420 Deploy Tempo container OR remove setup
1.10 No log aggregation. Correlation IDs threaded throughout but logs only queryable via journalctl + grep. All daemon/api code Deploy Loki (50MB RAM next to Grafana)
1.11 OptionsRiskManager completely orphaned — zero non-test imports. Options "strategies" (gold_options_skew, gamma_scalp_pm) consume options data but trade ETF shares. Margin/greeks aggregation: none. qgtm_risk/options_risk.py Delete OR wire if real options coming
1.12 EVT / factor model / stress are dashboards, not gates. None feed sizing. 5-sigma tail spike → RiskManager does nothing. qgtm_risk/evt.py, pm_factor_model.py, stress.py Wire size-haircut on EVT breach OR document as dashboard-only

Tier 2 — Async/IO violations (next DMS trip in waiting)

  • daemon.py:1910 _fetch_cot_historyto_thread without wait_for (Tier 0.7)
  • daemon.py:1666-1688 futures _fetch_tickerasyncio.gather with no return_exceptions/outer timeout, one hung Yahoo call stalls everything
  • daemon.py:2073-2100 FRED loop — serial; 10 series × 30s = 5min worst case (> DMS 300s only barely safe)
  • daemon.py:1986-1995 _fetch_etf_flows_history — sequential 5×, no per-call timeout, 150s worst case
  • Strategies fetching sync HTTP on event loop: gold_phys_premium.py:96, gold_lbma_fix_anomaly.py:93, gold_sge_premium.py:73, gold_mint_sales.py:93. Daily-only so cold-start risk only.

Tier 3 — Deploy / infrastructure fragility

# Finding Action
3.1 deploy-api-self-hosted.yml no CI gate (Tier 0.10) Add gate
3.2 Daemon restarted BEFORE health check; rollback only restarts API, not daemon Restart daemon AFTER deep-health pass; include daemon in rollback
3.3 Health check is /health only (shallow) Curl /api/v1/daemon/telemetry; assert heartbeat fresh + DMS not tripped
3.4 git clean -fd wipes audit_data/, data/cache/ on every deploy Add -e audit_data -e data/cache -e secrets
3.5 Concurrency groups differ (deploy-${{ref}} vs deploy-api) — race possible Unify on deploy-api
3.6 systemctl restart qgtm-daemon 2>/dev/null \|\| true swallows failures Drop \|\| true
3.7 chmod 600 .env AFTER write — interrupt → world-readable umask 077 first or install -m 600
3.8 pip install ... \|\| pip install ... masks first-attempt failure Capture rc, fail explicitly
3.9 Redis no persistence (--save "", no AOF) — every restart wipes IC tracker, daily returns, killed-symbols, operator queue --appendonly yes --appendfsync everysec + bind-mount volume
3.10 Heartbeat-file healthcheck stat /app/.heartbeat/alive — file NEVER written Either remove healthcheck OR have Heartbeat.beat() touch() the file
3.11 memory: 768M cgroup limit < observed 910MB peak (May incident) Bump to 2G (have 8G total)
3.12 Watchdog runs in SAME event loop it's protecting (docstring claims two-process) Split to separate process OR thread with own loop
3.13 Daemon-side Telegram alerts have NO dedup (only watchdog has it) Apply PR #49 dedup pattern to qgtm_bot/telegram.py:send_alert
3.14 14 dead/stale workflows clutter Actions UI (dns-workaround, sni-probe, cf-proxy-on, fix-firewall, etc.) Delete or move to infra/scripts/
3.15 DR plan assumes K8s (infra/scripts/failover_drill.sh) but prod is systemd Rewrite for systemd OR document RTO/RPO
3.16 deploy.yml deploy-api job duplicates deploy-api-self-hosted.yml — both wired to push:main Pick one canonical, delete the other
3.17 deploy-web-direct.yml + -self-hosted.yml + deploy.yml::deploy-web — three-way split Canonical = self-hosted; delete direct
3.18 Self-hosted runners co-resident with daemon — CI contention during long backtest runs Document; consider separate runner host long-term
3.19 No runner heartbeat — if a runner dies, deploys silently queue Scheduled probe via gh api runners
3.20 Notifications: deploy success/fail only in GH UI Telegram step at end of both deploy workflows
3.21 SECRETS.md missing: NOTION_TOKEN, NOTION_PARENT_PAGE_ID, PR_LOGGER_GITHUB_WEBHOOK_SECRET, DO_API_TOKEN, DISCORD_WEBHOOK_URL, GRAFANA_ADMIN_PASS Update

Tier 4 — Strategy hygiene

  • 10+ strategy files have bare return [] instead of self._empty(reason) — daemon has zero visibility into why they fired silent: gold_silver_ratio_intraday.py (7), vwap_reversion_intraday.py, xsmom.py, vol_risk_premium_pm.py, cot_precious.py, seasonality_pm.py (quarantined but still rotting)
  • 9 strategy files have strategy_id but aren't registered in daemon.PM_STRATEGY_CATEGORIES: comex_warehouse, cot_positioning, inventory_surprise, regime_classifier_pm (the macro-regime classifier!), ml_ensemble, meta_labeller_pm, si_term_structure, fix_dislocation, precious_metals (id gold_macro_regime). Either wire OR delete.
  • mvg_v1 half-quarantined — quarantined by its own backtest, only protected by QGTM_ENABLE_MVG=False. Misconfigured deploy → on. Add to QUARANTINED_STRATEGIES frozenset.
  • cot_precioushedging_pressure are mathematical inverses of same COT report. Likely -0.8 to -0.9 correlated, probably net-cancel. Consolidate.
  • Trend over-weighting: tsmom, gold_tsmom, carver_trend, gold_multi_ma_trend — 4x exposure when gold trends. Add ensemble correlation cap.
  • vix_haven vs gold_vix_haven — same Baur-Lucey edge, likely 0.7+ corr. Pick one or combine.
  • seasonality_pm vs gold_seasonal_demand — overlap. Delete the quarantined one.
  • AISC quarterly JSON last updated 2026-02-25, ~80 days stale. Quarantines at 120d. ~40 days to trip.
  • gold_seasonal_demand hardcodes CNY dates — must update annually. Verify 2027 in table.
  • backtest_results/ has only mvg_v1/. Zero artifacts for 47 active strategies. No checked-in provenance.
  • 5 quarantined strategies (gold_silver_ratio, gold_platinum, miners_vs_metal, seasonality_pm, overnight_gold): recommend delete gold_platinum (PBO 1.0) + seasonality_pm (overlaps); revive-with-fix gold_silver_ratio (try z_entry=2.5), overnight_gold (long-only); keep-frozen miners_vs_metal for 30d.

Tier 5 — Missing edges (high-leverage, need backtest)

Edge Sharpe est. Effort Data
Gold-Oil ratio mean-reversion 0.4-0.7 1-2d BNO/USO already plumbed
GDX/GDXJ junior-senior rotation 0.5 2-3d Existing universe
PBOC monthly reserve announcements 0.3 (uncorrelated) 3-5d SAFE press release scrape
Gold/Copper macro regime tilt 0.3-0.5 2d CPER in universe, unused
LBMA AM/PM fix-to-fix overnight diff 0.4 1d LBMA data already live
Refiner-vault delta 0.3 ~1wk Not feasible cheaply
UK/Asia ETF coverage (SGLN.L, GBS.L, 2840.HK) medium 0.5d Yahoo works
CME gold options OI/skew (real, not just Alpaca equity options) high 1d CME Datamine \(0-\)500/mo
India gold import duty + festival demand medium 1d CBIC notifications

Tier 6 — ML/AI stack upgrades

Item Why Effort
Add CatBoost as 3rd ensemble model SOTA on tabular finance May'26 1d
Add Optuna for hyperparam tuning Hand-tuned today, 10-30% IC lift 1d
Wire shap into ml_ensemble Already a dep, never used; explainability 0.5d
Replace hand-rolled KS test (2 copies) with scipy.stats.ks_2samp Correctness + speed 0.25d
Replace hand-rolled EMA/SMA/RSI with pandas-ta ~50× speedup 0.25d
Replace hand-rolled HMM with hmmlearn GaussianHMM Real Baum-Welch vs k-means approx 0.5d
Replace FinBERT with Llama-3.1-8B via existing Ollama Ollama already running on droplet, free 1d
Add evidently for drift detection Replaces 800+ LOC hand-rolled 1d
Add MLflow for model registry Self-hosted on droplet :5000 2d
Add darts/neuralforecast for PatchTST forecasting No statistical/neural forecasting today 1wk

Tier 7 — Code quality / hygiene

  • Delete 5,569 LOC of one-shot Notion scripts in qgtm_integrations/ (notion_*.py) — zero in-repo callers, ~80 mypy errors. Move to scripts/notion_legacy/ if audit trail needed.
  • Fix coverage scope: pyproject.toml:200 tool.coverage.run.source excludes qgtm_live, qgtm_api, qgtm_signals, qgtm_bot, qgtm_features, qgtm_altdata — highest-risk modules invisible.
  • Coverage gate mismatch: CI --fail-under=50 vs pyproject fail_under=70. Pick one.
  • daemon.py:_rebalance is 829 LOC, CC≈70. Highest-risk function in repo. Refactor to RebalanceCycle class with explicit phases — but high risk, needs shadow comparison.
  • mypy strict whitelisted only 12 files in qgtm_risk/ — daemon, strategies, API all unchecked.
  • daemon.py:3382 null-deref riskDataFrame | None → DataFrame unsafe assignment.
  • 240 except Exception: catches — many legitimate fail-soft; 20% in _rebalance / order-routing could hide bugs.
  • 104 [QGTM-FLOAT-MONEY] lint violations — your own lint, ignored.
  • tools/lints/check_invariants.py exists, not wired to pre-commit.
  • ml_ensemble.py:235 — potential lookahead (.shift(-...) with no # intentional).
  • 13 F401 + 7 RUF100 — auto-fixable with ruff check --fix.
  • 76 outdated packages: mypy/pandas/redis/xgboost all majors behind.

Tier 8 — Backtest infrastructure (Followup B)

qgtm_backtest/ has god-tier per-strategy suite (CPCV+PBO+DSR+HLZ+bootstrap+capacity). Missing: portfolio engine. 1,400 LOC, ~10-14 hr dedicated session. Without it, PR #52's leverage bumps unvalidated against historical aggregate.

Pre-build cleanup: - backtest_results/ only has mvg_v1 — nobody ran backtest_all.py. Confirm registry path works. - Strategy-count drift: registry says 29, memory says 47 active. Resolve. - Engine hard-codes gross<=2.0 (line 255), production now 3.0. Single-line fix. - Two walk_forward methods (engine.py:372 + :705) — pick one, delete other.


Execution plan

Wave 2A — P0 safe wins (no behavior change, ship immediately)

  • Fix HighErrorRate alert (T0.9)
  • Install node_exporter (T1.8)
  • Add CI gate to deploy-api-self-hosted (T0.10)
  • Wrap _fetch_cot_history in wait_for (T0.7)
  • Add SIGTERM handler (T0.8)
  • Bump cgroup memory 768M→2G (T3.11)
  • Fix Dockerfile healthcheck (T3.10)
  • Add -e audit_data to git clean (T3.4)
  • Drop || true from systemctl restart (T3.6)
  • Pre-create .env with mode 600 (T3.7)
  • Fix pip install error masking (T3.8)
  • Unify concurrency groups (T3.5)
  • Add deep health check to deploy (T3.3)
  • Restart daemon AFTER health check + include in rollback (T3.2)
  • Delete dead workflows (dns-workaround, sni-probe, cf-proxy-on, fix-firewall, deploy-web-direct) (T3.14)
  • Update SECRETS.md (T3.21)
  • Add Telegram alert dedup (T3.13)
  • Fix CI vs pyproject coverage mismatch (T7)
  • Add qgtm_live/api/signals to coverage scope (T7)
  • ruff check --fix cleanup (T7)
  • Delete dead Notion script files (T7)
  • Add qgtm-backup.service health alert
  • Bare return []self._empty(reason) mass fix (T4)
  • Add mvg_v1 to QUARANTINED_STRATEGIES (T4)

Wave 2B — P0 silent killer bug

  • DrawdownManager cool-off per-day fix + regression test (T0.1) — HIGHEST PRIORITY

Wave 2C — P0 strategy hygiene

  • Delete 9 dead strategy files OR wire if ready (T4)
  • Strategy correlation audit + ensemble fix (T4 — cot_precious/hedging_pressure consolidation, trend cap)

Wave 3 — P1 stage as PRs for Mo's review

  • Wire TCA + retrieve fill prices (T1.2 + T1.3)
  • Wire intraday compliance gate (T0.2)
  • Fix PositionLimitChecker (T0.3)
  • Fix factor exposures to use portfolio weights (T0.4)
  • Fix 20% vs 25% cap mismatch (T0.5)
  • Fix gold_etf_flow shares-outstanding bug (T0.6)
  • Wire self-learning loop OR delete (T1.1)
  • Add CatBoost + Optuna + shap (T6)
  • Replace FinBERT with Ollama Llama (T6)
  • Add Redis persistence (T3.9) — needs care, ops change
  • Multi-source sentiment with circuit breakers (T1.6)
  • Revive dormant feeds: comex_inventory via Nasdaq Data Link, central_banks.py wiring (T data)

Needs Mo's call (escalate)

  • Build portfolio backtest engine (dedicated session, ~14hr)
  • Split daemon._rebalance refactor (high risk)
  • DTBP-aware sizing (Followup A, 1-2hr but live-risk)
  • Single droplet SPoF — add standby?
  • Deploy Loki + Tempo
  • IBKR backup broker (real, not scaffolding)
  • CI red root-cause investigation