QGTM Trading — God-Tier Audit (2026-05-17)

Ten parallel agents audited every dimension. Below is the master synthesis.

Tier 0 — Silent killers (real-money risk, ship now)

#	Finding	File:line	Action
0.1	DrawdownManager cool-off is per-call, not per-day. Called every 30s → 5-day cool-off completes in 2.5 minutes, full re-leverage in ~5 min after 15% DD. With live $: flash-crash → FLATTEN → re-leverage straight into recovery chop.	`qgtm_portfolio/signal_aggregator.py:860-872`	Gate decrement on `now.date() > _last_step_date` + regression test
0.2	Intraday rebalance bypasses compliance entirely. Wash-sale, PDT, restricted-list, sector caps — none enforced for intraday. Macro path has it; intraday doesn't.	`qgtm_live/daemon.py:3500-3504`	Add same compliance gate as macro
0.3	PositionLimitChecker effectively dead in production. Wiring exists but `daemon.py:3039` and `:3828` call `check_order` without `portfolio_state`/`proposed_size` → cap never evaluated. Per-symbol caps (VIXY=10%) silently inactive.	`qgtm_live/daemon.py:3039,3828`	Build PortfolioState before check
0.4	Factor exposures computed with EQUAL weights, not portfolio weights. API surfaces decorative numbers, not real exposures. False sense of monitoring.	`qgtm_live/daemon.py:2820`	Map asset_syms[i] to portfolio.positions[i]/equity
0.5	20% per-order cap conflicts with 25% per-position cap. Strategies sizing to 25% get killed at 20%.	`qgtm_risk/manager.py:174`	Replace with `limits.effective_cap_for(symbol)`
0.6	`gold_etf_flow` signal is structurally broken. Uses Yahoo daily VOLUME as shares-outstanding. Strategy z-scores trading noise, not creations/redemptions.	`qgtm_data/etf_flows.py:141-164`	Switch to issuer-published shares-outstanding (SPDR API for GLD)
0.7	`_fetch_cot_history` uses `to_thread` but NO `wait_for`. Same DMS-trip shape as 2026-04-16 / 05-13.	`qgtm_live/daemon.py:1910`	Wrap with `asyncio.wait_for(..., 45)`
0.8	No SIGTERM handler on daemon. `systemctl restart` mid-rebalance orphans Alpaca orders, misses S3 audit flush.	`qgtm_live/daemon.py:3937-3942` + `infra/systemd/qgtm-daemon.service:13`	Register signal handler that awaits cycle completion
0.9	`HighErrorRate` Prometheus alert firing every cycle since 2026-04-26 (3+ weeks straight). Rule is wrong — fires when nobody makes requests. Mo's trust eroder, exactly the pattern PR #49 was meant to prevent.	`/etc/prometheus/alerts.yml` on droplet	Delete or rewrite as actual 5xx-rate
0.10	CI has been RED for 6+ days. Every PR merged on red. `deploy-api-self-hosted.yml` has NO CI gate → ships to prod regardless. `deploy.yml` IS gated → been skipped every run.	`.github/workflows/deploy-api-self-hosted.yml` (no `workflow_run`)	Add CI gate + fix root-cause of failure

Tier 1 — Theatre (advertised but not running)

#	Finding	Files	Action
1.1	Self-learning loop is Potemkin. `SelfLearningOrchestrator`, `AutoRetrainLoop`, `AutoPostmortemEngine`, `ModelRegistry`, `DecayMonitor` are all never instantiated outside tests. README+gap-ledger+Notion all claim it's done. Loop never runs. Models never retrain.	`qgtm_live/self_learning.py`, `auto_retrain.py`, `auto_postmortem.py`, `qgtm_core/model_registry.py`	Either wire to scheduler.py post-close OR delete the claim
1.2	TCA engine wired to NOTHING. 480 LOC analyzer + 433 LOC advanced algos (Iceberg/Sniper/POV/IS) — zero call sites outside tests. Slippage we're paying is unknown.	`qgtm_execution/tca.py`, `algos.py`	Hook into oms.py after fills, persist via audit log
1.3	AlpacaBroker never retrieves filled_avg_price/qty/at. Slippage cannot be measured. Macro Order.filled_avg_price always None.	`qgtm_execution/broker.py:78-144`	Copy fill fields from Alpaca response; ideally subscribe to trade_updates ws
1.4	ml_ensemble + meta_labeller_pm not in production strategy registry. Notion claims "shadow strategies"; no shadow execution path exists. ML contributes 0 to live PnL.	`qgtm_live/daemon.py:165-232` PM_STRATEGY_CATEGORIES	Wire or delete
1.5	shap declared in pyproject (0.46) but NEVER imported. Maintenance lie.	`pyproject.toml:32`	Wire `shap.TreeExplainer` into ml_ensemble OR remove dep
1.6	Sentiment computed and largely ignored. Only `event_drift_pm.py` reads `sentiment_score` — FOMC-specific. `gold_geopolitical_risk` claims sentiment-driven, doesn't read the column. Single RSS feed (`reuters_commodities` via Google News) is fragile single-source.	`qgtm_altdata/news_sentiment.py:90-94`	Multi-source + LLM scoring (Llama via existing Ollama)
1.7	11 dashboard metrics emit nothing. Per-strategy P&L, rolling Sharpe, win-rate, turnover, VaR, factor exposures, factor PnL attribution — all panels show "No data".	`qgtm_api/metrics.py:1-232` vs `/tmp/qgtm-dashboard.json`	Define + emit
1.8	node_exporter not running. Zero host visibility — CPU/mem/disk/fd/network all invisible.	Droplet `host.docker.internal:9100`	Install systemd unit
1.9	OTLP tracing wired but no collector. Spans created, vanish into void.	`qgtm_core/telemetry.py` + daemon.py:420	Deploy Tempo container OR remove setup
1.10	No log aggregation. Correlation IDs threaded throughout but logs only queryable via `journalctl + grep`.	All daemon/api code	Deploy Loki (50MB RAM next to Grafana)
1.11	OptionsRiskManager completely orphaned — zero non-test imports. Options "strategies" (`gold_options_skew`, `gamma_scalp_pm`) consume options data but trade ETF shares. Margin/greeks aggregation: none.	`qgtm_risk/options_risk.py`	Delete OR wire if real options coming
1.12	EVT / factor model / stress are dashboards, not gates. None feed sizing. 5-sigma tail spike → RiskManager does nothing.	`qgtm_risk/evt.py`, `pm_factor_model.py`, `stress.py`	Wire size-haircut on EVT breach OR document as dashboard-only

Tier 2 — Async/IO violations (next DMS trip in waiting)

daemon.py:1910 _fetch_cot_history — to_thread without wait_for (Tier 0.7)
daemon.py:1666-1688 futures _fetch_ticker — asyncio.gather with no return_exceptions/outer timeout, one hung Yahoo call stalls everything
daemon.py:2073-2100 FRED loop — serial; 10 series × 30s = 5min worst case (> DMS 300s only barely safe)
daemon.py:1986-1995 _fetch_etf_flows_history — sequential 5×, no per-call timeout, 150s worst case
Strategies fetching sync HTTP on event loop: gold_phys_premium.py:96, gold_lbma_fix_anomaly.py:93, gold_sge_premium.py:73, gold_mint_sales.py:93. Daily-only so cold-start risk only.

Tier 3 — Deploy / infrastructure fragility

#	Finding	Action
3.1	`deploy-api-self-hosted.yml` no CI gate (Tier 0.10)	Add gate
3.2	Daemon restarted BEFORE health check; rollback only restarts API, not daemon	Restart daemon AFTER deep-health pass; include daemon in rollback
3.3	Health check is `/health` only (shallow)	Curl `/api/v1/daemon/telemetry`; assert heartbeat fresh + DMS not tripped
3.4	`git clean -fd` wipes `audit_data/`, `data/cache/` on every deploy	Add `-e audit_data -e data/cache -e secrets`
3.5	Concurrency groups differ (`deploy-${{ref}}` vs `deploy-api`) — race possible	Unify on `deploy-api`
3.6	`systemctl restart qgtm-daemon 2>/dev/null \\|\\| true` swallows failures	Drop `\\|\\| true`
3.7	`chmod 600 .env` AFTER write — interrupt → world-readable	`umask 077` first or `install -m 600`
3.8	`pip install ... \\|\\| pip install ...` masks first-attempt failure	Capture rc, fail explicitly
3.9	Redis no persistence (`--save ""`, no AOF) — every restart wipes IC tracker, daily returns, killed-symbols, operator queue	`--appendonly yes --appendfsync everysec` + bind-mount volume
3.10	Heartbeat-file healthcheck `stat /app/.heartbeat/alive` — file NEVER written	Either remove healthcheck OR have `Heartbeat.beat()` `touch()` the file
3.11	`memory: 768M` cgroup limit < observed 910MB peak (May incident)	Bump to 2G (have 8G total)
3.12	Watchdog runs in SAME event loop it's protecting (docstring claims two-process)	Split to separate process OR thread with own loop
3.13	Daemon-side Telegram alerts have NO dedup (only watchdog has it)	Apply PR #49 dedup pattern to `qgtm_bot/telegram.py:send_alert`
3.14	14 dead/stale workflows clutter Actions UI (dns-workaround, sni-probe, cf-proxy-on, fix-firewall, etc.)	Delete or move to `infra/scripts/`
3.15	DR plan assumes K8s (`infra/scripts/failover_drill.sh`) but prod is systemd	Rewrite for systemd OR document RTO/RPO
3.16	`deploy.yml` `deploy-api` job duplicates `deploy-api-self-hosted.yml` — both wired to push:main	Pick one canonical, delete the other
3.17	`deploy-web-direct.yml` + `-self-hosted.yml` + `deploy.yml::deploy-web` — three-way split	Canonical = self-hosted; delete direct
3.18	Self-hosted runners co-resident with daemon — CI contention during long backtest runs	Document; consider separate runner host long-term
3.19	No runner heartbeat — if a runner dies, deploys silently queue	Scheduled probe via `gh api runners`
3.20	Notifications: deploy success/fail only in GH UI	Telegram step at end of both deploy workflows
3.21	SECRETS.md missing: NOTION_TOKEN, NOTION_PARENT_PAGE_ID, PR_LOGGER_GITHUB_WEBHOOK_SECRET, DO_API_TOKEN, DISCORD_WEBHOOK_URL, GRAFANA_ADMIN_PASS	Update

Tier 4 — Strategy hygiene

10+ strategy files have bare return [] instead of self._empty(reason) — daemon has zero visibility into why they fired silent: gold_silver_ratio_intraday.py (7), vwap_reversion_intraday.py, xsmom.py, vol_risk_premium_pm.py, cot_precious.py, seasonality_pm.py (quarantined but still rotting)
9 strategy files have strategy_id but aren't registered in daemon.PM_STRATEGY_CATEGORIES: comex_warehouse, cot_positioning, inventory_surprise, regime_classifier_pm (the macro-regime classifier!), ml_ensemble, meta_labeller_pm, si_term_structure, fix_dislocation, precious_metals (id gold_macro_regime). Either wire OR delete.
mvg_v1 half-quarantined — quarantined by its own backtest, only protected by QGTM_ENABLE_MVG=False. Misconfigured deploy → on. Add to QUARANTINED_STRATEGIES frozenset.
cot_precious ↔ hedging_pressure are mathematical inverses of same COT report. Likely -0.8 to -0.9 correlated, probably net-cancel. Consolidate.
Trend over-weighting: tsmom, gold_tsmom, carver_trend, gold_multi_ma_trend — 4x exposure when gold trends. Add ensemble correlation cap.
vix_haven vs gold_vix_haven — same Baur-Lucey edge, likely 0.7+ corr. Pick one or combine.
seasonality_pm vs gold_seasonal_demand — overlap. Delete the quarantined one.
AISC quarterly JSON last updated 2026-02-25, ~80 days stale. Quarantines at 120d. ~40 days to trip.
gold_seasonal_demand hardcodes CNY dates — must update annually. Verify 2027 in table.
backtest_results/ has only mvg_v1/. Zero artifacts for 47 active strategies. No checked-in provenance.
5 quarantined strategies (gold_silver_ratio, gold_platinum, miners_vs_metal, seasonality_pm, overnight_gold): recommend delete gold_platinum (PBO 1.0) + seasonality_pm (overlaps); revive-with-fix gold_silver_ratio (try z_entry=2.5), overnight_gold (long-only); keep-frozen miners_vs_metal for 30d.

Tier 5 — Missing edges (high-leverage, need backtest)

Edge	Sharpe est.	Effort	Data
Gold-Oil ratio mean-reversion	0.4-0.7	1-2d	BNO/USO already plumbed
GDX/GDXJ junior-senior rotation	0.5	2-3d	Existing universe
PBOC monthly reserve announcements	0.3 (uncorrelated)	3-5d	SAFE press release scrape
Gold/Copper macro regime tilt	0.3-0.5	2d	`CPER` in universe, unused
LBMA AM/PM fix-to-fix overnight diff	0.4	1d	LBMA data already live
Refiner-vault delta	0.3	~1wk	Not feasible cheaply
UK/Asia ETF coverage (SGLN.L, GBS.L, 2840.HK)	medium	0.5d	Yahoo works
CME gold options OI/skew (real, not just Alpaca equity options)	high	1d	CME Datamine $0-$500/mo
India gold import duty + festival demand	medium	1d	CBIC notifications

Tier 6 — ML/AI stack upgrades

Item	Why	Effort
Add CatBoost as 3rd ensemble model	SOTA on tabular finance May'26	1d
Add Optuna for hyperparam tuning	Hand-tuned today, 10-30% IC lift	1d
Wire shap into ml_ensemble	Already a dep, never used; explainability	0.5d
Replace hand-rolled KS test (2 copies) with scipy.stats.ks_2samp	Correctness + speed	0.25d
Replace hand-rolled EMA/SMA/RSI with pandas-ta	~50× speedup	0.25d
Replace hand-rolled HMM with hmmlearn GaussianHMM	Real Baum-Welch vs k-means approx	0.5d
Replace FinBERT with Llama-3.1-8B via existing Ollama	Ollama already running on droplet, free	1d
Add evidently for drift detection	Replaces 800+ LOC hand-rolled	1d
Add MLflow for model registry	Self-hosted on droplet :5000	2d
Add darts/neuralforecast for PatchTST forecasting	No statistical/neural forecasting today	1wk

Tier 7 — Code quality / hygiene

Delete 5,569 LOC of one-shot Notion scripts in qgtm_integrations/ (notion_*.py) — zero in-repo callers, ~80 mypy errors. Move to scripts/notion_legacy/ if audit trail needed.
Fix coverage scope: pyproject.toml:200 tool.coverage.run.source excludes qgtm_live, qgtm_api, qgtm_signals, qgtm_bot, qgtm_features, qgtm_altdata — highest-risk modules invisible.
Coverage gate mismatch: CI --fail-under=50 vs pyproject fail_under=70. Pick one.
daemon.py:_rebalance is 829 LOC, CC≈70. Highest-risk function in repo. Refactor to RebalanceCycle class with explicit phases — but high risk, needs shadow comparison.
mypy strict whitelisted only 12 files in qgtm_risk/ — daemon, strategies, API all unchecked.
daemon.py:3382 null-deref risk — DataFrame | None → DataFrame unsafe assignment.
240 except Exception: catches — many legitimate fail-soft; 20% in _rebalance / order-routing could hide bugs.
104 [QGTM-FLOAT-MONEY] lint violations — your own lint, ignored.
tools/lints/check_invariants.py exists, not wired to pre-commit.
ml_ensemble.py:235 — potential lookahead (.shift(-...) with no # intentional).
13 F401 + 7 RUF100 — auto-fixable with ruff check --fix.
76 outdated packages: mypy/pandas/redis/xgboost all majors behind.

Tier 8 — Backtest infrastructure (Followup B)

qgtm_backtest/ has god-tier per-strategy suite (CPCV+PBO+DSR+HLZ+bootstrap+capacity). Missing: portfolio engine. 1,400 LOC, ~10-14 hr dedicated session. Without it, PR #52's leverage bumps unvalidated against historical aggregate.

Pre-build cleanup: - backtest_results/ only has mvg_v1 — nobody ran backtest_all.py. Confirm registry path works. - Strategy-count drift: registry says 29, memory says 47 active. Resolve. - Engine hard-codes gross<=2.0 (line 255), production now 3.0. Single-line fix. - Two walk_forward methods (engine.py:372 + :705) — pick one, delete other.

Execution plan

Wave 2A — P0 safe wins (no behavior change, ship immediately)

Fix HighErrorRate alert (T0.9)
Install node_exporter (T1.8)
Add CI gate to deploy-api-self-hosted (T0.10)
Wrap _fetch_cot_history in wait_for (T0.7)
Add SIGTERM handler (T0.8)
Bump cgroup memory 768M→2G (T3.11)
Fix Dockerfile healthcheck (T3.10)
Add -e audit_data to git clean (T3.4)
Drop || true from systemctl restart (T3.6)
Pre-create .env with mode 600 (T3.7)
Fix pip install error masking (T3.8)
Unify concurrency groups (T3.5)
Add deep health check to deploy (T3.3)
Restart daemon AFTER health check + include in rollback (T3.2)
Delete dead workflows (dns-workaround, sni-probe, cf-proxy-on, fix-firewall, deploy-web-direct) (T3.14)
Update SECRETS.md (T3.21)
Add Telegram alert dedup (T3.13)
Fix CI vs pyproject coverage mismatch (T7)
Add qgtm_live/api/signals to coverage scope (T7)
ruff check --fix cleanup (T7)
Delete dead Notion script files (T7)
Add qgtm-backup.service health alert
Bare return [] → self._empty(reason) mass fix (T4)
Add mvg_v1 to QUARANTINED_STRATEGIES (T4)

Wave 2B — P0 silent killer bug

DrawdownManager cool-off per-day fix + regression test (T0.1) — HIGHEST PRIORITY

Wave 2C — P0 strategy hygiene

Delete 9 dead strategy files OR wire if ready (T4)
Strategy correlation audit + ensemble fix (T4 — cot_precious/hedging_pressure consolidation, trend cap)

Wave 3 — P1 stage as PRs for Mo's review

Wire TCA + retrieve fill prices (T1.2 + T1.3)
Wire intraday compliance gate (T0.2)
Fix PositionLimitChecker (T0.3)
Fix factor exposures to use portfolio weights (T0.4)
Fix 20% vs 25% cap mismatch (T0.5)
Fix gold_etf_flow shares-outstanding bug (T0.6)
Wire self-learning loop OR delete (T1.1)
Add CatBoost + Optuna + shap (T6)
Replace FinBERT with Ollama Llama (T6)
Add Redis persistence (T3.9) — needs care, ops change
Multi-source sentiment with circuit breakers (T1.6)
Revive dormant feeds: comex_inventory via Nasdaq Data Link, central_banks.py wiring (T data)

Needs Mo's call (escalate)

Build portfolio backtest engine (dedicated session, ~14hr)
Split daemon._rebalance refactor (high risk)
DTBP-aware sizing (Followup A, 1-2hr but live-risk)
Single droplet SPoF — add standby?
Deploy Loki + Tempo
IBKR backup broker (real, not scaffolding)
CI red root-cause investigation