QGTM Trading — God-Tier Audit (2026-05-17)
Ten parallel agents audited every dimension. Below is the master synthesis.
Tier 0 — Silent killers (real-money risk, ship now)
| # | Finding | File:line | Action |
|---|---|---|---|
| 0.1 | DrawdownManager cool-off is per-call, not per-day. Called every 30s → 5-day cool-off completes in 2.5 minutes, full re-leverage in ~5 min after 15% DD. With live $: flash-crash → FLATTEN → re-leverage straight into recovery chop. | qgtm_portfolio/signal_aggregator.py:860-872 |
Gate decrement on now.date() > _last_step_date + regression test |
| 0.2 | Intraday rebalance bypasses compliance entirely. Wash-sale, PDT, restricted-list, sector caps — none enforced for intraday. Macro path has it; intraday doesn't. | qgtm_live/daemon.py:3500-3504 |
Add same compliance gate as macro |
| 0.3 | PositionLimitChecker effectively dead in production. Wiring exists but daemon.py:3039 and :3828 call check_order without portfolio_state/proposed_size → cap never evaluated. Per-symbol caps (VIXY=10%) silently inactive. |
qgtm_live/daemon.py:3039,3828 |
Build PortfolioState before check |
| 0.4 | Factor exposures computed with EQUAL weights, not portfolio weights. API surfaces decorative numbers, not real exposures. False sense of monitoring. | qgtm_live/daemon.py:2820 |
Map asset_syms[i] to portfolio.positions[i]/equity |
| 0.5 | 20% per-order cap conflicts with 25% per-position cap. Strategies sizing to 25% get killed at 20%. | qgtm_risk/manager.py:174 |
Replace with limits.effective_cap_for(symbol) |
| 0.6 | gold_etf_flow signal is structurally broken. Uses Yahoo daily VOLUME as shares-outstanding. Strategy z-scores trading noise, not creations/redemptions. |
qgtm_data/etf_flows.py:141-164 |
Switch to issuer-published shares-outstanding (SPDR API for GLD) |
| 0.7 | _fetch_cot_history uses to_thread but NO wait_for. Same DMS-trip shape as 2026-04-16 / 05-13. |
qgtm_live/daemon.py:1910 |
Wrap with asyncio.wait_for(..., 45) |
| 0.8 | No SIGTERM handler on daemon. systemctl restart mid-rebalance orphans Alpaca orders, misses S3 audit flush. |
qgtm_live/daemon.py:3937-3942 + infra/systemd/qgtm-daemon.service:13 |
Register signal handler that awaits cycle completion |
| 0.9 | HighErrorRate Prometheus alert firing every cycle since 2026-04-26 (3+ weeks straight). Rule is wrong — fires when nobody makes requests. Mo's trust eroder, exactly the pattern PR #49 was meant to prevent. |
/etc/prometheus/alerts.yml on droplet |
Delete or rewrite as actual 5xx-rate |
| 0.10 | CI has been RED for 6+ days. Every PR merged on red. deploy-api-self-hosted.yml has NO CI gate → ships to prod regardless. deploy.yml IS gated → been skipped every run. |
.github/workflows/deploy-api-self-hosted.yml (no workflow_run) |
Add CI gate + fix root-cause of failure |
Tier 1 — Theatre (advertised but not running)
| # | Finding | Files | Action |
|---|---|---|---|
| 1.1 | Self-learning loop is Potemkin. SelfLearningOrchestrator, AutoRetrainLoop, AutoPostmortemEngine, ModelRegistry, DecayMonitor are all never instantiated outside tests. README+gap-ledger+Notion all claim it's done. Loop never runs. Models never retrain. |
qgtm_live/self_learning.py, auto_retrain.py, auto_postmortem.py, qgtm_core/model_registry.py |
Either wire to scheduler.py post-close OR delete the claim |
| 1.2 | TCA engine wired to NOTHING. 480 LOC analyzer + 433 LOC advanced algos (Iceberg/Sniper/POV/IS) — zero call sites outside tests. Slippage we're paying is unknown. | qgtm_execution/tca.py, algos.py |
Hook into oms.py after fills, persist via audit log |
| 1.3 | AlpacaBroker never retrieves filled_avg_price/qty/at. Slippage cannot be measured. Macro Order.filled_avg_price always None. | qgtm_execution/broker.py:78-144 |
Copy fill fields from Alpaca response; ideally subscribe to trade_updates ws |
| 1.4 | ml_ensemble + meta_labeller_pm not in production strategy registry. Notion claims "shadow strategies"; no shadow execution path exists. ML contributes 0 to live PnL. | qgtm_live/daemon.py:165-232 PM_STRATEGY_CATEGORIES |
Wire or delete |
| 1.5 | shap declared in pyproject (0.46) but NEVER imported. Maintenance lie. | pyproject.toml:32 |
Wire shap.TreeExplainer into ml_ensemble OR remove dep |
| 1.6 | Sentiment computed and largely ignored. Only event_drift_pm.py reads sentiment_score — FOMC-specific. gold_geopolitical_risk claims sentiment-driven, doesn't read the column. Single RSS feed (reuters_commodities via Google News) is fragile single-source. |
qgtm_altdata/news_sentiment.py:90-94 |
Multi-source + LLM scoring (Llama via existing Ollama) |
| 1.7 | 11 dashboard metrics emit nothing. Per-strategy P&L, rolling Sharpe, win-rate, turnover, VaR, factor exposures, factor PnL attribution — all panels show "No data". | qgtm_api/metrics.py:1-232 vs /tmp/qgtm-dashboard.json |
Define + emit |
| 1.8 | node_exporter not running. Zero host visibility — CPU/mem/disk/fd/network all invisible. | Droplet host.docker.internal:9100 |
Install systemd unit |
| 1.9 | OTLP tracing wired but no collector. Spans created, vanish into void. | qgtm_core/telemetry.py + daemon.py:420 |
Deploy Tempo container OR remove setup |
| 1.10 | No log aggregation. Correlation IDs threaded throughout but logs only queryable via journalctl + grep. |
All daemon/api code | Deploy Loki (50MB RAM next to Grafana) |
| 1.11 | OptionsRiskManager completely orphaned — zero non-test imports. Options "strategies" (gold_options_skew, gamma_scalp_pm) consume options data but trade ETF shares. Margin/greeks aggregation: none. |
qgtm_risk/options_risk.py |
Delete OR wire if real options coming |
| 1.12 | EVT / factor model / stress are dashboards, not gates. None feed sizing. 5-sigma tail spike → RiskManager does nothing. | qgtm_risk/evt.py, pm_factor_model.py, stress.py |
Wire size-haircut on EVT breach OR document as dashboard-only |
Tier 2 — Async/IO violations (next DMS trip in waiting)
daemon.py:1910_fetch_cot_history—to_threadwithoutwait_for(Tier 0.7)daemon.py:1666-1688futures_fetch_ticker—asyncio.gatherwith noreturn_exceptions/outer timeout, one hung Yahoo call stalls everythingdaemon.py:2073-2100FRED loop — serial; 10 series × 30s = 5min worst case (> DMS 300s only barely safe)daemon.py:1986-1995_fetch_etf_flows_history— sequential 5×, no per-call timeout, 150s worst case- Strategies fetching sync HTTP on event loop:
gold_phys_premium.py:96,gold_lbma_fix_anomaly.py:93,gold_sge_premium.py:73,gold_mint_sales.py:93. Daily-only so cold-start risk only.
Tier 3 — Deploy / infrastructure fragility
| # | Finding | Action |
|---|---|---|
| 3.1 | deploy-api-self-hosted.yml no CI gate (Tier 0.10) |
Add gate |
| 3.2 | Daemon restarted BEFORE health check; rollback only restarts API, not daemon | Restart daemon AFTER deep-health pass; include daemon in rollback |
| 3.3 | Health check is /health only (shallow) |
Curl /api/v1/daemon/telemetry; assert heartbeat fresh + DMS not tripped |
| 3.4 | git clean -fd wipes audit_data/, data/cache/ on every deploy |
Add -e audit_data -e data/cache -e secrets |
| 3.5 | Concurrency groups differ (deploy-${{ref}} vs deploy-api) — race possible |
Unify on deploy-api |
| 3.6 | systemctl restart qgtm-daemon 2>/dev/null \|\| true swallows failures |
Drop \|\| true |
| 3.7 | chmod 600 .env AFTER write — interrupt → world-readable |
umask 077 first or install -m 600 |
| 3.8 | pip install ... \|\| pip install ... masks first-attempt failure |
Capture rc, fail explicitly |
| 3.9 | Redis no persistence (--save "", no AOF) — every restart wipes IC tracker, daily returns, killed-symbols, operator queue |
--appendonly yes --appendfsync everysec + bind-mount volume |
| 3.10 | Heartbeat-file healthcheck stat /app/.heartbeat/alive — file NEVER written |
Either remove healthcheck OR have Heartbeat.beat() touch() the file |
| 3.11 | memory: 768M cgroup limit < observed 910MB peak (May incident) |
Bump to 2G (have 8G total) |
| 3.12 | Watchdog runs in SAME event loop it's protecting (docstring claims two-process) | Split to separate process OR thread with own loop |
| 3.13 | Daemon-side Telegram alerts have NO dedup (only watchdog has it) | Apply PR #49 dedup pattern to qgtm_bot/telegram.py:send_alert |
| 3.14 | 14 dead/stale workflows clutter Actions UI (dns-workaround, sni-probe, cf-proxy-on, fix-firewall, etc.) | Delete or move to infra/scripts/ |
| 3.15 | DR plan assumes K8s (infra/scripts/failover_drill.sh) but prod is systemd |
Rewrite for systemd OR document RTO/RPO |
| 3.16 | deploy.yml deploy-api job duplicates deploy-api-self-hosted.yml — both wired to push:main |
Pick one canonical, delete the other |
| 3.17 | deploy-web-direct.yml + -self-hosted.yml + deploy.yml::deploy-web — three-way split |
Canonical = self-hosted; delete direct |
| 3.18 | Self-hosted runners co-resident with daemon — CI contention during long backtest runs | Document; consider separate runner host long-term |
| 3.19 | No runner heartbeat — if a runner dies, deploys silently queue | Scheduled probe via gh api runners |
| 3.20 | Notifications: deploy success/fail only in GH UI | Telegram step at end of both deploy workflows |
| 3.21 | SECRETS.md missing: NOTION_TOKEN, NOTION_PARENT_PAGE_ID, PR_LOGGER_GITHUB_WEBHOOK_SECRET, DO_API_TOKEN, DISCORD_WEBHOOK_URL, GRAFANA_ADMIN_PASS | Update |
Tier 4 — Strategy hygiene
- 10+ strategy files have bare
return []instead ofself._empty(reason)— daemon has zero visibility into why they fired silent:gold_silver_ratio_intraday.py(7),vwap_reversion_intraday.py,xsmom.py,vol_risk_premium_pm.py,cot_precious.py,seasonality_pm.py(quarantined but still rotting) - 9 strategy files have
strategy_idbut aren't registered indaemon.PM_STRATEGY_CATEGORIES:comex_warehouse,cot_positioning,inventory_surprise,regime_classifier_pm(the macro-regime classifier!),ml_ensemble,meta_labeller_pm,si_term_structure,fix_dislocation,precious_metals(idgold_macro_regime). Either wire OR delete. mvg_v1half-quarantined — quarantined by its own backtest, only protected byQGTM_ENABLE_MVG=False. Misconfigured deploy → on. Add toQUARANTINED_STRATEGIESfrozenset.cot_precious↔hedging_pressureare mathematical inverses of same COT report. Likely -0.8 to -0.9 correlated, probably net-cancel. Consolidate.- Trend over-weighting:
tsmom,gold_tsmom,carver_trend,gold_multi_ma_trend— 4x exposure when gold trends. Add ensemble correlation cap. vix_havenvsgold_vix_haven— same Baur-Lucey edge, likely 0.7+ corr. Pick one or combine.seasonality_pmvsgold_seasonal_demand— overlap. Delete the quarantined one.- AISC quarterly JSON last updated 2026-02-25, ~80 days stale. Quarantines at 120d. ~40 days to trip.
gold_seasonal_demandhardcodes CNY dates — must update annually. Verify 2027 in table.backtest_results/has onlymvg_v1/. Zero artifacts for 47 active strategies. No checked-in provenance.- 5 quarantined strategies (
gold_silver_ratio,gold_platinum,miners_vs_metal,seasonality_pm,overnight_gold): recommend deletegold_platinum(PBO 1.0) +seasonality_pm(overlaps); revive-with-fixgold_silver_ratio(try z_entry=2.5),overnight_gold(long-only); keep-frozenminers_vs_metalfor 30d.
Tier 5 — Missing edges (high-leverage, need backtest)
| Edge | Sharpe est. | Effort | Data |
|---|---|---|---|
| Gold-Oil ratio mean-reversion | 0.4-0.7 | 1-2d | BNO/USO already plumbed |
| GDX/GDXJ junior-senior rotation | 0.5 | 2-3d | Existing universe |
| PBOC monthly reserve announcements | 0.3 (uncorrelated) | 3-5d | SAFE press release scrape |
| Gold/Copper macro regime tilt | 0.3-0.5 | 2d | CPER in universe, unused |
| LBMA AM/PM fix-to-fix overnight diff | 0.4 | 1d | LBMA data already live |
| Refiner-vault delta | 0.3 | ~1wk | Not feasible cheaply |
| UK/Asia ETF coverage (SGLN.L, GBS.L, 2840.HK) | medium | 0.5d | Yahoo works |
| CME gold options OI/skew (real, not just Alpaca equity options) | high | 1d | CME Datamine \(0-\)500/mo |
| India gold import duty + festival demand | medium | 1d | CBIC notifications |
Tier 6 — ML/AI stack upgrades
| Item | Why | Effort |
|---|---|---|
| Add CatBoost as 3rd ensemble model | SOTA on tabular finance May'26 | 1d |
| Add Optuna for hyperparam tuning | Hand-tuned today, 10-30% IC lift | 1d |
| Wire shap into ml_ensemble | Already a dep, never used; explainability | 0.5d |
| Replace hand-rolled KS test (2 copies) with scipy.stats.ks_2samp | Correctness + speed | 0.25d |
| Replace hand-rolled EMA/SMA/RSI with pandas-ta | ~50× speedup | 0.25d |
| Replace hand-rolled HMM with hmmlearn GaussianHMM | Real Baum-Welch vs k-means approx | 0.5d |
| Replace FinBERT with Llama-3.1-8B via existing Ollama | Ollama already running on droplet, free | 1d |
| Add evidently for drift detection | Replaces 800+ LOC hand-rolled | 1d |
| Add MLflow for model registry | Self-hosted on droplet :5000 | 2d |
| Add darts/neuralforecast for PatchTST forecasting | No statistical/neural forecasting today | 1wk |
Tier 7 — Code quality / hygiene
- Delete 5,569 LOC of one-shot Notion scripts in
qgtm_integrations/(notion_*.py) — zero in-repo callers, ~80 mypy errors. Move toscripts/notion_legacy/if audit trail needed. - Fix coverage scope:
pyproject.toml:200tool.coverage.run.sourceexcludesqgtm_live, qgtm_api, qgtm_signals, qgtm_bot, qgtm_features, qgtm_altdata— highest-risk modules invisible. - Coverage gate mismatch: CI
--fail-under=50vs pyprojectfail_under=70. Pick one. daemon.py:_rebalanceis 829 LOC, CC≈70. Highest-risk function in repo. Refactor toRebalanceCycleclass with explicit phases — but high risk, needs shadow comparison.- mypy strict whitelisted only 12 files in
qgtm_risk/— daemon, strategies, API all unchecked. daemon.py:3382null-deref risk —DataFrame | None → DataFrameunsafe assignment.- 240
except Exception:catches — many legitimate fail-soft; 20% in_rebalance/ order-routing could hide bugs. - 104
[QGTM-FLOAT-MONEY]lint violations — your own lint, ignored. tools/lints/check_invariants.pyexists, not wired to pre-commit.ml_ensemble.py:235— potential lookahead (.shift(-...)with no# intentional).- 13 F401 + 7 RUF100 — auto-fixable with
ruff check --fix. - 76 outdated packages: mypy/pandas/redis/xgboost all majors behind.
Tier 8 — Backtest infrastructure (Followup B)
qgtm_backtest/ has god-tier per-strategy suite (CPCV+PBO+DSR+HLZ+bootstrap+capacity). Missing: portfolio engine. 1,400 LOC, ~10-14 hr dedicated session. Without it, PR #52's leverage bumps unvalidated against historical aggregate.
Pre-build cleanup:
- backtest_results/ only has mvg_v1 — nobody ran backtest_all.py. Confirm registry path works.
- Strategy-count drift: registry says 29, memory says 47 active. Resolve.
- Engine hard-codes gross<=2.0 (line 255), production now 3.0. Single-line fix.
- Two walk_forward methods (engine.py:372 + :705) — pick one, delete other.
Execution plan
Wave 2A — P0 safe wins (no behavior change, ship immediately)
- Fix HighErrorRate alert (T0.9)
- Install node_exporter (T1.8)
- Add CI gate to deploy-api-self-hosted (T0.10)
- Wrap
_fetch_cot_historyin wait_for (T0.7) - Add SIGTERM handler (T0.8)
- Bump cgroup memory 768M→2G (T3.11)
- Fix Dockerfile healthcheck (T3.10)
- Add
-e audit_datato git clean (T3.4) - Drop
|| truefrom systemctl restart (T3.6) - Pre-create .env with mode 600 (T3.7)
- Fix
pip installerror masking (T3.8) - Unify concurrency groups (T3.5)
- Add deep health check to deploy (T3.3)
- Restart daemon AFTER health check + include in rollback (T3.2)
- Delete dead workflows (dns-workaround, sni-probe, cf-proxy-on, fix-firewall, deploy-web-direct) (T3.14)
- Update SECRETS.md (T3.21)
- Add Telegram alert dedup (T3.13)
- Fix CI vs pyproject coverage mismatch (T7)
- Add qgtm_live/api/signals to coverage scope (T7)
ruff check --fixcleanup (T7)- Delete dead Notion script files (T7)
- Add
qgtm-backup.servicehealth alert - Bare
return []→self._empty(reason)mass fix (T4) - Add
mvg_v1to QUARANTINED_STRATEGIES (T4)
Wave 2B — P0 silent killer bug
- DrawdownManager cool-off per-day fix + regression test (T0.1) — HIGHEST PRIORITY
Wave 2C — P0 strategy hygiene
- Delete 9 dead strategy files OR wire if ready (T4)
- Strategy correlation audit + ensemble fix (T4 —
cot_precious/hedging_pressureconsolidation, trend cap)
Wave 3 — P1 stage as PRs for Mo's review
- Wire TCA + retrieve fill prices (T1.2 + T1.3)
- Wire intraday compliance gate (T0.2)
- Fix PositionLimitChecker (T0.3)
- Fix factor exposures to use portfolio weights (T0.4)
- Fix 20% vs 25% cap mismatch (T0.5)
- Fix gold_etf_flow shares-outstanding bug (T0.6)
- Wire self-learning loop OR delete (T1.1)
- Add CatBoost + Optuna + shap (T6)
- Replace FinBERT with Ollama Llama (T6)
- Add Redis persistence (T3.9) — needs care, ops change
- Multi-source sentiment with circuit breakers (T1.6)
- Revive dormant feeds: comex_inventory via Nasdaq Data Link, central_banks.py wiring (T data)
Needs Mo's call (escalate)
- Build portfolio backtest engine (dedicated session, ~14hr)
- Split daemon._rebalance refactor (high risk)
- DTBP-aware sizing (Followup A, 1-2hr but live-risk)
- Single droplet SPoF — add standby?
- Deploy Loki + Tempo
- IBKR backup broker (real, not scaffolding)
- CI red root-cause investigation