QGTM Operator Runbook
Last updated: 2026-05-14 Audience: anyone with SSH or API-key access to api.qgtmai.com.
This is the playbook for the situations that actually happen — dead-man trips, Cloudflare WAF outages, deploy failures, kill-switch escalations. Keep this open next to your terminal during incidents.
0. Quick reference
| What | Endpoint / command |
|---|---|
| Health (no auth) | curl http://142.93.1.195:8000/health |
| Daemon telemetry (owner) | curl -H "X-QGTM-API-Key: $KEY" .../api/v1/daemon/telemetry |
| Reset dead-man | POST /api/v1/daemon/reset-dead-man (PR #43) |
| Emergency flatten | POST /api/v1/daemon/flatten |
| Restart daemon | ssh root@142.93.1.195 'systemctl restart qgtm-daemon' |
| Tail logs | ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f' |
Bypass Cloudflare WAF if the UI is challenge-walled: hit the droplet directly:
curl -sk -H "Host: api.qgtmai.com" \
-H "X-QGTM-API-Key: $KEY" \
http://142.93.1.195:8000/api/v1/daemon/telemetry
1. Dead-man's switch tripped
Symptom
- Telegram alert
DEAD_MAN_SWITCH_TRIPPEDorDEAD_MAN_STILL_TRIPPED - All positions flatten within ~2 seconds of the trip
/api/v1/daemon/telemetryreturns"daemon_alive": falsedead_man_switch.state == "TRIPPED"
Decide first: re-enter or stay flat?
Look at telemetry before restarting. If: - Equity drop > 0.5% on the day → stay flat for the session - Reconciliation was clean before the trip → re-enter is safer - Multiple trips this week → stay flat, root-cause first
The PR-1 phase timeouts mean the next trip would only come from a >300 s synchronous block; if you don't know why this one tripped, the next one is likely too.
Step-by-step reset
-
Confirm the trip via direct droplet (Cloudflare may be in the way):
-
Pull the last 200 daemon log lines from before the trip:
Look for:ssh root@142.93.1.195 \ "journalctl -u qgtm-daemon --since '15 minutes ago' --until 'now' | tail -300"bars_batch_timeout,enrichment_gather_timeout,no_signals_generated_from_any_strategy,redis_unavailable, uncaught exception tracebacks. -
Decide stay-flat or re-enter (see above).
-
If staying flat for the session: leave the daemon down — it's already flat. Resume next session by following step 5.
-
If re-entering, do a clean restart:
This (PR #40) runs a mandatory startup reconciliation against the broker before opening any positions. Watch: -
Verify the daemon is healthy after ~60 seconds:
Expect:curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \ http://142.93.1.195:8000/api/v1/daemon/telemetry | \ jq '.heartbeat.is_alive, .risk.kill_tier, .reconciliation.is_clean'true, "NORMAL", true. Anything else: stop, ask why.
Step-by-step reset without systemd (operator endpoint)
When PR #43's operator-reset endpoint is wired in:
curl -X POST -H "X-QGTM-API-Key: $KEY" \
-H "Content-Type: application/json" \
-d '{"approver": "naz", "reason": "post-trip investigation complete"}' \
https://api.qgtmai.com/api/v1/daemon/reset-dead-man
2. Cloudflare WAF challenging API traffic
Symptom
curl https://api.qgtmai.com/healthreturns the "Just a moment..." HTML- The web dashboard at qgtmai.com is empty or shows stale data
- Daemon itself may still be healthy (check via droplet IP)
Step-by-step
-
Confirm it's a WAF issue, not a daemon issue:
If this returns{"status": "ok"}then the daemon is fine and only Cloudflare is blocking. -
In the Cloudflare dashboard:
- Security → WAF → Custom rules
- Find any rule matching
api.qgtmai.comand/api/v1/* - Either add a bypass for hits carrying the
X-QGTM-API-Keyheader, or temporarily lower the security level for/api/v1/*to "Essentially Off" -
The historical fix has been to lower from Bot Fight Mode → Off for the
/api/v1/*path -
Verify the fix:
Expect:{"status": "ok", ...}— no challenge HTML. -
Long-term: migrate browser auth to Cloudflare Access (PR 2). Then the WAF can be permissive on
/api/v1/*because CF Access provides the auth layer.
3. Kill-switch escalated above NORMAL
Tiers (qgtm_risk.manager.KillTier)
| Tier | What it means | Auto-action |
|---|---|---|
NORMAL |
All clear | None |
WARN |
Threshold approaching | Logged, alerted |
THROTTLE |
Reduce sizes 50% | Positions still open; new orders scaled down |
NO_NEW |
Block opens | Only closes/reductions allowed |
FLATTEN |
Emergency | Close everything via market orders |
Causes that escalate to WARN
- Reconciliation: critical-discrepancy count > 0 (consecutive)
- Empty-signal cycles ≥ 2 (PR-1 fix — was silently flattening before)
- Cost gate rejecting 100% of orders for ≥ 3 cycles (PR-1 fix)
- Redis unavailable > 30 s
- Drawdown soft breach (5% from peak)
- Startup reconciliation critical (PR #40 fix)
What to do at WARN
- Run telemetry and check
risk.kill_historyfor the last 5 escalations - Match against the triggers above —
risk.kill_history[].reasonis verbose - Most WARN triggers self-clear: a clean reconcile, signals returning,
etc. give the manager an explicit
_de_escalatenext cycle - If WARN persists > 30 minutes, restart the daemon (forces startup recon)
Manual reset
or via API:curl -X POST -H "X-QGTM-API-Key: $KEY" \
-d '{"approver":"naz","reason":"investigated, all clear"}' \
https://api.qgtmai.com/api/v1/risk/kill-tier/reset
4. Deploy + rollback
Standard deploy (via GitHub Actions on push to main)
- Open PR, run CI, merge to main
.github/workflows/deploy.ymlruns:- Web: builds Next.js → uploads to Cloudflare Pages
- API: SSHes to droplet,
git pull,pip install,systemctl restart qgtm-daemon - Health-check loop hits
http://127.0.0.1:8000/healthfor up to 60 s - If the health check fails, the workflow rolls back to the previous commit and restarts again
Manual deploy (if Actions are unavailable)
ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
git fetch origin
git checkout main
git pull --ff-only
source .venv/bin/activate
pip install -e .
systemctl restart qgtm-daemon
sleep 30
curl -fsS http://127.0.0.1:8000/health
EOF
Emergency rollback (deploy broke production)
ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
LAST_GOOD=$(git log --oneline -20 | grep -v 'WIP' | head -2 | tail -1 | cut -d' ' -f1)
git checkout $LAST_GOOD
pip install -e .
systemctl restart qgtm-daemon
EOF
Then in GitHub:
…to keep main consistent with the deployed state.5. Cloudflare Access (CF Access) setup — when you're ready to retire the frontend API key
PR 2 added backend support for CF Access JWTs. To turn it on:
One-time setup (Cloudflare dashboard)
- Cloudflare → Zero Trust → Access → Applications → Add an application
- Type: Self-hosted
- Application name: "QGTM API"
- Application domain:
api.qgtmai.com - Path:
/api/v1/*(leave/healthopen for monitoring) - Identity providers: enable Email OTP + Google + GitHub
- Add policies:
- Policy 1 "Founders": Include emails
naz@qgtmai.com,mo@qgtmai.com - Save. Note the AUD tag from the app config (long hex string)
- Note your team subdomain (e.g.
qgtmai→qgtmai.cloudflareaccess.com)
Deploy backend with CF Access enabled
ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
cat >> .env <<ENV
QGTM_CF_ACCESS_TEAM_DOMAIN=qgtmai
QGTM_CF_ACCESS_APP_AUD=<paste-AUD-from-cloudflare>
ENV
systemctl restart qgtm-daemon
systemctl restart qgtm-api # if separate
EOF
Verify
# Without auth → should redirect to CF Access login
curl -I https://api.qgtmai.com/api/v1/risk
# With X-QGTM-API-Key (legacy path) → should still work
curl -H "X-QGTM-API-Key: $KEY" https://api.qgtmai.com/api/v1/risk
Browse to https://api.qgtmai.com/api/v1/risk — you should hit a CF
Access login page, authenticate, and get the JSON.
Switch the frontend (final step)
Once CF Access works:
- Remove
NEXT_PUBLIC_QGTM_API_KEYfrom.github/workflows/deploy.yml - Update
qgtm_web/src/lib/api.tsto drop theX-QGTM-API-Keyheader - Rebuild + redeploy
- Rotate the old API key on the server —
qgtm_api_key=$(openssl rand -hex 32) - Update server-to-server callers (daemon ingest, monitoring) with the new key
The legacy X-QGTM-API-Key path stays available for programmatic use
(CI, scripts, the daemon's own state ingest).
6. Pattern Day Trader (PDT) flag
Symptom
/api/v1/accountshows"pattern_day_trader": truedaytrade_count≥ 2
Why it matters
Alpaca classifies an account as PDT after 4 day-trades in 5 trading days. Below $25k this hard-restricts trading. You're at $95k so still operational, but PDT-flagged accounts are watched more closely and some intraday tactics incur higher costs through forced T+2 settlement.
What to do
- Audit recent intraday/flatten activity — the flatten cascade we saw on 2026-05-13 is the main contributor
- Reduce intraday strategy weight (currently capped at 10% via
INTRADAY_CAPITAL_FRACTION) - Avoid emergency flattens (the PR-1 work targets this directly)
- Alpaca will reset the PDT flag after a few weeks of no day-trades
7. Useful one-liners
# 14 days of P&L by day (from order history, requires X-QGTM-API-Key)
curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
http://142.93.1.195:8000/api/v1/orders?limit=400 | \
jq '[.orders[] | select(.status=="filled") | {date: (.submittedAt[:10]), pnl: (...)}] | group_by(.date)'
# Count emergency flatten events in last 30 days (orders with bare-UUID
# clientOrderId clusters, ≥ 4 within 60s)
curl ... | python3 scripts/count_flatten_clusters.py
# Tail daemon logs filtered to escalations only
ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f' | \
grep -E 'KILL_SWITCH_|DEAD_MAN_|RECONCILIATION_CRITICAL|DRAWDOWN_'
# Direct droplet curl bypassing Cloudflare
curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
http://142.93.1.195:8000/$ENDPOINT
8. Long-term — what to set up next
- [ ] SSH read-only user for Claude —
admgroup, sudo limited tosystemctl status/restart qgtm-daemon. Removes blast risk while enabling on-call diagnostics. - [ ] Staging droplet — same code, fake Alpaca account, lets risky changes (shadow-portfolio refactor, OMS routing) be validated against real-shaped data before touching production.
- [ ] Cloudflare Access (see §5) — replaces frontend API key.
- [ ] Sentry integration — daemon errors should hit a tool, not bury
in
journalctl. - [ ] DigitalOcean Snapshots automation — daily snapshot of the droplet so we can revert in under 30 s for any reason.