QGTM Operator Runbook

Last updated: 2026-05-14 Audience: anyone with SSH or API-key access to api.qgtmai.com.

This is the playbook for the situations that actually happen — dead-man trips, Cloudflare WAF outages, deploy failures, kill-switch escalations. Keep this open next to your terminal during incidents.

0. Quick reference

What	Endpoint / command
Health (no auth)	`curl http://142.93.1.195:8000/health`
Daemon telemetry (owner)	`curl -H "X-QGTM-API-Key: $KEY" .../api/v1/daemon/telemetry`
Reset dead-man	`POST /api/v1/daemon/reset-dead-man` (PR #43)
Emergency flatten	`POST /api/v1/daemon/flatten`
Restart daemon	`ssh root@142.93.1.195 'systemctl restart qgtm-daemon'`
Tail logs	`ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f'`

Bypass Cloudflare WAF if the UI is challenge-walled: hit the droplet directly:

curl -sk -H "Host: api.qgtmai.com" \
  -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/api/v1/daemon/telemetry

1. Dead-man's switch tripped

Symptom

Telegram alert DEAD_MAN_SWITCH_TRIPPED or DEAD_MAN_STILL_TRIPPED
All positions flatten within ~2 seconds of the trip
/api/v1/daemon/telemetry returns "daemon_alive": false
dead_man_switch.state == "TRIPPED"

Decide first: re-enter or stay flat?

Look at telemetry before restarting. If: - Equity drop > 0.5% on the day → stay flat for the session - Reconciliation was clean before the trip → re-enter is safer - Multiple trips this week → stay flat, root-cause first

The PR-1 phase timeouts mean the next trip would only come from a >300 s synchronous block; if you don't know why this one tripped, the next one is likely too.

Step-by-step reset

Confirm the trip via direct droplet (Cloudflare may be in the way):

curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/api/v1/daemon/telemetry | jq '.dead_man_switch, .heartbeat'

Pull the last 200 daemon log lines from before the trip:
```
ssh root@142.93.1.195 \
  "journalctl -u qgtm-daemon --since '15 minutes ago' --until 'now' | tail -300"
```
Look for: bars_batch_timeout, enrichment_gather_timeout, no_signals_generated_from_any_strategy, redis_unavailable, uncaught exception tracebacks.
Decide stay-flat or re-enter (see above).
If staying flat for the session: leave the daemon down — it's already flat. Resume next session by following step 5.

If re-entering, do a clean restart:

ssh root@142.93.1.195 'systemctl restart qgtm-daemon'

This (PR #40) runs a mandatory startup reconciliation against the broker before opening any positions. Watch:

ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f' | \
  grep -E 'startup_reconciliation|RECONCILIATION_CRITICAL|signal_'

Verify the daemon is healthy after ~60 seconds:

curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/api/v1/daemon/telemetry | \
  jq '.heartbeat.is_alive, .risk.kill_tier, .reconciliation.is_clean'

Expect: true, "NORMAL", true. Anything else: stop, ask why.

Step-by-step reset without systemd (operator endpoint)

When PR #43's operator-reset endpoint is wired in:

curl -X POST -H "X-QGTM-API-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{"approver": "naz", "reason": "post-trip investigation complete"}' \
  https://api.qgtmai.com/api/v1/daemon/reset-dead-man

2. Cloudflare WAF challenging API traffic

Symptom

curl https://api.qgtmai.com/health returns the "Just a moment..." HTML
The web dashboard at qgtmai.com is empty or shows stale data
Daemon itself may still be healthy (check via droplet IP)

Step-by-step

Confirm it's a WAF issue, not a daemon issue:
```
curl -sk -H "Host: api.qgtmai.com" http://142.93.1.195:8000/health
```
If this returns {"status": "ok"} then the daemon is fine and only Cloudflare is blocking.
In the Cloudflare dashboard:
Security → WAF → Custom rules
Find any rule matching api.qgtmai.com and /api/v1/*
Either add a bypass for hits carrying the X-QGTM-API-Key header, or temporarily lower the security level for /api/v1/* to "Essentially Off"
The historical fix has been to lower from Bot Fight Mode → Off for the /api/v1/* path
Verify the fix:
```
curl https://api.qgtmai.com/health
```
Expect: {"status": "ok", ...} — no challenge HTML.
Long-term: migrate browser auth to Cloudflare Access (PR 2). Then the WAF can be permissive on /api/v1/* because CF Access provides the auth layer.

3. Kill-switch escalated above NORMAL

Tiers (`qgtm_risk.manager.KillTier`)

Tier	What it means	Auto-action
`NORMAL`	All clear	None
`WARN`	Threshold approaching	Logged, alerted
`THROTTLE`	Reduce sizes 50%	Positions still open; new orders scaled down
`NO_NEW`	Block opens	Only closes/reductions allowed
`FLATTEN`	Emergency	Close everything via market orders

Causes that escalate to WARN

Reconciliation: critical-discrepancy count > 0 (consecutive)
Empty-signal cycles ≥ 2 (PR-1 fix — was silently flattening before)
Cost gate rejecting 100% of orders for ≥ 3 cycles (PR-1 fix)
Redis unavailable > 30 s
Drawdown soft breach (5% from peak)
Startup reconciliation critical (PR #40 fix)

What to do at WARN

Run telemetry and check risk.kill_history for the last 5 escalations
Match against the triggers above — risk.kill_history[].reason is verbose
Most WARN triggers self-clear: a clean reconcile, signals returning, etc. give the manager an explicit _de_escalate next cycle
If WARN persists > 30 minutes, restart the daemon (forces startup recon)

Manual reset

# Telegram bot command (preferred)
/kill_reset NORMAL "manual investigation complete - naz"

or via API:

curl -X POST -H "X-QGTM-API-Key: $KEY" \
  -d '{"approver":"naz","reason":"investigated, all clear"}' \
  https://api.qgtmai.com/api/v1/risk/kill-tier/reset

4. Deploy + rollback

Standard deploy (via GitHub Actions on push to main)

Open PR, run CI, merge to main
.github/workflows/deploy.yml runs:
Web: builds Next.js → uploads to Cloudflare Pages
API: SSHes to droplet, git pull, pip install, systemctl restart qgtm-daemon
Health-check loop hits http://127.0.0.1:8000/health for up to 60 s
If the health check fails, the workflow rolls back to the previous commit and restarts again

Manual deploy (if Actions are unavailable)

ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
git fetch origin
git checkout main
git pull --ff-only
source .venv/bin/activate
pip install -e .
systemctl restart qgtm-daemon
sleep 30
curl -fsS http://127.0.0.1:8000/health
EOF

Emergency rollback (deploy broke production)

ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
LAST_GOOD=$(git log --oneline -20 | grep -v 'WIP' | head -2 | tail -1 | cut -d' ' -f1)
git checkout $LAST_GOOD
pip install -e .
systemctl restart qgtm-daemon
EOF

Then in GitHub:

git revert HEAD --no-edit
git push origin main

…to keep main consistent with the deployed state.

5. Cloudflare Access (CF Access) setup — when you're ready to retire the frontend API key

PR 2 added backend support for CF Access JWTs. To turn it on:

One-time setup (Cloudflare dashboard)

Cloudflare → Zero Trust → Access → Applications → Add an application
Type: Self-hosted
Application name: "QGTM API"
Application domain: api.qgtmai.com
Path: /api/v1/* (leave /health open for monitoring)
Identity providers: enable Email OTP + Google + GitHub
Add policies:
Policy 1 "Founders": Include emails naz@qgtmai.com, mo@qgtmai.com
Save. Note the AUD tag from the app config (long hex string)
Note your team subdomain (e.g. qgtmai → qgtmai.cloudflareaccess.com)

Deploy backend with CF Access enabled

ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
cat >> .env <<ENV
QGTM_CF_ACCESS_TEAM_DOMAIN=qgtmai
QGTM_CF_ACCESS_APP_AUD=<paste-AUD-from-cloudflare>
ENV
systemctl restart qgtm-daemon
systemctl restart qgtm-api  # if separate
EOF

Verify

# Without auth → should redirect to CF Access login
curl -I https://api.qgtmai.com/api/v1/risk

# With X-QGTM-API-Key (legacy path) → should still work
curl -H "X-QGTM-API-Key: $KEY" https://api.qgtmai.com/api/v1/risk

Browse to https://api.qgtmai.com/api/v1/risk — you should hit a CF Access login page, authenticate, and get the JSON.

Switch the frontend (final step)

Once CF Access works:

Remove NEXT_PUBLIC_QGTM_API_KEY from .github/workflows/deploy.yml
Update qgtm_web/src/lib/api.ts to drop the X-QGTM-API-Key header
Rebuild + redeploy
Rotate the old API key on the server — qgtm_api_key=$(openssl rand -hex 32)
Update server-to-server callers (daemon ingest, monitoring) with the new key

The legacy X-QGTM-API-Key path stays available for programmatic use (CI, scripts, the daemon's own state ingest).

6. Pattern Day Trader (PDT) flag

Symptom

/api/v1/account shows "pattern_day_trader": true
daytrade_count ≥ 2

Why it matters

Alpaca classifies an account as PDT after 4 day-trades in 5 trading days. Below $25k this hard-restricts trading. You're at $95k so still operational, but PDT-flagged accounts are watched more closely and some intraday tactics incur higher costs through forced T+2 settlement.

What to do

Audit recent intraday/flatten activity — the flatten cascade we saw on 2026-05-13 is the main contributor
Reduce intraday strategy weight (currently capped at 10% via INTRADAY_CAPITAL_FRACTION)
Avoid emergency flattens (the PR-1 work targets this directly)
Alpaca will reset the PDT flag after a few weeks of no day-trades

7. Useful one-liners

# 14 days of P&L by day (from order history, requires X-QGTM-API-Key)
curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/api/v1/orders?limit=400 | \
  jq '[.orders[] | select(.status=="filled") | {date: (.submittedAt[:10]), pnl: (...)}] | group_by(.date)'

# Count emergency flatten events in last 30 days (orders with bare-UUID
# clientOrderId clusters, ≥ 4 within 60s)
curl ... | python3 scripts/count_flatten_clusters.py

# Tail daemon logs filtered to escalations only
ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f' | \
  grep -E 'KILL_SWITCH_|DEAD_MAN_|RECONCILIATION_CRITICAL|DRAWDOWN_'

# Direct droplet curl bypassing Cloudflare
curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/$ENDPOINT

8. Long-term — what to set up next

[ ] SSH read-only user for Claude — adm group, sudo limited to systemctl status/restart qgtm-daemon. Removes blast risk while enabling on-call diagnostics.
[ ] Staging droplet — same code, fake Alpaca account, lets risky changes (shadow-portfolio refactor, OMS routing) be validated against real-shaped data before touching production.
[ ] Cloudflare Access (see §5) — replaces frontend API key.
[ ] Sentry integration — daemon errors should hit a tool, not bury in journalctl.
[ ] DigitalOcean Snapshots automation — daily snapshot of the droplet so we can revert in under 30 s for any reason.

QGTM Operator Runbook

0. Quick reference

1. Dead-man's switch tripped

Symptom

Decide first: re-enter or stay flat?

Step-by-step reset

Step-by-step reset without systemd (operator endpoint)

2. Cloudflare WAF challenging API traffic

Symptom

Step-by-step

3. Kill-switch escalated above NORMAL

Tiers (qgtm_risk.manager.KillTier)

Causes that escalate to WARN

What to do at WARN

Manual reset

4. Deploy + rollback

Standard deploy (via GitHub Actions on push to main)

Manual deploy (if Actions are unavailable)

Emergency rollback (deploy broke production)

5. Cloudflare Access (CF Access) setup — when you're ready to retire the frontend API key

One-time setup (Cloudflare dashboard)

Deploy backend with CF Access enabled

Verify

Switch the frontend (final step)

6. Pattern Day Trader (PDT) flag

Symptom

Why it matters

What to do

7. Useful one-liners

8. Long-term — what to set up next

Tiers (`qgtm_risk.manager.KillTier`)