Skip to content

QGTM Operator Runbook

Last updated: 2026-05-14 Audience: anyone with SSH or API-key access to api.qgtmai.com.

This is the playbook for the situations that actually happen — dead-man trips, Cloudflare WAF outages, deploy failures, kill-switch escalations. Keep this open next to your terminal during incidents.


0. Quick reference

What Endpoint / command
Health (no auth) curl http://142.93.1.195:8000/health
Daemon telemetry (owner) curl -H "X-QGTM-API-Key: $KEY" .../api/v1/daemon/telemetry
Reset dead-man POST /api/v1/daemon/reset-dead-man (PR #43)
Emergency flatten POST /api/v1/daemon/flatten
Restart daemon ssh root@142.93.1.195 'systemctl restart qgtm-daemon'
Tail logs ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f'

Bypass Cloudflare WAF if the UI is challenge-walled: hit the droplet directly:

curl -sk -H "Host: api.qgtmai.com" \
  -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/api/v1/daemon/telemetry


1. Dead-man's switch tripped

Symptom

  • Telegram alert DEAD_MAN_SWITCH_TRIPPED or DEAD_MAN_STILL_TRIPPED
  • All positions flatten within ~2 seconds of the trip
  • /api/v1/daemon/telemetry returns "daemon_alive": false
  • dead_man_switch.state == "TRIPPED"

Decide first: re-enter or stay flat?

Look at telemetry before restarting. If: - Equity drop > 0.5% on the day → stay flat for the session - Reconciliation was clean before the trip → re-enter is safer - Multiple trips this week → stay flat, root-cause first

The PR-1 phase timeouts mean the next trip would only come from a >300 s synchronous block; if you don't know why this one tripped, the next one is likely too.

Step-by-step reset

  1. Confirm the trip via direct droplet (Cloudflare may be in the way):

    curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
      http://142.93.1.195:8000/api/v1/daemon/telemetry | jq '.dead_man_switch, .heartbeat'
    

  2. Pull the last 200 daemon log lines from before the trip:

    ssh root@142.93.1.195 \
      "journalctl -u qgtm-daemon --since '15 minutes ago' --until 'now' | tail -300"
    
    Look for: bars_batch_timeout, enrichment_gather_timeout, no_signals_generated_from_any_strategy, redis_unavailable, uncaught exception tracebacks.

  3. Decide stay-flat or re-enter (see above).

  4. If staying flat for the session: leave the daemon down — it's already flat. Resume next session by following step 5.

  5. If re-entering, do a clean restart:

    ssh root@142.93.1.195 'systemctl restart qgtm-daemon'
    
    This (PR #40) runs a mandatory startup reconciliation against the broker before opening any positions. Watch:
    ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f' | \
      grep -E 'startup_reconciliation|RECONCILIATION_CRITICAL|signal_'
    

  6. Verify the daemon is healthy after ~60 seconds:

    curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
      http://142.93.1.195:8000/api/v1/daemon/telemetry | \
      jq '.heartbeat.is_alive, .risk.kill_tier, .reconciliation.is_clean'
    
    Expect: true, "NORMAL", true. Anything else: stop, ask why.

Step-by-step reset without systemd (operator endpoint)

When PR #43's operator-reset endpoint is wired in:

curl -X POST -H "X-QGTM-API-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{"approver": "naz", "reason": "post-trip investigation complete"}' \
  https://api.qgtmai.com/api/v1/daemon/reset-dead-man


2. Cloudflare WAF challenging API traffic

Symptom

  • curl https://api.qgtmai.com/health returns the "Just a moment..." HTML
  • The web dashboard at qgtmai.com is empty or shows stale data
  • Daemon itself may still be healthy (check via droplet IP)

Step-by-step

  1. Confirm it's a WAF issue, not a daemon issue:

    curl -sk -H "Host: api.qgtmai.com" http://142.93.1.195:8000/health
    
    If this returns {"status": "ok"} then the daemon is fine and only Cloudflare is blocking.

  2. In the Cloudflare dashboard:

  3. Security → WAF → Custom rules
  4. Find any rule matching api.qgtmai.com and /api/v1/*
  5. Either add a bypass for hits carrying the X-QGTM-API-Key header, or temporarily lower the security level for /api/v1/* to "Essentially Off"
  6. The historical fix has been to lower from Bot Fight Mode → Off for the /api/v1/* path

  7. Verify the fix:

    curl https://api.qgtmai.com/health
    
    Expect: {"status": "ok", ...} — no challenge HTML.

  8. Long-term: migrate browser auth to Cloudflare Access (PR 2). Then the WAF can be permissive on /api/v1/* because CF Access provides the auth layer.


3. Kill-switch escalated above NORMAL

Tiers (qgtm_risk.manager.KillTier)

Tier What it means Auto-action
NORMAL All clear None
WARN Threshold approaching Logged, alerted
THROTTLE Reduce sizes 50% Positions still open; new orders scaled down
NO_NEW Block opens Only closes/reductions allowed
FLATTEN Emergency Close everything via market orders

Causes that escalate to WARN

  • Reconciliation: critical-discrepancy count > 0 (consecutive)
  • Empty-signal cycles ≥ 2 (PR-1 fix — was silently flattening before)
  • Cost gate rejecting 100% of orders for ≥ 3 cycles (PR-1 fix)
  • Redis unavailable > 30 s
  • Drawdown soft breach (5% from peak)
  • Startup reconciliation critical (PR #40 fix)

What to do at WARN

  1. Run telemetry and check risk.kill_history for the last 5 escalations
  2. Match against the triggers above — risk.kill_history[].reason is verbose
  3. Most WARN triggers self-clear: a clean reconcile, signals returning, etc. give the manager an explicit _de_escalate next cycle
  4. If WARN persists > 30 minutes, restart the daemon (forces startup recon)

Manual reset

# Telegram bot command (preferred)
/kill_reset NORMAL "manual investigation complete - naz"
or via API:
curl -X POST -H "X-QGTM-API-Key: $KEY" \
  -d '{"approver":"naz","reason":"investigated, all clear"}' \
  https://api.qgtmai.com/api/v1/risk/kill-tier/reset


4. Deploy + rollback

Standard deploy (via GitHub Actions on push to main)

  1. Open PR, run CI, merge to main
  2. .github/workflows/deploy.yml runs:
  3. Web: builds Next.js → uploads to Cloudflare Pages
  4. API: SSHes to droplet, git pull, pip install, systemctl restart qgtm-daemon
  5. Health-check loop hits http://127.0.0.1:8000/health for up to 60 s
  6. If the health check fails, the workflow rolls back to the previous commit and restarts again

Manual deploy (if Actions are unavailable)

ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
git fetch origin
git checkout main
git pull --ff-only
source .venv/bin/activate
pip install -e .
systemctl restart qgtm-daemon
sleep 30
curl -fsS http://127.0.0.1:8000/health
EOF

Emergency rollback (deploy broke production)

ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
LAST_GOOD=$(git log --oneline -20 | grep -v 'WIP' | head -2 | tail -1 | cut -d' ' -f1)
git checkout $LAST_GOOD
pip install -e .
systemctl restart qgtm-daemon
EOF

Then in GitHub:

git revert HEAD --no-edit
git push origin main
…to keep main consistent with the deployed state.


5. Cloudflare Access (CF Access) setup — when you're ready to retire the frontend API key

PR 2 added backend support for CF Access JWTs. To turn it on:

One-time setup (Cloudflare dashboard)

  1. Cloudflare → Zero Trust → Access → Applications → Add an application
  2. Type: Self-hosted
  3. Application name: "QGTM API"
  4. Application domain: api.qgtmai.com
  5. Path: /api/v1/* (leave /health open for monitoring)
  6. Identity providers: enable Email OTP + Google + GitHub
  7. Add policies:
  8. Policy 1 "Founders": Include emails naz@qgtmai.com, mo@qgtmai.com
  9. Save. Note the AUD tag from the app config (long hex string)
  10. Note your team subdomain (e.g. qgtmaiqgtmai.cloudflareaccess.com)

Deploy backend with CF Access enabled

ssh root@142.93.1.195 << 'EOF'
cd /srv/qgtm
cat >> .env <<ENV
QGTM_CF_ACCESS_TEAM_DOMAIN=qgtmai
QGTM_CF_ACCESS_APP_AUD=<paste-AUD-from-cloudflare>
ENV
systemctl restart qgtm-daemon
systemctl restart qgtm-api  # if separate
EOF

Verify

# Without auth → should redirect to CF Access login
curl -I https://api.qgtmai.com/api/v1/risk

# With X-QGTM-API-Key (legacy path) → should still work
curl -H "X-QGTM-API-Key: $KEY" https://api.qgtmai.com/api/v1/risk

Browse to https://api.qgtmai.com/api/v1/risk — you should hit a CF Access login page, authenticate, and get the JSON.

Switch the frontend (final step)

Once CF Access works:

  1. Remove NEXT_PUBLIC_QGTM_API_KEY from .github/workflows/deploy.yml
  2. Update qgtm_web/src/lib/api.ts to drop the X-QGTM-API-Key header
  3. Rebuild + redeploy
  4. Rotate the old API key on the serverqgtm_api_key=$(openssl rand -hex 32)
  5. Update server-to-server callers (daemon ingest, monitoring) with the new key

The legacy X-QGTM-API-Key path stays available for programmatic use (CI, scripts, the daemon's own state ingest).


6. Pattern Day Trader (PDT) flag

Symptom

  • /api/v1/account shows "pattern_day_trader": true
  • daytrade_count ≥ 2

Why it matters

Alpaca classifies an account as PDT after 4 day-trades in 5 trading days. Below $25k this hard-restricts trading. You're at $95k so still operational, but PDT-flagged accounts are watched more closely and some intraday tactics incur higher costs through forced T+2 settlement.

What to do

  • Audit recent intraday/flatten activity — the flatten cascade we saw on 2026-05-13 is the main contributor
  • Reduce intraday strategy weight (currently capped at 10% via INTRADAY_CAPITAL_FRACTION)
  • Avoid emergency flattens (the PR-1 work targets this directly)
  • Alpaca will reset the PDT flag after a few weeks of no day-trades

7. Useful one-liners

# 14 days of P&L by day (from order history, requires X-QGTM-API-Key)
curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/api/v1/orders?limit=400 | \
  jq '[.orders[] | select(.status=="filled") | {date: (.submittedAt[:10]), pnl: (...)}] | group_by(.date)'

# Count emergency flatten events in last 30 days (orders with bare-UUID
# clientOrderId clusters, ≥ 4 within 60s)
curl ... | python3 scripts/count_flatten_clusters.py

# Tail daemon logs filtered to escalations only
ssh root@142.93.1.195 'journalctl -u qgtm-daemon -f' | \
  grep -E 'KILL_SWITCH_|DEAD_MAN_|RECONCILIATION_CRITICAL|DRAWDOWN_'

# Direct droplet curl bypassing Cloudflare
curl -sk -H "Host: api.qgtmai.com" -H "X-QGTM-API-Key: $KEY" \
  http://142.93.1.195:8000/$ENDPOINT

8. Long-term — what to set up next

  • [ ] SSH read-only user for Claudeadm group, sudo limited to systemctl status/restart qgtm-daemon. Removes blast risk while enabling on-call diagnostics.
  • [ ] Staging droplet — same code, fake Alpaca account, lets risky changes (shadow-portfolio refactor, OMS routing) be validated against real-shaped data before touching production.
  • [ ] Cloudflare Access (see §5) — replaces frontend API key.
  • [ ] Sentry integration — daemon errors should hit a tool, not bury in journalctl.
  • [ ] DigitalOcean Snapshots automation — daily snapshot of the droplet so we can revert in under 30 s for any reason.