Chapter 9: Operating the System
Two people run this platform. One builds it, one watches the dashboard. This chapter is the operating manual for both.
Reading time: 20 min | Difficulty: Advanced | Prerequisites: Chapters 5-8
Two Operators, Two Interfaces
This system has two operators with fundamentally different responsibilities:
| Role | Codename | Access Level | Primary Tool | Time Commitment |
|---|---|---|---|---|
| System Architect | OpA | Full SSH, code, config | Terminal + Grafana | 30 min/day + on-call |
| Business Partner | OpB | Read-only dashboard | Phone browser | 5 min/day glance |
Why two roles matter: No single person should be able to both modify the system AND authorize risk changes. This is the two-person rule borrowed from nuclear launch protocols and institutional fund management. OpA proposes changes; OpB confirms critical actions like kill-switch resets.
OpA: The Daily Workflow
Morning (6:45 AM ET, before US pre-market)
This takes 10-15 minutes. Do it the same way every day.
Step 1: Check overnight health.
# SSH into the server, pull the morning summary
qgtm status --summary
# Expected output:
# ╔══════════════════════════════════════════╗
# ║ QGTM MORNING STATUS — 2026-04-12 ║
# ╠══════════════════════════════════════════╣
# ║ Daemon: RUNNING (uptime 14d 7h) ║
# ║ Heartbeat: 12s ago ║
# ║ Strategies: 29/29 loaded ║
# ║ Broker: CONNECTED (Alpaca) ║
# ║ Positions: 7 open, $247,300 notional ║
# ║ P&L (1d): +$1,412 (+0.57%) ║
# ║ Drawdown: -2.3% from peak ║
# ║ Signals: 3 pending (GLD, SLV, GDX) ║
# ║ Risk: ALL GREEN ║
# ╚══════════════════════════════════════════╝
Step 2: Review pending signals. The system generated signals overnight from Asian session data and pre-market futures. Review them in Grafana or terminal:
Each signal shows: strategy source, confidence score (0-100), meta-label probability, position size recommendation, and the pre-trade compliance result (PASS/FAIL with reason).
Step 3: Check data freshness. Stale data is the silent killer of systematic trading.
qgtm data --freshness
# Shows last-updated timestamp for every feed:
# OHLCV (Alpaca): 2026-04-11 20:00 UTC ✓ fresh
# COT (CFTC): 2026-04-08 (weekly) ✓ on schedule
# FRED macro: 2026-04-10 ✓ fresh
# Options vol: 2026-04-11 21:00 UTC ✓ fresh
# Sentiment (Reddit): 2026-04-12 06:30 UTC ✓ fresh
Step 4: Scan Grafana dashboards. Three panels matter:
- System Health — CPU, memory, API rate limits, queue depth
- Strategy Performance — Rolling Sharpe by strategy, regime indicator
- Risk Dashboard — Current drawdown, VaR utilization, correlation heatmap
If everything is green, you are done. Total time: 8-12 minutes.
The Sunday Night Decision Screen
Sunday at 8 PM ET, before Asian markets open, OpA runs the weekly review:
This produces a structured report covering:
- Week-in-review: P&L attribution by strategy, largest winners/losers, execution quality (slippage vs. estimate)
- Regime assessment: Current HMM state (risk-on / risk-off / crisis), BOCPD change-point probability, vol regime (low / normal / elevated / crisis)
- Allocation recommendation: The optimizer's suggested weight shifts for the coming week based on updated skill scores and regime probabilities
- Calendar risks: FOMC, CPI, NFP, options expiry dates for the week ahead
- Action items: Data feed renewals, strategy parameter reviews due, any code deployments queued
OpA reviews this, decides whether to accept the optimizer's allocation recommendation or override (with logged justification), and confirms. The system applies new weights at Sunday 9 PM ET, two hours before CME Globex opens.
OpB: The Phone Dashboard
OpB opens a single URL on their phone. The dashboard shows:
The Health Dot
A single colored circle in the top-left corner. This is the most important indicator on the screen.
| Color | Meaning | Action Required |
|---|---|---|
| Green | All systems normal, P&L within expectations | None. Check back tomorrow. |
| Yellow | Minor issue — one data feed stale, or drawdown approaching soft cap | Read the one-line description below the dot. No immediate action needed. |
| Orange | Elevated concern — drawdown past soft cap, or two+ feeds stale, or a strategy disabled | Read the detail panel. Text/call OpA if no response in 2 hours. |
| Red | Kill-switch triggered, or broker disconnected, or critical system failure | The system has already halted trading. Contact OpA immediately. If OpA unreachable for 30 min, follow the dead-man's switch protocol (Chapter 6). |
What OpB Sees Below the Dot
- Account value (updated every 60s during market hours)
- Today's P&L in dollars and percent
- Drawdown from peak as a visual bar (green/yellow/red zones)
- Number of open positions and total notional
- Last signal with timestamp, direction, and symbol
- System uptime in days
When to Worry (OpB Decision Tree)
Dot is green? → Close the app. Everything is fine.
Dot is yellow? → Read the description. Probably a stale feed.
If still yellow tomorrow morning, mention it to OpA.
Dot is orange? → Read the detail panel.
Text OpA: "Dashboard orange — [one-line reason]"
Dot is red? → Trading is halted. Call OpA.
If no answer in 30 min: nothing to do.
System is already in safe mode. No money is at risk.
Starting the Daemon
When OpA starts (or restarts) the system, here is exactly what happens:
Boot Sequence (15-30 seconds)
[T+0s] Daemon process starts
→ Load configuration from qgtm_config.toml
→ Validate all API keys (Alpaca, FRED, data vendors)
[T+2s] Strategy initialization
→ Load all 29 strategy instances
→ Each strategy loads its trained model/parameters
→ Verify parameter checksums match last deployment
[T+5s] Broker connection
→ Connect to Alpaca via WebSocket
→ Authenticate, verify account status
→ Pull current positions and open orders
[T+8s] Reconciliation
→ Compare broker positions vs. internal state
→ If mismatch: LOG WARNING, do NOT auto-correct
→ OpA must manually reconcile any discrepancy
[T+12s] Data pipeline warm-up
→ Fetch last 252 trading days of OHLCV (cached locally)
→ Pull latest COT, FRED, sentiment data
→ Calculate all features for current bar
[T+18s] Risk system activation
→ Load current drawdown state
→ Initialize VaR calculator
→ Set kill-switch thresholds from config
[T+20s] Heartbeat begins
→ First heartbeat sent to monitoring
→ Dashboard health dot turns green
→ System is live
Critical Rule: Reconciliation Never Auto-Corrects
If the daemon finds a position mismatch on startup (e.g., broker shows 500 GLD shares but internal state says 400), it does NOT automatically adjust. It logs the discrepancy, sends an alert, and waits for OpA to investigate. This prevents the scenario where a partial fill or manual override gets steamrolled by automation.
Monitoring
Grafana Dashboards
Four pre-built dashboards ship with the system:
1. System Health (/grafana/d/system-health)
- Daemon uptime and heartbeat interval
- CPU, memory, disk usage
- API rate limit consumption (Alpaca: 200/min limit)
- Data pipeline latency (target: < 5s from market tick to feature calculation)
2. Strategy Performance (/grafana/d/strategy-perf)
- Rolling 63-day Sharpe ratio by strategy (sorted, colored by regime)
- Signal count heatmap (strategy x day)
- Win rate and profit factor by strategy (trailing 90 days)
- Skill score evolution (the adaptive weight the system assigns each strategy)
3. Risk Dashboard (/grafana/d/risk)
- Current portfolio drawdown with soft/hard cap lines
- VaR utilization (actual vs. limit)
- Cross-strategy correlation matrix (warns when strategies cluster)
- Regime indicator (HMM state probabilities as stacked area chart)
4. Execution Quality (/grafana/d/exec-quality)
- Slippage distribution (arrival price vs. fill price)
- Order latency (signal generation to order submission)
- Fill rate and partial fill frequency
- Cost attribution (spread + slippage + commission per trade)
Signal Dashboard API
The web frontend pulls live data from the signal API:
GET /api/signals/active → Currently open signals with live P&L
GET /api/signals/pending → Signals awaiting market open
GET /api/signals/history → Historical signals with outcomes
GET /api/health → System health summary (powers the health dot)
GET /api/performance → Account-level performance metrics
Terminal Panels
For OpA, the terminal interface provides rapid access without opening a browser:
# Live P&L ticker (updates every second during market hours)
qgtm watch --pnl
# Strategy signal monitor (shows signals as they generate)
qgtm watch --signals
# Risk dashboard in terminal (curses-based)
qgtm watch --risk
# Combined "mission control" layout (tmux-based, 4 panels)
qgtm mission-control
The Kill-Switch
The kill-switch is the most important safety mechanism in the system. It exists to prevent catastrophic loss when something goes wrong — and in markets, things go wrong.
Three Trigger Levels
| Level | Trigger | Automatic Action | Manual Action Required |
|---|---|---|---|
| Soft Cap | Drawdown hits -8% from peak | Reduce position sizes by 50%. No new positions. | OpA reviews. Can override with logged justification. |
| Hard Cap | Drawdown hits -12% from peak | Flatten ALL positions at market. System enters cool-off. | OpA + OpB must both confirm to re-enable trading. |
| Dead-Man's Switch | No heartbeat for 15 minutes | Flatten all positions. Alert both operators. | Full system restart and reconciliation required. |
What Happens When the Hard Cap Triggers
- Immediate: All open orders cancelled. Market orders sent to flatten every position.
- T+1 minute: Confirmation that all positions are flat (or alert if any fill failed).
- T+5 minutes: Automated incident report generated — what positions were open, what the drawdown was, which strategies contributed most.
- Cool-off period: System will not trade for 24 hours minimum.
- Re-entry: After cool-off, OpA proposes re-entry plan with reduced position sizing (50% of normal). OpB confirms via the dashboard (two-person rule). System gradually scales back to full size over 5 trading days.
The Two-Person Reset
After a hard cap trigger, re-enabling trading requires:
- OpA runs
qgtm killswitch --request-reset(generates a reset token) - OpB receives the token on their phone dashboard
- OpB enters the token on the confirmation screen
- System re-enables with cool-off constraints
Neither operator can reset alone. This prevents emotional decision-making after a drawdown.
Maintenance
Weekly Digest (Automated, Sunday 6 PM ET)
The system auto-generates and emails a PDF to both operators:
- Performance summary (1-week, 1-month, YTD)
- Strategy-level attribution
- Risk metrics snapshot
- Data quality report (any gaps, stale feeds, interpolated values)
- Upcoming calendar risks (economic events, options expiry, roll dates)
Monthly Tasks (OpA)
- Strategy health review: Are any strategies consistently underperforming their historical Sharpe? The skill scoring system handles this automatically, but OpA should verify the weights make sense.
- Data audit: Run
qgtm data --auditto check for gaps, outliers, or feed changes from vendors. - Dependency updates: Security patches for Python packages, Docker images. Run in staging first.
Quarterly Architecture Review
Every 90 days, both operators sit down for a structured review:
| Topic | Duration | Deliverable |
|---|---|---|
| Performance vs. benchmark | 30 min | Written assessment |
| Strategy additions/removals | 30 min | Proposal document if changes needed |
| Risk parameter review | 20 min | Updated config if thresholds need adjusting |
| Infrastructure review | 20 min | Upgrade plan if capacity limits approaching |
| Disaster recovery test | 20 min | Verify backup restore works, failover functions |
The quarterly review is also when you ask: "Is the system still doing what we designed it to do, or has it drifted?" Drift is the slow death of systematic trading — parameters that made sense two years ago may be stale today.
Emergency Procedures
Broker Disconnection
If Alpaca WebSocket drops: 1. System retries 3 times over 30 seconds 2. If reconnection fails, all pending orders are cancelled via REST API 3. Existing positions are left in place (they are hedged or sized to survive) 4. OpA is alerted. Manual reconnection and reconciliation required.
Data Feed Failure
If a primary data source goes down: 1. System switches to cached data (last known good values) 2. Strategies that depend on the missing feed are paused 3. Other strategies continue operating 4. If the feed is down for > 4 hours during market hours, OpA should consider reducing overall exposure
Server Crash / Power Loss
The daemon writes state to disk every 60 seconds. On restart: 1. Load last-known state from disk 2. Reconcile with broker (who always has ground truth) 3. Resume from reconciled state 4. Any signals that were pending but not executed are re-evaluated with fresh data
Checklist: First Week of Live Operation
Day 1: Start daemon. Verify all 29 strategies load.
Confirm broker connection. Place one test order (1 share of GLD).
Verify the order appears in both Alpaca and internal state.
Cancel the test order.
Day 2: Let the system run overnight. Check morning status.
Verify all data feeds updated correctly.
Review any signals generated (should be few — system is conservative).
Day 3: If any signals triggered, verify execution quality.
Compare fill prices to arrival prices.
Check that position sizing matches the config.
Day 4: Intentionally trigger the soft cap in paper trading mode.
Verify position reduction happens automatically.
Verify the alert reaches both OpA and OpB.
Day 5: Intentionally trigger the dead-man's switch (stop the daemon).
Verify positions are flattened within 15 minutes.
Restart and verify recovery sequence.
Day 6: Run a full reconciliation drill.
Manually adjust a position count in the broker.
Restart daemon. Verify it catches the mismatch and does NOT auto-correct.
Day 7: Sunday night review. Generate weekly digest.
Confirm both operators receive it.
If everything checks out: you are operational.