Chapter 9: Operating the System

Two people run this platform. One builds it, one watches the dashboard. This chapter is the operating manual for both.

Reading time: 20 min | Difficulty: Advanced | Prerequisites: Chapters 5-8

Two Operators, Two Interfaces

This system has two operators with fundamentally different responsibilities:

Role	Codename	Access Level	Primary Tool	Time Commitment
System Architect	OpA	Full SSH, code, config	Terminal + Grafana	30 min/day + on-call
Business Partner	OpB	Read-only dashboard	Phone browser	5 min/day glance

Why two roles matter: No single person should be able to both modify the system AND authorize risk changes. This is the two-person rule borrowed from nuclear launch protocols and institutional fund management. OpA proposes changes; OpB confirms critical actions like kill-switch resets.

OpA: The Daily Workflow

Morning (6:45 AM ET, before US pre-market)

This takes 10-15 minutes. Do it the same way every day.

Step 1: Check overnight health.

# SSH into the server, pull the morning summary
qgtm status --summary

# Expected output:
# ╔══════════════════════════════════════════╗
# ║  QGTM MORNING STATUS — 2026-04-12      ║
# ╠══════════════════════════════════════════╣
# ║  Daemon:     RUNNING (uptime 14d 7h)    ║
# ║  Heartbeat:  12s ago                     ║
# ║  Strategies: 29/29 loaded               ║
# ║  Broker:     CONNECTED (Alpaca)          ║
# ║  Positions:  7 open, $247,300 notional   ║
# ║  P&L (1d):   +$1,412 (+0.57%)           ║
# ║  Drawdown:   -2.3% from peak            ║
# ║  Signals:    3 pending (GLD, SLV, GDX)  ║
# ║  Risk:       ALL GREEN                   ║
# ╚══════════════════════════════════════════╝

Step 2: Review pending signals. The system generated signals overnight from Asian session data and pre-market futures. Review them in Grafana or terminal:

qgtm signals --pending --detail

Each signal shows: strategy source, confidence score (0-100), meta-label probability, position size recommendation, and the pre-trade compliance result (PASS/FAIL with reason).

Step 3: Check data freshness. Stale data is the silent killer of systematic trading.

qgtm data --freshness

# Shows last-updated timestamp for every feed:
# OHLCV (Alpaca):     2026-04-11 20:00 UTC  ✓ fresh
# COT (CFTC):         2026-04-08 (weekly)    ✓ on schedule
# FRED macro:         2026-04-10             ✓ fresh
# Options vol:        2026-04-11 21:00 UTC   ✓ fresh
# Sentiment (Reddit): 2026-04-12 06:30 UTC   ✓ fresh

Step 4: Scan Grafana dashboards. Three panels matter:

System Health — CPU, memory, API rate limits, queue depth
Strategy Performance — Rolling Sharpe by strategy, regime indicator
Risk Dashboard — Current drawdown, VaR utilization, correlation heatmap

If everything is green, you are done. Total time: 8-12 minutes.

The Sunday Night Decision Screen

Sunday at 8 PM ET, before Asian markets open, OpA runs the weekly review:

qgtm weekly-review --generate

This produces a structured report covering:

Week-in-review: P&L attribution by strategy, largest winners/losers, execution quality (slippage vs. estimate)
Regime assessment: Current HMM state (risk-on / risk-off / crisis), BOCPD change-point probability, vol regime (low / normal / elevated / crisis)
Allocation recommendation: The optimizer's suggested weight shifts for the coming week based on updated skill scores and regime probabilities
Calendar risks: FOMC, CPI, NFP, options expiry dates for the week ahead
Action items: Data feed renewals, strategy parameter reviews due, any code deployments queued

OpA reviews this, decides whether to accept the optimizer's allocation recommendation or override (with logged justification), and confirms. The system applies new weights at Sunday 9 PM ET, two hours before CME Globex opens.

OpB: The Phone Dashboard

OpB opens a single URL on their phone. The dashboard shows:

The Health Dot

A single colored circle in the top-left corner. This is the most important indicator on the screen.

Color	Meaning	Action Required
Green	All systems normal, P&L within expectations	None. Check back tomorrow.
Yellow	Minor issue — one data feed stale, or drawdown approaching soft cap	Read the one-line description below the dot. No immediate action needed.
Orange	Elevated concern — drawdown past soft cap, or two+ feeds stale, or a strategy disabled	Read the detail panel. Text/call OpA if no response in 2 hours.
Red	Kill-switch triggered, or broker disconnected, or critical system failure	The system has already halted trading. Contact OpA immediately. If OpA unreachable for 30 min, follow the dead-man's switch protocol (Chapter 6).

What OpB Sees Below the Dot

Account value (updated every 60s during market hours)
Today's P&L in dollars and percent
Drawdown from peak as a visual bar (green/yellow/red zones)
Number of open positions and total notional
Last signal with timestamp, direction, and symbol
System uptime in days

When to Worry (OpB Decision Tree)

Dot is green?  →  Close the app. Everything is fine.
Dot is yellow? →  Read the description. Probably a stale feed.
                  If still yellow tomorrow morning, mention it to OpA.
Dot is orange? →  Read the detail panel.
                  Text OpA: "Dashboard orange — [one-line reason]"
Dot is red?    →  Trading is halted. Call OpA.
                  If no answer in 30 min: nothing to do.
                  System is already in safe mode. No money is at risk.

Starting the Daemon

When OpA starts (or restarts) the system, here is exactly what happens:

Boot Sequence (15-30 seconds)

[T+0s]   Daemon process starts
         → Load configuration from qgtm_config.toml
         → Validate all API keys (Alpaca, FRED, data vendors)

[T+2s]   Strategy initialization
         → Load all 29 strategy instances
         → Each strategy loads its trained model/parameters
         → Verify parameter checksums match last deployment

[T+5s]   Broker connection
         → Connect to Alpaca via WebSocket
         → Authenticate, verify account status
         → Pull current positions and open orders

[T+8s]   Reconciliation
         → Compare broker positions vs. internal state
         → If mismatch: LOG WARNING, do NOT auto-correct
         → OpA must manually reconcile any discrepancy

[T+12s]  Data pipeline warm-up
         → Fetch last 252 trading days of OHLCV (cached locally)
         → Pull latest COT, FRED, sentiment data
         → Calculate all features for current bar

[T+18s]  Risk system activation
         → Load current drawdown state
         → Initialize VaR calculator
         → Set kill-switch thresholds from config

[T+20s]  Heartbeat begins
         → First heartbeat sent to monitoring
         → Dashboard health dot turns green
         → System is live

Critical Rule: Reconciliation Never Auto-Corrects

If the daemon finds a position mismatch on startup (e.g., broker shows 500 GLD shares but internal state says 400), it does NOT automatically adjust. It logs the discrepancy, sends an alert, and waits for OpA to investigate. This prevents the scenario where a partial fill or manual override gets steamrolled by automation.

Monitoring

Grafana Dashboards

Four pre-built dashboards ship with the system:

1. System Health (/grafana/d/system-health) - Daemon uptime and heartbeat interval - CPU, memory, disk usage - API rate limit consumption (Alpaca: 200/min limit) - Data pipeline latency (target: < 5s from market tick to feature calculation)

2. Strategy Performance (/grafana/d/strategy-perf) - Rolling 63-day Sharpe ratio by strategy (sorted, colored by regime) - Signal count heatmap (strategy x day) - Win rate and profit factor by strategy (trailing 90 days) - Skill score evolution (the adaptive weight the system assigns each strategy)

3. Risk Dashboard (/grafana/d/risk) - Current portfolio drawdown with soft/hard cap lines - VaR utilization (actual vs. limit) - Cross-strategy correlation matrix (warns when strategies cluster) - Regime indicator (HMM state probabilities as stacked area chart)

4. Execution Quality (/grafana/d/exec-quality) - Slippage distribution (arrival price vs. fill price) - Order latency (signal generation to order submission) - Fill rate and partial fill frequency - Cost attribution (spread + slippage + commission per trade)

Signal Dashboard API

The web frontend pulls live data from the signal API:

GET /api/signals/active     →  Currently open signals with live P&L
GET /api/signals/pending     →  Signals awaiting market open
GET /api/signals/history     →  Historical signals with outcomes
GET /api/health              →  System health summary (powers the health dot)
GET /api/performance         →  Account-level performance metrics

Terminal Panels

For OpA, the terminal interface provides rapid access without opening a browser:

# Live P&L ticker (updates every second during market hours)
qgtm watch --pnl

# Strategy signal monitor (shows signals as they generate)
qgtm watch --signals

# Risk dashboard in terminal (curses-based)
qgtm watch --risk

# Combined "mission control" layout (tmux-based, 4 panels)
qgtm mission-control

The Kill-Switch

The kill-switch is the most important safety mechanism in the system. It exists to prevent catastrophic loss when something goes wrong — and in markets, things go wrong.

Three Trigger Levels

Level	Trigger	Automatic Action	Manual Action Required
Soft Cap	Drawdown hits -8% from peak	Reduce position sizes by 50%. No new positions.	OpA reviews. Can override with logged justification.
Hard Cap	Drawdown hits -12% from peak	Flatten ALL positions at market. System enters cool-off.	OpA + OpB must both confirm to re-enable trading.
Dead-Man's Switch	No heartbeat for 15 minutes	Flatten all positions. Alert both operators.	Full system restart and reconciliation required.

What Happens When the Hard Cap Triggers

Immediate: All open orders cancelled. Market orders sent to flatten every position.
T+1 minute: Confirmation that all positions are flat (or alert if any fill failed).
T+5 minutes: Automated incident report generated — what positions were open, what the drawdown was, which strategies contributed most.
Cool-off period: System will not trade for 24 hours minimum.
Re-entry: After cool-off, OpA proposes re-entry plan with reduced position sizing (50% of normal). OpB confirms via the dashboard (two-person rule). System gradually scales back to full size over 5 trading days.

The Two-Person Reset

After a hard cap trigger, re-enabling trading requires:

OpA runs qgtm killswitch --request-reset (generates a reset token)
OpB receives the token on their phone dashboard
OpB enters the token on the confirmation screen
System re-enables with cool-off constraints

Neither operator can reset alone. This prevents emotional decision-making after a drawdown.

Maintenance

Weekly Digest (Automated, Sunday 6 PM ET)

The system auto-generates and emails a PDF to both operators:

Performance summary (1-week, 1-month, YTD)
Strategy-level attribution
Risk metrics snapshot
Data quality report (any gaps, stale feeds, interpolated values)
Upcoming calendar risks (economic events, options expiry, roll dates)

Monthly Tasks (OpA)

Strategy health review: Are any strategies consistently underperforming their historical Sharpe? The skill scoring system handles this automatically, but OpA should verify the weights make sense.
Data audit: Run qgtm data --audit to check for gaps, outliers, or feed changes from vendors.
Dependency updates: Security patches for Python packages, Docker images. Run in staging first.

Quarterly Architecture Review

Every 90 days, both operators sit down for a structured review:

Topic	Duration	Deliverable
Performance vs. benchmark	30 min	Written assessment
Strategy additions/removals	30 min	Proposal document if changes needed
Risk parameter review	20 min	Updated config if thresholds need adjusting
Infrastructure review	20 min	Upgrade plan if capacity limits approaching
Disaster recovery test	20 min	Verify backup restore works, failover functions

The quarterly review is also when you ask: "Is the system still doing what we designed it to do, or has it drifted?" Drift is the slow death of systematic trading — parameters that made sense two years ago may be stale today.

Emergency Procedures

Broker Disconnection

If Alpaca WebSocket drops: 1. System retries 3 times over 30 seconds 2. If reconnection fails, all pending orders are cancelled via REST API 3. Existing positions are left in place (they are hedged or sized to survive) 4. OpA is alerted. Manual reconnection and reconciliation required.

Data Feed Failure

If a primary data source goes down: 1. System switches to cached data (last known good values) 2. Strategies that depend on the missing feed are paused 3. Other strategies continue operating 4. If the feed is down for > 4 hours during market hours, OpA should consider reducing overall exposure

Server Crash / Power Loss

The daemon writes state to disk every 60 seconds. On restart: 1. Load last-known state from disk 2. Reconcile with broker (who always has ground truth) 3. Resume from reconciled state 4. Any signals that were pending but not executed are re-evaluated with fresh data

Checklist: First Week of Live Operation

Day 1:  Start daemon. Verify all 29 strategies load.
        Confirm broker connection. Place one test order (1 share of GLD).
        Verify the order appears in both Alpaca and internal state.
        Cancel the test order.

Day 2:  Let the system run overnight. Check morning status.
        Verify all data feeds updated correctly.
        Review any signals generated (should be few — system is conservative).

Day 3:  If any signals triggered, verify execution quality.
        Compare fill prices to arrival prices.
        Check that position sizing matches the config.

Day 4:  Intentionally trigger the soft cap in paper trading mode.
        Verify position reduction happens automatically.
        Verify the alert reaches both OpA and OpB.

Day 5:  Intentionally trigger the dead-man's switch (stop the daemon).
        Verify positions are flattened within 15 minutes.
        Restart and verify recovery sequence.

Day 6:  Run a full reconciliation drill.
        Manually adjust a position count in the broker.
        Restart daemon. Verify it catches the mismatch and does NOT auto-correct.

Day 7:  Sunday night review. Generate weekly digest.
        Confirm both operators receive it.
        If everything checks out: you are operational.

Next: Chapter 10 — The Signal Pipeline →