Dead-Man's Switch Drill — 2026-04-12
GAP-005 Closure
Drill Summary
| Item | Result |
|---|---|
| Date | 2026-04-12 |
| Operator | Agent DR (automated) |
| Test file | tests/test_watchdog_drill.py |
| Tests run | 26 |
| Tests passed | 26 |
| Tests failed | 0 |
| Duration | 0.42s |
What Was Tested
- Heartbeat emitter (daemon side)
- Initial state is alive
- Beat updates timestamp and counter
- Metadata captured per beat
- Staleness detection at 3x interval
-
Status report structured correctly
-
Dead-man's switch (watchdog side)
- Initial state is ARMED
- Heartbeat reception keeps state ARMED
- WARNING state at 50% of timeout
- TRIPPED state when timeout exceeded
- Trip fires only once (idempotent)
- Heartbeats ignored after trip (no auto-recovery)
- Detection within one check cycle after timeout (120s default)
- Manual reset re-arms the switch
-
Reset requires named approver
-
Watchdog loop (async integration)
- flatten_callback called when switch trips
- No flatten when heartbeats are fresh
- Flatten called exactly once (not repeated)
-
Loop survives flatten_callback exceptions
-
Daemon recovery
- Fresh switch starts armed on restart
- Stale state detected immediately after crash
- Reset + re-arm flow works
-
Re-trip possible after reset and second failure
-
Independence
- Heartbeat and DeadManSwitch are separate objects
- Communication only via receive_heartbeat()
- Daemon crash does not affect watchdog process
Findings
- The dead-man's switch correctly detects stale heartbeats within one check cycle.
- The two-process architecture (Heartbeat in daemon, DeadManSwitch in watchdog) ensures crashes are detected even if the daemon process is completely dead.
- Once tripped, the switch requires human reset — the daemon cannot un-trip itself.
- The watchdog loop is resilient to flatten_callback failures (logs exception, continues monitoring).
- Default timeout of 120s with 10s check interval means worst-case detection is 130s after last heartbeat.
Gaps Remaining
- Redis integration: In production, heartbeats flow through Redis. Tests use direct
receive_heartbeat()calls. The pattern is correct; the transport layer (Redis) is not tested here. - Process isolation: Tests verify object independence but not actual OS process separation. In production, the watchdog runs as a separate systemd service.
Verdict
GAP-005: CLOSED. The dead-man's switch detects stale heartbeats, triggers emergency flatten, operates independently from the daemon, and handles recovery correctly.