Skip to content

Dead-Man's Switch Drill — 2026-04-12

GAP-005 Closure

Drill Summary

Item Result
Date 2026-04-12
Operator Agent DR (automated)
Test file tests/test_watchdog_drill.py
Tests run 26
Tests passed 26
Tests failed 0
Duration 0.42s

What Was Tested

  1. Heartbeat emitter (daemon side)
  2. Initial state is alive
  3. Beat updates timestamp and counter
  4. Metadata captured per beat
  5. Staleness detection at 3x interval
  6. Status report structured correctly

  7. Dead-man's switch (watchdog side)

  8. Initial state is ARMED
  9. Heartbeat reception keeps state ARMED
  10. WARNING state at 50% of timeout
  11. TRIPPED state when timeout exceeded
  12. Trip fires only once (idempotent)
  13. Heartbeats ignored after trip (no auto-recovery)
  14. Detection within one check cycle after timeout (120s default)
  15. Manual reset re-arms the switch
  16. Reset requires named approver

  17. Watchdog loop (async integration)

  18. flatten_callback called when switch trips
  19. No flatten when heartbeats are fresh
  20. Flatten called exactly once (not repeated)
  21. Loop survives flatten_callback exceptions

  22. Daemon recovery

  23. Fresh switch starts armed on restart
  24. Stale state detected immediately after crash
  25. Reset + re-arm flow works
  26. Re-trip possible after reset and second failure

  27. Independence

  28. Heartbeat and DeadManSwitch are separate objects
  29. Communication only via receive_heartbeat()
  30. Daemon crash does not affect watchdog process

Findings

  • The dead-man's switch correctly detects stale heartbeats within one check cycle.
  • The two-process architecture (Heartbeat in daemon, DeadManSwitch in watchdog) ensures crashes are detected even if the daemon process is completely dead.
  • Once tripped, the switch requires human reset — the daemon cannot un-trip itself.
  • The watchdog loop is resilient to flatten_callback failures (logs exception, continues monitoring).
  • Default timeout of 120s with 10s check interval means worst-case detection is 130s after last heartbeat.

Gaps Remaining

  • Redis integration: In production, heartbeats flow through Redis. Tests use direct receive_heartbeat() calls. The pattern is correct; the transport layer (Redis) is not tested here.
  • Process isolation: Tests verify object independence but not actual OS process separation. In production, the watchdog runs as a separate systemd service.

Verdict

GAP-005: CLOSED. The dead-man's switch detects stale heartbeats, triggers emergency flatten, operates independently from the daemon, and handles recovery correctly.