{
  "id": "2026-04-27-circuit-breaker-design-e410fc128d",
  "scope": "redkey",
  "source_of_truth": "repo",
  "source_path": "docs/specs/2026-04-27-circuit-breaker-design.md",
  "source_kind": "markdown",
  "visibility": "internal",
  "renderer_id": "design_doc.dreamborn-forge.generated.v1",
  "design_system": "dreamborn-design-system:forge",
  "generated_at": "2026-05-09T13:00:55.678Z",
  "artifact_type": "design_doc",
  "schema_version": "design_doc.generated.v1",
  "title": "Circuit Breaker — Worker On/Off Control",
  "summary": "Circuit Breaker — Worker On/Off Control Date: 2026 04 27 Branch: feature/marketing agents Status: Approved for implementation Overview Add a circuit breaker to every worker agent — a flag that stops the runner from executing without stopping the systemd timer. Combined with a global kill switch that terminates all running processes immediately. Two mechanism...",
  "format_source": "markdown",
  "sections": [
    {
      "title": "Circuit Breaker — Worker On/Off Control",
      "level": 1,
      "body": "**Date:** 2026-04-27  \n**Branch:** feature/marketing-agents  \n**Status:** Approved for implementation\n\n---"
    },
    {
      "title": "Overview",
      "level": 2,
      "body": "Add a circuit breaker to every worker agent — a flag that stops the runner from executing without stopping the systemd timer. Combined with a global kill switch that terminates all running processes immediately.\n\nTwo mechanisms:\n- **Per-agent circuit breaker** — graceful pause, takes effect on next timer tick\n- **Global kill switch** — immediate process termination + all breakers open\n\n---"
    },
    {
      "title": "Storage",
      "level": 2,
      "body": "**Migration:** `supabase/migrations/034_circuit_breaker.sql`\n\nAdd one column to `agent_state`:\n\n```sql\nALTER TABLE agent_state\n  ADD COLUMN circuit_breaker TEXT NOT NULL DEFAULT 'closed'\n  CHECK (circuit_breaker IN ('open', 'closed'));\n```\n\n`'closed'` = normal (agent runs). `'open'` = paused (agent skips). Default `'closed'` — existing agents unaffected after migration.\n\nNo separate global flag. \"All paused\" = all rows have `circuit_breaker = 'open'`. Resume = patch rows back to `'closed'`.\n\n---"
    },
    {
      "title": "Runner Check",
      "level": 2,
      "body": "In `agents/shared/runner.py`, add `_check_circuit_breaker(sb)` called at the very top of `_run()` — before `fetch_inbox`, `fetch_work_items`, or any other logic.\n\nBehavior:\n- If `circuit_breaker = 'open'`: log `\"circuit breaker open — skipping\"`, set `agent_state.status = 'paused'`, return immediately\n- If Supabase is unreachable: fail to `'closed'` (agents keep running — don't block the platform on a DB hiccup)\n- If no row exists for this agent: treat as `'closed'`\n\n```python\ndef _check_circuit_breaker(self, sb: Client) -> bool:\n    \"\"\"Returns True if OK to run (closed), False if paused (open).\"\"\"\n    try:\n        row = sb.table(\"agent_state\").select(\"circuit_breaker\") \\\n                .eq(\"agent\", self.AGENT_ID).limit(1).execute()\n        if row.data:\n            return row.data[0].get(\"circuit_breaker\", \"closed\") == \"closed\"\n    except Exception as e:\n        log.warning(\"circuit_breaker check failed — defaulting to closed: %s\", e)\n    return True\n```\n\n`_run()` calls this first:\n\n```python\ndef _run(self, sb: Client, creds: dict) -> None:\n    if not self._check_circuit_breaker(sb):\n        log.info(\"%s circuit breaker open — skipping\", self.AGENT_ID)\n        self._update_agent_state(sb, \"paused\")\n        return\n    # ... rest of existing _run logic\n```\n\n---"
    },
    {
      "title": "Kill Endpoint",
      "level": 2,
      "body": "Add `POST /kill-agents` to `vps/hcs-post-server.js`.\n\nTwo-step execution:\n1. Patch all rows in `agent_state` → `circuit_breaker = 'open'`\n2. `pkill -f \"agents.shared.runner\"` — terminates all running runner processes\n\nResponse:\n\n```json\n{ \"ok\": true, \"killed\": true }\n```\n\n`pkill` exit code 0 = processes killed, 1 = nothing was running — both are valid. Either way, breakers are open and nothing restarts.\n\n---"
    },
    {
      "title": "CLI Script",
      "level": 2,
      "body": "`scripts/circuit-breaker.js` — run with `node --env-file=.env scripts/circuit-breaker.js`.\n\n```"
    },
    {
      "title": "Pause one agent (graceful — takes effect on next timer tick)",
      "level": 1,
      "body": "node scripts/circuit-breaker.js --agent quinn --open"
    },
    {
      "title": "Resume one agent",
      "level": 1,
      "body": "node scripts/circuit-breaker.js --agent quinn --close"
    },
    {
      "title": "Resume all agents (no process start — just clears all breakers)",
      "level": 1,
      "body": "node scripts/circuit-breaker.js --resume-all"
    },
    {
      "title": "Global kill (terminates all running workers NOW + opens all breakers)",
      "level": 1,
      "body": "node scripts/circuit-breaker.js --kill\n```\n\n- `--open` / `--close`: patch `agent_state` via Supabase (`REDKEY_SUPABASE_URL` + `REDKEY_SUPABASE_SECRET_KEY`)\n- `--kill`: POST to `http://87.99.154.64:3001/kill-agents`\n- `--resume-all`: patch all `agent_state` rows to `circuit_breaker = 'closed'`\n- Missing `--agent` with `--open`/`--close`: print usage and exit 1\n\n---"
    },
    {
      "title": "Per-agent toggle",
      "level": 3,
      "body": "In each agent row, add a toggle button next to the status dot:\n- Pause icon when `circuit_breaker = 'closed'` (agent is running normally)\n- Play icon when `circuit_breaker = 'open'` (agent is paused)\n- Clicking PATCHes `agent_state.circuit_breaker` for that agent via Supabase\n- When `open`, show a small orange \"paused\" badge on the agent row"
    },
    {
      "title": "Header buttons",
      "level": 3,
      "body": "Two buttons in the cockpit header:\n\n| Button | Color | Action |\n|--------|-------|--------|\n| Resume All | Green/neutral | Patches all `agent_state.circuit_breaker = 'closed'` |\n| Kill All | Red | Confirmation dialog → POST `/kill-agents` |\n\nConfirmation dialog text: *\"This will terminate all running workers immediately. Continue?\"*\n\nOn Kill All success: cockpit reflects all agents as paused (orange badges). Resume All clears them.\n\n---"
    },
    {
      "title": "Behavior Summary",
      "level": 2,
      "body": "| Action | Mechanism | Effect on running session | Effect on next tick |\n|--------|-----------|--------------------------|---------------------|\n| Per-agent pause | Supabase patch | None — finishes naturally | Runner exits at startup |\n| Per-agent resume | Supabase patch | N/A | Runner runs normally |\n| Global kill | POST /kill-agents | SIGTERM immediately | All runners skip |\n| Resume All | Supabase patch | N/A | All runners run normally |\n\n---"
    },
    {
      "title": "Implementation Tasks",
      "level": 2,
      "body": "1. Migration `034_circuit_breaker.sql` — add column to `agent_state`\n2. `runner.py` — add `_check_circuit_breaker()` + call at top of `_run()`\n3. `vps/hcs-post-server.js` — add `POST /kill-agents` route\n4. `scripts/circuit-breaker.js` — CLI script (4 modes)\n5. Cockpit — per-agent toggle button + Kill All / Resume All header buttons\n6. Deploy: SCP hcs-post-server.js + runner.py to VPS, restart services"
    }
  ],
  "html_path": "artifacts/2026-04-27-circuit-breaker-design-e410fc128d.html",
  "json_path": "artifacts/2026-04-27-circuit-breaker-design-e410fc128d.json"
}