design_doc · markdown

Iris — Governance Scratch Pad

Iris — Governance Scratch Pad Agent: iris Topic: 0.0.8718898 Status: defined, not yet implemented Living doc. Log observed failure modes, design notes, and governance scenarios here before formal spec work begins. Failure Modes Observed in Production 1. Zombie Task (observed 2026 04 24) What happened: Quinn claimed task fa1fc353 on roles.developer. Claude se...

Agent: iris | Topic: 0.0.8718898 | Status: defined, not yet implemented

Living doc. Log observed failure modes, design notes, and governance scenarios here before formal spec work begins.

---

1. Zombie Task (observed 2026-04-24)

What happened: Quinn claimed task fa1fc353 on roles.developer. Claude session exited with code 1 (crash/max_turns) after ~1 minute. Runner only posts task.complete on exit code 0 — on failure, nothing is posted. Task remained permanently stuck: task.claim on HCS, no task.complete or task.blocked ever follows.

Effect: Task is unclaimed from anyone else's perspective. Agent can't re-claim it (it's in the completed set). Humans have no visibility without digging through logs.

Runner doesn't post task.blocked on non-zero exit (runner-level fix: patch base.py)
No external sweep detects "claimed but unresolved past timeout"

Iris's role: Sweep role topics periodically. For any task.claim with no subsequent task.complete or task.blocked past a timeout (e.g., 30 minutes), post task.blocked with reason "governance: claim timeout — no completion observed" and escalate to roles.exec.

---

2. Stale Task Pollution (observed 2026-04-24)

What happened: Cancelled workflows leave task.available messages on role topics. Agents keep claiming them, burning cycles, hitting max_turns, looping forever.

Effect: Agents waste resources on tasks that will never produce useful output.

Iris's role: On workflow cancellation, sweep role topics for task.available messages from that workflow_instance_id with no subsequent claim or blocked status. Post task.blocked with reason "governance: parent workflow cancelled".

---

3. Claim Loop (observed 2026-04-24)

What happened: Agent hits max_turns, exits without posting task.complete or task.blocked. Runner restarts, re-claims the same task (winner-is-me verify_claim logic), runs again, hits max_turns again. Infinite loop.

Effect: Burning API spend and HBAR with no progress.

Iris's role: Detect same task_id claimed 3+ times by the same agent with no completion between claims. Post task.blocked, escalate to roles.exec: "Agent X has claimed task Y 3 times without completing."

---

4. Agent Stuck in Working Status

What: Agent posts working status to agents.<id> but never transitions to idle or complete past the expected duration for the task type.

Iris's role: Monitor agent_state for agents stuck in working past threshold. Alert to roles.exec.

---

Design Principles

Iris never does work herself — she detects, blocks, and escalates
All Iris actions originate on HCS (task.blocked, escalation messages to roles.exec) — full audit trail
Iris should be idempotent: posting task.blocked for an already-blocked task is a no-op
Timeout thresholds should be configurable per role/task type (architect tasks take longer than BA tasks)
Escalations to roles.exec should include: task_id, agent, how long stuck, what was last observed

Suggested Timeout Thresholds (draft)

| Role | Expected max duration | Iris timeout | |---|---|---| | roles.developer | 20 min | 45 min | | roles.architect | 15 min | 30 min | | roles.ba | 10 min | 20 min | | roles.coordinator | 5 min | 15 min | | roles.exec | no timeout | no timeout |

Open Questions

Does Iris run on a fixed cron (e.g., every 5 min) or event-driven?
Should Iris post directly to role topics or only to roles.exec?
How does Iris know a workflow was cancelled? Reads workflow_instances.status from Supabase?
Should Iris have her own HCS topic for governance events, or use agents.iris?