Iris — Governance Scratch Pad
Iris — Governance Scratch Pad Agent: iris Topic: 0.0.8718898 Status: defined, not yet implemented Living doc. Log observed failure modes, design notes, and governance scenarios here before formal spec work begins. Failure Modes Observed in Production 1. Zombie Task (observed 2026 04 24) What happened: Quinn claimed task fa1fc353 on roles.developer. Claude se...
Agent: iris | Topic: 0.0.8718898 | Status: defined, not yet implemented
Living doc. Log observed failure modes, design notes, and governance scenarios here before formal spec work begins.
---
What happened: Quinn claimed task fa1fc353 on roles.developer. Claude session exited with code 1 (crash/max_turns) after ~1 minute. Runner only posts task.complete on exit code 0 — on failure, nothing is posted. Task remained permanently stuck: task.claim on HCS, no task.complete or task.blocked ever follows.
Effect: Task is unclaimed from anyone else's perspective. Agent can't re-claim it (it's in the completed set). Humans have no visibility without digging through logs.
- Runner doesn't post
task.blockedon non-zero exit (runner-level fix: patchbase.py) - No external sweep detects "claimed but unresolved past timeout"
Iris's role: Sweep role topics periodically. For any task.claim with no subsequent task.complete or task.blocked past a timeout (e.g., 30 minutes), post task.blocked with reason "governance: claim timeout — no completion observed" and escalate to roles.exec.
---
What happened: Cancelled workflows leave task.available messages on role topics. Agents keep claiming them, burning cycles, hitting max_turns, looping forever.
Effect: Agents waste resources on tasks that will never produce useful output.
Iris's role: On workflow cancellation, sweep role topics for task.available messages from that workflow_instance_id with no subsequent claim or blocked status. Post task.blocked with reason "governance: parent workflow cancelled".
---
What happened: Agent hits max_turns, exits without posting task.complete or task.blocked. Runner restarts, re-claims the same task (winner-is-me verify_claim logic), runs again, hits max_turns again. Infinite loop.
Effect: Burning API spend and HBAR with no progress.
Iris's role: Detect same task_id claimed 3+ times by the same agent with no completion between claims. Post task.blocked, escalate to roles.exec: "Agent X has claimed task Y 3 times without completing."
---
What: Agent posts working status to agents.<id> but never transitions to idle or complete past the expected duration for the task type.
Iris's role: Monitor agent_state for agents stuck in working past threshold. Alert to roles.exec.
---
- Iris never does work herself — she detects, blocks, and escalates
- All Iris actions originate on HCS (
task.blocked, escalation messages toroles.exec) — full audit trail - Iris should be idempotent: posting
task.blockedfor an already-blocked task is a no-op - Timeout thresholds should be configurable per role/task type (architect tasks take longer than BA tasks)
- Escalations to
roles.execshould include: task_id, agent, how long stuck, what was last observed
| Role | Expected max duration | Iris timeout |
|---|---|---|
| roles.developer | 20 min | 45 min |
| roles.architect | 15 min | 30 min |
| roles.ba | 10 min | 20 min |
| roles.coordinator | 5 min | 15 min |
| roles.exec | no timeout | no timeout |
- Does Iris run on a fixed cron (e.g., every 5 min) or event-driven?
- Should Iris post directly to role topics or only to
roles.exec? - How does Iris know a workflow was cancelled? Reads
workflow_instances.statusfrom Supabase? - Should Iris have her own HCS topic for governance events, or use
agents.iris?