colibri/docs/guide/architecture/controlplane.md
Sam & Claude 95c487546d
Some checks are pending
CI / rust (pull_request) Waiting to run
CI / markdown (pull_request) Waiting to run
CI / port (pull_request) Waiting to run
CI / agent-jail-pkgs (pull_request) Waiting to run
docs(guide): port 39 procedural docs from clawdie-ai to colibri
New docs/guide/ tree — canonical home for operator-facing procedural docs.
Starlight frontmatter added to all files. 0.12 alignment fixes applied:

- v0.11.0 → v0.12.0 throughout
- PI_TUI_PROVIDER/MODEL → DEEPSEEK_API_KEY
- Headless Codex login → Agent runtime setup (zot + RPC mode)
- /login and auth.json references removed
- pi → zot in provider-fallback spawn reference
- colibri-provider-verify (was pi-provider-smoke)
- Language cleanup: smoke test → verification, fake → test,
  can't self-fix → requires operator intervention,
  broken → unresponsive, Fix anything broken → Verify all checks pass

Two-tree model: docs/wiki/ (decisions) + docs/guide/ (procedural).
Single source of truth in colibri. clawdie-ai docs/public/ to be retired.
2026-06-26 09:16:43 +02:00

7.2 KiB
Raw Permalink Blame History

title
Control Plane

Starting with v0.10.0, Clawdie has a built-in multi-agent control plane. The agent named after your install (e.g. "Clawdie" or "Atlas") becomes the orchestrator of her own computer — with a Sysadmin, DBA, and Git Admin working under her.

This is not a separate service or jail. It runs inside the existing clawdie service on the host.


What It Is

A lightweight orchestration layer baked into Clawdie that gives her:

  • Org chart — Orchestrator + Sysadmin + DBA + Git Admin, each with a defined scope
  • Task queue — work items assigned to agents, created by Telegram or by the orchestrator herself
  • Token budgets — daily limits per agent, hard stops, operator approval for expensive ops
  • Activity log — immutable audit trail of every decision and skill execution
  • Heartbeat scheduling — Sysadmin wakes daily for health checks; others wake on demand
  • Agentic harness — a terminal-first operator UI with extensions, safety gates, and live status

You are the human operator. You approve expensive operations, review the activity log, and can create tasks directly via the HTTP API or Telegram.


Architecture

Single clawdie service (host):
  ├── Telegram intake          — existing
  ├── HTTP API (port 3100)     — new: /api/controlplane/...
  ├── Unified scheduler        — 30s ticks, Telegram + heartbeats
  ├── Agent executor           — spawn("pi", ...) with CONTROLPLANE_* env
  ├── Agentic harness (TUI)    — extensions, safety gates, live status
  └── Shared hostd access      — privileged ops (bastille, zfs, pf)

Agents run on the host via the pi CLI. Each agent gets:

  • A system prompt from their identity file (SYSADMIN_AGENT.md, DB_ADMIN_AGENT.md, etc.)
  • A persistent session in data/sessions/{agent}.jsonl
  • Access to the skills catalog in data/skills/
  • CONTROLPLANE_* env vars pointing at the local HTTP API
  • All agent spawns use --no-skills to disable pi's built-in skill discovery; skills are injected via --append-system-prompt from the catalog

API authentication requires CONTROLPLANE_SHARED_SECRET — a Bearer token that all agents and API clients must present.


Default System

Setup auto-provisions a default system named after your AGENT_NAME:

Agent Role Heartbeat Budget
Orchestrator Primary decision-maker, delegator On-demand 80%
Sysadmin Jails, ZFS, PF, services Daily (24h) 10%
DBA PostgreSQL ops On-demand 5%
Git Admin Merges, releases, mirrors On-demand 5%

Budget is token-based. Default: 100,000 tokens/day. Hard stops enforced before every spawn.


Dual-Layer Decision Model

Every agent queries two systems before acting:

1. Control plane API  → "What's my task? What's my budget?"
2. Local session      → "What did I do last time? What skills do I have?"

Then: pattern-match task → skill → execute (deterministic, low cost)
      no match          → escalate to orchestrator or request operator approval

Most work is skill execution — 3001,200 tokens. Reasoning is reserved for genuinely ambiguous situations.


Skills

Agents use a catalog of operational skills sourced from agent/library.yaml.

Skills are discoverable via tags and the skills_search extension tool. The control plane can route tasks to the right specialist without depending on the LLM to “remember” what exists.

Skill Agent Trigger example
jail-status Sysadmin "Check if db jail is running"
disk-usage Sysadmin "How much free disk?"
system-stats Sysadmin "CPU and memory load?"
service-restart Sysadmin "Restart nginx service"
backup-db DBA "Back up the database"
db-vacuum DBA "Run vacuum on system_brain"
db-migrate DBA "Apply pending migrations"
git-merge Git Admin "Merge PR #42 into main"
git-release-tag Git Admin "Tag version v0.12.0"

The catalog evolves over time; for the authoritative current list run /skills in Telegram or just skill-list on the host.

Agents also have access to the skills_search extension tool, which queries the skills catalog at runtime to find relevant skills without consuming session tokens.


Implementation Progress

Built in 7 phases. Each phase adds one module and turns its test todos green.

Phase Module Status
1 DB schema + provisioning (setup/controlplane.ts)
2 HTTP API routes (src/controlplane-api.ts)
3 Control plane runner (src/controlplane-runner.ts)
4 Budget enforcement (src/controlplane-budget.ts)
5 Session persistence (src/agent-session.ts)
6 Skills discovery (src/skills-discovery.ts)
7 Scheduler integration (src/task-scheduler.ts)

Setup

just setup-controlplane

# Output:
# ✓ Creating control plane tables...
# ✓ Hiring orchestrator agent...
# ✓ Hiring Sysadmin agent (heartbeat: 24h)...
# ✓ Hiring DBA agent...
# ✓ Hiring Git Admin agent...
# ✓ Copying 15 operational skills to data/skills/...
# ✓ Operator account created: clawdie
# ✓ Harness: run in terminal (no browser dashboard)

Runtime Observability

Every agent run (orchestrator main chat or specialist heartbeat) records three provider/model values in agent_activity.payload:

Field Meaning
configured_* What provider.env says (DEEPSEEK_API_KEY)
effective_* What was actually passed to pi (after fallback swap)
actual_* What pi reports having used (parsed from session JSONL)

configured_* and effective_* differ when provider fallback is active (cooldown is live, runtime is using the operator's chosen fallback). actual_* should match effective_* for a successful run; a divergence suggests pi rewrote the model selection internally.

/budgetreport and /tokens surface these values; /policy shows the fallback cooldown line when one is active.

References

  • SOUL.md, SYSADMIN_AGENT.md, DB_ADMIN_AGENT.md, GIT_ADMIN_AGENT.md — agent identity files
  • Provider Fallback — automatic provider switching when the primary hits a usage cap
  • Structured Reports — operator-facing report family + free-text routing
  • Colibri Architecture — the Rust control plane replacing this TypeScript implementation