colibri/docs/wiki/a2a-complexity-audit.md
Sam & Claude c3228e5147
Some checks failed
ci-gate / gate (pull_request) Successful in 3m32s
CI / agent-jail-pkgs (pull_request) Has been cancelled
CI / rust (pull_request) Has been cancelled
CI / markdown (pull_request) Has been cancelled
CI / port (pull_request) Has been cancelled
refactor(daemon): introduce AgentKind — single source of per-harness behavior
Five independent derivations of 'is this zot?' scattered across both spawn
paths were replaced by one AgentKind enum resolved once from the binary
basename:

  AgentKind { Zot, Pi, TestAgent }

  - from_binary(bin)  — basename match in one place
  - args(task_id)     — zot→rpc, pi→--mode json, test→--session-id...
  - rpc_stdin()       — zot true, rest false
  - runtime()         — Zot→AgentRuntime::Zot, Pi/TestAgent→Pi
  - credentials(cfg)  — env keys for all; auth.json write ONLY for Zot

Both spawn paths (autospawn in socket.rs, per-task poll_tasks in daemon.rs)
and the non-local cmd_spawn_agent now pull everything from AgentKind.  This:

  - Gives pi the same credential treatment in both paths (env keys without
    a stray auth.json write) — fixing asymmetry A from the parity audit.
  - Never writes zot's auth.json when the binary is pi — fixing asymmetry B.
  - Removes the 5 scattered basename/filesystem/args-inspection checks.
  - Makes adding a new harness (Anthropic, etc.) a single-enum-variant change.

Tests: 5 new agent_kind_* tests cover from_binary, args, rpc_stdin, runtime,
and credentials parity (zot gets auth.json + env; pi gets env only).

Gate: fmt  clippy  cargo test --workspace 
2026-06-30 22:09:11 +02:00

9.6 KiB
Raw Permalink Blame History

A2A Complexity Audit

Question: Does A2A reduce Colibri's code complexity, or is it additive? Date: 27.jun.2026 Referenced from: hive-pane.md, hive-routing.md

Current protocol surface area

Colibri speaks 5 protocols today:

Protocol Where Lines Purpose
Custom JSON wire crates/colibri-daemon/src/socket.rs + crates/colibri-client/src/lib.rs 1,981 Local daemon control (spawn, status, snapshot, tasks, skills)
MCP JSON-RPC crates/colibri-mcp/src/lib.rs 570 Editor integration + external MCP host
MCP-over-SSH packaging/mother/ (3 files) 437 Mother hive entrypoint (forced-command allowlist + node register)
JSONL crates/colibri-glasspane/src/lib.rs 1,186 Agent subprocess stdout events
SQL crates/colibri-ledger/src/lib.rs + crates/colibri-ledger/src/schema.rs 1,150 Local coordination (tasks, agents, skills, tenants)

Total protocol surface: ~5,324 lines.


What A2A would replace

1. Mother MCP-over-SSH bridge → A2A HTTP endpoint

Today's mother entrypoint:

USB node → SSH (authorized_keys forced-command) → colibri-mcp-ssh → colibri-mcp → PostgreSQL
                                                                       └─ node-register-mcp (embedded psql)

With A2A:

USB node → HTTPS → mother A2A endpoint → PostgreSQL
                    └─ /a2a (task exchange)
                    └─ /.well-known/agent.json (discovery)

Removed:

  • colibri-mcp-ssh (32 lines) — SSH forced-command allowlist wrapper
  • node-register-mcp (88 lines) — Custom MCP tool with embedded psql
  • SSH key management in setup-mother.sh (~40 lines of key distribution logic)

Removed total: ~160 lines.

Added:

  • A2A HTTP endpoint on mother (~200 lines)
  • A2A client library integration on USB node (~150 lines)
  • mTLS/TLS termination for auth (~30 lines)

Added total: ~380 lines.

Net delta: +220 lines. Not a code reduction. But operational complexity drops significantly:

  • No SSH key distribution to USB nodes (key lives on seed partition → no longer needed on mother)
  • No forced-command allowlist to maintain
  • Standard HTTPS is easier to firewall, audit, and monitor than SSH forced-command
  • Agent Card URL is discoverable without manual external MCP registry entries

2. External MCP server discovery → Agent Card

Today: external MCP registry config — manual JSON listing third-party MCP servers:

{
  "servers": [
    {
      "name": "filesystem",
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-filesystem", "/tmp"],
      "env": {}
    }
  ]
}

With A2A: third-party tools that speak A2A (not MCP) publish an Agent Card. Colibri discovers them via the well-known Agent Card URL instead of manual JSON config files.

Reality check: No third-party tools speak A2A yet. The protocol was just announced (April 2025). MCP has ~2 years of ecosystem maturity. This is a future replacement, not a current one.

Verdict: A2A discovery doesn't reduce code today. External MCP stays for tool access.

3. Ad-hoc cost data format → Typed A2A part

Today: cost data is embedded in the daemon's heartbeat logic — unstructured:

info!(task_id = %task_id, cost = u.cost(), "task cost captured");

With A2A: cost data is a typed message part (application/json+cost). The format is standardized, not ad-hoc.

Code savings: ~10 lines (the info! log stays; the A2A part is new code).

Verdict: Negligible code impact. The value is interop, not complexity reduction.


What A2A does NOT replace

Component Why A2A doesn't touch it Lines saved
Unix socket wire protocol (crates/colibri-daemon/src/socket.rs) A2A is cross-node HTTP. Local daemon control needs IPC — Unix socket is faster, auth-free (filesystem permissions), and doesn't need a network stack. 0
Spawner (crates/colibri-daemon/src/spawner.rs) A2A routes tasks to existing agents. Colibri creates agents by spawning subprocesses. A2A has no process lifecycle concept. 0
Glasspane (crates/colibri-glasspane/src/lib.rs) A2A doesn't watch subprocess stdout. Glasspane is a PTY observer — it reads JSONL from child processes. A2A operates one layer above. 0
Store (crates/colibri-ledger/src/lib.rs) A2A doesn't replace local SQLite coordination. Each node needs local persistence for task board, agents, skills — A2A is the transport, not the database. 0
MCP editor bridge A2A is agent-to-agent. MCP is human-to-tool. Different protocols for different directions. They coexist. 0
Contracts schemas (crates/colibri-contracts/src/lib.rs) A2A uses JSON Schema for input validation. Colibri's contracts are already compatible — no change needed. 0

Total irreplaceable: ~5,000 lines. A2A doesn't reduce this at all.


Net complexity analysis

                         BEFORE      AFTER A2A
                         ──────      ─────────
Unix socket protocol      1,981       1,981        (unchanged)
MCP bridge                  570         570        (unchanged)
Mother MCP-over-SSH         437           0        (REMOVED)
A2A endpoint                  0         380        (NEW)
Glasspane JSONL           1,186       1,186        (unchanged)
SQLite store              1,150       1,150        (unchanged)
Contracts schemas           200         200        (unchanged)
                         ──────      ──────
TOTAL                     5,524       5,467
                         ──────      ──────

Net delta: 57 lines. Technically a tiny reduction. Realistically: the code moves around, it doesn't shrink.


The real trade-off

A2A is not a complexity reduction play. It's an interoperability and operational simplicity play:

Metric MCP-over-SSH (current) A2A (proposed)
Lines of code ~5,524 (spread across 6 crates + 3 shell scripts) ~5,467 (SSH scripts gone, A2A handler added)
Protocol count 5 6 (A2A adds one)
Operational complexity SSH keys × N nodes, forced-command allowlists, peer auth setup One HTTPS endpoint, mTLS certs, well-known URL
Discoverability Manual external MCP registry entries Agent Card at well-known URL
Interoperability Colibri-only Any A2A client
Debugability ssh -v, psql, jq curl, browser devtools, standard HTTP tooling
Ecosystem maturity N/A (Colibri-specific) Protocol < 3 months old, zero adoption
When it pays off Works today for 4 nodes Pays off at 10+ nodes, or when 3rd-party tools ship A2A

Recommendation: Later, not now

The right window for A2A is when one of these becomes true:

  1. We have >10 hive nodes — SSH key distribution becomes painful
  2. A third-party tool ships A2A support — interop value materializes
  3. We want federation — multiple hives discovering each other

Until then: the current MCP-over-SSH bridge is 437 lines of boring, working code. A2A would add 380 lines for a protocol that has zero adopters. The code savings (~57 lines) don't justify the protocol risk.

Phase 2 (next sprint) should not include A2A. Build the routing engine on the existing MCP bridge. Add A2A as Phase 3 — when the protocol has real-world adoption and Colibri has enough nodes to benefit from discovery.

The HIVE-PANE.md A2A section is a good north-star design doc. It stays in the wiki as "planned." But it shouldn't drive implementation priority.