diff --git a/docs/CAPABILITY-ROUTING.md b/docs/CAPABILITY-ROUTING.md new file mode 100644 index 0000000..f47cc79 --- /dev/null +++ b/docs/CAPABILITY-ROUTING.md @@ -0,0 +1,78 @@ +# Capability-Based Task Routing + +**Principle: a tool that one OS can't support is not a loss — it's a routing +constraint.** In a multi-agent, multi-OS fleet we don't force every capability onto +every host. We let each host advertise what it can do, let each task declare what it +needs, and let the scheduler send the task to a host that qualifies. FreeBSD stays lean; +the capability simply lives where it's cheap. + +This is the operational payoff of the dual-OS survivability model: heterogeneous hosts, +one task board, automatic placement. + +## What Colibri already provides + +The matching engine exists today in `colibri-daemon` — this is wiring, not a rewrite: + +- **Agents carry capability tags** — `agents.capabilities` (JSON array) in the store + (`colibri-store` schema); registered via `colibri` client / `--capabilities`. +- **Tasks declare requirements** — jobs and intake requests carry `required_capabilities` + (`colibri intake-task --capabilities `). +- **The scheduler matches** — `pick_agent(required, agents)` scores each idle/active agent + with `capability_match_score` and picks the best fit. +- **Unmatched = parked, not failed** — if requirements are non-empty and no online agent + matches, `pick_agent` returns `None`: the task is created but left **unassigned until a + capable agent appears**. Exactly the behaviour we want — a screenshot task waits for a + Linux host rather than failing on FreeBSD. + +## What we add to realize it + +| Piece | Status | Action | +| ----- | ------ | ------ | +| Capability vocabulary | tags are free-form (`rust`, `python`, `linux`) | Agree a shared tag set (below) | +| Agents advertise real capabilities | manual / ad-hoc | Derive from `verify_facts_probe.py`; register at agent start | +| Skills declare their needs | `SkillManifest` has no requirements field | Add `required_capabilities: Vec`; scheduler reads it | +| Cross-host agent pool | daemon listens on a **local Unix socket only** | One orchestrator daemon (debby/Hermes); remote agents reach it over Tailscale | + +### Cross-host topology (the one real decision) + +The daemon's socket is local, so today the agent pool is per-host. To route *across* +hosts, agents on every host must be visible to one scheduler. Recommended: + +- **Central orchestrator daemon on debby (Hermes).** Agents on domedog/osa reach its + socket over Tailscale (forwarded via SSH/`socat`). Hermes is already the designated + orchestrator, so this matches the agent matrix. +- Alternative (heavier, deferred): daemon-to-daemon federation. + +## Capability vocabulary (initial) + +Flat, explicit tags — the matcher does exact string comparison, no implied hierarchy. +Sourced from the probe and recorded per host in [`HOST-MATRIX.md`](./HOST-MATRIX.md). + +| Category | Tags | +| -------- | ---- | +| OS | `linux`, `freebsd` | +| Isolation | `docker`, `freebsd-jail` | +| Display | `gui`, `screenshot`, `wayland` | +| Hardware | `gpu`, `zfs` | +| Runtime | `python3.12`, `node24`, `rust`, `go` | +| Media | `ffmpeg`, `pillow`/`image-render` | + +Hosts advertise only what they truly have. Example from the current fleet: + +- **domedog / debby (Linux):** `linux`, `docker`, `gui`, `screenshot`, `image-render`, … +- **osa (FreeBSD):** `freebsd`, `freebsd-jail`, `zfs`, `rust`, … (no `screenshot`/`image-render`) + +## Worked example: the tmux-screenshot skill + +This is why we could drop `py312-pillow` from the FreeBSD ISO without losing the skill: + +1. FreeBSD image drops Pillow — stays lean (`pkg-list` carries only `python312`). +2. The skill manifest declares `required_capabilities: ["screenshot"]` (or `image-render`). +3. Only Linux hosts advertise `screenshot` (Pillow is trivial there). +4. Colibri routes any screenshot task to debby/domedog automatically; if both are offline + the task parks until one returns. + +The capability moved hosts. It was never lost. + +_See [`AGENTS.md`](../AGENTS.md) for the agent matrix, [`HOST-MATRIX.md`](./HOST-MATRIX.md) +for per-host facts, and [`TOOLCHAIN.md`](./TOOLCHAIN.md) for runtime versions._