docs: add capability-based task routing design

Multi-OS routing: hosts advertise capability tags, tasks declare
required_capabilities, Colibri's scheduler (pick_agent/capability_match_score,
already implemented) places each task on a qualifying host. Documents the
vocabulary, the probe->capability mapping, the SkillManifest.required_capabilities
addition, central-daemon topology, and the tmux-screenshot skill as the worked
example (why dropping FreeBSD Pillow loses no capability).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Claude (domedog) 2026-06-17 16:06:00 +02:00
parent 04c65e73bc
commit b6bdc829e4

View file

@ -0,0 +1,78 @@
# Capability-Based Task Routing
**Principle: a tool that one OS can't support is not a loss — it's a routing
constraint.** In a multi-agent, multi-OS fleet we don't force every capability onto
every host. We let each host advertise what it can do, let each task declare what it
needs, and let the scheduler send the task to a host that qualifies. FreeBSD stays lean;
the capability simply lives where it's cheap.
This is the operational payoff of the dual-OS survivability model: heterogeneous hosts,
one task board, automatic placement.
## What Colibri already provides
The matching engine exists today in `colibri-daemon` — this is wiring, not a rewrite:
- **Agents carry capability tags**`agents.capabilities` (JSON array) in the store
(`colibri-store` schema); registered via `colibri` client / `--capabilities`.
- **Tasks declare requirements** — jobs and intake requests carry `required_capabilities`
(`colibri intake-task --capabilities <csv>`).
- **The scheduler matches**`pick_agent(required, agents)` scores each idle/active agent
with `capability_match_score` and picks the best fit.
- **Unmatched = parked, not failed** — if requirements are non-empty and no online agent
matches, `pick_agent` returns `None`: the task is created but left **unassigned until a
capable agent appears**. Exactly the behaviour we want — a screenshot task waits for a
Linux host rather than failing on FreeBSD.
## What we add to realize it
| Piece | Status | Action |
| ----- | ------ | ------ |
| Capability vocabulary | tags are free-form (`rust`, `python`, `linux`) | Agree a shared tag set (below) |
| Agents advertise real capabilities | manual / ad-hoc | Derive from `verify_facts_probe.py`; register at agent start |
| Skills declare their needs | `SkillManifest` has no requirements field | Add `required_capabilities: Vec<String>`; scheduler reads it |
| Cross-host agent pool | daemon listens on a **local Unix socket only** | One orchestrator daemon (debby/Hermes); remote agents reach it over Tailscale |
### Cross-host topology (the one real decision)
The daemon's socket is local, so today the agent pool is per-host. To route *across*
hosts, agents on every host must be visible to one scheduler. Recommended:
- **Central orchestrator daemon on debby (Hermes).** Agents on domedog/osa reach its
socket over Tailscale (forwarded via SSH/`socat`). Hermes is already the designated
orchestrator, so this matches the agent matrix.
- Alternative (heavier, deferred): daemon-to-daemon federation.
## Capability vocabulary (initial)
Flat, explicit tags — the matcher does exact string comparison, no implied hierarchy.
Sourced from the probe and recorded per host in [`HOST-MATRIX.md`](./HOST-MATRIX.md).
| Category | Tags |
| -------- | ---- |
| OS | `linux`, `freebsd` |
| Isolation | `docker`, `freebsd-jail` |
| Display | `gui`, `screenshot`, `wayland` |
| Hardware | `gpu`, `zfs` |
| Runtime | `python3.12`, `node24`, `rust`, `go` |
| Media | `ffmpeg`, `pillow`/`image-render` |
Hosts advertise only what they truly have. Example from the current fleet:
- **domedog / debby (Linux):** `linux`, `docker`, `gui`, `screenshot`, `image-render`, …
- **osa (FreeBSD):** `freebsd`, `freebsd-jail`, `zfs`, `rust`, … (no `screenshot`/`image-render`)
## Worked example: the tmux-screenshot skill
This is why we could drop `py312-pillow` from the FreeBSD ISO without losing the skill:
1. FreeBSD image drops Pillow — stays lean (`pkg-list` carries only `python312`).
2. The skill manifest declares `required_capabilities: ["screenshot"]` (or `image-render`).
3. Only Linux hosts advertise `screenshot` (Pillow is trivial there).
4. Colibri routes any screenshot task to debby/domedog automatically; if both are offline
the task parks until one returns.
The capability moved hosts. It was never lost.
_See [`AGENTS.md`](../AGENTS.md) for the agent matrix, [`HOST-MATRIX.md`](./HOST-MATRIX.md)
for per-host facts, and [`TOOLCHAIN.md`](./TOOLCHAIN.md) for runtime versions._