Claude (domedog) b6bdc829e4 docs: add capability-based task routing design

Multi-OS routing: hosts advertise capability tags, tasks declare
required_capabilities, Colibri's scheduler (pick_agent/capability_match_score,
already implemented) places each task on a qualifying host. Documents the
vocabulary, the probe->capability mapping, the SkillManifest.required_capabilities
addition, central-daemon topology, and the tmux-screenshot skill as the worked
example (why dropping FreeBSD Pillow loses no capability).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-17 16:06:00 +02:00

3.9 KiB

Raw Blame History

Capability-Based Task Routing

Principle: a tool that one OS can't support is not a loss — it's a routing constraint. In a multi-agent, multi-OS fleet we don't force every capability onto every host. We let each host advertise what it can do, let each task declare what it needs, and let the scheduler send the task to a host that qualifies. FreeBSD stays lean; the capability simply lives where it's cheap.

This is the operational payoff of the dual-OS survivability model: heterogeneous hosts, one task board, automatic placement.

What Colibri already provides

The matching engine exists today in colibri-daemon — this is wiring, not a rewrite:

Agents carry capability tags — agents.capabilities (JSON array) in the store (colibri-store schema); registered via colibri client / --capabilities.
Tasks declare requirements — jobs and intake requests carry required_capabilities (colibri intake-task --capabilities <csv>).
The scheduler matches — pick_agent(required, agents) scores each idle/active agent with capability_match_score and picks the best fit.
Unmatched = parked, not failed — if requirements are non-empty and no online agent matches, pick_agent returns None: the task is created but left unassigned until a capable agent appears. Exactly the behaviour we want — a screenshot task waits for a Linux host rather than failing on FreeBSD.

What we add to realize it

Piece	Status	Action
Capability vocabulary	tags are free-form (`rust`, `python`, `linux`)	Agree a shared tag set (below)
Agents advertise real capabilities	manual / ad-hoc	Derive from `verify_facts_probe.py`; register at agent start
Skills declare their needs	`SkillManifest` has no requirements field	Add `required_capabilities: Vec<String>`; scheduler reads it
Cross-host agent pool	daemon listens on a local Unix socket only	One orchestrator daemon (debby/Hermes); remote agents reach it over Tailscale

Cross-host topology (the one real decision)

The daemon's socket is local, so today the agent pool is per-host. To route across hosts, agents on every host must be visible to one scheduler. Recommended:

Central orchestrator daemon on debby (Hermes). Agents on domedog/osa reach its socket over Tailscale (forwarded via SSH/socat). Hermes is already the designated orchestrator, so this matches the agent matrix.
Alternative (heavier, deferred): daemon-to-daemon federation.

Capability vocabulary (initial)

Flat, explicit tags — the matcher does exact string comparison, no implied hierarchy. Sourced from the probe and recorded per host in HOST-MATRIX.md.

Category	Tags
OS	`linux`, `freebsd`
Isolation	`docker`, `freebsd-jail`
Display	`gui`, `screenshot`, `wayland`
Hardware	`gpu`, `zfs`
Runtime	`python3.12`, `node24`, `rust`, `go`
Media	`ffmpeg`, `pillow`/`image-render`

Hosts advertise only what they truly have. Example from the current fleet:

domedog / debby (Linux): linux, docker, gui, screenshot, image-render, …
osa (FreeBSD): freebsd, freebsd-jail, zfs, rust, … (no screenshot/image-render)

Worked example: the tmux-screenshot skill

This is why we could drop py312-pillow from the FreeBSD ISO without losing the skill:

FreeBSD image drops Pillow — stays lean (pkg-list carries only python312).
The skill manifest declares required_capabilities: ["screenshot"] (or image-render).
Only Linux hosts advertise screenshot (Pillow is trivial there).
Colibri routes any screenshot task to debby/domedog automatically; if both are offline the task parks until one returns.

The capability moved hosts. It was never lost.

See AGENTS.md for the agent matrix, HOST-MATRIX.md for per-host facts, and TOOLCHAIN.md for runtime versions.

3.9 KiB Raw Blame History