docs: post-Phase-3 wiki accuracy + task-dispatch-flow page
Some checks failed
CI / rust (pull_request) Has been cancelled
CI / markdown (pull_request) Has been cancelled
CI / port (pull_request) Has been cancelled
CI / agent-jail-pkgs (pull_request) Has been cancelled

- model-selection-and-eval: status Design → Phases 1–3 shipped (#264/#280/#285);
  mark Phase 2/3 deliverables, add 3a scope note, fix stale routing-gap row.
- hive-routing: status → partially shipped; scheduler row reflects pick_agent +
  select_model.
- README + index: model-selection row reflects shipped, not "design".
- New task-dispatch-flow.md: the verified queued→claim→spawn→register→dispatch→
  cost chain with code anchors + "why a task stalls" (stale build, not RPC mode,
  registration linkage). Indexed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Sam & Claude 2026-06-28 18:53:09 +02:00
parent f10b369c1d
commit 04370dd869
5 changed files with 111 additions and 32 deletions

View file

@ -67,7 +67,7 @@ Key pages:
| [glasspane](docs/wiki/glasspane.md) | 5-state machine, attention system, TUI |
| [hive-routing](docs/wiki/hive-routing.md) | Fleet routing, capability matrix, node registration |
| [cost-dashboard](docs/wiki/cost-dashboard.md) | Per-task cost tracking, proof_text, live dashboard |
| [model-selection-and-eval](docs/wiki/model-selection-and-eval.md) | T2.x eval harness + model selection design |
| [model-selection-and-eval](docs/wiki/model-selection-and-eval.md) | Eval harness + eval-driven model selection (Ph 13) |
| [mother-hive](docs/wiki/mother-hive.md) | Mother node coordination (PostgreSQL + MCP over SSH) |
| [pull-requests](docs/wiki/pull-requests.md) | Branch naming, commit format, review workflow |
| [quality-gates](docs/wiki/quality-gates.md) | CI pipeline, fmt/clippy/test/wiki-lint gates |

View file

@ -1,6 +1,7 @@
# Hive Member Tracking & Cost-Aware Routing
**Status:** 📋 Design
**Status:** Partially shipped — capability matching + eval-driven model
selection live (#285); stable node UUID and hive-level cost aggregation pending.
**Date:** 24.jun.2026
**Driven by:** T1.5 per-task cost tracking (shipped) → T2.x routing
@ -10,15 +11,15 @@
## What Exists Today
| Component | State | Gap |
| ---------------------------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------- |
| `mother_schema.sql` | `hive_nodes` table with `hw_profile` + `capabilities` JSONB | No stable node UUID; hostname is the key |
| `derive_capabilities()` trigger | Auto-computes `has_gpu`, `gpu_vendor`, `can_run_local_llm`, `max_model` from hw_profile | Only GPU/VRAM heuristics — doesn't probe running services |
| `clawdie-system-probe` | Collects GPU, RAM, CPU, disks, ZFS, WiFi, Vulkan, Colibri status | No ollama/llama.cpp probing |
| `node-register-mcp` | UPSERTs hw_profile into `hive_nodes` on join | No UUID generation at join time |
| `crates/colibri-daemon/src/scheduler.rs` | Cron/interval/one-shot jobs, capability matching stubs | No cost-aware routing, no hive awareness |
| `colibri-ledger` | Local SQLite `agents` table with UUID (v4 random) | UUID is session-local, not hive-stable |
| T1.5 cost tracking | Per-task cost captured in local SQLite | No hive-level cost aggregation |
| Component | State | Gap |
| ---------------------------------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------- |
| `mother_schema.sql` | `hive_nodes` table with `hw_profile` + `capabilities` JSONB | No stable node UUID; hostname is the key |
| `derive_capabilities()` trigger | Auto-computes `has_gpu`, `gpu_vendor`, `can_run_local_llm`, `max_model` from hw_profile | Only GPU/VRAM heuristics — doesn't probe running services |
| `clawdie-system-probe` | Collects GPU, RAM, CPU, disks, ZFS, WiFi, Vulkan, Colibri status | No ollama/llama.cpp probing |
| `node-register-mcp` | UPSERTs hw_profile into `hive_nodes` on join | No UUID generation at join time |
| `crates/colibri-daemon/src/scheduler.rs` | Cron/interval/one-shot jobs, capability matching (`pick_agent`), eval-driven `select_model` | Selection is per-host; no cross-hive awareness yet |
| `colibri-ledger` | Local SQLite `agents` table with UUID (v4 random) | UUID is session-local, not hive-stable |
| T1.5 cost tracking | Per-task cost captured in local SQLite | No hive-level cost aggregation |
## Design Goals
@ -380,7 +381,7 @@ The capability matrix, stable UUIDs, and local LLM probes are the foundation —
| `colibri_query_hive_capabilities` MCP tool | colibri-mcp |
| `colibri_dispatch_to_node` MCP tool | colibri-mcp |
| `hive-routing` skill | `.agent/skills/` |
| `Task.routing` JSONB field in colibri-ledger | colibri-ledger |
| `Task.routing` JSONB field in colibri-ledger | colibri-ledger |
| Mother-side routing score as PostgreSQL function (optional — only if agent-driven routing proves insufficient) | mother_schema.sql |
---

View file

@ -46,6 +46,7 @@ warning.
| Page | What it covers |
| --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| [agent-harness](./agent-harness.md) | The zot (agent) + Colibri (control plane) split; autospawn + RPC driver |
| [task-dispatch-flow](./task-dispatch-flow.md) | End-to-end task path: queued → claimed → spawn → register → dispatch → cost; why a task stalls |
| [agent-events-reference](./agent-events-reference.md) | Per-harness zot event reference, Glasspane mappings, and verified transcript fields |
| [cost-model](./cost-model.md) | Byte-stable prefixes, cache-hit metering, auto-escalation, T14 compaction |
| [glasspane](./glasspane.md) | Agent state machine, JSONL streaming, AgentRuntime taxonomy, snapshot API |
@ -57,14 +58,14 @@ warning.
| [hive-pane](./hive-pane.md) | Glasspane for the hive — multi-node cost observability, A2A discovery, and operator board |
| [cost-dashboard](./cost-dashboard.md) | Mother-side cost observability — human gallery + JSON, screenshot proof linked from cost rows |
| [a2a-complexity-audit](./a2a-complexity-audit.md) | A2A code complexity impact — 6-protocol surface audit, when A2A pays off |
| [model-selection-and-eval](./model-selection-and-eval.md) | T2.x design: model selection (tier arbitrage) + evaluation harness (task success measurement) |
| [model-selection-and-eval](./model-selection-and-eval.md) | Eval harness + eval-driven model selection — Phases 13 shipped, Phase 4 (cloud eval) planned |
| [naming-decisions](./naming-decisions.md) | Ledger of harness-neutral / architecture renames — shipped and in-flight |
| [daemon-not-demon](./daemon-not-demon.md) | Why we say daemon (helper spirit) not demon (bad spirit) — English + Slovenian |
| [layered-soul](./layered-soul.md) | How Colibri consumes the layered-soul reviewed-context repo today vs planned |
| [task-board](./task-board.md) | Capability match scoring, cron scheduling, intake drain, SQLite backing |
| [pull-requests](./pull-requests.md) | PR workflow — branching, review, gates, merge conventions |
| [quality-gates](./quality-gates.md) | `ci-checks.sh` as the pre-merge gate; why drift reached `main` before |
| [contracts](./contracts.md) | Stable JSON schemas (run-manifest, runtime-inventory, provider-test), fixture tests |
| [contracts](./contracts.md) | Stable JSON schemas (run-manifest, runtime-inventory, provider-test), fixture tests |
| [store-schema](./store-schema.md) | SQLite coordination schema and migration discipline |
| [external-mcp](./external-mcp.md) | MCP bridge for editors + external stdio MCP host; read/write/external-call gates |
| [operator-cli](./operator-cli.md) | The `colibri` CLI as a thin typed Unix-socket client over the daemon API |
@ -74,4 +75,4 @@ warning.
| [skills-catalog](./skills-catalog.md) | Read-only runtime consumer for reviewed skill artifacts |
| [vault-provision](./vault-provision.md) | Vaultwarden-driven env-file provisioning into jails after agent spawn |
| [deployment](./deployment.md) | Host installer (clawdie): ZFS layout, rc.d/systemd service, dry-run safety |
| [rust-glossary](./rust-glossary.md) | Quick reference for Rust terms in the codebase: serde, Result/Option, Arc/Box, derive macros |
| [rust-glossary](./rust-glossary.md) | Quick reference for Rust terms in the codebase: serde, Result/Option, Arc/Box, derive macros |

View file

@ -1,9 +1,16 @@
# T2.x: Model Selection & Evaluation Harness
**Status:** 📋 Design
**Status:** Phases 13 shipped (#264, #280, #285). Phase 4 (cloud eval) planned.
**Date:** 25.jun.2026
**Driven by:** T1.5 per-task cost tracking (shipped) → T2.x model selection + eval
Scope note (Phase 3, 3a): selection is **model-only** — success rate is the
primary signal, cost is a tiebreaker, `cost_mode` owns token spend. Per-
`task_type` routing is deferred to Phase 4 (no task_type data yet). The
selector reads [`Store::model_success_rates`] and runs at agent spawn
(long-lived harnesses, not per-task dispatch); gated by
`COLIBRI_MODEL_SELECTION`, off by default.
> **Companion doc:** [hive-routing](./hive-routing.md) — the capability matrix,
> machine identity, and routing engine. This doc covers what the routing engine
> _optimizes for_ (model selection) and how it knows if it's winning (eval harness).
@ -14,7 +21,7 @@
| ------------------------- | ---------------------------------------------------------------------------------------------- | ---------------------------------------------------- |
| `task_costs` (PostgreSQL) | Per-task cost rows with `provider`, `model`, `cost_usd`, `success` | `success` is boolean — agent process exited 0 or not |
| `hive_nodes.capabilities` | JSONB with `has_gpu`, `can_run_local_llm`, `ollama_models` | No success-rate history per model per node |
| Cost tiers (T0T3) | Defined in hive-routing.md: local ($0), DeepSeek ($0.27/1M), Gemini ($0.15/1M), Claude ($3/1M) | No routing decision uses them yet |
| Cost tiers (T0T3) | Defined in hive-routing.md: local ($0), DeepSeek ($0.27/1M), Gemini ($0.15/1M), Claude ($3/1M) | Phase 3 `select_model` now factors cost into routing |
| Agent harness | Spawns zot/pi, tracks session usage | No quality measurement beyond "did it exit 0?" |
## The Problem
@ -344,7 +351,7 @@ task arrives at scheduler
**What this gives us:** Eval infrastructure is in place. We're collecting quality scores from agent self-report. This is the minimum viable eval.
### Phase 2 — Local LLM Eval (3 days)
### Phase 2 — Local LLM Eval (shipped — PR #280)
**Goal:** Independent eval via local LLM.
@ -354,13 +361,13 @@ task arrives at scheduler
| Local eval: spawn local LLM with eval prompt | colibri-daemon | ~60 |
| Fallback logic: self-report → local → cloud → skipped | colibri-daemon | ~40 |
| Eval job scheduler (async, fire-and-forget) | colibri-daemon | ~30 |
| Eval result merge into task_eval | colibri-ledger | ~20 |
| Eval result merge into task_eval | colibri-ledger | ~20 |
**Total:** ~180 lines, 3 days.
**What this gives us:** Independent eval for most tasks. Self-report is still the default, but local LLM eval can verify or override.
### Phase 3 — Model Selection (3 days)
### Phase 3 — Model Selection (shipped — PR #285)
**Goal:** Data-driven routing decisions.
@ -383,7 +390,7 @@ task arrives at scheduler
| Deliverable | Where | Lines |
| --------------------------------------------------- | -------------- | ----- |
| Cloud eval: call Claude/DeepSeek with eval prompt | colibri-daemon | ~50 |
| Cost accounting: eval_cost_usd added to task_eval | colibri-ledger | ~10 |
| Cost accounting: eval_cost_usd added to task_eval | colibri-ledger | ~10 |
| Feedback loop: eval results → routing weight update | colibri-daemon | ~30 |
| Eval aggregation: 5-minute rollup of success rates | colibri-mcp | ~25 |
@ -402,20 +409,21 @@ task arrives at scheduler
- Daemon writes eval result ✅
- Query API for eval data ✅
### Phase 2 — Local LLM Eval
### Phase 2 — Local LLM Eval (shipped — PR #280)
- Eval prompt template (JSON schema)
- Local eval job (spawn local LLM)
- Fallback logic (self-report → local → cloud → skipped)
- Async eval scheduler
- Eval prompt template
- Local eval job (spawn local LLM via ollama) ✅
- Fallback chain self-report → local ✅ (→ cloud lands in Phase 4)
- Async eval (background `spawn_blocking`) ✅
### Phase 3 — Model Selection
### Phase 3 — Model Selection (shipped — PR #285)
- `select_model()` function
- Query eval success rates per (model, task_type)
- Decision rationale logging
- Configurable weights
- Integration with task dispatch
- `select_model()` function ✅
- Query eval success rates per model — `Store::model_success_rates`
(per `task_type` deferred to Phase 4)
- Decision rationale logging ✅
- Configurable weights (`COLIBRI_MODEL_SELECTION_WEIGHT_*`) ✅
- Integration at agent spawn (`recommend_model` → autospawn env) ✅
### Phase 4 — Cloud Eval + Feedback

View file

@ -0,0 +1,69 @@
# Task dispatch flow (queued → processed → cost)
← [index](./index.md)
## What this is
The end-to-end path a task takes from submission to a running agent and back.
This page exists because the chain spans three modules ([task-board](./task-board.md)
scheduling, [agent-harness](./agent-harness.md) spawning, and the daemon poll
loop), and it's a recurring source of "why won't the agent pick up my task?"
confusion. Every stage below is in `main` today.
## The chain
```
operator submits intake-task (socket) cmd_intake_task
→ task row (queued) ───────────────────────────────► store.create_task
scheduler tick (~30s) pick best-fit agent Scheduler::tick
→ claim_task ──────────────────────────────────────► pick_agent + claim_task
autospawn (once) spawn `zot rpc` (stdin piped) autospawn_agent_if_configured
→ register agent ───────────────────────────────────► register_agent (name = spawn id)
daemon poll loop task text → agent stdin poll_tasks
→ send_prompt ──────────────────────────────────────► rpc_sender().send_prompt(task)
→ status: Started │
agent works, emits JSONL → glasspane state → completion → set_task_cost
→ write_task_eval (self-report) + background local eval → push_cost_to_mother → dashboard
```
## Stages
| Stage | What happens | Code |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------- |
| **Submit** | A task row is created in the store with status `queued`. | [`cmd_intake_task`](../../crates/colibri-daemon/src/socket.rs) |
| **Claim** | Each tick, the scheduler picks the best-fit agent by capability and claims the task (`queued → claimed`). | [`Scheduler::tick`](../../crates/colibri-daemon/src/scheduler.rs) → `pick_agent`, `claim_task` |
| **Spawn** | Autospawn starts the harness as **`zot rpc`** (provider `local`), so stdin is piped and an `RpcSender` is available. | [`autospawn_agent_if_configured`](../../crates/colibri-daemon/src/socket.rs), `default_agent_args` |
| **Register** | The spawned agent is registered in the store; its `name` column holds the live spawn-handle id used in `state.agents`. | `register_agent` (store row `name` = spawn id) |
| **Dispatch** | The poll loop resolves the spawn handle from the claimed task's `agent_id`, gets `rpc_sender()`, and writes the task text to the agent's stdin, transitioning the task to `Started`. | [`poll_tasks`](../../crates/colibri-daemon/src/daemon.rs) → `send_prompt` |
| **Process + cost** | The agent works and emits JSONL ([glasspane](./glasspane.md)). On completion, cost and an eval record are written, then pushed to mother for the [dashboard](./cost-dashboard.md). | `set_task_cost`, `write_task_eval`, `push_cost_to_mother` |
## Why a task can stall (and what it is _not_)
The dispatch logic above is all in `main` — a stalled task is almost never
missing code. The usual causes, in order:
1. **Stale deployed build.** The host is running a colibri binary older than
the current `poll_tasks` dispatch or agent-registration fixes. Check
`git rev-parse HEAD` on the host against `origin/main`; reset, rebuild,
restart the daemon.
2. **Agent not in RPC mode.** If the process is `zot --mode json` (not
`zot rpc`), stdin isn't piped, `rpc_sender()` is `None`, and no dispatch
happens. Confirm with `ps`.
3. **Registration linkage broken.** Dispatch needs the store agent row's
`name` to equal the live spawn id in `state.agents`. A mismatch (older
build) means `poll_tasks` can't find the sender.
If you're told "merge branch X to enable dispatch," verify X against `main`
first — the chain is already merged, and re-pushing an auto-deleted branch
hits the [branch recreation hazard](./pull-requests.md).
## See also
- [task-board](./task-board.md) — scheduler internals (capability scoring, intake drain)
- [agent-harness](./agent-harness.md) — zot/Colibri split, autospawn, RPC driver
- [glasspane](./glasspane.md) — how agent stdout becomes observable state
- [cost-dashboard](./cost-dashboard.md) — where cost lands after completion