docs: post-Phase-3 wiki accuracy + task-dispatch-flow page

- model-selection-and-eval: status Design → Phases 1–3 shipped (#264/#280/#285); mark Phase 2/3 deliverables, add 3a scope note, fix stale routing-gap row. - hive-routing: status → partially shipped; scheduler row reflects pick_agent + select_model. - README + index: model-selection row reflects shipped, not "design". - New task-dispatch-flow.md: the verified queued→claim→spawn→register→dispatch→ cost chain with code anchors + "why a task stalls" (stale build, not RPC mode, registration linkage). Indexed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 18:53:09 +02:00 · 2026-06-28 18:53:09 +02:00 · 04370dd869
commit 04370dd869
parent f10b369c1d
5 changed files with 111 additions and 32 deletions
--- a/README.md
+++ b/README.md
@ -67,7 +67,7 @@ Key pages:
 | [glasspane](docs/wiki/glasspane.md)                               | 5-state machine, attention system, TUI               |
 | [hive-routing](docs/wiki/hive-routing.md)                         | Fleet routing, capability matrix, node registration  |
 | [cost-dashboard](docs/wiki/cost-dashboard.md)                     | Per-task cost tracking, proof_text, live dashboard   |
-| [model-selection-and-eval](docs/wiki/model-selection-and-eval.md) | T2.x eval harness + model selection design           |
+| [model-selection-and-eval](docs/wiki/model-selection-and-eval.md) | Eval harness + eval-driven model selection (Ph 1–3)  |
 | [mother-hive](docs/wiki/mother-hive.md)                           | Mother node coordination (PostgreSQL + MCP over SSH) |
 | [pull-requests](docs/wiki/pull-requests.md)                       | Branch naming, commit format, review workflow        |
 | [quality-gates](docs/wiki/quality-gates.md)                       | CI pipeline, fmt/clippy/test/wiki-lint gates         |
--- a/docs/wiki/hive-routing.md
+++ b/docs/wiki/hive-routing.md
@ -1,6 +1,7 @@
 # Hive Member Tracking & Cost-Aware Routing

-**Status:** 📋 Design
+**Status:** Partially shipped — capability matching + eval-driven model
+selection live (#285); stable node UUID and hive-level cost aggregation pending.
 **Date:** 24.jun.2026
 **Driven by:** T1.5 per-task cost tracking (shipped) → T2.x routing

@ -10,15 +11,15 @@

 ## What Exists Today

-| Component                                | State                                                                                   | Gap                                                       |
-| ---------------------------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------- |
-| `mother_schema.sql`                      | `hive_nodes` table with `hw_profile` + `capabilities` JSONB                             | No stable node UUID; hostname is the key                  |
-| `derive_capabilities()` trigger          | Auto-computes `has_gpu`, `gpu_vendor`, `can_run_local_llm`, `max_model` from hw_profile | Only GPU/VRAM heuristics — doesn't probe running services |
-| `clawdie-system-probe`                       | Collects GPU, RAM, CPU, disks, ZFS, WiFi, Vulkan, Colibri status                        | No ollama/llama.cpp probing                               |
-| `node-register-mcp`                      | UPSERTs hw_profile into `hive_nodes` on join                                            | No UUID generation at join time                           |
-| `crates/colibri-daemon/src/scheduler.rs` | Cron/interval/one-shot jobs, capability matching stubs                                  | No cost-aware routing, no hive awareness                  |
-| `colibri-ledger`                          | Local SQLite `agents` table with UUID (v4 random)                                       | UUID is session-local, not hive-stable                    |
-| T1.5 cost tracking                       | Per-task cost captured in local SQLite                                                  | No hive-level cost aggregation                            |
+| Component                                | State                                                                                       | Gap                                                       |
+| ---------------------------------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------- |
+| `mother_schema.sql`                      | `hive_nodes` table with `hw_profile` + `capabilities` JSONB                                 | No stable node UUID; hostname is the key                  |
+| `derive_capabilities()` trigger          | Auto-computes `has_gpu`, `gpu_vendor`, `can_run_local_llm`, `max_model` from hw_profile     | Only GPU/VRAM heuristics — doesn't probe running services |
+| `clawdie-system-probe`                   | Collects GPU, RAM, CPU, disks, ZFS, WiFi, Vulkan, Colibri status                            | No ollama/llama.cpp probing                               |
+| `node-register-mcp`                      | UPSERTs hw_profile into `hive_nodes` on join                                                | No UUID generation at join time                           |
+| `crates/colibri-daemon/src/scheduler.rs` | Cron/interval/one-shot jobs, capability matching (`pick_agent`), eval-driven `select_model` | Selection is per-host; no cross-hive awareness yet        |
+| `colibri-ledger`                         | Local SQLite `agents` table with UUID (v4 random)                                           | UUID is session-local, not hive-stable                    |
+| T1.5 cost tracking                       | Per-task cost captured in local SQLite                                                      | No hive-level cost aggregation                            |

 ## Design Goals

@ -380,7 +381,7 @@ The capability matrix, stable UUIDs, and local LLM probes are the foundation —
 | `colibri_query_hive_capabilities` MCP tool                                                                     | colibri-mcp       |
 | `colibri_dispatch_to_node` MCP tool                                                                            | colibri-mcp       |
 | `hive-routing` skill                                                                                           | `.agent/skills/`  |
-| `Task.routing` JSONB field in colibri-ledger                                                                    | colibri-ledger     |
+| `Task.routing` JSONB field in colibri-ledger                                                                   | colibri-ledger    |
 | Mother-side routing score as PostgreSQL function (optional — only if agent-driven routing proves insufficient) | mother_schema.sql |

 ---
--- a/docs/wiki/index.md
+++ b/docs/wiki/index.md
@ -46,6 +46,7 @@ warning.
 | Page                                                      | What it covers                                                                                                  |
 | --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
 | [agent-harness](./agent-harness.md)                       | The zot (agent) + Colibri (control plane) split; autospawn + RPC driver                                         |
+| [task-dispatch-flow](./task-dispatch-flow.md)             | End-to-end task path: queued → claimed → spawn → register → dispatch → cost; why a task stalls                  |
 | [agent-events-reference](./agent-events-reference.md)     | Per-harness zot event reference, Glasspane mappings, and verified transcript fields                             |
 | [cost-model](./cost-model.md)                             | Byte-stable prefixes, cache-hit metering, auto-escalation, T14 compaction                                       |
 | [glasspane](./glasspane.md)                               | Agent state machine, JSONL streaming, AgentRuntime taxonomy, snapshot API                                       |
@ -57,14 +58,14 @@ warning.
 | [hive-pane](./hive-pane.md)                               | Glasspane for the hive — multi-node cost observability, A2A discovery, and operator board                       |
 | [cost-dashboard](./cost-dashboard.md)                     | Mother-side cost observability — human gallery + JSON, screenshot proof linked from cost rows                   |
 | [a2a-complexity-audit](./a2a-complexity-audit.md)         | A2A code complexity impact — 6-protocol surface audit, when A2A pays off                                        |
-| [model-selection-and-eval](./model-selection-and-eval.md) | T2.x design: model selection (tier arbitrage) + evaluation harness (task success measurement)                   |
+| [model-selection-and-eval](./model-selection-and-eval.md) | Eval harness + eval-driven model selection — Phases 1–3 shipped, Phase 4 (cloud eval) planned                   |
 | [naming-decisions](./naming-decisions.md)                 | Ledger of harness-neutral / architecture renames — shipped and in-flight                                        |
 | [daemon-not-demon](./daemon-not-demon.md)                 | Why we say daemon (helper spirit) not demon (bad spirit) — English + Slovenian                                  |
 | [layered-soul](./layered-soul.md)                         | How Colibri consumes the layered-soul reviewed-context repo today vs planned                                    |
 | [task-board](./task-board.md)                             | Capability match scoring, cron scheduling, intake drain, SQLite backing                                         |
 | [pull-requests](./pull-requests.md)                       | PR workflow — branching, review, gates, merge conventions                                                       |
 | [quality-gates](./quality-gates.md)                       | `ci-checks.sh` as the pre-merge gate; why drift reached `main` before                                           |
-| [contracts](./contracts.md)                               | Stable JSON schemas (run-manifest, runtime-inventory, provider-test), fixture tests                              |
+| [contracts](./contracts.md)                               | Stable JSON schemas (run-manifest, runtime-inventory, provider-test), fixture tests                             |
 | [store-schema](./store-schema.md)                         | SQLite coordination schema and migration discipline                                                             |
 | [external-mcp](./external-mcp.md)                         | MCP bridge for editors + external stdio MCP host; read/write/external-call gates                                |
 | [operator-cli](./operator-cli.md)                         | The `colibri` CLI as a thin typed Unix-socket client over the daemon API                                        |
@ -74,4 +75,4 @@ warning.
 | [skills-catalog](./skills-catalog.md)                     | Read-only runtime consumer for reviewed skill artifacts                                                         |
 | [vault-provision](./vault-provision.md)                   | Vaultwarden-driven env-file provisioning into jails after agent spawn                                           |
 | [deployment](./deployment.md)                             | Host installer (clawdie): ZFS layout, rc.d/systemd service, dry-run safety                                      |
-| [rust-glossary](./rust-glossary.md)                       | Quick reference for Rust terms in the codebase: serde, Result/Option, Arc/Box, derive macros                      |
+| [rust-glossary](./rust-glossary.md)                       | Quick reference for Rust terms in the codebase: serde, Result/Option, Arc/Box, derive macros                    |
--- a/docs/wiki/model-selection-and-eval.md
+++ b/docs/wiki/model-selection-and-eval.md
@ -1,9 +1,16 @@
 # T2.x: Model Selection & Evaluation Harness

-**Status:** 📋 Design
+**Status:** Phases 1–3 shipped (#264, #280, #285). Phase 4 (cloud eval) planned.
 **Date:** 25.jun.2026
 **Driven by:** T1.5 per-task cost tracking (shipped) → T2.x model selection + eval

+Scope note (Phase 3, 3a): selection is **model-only** — success rate is the
+primary signal, cost is a tiebreaker, `cost_mode` owns token spend. Per-
+`task_type` routing is deferred to Phase 4 (no task_type data yet). The
+selector reads [`Store::model_success_rates`] and runs at agent spawn
+(long-lived harnesses, not per-task dispatch); gated by
+`COLIBRI_MODEL_SELECTION`, off by default.
+
 > **Companion doc:** [hive-routing](./hive-routing.md) — the capability matrix,
 > machine identity, and routing engine. This doc covers what the routing engine
 > _optimizes for_ (model selection) and how it knows if it's winning (eval harness).
@ -14,7 +21,7 @@
 | ------------------------- | ---------------------------------------------------------------------------------------------- | ---------------------------------------------------- |
 | `task_costs` (PostgreSQL) | Per-task cost rows with `provider`, `model`, `cost_usd`, `success`                             | `success` is boolean — agent process exited 0 or not |
 | `hive_nodes.capabilities` | JSONB with `has_gpu`, `can_run_local_llm`, `ollama_models`                                     | No success-rate history per model per node           |
-| Cost tiers (T0–T3)        | Defined in hive-routing.md: local ($0), DeepSeek ($0.27/1M), Gemini ($0.15/1M), Claude ($3/1M) | No routing decision uses them yet                    |
+| Cost tiers (T0–T3)        | Defined in hive-routing.md: local ($0), DeepSeek ($0.27/1M), Gemini ($0.15/1M), Claude ($3/1M) | Phase 3 `select_model` now factors cost into routing |
 | Agent harness             | Spawns zot/pi, tracks session usage                                                            | No quality measurement beyond "did it exit 0?"       |

 ## The Problem
@ -344,7 +351,7 @@ task arrives at scheduler

 **What this gives us:** Eval infrastructure is in place. We're collecting quality scores from agent self-report. This is the minimum viable eval.

-### Phase 2 — Local LLM Eval (3 days)
+### Phase 2 — Local LLM Eval (shipped — PR #280)

 **Goal:** Independent eval via local LLM.

@ -354,13 +361,13 @@ task arrives at scheduler
 | Local eval: spawn local LLM with eval prompt          | colibri-daemon | ~60   |
 | Fallback logic: self-report → local → cloud → skipped | colibri-daemon | ~40   |
 | Eval job scheduler (async, fire-and-forget)           | colibri-daemon | ~30   |
-| Eval result merge into task_eval                      | colibri-ledger  | ~20   |
+| Eval result merge into task_eval                      | colibri-ledger | ~20   |

 **Total:** ~180 lines, 3 days.

 **What this gives us:** Independent eval for most tasks. Self-report is still the default, but local LLM eval can verify or override.

-### Phase 3 — Model Selection (3 days)
+### Phase 3 — Model Selection (shipped — PR #285)

 **Goal:** Data-driven routing decisions.

@ -383,7 +390,7 @@ task arrives at scheduler
 | Deliverable                                         | Where          | Lines |
 | --------------------------------------------------- | -------------- | ----- |
 | Cloud eval: call Claude/DeepSeek with eval prompt   | colibri-daemon | ~50   |
-| Cost accounting: eval_cost_usd added to task_eval   | colibri-ledger  | ~10   |
+| Cost accounting: eval_cost_usd added to task_eval   | colibri-ledger | ~10   |
 | Feedback loop: eval results → routing weight update | colibri-daemon | ~30   |
 | Eval aggregation: 5-minute rollup of success rates  | colibri-mcp    | ~25   |

@ -402,20 +409,21 @@ task arrives at scheduler
 - Daemon writes eval result ✅
 - Query API for eval data ✅

-### Phase 2 — Local LLM Eval
+### Phase 2 — Local LLM Eval (shipped — PR #280)

- Eval prompt template (JSON schema)
- Local eval job (spawn local LLM)
- Fallback logic (self-report → local → cloud → skipped)
- Async eval scheduler
+- Eval prompt template ✅
+- Local eval job (spawn local LLM via ollama) ✅
+- Fallback chain self-report → local ✅ (→ cloud lands in Phase 4)
+- Async eval (background `spawn_blocking`) ✅

-### Phase 3 — Model Selection
+### Phase 3 — Model Selection (shipped — PR #285)

- `select_model()` function
- Query eval success rates per (model, task_type)
- Decision rationale logging
- Configurable weights
- Integration with task dispatch
+- `select_model()` function ✅
+- Query eval success rates per model — `Store::model_success_rates` ✅
+  (per `task_type` deferred to Phase 4)
+- Decision rationale logging ✅
+- Configurable weights (`COLIBRI_MODEL_SELECTION_WEIGHT_*`) ✅
+- Integration at agent spawn (`recommend_model` → autospawn env) ✅

 ### Phase 4 — Cloud Eval + Feedback

--- a/docs/wiki/task-dispatch-flow.md
+++ b/docs/wiki/task-dispatch-flow.md
@ -0,0 +1,69 @@
+# Task dispatch flow (queued → processed → cost)
+
+← [index](./index.md)
+
+## What this is
+
+The end-to-end path a task takes from submission to a running agent and back.
+This page exists because the chain spans three modules ([task-board](./task-board.md)
+scheduling, [agent-harness](./agent-harness.md) spawning, and the daemon poll
+loop), and it's a recurring source of "why won't the agent pick up my task?"
+confusion. Every stage below is in `main` today.
+
+## The chain
+
+```
+operator submits        intake-task (socket)            cmd_intake_task
+   → task row (queued) ───────────────────────────────► store.create_task
+                                                              │
+scheduler tick (~30s)   pick best-fit agent              Scheduler::tick
+   → claim_task ──────────────────────────────────────► pick_agent + claim_task
+                                                              │
+autospawn (once)        spawn `zot rpc` (stdin piped)    autospawn_agent_if_configured
+   → register agent ───────────────────────────────────► register_agent (name = spawn id)
+                                                              │
+daemon poll loop        task text → agent stdin          poll_tasks
+   → send_prompt ──────────────────────────────────────► rpc_sender().send_prompt(task)
+   → status: Started                                          │
+                                                              ▼
+agent works, emits JSONL → glasspane state → completion → set_task_cost
+   → write_task_eval (self-report) + background local eval → push_cost_to_mother → dashboard
+```
+
+## Stages
+
+| Stage              | What happens                                                                                                                                                                         | Code                                                                                               |
+| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------- |
+| **Submit**         | A task row is created in the store with status `queued`.                                                                                                                             | [`cmd_intake_task`](../../crates/colibri-daemon/src/socket.rs)                                     |
+| **Claim**          | Each tick, the scheduler picks the best-fit agent by capability and claims the task (`queued → claimed`).                                                                            | [`Scheduler::tick`](../../crates/colibri-daemon/src/scheduler.rs) → `pick_agent`, `claim_task`     |
+| **Spawn**          | Autospawn starts the harness as **`zot rpc`** (provider `local`), so stdin is piped and an `RpcSender` is available.                                                                 | [`autospawn_agent_if_configured`](../../crates/colibri-daemon/src/socket.rs), `default_agent_args` |
+| **Register**       | The spawned agent is registered in the store; its `name` column holds the live spawn-handle id used in `state.agents`.                                                               | `register_agent` (store row `name` = spawn id)                                                     |
+| **Dispatch**       | The poll loop resolves the spawn handle from the claimed task's `agent_id`, gets `rpc_sender()`, and writes the task text to the agent's stdin, transitioning the task to `Started`. | [`poll_tasks`](../../crates/colibri-daemon/src/daemon.rs) → `send_prompt`                          |
+| **Process + cost** | The agent works and emits JSONL ([glasspane](./glasspane.md)). On completion, cost and an eval record are written, then pushed to mother for the [dashboard](./cost-dashboard.md).   | `set_task_cost`, `write_task_eval`, `push_cost_to_mother`                                          |
+
+## Why a task can stall (and what it is _not_)
+
+The dispatch logic above is all in `main` — a stalled task is almost never
+missing code. The usual causes, in order:
+
+1. **Stale deployed build.** The host is running a colibri binary older than
+   the current `poll_tasks` dispatch or agent-registration fixes. Check
+   `git rev-parse HEAD` on the host against `origin/main`; reset, rebuild,
+   restart the daemon.
+2. **Agent not in RPC mode.** If the process is `zot --mode json` (not
+   `zot rpc`), stdin isn't piped, `rpc_sender()` is `None`, and no dispatch
+   happens. Confirm with `ps`.
+3. **Registration linkage broken.** Dispatch needs the store agent row's
+   `name` to equal the live spawn id in `state.agents`. A mismatch (older
+   build) means `poll_tasks` can't find the sender.
+
+If you're told "merge branch X to enable dispatch," verify X against `main`
+first — the chain is already merged, and re-pushing an auto-deleted branch
+hits the [branch recreation hazard](./pull-requests.md).
+
+## See also
+
+- [task-board](./task-board.md) — scheduler internals (capability scoring, intake drain)
+- [agent-harness](./agent-harness.md) — zot/Colibri split, autospawn, RPC driver
+- [glasspane](./glasspane.md) — how agent stdout becomes observable state
+- [cost-dashboard](./cost-dashboard.md) — where cost lands after completion