From 04370dd869a3e96dfad67ed77c8b1cd7fceabc3e Mon Sep 17 00:00:00 2001 From: Sam & Claude Date: Sun, 28 Jun 2026 18:53:09 +0200 Subject: [PATCH] docs: post-Phase-3 wiki accuracy + task-dispatch-flow page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - model-selection-and-eval: status Design → Phases 1–3 shipped (#264/#280/#285); mark Phase 2/3 deliverables, add 3a scope note, fix stale routing-gap row. - hive-routing: status → partially shipped; scheduler row reflects pick_agent + select_model. - README + index: model-selection row reflects shipped, not "design". - New task-dispatch-flow.md: the verified queued→claim→spawn→register→dispatch→ cost chain with code anchors + "why a task stalls" (stale build, not RPC mode, registration linkage). Indexed. Co-Authored-By: Claude Opus 4.8 --- README.md | 2 +- docs/wiki/hive-routing.md | 23 ++++----- docs/wiki/index.md | 7 +-- docs/wiki/model-selection-and-eval.md | 42 +++++++++------- docs/wiki/task-dispatch-flow.md | 69 +++++++++++++++++++++++++++ 5 files changed, 111 insertions(+), 32 deletions(-) create mode 100644 docs/wiki/task-dispatch-flow.md diff --git a/README.md b/README.md index 6e0fcc8..f7b5ecb 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ Key pages: | [glasspane](docs/wiki/glasspane.md) | 5-state machine, attention system, TUI | | [hive-routing](docs/wiki/hive-routing.md) | Fleet routing, capability matrix, node registration | | [cost-dashboard](docs/wiki/cost-dashboard.md) | Per-task cost tracking, proof_text, live dashboard | -| [model-selection-and-eval](docs/wiki/model-selection-and-eval.md) | T2.x eval harness + model selection design | +| [model-selection-and-eval](docs/wiki/model-selection-and-eval.md) | Eval harness + eval-driven model selection (Ph 1–3) | | [mother-hive](docs/wiki/mother-hive.md) | Mother node coordination (PostgreSQL + MCP over SSH) | | [pull-requests](docs/wiki/pull-requests.md) | Branch naming, commit format, review workflow | | [quality-gates](docs/wiki/quality-gates.md) | CI pipeline, fmt/clippy/test/wiki-lint gates | diff --git a/docs/wiki/hive-routing.md b/docs/wiki/hive-routing.md index f68f3b7..fe042c5 100644 --- a/docs/wiki/hive-routing.md +++ b/docs/wiki/hive-routing.md @@ -1,6 +1,7 @@ # Hive Member Tracking & Cost-Aware Routing -**Status:** 📋 Design +**Status:** Partially shipped — capability matching + eval-driven model +selection live (#285); stable node UUID and hive-level cost aggregation pending. **Date:** 24.jun.2026 **Driven by:** T1.5 per-task cost tracking (shipped) → T2.x routing @@ -10,15 +11,15 @@ ## What Exists Today -| Component | State | Gap | -| ---------------------------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------- | -| `mother_schema.sql` | `hive_nodes` table with `hw_profile` + `capabilities` JSONB | No stable node UUID; hostname is the key | -| `derive_capabilities()` trigger | Auto-computes `has_gpu`, `gpu_vendor`, `can_run_local_llm`, `max_model` from hw_profile | Only GPU/VRAM heuristics — doesn't probe running services | -| `clawdie-system-probe` | Collects GPU, RAM, CPU, disks, ZFS, WiFi, Vulkan, Colibri status | No ollama/llama.cpp probing | -| `node-register-mcp` | UPSERTs hw_profile into `hive_nodes` on join | No UUID generation at join time | -| `crates/colibri-daemon/src/scheduler.rs` | Cron/interval/one-shot jobs, capability matching stubs | No cost-aware routing, no hive awareness | -| `colibri-ledger` | Local SQLite `agents` table with UUID (v4 random) | UUID is session-local, not hive-stable | -| T1.5 cost tracking | Per-task cost captured in local SQLite | No hive-level cost aggregation | +| Component | State | Gap | +| ---------------------------------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------- | +| `mother_schema.sql` | `hive_nodes` table with `hw_profile` + `capabilities` JSONB | No stable node UUID; hostname is the key | +| `derive_capabilities()` trigger | Auto-computes `has_gpu`, `gpu_vendor`, `can_run_local_llm`, `max_model` from hw_profile | Only GPU/VRAM heuristics — doesn't probe running services | +| `clawdie-system-probe` | Collects GPU, RAM, CPU, disks, ZFS, WiFi, Vulkan, Colibri status | No ollama/llama.cpp probing | +| `node-register-mcp` | UPSERTs hw_profile into `hive_nodes` on join | No UUID generation at join time | +| `crates/colibri-daemon/src/scheduler.rs` | Cron/interval/one-shot jobs, capability matching (`pick_agent`), eval-driven `select_model` | Selection is per-host; no cross-hive awareness yet | +| `colibri-ledger` | Local SQLite `agents` table with UUID (v4 random) | UUID is session-local, not hive-stable | +| T1.5 cost tracking | Per-task cost captured in local SQLite | No hive-level cost aggregation | ## Design Goals @@ -380,7 +381,7 @@ The capability matrix, stable UUIDs, and local LLM probes are the foundation — | `colibri_query_hive_capabilities` MCP tool | colibri-mcp | | `colibri_dispatch_to_node` MCP tool | colibri-mcp | | `hive-routing` skill | `.agent/skills/` | -| `Task.routing` JSONB field in colibri-ledger | colibri-ledger | +| `Task.routing` JSONB field in colibri-ledger | colibri-ledger | | Mother-side routing score as PostgreSQL function (optional — only if agent-driven routing proves insufficient) | mother_schema.sql | --- diff --git a/docs/wiki/index.md b/docs/wiki/index.md index f99b10b..8c2d768 100644 --- a/docs/wiki/index.md +++ b/docs/wiki/index.md @@ -46,6 +46,7 @@ warning. | Page | What it covers | | --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | | [agent-harness](./agent-harness.md) | The zot (agent) + Colibri (control plane) split; autospawn + RPC driver | +| [task-dispatch-flow](./task-dispatch-flow.md) | End-to-end task path: queued → claimed → spawn → register → dispatch → cost; why a task stalls | | [agent-events-reference](./agent-events-reference.md) | Per-harness zot event reference, Glasspane mappings, and verified transcript fields | | [cost-model](./cost-model.md) | Byte-stable prefixes, cache-hit metering, auto-escalation, T14 compaction | | [glasspane](./glasspane.md) | Agent state machine, JSONL streaming, AgentRuntime taxonomy, snapshot API | @@ -57,14 +58,14 @@ warning. | [hive-pane](./hive-pane.md) | Glasspane for the hive — multi-node cost observability, A2A discovery, and operator board | | [cost-dashboard](./cost-dashboard.md) | Mother-side cost observability — human gallery + JSON, screenshot proof linked from cost rows | | [a2a-complexity-audit](./a2a-complexity-audit.md) | A2A code complexity impact — 6-protocol surface audit, when A2A pays off | -| [model-selection-and-eval](./model-selection-and-eval.md) | T2.x design: model selection (tier arbitrage) + evaluation harness (task success measurement) | +| [model-selection-and-eval](./model-selection-and-eval.md) | Eval harness + eval-driven model selection — Phases 1–3 shipped, Phase 4 (cloud eval) planned | | [naming-decisions](./naming-decisions.md) | Ledger of harness-neutral / architecture renames — shipped and in-flight | | [daemon-not-demon](./daemon-not-demon.md) | Why we say daemon (helper spirit) not demon (bad spirit) — English + Slovenian | | [layered-soul](./layered-soul.md) | How Colibri consumes the layered-soul reviewed-context repo today vs planned | | [task-board](./task-board.md) | Capability match scoring, cron scheduling, intake drain, SQLite backing | | [pull-requests](./pull-requests.md) | PR workflow — branching, review, gates, merge conventions | | [quality-gates](./quality-gates.md) | `ci-checks.sh` as the pre-merge gate; why drift reached `main` before | -| [contracts](./contracts.md) | Stable JSON schemas (run-manifest, runtime-inventory, provider-test), fixture tests | +| [contracts](./contracts.md) | Stable JSON schemas (run-manifest, runtime-inventory, provider-test), fixture tests | | [store-schema](./store-schema.md) | SQLite coordination schema and migration discipline | | [external-mcp](./external-mcp.md) | MCP bridge for editors + external stdio MCP host; read/write/external-call gates | | [operator-cli](./operator-cli.md) | The `colibri` CLI as a thin typed Unix-socket client over the daemon API | @@ -74,4 +75,4 @@ warning. | [skills-catalog](./skills-catalog.md) | Read-only runtime consumer for reviewed skill artifacts | | [vault-provision](./vault-provision.md) | Vaultwarden-driven env-file provisioning into jails after agent spawn | | [deployment](./deployment.md) | Host installer (clawdie): ZFS layout, rc.d/systemd service, dry-run safety | -| [rust-glossary](./rust-glossary.md) | Quick reference for Rust terms in the codebase: serde, Result/Option, Arc/Box, derive macros | +| [rust-glossary](./rust-glossary.md) | Quick reference for Rust terms in the codebase: serde, Result/Option, Arc/Box, derive macros | diff --git a/docs/wiki/model-selection-and-eval.md b/docs/wiki/model-selection-and-eval.md index 0e295d0..8b35e70 100644 --- a/docs/wiki/model-selection-and-eval.md +++ b/docs/wiki/model-selection-and-eval.md @@ -1,9 +1,16 @@ # T2.x: Model Selection & Evaluation Harness -**Status:** 📋 Design +**Status:** Phases 1–3 shipped (#264, #280, #285). Phase 4 (cloud eval) planned. **Date:** 25.jun.2026 **Driven by:** T1.5 per-task cost tracking (shipped) → T2.x model selection + eval +Scope note (Phase 3, 3a): selection is **model-only** — success rate is the +primary signal, cost is a tiebreaker, `cost_mode` owns token spend. Per- +`task_type` routing is deferred to Phase 4 (no task_type data yet). The +selector reads [`Store::model_success_rates`] and runs at agent spawn +(long-lived harnesses, not per-task dispatch); gated by +`COLIBRI_MODEL_SELECTION`, off by default. + > **Companion doc:** [hive-routing](./hive-routing.md) — the capability matrix, > machine identity, and routing engine. This doc covers what the routing engine > _optimizes for_ (model selection) and how it knows if it's winning (eval harness). @@ -14,7 +21,7 @@ | ------------------------- | ---------------------------------------------------------------------------------------------- | ---------------------------------------------------- | | `task_costs` (PostgreSQL) | Per-task cost rows with `provider`, `model`, `cost_usd`, `success` | `success` is boolean — agent process exited 0 or not | | `hive_nodes.capabilities` | JSONB with `has_gpu`, `can_run_local_llm`, `ollama_models` | No success-rate history per model per node | -| Cost tiers (T0–T3) | Defined in hive-routing.md: local ($0), DeepSeek ($0.27/1M), Gemini ($0.15/1M), Claude ($3/1M) | No routing decision uses them yet | +| Cost tiers (T0–T3) | Defined in hive-routing.md: local ($0), DeepSeek ($0.27/1M), Gemini ($0.15/1M), Claude ($3/1M) | Phase 3 `select_model` now factors cost into routing | | Agent harness | Spawns zot/pi, tracks session usage | No quality measurement beyond "did it exit 0?" | ## The Problem @@ -344,7 +351,7 @@ task arrives at scheduler **What this gives us:** Eval infrastructure is in place. We're collecting quality scores from agent self-report. This is the minimum viable eval. -### Phase 2 — Local LLM Eval (3 days) +### Phase 2 — Local LLM Eval (shipped — PR #280) **Goal:** Independent eval via local LLM. @@ -354,13 +361,13 @@ task arrives at scheduler | Local eval: spawn local LLM with eval prompt | colibri-daemon | ~60 | | Fallback logic: self-report → local → cloud → skipped | colibri-daemon | ~40 | | Eval job scheduler (async, fire-and-forget) | colibri-daemon | ~30 | -| Eval result merge into task_eval | colibri-ledger | ~20 | +| Eval result merge into task_eval | colibri-ledger | ~20 | **Total:** ~180 lines, 3 days. **What this gives us:** Independent eval for most tasks. Self-report is still the default, but local LLM eval can verify or override. -### Phase 3 — Model Selection (3 days) +### Phase 3 — Model Selection (shipped — PR #285) **Goal:** Data-driven routing decisions. @@ -383,7 +390,7 @@ task arrives at scheduler | Deliverable | Where | Lines | | --------------------------------------------------- | -------------- | ----- | | Cloud eval: call Claude/DeepSeek with eval prompt | colibri-daemon | ~50 | -| Cost accounting: eval_cost_usd added to task_eval | colibri-ledger | ~10 | +| Cost accounting: eval_cost_usd added to task_eval | colibri-ledger | ~10 | | Feedback loop: eval results → routing weight update | colibri-daemon | ~30 | | Eval aggregation: 5-minute rollup of success rates | colibri-mcp | ~25 | @@ -402,20 +409,21 @@ task arrives at scheduler - Daemon writes eval result ✅ - Query API for eval data ✅ -### Phase 2 — Local LLM Eval +### Phase 2 — Local LLM Eval (shipped — PR #280) -- Eval prompt template (JSON schema) -- Local eval job (spawn local LLM) -- Fallback logic (self-report → local → cloud → skipped) -- Async eval scheduler +- Eval prompt template ✅ +- Local eval job (spawn local LLM via ollama) ✅ +- Fallback chain self-report → local ✅ (→ cloud lands in Phase 4) +- Async eval (background `spawn_blocking`) ✅ -### Phase 3 — Model Selection +### Phase 3 — Model Selection (shipped — PR #285) -- `select_model()` function -- Query eval success rates per (model, task_type) -- Decision rationale logging -- Configurable weights -- Integration with task dispatch +- `select_model()` function ✅ +- Query eval success rates per model — `Store::model_success_rates` ✅ + (per `task_type` deferred to Phase 4) +- Decision rationale logging ✅ +- Configurable weights (`COLIBRI_MODEL_SELECTION_WEIGHT_*`) ✅ +- Integration at agent spawn (`recommend_model` → autospawn env) ✅ ### Phase 4 — Cloud Eval + Feedback diff --git a/docs/wiki/task-dispatch-flow.md b/docs/wiki/task-dispatch-flow.md new file mode 100644 index 0000000..6540165 --- /dev/null +++ b/docs/wiki/task-dispatch-flow.md @@ -0,0 +1,69 @@ +# Task dispatch flow (queued → processed → cost) + +← [index](./index.md) + +## What this is + +The end-to-end path a task takes from submission to a running agent and back. +This page exists because the chain spans three modules ([task-board](./task-board.md) +scheduling, [agent-harness](./agent-harness.md) spawning, and the daemon poll +loop), and it's a recurring source of "why won't the agent pick up my task?" +confusion. Every stage below is in `main` today. + +## The chain + +``` +operator submits intake-task (socket) cmd_intake_task + → task row (queued) ───────────────────────────────► store.create_task + │ +scheduler tick (~30s) pick best-fit agent Scheduler::tick + → claim_task ──────────────────────────────────────► pick_agent + claim_task + │ +autospawn (once) spawn `zot rpc` (stdin piped) autospawn_agent_if_configured + → register agent ───────────────────────────────────► register_agent (name = spawn id) + │ +daemon poll loop task text → agent stdin poll_tasks + → send_prompt ──────────────────────────────────────► rpc_sender().send_prompt(task) + → status: Started │ + ▼ +agent works, emits JSONL → glasspane state → completion → set_task_cost + → write_task_eval (self-report) + background local eval → push_cost_to_mother → dashboard +``` + +## Stages + +| Stage | What happens | Code | +| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------- | +| **Submit** | A task row is created in the store with status `queued`. | [`cmd_intake_task`](../../crates/colibri-daemon/src/socket.rs) | +| **Claim** | Each tick, the scheduler picks the best-fit agent by capability and claims the task (`queued → claimed`). | [`Scheduler::tick`](../../crates/colibri-daemon/src/scheduler.rs) → `pick_agent`, `claim_task` | +| **Spawn** | Autospawn starts the harness as **`zot rpc`** (provider `local`), so stdin is piped and an `RpcSender` is available. | [`autospawn_agent_if_configured`](../../crates/colibri-daemon/src/socket.rs), `default_agent_args` | +| **Register** | The spawned agent is registered in the store; its `name` column holds the live spawn-handle id used in `state.agents`. | `register_agent` (store row `name` = spawn id) | +| **Dispatch** | The poll loop resolves the spawn handle from the claimed task's `agent_id`, gets `rpc_sender()`, and writes the task text to the agent's stdin, transitioning the task to `Started`. | [`poll_tasks`](../../crates/colibri-daemon/src/daemon.rs) → `send_prompt` | +| **Process + cost** | The agent works and emits JSONL ([glasspane](./glasspane.md)). On completion, cost and an eval record are written, then pushed to mother for the [dashboard](./cost-dashboard.md). | `set_task_cost`, `write_task_eval`, `push_cost_to_mother` | + +## Why a task can stall (and what it is _not_) + +The dispatch logic above is all in `main` — a stalled task is almost never +missing code. The usual causes, in order: + +1. **Stale deployed build.** The host is running a colibri binary older than + the current `poll_tasks` dispatch or agent-registration fixes. Check + `git rev-parse HEAD` on the host against `origin/main`; reset, rebuild, + restart the daemon. +2. **Agent not in RPC mode.** If the process is `zot --mode json` (not + `zot rpc`), stdin isn't piped, `rpc_sender()` is `None`, and no dispatch + happens. Confirm with `ps`. +3. **Registration linkage broken.** Dispatch needs the store agent row's + `name` to equal the live spawn id in `state.agents`. A mismatch (older + build) means `poll_tasks` can't find the sender. + +If you're told "merge branch X to enable dispatch," verify X against `main` +first — the chain is already merged, and re-pushing an auto-deleted branch +hits the [branch recreation hazard](./pull-requests.md). + +## See also + +- [task-board](./task-board.md) — scheduler internals (capability scoring, intake drain) +- [agent-harness](./agent-harness.md) — zot/Colibri split, autospawn, RPC driver +- [glasspane](./glasspane.md) — how agent stdout becomes observable state +- [cost-dashboard](./cost-dashboard.md) — where cost lands after completion