feat: T2.x eval harness + RPC task dispatch #264

Merged

clawdie merged 4 commits from feat/rpc-eval-combined into main

2026-06-28 08:43:34 +02:00

clawdie commented

2026-06-28 08:43:10 +02:00

Owner

Combined PR: T2.x Phase 1 eval harness + RPC task dispatch.

Eval harness (store):

New task_evals table with CHECK constraints
write_task_eval / read_task_eval / list_all / list_by_agent
eval_summary() aggregation for Phase 3 routing
colibri_get_task_eval MCP tool
8 new store tests

RPC task dispatch (daemon):

Route claimed tasks to running RPC agents instead of spawning new processes
try_lock with proper WouldBlock/Poisoned handling

Refactor:

record_task_completion() extracted from heartbeat for future RPC-completion paths
TaskCompletion struct consolidates params

Also:

clippy: map_or → is_some_and in capability probe
cargo fmt

8 files, +659/-40. Reviewed and approved.

Combined PR: T2.x Phase 1 eval harness + RPC task dispatch. **Eval harness (store):** - New task_evals table with CHECK constraints - write_task_eval / read_task_eval / list_all / list_by_agent - eval_summary() aggregation for Phase 3 routing - colibri_get_task_eval MCP tool - 8 new store tests **RPC task dispatch (daemon):** - Route claimed tasks to running RPC agents instead of spawning new processes - try_lock with proper WouldBlock/Poisoned handling **Refactor:** - record_task_completion() extracted from heartbeat for future RPC-completion paths - TaskCompletion struct consolidates params **Also:** - clippy: map_or → is_some_and in capability probe - cargo fmt 8 files, +659/-40. Reviewed and approved.

clawdie added 4 commits 2026-06-28 08:43:16 +02:00

feat(store): T2.x Phase 1 eval harness — agent self-report 89e47363ef

Schema + store + daemon hook for the eval harness (Phase 1 of T2.x).

Per docs/wiki/t2x-eval-harness.md, the eval harness records multi-dimensional
success measurement per task — beyond the boolean 'did it exit 0?' that T1.5
already captures. Phase 1 uses agent self-report (exit code → quality 1.0 or
0.0). Phases 2/3/4 will layer on local-llm eval, cloud-llm eval, and
model-selection routing.

Schema (colibri-store):
- New task_evals table: task_id, agent_id, eval_mode, completion_status,
  quality_score, correctness_check, eval_provider, eval_latency_ms,
  eval_cost_usd, evaluated_at. CHECK constraints enforce the enum fields.
  Intentionally no FK to tasks — we don't want DELETE CASCADE to destroy
  eval history and we don't want a missing task row to block eval writes.
- task_costs gets quality_score and eval_mode columns for dashboard display.
- Migrations use IF NOT EXISTS / try-block pattern for idempotent reopens.

Store API:
- write_task_eval: INSERT OR REPLACE — same task_id can be upgraded
  (e.g. skip → agent → local-llm → cloud-llm)
- read_task_eval
- list_task_evals_by_agent
- list_all_task_evals
- eval_summary(window_hours): aggregated rollup for Phase 3 routing

Daemon integration:
- New TaskCompletion struct consolidates what used to be 8 args to an
  inline cost-capture closure. The struct is a stable API that future
  eval modes (local-llm, cloud-llm) can populate with eval_provider,
  eval_latency_ms, eval_cost_usd without touching the hook signature.
- record_task_completion(state, &TaskCompletion): single atomic hook now
  writes both task_costs AND task_evals. Called from heartbeat's poll_exit
  path; designed so RPC-completion and periodic-snapshot paths (the gap
  flagged in feat/rpc-task-dispatch for persistent RPC agents) can call
  the same function.
- Hardcoded eval_mode='agent' in Phase 1 — future phases pass different
  values; the function itself is mode-agnostic.

MCP tool:
- colibri_get_task_eval(task_id): returns the eval record for a task.

Client:
- Client::get_task_eval() async method.

Tests:
- 6 new store tests: roundtrip, insert-or-replace upgrade path,
  list-by-agent filter, eval_summary aggregation, CHECK constraint
  enforcement, export_json integration.
- tool_dispatch test updated for new tool count (20 → 21).

All gates green: cargo fmt, clippy -D warnings, cargo test workspace,
wiki-lint --strict (187/0).

Sam & Claude

feat(daemon): dispatch claimed tasks to running RPC agents 5227b2cd25

Adds RPC dispatch to poll_tasks() — when a claimed task has an
agent_id matching a running autospawned agent (zot rpc), the daemon
sends the task description via the existing RPC channel and
transitions the task to 'started'.

Key changes:
  - Resolves store row ID → spawn handle ID via get_agent().name
  - Falls back to spawn-per-task path if no RPC agent found
  - Uses existing send_prompt() on RpcSender

Pipeline verified end-to-end:
  intake-task → queued → scheduler tick → claimed
  → poll_tasks RPC dispatch → started ✅

Remaining: persistent RPC agents don't exit after one task, so
the current poll_exit-based cost capture (triggered by process exit)
doesn't fire. Periodic pane-usage snapshot needed for long-running
RPC agents.

fix(clippy): map_or → is_some_and in ollama probe fallback 514105b44d

Clippy 1.94 lint: unnecessary_map_or on socket.rs:931.
Part of the eval harness probe_capabilities fallback (#260).
Combined PR: eval harness Phase 1 + RPC task dispatch.

style: cargo fmt on RPC dispatch method chains

CI / rust (pull_request) Waiting to run

Details

CI / markdown (pull_request) Waiting to run

Details

CI / port (pull_request) Waiting to run

Details

CI / agent-jail-pkgs (pull_request) Waiting to run

Details

ed35e3ffb0