feat: T2.x eval harness + RPC task dispatch #264

Merged
clawdie merged 4 commits from feat/rpc-eval-combined into main 2026-06-28 08:43:34 +02:00
Owner

Combined PR: T2.x Phase 1 eval harness + RPC task dispatch.

Eval harness (store):

  • New task_evals table with CHECK constraints
  • write_task_eval / read_task_eval / list_all / list_by_agent
  • eval_summary() aggregation for Phase 3 routing
  • colibri_get_task_eval MCP tool
  • 8 new store tests

RPC task dispatch (daemon):

  • Route claimed tasks to running RPC agents instead of spawning new processes
  • try_lock with proper WouldBlock/Poisoned handling

Refactor:

  • record_task_completion() extracted from heartbeat for future RPC-completion paths
  • TaskCompletion struct consolidates params

Also:

  • clippy: map_or → is_some_and in capability probe
  • cargo fmt

8 files, +659/-40. Reviewed and approved.

Combined PR: T2.x Phase 1 eval harness + RPC task dispatch. **Eval harness (store):** - New task_evals table with CHECK constraints - write_task_eval / read_task_eval / list_all / list_by_agent - eval_summary() aggregation for Phase 3 routing - colibri_get_task_eval MCP tool - 8 new store tests **RPC task dispatch (daemon):** - Route claimed tasks to running RPC agents instead of spawning new processes - try_lock with proper WouldBlock/Poisoned handling **Refactor:** - record_task_completion() extracted from heartbeat for future RPC-completion paths - TaskCompletion struct consolidates params **Also:** - clippy: map_or → is_some_and in capability probe - cargo fmt 8 files, +659/-40. Reviewed and approved.
clawdie added 4 commits 2026-06-28 08:43:16 +02:00
Schema + store + daemon hook for the eval harness (Phase 1 of T2.x).

Per docs/wiki/t2x-eval-harness.md, the eval harness records multi-dimensional
success measurement per task — beyond the boolean 'did it exit 0?' that T1.5
already captures. Phase 1 uses agent self-report (exit code → quality 1.0 or
0.0). Phases 2/3/4 will layer on local-llm eval, cloud-llm eval, and
model-selection routing.

Schema (colibri-store):
- New task_evals table: task_id, agent_id, eval_mode, completion_status,
  quality_score, correctness_check, eval_provider, eval_latency_ms,
  eval_cost_usd, evaluated_at. CHECK constraints enforce the enum fields.
  Intentionally no FK to tasks — we don't want DELETE CASCADE to destroy
  eval history and we don't want a missing task row to block eval writes.
- task_costs gets quality_score and eval_mode columns for dashboard display.
- Migrations use IF NOT EXISTS / try-block pattern for idempotent reopens.

Store API:
- write_task_eval: INSERT OR REPLACE — same task_id can be upgraded
  (e.g. skip → agent → local-llm → cloud-llm)
- read_task_eval
- list_task_evals_by_agent
- list_all_task_evals
- eval_summary(window_hours): aggregated rollup for Phase 3 routing

Daemon integration:
- New TaskCompletion struct consolidates what used to be 8 args to an
  inline cost-capture closure. The struct is a stable API that future
  eval modes (local-llm, cloud-llm) can populate with eval_provider,
  eval_latency_ms, eval_cost_usd without touching the hook signature.
- record_task_completion(state, &TaskCompletion): single atomic hook now
  writes both task_costs AND task_evals. Called from heartbeat's poll_exit
  path; designed so RPC-completion and periodic-snapshot paths (the gap
  flagged in feat/rpc-task-dispatch for persistent RPC agents) can call
  the same function.
- Hardcoded eval_mode='agent' in Phase 1 — future phases pass different
  values; the function itself is mode-agnostic.

MCP tool:
- colibri_get_task_eval(task_id): returns the eval record for a task.

Client:
- Client::get_task_eval() async method.

Tests:
- 6 new store tests: roundtrip, insert-or-replace upgrade path,
  list-by-agent filter, eval_summary aggregation, CHECK constraint
  enforcement, export_json integration.
- tool_dispatch test updated for new tool count (20 → 21).

All gates green: cargo fmt, clippy -D warnings, cargo test workspace,
wiki-lint --strict (187/0).

Sam & Claude
Adds RPC dispatch to poll_tasks() — when a claimed task has an
agent_id matching a running autospawned agent (zot rpc), the daemon
sends the task description via the existing RPC channel and
transitions the task to 'started'.

Key changes:
  - Resolves store row ID → spawn handle ID via get_agent().name
  - Falls back to spawn-per-task path if no RPC agent found
  - Uses existing send_prompt() on RpcSender

Pipeline verified end-to-end:
  intake-task → queued → scheduler tick → claimed
  → poll_tasks RPC dispatch → started 

Remaining: persistent RPC agents don't exit after one task, so
the current poll_exit-based cost capture (triggered by process exit)
doesn't fire. Periodic pane-usage snapshot needed for long-running
RPC agents.
Clippy 1.94 lint: unnecessary_map_or on socket.rs:931.
Part of the eval harness probe_capabilities fallback (#260).
Combined PR: eval harness Phase 1 + RPC task dispatch.
style: cargo fmt on RPC dispatch method chains
Some checks are pending
CI / rust (pull_request) Waiting to run
CI / markdown (pull_request) Waiting to run
CI / port (pull_request) Waiting to run
CI / agent-jail-pkgs (pull_request) Waiting to run
ed35e3ffb0
clawdie merged commit 274652a9fb into main 2026-06-28 08:43:34 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: clawdie/colibri#264
No description provided.