colibri/crates
Sam & Claude 89e47363ef feat(store): T2.x Phase 1 eval harness — agent self-report
Schema + store + daemon hook for the eval harness (Phase 1 of T2.x).

Per docs/wiki/t2x-eval-harness.md, the eval harness records multi-dimensional
success measurement per task — beyond the boolean 'did it exit 0?' that T1.5
already captures. Phase 1 uses agent self-report (exit code → quality 1.0 or
0.0). Phases 2/3/4 will layer on local-llm eval, cloud-llm eval, and
model-selection routing.

Schema (colibri-store):
- New task_evals table: task_id, agent_id, eval_mode, completion_status,
  quality_score, correctness_check, eval_provider, eval_latency_ms,
  eval_cost_usd, evaluated_at. CHECK constraints enforce the enum fields.
  Intentionally no FK to tasks — we don't want DELETE CASCADE to destroy
  eval history and we don't want a missing task row to block eval writes.
- task_costs gets quality_score and eval_mode columns for dashboard display.
- Migrations use IF NOT EXISTS / try-block pattern for idempotent reopens.

Store API:
- write_task_eval: INSERT OR REPLACE — same task_id can be upgraded
  (e.g. skip → agent → local-llm → cloud-llm)
- read_task_eval
- list_task_evals_by_agent
- list_all_task_evals
- eval_summary(window_hours): aggregated rollup for Phase 3 routing

Daemon integration:
- New TaskCompletion struct consolidates what used to be 8 args to an
  inline cost-capture closure. The struct is a stable API that future
  eval modes (local-llm, cloud-llm) can populate with eval_provider,
  eval_latency_ms, eval_cost_usd without touching the hook signature.
- record_task_completion(state, &TaskCompletion): single atomic hook now
  writes both task_costs AND task_evals. Called from heartbeat's poll_exit
  path; designed so RPC-completion and periodic-snapshot paths (the gap
  flagged in feat/rpc-task-dispatch for persistent RPC agents) can call
  the same function.
- Hardcoded eval_mode='agent' in Phase 1 — future phases pass different
  values; the function itself is mode-agnostic.

MCP tool:
- colibri_get_task_eval(task_id): returns the eval record for a task.

Client:
- Client::get_task_eval() async method.

Tests:
- 6 new store tests: roundtrip, insert-or-replace upgrade path,
  list-by-agent filter, eval_summary aggregation, CHECK constraint
  enforcement, export_json integration.
- tool_dispatch test updated for new tool count (20 → 21).

All gates green: cargo fmt, clippy -D warnings, cargo test workspace,
wiki-lint --strict (187/0).

Sam & Claude
2026-06-28 08:23:05 +02:00
..
clawdie
colibri-client feat(store): T2.x Phase 1 eval harness — agent self-report 2026-06-28 08:23:05 +02:00
colibri-contracts feat: per-task cost tracking across all crates (T1.5) 2026-06-27 12:12:51 +02:00
colibri-daemon feat(store): T2.x Phase 1 eval harness — agent self-report 2026-06-28 08:23:05 +02:00
colibri-deepseek refactor: rename smoke→test across provider contracts and docs 2026-06-27 11:54:30 +02:00
colibri-deploy feat(deploy): add colibri-deploy crate + MCP tools 2026-06-27 18:57:55 +02:00
colibri-glasspane fix: remove legacy references — Rust source + agent skills (5 files) (#249) 2026-06-28 00:10:50 +02:00
colibri-glasspane-tui docs: fold glasspane TUI design into wiki/tui.md, delete scratch 2026-06-26 22:03:12 +02:00
colibri-mcp feat(store): T2.x Phase 1 eval harness — agent self-report 2026-06-28 08:23:05 +02:00
colibri-pf feat: add colibri-zfs, colibri-pf crates + MCP tools + wiki tools 2026-06-27 14:49:46 +02:00
colibri-runtime
colibri-skills fix(skills): correct source-of-truth — colibri, not clawdie-ai 2026-06-26 21:43:08 +02:00
colibri-store feat(store): T2.x Phase 1 eval harness — agent self-report 2026-06-28 08:23:05 +02:00
colibri-vault
colibri-zfs fix: clippy lint — map_or→is_none_or in wiki tools, unused _line in zfs test 2026-06-27 17:05:56 +02:00