colibri

History

Sam & Claude 89e47363ef feat(store): T2.x Phase 1 eval harness — agent self-report Schema + store + daemon hook for the eval harness (Phase 1 of T2.x). Per docs/wiki/t2x-eval-harness.md, the eval harness records multi-dimensional success measurement per task — beyond the boolean 'did it exit 0?' that T1.5 already captures. Phase 1 uses agent self-report (exit code → quality 1.0 or 0.0). Phases 2/3/4 will layer on local-llm eval, cloud-llm eval, and model-selection routing. Schema (colibri-store): - New task_evals table: task_id, agent_id, eval_mode, completion_status, quality_score, correctness_check, eval_provider, eval_latency_ms, eval_cost_usd, evaluated_at. CHECK constraints enforce the enum fields. Intentionally no FK to tasks — we don't want DELETE CASCADE to destroy eval history and we don't want a missing task row to block eval writes. - task_costs gets quality_score and eval_mode columns for dashboard display. - Migrations use IF NOT EXISTS / try-block pattern for idempotent reopens. Store API: - write_task_eval: INSERT OR REPLACE — same task_id can be upgraded (e.g. skip → agent → local-llm → cloud-llm) - read_task_eval - list_task_evals_by_agent - list_all_task_evals - eval_summary(window_hours): aggregated rollup for Phase 3 routing Daemon integration: - New TaskCompletion struct consolidates what used to be 8 args to an inline cost-capture closure. The struct is a stable API that future eval modes (local-llm, cloud-llm) can populate with eval_provider, eval_latency_ms, eval_cost_usd without touching the hook signature. - record_task_completion(state, &TaskCompletion): single atomic hook now writes both task_costs AND task_evals. Called from heartbeat's poll_exit path; designed so RPC-completion and periodic-snapshot paths (the gap flagged in feat/rpc-task-dispatch for persistent RPC agents) can call the same function. - Hardcoded eval_mode='agent' in Phase 1 — future phases pass different values; the function itself is mode-agnostic. MCP tool: - colibri_get_task_eval(task_id): returns the eval record for a task. Client: - Client::get_task_eval() async method. Tests: - 6 new store tests: roundtrip, insert-or-replace upgrade path, list-by-agent filter, eval_summary aggregation, CHECK constraint enforcement, export_json integration. - tool_dispatch test updated for new tool count (20 → 21). All gates green: cargo fmt, clippy -D warnings, cargo test workspace, wiki-lint --strict (187/0). Sam & Claude		2026-06-28 08:23:05 +02:00
..
clawdie
colibri-client	feat(store): T2.x Phase 1 eval harness — agent self-report	2026-06-28 08:23:05 +02:00
colibri-contracts	feat: per-task cost tracking across all crates (T1.5)	2026-06-27 12:12:51 +02:00
colibri-daemon	feat(store): T2.x Phase 1 eval harness — agent self-report	2026-06-28 08:23:05 +02:00
colibri-deepseek	refactor: rename smoke→test across provider contracts and docs	2026-06-27 11:54:30 +02:00
colibri-deploy	feat(deploy): add colibri-deploy crate + MCP tools	2026-06-27 18:57:55 +02:00
colibri-glasspane	fix: remove legacy references — Rust source + agent skills (5 files) (#249 )	2026-06-28 00:10:50 +02:00
colibri-glasspane-tui	docs: fold glasspane TUI design into wiki/tui.md, delete scratch	2026-06-26 22:03:12 +02:00
colibri-mcp	feat(store): T2.x Phase 1 eval harness — agent self-report	2026-06-28 08:23:05 +02:00
colibri-pf	feat: add colibri-zfs, colibri-pf crates + MCP tools + wiki tools	2026-06-27 14:49:46 +02:00
colibri-runtime
colibri-skills	fix(skills): correct source-of-truth — colibri, not clawdie-ai	2026-06-26 21:43:08 +02:00
colibri-store	feat(store): T2.x Phase 1 eval harness — agent self-report	2026-06-28 08:23:05 +02:00
colibri-vault
colibri-zfs	fix: clippy lint — map_or→is_none_or in wiki tools, unused _line in zfs test	2026-06-27 17:05:56 +02:00