diff --git a/docs/COLIBRI-TOKENOMICS-TRIFECTA.md b/docs/COLIBRI-TOKENOMICS-TRIFECTA.md index e0fb114..a31d4f9 100644 --- a/docs/COLIBRI-TOKENOMICS-TRIFECTA.md +++ b/docs/COLIBRI-TOKENOMICS-TRIFECTA.md @@ -22,11 +22,11 @@ measure-everything agent runtime. The "trifecta" is our north star. ## The Trifecta -| Axis | What it means for agents | Colibri surface | -|-------------|---------------------------------------------------|---------------------------------------| -| Performance | Did the agent get it right? Task success rate | Task outcomes, eval harness (T1.6) | -| Speed | Tokens/second, cache-hit ratio, latency | `colibri-deepseek` cache probe, T1.4 | -| Cost | Dollars per task. Not per token — per *result* | `cost.rs` CostMode, escalation, metering | +| Axis | What it means for agents | Colibri surface | +| ----------- | ---------------------------------------------- | ---------------------------------------- | +| Performance | Did the agent get it right? Task success rate | Task outcomes, eval harness (T1.6) | +| Speed | Tokens/second, cache-hit ratio, latency | `colibri-deepseek` cache probe, T1.4 | +| Cost | Dollars per task. Not per token — per _result_ | `cost.rs` CostMode, escalation, metering | You cannot optimize one without understanding impact on the other two. A cheap model that needs 5 retries is more expensive than a capable model @@ -43,7 +43,7 @@ strategy: 1. **Maximize cache-hit surface**: byte-stable system prefix, skills, tool definitions, agent identity — warm once, reuse thousands of times 2. **Spend where it counts**: conversation turns, tool results, novel - context — these are unavoidable, so make them *useful* (VSpecs, + context — these are unavoidable, so make them _useful_ (VSpecs, rich context, HTML plans) 3. **Trim where it doesn't**: auto-compaction, summarization, tool result truncation — Colibri's 3-region model already does this @@ -96,6 +96,7 @@ Cost ████████░░ $0.047 avg/task (target: <$0.05) ### Model selection arbitrage Given a task, Colibri should be able to answer: + - Can this task be handled by a cheap model (DeepSeek V3, Gemini Flash)? - Is the cache-hit ratio high enough that the premium model is actually cheaper? - What's the cost delta between models for this specific task type? @@ -134,17 +135,17 @@ text+image token budgets. ## Integration with Existing Work -| Colibri component | Trifecta role | Status | -|------------------------------|-----------------------------------------|---------| -| `colibri-deepseek` | Cache probe, hit metering | ✅ done | -| `colibri-daemon/cost.rs` | CostMode, budget enforcement | ✅ done | -| `colibri-daemon/session.rs` | 3-region prompt, compaction | ✅ done | -| Cache warming (T1.4 PR3b) | Pre-warm stable prefix | ✅ done | -| Prompt discipline (T1.4) | Byte-stable assembly, cost-aware trim | 🔧 WIP | -| Trifecta dashboard (T1.5) | Per-task cost/speed/perf metrics | 📋 plan | -| Eval harness (T1.6) | Task success measurement | 📋 plan | -| Model selection (T2.x) | Arbitrage engine, cost-aware routing | 📋 plan | -| VSpec support (T2.x) | Image tokens in prompt assembly | 📋 plan | +| Colibri component | Trifecta role | Status | +| --------------------------- | ------------------------------------- | ------- | +| `colibri-deepseek` | Cache probe, hit metering | ✅ done | +| `colibri-daemon/cost.rs` | CostMode, budget enforcement | ✅ done | +| `colibri-daemon/session.rs` | 3-region prompt, compaction | ✅ done | +| Cache warming (T1.4 PR3b) | Pre-warm stable prefix | ✅ done | +| Prompt discipline (T1.4) | Byte-stable assembly, cost-aware trim | 🔧 WIP | +| Trifecta dashboard (T1.5) | Per-task cost/speed/perf metrics | 📋 plan | +| Eval harness (T1.6) | Task success measurement | 📋 plan | +| Model selection (T2.x) | Arbitrage engine, cost-aware routing | 📋 plan | +| VSpec support (T2.x) | Image tokens in prompt assembly | 📋 plan | ## Reference