docs: Colibri Tokenomics — trifecta framework (performance/speed/cost) #15

Merged
clawdie merged 3 commits from docs/tokenomics-trifecta-v2 into main 2026-06-02 17:56:08 +02:00
Showing only changes of commit 0d80bb161d - Show all commits

View file

@ -22,11 +22,11 @@ measure-everything agent runtime. The "trifecta" is our north star.
## The Trifecta
| Axis | What it means for agents | Colibri surface |
|-------------|---------------------------------------------------|---------------------------------------|
| Performance | Did the agent get it right? Task success rate | Task outcomes, eval harness (T1.6) |
| Speed | Tokens/second, cache-hit ratio, latency | `colibri-deepseek` cache probe, T1.4 |
| Cost | Dollars per task. Not per token — per *result* | `cost.rs` CostMode, escalation, metering |
| Axis | What it means for agents | Colibri surface |
| ----------- | ---------------------------------------------- | ---------------------------------------- |
| Performance | Did the agent get it right? Task success rate | Task outcomes, eval harness (T1.6) |
| Speed | Tokens/second, cache-hit ratio, latency | `colibri-deepseek` cache probe, T1.4 |
| Cost | Dollars per task. Not per token — per _result_ | `cost.rs` CostMode, escalation, metering |
You cannot optimize one without understanding impact on the other two.
A cheap model that needs 5 retries is more expensive than a capable model
@ -43,7 +43,7 @@ strategy:
1. **Maximize cache-hit surface**: byte-stable system prefix, skills,
tool definitions, agent identity — warm once, reuse thousands of times
2. **Spend where it counts**: conversation turns, tool results, novel
context — these are unavoidable, so make them *useful* (VSpecs,
context — these are unavoidable, so make them _useful_ (VSpecs,
rich context, HTML plans)
3. **Trim where it doesn't**: auto-compaction, summarization, tool result
truncation — Colibri's 3-region model already does this
@ -96,6 +96,7 @@ Cost ████████░░ $0.047 avg/task (target: <$0.05)
### Model selection arbitrage
Given a task, Colibri should be able to answer:
- Can this task be handled by a cheap model (DeepSeek V3, Gemini Flash)?
- Is the cache-hit ratio high enough that the premium model is actually cheaper?
- What's the cost delta between models for this specific task type?
@ -134,17 +135,17 @@ text+image token budgets.
## Integration with Existing Work
| Colibri component | Trifecta role | Status |
|------------------------------|-----------------------------------------|---------|
| `colibri-deepseek` | Cache probe, hit metering | ✅ done |
| `colibri-daemon/cost.rs` | CostMode, budget enforcement | ✅ done |
| `colibri-daemon/session.rs` | 3-region prompt, compaction | ✅ done |
| Cache warming (T1.4 PR3b) | Pre-warm stable prefix | ✅ done |
| Prompt discipline (T1.4) | Byte-stable assembly, cost-aware trim | 🔧 WIP |
| Trifecta dashboard (T1.5) | Per-task cost/speed/perf metrics | 📋 plan |
| Eval harness (T1.6) | Task success measurement | 📋 plan |
| Model selection (T2.x) | Arbitrage engine, cost-aware routing | 📋 plan |
| VSpec support (T2.x) | Image tokens in prompt assembly | 📋 plan |
| Colibri component | Trifecta role | Status |
| --------------------------- | ------------------------------------- | ------- |
| `colibri-deepseek` | Cache probe, hit metering | ✅ done |
| `colibri-daemon/cost.rs` | CostMode, budget enforcement | ✅ done |
| `colibri-daemon/session.rs` | 3-region prompt, compaction | ✅ done |
| Cache warming (T1.4 PR3b) | Pre-warm stable prefix | ✅ done |
| Prompt discipline (T1.4) | Byte-stable assembly, cost-aware trim | 🔧 WIP |
| Trifecta dashboard (T1.5) | Per-task cost/speed/perf metrics | 📋 plan |
| Eval harness (T1.6) | Task success measurement | 📋 plan |
| Model selection (T2.x) | Arbitrage engine, cost-aware routing | 📋 plan |
| VSpec support (T2.x) | Image tokens in prompt assembly | 📋 plan |
## Reference