Convert US/ISO prose dates (2026-06-21) to EU format (21.jun.2026) across colibri docs + wiki. Left as-is (data, not prose): the captured JSON "time" timestamp in AGENT-EVENTS-REFERENCE and the rustc/cargo version strings in CLAWDIE-INSTALLER-HANDOFF — ISO is correct for machine timestamps/filenames. Gates: wiki-lint --strict clean; markdown format clean.
6.3 KiB
Colibri Tokenomics — The Trifecta Framework
Source: Indie Devdan, "Agent Specs: The Unreasonable Effectiveness of Useful Tokens" (https://www.youtube.com/watch?v=o4KZH_KSqYQ) Date: 01.jun.2026 Status: Strategic vision — maps to existing T1.4/T1.5 work
Scope: This applies to the full Colibri control plane.
Core Thesis
More useful tokens > fewer useful tokens
Cost per intelligence > cost per token
If you don't measure, you can't improve
The video validates what Colibri is already building: a cache-first, measure-everything agent runtime. The "trifecta" is our north star.
The Trifecta
| Axis | What it means for agents | Colibri surface |
|---|---|---|
| Performance | Did the agent get it right? Task success rate | Task outcomes, eval harness (T1.6) |
| Speed | Tokens/second, cache-hit ratio, latency | colibri-deepseek cache probe, T1.4 |
| Cost | Dollars per task. Not per token — per result | cost.rs CostMode, escalation, metering |
Optimize each dimension with full awareness of its impact on the other two. A cheap model that needs 5 retries is more expensive than a capable model that gets it right in one shot.
Token Arbitrage (the "golden line")
Arbitrage tokens for maximum value. Every byte that hits cache is a 10× discount — design prompts to maximize cache-hit prefixes.
Cache-hit tokens cost ~10% of fresh tokens (DeepSeek pricing). Every byte in the stable prefix that hits cache is 90% cheaper. The arbitrage strategy:
- Maximize cache-hit surface: byte-stable system prefix, skills, tool definitions, agent identity — warm once, reuse thousands of times
- Spend where it counts: conversation turns, tool results, novel context — these are unavoidable, so make them useful (VSpecs, rich context, HTML plans)
- Trim where it doesn't: auto-compaction, summarization, tool result truncation — Colibri's 3-region model already does this
Existing Colibri arbitrage infrastructure
T1.4 Prompt Discipline (code present, integration in progress):
Region 1: STABLE_SYSTEM_PREFIX → cache-hit (90% cheaper)
Region 2: conversation log (compacted) → fresh tokens
Region 3: volatile scratch (empty) → zero cost
CostMode escalation (Fast → Smart → Max):
Fast: 500K budget, compact tool results, 5 turns
Smart: 2M budget, keep tool results, 20 turns ← default
Max: 8M budget, full context, 100 turns
Cache warming (T1.4 PR3b, merged):
Pre-warm STABLE_SYSTEM_PREFIX on daemon startup
Re-warm every N hours (configurable)
~3,500 tokens per warm cycle → pays off in ~7 agent tasks
What We Still Need (Trifecta Dashboard)
The video's core message: observability isn't optional for production agents. Colibri already captures the raw data. What's missing is the trifecta view:
Per-task cost tracking
task_id: "abc123"
model: "deepseek-v4-flash"
tokens_in: 45,230 (12,100 cache-hit, 33,130 fresh)
tokens_out: 2,847
cost: $0.047 (cache savings: $0.012)
latency: 8.3s
success: true
Trifecta balance sheet
Performance ████████░░ 82% task success (rolling 24h)
Speed ██████░░░░ 61% cache-hit ratio
Cost ████████░░ $0.047 avg/task (target: <$0.05)
Model selection arbitrage
Given a task, Colibri should be able to answer:
- Can this task be handled by a cheap model (DeepSeek V3, Gemini Flash)?
- Is the cache-hit ratio high enough that the premium model is actually cheaper?
- What's the cost delta between models for this specific task type?
Visual Specs (VSpecs) — Future Input Modality
The video introduces "VSpecs": plans with embedded images generated by GPT Image 2. Multimodal models (Gemini 3.5 Flash, GPT-5) read these images as "useful tokens" — a UI mockup is worth 1000 words of text description.
For Colibri: this means the prompt assembly pipeline should eventually support image tokens in Region 2 (conversation log). NOT for T1.4 — this is T2.x territory. But the cost model should be ready for mixed text+image token budgets.
Golden Rules (from the video, adapted for Colibri)
-
Measure everything. Every tool call, every token, every dollar. Colibri's glasspane architecture already captures the event stream; the trifecta dashboard makes it actionable.
-
Arbitrage cache vs spend. The stable prefix is free money. Maximize its size, minimize its churn.
-
Cost per intelligence, not per token. Compare cost-per-successful-task, not raw model prices in isolation. A $0.05 task that works is infinitely cheaper than a $0.01 task that fails.
-
Trade-offs are engineering. There is no "best" model. There is only the right model for THIS task, under THESE constraints.
-
Closed loop: measure → analyze → improve. The trifecta dashboard isn't a report — it's a feedback loop. Every task feeds back into model selection, prompt design, and cache strategy.
Integration with Existing Work
| Colibri component | Trifecta role | Status |
|---|---|---|
colibri-deepseek |
Cache probe, hit metering | ✅ done |
colibri-daemon/cost.rs |
CostMode, budget enforcement | ✅ done |
colibri-daemon/session.rs |
3-region prompt, compaction | ✅ done |
| Cache warming (T1.4 PR3b) | Pre-warm stable prefix | ✅ done |
| Prompt discipline (T1.4) | Byte-stable assembly, cost-aware trim | 🔧 WIP |
| Trifecta dashboard (T1.5) | Per-task cost/speed/perf metrics | 📋 plan |
| Eval harness (T1.6) | Task success measurement | 📋 plan |
| Model selection (T2.x) | Arbitrage engine, cost-aware routing | 📋 plan |
| VSpec support (T2.x) | Image tokens in prompt assembly | 📋 plan |
Reference
- Video: "Agent Specs: The Unreasonable Effectiveness of Useful Tokens" https://www.youtube.com/watch?v=o4KZH_KSqYQ
- Colibri T1.4 Prompt Discipline:
docs/T1.4-PROMPT-DISCIPLINE-PLAN.md - Colibri Glasspane Design:
docs/COLIBRI-GLASSPANE-DESIGN.md