colibri/docs/COLIBRI-TOKENOMICS-TRIFECTA.md
Sam & Claude b233aa8d9e
Some checks failed
CI / agent-jail-pkgs (pull_request) Has been cancelled
CI / rust (pull_request) Has been cancelled
CI / markdown (pull_request) Has been cancelled
CI / port (pull_request) Has been cancelled
docs: normalize prose dates to DD.mon.YYYY (AGENTS.md rule)
Convert US/ISO prose dates (2026-06-21) to EU format (21.jun.2026) across colibri
docs + wiki. Left as-is (data, not prose): the captured JSON "time" timestamp in
AGENT-EVENTS-REFERENCE and the rustc/cargo version strings in
CLAWDIE-INSTALLER-HANDOFF — ISO is correct for machine timestamps/filenames.

Gates: wiki-lint --strict clean; markdown format clean.
2026-06-24 16:43:41 +02:00

6.3 KiB
Raw Blame History

Colibri Tokenomics — The Trifecta Framework

Source: Indie Devdan, "Agent Specs: The Unreasonable Effectiveness of Useful Tokens" (https://www.youtube.com/watch?v=o4KZH_KSqYQ) Date: 01.jun.2026 Status: Strategic vision — maps to existing T1.4/T1.5 work

Scope: This applies to the full Colibri control plane.

Core Thesis

More useful tokens > fewer useful tokens
Cost per intelligence > cost per token
If you don't measure, you can't improve

The video validates what Colibri is already building: a cache-first, measure-everything agent runtime. The "trifecta" is our north star.

The Trifecta

Axis What it means for agents Colibri surface
Performance Did the agent get it right? Task success rate Task outcomes, eval harness (T1.6)
Speed Tokens/second, cache-hit ratio, latency colibri-deepseek cache probe, T1.4
Cost Dollars per task. Not per token — per result cost.rs CostMode, escalation, metering

Optimize each dimension with full awareness of its impact on the other two. A cheap model that needs 5 retries is more expensive than a capable model that gets it right in one shot.

Token Arbitrage (the "golden line")

Arbitrage tokens for maximum value. Every byte that hits cache is a 10× discount — design prompts to maximize cache-hit prefixes.

Cache-hit tokens cost ~10% of fresh tokens (DeepSeek pricing). Every byte in the stable prefix that hits cache is 90% cheaper. The arbitrage strategy:

  1. Maximize cache-hit surface: byte-stable system prefix, skills, tool definitions, agent identity — warm once, reuse thousands of times
  2. Spend where it counts: conversation turns, tool results, novel context — these are unavoidable, so make them useful (VSpecs, rich context, HTML plans)
  3. Trim where it doesn't: auto-compaction, summarization, tool result truncation — Colibri's 3-region model already does this

Existing Colibri arbitrage infrastructure

T1.4 Prompt Discipline (code present, integration in progress):
  Region 1: STABLE_SYSTEM_PREFIX          → cache-hit (90% cheaper)
  Region 2: conversation log (compacted)  → fresh tokens
  Region 3: volatile scratch (empty)      → zero cost

CostMode escalation (Fast → Smart → Max):
  Fast:    500K budget, compact tool results, 5 turns
  Smart:   2M budget, keep tool results, 20 turns  ← default
  Max:     8M budget, full context, 100 turns

Cache warming (T1.4 PR3b, merged):
  Pre-warm STABLE_SYSTEM_PREFIX on daemon startup
  Re-warm every N hours (configurable)
  ~3,500 tokens per warm cycle → pays off in ~7 agent tasks

What We Still Need (Trifecta Dashboard)

The video's core message: observability isn't optional for production agents. Colibri already captures the raw data. What's missing is the trifecta view:

Per-task cost tracking

task_id: "abc123"
model: "deepseek-v4-flash"
tokens_in: 45,230   (12,100 cache-hit, 33,130 fresh)
tokens_out: 2,847
cost: $0.047         (cache savings: $0.012)
latency: 8.3s
success: true

Trifecta balance sheet

Performance  ████████░░  82% task success (rolling 24h)
Speed        ██████░░░░  61% cache-hit ratio
Cost         ████████░░  $0.047 avg/task (target: <$0.05)

Model selection arbitrage

Given a task, Colibri should be able to answer:

  • Can this task be handled by a cheap model (DeepSeek V3, Gemini Flash)?
  • Is the cache-hit ratio high enough that the premium model is actually cheaper?
  • What's the cost delta between models for this specific task type?

Visual Specs (VSpecs) — Future Input Modality

The video introduces "VSpecs": plans with embedded images generated by GPT Image 2. Multimodal models (Gemini 3.5 Flash, GPT-5) read these images as "useful tokens" — a UI mockup is worth 1000 words of text description.

For Colibri: this means the prompt assembly pipeline should eventually support image tokens in Region 2 (conversation log). NOT for T1.4 — this is T2.x territory. But the cost model should be ready for mixed text+image token budgets.

Golden Rules (from the video, adapted for Colibri)

  1. Measure everything. Every tool call, every token, every dollar. Colibri's glasspane architecture already captures the event stream; the trifecta dashboard makes it actionable.

  2. Arbitrage cache vs spend. The stable prefix is free money. Maximize its size, minimize its churn.

  3. Cost per intelligence, not per token. Compare cost-per-successful-task, not raw model prices in isolation. A $0.05 task that works is infinitely cheaper than a $0.01 task that fails.

  4. Trade-offs are engineering. There is no "best" model. There is only the right model for THIS task, under THESE constraints.

  5. Closed loop: measure → analyze → improve. The trifecta dashboard isn't a report — it's a feedback loop. Every task feeds back into model selection, prompt design, and cache strategy.

Integration with Existing Work

Colibri component Trifecta role Status
colibri-deepseek Cache probe, hit metering done
colibri-daemon/cost.rs CostMode, budget enforcement done
colibri-daemon/session.rs 3-region prompt, compaction done
Cache warming (T1.4 PR3b) Pre-warm stable prefix done
Prompt discipline (T1.4) Byte-stable assembly, cost-aware trim 🔧 WIP
Trifecta dashboard (T1.5) Per-task cost/speed/perf metrics 📋 plan
Eval harness (T1.6) Task success measurement 📋 plan
Model selection (T2.x) Arbitrage engine, cost-aware routing 📋 plan
VSpec support (T2.x) Image tokens in prompt assembly 📋 plan

Reference

  • Video: "Agent Specs: The Unreasonable Effectiveness of Useful Tokens" https://www.youtube.com/watch?v=o4KZH_KSqYQ
  • Colibri T1.4 Prompt Discipline: docs/T1.4-PROMPT-DISCIPLINE-PLAN.md
  • Colibri Glasspane Design: docs/COLIBRI-GLASSPANE-DESIGN.md