clawdie/colibri

Fork 0

Sam & Claude b233aa8d9e

CI / agent-jail-pkgs (pull_request) Has been cancelled

Details

CI / rust (pull_request) Has been cancelled

Details

CI / markdown (pull_request) Has been cancelled

Details

CI / port (pull_request) Has been cancelled

Details

docs: normalize prose dates to DD.mon.YYYY (AGENTS.md rule)

Convert US/ISO prose dates (2026-06-21) to EU format (21.jun.2026) across colibri
docs + wiki. Left as-is (data, not prose): the captured JSON "time" timestamp in
AGENT-EVENTS-REFERENCE and the rustc/cargo version strings in
CLAWDIE-INSTALLER-HANDOFF — ISO is correct for machine timestamps/filenames.

Gates: wiki-lint --strict clean; markdown format clean.

2026-06-24 16:43:41 +02:00

6.3 KiB

Raw Blame History

Colibri Tokenomics — The Trifecta Framework

Source: Indie Devdan, "Agent Specs: The Unreasonable Effectiveness of Useful Tokens" (https://www.youtube.com/watch?v=o4KZH_KSqYQ) Date: 01.jun.2026 Status: Strategic vision — maps to existing T1.4/T1.5 work

Scope: This applies to the full Colibri control plane.

Core Thesis

More useful tokens > fewer useful tokens
Cost per intelligence > cost per token
If you don't measure, you can't improve

The video validates what Colibri is already building: a cache-first, measure-everything agent runtime. The "trifecta" is our north star.

The Trifecta

Axis	What it means for agents	Colibri surface
Performance	Did the agent get it right? Task success rate	Task outcomes, eval harness (T1.6)
Speed	Tokens/second, cache-hit ratio, latency	`colibri-deepseek` cache probe, T1.4
Cost	Dollars per task. Not per token — per result	`cost.rs` CostMode, escalation, metering

Optimize each dimension with full awareness of its impact on the other two. A cheap model that needs 5 retries is more expensive than a capable model that gets it right in one shot.

Token Arbitrage (the "golden line")

Arbitrage tokens for maximum value. Every byte that hits cache is a 10× discount — design prompts to maximize cache-hit prefixes.

Cache-hit tokens cost ~10% of fresh tokens (DeepSeek pricing). Every byte in the stable prefix that hits cache is 90% cheaper. The arbitrage strategy:

Maximize cache-hit surface: byte-stable system prefix, skills, tool definitions, agent identity — warm once, reuse thousands of times
Spend where it counts: conversation turns, tool results, novel context — these are unavoidable, so make them useful (VSpecs, rich context, HTML plans)
Trim where it doesn't: auto-compaction, summarization, tool result truncation — Colibri's 3-region model already does this

Existing Colibri arbitrage infrastructure

T1.4 Prompt Discipline (code present, integration in progress):
  Region 1: STABLE_SYSTEM_PREFIX          → cache-hit (90% cheaper)
  Region 2: conversation log (compacted)  → fresh tokens
  Region 3: volatile scratch (empty)      → zero cost

CostMode escalation (Fast → Smart → Max):
  Fast:    500K budget, compact tool results, 5 turns
  Smart:   2M budget, keep tool results, 20 turns  ← default
  Max:     8M budget, full context, 100 turns

Cache warming (T1.4 PR3b, merged):
  Pre-warm STABLE_SYSTEM_PREFIX on daemon startup
  Re-warm every N hours (configurable)
  ~3,500 tokens per warm cycle → pays off in ~7 agent tasks

What We Still Need (Trifecta Dashboard)

The video's core message: observability isn't optional for production agents. Colibri already captures the raw data. What's missing is the trifecta view:

Per-task cost tracking

task_id: "abc123"
model: "deepseek-v4-flash"
tokens_in: 45,230   (12,100 cache-hit, 33,130 fresh)
tokens_out: 2,847
cost: $0.047         (cache savings: $0.012)
latency: 8.3s
success: true

Trifecta balance sheet

Performance  ████████░░  82% task success (rolling 24h)
Speed        ██████░░░░  61% cache-hit ratio
Cost         ████████░░  $0.047 avg/task (target: <$0.05)

Model selection arbitrage

Given a task, Colibri should be able to answer:

Can this task be handled by a cheap model (DeepSeek V3, Gemini Flash)?
Is the cache-hit ratio high enough that the premium model is actually cheaper?
What's the cost delta between models for this specific task type?

Visual Specs (VSpecs) — Future Input Modality

The video introduces "VSpecs": plans with embedded images generated by GPT Image 2. Multimodal models (Gemini 3.5 Flash, GPT-5) read these images as "useful tokens" — a UI mockup is worth 1000 words of text description.

For Colibri: this means the prompt assembly pipeline should eventually support image tokens in Region 2 (conversation log). NOT for T1.4 — this is T2.x territory. But the cost model should be ready for mixed text+image token budgets.

Golden Rules (from the video, adapted for Colibri)

Measure everything. Every tool call, every token, every dollar. Colibri's glasspane architecture already captures the event stream; the trifecta dashboard makes it actionable.
Arbitrage cache vs spend. The stable prefix is free money. Maximize its size, minimize its churn.
Cost per intelligence, not per token. Compare cost-per-successful-task, not raw model prices in isolation. A $0.05 task that works is infinitely cheaper than a $0.01 task that fails.
Trade-offs are engineering. There is no "best" model. There is only the right model for THIS task, under THESE constraints.
Closed loop: measure → analyze → improve. The trifecta dashboard isn't a report — it's a feedback loop. Every task feeds back into model selection, prompt design, and cache strategy.

Integration with Existing Work

Colibri component	Trifecta role	Status
`colibri-deepseek`	Cache probe, hit metering	✅ done
`colibri-daemon/cost.rs`	CostMode, budget enforcement	✅ done
`colibri-daemon/session.rs`	3-region prompt, compaction	✅ done
Cache warming (T1.4 PR3b)	Pre-warm stable prefix	✅ done
Prompt discipline (T1.4)	Byte-stable assembly, cost-aware trim	🔧 WIP
Trifecta dashboard (T1.5)	Per-task cost/speed/perf metrics	📋 plan
Eval harness (T1.6)	Task success measurement	📋 plan
Model selection (T2.x)	Arbitrage engine, cost-aware routing	📋 plan
VSpec support (T2.x)	Image tokens in prompt assembly	📋 plan

Reference

Video: "Agent Specs: The Unreasonable Effectiveness of Useful Tokens" https://www.youtube.com/watch?v=o4KZH_KSqYQ
Colibri T1.4 Prompt Discipline: docs/T1.4-PROMPT-DISCIPLINE-PLAN.md
Colibri Glasspane Design: docs/COLIBRI-GLASSPANE-DESIGN.md

6.3 KiB Raw Blame History Unescape Escape

Colibri Tokenomics — The Trifecta Framework

Core Thesis

The Trifecta

Token Arbitrage (the "golden line")

Existing Colibri arbitrage infrastructure

What We Still Need (Trifecta Dashboard)

Per-task cost tracking

Trifecta balance sheet

Model selection arbitrage

Visual Specs (VSpecs) — Future Input Modality

Golden Rules (from the video, adapted for Colibri)

Integration with Existing Work

Reference

6.3 KiB

Raw Blame History