# Cost model ← [index](./index.md) ## What this is Colibri tracks every token that passes through an agent session and meters cost against a configurable budget. The key insight: **cache-hit tokens cost 10× less** than fresh tokens on DeepSeek — so the prompt prefix is engineered to be byte-stable across requests, maximizing cache hits. Three cost modes (fast, smart, max) represent different points on the speed/cost trade-off, and the model auto-escalates when a cheaper mode can't keep up. ## Decisions ### Byte-stable prompt prefix → cache-hit metering The system prompt and early context blocks are **byte-for-byte identical** across consecutive requests to the same DeepSeek endpoint. DeepSeek's cache-hit pricing discounts these by ~90%. Colibri's `colibri-deepseek` probe determines the exact token-count split between cached and fresh tokens per request, and the cost tracker records both so the session budget reflects the **actual** discounted cost, not the nominal token count. **Why not just count tokens**: token counting with an offline tokenizer gives you an upper bound but not the real cost. DeepSeek's API sometimes re-caches and sometimes doesn't — the probe measures what actually happened. The discount is too large (10×) to leave unmeasured. → [`HEADROOM-SIDECAR.md`](../HEADROOM-SIDECAR.md), [`COLIBRI-TOKENOMICS-TRIFECTA.md`](../COLIBRI-TOKENOMICS-TRIFECTA.md), [`crates/colibri-deepseek/src/lib.rs`](../../crates/colibri-deepseek/src/lib.rs) ### Three cost modes (fast → smart → max) | Mode | Budget (tokens) | Behavior | | ----- | --------------- | -------------------------------------------------------------------------- | | Fast | 16K | Maximum cache-hits, minimum fresh tokens. Rejects large expansions early. | | Smart | 64K | Default. Balances cache reuse with room for follow-up turns. | | Max | 256K | Almost never hits budget. For one-shot deep tasks where cost is secondary. | The daemon **auto-escalates** when a session exhausts its budget in a lower mode: fast → smart → max. Escalation is one-way (never downgrades mid-session). **Why three modes, not a continuous slider**: simplicity wins here. Three well-understood points cover the space — operators pick by risk appetite, not by fine-tuning a number. The escalation chain means "start cheap, pay more only if it works." → [`COLIBRI-TOKENOMICS-TRIFECTA.md`](../COLIBRI-TOKENOMICS-TRIFECTA.md), [`crates/colibri-daemon/src/cost.rs`](../../crates/colibri-daemon/src/cost.rs) ### T14 compaction (budget trim, not truncate) When a session is about to exceed its budget, Colibri compacts the tool results in the volatile region — it sends them through the headroom sidecar for summarization, then trims the oldest volatile blocks until the prompt fits within budget. The **prefix** (system prompt, static context) is never trimmed — only the volatile suffix. If compaction is insufficient and auto-escalation is enabled, the mode steps up before truncating. **Why not just truncate**: truncating mid-conversation loses context the agent needs to continue. Compaction preserves the semantic content at lower token cost. The headroom sidecar is optional (off by default); without it, the fallback is simple truncation. → [`HEADROOM-SIDECAR.md`](../HEADROOM-SIDECAR.md), [`crates/colibri-daemon/src/session.rs`](../../crates/colibri-daemon/src/session.rs) ### Cache-hit probe (DeepSeek-specific) The `colibri-deepseek` crate sends a preflight request with a known prompt to the DeepSeek API and parses the response headers to determine the cache-hit split (prompt_cache_hit_tokens / prompt_cache_miss_tokens). This is provider-specific — DeepSeek is the only provider that exposes this granularity. The probe runs once per session configuration change, not per request. **Why a probe and not a hook**: middleware that intercepts every API response would couple cost tracking to the HTTP layer. A probe decouples it — the cost tracker asks "what was the cache ratio?" and the probe answers, independently of how the request was made. → [`crates/colibri-deepseek/src/lib.rs`](../../crates/colibri-deepseek/src/lib.rs) ## See also - [task-board](./task-board.md) — the scheduler that dispatches tasks within session budgets - [mother-hive](./mother-hive.md) — MCP architecture (different cost domain) - [quality-gates](./quality-gates.md) — the gate that validates cost-mode parsing