colibri/docs/wiki/cost-model.md

# Cost model

← [index](./index.md)

## What this is

Colibri tracks every token that passes through an agent session and meters cost
against a configurable budget. The key insight: **cache-hit tokens cost 10×
less** than fresh tokens on DeepSeek — so the prompt prefix is engineered to be
byte-stable across requests, maximizing cache hits. Three cost modes (fast,
smart, max) represent different points on the speed/cost trade-off, and the
model auto-escalates when a cheaper mode can't keep up.

## Decisions

### Byte-stable prompt prefix → cache-hit metering

The system prompt and early context blocks are **byte-for-byte identical**
across consecutive requests to the same DeepSeek endpoint. DeepSeek's cache-hit
pricing discounts these by ~90%. Colibri's `colibri-deepseek` probe determines
the exact token-count split between cached and fresh tokens per request, and
the cost tracker records both so the session budget reflects the **actual**
discounted cost, not the nominal token count.

**Why not just count tokens**: token counting with an offline tokenizer gives
you an upper bound but not the real cost. DeepSeek's API sometimes re-caches and
sometimes doesn't — the probe measures what actually happened. The discount is
too large (10×) to leave unmeasured.

→ [`HEADROOM-SIDECAR.md`](../HEADROOM-SIDECAR.md),
[`COLIBRI-TOKENOMICS-TRIFECTA.md`](../COLIBRI-TOKENOMICS-TRIFECTA.md),
[`crates/colibri-deepseek/src/lib.rs`](../../crates/colibri-deepseek/src/lib.rs)

### Three cost modes (fast → smart → max)

| Mode  | Budget (tokens) | Behavior                                                                   |
| ----- | --------------- | -------------------------------------------------------------------------- |
| Fast  | 16K             | Maximum cache-hits, minimum fresh tokens. Rejects large expansions early.  |
| Smart | 64K             | Default. Balances cache reuse with room for follow-up turns.               |
| Max   | 256K            | Almost never hits budget. For one-shot deep tasks where cost is secondary. |

The daemon **auto-escalates** when a session exhausts its budget in a lower
mode: fast → smart → max. Escalation is one-way (never downgrades mid-session).

**Why three modes, not a continuous slider**: simplicity wins here. Three
well-understood points cover the space — operators pick by risk appetite, not
by fine-tuning a number. The escalation chain means "start cheap, pay more only
if it works."

→ [`COLIBRI-TOKENOMICS-TRIFECTA.md`](../COLIBRI-TOKENOMICS-TRIFECTA.md),
[`crates/colibri-daemon/src/cost.rs`](../../crates/colibri-daemon/src/cost.rs)

### T14 compaction (budget trim, not truncate)

When a session is about to exceed its budget, Colibri compacts the tool results
in the volatile region — it sends them through the headroom sidecar for
summarization, then trims the oldest volatile blocks until the prompt fits
within budget. The **prefix** (system prompt, static context) is never trimmed
— only the volatile suffix.

If compaction is insufficient and auto-escalation is enabled, the mode steps up
before truncating.

**Why not just truncate**: truncating mid-conversation loses context the agent
needs to continue. Compaction preserves the semantic content at lower token cost.
The headroom sidecar is optional (off by default); without it, the fallback is
simple truncation.

→ [`HEADROOM-SIDECAR.md`](../HEADROOM-SIDECAR.md),
[`crates/colibri-daemon/src/session.rs`](../../crates/colibri-daemon/src/session.rs)

### Cache-hit probe (DeepSeek-specific)

The `colibri-deepseek` crate sends a preflight request with a known prompt to
the DeepSeek API and parses the response headers to determine the cache-hit
split (prompt_cache_hit_tokens / prompt_cache_miss_tokens). This is
provider-specific — DeepSeek is the only provider that exposes this granularity.
The probe runs once per session configuration change, not per request.

**Why a probe and not a hook**: middleware that intercepts every API response
would couple cost tracking to the HTTP layer. A probe decouples it — the cost
tracker asks "what was the cache ratio?" and the probe answers, independently of
how the request was made.

→ [`crates/colibri-deepseek/src/lib.rs`](../../crates/colibri-deepseek/src/lib.rs)

## See also

- [task-board](./task-board.md) — the scheduler that dispatches tasks within session budgets
- [mother-hive](./mother-hive.md) — MCP architecture (different cost domain)
- [quality-gates](./quality-gates.md) — the gate that validates cost-mode parsing