91 lines
4.4 KiB
Markdown
91 lines
4.4 KiB
Markdown
# Cost model
|
||
|
||
← [index](./index.md)
|
||
|
||
## What this is
|
||
|
||
Colibri tracks every token that passes through an agent session and meters cost
|
||
against a configurable budget. The key insight: **cache-hit tokens cost 10×
|
||
less** than fresh tokens on DeepSeek — so the prompt prefix is engineered to be
|
||
byte-stable across requests, maximizing cache hits. Three cost modes (fast,
|
||
smart, max) represent different points on the speed/cost trade-off, and the
|
||
model auto-escalates when a cheaper mode can't keep up.
|
||
|
||
## Decisions
|
||
|
||
### Byte-stable prompt prefix → cache-hit metering
|
||
|
||
The system prompt and early context blocks are **byte-for-byte identical**
|
||
across consecutive requests to the same DeepSeek endpoint. DeepSeek's cache-hit
|
||
pricing discounts these by ~90%. Colibri's `colibri-deepseek` probe determines
|
||
the exact token-count split between cached and fresh tokens per request, and
|
||
the cost tracker records both so the session budget reflects the **actual**
|
||
discounted cost, not the nominal token count.
|
||
|
||
**Why not just count tokens**: token counting with an offline tokenizer gives
|
||
you an upper bound but not the real cost. DeepSeek's API sometimes re-caches and
|
||
sometimes doesn't — the probe measures what actually happened. The discount is
|
||
too large (10×) to leave unmeasured.
|
||
|
||
→ [`HEADROOM-SIDECAR.md`](../HEADROOM-SIDECAR.md),
|
||
[`COLIBRI-TOKENOMICS-TRIFECTA.md`](../COLIBRI-TOKENOMICS-TRIFECTA.md),
|
||
[`crates/colibri-deepseek/src/lib.rs`](../../crates/colibri-deepseek/src/lib.rs)
|
||
|
||
### Three cost modes (fast → smart → max)
|
||
|
||
| Mode | Budget (tokens) | Behavior |
|
||
| ----- | --------------- | -------------------------------------------------------------------------- |
|
||
| Fast | 16K | Maximum cache-hits, minimum fresh tokens. Rejects large expansions early. |
|
||
| Smart | 64K | Default. Balances cache reuse with room for follow-up turns. |
|
||
| Max | 256K | Almost never hits budget. For one-shot deep tasks where cost is secondary. |
|
||
|
||
The daemon **auto-escalates** when a session exhausts its budget in a lower
|
||
mode: fast → smart → max. Escalation is one-way (never downgrades mid-session).
|
||
|
||
**Why three modes, not a continuous slider**: simplicity wins here. Three
|
||
well-understood points cover the space — operators pick by risk appetite, not
|
||
by fine-tuning a number. The escalation chain means "start cheap, pay more only
|
||
if it works."
|
||
|
||
→ [`COLIBRI-TOKENOMICS-TRIFECTA.md`](../COLIBRI-TOKENOMICS-TRIFECTA.md),
|
||
[`crates/colibri-daemon/src/cost.rs`](../../crates/colibri-daemon/src/cost.rs)
|
||
|
||
### T14 compaction (budget trim, not truncate)
|
||
|
||
When a session is about to exceed its budget, Colibri compacts the tool results
|
||
in the volatile region — it sends them through the headroom sidecar for
|
||
summarization, then trims the oldest volatile blocks until the prompt fits
|
||
within budget. The **prefix** (system prompt, static context) is never trimmed
|
||
— only the volatile suffix.
|
||
|
||
If compaction is insufficient and auto-escalation is enabled, the mode steps up
|
||
before truncating.
|
||
|
||
**Why not just truncate**: truncating mid-conversation loses context the agent
|
||
needs to continue. Compaction preserves the semantic content at lower token cost.
|
||
The headroom sidecar is optional (off by default); without it, the fallback is
|
||
simple truncation.
|
||
|
||
→ [`HEADROOM-SIDECAR.md`](../HEADROOM-SIDECAR.md),
|
||
[`crates/colibri-daemon/src/session.rs`](../../crates/colibri-daemon/src/session.rs)
|
||
|
||
### Cache-hit probe (DeepSeek-specific)
|
||
|
||
The `colibri-deepseek` crate sends a preflight request with a known prompt to
|
||
the DeepSeek API and parses the response headers to determine the cache-hit
|
||
split (prompt_cache_hit_tokens / prompt_cache_miss_tokens). This is
|
||
provider-specific — DeepSeek is the only provider that exposes this granularity.
|
||
The probe runs once per session configuration change, not per request.
|
||
|
||
**Why a probe and not a hook**: middleware that intercepts every API response
|
||
would couple cost tracking to the HTTP layer. A probe decouples it — the cost
|
||
tracker asks "what was the cache ratio?" and the probe answers, independently of
|
||
how the request was made.
|
||
|
||
→ [`crates/colibri-deepseek/src/lib.rs`](../../crates/colibri-deepseek/src/lib.rs)
|
||
|
||
## See also
|
||
|
||
- [task-board](./task-board.md) — the scheduler that dispatches tasks within session budgets
|
||
- [mother-hive](./mother-hive.md) — MCP architecture (different cost domain)
|
||
- [quality-gates](./quality-gates.md) — the gate that validates cost-mode parsing
|