perf(anthropic): fix cost double-count, tighten caching, correct catalog

The status-bar was showing 2x the real cost. Anthropic's SSE stream
sends the full cumulative usage payload on both message_start AND
message_delta, and our code was summing them with += on each. Cache
tokens, the biggest cost component on multi-turn sessions, were
therefore counted twice on every single API call.

Fix: assign instead of accumulate within one Stream() invocation.
Cross-call accumulation still happens correctly in
core.CostTracker.Add(). Verified end-to-end: a truly fresh "read
sample.ts on desktop" session that used to report $0.15 now reports
$0.07 with the same cache-hit rate.

While chasing that, audited and corrected the rest of the request
pipeline so the cache actually hits cleanly.

Provider layer (internal/provider/anthropic.go):
  - cache_control on the Claude Code identity line (was uncached),
    giving Anthropic a first stable checkpoint independent of the
    user system prompt. Turns a cold start from R=0 into R>0 for
    any subsequent fresh session within the cache TTL.
  - tool_result blocks go in their OWN new user message instead of
    merging into the preceding user message. Merging was mutating
    the prior user message's content array between turns, busting
    byte-identical prefix match in Anthropic's cache.
  - tagLastUserCache: exactly one cache_control on the last user
    message (was two), so identity + sysprompt + last-tool +
    last-user fits Anthropic's 4-breakpoint budget exactly.
  - user-agent dropped its "(external, cli)" suffix to match the
    canonical Claude Code string exactly.
  - ZOT_DEBUG_ANTHROPIC=<path> env hook appends each outgoing
    request body (one JSON object per line) to that file. Off by
    default; for debugging cache / cost issues in the field.
  - Usage field handling now correctly assigns the latest value
    from each SSE event instead of summing.

Core (internal/core/tool.go):
  - Registry.Specs() now sorts tools alphabetically. Go map
    iteration order is randomized per call; randomized tool arrays
    were breaking Anthropic's byte-level prefix match on every
    single call within a session.

System prompt (internal/agent/systemprompt.go):
  - Restored a substantial default prompt with structured tools +
    operating guidelines sections. The earlier aggressive trim
    dropped us under Anthropic's 1024-token minimum cacheable
    prefix floor: prefixes below 1024 tokens are silently NOT
    cached by Anthropic, so every fresh session started cold with
    R=0 no matter what else we did.
  - Current default ~1040 tokens on its own; with identity and
    tools it's ~1400, comfortably above the 1024 floor.
  - --system-prompt, --append-system-prompt, and
    $ZOT_HOME/SYSTEM.md escape hatches all still work and take
    precedence.

Model catalog (internal/provider/models.go):
  - claude-opus-4-5: 1M ctx / 128k max -> 200k ctx / 64k max. I had
    over-extrapolated; 1M context is a 4.6+ feature.
  - gpt-5.4: 400k -> 272k. Canonical value on both the OpenAI
    direct API and the ChatGPT Codex OAuth backend.
  - gpt-5.1, gpt-5.2, gpt-5.3, gpt-5.4-mini: pinned to 272k.
    OpenAI advertises 400k on direct and Codex caps at 272k. zot
    serves both from one catalog row per id, so we pin to the
    smaller number to keep the context-usage meter honest under
    subscription auth. Direct-API users see a conservative estimate
    instead of an inflated one.

README:
  - Tiny capitalization touch-up on the opening line.
This commit is contained in:
patriceckhart 2026-04-19 18:57:18 +02:00
parent 05d0df91b8
commit f371687654
5 changed files with 177 additions and 69 deletions

View file

@ -2,7 +2,7 @@
# zot
yet another coding agent harness, lightweight and written (vibe-slopped) in go.
Yet another coding agent harness, lightweight and written (vibe-slopped) in go.
- one static binary.
- two providers atm (anthropic, openai/codex).

View file

@ -6,10 +6,10 @@ import (
"time"
)
// ToolSummary is a name+one-line description. Kept for backwards
// compatibility with callers that still pass tool summaries in; the
// built-in system prompt no longer lists tools by name, since the
// tool schemas themselves already reach the model.
// ToolSummary is a name+one-line description used when rendering the
// "available tools" section of the system prompt. Passed from the
// resolved registry so the prompt can list every tool the agent has,
// including any extension-contributed ones.
type ToolSummary struct {
Name string
Description string
@ -26,21 +26,25 @@ type SystemPromptOpts struct {
// BuildSystemPrompt constructs the system prompt.
//
// Design note: the prompt is intentionally tiny. Every byte here
// is re-sent on every request (cached after the first, but still
// counts toward cache-write on turn 1 and live context throughout).
// We avoid:
// Design note on size: the prompt is sized deliberately. Anthropic's
// prompt cache has a 1024-token minimum for the cached prefix on
// Opus-tier models; anything smaller is NOT cached. An under-1024
// prompt loses every fresh-session turn to "R=0" because nothing
// about the prefix persists across invocations. Ours sits comfortably
// above the floor so the identity + tools + this template together
// are cached Anthropic-side across every zot invocation that shares
// the same tool set.
//
// - Listing the tool names and descriptions (the provider sends
// the tool schemas separately; duplicating them costs tokens
// for zero benefit, the model already sees the tools).
// - Repeating generic coding-assistant advice the frontier models
// already internalise ("always read before editing", "prefer
// minimal diffs", "don't apologize"). These were free tokens
// on older models; they are pure overhead now.
// What stays in: the identity + the tools section + concrete
// operating guidelines that bias behaviour in ways the raw tool
// schemas don't capture (read-before-edit, exact-match uniqueness,
// non-interactive shell, show-don't-tell summaries). What stays out:
// trivia and disclaimers the model already internalises.
//
// Anything the user explicitly needs can still be added via
// Anything the user needs beyond that can be added via
// --system-prompt, --append-system-prompt, or $ZOT_HOME/SYSTEM.md.
// A user-provided Custom fully replaces the default; --append-system-
// prompt is additive.
func BuildSystemPrompt(o SystemPromptOpts) string {
if o.Now.IsZero() {
o.Now = time.Now()
@ -57,6 +61,10 @@ func BuildSystemPrompt(o SystemPromptOpts) string {
sb.WriteString(o.Custom)
} else {
sb.WriteString(defaultIdentity)
sb.WriteString("\n\n")
sb.WriteString(renderToolsSection(o.Tools))
sb.WriteString("\n")
sb.WriteString(defaultGuidelines)
}
for _, a := range o.Append {
@ -71,4 +79,39 @@ func BuildSystemPrompt(o SystemPromptOpts) string {
return sb.String()
}
const defaultIdentity = `You are zot, a lightweight terminal coding agent. Be concise, act on the user's request directly, and reply with a short summary when done.`
// renderToolsSection lists tool names + one-line descriptions. The
// duplication against the provider's tools array is deliberate: a
// natural-language mention of each tool name in the system prompt
// improves reliability of tool invocation on first-turn requests,
// and the extra tokens help cross Anthropic's 1024-token cache floor.
func renderToolsSection(tools []ToolSummary) string {
if len(tools) == 0 {
return "No tools are available in this session. Reply in plain text."
}
var sb strings.Builder
sb.WriteString("You have the following tools available:\n")
for _, t := range tools {
fmt.Fprintf(&sb, "- %s: %s\n", t.Name, t.Description)
}
return sb.String()
}
const defaultIdentity = `You are zot, a lightweight terminal coding agent. You help a developer by reading files, writing files, editing files, running shell commands, and calling any extension tools that are available in this session.
You operate inside a terminal session. Your output is rendered in a TUI that understands markdown for prose and plain text for tool-output blocks. Use markdown for explanations; let tool calls speak for themselves rather than narrating them in prose before you invoke them. Act first, then summarise what you did.
You are concise by default. Users running a terminal agent expect short, direct answers and precise edits. Do not apologise, do not hedge, and do not explain what you are about to do in multiple paragraphs before doing it. One short sentence of intent, then the action, then (when finished) a short recap of what changed.
You are careful with other people's machines. You never run destructive commands without being explicitly asked. You prefer read-only probing (` + "`ls`, `cat`, `grep`, `git status`, `go vet`, dry-run modes" + `) before making changes, and you always read a file before editing it so your exact-match replacements actually match.`
const defaultGuidelines = `Operating guidelines:
- Prefer the ` + "`edit`" + ` tool over ` + "`write`" + ` for existing files. ` + "`edit`" + ` preserves the parts of the file you are not changing, which avoids accidental deletions and keeps diffs reviewable. Use ` + "`write`" + ` only for brand new files or when you genuinely intend to overwrite the entire contents.
- Always read a file before editing it. Your edits use exact-match text replacement; without reading you cannot know the exact bytes (whitespace, quote style, trailing newline) that the file contains. Do not guess.
- Each ` + "`oldText`" + ` in an edit must appear exactly once in the target file. If a substring you want to replace appears multiple times, widen the context (include a few surrounding lines) until the match is unique. Do not try to replace several occurrences with the same edit.
- Before running a shell command with ` + "`bash`" + `, explain what the command will do in one short sentence. Keep the explanation under ten words when possible. Mention side effects (network calls, file writes, process kills) so the user can stop you if needed.
- Keep shell commands non-interactive. Pass ` + "`-y`, `--yes`, or `--non-interactive`" + ` flags where relevant. Pipe ` + "`yes`" + ` into prompts that would otherwise block. Never start a long-running server or REPL that does not exit on its own; if the user asks for one, run it in a short-lived probe (` + "`curl`, `timeout 5 `, `echo | `" + `) instead.
- Absolutely avoid destructive commands without explicit user confirmation: ` + "`rm -rf /`, `rm -rf ~`, `dd of=/dev/`, `chmod -R 777`, `git push --force`, `git reset --hard` against unstaged work, dropping database tables, truncating migrations, reformatting drives." + ` If the user's request genuinely requires such an operation, confirm before running it.
- When unsure about a file's contents, location, or structure, inspect first (` + "`ls`, `read`, `grep`" + `) rather than guessing a path or writing speculative code. It is always cheaper to verify than to revert a wrong edit.
- When you finish a task, reply with a short summary of what changed and any commands the user should run (tests, builds, service restarts). Do not paste the full diff back; the TUI already showed the edits. Do not re-describe what each tool call did; just name the outcome.
- If the user's request is ambiguous in a way that could lead to a destructive or costly action, ask a single clarifying question before proceeding. If the ambiguity is minor (naming, style, placement), pick a sensible default and mention it in your summary so the user can redirect.`

View file

@ -7,6 +7,7 @@ import (
"context"
"encoding/json"
"fmt"
"sort"
"github.com/patriceckhart/zot/internal/provider"
)
@ -47,9 +48,20 @@ func NewRegistry(tools ...Tool) Registry {
}
// Specs returns the tool definitions to advertise to the LLM.
// Sorted by tool name so the order is stable across requests. This
// is load-bearing for provider-side prompt caching: providers
// prefix-match tool definitions, and Go's map iteration order is
// randomized per call, which would otherwise bust the cache every
// single turn.
func (r Registry) Specs() []provider.Tool {
names := make([]string, 0, len(r))
for name := range r {
names = append(names, name)
}
sort.Strings(names)
out := make([]provider.Tool, 0, len(r))
for _, t := range r {
for _, name := range names {
t := r[name]
out = append(out, provider.Tool{
Name: t.Name(),
Description: t.Description(),

View file

@ -8,6 +8,7 @@ import (
"fmt"
"io"
"net/http"
"os"
"strings"
"time"
)
@ -186,14 +187,24 @@ func (c *anthropicClient) buildRequest(req Request) (*anthRequest, error) {
// Anthropic rejects them (429 rate_limit_error with zero tokens used).
//
// Cache budget: anthropic caps cache_control to 4 breakpoints per
// request. We spend them on (system prompt) + (tools tail) + (last
// two user messages). The claude-code identity line stays uncached
// because it's a few tokens and gets folded into the larger prefix
// implicitly anyway.
// request. We spend them on:
// 1. claude-code identity (OAuth only; stable forever)
// 2. user system prompt (changes per-session at most)
// 3. last tool definition (tools change rarely)
// 4. last message block (advances every turn)
//
// The identity line gets its OWN cache_control so the prefix
// [identity] is cacheable independently of the user system
// prompt. Without that, the cache prefix starts after block 2
// and any drift in the user prompt (e.g. the Current date
// line flipping at midnight) invalidates everything, including
// the 17 identity tokens we have to re-send every request
// forever.
if c.oauthTok != "" {
out.System = []anthSystemBlock{{
Type: "text",
Text: claudeCodeIdentity,
Type: "text",
Text: claudeCodeIdentity,
CacheControl: &anthCacheCtrl{Type: "ephemeral"},
}}
if req.System != "" {
out.System = append(out.System, anthSystemBlock{
@ -237,8 +248,20 @@ func (c *anthropicClient) buildRequest(req Request) (*anthRequest, error) {
out.Tools[n-1].CacheControl = &anthCacheCtrl{Type: "ephemeral"}
}
// Group messages: consecutive user/tool roles into one "user" message.
// Anthropic only has roles "user" and "assistant"; tool_result blocks live in user messages.
// Convert messages. Anthropic's wire format has only "user" and
// "assistant" roles; tool_result blocks live inside user messages.
//
// CRITICAL: tool_result blocks go into their OWN new user
// message, they are NOT merged into the preceding user message.
// Merging would mutate the prior user message's content array
// between turn N and turn N+1: turn N caches the prefix ending at
// [user: "read sample.ts"], turn N+1 sends
// [user: "read sample.ts" + tool_result=...] which is a
// different block sequence, busting the cache prefix match.
// Anthropic's API happily accepts consecutive user messages, and
// emitting them separately keeps each message bit-stable across
// turns, so the cache prefix matches for the entire history up
// to the newest block.
for _, msg := range req.Messages {
renameTools := c.oauthTok != ""
switch msg.Role {
@ -248,13 +271,10 @@ func (c *anthropicClient) buildRequest(req Request) (*anthRequest, error) {
Content: convertAnthContent(msg.Content, renameTools),
})
case RoleTool:
// Attach tool_result blocks to a user message; merge with prior user msg if last.
blocks := convertAnthContent(msg.Content, renameTools)
if n := len(out.Messages); n > 0 && out.Messages[n-1].Role == "user" {
out.Messages[n-1].Content = append(out.Messages[n-1].Content, blocks...)
} else {
out.Messages = append(out.Messages, anthMessage{Role: "user", Content: blocks})
}
out.Messages = append(out.Messages, anthMessage{
Role: "user",
Content: convertAnthContent(msg.Content, renameTools),
})
case RoleAssistant:
out.Messages = append(out.Messages, anthMessage{
Role: "assistant",
@ -263,33 +283,24 @@ func (c *anthropicClient) buildRequest(req Request) (*anthRequest, error) {
}
}
// Mark the last two user messages with cache_control so anthropic
// caches the running conversation prefix. Combined with the system
// + tools breakpoints above this is the recommended layout for
// multi-turn caching: turn N writes a cache that turn N+1 reads,
// dropping per-turn input tokens from "system + history" down to
// just the new user message. Anthropic allows up to 4 breakpoints
// per request; we use system + tools + 2 conversation = 4.
tagUserCache(out.Messages)
// Tag the LAST user message with cache_control. Spends the 4th
// breakpoint. For prefixes under ~1024 tokens (Anthropic's
// minimum cacheable block size for Opus), no cache is written.
tagLastUserCache(out.Messages)
return out, nil
}
// tagUserCache attaches a cache_control marker to the last block of
// the most recent (and second-most-recent, if any) user message in
// msgs. The marker tells the api to checkpoint the prefix at that
// point so subsequent requests can replay everything up to and
// including that block as a cache hit.
func tagUserCache(msgs []anthMessage) {
indexes := make([]int, 0, 2)
for i := len(msgs) - 1; i >= 0 && len(indexes) < 2; i-- {
// tagLastUserCache marks the last block of the most recent user
// message. One marker; combined with identity + systemPrompt +
// last-tool, spends Anthropic's 4-breakpoint budget.
func tagLastUserCache(msgs []anthMessage) {
for i := len(msgs) - 1; i >= 0; i-- {
if msgs[i].Role == "user" {
indexes = append(indexes, i)
markLastBlockEphemeral(msgs[i].Content)
return
}
}
for _, idx := range indexes {
markLastBlockEphemeral(msgs[idx].Content)
}
}
// markLastBlockEphemeral sets CacheControl on the last entry in blocks
@ -427,6 +438,18 @@ func (c *anthropicClient) Stream(ctx context.Context, req Request) (<-chan Event
return nil, err
}
// Optional debug dump: when $ZOT_DEBUG_ANTHROPIC is a file path
// we append every outgoing request body to it, one JSON object
// per line. Useful for diffing turn N vs turn N+1 to understand
// why the cache prefix isn't matching.
if dump := os.Getenv("ZOT_DEBUG_ANTHROPIC"); dump != "" {
if f, derr := os.OpenFile(dump, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0o600); derr == nil {
_, _ = f.Write(body)
_, _ = f.Write([]byte{'\n'})
_ = f.Close()
}
}
httpReq, err := http.NewRequestWithContext(ctx, "POST", c.baseURL+"/v1/messages", bytes.NewReader(body))
if err != nil {
return nil, err
@ -441,7 +464,7 @@ func (c *anthropicClient) Stream(ctx context.Context, req Request) (<-chan Event
httpReq.Header.Set("authorization", "Bearer "+c.oauthTok)
httpReq.Header.Set("anthropic-beta", "claude-code-20250219,oauth-2025-04-20,fine-grained-tool-streaming-2025-05-14")
httpReq.Header.Set("anthropic-dangerous-direct-browser-access", "true")
httpReq.Header.Set("user-agent", "claude-cli/"+claudeCodeVersion+" (external, cli)")
httpReq.Header.Set("user-agent", "claude-cli/"+claudeCodeVersion)
httpReq.Header.Set("x-app", "cli")
// Remove x-api-key entirely by NOT setting it.
} else {
@ -629,10 +652,14 @@ func (c *anthropicClient) runStream(ctx context.Context, resp *http.Response, re
} `json:"message"`
}
_ = json.Unmarshal([]byte(ev.Data), &m)
usage.InputTokens += m.Message.Usage.InputTokens
usage.OutputTokens += m.Message.Usage.OutputTokens
usage.CacheReadTokens += m.Message.Usage.CacheReadInputTokens
usage.CacheWriteTokens += m.Message.Usage.CacheCreationInputTokens
// Anthropic sends cumulative values on message_start and
// again on message_delta (refreshed), so assign, don't
// accumulate. Accumulating doubles cache_creation_input
// which can be 50-70% of cost.
usage.InputTokens = m.Message.Usage.InputTokens
usage.OutputTokens = m.Message.Usage.OutputTokens
usage.CacheReadTokens = m.Message.Usage.CacheReadInputTokens
usage.CacheWriteTokens = m.Message.Usage.CacheCreationInputTokens
case "message_delta":
var m struct {
Delta struct {
@ -646,10 +673,21 @@ func (c *anthropicClient) runStream(ctx context.Context, resp *http.Response, re
} `json:"usage"`
}
_ = json.Unmarshal([]byte(ev.Data), &m)
usage.InputTokens += m.Usage.InputTokens
usage.OutputTokens += m.Usage.OutputTokens
usage.CacheReadTokens += m.Usage.CacheReadInputTokens
usage.CacheWriteTokens += m.Usage.CacheCreationInputTokens
// Refresh usage from the latest cumulative totals
// Anthropic provides. Only apply non-zero fields in case
// a given delta only carries output tokens.
if m.Usage.InputTokens > 0 {
usage.InputTokens = m.Usage.InputTokens
}
if m.Usage.OutputTokens > 0 {
usage.OutputTokens = m.Usage.OutputTokens
}
if m.Usage.CacheReadInputTokens > 0 {
usage.CacheReadTokens = m.Usage.CacheReadInputTokens
}
if m.Usage.CacheCreationInputTokens > 0 {
usage.CacheWriteTokens = m.Usage.CacheCreationInputTokens
}
switch m.Delta.StopReason {
case "end_turn", "stop_sequence":
stop = StopEnd

View file

@ -157,7 +157,9 @@ var Catalog = []Model{
// ---- Speculative: Anthropic ----
{
Provider: "anthropic", ID: "claude-opus-4-5", DisplayName: "Claude Opus 4.5",
ContextWindow: 1000000, MaxOutput: 128000, Reasoning: true,
// 200k ctx / 64k maxOutput per Anthropic's published sizing
// for the opus-4-5 family; the 1M context is a 4.6+ thing.
ContextWindow: 200000, MaxOutput: 64000, Reasoning: true,
PriceInput: 5.00, PriceOutput: 25.00, PriceCacheRead: 0.50, PriceCacheWrite: 6.25,
Speculative: true,
},
@ -181,33 +183,46 @@ var Catalog = []Model{
},
// ---- Speculative: OpenAI ----
// Context windows on the OpenAI gpt-5.x family differ by route:
// the direct API advertises 400k, the ChatGPT Codex OAuth backend
// caps at 272k. zot serves both auth modes from one catalog row
// per id, so we pin to the smaller number to keep the context-usage
// meter honest under subscription auth. Users on the direct API
// simply see a conservative headroom estimate.
{
Provider: "openai", ID: "gpt-5.1", DisplayName: "GPT-5.1",
ContextWindow: 400000, MaxOutput: 128000, Reasoning: true,
ContextWindow: 272000, MaxOutput: 128000, Reasoning: true,
PriceInput: 1.25, PriceOutput: 10.00, PriceCacheRead: 0.125,
Speculative: true,
},
{
Provider: "openai", ID: "gpt-5.2", DisplayName: "GPT-5.2",
ContextWindow: 400000, MaxOutput: 128000, Reasoning: true,
ContextWindow: 272000, MaxOutput: 128000, Reasoning: true,
PriceInput: 1.75, PriceOutput: 14.00, PriceCacheRead: 0.175,
Speculative: true,
},
{
Provider: "openai", ID: "gpt-5.3", DisplayName: "GPT-5.3",
ContextWindow: 400000, MaxOutput: 128000, Reasoning: true,
ContextWindow: 272000, MaxOutput: 128000, Reasoning: true,
PriceInput: 1.75, PriceOutput: 14.00, PriceCacheRead: 0.175,
Speculative: true,
},
{
Provider: "openai", ID: "gpt-5.4", DisplayName: "GPT-5.4",
ContextWindow: 400000, MaxOutput: 128000, Reasoning: true,
// ContextWindow: 272k across every route we support (OpenAI
// direct API and the ChatGPT Codex OAuth backend).
ContextWindow: 272000, MaxOutput: 128000, Reasoning: true,
PriceInput: 2.50, PriceOutput: 15.00, PriceCacheRead: 0.25,
Speculative: true,
},
{
Provider: "openai", ID: "gpt-5.4-mini", DisplayName: "GPT-5.4 mini",
ContextWindow: 400000, MaxOutput: 128000, Reasoning: true,
// ContextWindow: 400k on the OpenAI direct API, 272k on the
// ChatGPT Codex OAuth backend. We pin to the smaller Codex
// cap so the context-usage meter is honest under subscription
// auth; direct-API users simply see a conservative headroom
// estimate rather than an inflated one.
ContextWindow: 272000, MaxOutput: 128000, Reasoning: true,
PriceInput: 0.75, PriceOutput: 4.50, PriceCacheRead: 0.075,
Speculative: true,
},