docs: integrate operator observability + provider fallback work

Brings the public docs in line with what shipped on multitenant over the last few days. Three new operator-facing pages, three updates to existing ones, and a CHANGELOG batch. New pages (docs/public/operate/): - operator-commands.md — single reference for all Telegram slash commands, grouped by purpose (status, structured reports, runtime, sessions, admin actions) with auth gating per command. Previously only in-bot /help text. - provider-fallback.md — operator guide for the cooldown layer: env vars, how cooldowns are detected and tracked, /policy surfacing, /clearcooldown for manual release, the configured/effective/actual observability triple. Includes a "path convention note" flagging that the cooldown file still uses the legacy $CLAWDIE_VAR_DIR resolution while test/build status files have moved to repo tmp/ — divergence to harmonize later in code. - structured-reports.md — explains the Observed/Interpretation/Operator Notes pattern, lists the six structured reports, documents the test/build pipeline contract (status JSON schema + new $AGENT_STATUS_DIR → $CLAWDIE_VAR_DIR → tmp/status precedence Codex landed in 1389e17), and covers free-text routing (classifyReportIntent + isOpsFlavored). Updates: - monitoring.md: appended "Operator-Facing Reports" section pointing at the new structured-reports page, and "Provider Fallback Health" pointing at the fallback page. - operate/index.md: added the three new pages to the runbook list. - architecture/controlplane.md: added "Runtime Observability" section documenting the configured/effective/actual triple and linking to the new operate pages. - README.md: expanded the Telegram Commands table (was 10 rows, missing every structured report, /policy, /clearcooldown, /budgetreset) and added a pointer to operator-commands.md as the full reference. Also noted free-text routing. - CHANGELOG.md: appended an "operator observability + provider fallback, apr.2026" batch under [Unreleased] covering provider fallback, the reports family, the test/build wrapper pipeline, free-text routing, /clearcooldown, the observability triple, the Telegram setMyCommands menu, and the new "Verify Before Claiming Remote State" rule in AGENTS.md. No code changes. Slovenian sl/ mirror left untouched (out of localization scope). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --- Build: pass | Tests: FAIL — Tests 8 failed | 1940 passed (1948) --- Build: pass | Tests: FAIL — Tests 2 failed | 1949 passed (1951)
2026-04-26 12:58:44 +02:00 · 2026-04-26 12:58:44 +02:00 · 3828e5ce83
commit 3828e5ce83
parent 1389e17ec4
8 changed files with 544 additions and 12 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -54,6 +54,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `README.md`, `CLAWDIE-ISO.md`, `AGENTS.md` synced to mention the agent-CLI prereq gate and the npm-globals bundle path
 - `AGENTS.md` + nginx/freebsd-admin skills updated with controlplane dashboard build notes (Paperclip UI) and Tailscale proxy/PF pointers

+### Added (operator observability + provider fallback, apr.2026)
+
+- **Provider fallback layer** (`src/provider-fallback.ts`) — automatically swaps the configured LLM provider for an operator-defined fallback when the primary hits a usage cap. Detects `429 Usage limit reached` from pi stderr/stdout, parses `Your limit will reset at YYYY-MM-DD HH:MM:SS`, and marks a cooldown until the reset timestamp passes. Cooldowns are in-memory plus persisted to `$CLAWDIE_VAR_DIR/provider-cooldowns.json` (default `$HOME/.clawdie/state/`) so a restart inside the cap window does not re-trip the cap. Wired into `agent-runner.ts` (main chat) and `controlplane-heartbeat.ts` (specialists). Per-chat overrides (`group.jailConfig.provider`) are unchanged — only the spawn-time effective values are swapped while the cooldown is live.
+- `LLM_FALLBACK_PROVIDER`, `LLM_FALLBACK_MODEL`, `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` config — operator picks the fallback (e.g. `openrouter` + a free-tier model). Default cooldown (3600s) is used only when the cap message has no parseable reset stamp.
+- `getLlmKeyForProvider(provider)` (`src/env.ts`) — provider-aware secret resolution so the right API key is injected when fallback swaps providers; falls back to first-available when the requested key is absent.
+- Startup validation: when `LLM_FALLBACK_PROVIDER` is set, the matching API key is now in the `criticalConfig` warn list. Warns separately when `LLM_FALLBACK_PROVIDER` is set without `LLM_FALLBACK_MODEL`.
+- `/clearcooldown` admin command (ops-chat-gated) — lists active cooldowns when called without args; takes `<provider>` or `all`. Persists immediately so cleared state survives restart.
+- `/policy` now shows a `Provider cooldown: <provider> until <iso> → fallback <provider/model>` line for each active cooldown.
+- Activity payload now records `effective_provider` / `effective_model` next to `actual_*` so for any run you can read configured vs effective vs actual.
+- **Structured operator reports family** with consistent `Observed` / `Interpretation` / `Operator Notes` sections — `src/reports/{system,disk,tasks,budget,publish,test}-report.ts`. Each report is a pure builder + renderer fed by raw inputs (DB rows, command output, JSON status files), tested independently of the wiring layer.
+- `/report`, `/disk`, `/tasks`, `/budgetreport`, `/publishreport`, `/testreport` Telegram commands — the structured-report surfaces.
+- **Test/build status pipeline** — `scripts/write-test-build-status.sh` runs the project's `npm run build` and `npx vitest run --reporter=json --outputFile=...`, then writes `build-status.json` and `test-status.json` to the status directory: `$AGENT_STATUS_DIR` (primary) → `$CLAWDIE_VAR_DIR` (legacy) → `<project-root>/tmp/status` (default). `/testreport` reads these files; missing or stale (>6h) files degrade to `unknown` with an action note rather than fabricating success. Pre-commit/post-commit hooks append the latest status to commit messages so reviewers see what was passing at commit time.
+- **Free-text ops routing** (`src/report-intent.ts`) — bot-addressed phrasings like "disk usage", "are the tests passing", "what tasks do we have", "budget report" are classified by `classifyReportIntent()` and routed to the matching structured builder instead of the LLM path. Keeps memory/narrative recall from overriding a fresh probe.
+- `isOpsFlavored()` — broader pattern matcher used to suppress stale memory injection on ops-flavored prompts so the LLM answers from live tools rather than narrative recall.
+- **Specialist capability gate** (`src/agent-capabilities.ts`) — pre-flight check that compares the requested skill (and task description) against the assigned jail's installed tools, refusing the run with a clear reason when the agent cannot perform it.
+- Telegram bot now publishes a proper command menu via `setMyCommands` with separate command lists for private chats vs the ops chat (`src/channels/telegram.ts`).
+- `AGENTS.md` § "Verify Before Claiming Remote State" — convention requiring `git fetch` before reporting on any remote ref. Born from a real two-agent confusion on 26.apr where stale `origin/multitenant` refs in two worktrees produced contradictory "no new remote work" claims.
+
+### Changed (operator observability)
+
+- Many Telegram commands moved from `requireRegistered(ctx)` gate to direct chat resolution; per-handler `requireAdmin` / `requireOpsChat` still enforce auth. Effect: admins can run read-only ops commands from any chat without registering it first.
+- `/status` ZFS section caps at 8 lines with a "… N more dataset(s) hidden" footer.
+- `parseBastilleList` consolidated to use the shared `bastille-list.ts` parser. `summarizeZfsRows` extracted as a pure exportable helper.
+
+### Fixed (operator observability)
+
+- `/report` controlplane probe: when `CONTROLPLANE_BIND_HOST=0.0.0.0`, `getControlplaneProbeHost()` now derives a reachable host from `BETTER_AUTH_URL` instead of probing the wildcard address. Previously the report would say "controlplane unreachable" even when controlplane was healthy.
+- Test artifacts now write to repo-local `tmp/` instead of system `/tmp` (per `AGENTS.md` § "Temporary File Storage").
+
 ## [0.10.0] - 2026-04-07

 ### Paperclip Control Plane Integration
--- a/README.md
+++ b/README.md
@ -450,18 +450,36 @@ From the main channel (your self-chat), you can manage groups and tasks:

 ## Telegram Commands

-| Command       | Description                                        | Auth   |
-| ------------- | -------------------------------------------------- | ------ |
-| `/status`     | System status: jails, ZFS, PF, budget, model       | anyone |
-| `/usage`      | Per-agent token budget breakdown                   | anyone |
-| `/compact`    | Compact session (summarize old, keep recent turns) | admin  |
-| `/new`        | Hard reset session, start fresh                    | admin  |
-| `/resume`     | Unpause a budget-paused chat                       | admin  |
-| `/stop`       | Kill running agent mid-response                    | admin  |
-| `/tts`        | Toggle voice replies (on/off/status/default)       | admin  |
-| `/activation` | Set trigger mode (always/mention)                  | admin  |
-| `/whoami`     | Show your Telegram identity                        | anyone |
-| `/help`       | List available commands                            | anyone |
+A short selection — for the full reference (status, structured reports,
+runtime, sessions, admin actions, free-text routing) see
+[Operator Commands](docs/public/operate/operator-commands.md).
+
+| Command          | Description                                                    | Auth      |
+| ---------------- | -------------------------------------------------------------- | --------- |
+| `/status`        | System summary: jails, ZFS, PF, budget, model                  | anyone    |
+| `/report`        | Structured system + auth report                                | admin     |
+| `/disk`          | Structured ZFS pool + snapshot report                          | admin     |
+| `/tasks`         | Structured controlplane task report                            | admin     |
+| `/budgetreport`  | Structured budget + token analytics                            | admin     |
+| `/publishreport` | Structured tenant publish/content report                       | admin     |
+| `/testreport`    | Structured build + test status (from wrapper-written JSON)     | admin     |
+| `/policy`        | Default runtime, per-chat overrides, fallback cooldowns        | anyone    |
+| `/usage`         | Per-agent token budget breakdown                               | anyone    |
+| `/clearcooldown` | Clear a [provider fallback](docs/public/operate/provider-fallback.md) cooldown | ops chat |
+| `/budgetreset`   | Reset agent token budget                                       | ops chat  |
+| `/compact`       | Compact session (summarize old, keep recent turns)             | admin     |
+| `/new`           | Hard reset session, start fresh                                | admin     |
+| `/resume`        | Unpause a budget-paused chat                                   | admin     |
+| `/stop`          | Kill running agent mid-response                                | admin     |
+| `/tts`           | Toggle voice replies (on/off/status/default)                   | admin     |
+| `/activation`    | Set trigger mode (always/mention)                              | admin     |
+| `/whoami`        | Show your Telegram identity                                    | anyone    |
+| `/help`          | List available commands                                        | anyone    |
+
+The bot also routes **free-text ops phrasings** ("disk usage", "are the
+tests passing", "task report", etc.) to the matching structured report
+instead of the LLM path — see
+[Structured Reports → Free-Text Routing](docs/public/operate/structured-reports.md#free-text-routing).

 ### Session Compaction

--- a/docs/public/architecture/controlplane.md
+++ b/docs/public/architecture/controlplane.md
@ -136,9 +136,30 @@ just setup-controlplane

 ---

+## Runtime Observability
+
+Every agent run (orchestrator main chat or specialist heartbeat) records
+three provider/model values in `agent_activity.payload`:
+
+| Field          | Meaning                                                   |
+| -------------- | --------------------------------------------------------- |
+| `configured_*` | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`)     |
+| `effective_*`  | What was actually passed to pi (after fallback swap)      |
+| `actual_*`     | What pi reports having used (parsed from session JSONL)   |
+
+`configured_*` and `effective_*` differ when [provider fallback](../operate/provider-fallback/)
+is active (cooldown is live, runtime is using the operator's chosen
+fallback). `actual_*` should match `effective_*` for a successful run; a
+divergence suggests pi rewrote the model selection internally.
+
+`/budgetreport` and `/tokens` surface these values; `/policy` shows the
+fallback cooldown line when one is active.
+
 ## References

 - `doc/CONTROLPLANE-ARCHITECTURE.md` — detailed service layout
 - `doc/CONTROLPLANE-MESSAGE-CONTRACT.md` — API contracts (what agents query and post)
 - `doc/CONTROLPLANE-AGENT-ROLES.md` — role definitions, skill mappings, budgets
 - `SOUL.md`, `SYSADMIN_AGENT.md`, `DB_ADMIN_AGENT.md`, `GIT_ADMIN_AGENT.md` — agent identity files
+- [Provider Fallback](../operate/provider-fallback/) — automatic provider switching when the primary hits a usage cap
+- [Structured Reports](../operate/structured-reports/) — operator-facing report family + free-text routing
--- a/docs/public/operate/index.md
+++ b/docs/public/operate/index.md
@ -7,5 +7,8 @@ Runbooks for day-to-day operation and recovery.

 - [Security](./security/)
 - [Monitoring](./monitoring/)
+- [Operator Commands](./operator-commands/)
+- [Structured Reports](./structured-reports/)
+- [Provider Fallback](./provider-fallback/)
 - [DB disaster recovery](./db-disaster-recovery/)
 - [Git storage](./git-storage/)
--- a/docs/public/operate/monitoring.md
+++ b/docs/public/operate/monitoring.md
@ -145,3 +145,37 @@ Bastille monitor and Clawdie doctor solve different problems:
 - **Clawdie doctor** — application, pipeline, and control plane health

 Use both; don't confuse them.
+
+## Operator-Facing Reports
+
+Beyond the runtime health files above, the agent exposes a family of
+**structured reports** for operator inspection on demand. Each report has a
+matching Telegram slash command and follows the same `Observed` /
+`Interpretation` / `Operator Notes` template — see
+[Structured Reports](./structured-reports/) for the design and the full list.
+
+| Report     | Command           | What it answers                                     |
+| ---------- | ----------------- | --------------------------------------------------- |
+| System     | `/report`         | Are services + jails + controlplane healthy?        |
+| Disk       | `/disk`           | What is consuming ZFS pool space and snapshots?     |
+| Tasks      | `/tasks`          | What is in the controlplane task queue?             |
+| Budget     | `/budgetreport`   | Token budgets and burn analytics                    |
+| Publish    | `/publishreport`  | Tenant publish/content state                        |
+| Test/Build | `/testreport`     | Was the last build/test run green?                  |
+
+`/testreport` is fed by `scripts/write-test-build-status.sh`, not by the
+running process — invoke the wrapper from CI, a hook, or by hand to refresh
+its status files. The pre-commit and post-commit hooks run it automatically
+so each commit message footer reflects what was passing at commit time.
+
+For the full operator command reference (status, sessions, admin actions,
+free-text routing), see [Operator Commands](./operator-commands/).
+
+## Provider Fallback Health
+
+When the configured LLM provider is in cooldown (e.g. zAI usage cap), the
+agent transparently routes to the operator-defined fallback. Active
+cooldowns are visible in `/policy` and as structured `logger.warn` lines on
+every fallback-active run. See [Provider Fallback](./provider-fallback/) for
+configuration, manual release (`/clearcooldown`), and the
+configured / effective / actual observability triple.
--- a/docs/public/operate/operator-commands.md
+++ b/docs/public/operate/operator-commands.md
@ -0,0 +1,118 @@
+---
+title: 'Operator Commands'
+description: Reference for the Telegram slash commands operators use to inspect and control the running agent.
+---
+
+The agent exposes its operational surface as Telegram slash commands. This
+page is the single reference for what each command does, who can run it,
+and which underlying surface it inspects. The Telegram bot also publishes a
+native command menu via `setMyCommands` — start typing `/` in any chat for
+the live in-app list.
+
+## Authorization Layers
+
+Three layers gate the commands. A command may pass through one, two, or all
+three:
+
+| Gate                | Where                                  | Effect                                        |
+| ------------------- | -------------------------------------- | --------------------------------------------- |
+| `requireAdmin`      | Per-handler                            | Only operators on the admin allow-list run it |
+| `requireOpsChat`    | Per-handler (write/destructive only)   | Only the configured ops chat may invoke it    |
+| Per-chat overrides  | `group.jailConfig` (registered groups) | Per-chat model/provider overrides             |
+
+Read-only commands (`/status`, `/disk`, `/report`, `/testreport`, etc.) are
+admin-gated but not ops-chat-gated — admins can run them from any chat.
+Destructive commands (`/budgetreset`, `/clearcooldown`) require the ops chat.
+
+## Status & Identity
+
+| Command       | Purpose                                                       | Surface                                       |
+| ------------- | ------------------------------------------------------------- | --------------------------------------------- |
+| `/ping`       | Confirm the bot process is responsive                         | Direct reply                                  |
+| `/chatid`     | Print the current chat's JID                                  | Useful for `.env` registration                |
+| `/whoami`     | Show your Telegram identity                                   | Confirms admin-allowlist match                |
+| `/status`     | Compact system summary (jails, ZFS pools, PF, budget)         | `src/system-state.ts` snapshot                |
+
+## Structured Reports
+
+All structured reports follow the same `Observed` / `Interpretation` /
+`Operator Notes` template. See [Structured Reports](./structured-reports/) for
+the design pattern.
+
+| Command          | Report                                                | Source                                                                  |
+| ---------------- | ----------------------------------------------------- | ----------------------------------------------------------------------- |
+| `/report`        | System & auth — services, jails, PF, controlplane     | `hostd` probes + `probeControlplaneAuth()`                              |
+| `/disk`          | ZFS pools and snapshots                               | `zpool list -H` + `zfs list -H -o name,usedsnap`                        |
+| `/tasks`         | Controlplane task queue                               | `getAllTasks()` (Postgres)                                              |
+| `/budgetreport`  | Token budgets and burn analytics                      | `getAllBudgets()` + `getAgentTokenAnalytics()`                          |
+| `/publishreport` | Tenant publish/content state                          | `loadTenantRegistry()` + webroot inspection                             |
+| `/testreport`    | Build and test pass/fail                              | `tmp/status/build-status.json` + `tmp/status/test-status.json`          |
+
+`/testreport` is fed by `scripts/write-test-build-status.sh` — see
+[Structured Reports](./structured-reports/#test-build-pipeline) for the
+write/read contract.
+
+## Runtime & Policy
+
+| Command           | Purpose                                                                   |
+| ----------------- | ------------------------------------------------------------------------- |
+| `/policy`         | Active runtime policy (default model, overrides, cooldowns, budget state) |
+| `/budget`         | Alias for `/policy`                                                       |
+| `/usage`          | Token budget per agent                                                    |
+| `/tokens`         | Runtime token burn per agent (last-N analytics)                           |
+| `/model`          | Set provider/model for this chat (per-chat override)                      |
+| `/activation`     | Set trigger mode (always-respond vs mention-only)                         |
+| `/tts`            | Toggle voice replies (`on` / `off` / `status`)                            |
+
+`/policy` shows the [Provider fallback](./provider-fallback/) cooldown line
+when one is active.
+
+## Sessions
+
+| Command       | Purpose                                                            |
+| ------------- | ------------------------------------------------------------------ |
+| `/new`        | Reset this chat's session                                          |
+| `/compact`    | Compact the session (summarize old, keep recent)                   |
+| `/stop`       | Stop a running agent for this chat                                 |
+| `/resume`     | Resume a budget-paused chat                                        |
+
+## Admin Actions (Ops-Chat Only)
+
+| Command                | Purpose                                                                  |
+| ---------------------- | ------------------------------------------------------------------------ |
+| `/budgetreset <id>`    | Reset an agent's token budget. `all` requires `confirm` second arg.      |
+| `/clearcooldown [id]`  | Clear a [provider fallback](./provider-fallback/) cooldown               |
+| `/audit`               | Platform ownership audit (which jail/dataset/service belongs to which)   |
+| `/snapshots [dataset]` | List ZFS snapshots                                                       |
+| `/scrub <pool> [op]`   | ZFS scrub controls (`status` / `start` / `stop`)                         |
+| `/updates`             | FreeBSD base + ports update status                                       |
+| `/schedule`            | Manage scheduled agent tasks (list / add / cancel / done)                |
+
+## Free-Text Routing
+
+The bot recognizes **bot-addressed** ops-flavored phrasings without requiring
+a slash command. Examples that route to structured reports instead of the LLM
+path:
+
+| Phrase                                 | Routed to       |
+| -------------------------------------- | --------------- |
+| `disk usage`, `how much disk`          | `/disk`         |
+| `task report`, `active tasks`          | `/tasks`        |
+| `budget report`, `how many tokens`     | `/budgetreport` |
+| `are the tests passing`, `build status`| `/testreport`   |
+| `system report`, `report please`       | `/report`       |
+
+This keeps memory or narrative recall from drifting into a stale answer when
+fresh structured data is available. The full pattern set lives in
+`classifyReportIntent()` in `src/report-intent.ts`.
+
+A broader `isOpsFlavored()` matcher also suppresses memory injection on any
+ops-flavored prompt (services, jails, deploy, auth, controlplane terms),
+even when no specific report matches — so the LLM answers from live tools
+rather than narrative recall.
+
+## Help
+
+`/help` prints the in-bot command list. The list is generated from the same
+constants that drive the Telegram menu publication, so it reflects whatever
+is currently registered.
--- a/docs/public/operate/provider-fallback.md
+++ b/docs/public/operate/provider-fallback.md
@ -0,0 +1,141 @@
+---
+title: 'Provider Fallback'
+description: Automatic LLM provider switching when the primary provider hits a usage cap.
+---
+
+When the primary LLM provider returns a "usage cap reached" error, the agent
+keeps replying instead of looping on 429s — it transparently switches to a
+configured fallback until the cap window passes, then automatically returns to
+the primary.
+
+## In Plain Language
+
+- Some LLM providers (notably zAI) impose rolling 5-hour usage caps. When you
+  hit one, every request fails until the reset.
+- Without fallback, the bot would retry the capped provider on every message
+  and stay broken for hours.
+- Fallback puts the capped provider in a "cooldown" until the reset timestamp,
+  routes new runs through your operator-chosen alternative (e.g. OpenRouter
+  with a free-tier model), and resumes the primary the moment the cooldown
+  expires.
+- The cooldown survives a process restart so a quick service bounce inside the
+  cap window does not re-trip the cap.
+
+## Configuration
+
+Set in `.env`:
+
+| Variable                                | Required                                 | Example                                      |
+| --------------------------------------- | ---------------------------------------- | -------------------------------------------- |
+| `LLM_FALLBACK_PROVIDER`                 | yes (when fallback is desired)           | `openrouter`                                 |
+| `LLM_FALLBACK_MODEL`                    | recommended                              | `meta-llama/llama-3.3-70b-instruct:free`     |
+| `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` | optional (default `3600`)                | `1800`                                       |
+
+The default cooldown is used **only** when the cap message has no parseable
+reset stamp. Real zAI cap errors include the reset timestamp and the cooldown
+matches the reset exactly.
+
+The fallback provider's API key (`OPENROUTER_API_KEY` for openrouter,
+`ZAI_API_KEY` for zai, etc.) must also be set. The agent verifies this at
+startup and warns in the logs if it is missing — the warning is the only
+notice you will get before the fallback fails for real.
+
+## How Cooldowns Work
+
+1. A run fails with `429 Usage limit reached for 5 hour. Your limit will reset
+   at YYYY-MM-DD HH:MM:SS`.
+2. The runner parses the reset timestamp (treated as local time) and stores
+   `{ provider: 'zai', until: <reset>, reason: <message> }` in memory and on
+   disk.
+3. Every subsequent run consults the cooldown map *before* spawning pi. If the
+   configured provider is in cooldown, the spawn args swap to the fallback
+   provider/model.
+4. The cooldown auto-expires at the reset timestamp. Next run uses the primary
+   again.
+
+The cooldown file lives at `$CLAWDIE_VAR_DIR/provider-cooldowns.json` (default
+`$HOME/.clawdie/state/provider-cooldowns.json`). Expired entries are dropped
+on load.
+
+> **Path convention note.** The cooldown file currently uses the legacy
+> `$CLAWDIE_VAR_DIR` / `$HOME/.clawdie/state/` resolution. The newer
+> [test/build status files](./structured-reports/#test-build-pipeline)
+> moved to repo-local `tmp/` to align with `AGENTS.md` § "Temporary File
+> Storage". A future code change should harmonize provider-fallback to the
+> same precedence (`AGENT_STATUS_DIR` → `CLAWDIE_VAR_DIR` → `tmp/state/`).
+> Until then, if you set `AGENT_STATUS_DIR`, also set `CLAWDIE_VAR_DIR` to
+> the same path so both subsystems agree.
+
+## Inspecting State
+
+`/policy` shows active cooldowns under the runtime line:
+
+```
+Default runtime: zai / glm-4.6
+Provider cooldown: zai until 2026-04-25T19:00:59 → fallback openrouter/meta-llama/llama-3.3-70b-instruct:free
+```
+
+When no cooldowns are active, the line is omitted — runtime looks normal.
+
+Logs include structured warnings on every fallback-active run:
+
+```
+{ originalProvider: 'zai', fallbackProvider: 'openrouter', cooldownUntil: '...' } Provider fallback active — preferred provider is in cooldown
+```
+
+And on the run that *trips* the cooldown:
+
+```
+{ provider: 'zai', until: '2026-04-25T19:00:59', reason: '429 Usage limit reached; resets ...' } Provider cap detected — marking cooldown
+```
+
+## Manual Release
+
+If you know the cap was lifted early or want to retry the primary before the
+parsed reset time, clear the cooldown manually:
+
+```
+/clearcooldown                # lists active cooldowns and prints usage
+/clearcooldown zai            # clears one
+/clearcooldown all            # clears every active cooldown
+```
+
+The command is admin-only and ops-chat-gated. It persists immediately so the
+cleared state survives restart.
+
+## Observability Triple
+
+Every agent activity row now records three provider/model values:
+
+| Field                | Meaning                                                  |
+| -------------------- | -------------------------------------------------------- |
+| `configured_*`       | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`)    |
+| `effective_*`        | What was actually passed to pi (after fallback swap)     |
+| `actual_*`           | What pi reports having used (parsed from session JSONL)  |
+
+When fallback is active, `configured_*` and `effective_*` differ.
+`actual_*` should match `effective_*` for a successful run; a divergence
+suggests pi rewrote the model selection internally.
+
+## Behavior That Stays The Same
+
+- **Per-chat overrides** (`group.jailConfig.provider` / `.model`) are not
+  touched by the cooldown layer. If you have explicitly set a chat to a
+  specific provider, only that provider's cooldowns affect it.
+- **Cap detection is conservative** — the parser only matches the specific
+  zAI cap signature, not generic 429s, transport errors, or rate-limit
+  responses from other providers. This is intentional to avoid false
+  positives. If you need the same behavior for another provider, the
+  pattern lives in `parseProviderCapError()` in `src/provider-fallback.ts`.
+
+## When Fallback Is Not Configured
+
+If a primary provider hits its cap and `LLM_FALLBACK_PROVIDER` is unset:
+
+- The cooldown is still tracked.
+- Runs continue to use the primary and continue to fail until reset.
+- Logs include a clear warning: `Provider in cooldown but no fallback configured; passing through`.
+- `/policy` will show the cooldown line without a fallback target.
+
+This is intentional — the fallback is opt-in. Without it, you fail visibly
+rather than silently routing to a wrong provider.
--- a/docs/public/operate/structured-reports.md
+++ b/docs/public/operate/structured-reports.md
@ -0,0 +1,168 @@
+---
+title: 'Structured Reports'
+description: The Observed / Interpretation / Operator Notes pattern, the report family, and the free-text routing layer.
+---
+
+The agent's operator-facing reports follow a single template so an operator
+or a peer agent can read any of them at a glance and know what is observed
+fact, what is interpretation, and what action (if any) is suggested.
+
+## In Plain Language
+
+- A **structured report** is a deterministic snapshot of one slice of the
+  system (disk, services, tasks, budget, publish state, build/test status).
+- Reports are built from **raw inputs** — DB rows, command output, JSON
+  status files — by a **pure builder function**. The builder has no side
+  effects and is unit-tested independently of how the report is delivered.
+- The result is rendered to HTML for Telegram and could equally be rendered
+  to JSON for a dashboard or to plain text for a CLI.
+- When the agent answers an ops question, it reads the structured report
+  rather than narrating from memory. This matters because memory drifts;
+  ZFS pool capacity does not.
+
+## The Three-Section Template
+
+Every structured report has the same three top-level sections:
+
+### Observed
+
+What the report measured, with no interpretation. ZFS shows pool A at 87%
+capacity. Build status file says `status: "fail"`. The last task in the
+queue was created at 10:23.
+
+This section is the source of truth for the rest of the report. If
+`Observed` is empty, the underlying probe failed and the report says so.
+
+### Interpretation
+
+A handful of `findings` extracted from `Observed`, each tagged `info`,
+`warn`, or `error`. "Pool A is at 87% capacity." "Tests last run failed.
+12 failing tests." "No active controlplane tasks are queued right now."
+
+Findings are short, factual, and avoid recommending action. Their job is
+to reduce a wall of data to the few signals that matter.
+
+### Operator Notes
+
+Suggestions, conditional and labeled `note` or `action`. "Largest
+snapshot: `tank/data@2026-04-20-weekly` (4.2 GB). Remove only if that
+rollback point is no longer needed." "Re-run the test wrapper before
+relying on this as evidence the branch is green."
+
+Notes are *suggestions*, not commands. They include the **conditional**
+that makes the action correct ("only if X"), so an operator can decide
+without re-deriving the context.
+
+## The Report Family
+
+| Report     | Module                                | Slash command    | Source                                                   |
+| ---------- | ------------------------------------- | ---------------- | -------------------------------------------------------- |
+| System     | `src/reports/system-report.ts`        | `/report`        | `hostd` probes + controlplane auth probe                 |
+| Disk       | `src/reports/disk-report.ts`          | `/disk`          | `zpool list -H` + `zfs list -H -o name,usedsnap`         |
+| Tasks      | `src/reports/tasks-report.ts`         | `/tasks`         | `getAllTasks()` (Postgres)                               |
+| Budget     | `src/reports/budget-report.ts`        | `/budgetreport`  | `getAllBudgets()` + `getAgentTokenAnalytics()`           |
+| Publish    | `src/reports/publish-report.ts`       | `/publishreport` | tenant registry + webroot inspection                     |
+| Test/Build | `src/reports/test-report.ts`          | `/testreport`    | `tmp/status/build-status.json` + `test-status.json`      |
+
+Each module exports two functions:
+
+```ts
+buildXxxReport(inputs)  // pure: takes raw inputs, returns a typed report
+renderXxxReport(report) // pure: takes the report, returns an HTML string
+```
+
+The split lets you unit-test the analysis without touching IO and reuse
+the builder against a JSON sink later.
+
+## Test/Build Pipeline
+
+`/testreport` is the only report whose source-of-truth is a file the agent
+does not write itself. The contract:
+
+1. `scripts/write-test-build-status.sh` runs `npm run build` and
+   `npx vitest run --reporter=json --outputFile=...` (or one of them via
+   `build` / `tests` argument).
+2. The wrapper writes two JSON files into the **status directory**:
+   - `<status-dir>/build-status.json`
+   - `<status-dir>/test-status.json`
+
+   The status directory resolves with this precedence (matched by both the
+   wrapper and `getDefaultStatusDir()` in `src/reports/test-report.ts`):
+
+   1. `$AGENT_STATUS_DIR` if set
+   2. `$CLAWDIE_VAR_DIR` if set (legacy)
+   3. `<project-root>/tmp/status` (default)
+
+   Per `AGENTS.md` § "Temporary File Storage", artifact paths under repo
+   `tmp/` are the preferred default — point `$AGENT_STATUS_DIR` elsewhere
+   only if you have a reason to.
+
+3. `/testreport` reads both files, builds the report, renders it.
+
+The schema for each file is intentionally narrow:
+
+```json
+{
+  "status": "ok" | "fail" | "unknown",
+  "completedAt": "2026-04-26T10:00:00Z",
+  "command": "npx vitest run",
+  "exitCode": 0,
+  "durationMs": 12345,
+  "totalTests": 1934,
+  "failingTests": 0,
+  "skippedTests": 0,
+  "failingTestNames": ["..."],
+  "summary": "..."
+}
+```
+
+Only `status` and `completedAt` are required; everything else degrades
+gracefully. Files older than 6 hours surface as `stale` with a warn finding.
+Missing or malformed files surface as `status: "unknown"` with an action
+note rather than fabricating success.
+
+The pre-commit and post-commit hooks call this wrapper so commit messages
+include a `Build: pass | Tests: 12 failed | 1936 passed (1948)` footer
+visible in `git log`.
+
+## Free-Text Routing
+
+When the agent receives a bot-addressed message, `classifyReportIntent()` in
+`src/report-intent.ts` checks a set of conservative regexes and routes to
+the matching structured report instead of the LLM path. This means an
+operator typing "how much disk?" gets a fresh `/disk` snapshot, not a
+half-remembered narrative from a session three days ago.
+
+The routing rules are intentionally **narrow** (false negatives are fine,
+false positives are not). For broader detection of "this prompt smells
+operational", a separate `isOpsFlavored()` matcher catches a wider net of
+phrasings (services, jails, deploy, controlplane terms, etc.) — and is
+used to **suppress memory injection** on those prompts so the LLM answers
+from live tools rather than narrative recall.
+
+| Function                     | Use                                                                  |
+| ---------------------------- | -------------------------------------------------------------------- |
+| `classifyReportIntent(text)` | Hard route → structured report. Only fires on confident phrasings.   |
+| `isOpsFlavored(text)`        | Soft signal → drop memory injection. Wider net, lower bar.           |
+
+Both ignore slash-command messages (those are routed by grammy) and
+`@assistant` mentions are stripped before matching.
+
+## Why Pure Builders
+
+The pure builder pattern was a deliberate choice over a one-shot
+"render-to-HTML now" approach. Three reasons:
+
+- **Testable** — unit tests exercise the analysis logic with synthetic
+  inputs, no Postgres or pi running.
+- **Reusable** — the same `buildDiskReport()` could feed a dashboard widget
+  or a daily email digest later. We are not committed to Telegram as the
+  only sink.
+- **Inspectable** — when an operator asks "why did the report flag this?",
+  the answer is a `findings[]` array with explicit codes, not opaque text
+  generation.
+
+If you add a new report, follow the same shape: a `Report` interface, a
+`buildXxxReport()` function with `findings: XxxReportFinding[]` and
+`operatorNotes: XxxReportOperatorNote[]`, a `renderXxxReport()` HTML
+renderer, and a `*.test.ts` covering the builder independently.