docs: integrate operator observability + provider fallback work

Brings the public docs in line with what shipped on multitenant over the
last few days. Three new operator-facing pages, three updates to existing
ones, and a CHANGELOG batch.

New pages (docs/public/operate/):
- operator-commands.md — single reference for all Telegram slash commands,
  grouped by purpose (status, structured reports, runtime, sessions, admin
  actions) with auth gating per command. Previously only in-bot /help text.
- provider-fallback.md — operator guide for the cooldown layer: env vars,
  how cooldowns are detected and tracked, /policy surfacing, /clearcooldown
  for manual release, the configured/effective/actual observability triple.
  Includes a "path convention note" flagging that the cooldown file still
  uses the legacy $CLAWDIE_VAR_DIR resolution while test/build status
  files have moved to repo tmp/ — divergence to harmonize later in code.
- structured-reports.md — explains the Observed/Interpretation/Operator
  Notes pattern, lists the six structured reports, documents the
  test/build pipeline contract (status JSON schema + new $AGENT_STATUS_DIR
  → $CLAWDIE_VAR_DIR → tmp/status precedence Codex landed in 1389e17),
  and covers free-text routing (classifyReportIntent + isOpsFlavored).

Updates:
- monitoring.md: appended "Operator-Facing Reports" section pointing at
  the new structured-reports page, and "Provider Fallback Health" pointing
  at the fallback page.
- operate/index.md: added the three new pages to the runbook list.
- architecture/controlplane.md: added "Runtime Observability" section
  documenting the configured/effective/actual triple and linking to the
  new operate pages.
- README.md: expanded the Telegram Commands table (was 10 rows, missing
  every structured report, /policy, /clearcooldown, /budgetreset) and
  added a pointer to operator-commands.md as the full reference. Also
  noted free-text routing.
- CHANGELOG.md: appended an "operator observability + provider fallback,
  apr.2026" batch under [Unreleased] covering provider fallback, the
  reports family, the test/build wrapper pipeline, free-text routing,
  /clearcooldown, the observability triple, the Telegram setMyCommands
  menu, and the new "Verify Before Claiming Remote State" rule in
  AGENTS.md.

No code changes. Slovenian sl/ mirror left untouched (out of localization
scope).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---
Build: pass | Tests: FAIL — Tests  8 failed | 1940 passed (1948)

---
Build: pass | Tests: FAIL — Tests  2 failed | 1949 passed (1951)
This commit is contained in:
Operator & claude 2026-04-26 12:58:44 +02:00 committed by Test
parent 1389e17ec4
commit 3828e5ce83
8 changed files with 544 additions and 12 deletions

View file

@ -54,6 +54,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `README.md`, `CLAWDIE-ISO.md`, `AGENTS.md` synced to mention the agent-CLI prereq gate and the npm-globals bundle path
- `AGENTS.md` + nginx/freebsd-admin skills updated with controlplane dashboard build notes (Paperclip UI) and Tailscale proxy/PF pointers
### Added (operator observability + provider fallback, apr.2026)
- **Provider fallback layer** (`src/provider-fallback.ts`) — automatically swaps the configured LLM provider for an operator-defined fallback when the primary hits a usage cap. Detects `429 Usage limit reached` from pi stderr/stdout, parses `Your limit will reset at YYYY-MM-DD HH:MM:SS`, and marks a cooldown until the reset timestamp passes. Cooldowns are in-memory plus persisted to `$CLAWDIE_VAR_DIR/provider-cooldowns.json` (default `$HOME/.clawdie/state/`) so a restart inside the cap window does not re-trip the cap. Wired into `agent-runner.ts` (main chat) and `controlplane-heartbeat.ts` (specialists). Per-chat overrides (`group.jailConfig.provider`) are unchanged — only the spawn-time effective values are swapped while the cooldown is live.
- `LLM_FALLBACK_PROVIDER`, `LLM_FALLBACK_MODEL`, `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` config — operator picks the fallback (e.g. `openrouter` + a free-tier model). Default cooldown (3600s) is used only when the cap message has no parseable reset stamp.
- `getLlmKeyForProvider(provider)` (`src/env.ts`) — provider-aware secret resolution so the right API key is injected when fallback swaps providers; falls back to first-available when the requested key is absent.
- Startup validation: when `LLM_FALLBACK_PROVIDER` is set, the matching API key is now in the `criticalConfig` warn list. Warns separately when `LLM_FALLBACK_PROVIDER` is set without `LLM_FALLBACK_MODEL`.
- `/clearcooldown` admin command (ops-chat-gated) — lists active cooldowns when called without args; takes `<provider>` or `all`. Persists immediately so cleared state survives restart.
- `/policy` now shows a `Provider cooldown: <provider> until <iso> → fallback <provider/model>` line for each active cooldown.
- Activity payload now records `effective_provider` / `effective_model` next to `actual_*` so for any run you can read configured vs effective vs actual.
- **Structured operator reports family** with consistent `Observed` / `Interpretation` / `Operator Notes` sections — `src/reports/{system,disk,tasks,budget,publish,test}-report.ts`. Each report is a pure builder + renderer fed by raw inputs (DB rows, command output, JSON status files), tested independently of the wiring layer.
- `/report`, `/disk`, `/tasks`, `/budgetreport`, `/publishreport`, `/testreport` Telegram commands — the structured-report surfaces.
- **Test/build status pipeline**`scripts/write-test-build-status.sh` runs the project's `npm run build` and `npx vitest run --reporter=json --outputFile=...`, then writes `build-status.json` and `test-status.json` to the status directory: `$AGENT_STATUS_DIR` (primary) → `$CLAWDIE_VAR_DIR` (legacy) → `<project-root>/tmp/status` (default). `/testreport` reads these files; missing or stale (>6h) files degrade to `unknown` with an action note rather than fabricating success. Pre-commit/post-commit hooks append the latest status to commit messages so reviewers see what was passing at commit time.
- **Free-text ops routing** (`src/report-intent.ts`) — bot-addressed phrasings like "disk usage", "are the tests passing", "what tasks do we have", "budget report" are classified by `classifyReportIntent()` and routed to the matching structured builder instead of the LLM path. Keeps memory/narrative recall from overriding a fresh probe.
- `isOpsFlavored()` — broader pattern matcher used to suppress stale memory injection on ops-flavored prompts so the LLM answers from live tools rather than narrative recall.
- **Specialist capability gate** (`src/agent-capabilities.ts`) — pre-flight check that compares the requested skill (and task description) against the assigned jail's installed tools, refusing the run with a clear reason when the agent cannot perform it.
- Telegram bot now publishes a proper command menu via `setMyCommands` with separate command lists for private chats vs the ops chat (`src/channels/telegram.ts`).
- `AGENTS.md` § "Verify Before Claiming Remote State" — convention requiring `git fetch` before reporting on any remote ref. Born from a real two-agent confusion on 26.apr where stale `origin/multitenant` refs in two worktrees produced contradictory "no new remote work" claims.
### Changed (operator observability)
- Many Telegram commands moved from `requireRegistered(ctx)` gate to direct chat resolution; per-handler `requireAdmin` / `requireOpsChat` still enforce auth. Effect: admins can run read-only ops commands from any chat without registering it first.
- `/status` ZFS section caps at 8 lines with a "… N more dataset(s) hidden" footer.
- `parseBastilleList` consolidated to use the shared `bastille-list.ts` parser. `summarizeZfsRows` extracted as a pure exportable helper.
### Fixed (operator observability)
- `/report` controlplane probe: when `CONTROLPLANE_BIND_HOST=0.0.0.0`, `getControlplaneProbeHost()` now derives a reachable host from `BETTER_AUTH_URL` instead of probing the wildcard address. Previously the report would say "controlplane unreachable" even when controlplane was healthy.
- Test artifacts now write to repo-local `tmp/` instead of system `/tmp` (per `AGENTS.md` § "Temporary File Storage").
## [0.10.0] - 2026-04-07
### Paperclip Control Plane Integration

View file

@ -450,18 +450,36 @@ From the main channel (your self-chat), you can manage groups and tasks:
## Telegram Commands
| Command | Description | Auth |
| ------------- | -------------------------------------------------- | ------ |
| `/status` | System status: jails, ZFS, PF, budget, model | anyone |
| `/usage` | Per-agent token budget breakdown | anyone |
| `/compact` | Compact session (summarize old, keep recent turns) | admin |
| `/new` | Hard reset session, start fresh | admin |
| `/resume` | Unpause a budget-paused chat | admin |
| `/stop` | Kill running agent mid-response | admin |
| `/tts` | Toggle voice replies (on/off/status/default) | admin |
| `/activation` | Set trigger mode (always/mention) | admin |
| `/whoami` | Show your Telegram identity | anyone |
| `/help` | List available commands | anyone |
A short selection — for the full reference (status, structured reports,
runtime, sessions, admin actions, free-text routing) see
[Operator Commands](docs/public/operate/operator-commands.md).
| Command | Description | Auth |
| ---------------- | -------------------------------------------------------------- | --------- |
| `/status` | System summary: jails, ZFS, PF, budget, model | anyone |
| `/report` | Structured system + auth report | admin |
| `/disk` | Structured ZFS pool + snapshot report | admin |
| `/tasks` | Structured controlplane task report | admin |
| `/budgetreport` | Structured budget + token analytics | admin |
| `/publishreport` | Structured tenant publish/content report | admin |
| `/testreport` | Structured build + test status (from wrapper-written JSON) | admin |
| `/policy` | Default runtime, per-chat overrides, fallback cooldowns | anyone |
| `/usage` | Per-agent token budget breakdown | anyone |
| `/clearcooldown` | Clear a [provider fallback](docs/public/operate/provider-fallback.md) cooldown | ops chat |
| `/budgetreset` | Reset agent token budget | ops chat |
| `/compact` | Compact session (summarize old, keep recent turns) | admin |
| `/new` | Hard reset session, start fresh | admin |
| `/resume` | Unpause a budget-paused chat | admin |
| `/stop` | Kill running agent mid-response | admin |
| `/tts` | Toggle voice replies (on/off/status/default) | admin |
| `/activation` | Set trigger mode (always/mention) | admin |
| `/whoami` | Show your Telegram identity | anyone |
| `/help` | List available commands | anyone |
The bot also routes **free-text ops phrasings** ("disk usage", "are the
tests passing", "task report", etc.) to the matching structured report
instead of the LLM path — see
[Structured Reports → Free-Text Routing](docs/public/operate/structured-reports.md#free-text-routing).
### Session Compaction

View file

@ -136,9 +136,30 @@ just setup-controlplane
---
## Runtime Observability
Every agent run (orchestrator main chat or specialist heartbeat) records
three provider/model values in `agent_activity.payload`:
| Field | Meaning |
| -------------- | --------------------------------------------------------- |
| `configured_*` | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`) |
| `effective_*` | What was actually passed to pi (after fallback swap) |
| `actual_*` | What pi reports having used (parsed from session JSONL) |
`configured_*` and `effective_*` differ when [provider fallback](../operate/provider-fallback/)
is active (cooldown is live, runtime is using the operator's chosen
fallback). `actual_*` should match `effective_*` for a successful run; a
divergence suggests pi rewrote the model selection internally.
`/budgetreport` and `/tokens` surface these values; `/policy` shows the
fallback cooldown line when one is active.
## References
- `doc/CONTROLPLANE-ARCHITECTURE.md` — detailed service layout
- `doc/CONTROLPLANE-MESSAGE-CONTRACT.md` — API contracts (what agents query and post)
- `doc/CONTROLPLANE-AGENT-ROLES.md` — role definitions, skill mappings, budgets
- `SOUL.md`, `SYSADMIN_AGENT.md`, `DB_ADMIN_AGENT.md`, `GIT_ADMIN_AGENT.md` — agent identity files
- [Provider Fallback](../operate/provider-fallback/) — automatic provider switching when the primary hits a usage cap
- [Structured Reports](../operate/structured-reports/) — operator-facing report family + free-text routing

View file

@ -7,5 +7,8 @@ Runbooks for day-to-day operation and recovery.
- [Security](./security/)
- [Monitoring](./monitoring/)
- [Operator Commands](./operator-commands/)
- [Structured Reports](./structured-reports/)
- [Provider Fallback](./provider-fallback/)
- [DB disaster recovery](./db-disaster-recovery/)
- [Git storage](./git-storage/)

View file

@ -145,3 +145,37 @@ Bastille monitor and Clawdie doctor solve different problems:
- **Clawdie doctor** — application, pipeline, and control plane health
Use both; don't confuse them.
## Operator-Facing Reports
Beyond the runtime health files above, the agent exposes a family of
**structured reports** for operator inspection on demand. Each report has a
matching Telegram slash command and follows the same `Observed` /
`Interpretation` / `Operator Notes` template — see
[Structured Reports](./structured-reports/) for the design and the full list.
| Report | Command | What it answers |
| ---------- | ----------------- | --------------------------------------------------- |
| System | `/report` | Are services + jails + controlplane healthy? |
| Disk | `/disk` | What is consuming ZFS pool space and snapshots? |
| Tasks | `/tasks` | What is in the controlplane task queue? |
| Budget | `/budgetreport` | Token budgets and burn analytics |
| Publish | `/publishreport` | Tenant publish/content state |
| Test/Build | `/testreport` | Was the last build/test run green? |
`/testreport` is fed by `scripts/write-test-build-status.sh`, not by the
running process — invoke the wrapper from CI, a hook, or by hand to refresh
its status files. The pre-commit and post-commit hooks run it automatically
so each commit message footer reflects what was passing at commit time.
For the full operator command reference (status, sessions, admin actions,
free-text routing), see [Operator Commands](./operator-commands/).
## Provider Fallback Health
When the configured LLM provider is in cooldown (e.g. zAI usage cap), the
agent transparently routes to the operator-defined fallback. Active
cooldowns are visible in `/policy` and as structured `logger.warn` lines on
every fallback-active run. See [Provider Fallback](./provider-fallback/) for
configuration, manual release (`/clearcooldown`), and the
configured / effective / actual observability triple.

View file

@ -0,0 +1,118 @@
---
title: 'Operator Commands'
description: Reference for the Telegram slash commands operators use to inspect and control the running agent.
---
The agent exposes its operational surface as Telegram slash commands. This
page is the single reference for what each command does, who can run it,
and which underlying surface it inspects. The Telegram bot also publishes a
native command menu via `setMyCommands` — start typing `/` in any chat for
the live in-app list.
## Authorization Layers
Three layers gate the commands. A command may pass through one, two, or all
three:
| Gate | Where | Effect |
| ------------------- | -------------------------------------- | --------------------------------------------- |
| `requireAdmin` | Per-handler | Only operators on the admin allow-list run it |
| `requireOpsChat` | Per-handler (write/destructive only) | Only the configured ops chat may invoke it |
| Per-chat overrides | `group.jailConfig` (registered groups) | Per-chat model/provider overrides |
Read-only commands (`/status`, `/disk`, `/report`, `/testreport`, etc.) are
admin-gated but not ops-chat-gated — admins can run them from any chat.
Destructive commands (`/budgetreset`, `/clearcooldown`) require the ops chat.
## Status & Identity
| Command | Purpose | Surface |
| ------------- | ------------------------------------------------------------- | --------------------------------------------- |
| `/ping` | Confirm the bot process is responsive | Direct reply |
| `/chatid` | Print the current chat's JID | Useful for `.env` registration |
| `/whoami` | Show your Telegram identity | Confirms admin-allowlist match |
| `/status` | Compact system summary (jails, ZFS pools, PF, budget) | `src/system-state.ts` snapshot |
## Structured Reports
All structured reports follow the same `Observed` / `Interpretation` /
`Operator Notes` template. See [Structured Reports](./structured-reports/) for
the design pattern.
| Command | Report | Source |
| ---------------- | ----------------------------------------------------- | ----------------------------------------------------------------------- |
| `/report` | System & auth — services, jails, PF, controlplane | `hostd` probes + `probeControlplaneAuth()` |
| `/disk` | ZFS pools and snapshots | `zpool list -H` + `zfs list -H -o name,usedsnap` |
| `/tasks` | Controlplane task queue | `getAllTasks()` (Postgres) |
| `/budgetreport` | Token budgets and burn analytics | `getAllBudgets()` + `getAgentTokenAnalytics()` |
| `/publishreport` | Tenant publish/content state | `loadTenantRegistry()` + webroot inspection |
| `/testreport` | Build and test pass/fail | `tmp/status/build-status.json` + `tmp/status/test-status.json` |
`/testreport` is fed by `scripts/write-test-build-status.sh` — see
[Structured Reports](./structured-reports/#test-build-pipeline) for the
write/read contract.
## Runtime & Policy
| Command | Purpose |
| ----------------- | ------------------------------------------------------------------------- |
| `/policy` | Active runtime policy (default model, overrides, cooldowns, budget state) |
| `/budget` | Alias for `/policy` |
| `/usage` | Token budget per agent |
| `/tokens` | Runtime token burn per agent (last-N analytics) |
| `/model` | Set provider/model for this chat (per-chat override) |
| `/activation` | Set trigger mode (always-respond vs mention-only) |
| `/tts` | Toggle voice replies (`on` / `off` / `status`) |
`/policy` shows the [Provider fallback](./provider-fallback/) cooldown line
when one is active.
## Sessions
| Command | Purpose |
| ------------- | ------------------------------------------------------------------ |
| `/new` | Reset this chat's session |
| `/compact` | Compact the session (summarize old, keep recent) |
| `/stop` | Stop a running agent for this chat |
| `/resume` | Resume a budget-paused chat |
## Admin Actions (Ops-Chat Only)
| Command | Purpose |
| ---------------------- | ------------------------------------------------------------------------ |
| `/budgetreset <id>` | Reset an agent's token budget. `all` requires `confirm` second arg. |
| `/clearcooldown [id]` | Clear a [provider fallback](./provider-fallback/) cooldown |
| `/audit` | Platform ownership audit (which jail/dataset/service belongs to which) |
| `/snapshots [dataset]` | List ZFS snapshots |
| `/scrub <pool> [op]` | ZFS scrub controls (`status` / `start` / `stop`) |
| `/updates` | FreeBSD base + ports update status |
| `/schedule` | Manage scheduled agent tasks (list / add / cancel / done) |
## Free-Text Routing
The bot recognizes **bot-addressed** ops-flavored phrasings without requiring
a slash command. Examples that route to structured reports instead of the LLM
path:
| Phrase | Routed to |
| -------------------------------------- | --------------- |
| `disk usage`, `how much disk` | `/disk` |
| `task report`, `active tasks` | `/tasks` |
| `budget report`, `how many tokens` | `/budgetreport` |
| `are the tests passing`, `build status`| `/testreport` |
| `system report`, `report please` | `/report` |
This keeps memory or narrative recall from drifting into a stale answer when
fresh structured data is available. The full pattern set lives in
`classifyReportIntent()` in `src/report-intent.ts`.
A broader `isOpsFlavored()` matcher also suppresses memory injection on any
ops-flavored prompt (services, jails, deploy, auth, controlplane terms),
even when no specific report matches — so the LLM answers from live tools
rather than narrative recall.
## Help
`/help` prints the in-bot command list. The list is generated from the same
constants that drive the Telegram menu publication, so it reflects whatever
is currently registered.

View file

@ -0,0 +1,141 @@
---
title: 'Provider Fallback'
description: Automatic LLM provider switching when the primary provider hits a usage cap.
---
When the primary LLM provider returns a "usage cap reached" error, the agent
keeps replying instead of looping on 429s — it transparently switches to a
configured fallback until the cap window passes, then automatically returns to
the primary.
## In Plain Language
- Some LLM providers (notably zAI) impose rolling 5-hour usage caps. When you
hit one, every request fails until the reset.
- Without fallback, the bot would retry the capped provider on every message
and stay broken for hours.
- Fallback puts the capped provider in a "cooldown" until the reset timestamp,
routes new runs through your operator-chosen alternative (e.g. OpenRouter
with a free-tier model), and resumes the primary the moment the cooldown
expires.
- The cooldown survives a process restart so a quick service bounce inside the
cap window does not re-trip the cap.
## Configuration
Set in `.env`:
| Variable | Required | Example |
| --------------------------------------- | ---------------------------------------- | -------------------------------------------- |
| `LLM_FALLBACK_PROVIDER` | yes (when fallback is desired) | `openrouter` |
| `LLM_FALLBACK_MODEL` | recommended | `meta-llama/llama-3.3-70b-instruct:free` |
| `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` | optional (default `3600`) | `1800` |
The default cooldown is used **only** when the cap message has no parseable
reset stamp. Real zAI cap errors include the reset timestamp and the cooldown
matches the reset exactly.
The fallback provider's API key (`OPENROUTER_API_KEY` for openrouter,
`ZAI_API_KEY` for zai, etc.) must also be set. The agent verifies this at
startup and warns in the logs if it is missing — the warning is the only
notice you will get before the fallback fails for real.
## How Cooldowns Work
1. A run fails with `429 Usage limit reached for 5 hour. Your limit will reset
at YYYY-MM-DD HH:MM:SS`.
2. The runner parses the reset timestamp (treated as local time) and stores
`{ provider: 'zai', until: <reset>, reason: <message> }` in memory and on
disk.
3. Every subsequent run consults the cooldown map *before* spawning pi. If the
configured provider is in cooldown, the spawn args swap to the fallback
provider/model.
4. The cooldown auto-expires at the reset timestamp. Next run uses the primary
again.
The cooldown file lives at `$CLAWDIE_VAR_DIR/provider-cooldowns.json` (default
`$HOME/.clawdie/state/provider-cooldowns.json`). Expired entries are dropped
on load.
> **Path convention note.** The cooldown file currently uses the legacy
> `$CLAWDIE_VAR_DIR` / `$HOME/.clawdie/state/` resolution. The newer
> [test/build status files](./structured-reports/#test-build-pipeline)
> moved to repo-local `tmp/` to align with `AGENTS.md` § "Temporary File
> Storage". A future code change should harmonize provider-fallback to the
> same precedence (`AGENT_STATUS_DIR``CLAWDIE_VAR_DIR``tmp/state/`).
> Until then, if you set `AGENT_STATUS_DIR`, also set `CLAWDIE_VAR_DIR` to
> the same path so both subsystems agree.
## Inspecting State
`/policy` shows active cooldowns under the runtime line:
```
Default runtime: zai / glm-4.6
Provider cooldown: zai until 2026-04-25T19:00:59 → fallback openrouter/meta-llama/llama-3.3-70b-instruct:free
```
When no cooldowns are active, the line is omitted — runtime looks normal.
Logs include structured warnings on every fallback-active run:
```
{ originalProvider: 'zai', fallbackProvider: 'openrouter', cooldownUntil: '...' } Provider fallback active — preferred provider is in cooldown
```
And on the run that *trips* the cooldown:
```
{ provider: 'zai', until: '2026-04-25T19:00:59', reason: '429 Usage limit reached; resets ...' } Provider cap detected — marking cooldown
```
## Manual Release
If you know the cap was lifted early or want to retry the primary before the
parsed reset time, clear the cooldown manually:
```
/clearcooldown # lists active cooldowns and prints usage
/clearcooldown zai # clears one
/clearcooldown all # clears every active cooldown
```
The command is admin-only and ops-chat-gated. It persists immediately so the
cleared state survives restart.
## Observability Triple
Every agent activity row now records three provider/model values:
| Field | Meaning |
| -------------------- | -------------------------------------------------------- |
| `configured_*` | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`) |
| `effective_*` | What was actually passed to pi (after fallback swap) |
| `actual_*` | What pi reports having used (parsed from session JSONL) |
When fallback is active, `configured_*` and `effective_*` differ.
`actual_*` should match `effective_*` for a successful run; a divergence
suggests pi rewrote the model selection internally.
## Behavior That Stays The Same
- **Per-chat overrides** (`group.jailConfig.provider` / `.model`) are not
touched by the cooldown layer. If you have explicitly set a chat to a
specific provider, only that provider's cooldowns affect it.
- **Cap detection is conservative** — the parser only matches the specific
zAI cap signature, not generic 429s, transport errors, or rate-limit
responses from other providers. This is intentional to avoid false
positives. If you need the same behavior for another provider, the
pattern lives in `parseProviderCapError()` in `src/provider-fallback.ts`.
## When Fallback Is Not Configured
If a primary provider hits its cap and `LLM_FALLBACK_PROVIDER` is unset:
- The cooldown is still tracked.
- Runs continue to use the primary and continue to fail until reset.
- Logs include a clear warning: `Provider in cooldown but no fallback configured; passing through`.
- `/policy` will show the cooldown line without a fallback target.
This is intentional — the fallback is opt-in. Without it, you fail visibly
rather than silently routing to a wrong provider.

View file

@ -0,0 +1,168 @@
---
title: 'Structured Reports'
description: The Observed / Interpretation / Operator Notes pattern, the report family, and the free-text routing layer.
---
The agent's operator-facing reports follow a single template so an operator
or a peer agent can read any of them at a glance and know what is observed
fact, what is interpretation, and what action (if any) is suggested.
## In Plain Language
- A **structured report** is a deterministic snapshot of one slice of the
system (disk, services, tasks, budget, publish state, build/test status).
- Reports are built from **raw inputs** — DB rows, command output, JSON
status files — by a **pure builder function**. The builder has no side
effects and is unit-tested independently of how the report is delivered.
- The result is rendered to HTML for Telegram and could equally be rendered
to JSON for a dashboard or to plain text for a CLI.
- When the agent answers an ops question, it reads the structured report
rather than narrating from memory. This matters because memory drifts;
ZFS pool capacity does not.
## The Three-Section Template
Every structured report has the same three top-level sections:
### Observed
What the report measured, with no interpretation. ZFS shows pool A at 87%
capacity. Build status file says `status: "fail"`. The last task in the
queue was created at 10:23.
This section is the source of truth for the rest of the report. If
`Observed` is empty, the underlying probe failed and the report says so.
### Interpretation
A handful of `findings` extracted from `Observed`, each tagged `info`,
`warn`, or `error`. "Pool A is at 87% capacity." "Tests last run failed.
12 failing tests." "No active controlplane tasks are queued right now."
Findings are short, factual, and avoid recommending action. Their job is
to reduce a wall of data to the few signals that matter.
### Operator Notes
Suggestions, conditional and labeled `note` or `action`. "Largest
snapshot: `tank/data@2026-04-20-weekly` (4.2 GB). Remove only if that
rollback point is no longer needed." "Re-run the test wrapper before
relying on this as evidence the branch is green."
Notes are *suggestions*, not commands. They include the **conditional**
that makes the action correct ("only if X"), so an operator can decide
without re-deriving the context.
## The Report Family
| Report | Module | Slash command | Source |
| ---------- | ------------------------------------- | ---------------- | -------------------------------------------------------- |
| System | `src/reports/system-report.ts` | `/report` | `hostd` probes + controlplane auth probe |
| Disk | `src/reports/disk-report.ts` | `/disk` | `zpool list -H` + `zfs list -H -o name,usedsnap` |
| Tasks | `src/reports/tasks-report.ts` | `/tasks` | `getAllTasks()` (Postgres) |
| Budget | `src/reports/budget-report.ts` | `/budgetreport` | `getAllBudgets()` + `getAgentTokenAnalytics()` |
| Publish | `src/reports/publish-report.ts` | `/publishreport` | tenant registry + webroot inspection |
| Test/Build | `src/reports/test-report.ts` | `/testreport` | `tmp/status/build-status.json` + `test-status.json` |
Each module exports two functions:
```ts
buildXxxReport(inputs) // pure: takes raw inputs, returns a typed report
renderXxxReport(report) // pure: takes the report, returns an HTML string
```
The split lets you unit-test the analysis without touching IO and reuse
the builder against a JSON sink later.
## Test/Build Pipeline
`/testreport` is the only report whose source-of-truth is a file the agent
does not write itself. The contract:
1. `scripts/write-test-build-status.sh` runs `npm run build` and
`npx vitest run --reporter=json --outputFile=...` (or one of them via
`build` / `tests` argument).
2. The wrapper writes two JSON files into the **status directory**:
- `<status-dir>/build-status.json`
- `<status-dir>/test-status.json`
The status directory resolves with this precedence (matched by both the
wrapper and `getDefaultStatusDir()` in `src/reports/test-report.ts`):
1. `$AGENT_STATUS_DIR` if set
2. `$CLAWDIE_VAR_DIR` if set (legacy)
3. `<project-root>/tmp/status` (default)
Per `AGENTS.md` § "Temporary File Storage", artifact paths under repo
`tmp/` are the preferred default — point `$AGENT_STATUS_DIR` elsewhere
only if you have a reason to.
3. `/testreport` reads both files, builds the report, renders it.
The schema for each file is intentionally narrow:
```json
{
"status": "ok" | "fail" | "unknown",
"completedAt": "2026-04-26T10:00:00Z",
"command": "npx vitest run",
"exitCode": 0,
"durationMs": 12345,
"totalTests": 1934,
"failingTests": 0,
"skippedTests": 0,
"failingTestNames": ["..."],
"summary": "..."
}
```
Only `status` and `completedAt` are required; everything else degrades
gracefully. Files older than 6 hours surface as `stale` with a warn finding.
Missing or malformed files surface as `status: "unknown"` with an action
note rather than fabricating success.
The pre-commit and post-commit hooks call this wrapper so commit messages
include a `Build: pass | Tests: 12 failed | 1936 passed (1948)` footer
visible in `git log`.
## Free-Text Routing
When the agent receives a bot-addressed message, `classifyReportIntent()` in
`src/report-intent.ts` checks a set of conservative regexes and routes to
the matching structured report instead of the LLM path. This means an
operator typing "how much disk?" gets a fresh `/disk` snapshot, not a
half-remembered narrative from a session three days ago.
The routing rules are intentionally **narrow** (false negatives are fine,
false positives are not). For broader detection of "this prompt smells
operational", a separate `isOpsFlavored()` matcher catches a wider net of
phrasings (services, jails, deploy, controlplane terms, etc.) — and is
used to **suppress memory injection** on those prompts so the LLM answers
from live tools rather than narrative recall.
| Function | Use |
| ---------------------------- | -------------------------------------------------------------------- |
| `classifyReportIntent(text)` | Hard route → structured report. Only fires on confident phrasings. |
| `isOpsFlavored(text)` | Soft signal → drop memory injection. Wider net, lower bar. |
Both ignore slash-command messages (those are routed by grammy) and
`@assistant` mentions are stripped before matching.
## Why Pure Builders
The pure builder pattern was a deliberate choice over a one-shot
"render-to-HTML now" approach. Three reasons:
- **Testable** — unit tests exercise the analysis logic with synthetic
inputs, no Postgres or pi running.
- **Reusable** — the same `buildDiskReport()` could feed a dashboard widget
or a daily email digest later. We are not committed to Telegram as the
only sink.
- **Inspectable** — when an operator asks "why did the report flag this?",
the answer is a `findings[]` array with explicit codes, not opaque text
generation.
If you add a new report, follow the same shape: a `Report` interface, a
`buildXxxReport()` function with `findings: XxxReportFinding[]` and
`operatorNotes: XxxReportOperatorNote[]`, a `renderXxxReport()` HTML
renderer, and a `*.test.ts` covering the builder independently.