docs: integrate operator observability + provider fallback work
Brings the public docs in line with what shipped on multitenant over the
last few days. Three new operator-facing pages, three updates to existing
ones, and a CHANGELOG batch.
New pages (docs/public/operate/):
- operator-commands.md — single reference for all Telegram slash commands,
grouped by purpose (status, structured reports, runtime, sessions, admin
actions) with auth gating per command. Previously only in-bot /help text.
- provider-fallback.md — operator guide for the cooldown layer: env vars,
how cooldowns are detected and tracked, /policy surfacing, /clearcooldown
for manual release, the configured/effective/actual observability triple.
Includes a "path convention note" flagging that the cooldown file still
uses the legacy $CLAWDIE_VAR_DIR resolution while test/build status
files have moved to repo tmp/ — divergence to harmonize later in code.
- structured-reports.md — explains the Observed/Interpretation/Operator
Notes pattern, lists the six structured reports, documents the
test/build pipeline contract (status JSON schema + new $AGENT_STATUS_DIR
→ $CLAWDIE_VAR_DIR → tmp/status precedence Codex landed in 1389e17),
and covers free-text routing (classifyReportIntent + isOpsFlavored).
Updates:
- monitoring.md: appended "Operator-Facing Reports" section pointing at
the new structured-reports page, and "Provider Fallback Health" pointing
at the fallback page.
- operate/index.md: added the three new pages to the runbook list.
- architecture/controlplane.md: added "Runtime Observability" section
documenting the configured/effective/actual triple and linking to the
new operate pages.
- README.md: expanded the Telegram Commands table (was 10 rows, missing
every structured report, /policy, /clearcooldown, /budgetreset) and
added a pointer to operator-commands.md as the full reference. Also
noted free-text routing.
- CHANGELOG.md: appended an "operator observability + provider fallback,
apr.2026" batch under [Unreleased] covering provider fallback, the
reports family, the test/build wrapper pipeline, free-text routing,
/clearcooldown, the observability triple, the Telegram setMyCommands
menu, and the new "Verify Before Claiming Remote State" rule in
AGENTS.md.
No code changes. Slovenian sl/ mirror left untouched (out of localization
scope).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 8 failed | 1940 passed (1948)
---
Build: pass | Tests: FAIL — Tests 2 failed | 1949 passed (1951)
This commit is contained in:
parent
1389e17ec4
commit
3828e5ce83
8 changed files with 544 additions and 12 deletions
29
CHANGELOG.md
29
CHANGELOG.md
|
|
@ -54,6 +54,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||
- `README.md`, `CLAWDIE-ISO.md`, `AGENTS.md` synced to mention the agent-CLI prereq gate and the npm-globals bundle path
|
||||
- `AGENTS.md` + nginx/freebsd-admin skills updated with controlplane dashboard build notes (Paperclip UI) and Tailscale proxy/PF pointers
|
||||
|
||||
### Added (operator observability + provider fallback, apr.2026)
|
||||
|
||||
- **Provider fallback layer** (`src/provider-fallback.ts`) — automatically swaps the configured LLM provider for an operator-defined fallback when the primary hits a usage cap. Detects `429 Usage limit reached` from pi stderr/stdout, parses `Your limit will reset at YYYY-MM-DD HH:MM:SS`, and marks a cooldown until the reset timestamp passes. Cooldowns are in-memory plus persisted to `$CLAWDIE_VAR_DIR/provider-cooldowns.json` (default `$HOME/.clawdie/state/`) so a restart inside the cap window does not re-trip the cap. Wired into `agent-runner.ts` (main chat) and `controlplane-heartbeat.ts` (specialists). Per-chat overrides (`group.jailConfig.provider`) are unchanged — only the spawn-time effective values are swapped while the cooldown is live.
|
||||
- `LLM_FALLBACK_PROVIDER`, `LLM_FALLBACK_MODEL`, `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` config — operator picks the fallback (e.g. `openrouter` + a free-tier model). Default cooldown (3600s) is used only when the cap message has no parseable reset stamp.
|
||||
- `getLlmKeyForProvider(provider)` (`src/env.ts`) — provider-aware secret resolution so the right API key is injected when fallback swaps providers; falls back to first-available when the requested key is absent.
|
||||
- Startup validation: when `LLM_FALLBACK_PROVIDER` is set, the matching API key is now in the `criticalConfig` warn list. Warns separately when `LLM_FALLBACK_PROVIDER` is set without `LLM_FALLBACK_MODEL`.
|
||||
- `/clearcooldown` admin command (ops-chat-gated) — lists active cooldowns when called without args; takes `<provider>` or `all`. Persists immediately so cleared state survives restart.
|
||||
- `/policy` now shows a `Provider cooldown: <provider> until <iso> → fallback <provider/model>` line for each active cooldown.
|
||||
- Activity payload now records `effective_provider` / `effective_model` next to `actual_*` so for any run you can read configured vs effective vs actual.
|
||||
- **Structured operator reports family** with consistent `Observed` / `Interpretation` / `Operator Notes` sections — `src/reports/{system,disk,tasks,budget,publish,test}-report.ts`. Each report is a pure builder + renderer fed by raw inputs (DB rows, command output, JSON status files), tested independently of the wiring layer.
|
||||
- `/report`, `/disk`, `/tasks`, `/budgetreport`, `/publishreport`, `/testreport` Telegram commands — the structured-report surfaces.
|
||||
- **Test/build status pipeline** — `scripts/write-test-build-status.sh` runs the project's `npm run build` and `npx vitest run --reporter=json --outputFile=...`, then writes `build-status.json` and `test-status.json` to the status directory: `$AGENT_STATUS_DIR` (primary) → `$CLAWDIE_VAR_DIR` (legacy) → `<project-root>/tmp/status` (default). `/testreport` reads these files; missing or stale (>6h) files degrade to `unknown` with an action note rather than fabricating success. Pre-commit/post-commit hooks append the latest status to commit messages so reviewers see what was passing at commit time.
|
||||
- **Free-text ops routing** (`src/report-intent.ts`) — bot-addressed phrasings like "disk usage", "are the tests passing", "what tasks do we have", "budget report" are classified by `classifyReportIntent()` and routed to the matching structured builder instead of the LLM path. Keeps memory/narrative recall from overriding a fresh probe.
|
||||
- `isOpsFlavored()` — broader pattern matcher used to suppress stale memory injection on ops-flavored prompts so the LLM answers from live tools rather than narrative recall.
|
||||
- **Specialist capability gate** (`src/agent-capabilities.ts`) — pre-flight check that compares the requested skill (and task description) against the assigned jail's installed tools, refusing the run with a clear reason when the agent cannot perform it.
|
||||
- Telegram bot now publishes a proper command menu via `setMyCommands` with separate command lists for private chats vs the ops chat (`src/channels/telegram.ts`).
|
||||
- `AGENTS.md` § "Verify Before Claiming Remote State" — convention requiring `git fetch` before reporting on any remote ref. Born from a real two-agent confusion on 26.apr where stale `origin/multitenant` refs in two worktrees produced contradictory "no new remote work" claims.
|
||||
|
||||
### Changed (operator observability)
|
||||
|
||||
- Many Telegram commands moved from `requireRegistered(ctx)` gate to direct chat resolution; per-handler `requireAdmin` / `requireOpsChat` still enforce auth. Effect: admins can run read-only ops commands from any chat without registering it first.
|
||||
- `/status` ZFS section caps at 8 lines with a "… N more dataset(s) hidden" footer.
|
||||
- `parseBastilleList` consolidated to use the shared `bastille-list.ts` parser. `summarizeZfsRows` extracted as a pure exportable helper.
|
||||
|
||||
### Fixed (operator observability)
|
||||
|
||||
- `/report` controlplane probe: when `CONTROLPLANE_BIND_HOST=0.0.0.0`, `getControlplaneProbeHost()` now derives a reachable host from `BETTER_AUTH_URL` instead of probing the wildcard address. Previously the report would say "controlplane unreachable" even when controlplane was healthy.
|
||||
- Test artifacts now write to repo-local `tmp/` instead of system `/tmp` (per `AGENTS.md` § "Temporary File Storage").
|
||||
|
||||
## [0.10.0] - 2026-04-07
|
||||
|
||||
### Paperclip Control Plane Integration
|
||||
|
|
|
|||
42
README.md
42
README.md
|
|
@ -450,18 +450,36 @@ From the main channel (your self-chat), you can manage groups and tasks:
|
|||
|
||||
## Telegram Commands
|
||||
|
||||
| Command | Description | Auth |
|
||||
| ------------- | -------------------------------------------------- | ------ |
|
||||
| `/status` | System status: jails, ZFS, PF, budget, model | anyone |
|
||||
| `/usage` | Per-agent token budget breakdown | anyone |
|
||||
| `/compact` | Compact session (summarize old, keep recent turns) | admin |
|
||||
| `/new` | Hard reset session, start fresh | admin |
|
||||
| `/resume` | Unpause a budget-paused chat | admin |
|
||||
| `/stop` | Kill running agent mid-response | admin |
|
||||
| `/tts` | Toggle voice replies (on/off/status/default) | admin |
|
||||
| `/activation` | Set trigger mode (always/mention) | admin |
|
||||
| `/whoami` | Show your Telegram identity | anyone |
|
||||
| `/help` | List available commands | anyone |
|
||||
A short selection — for the full reference (status, structured reports,
|
||||
runtime, sessions, admin actions, free-text routing) see
|
||||
[Operator Commands](docs/public/operate/operator-commands.md).
|
||||
|
||||
| Command | Description | Auth |
|
||||
| ---------------- | -------------------------------------------------------------- | --------- |
|
||||
| `/status` | System summary: jails, ZFS, PF, budget, model | anyone |
|
||||
| `/report` | Structured system + auth report | admin |
|
||||
| `/disk` | Structured ZFS pool + snapshot report | admin |
|
||||
| `/tasks` | Structured controlplane task report | admin |
|
||||
| `/budgetreport` | Structured budget + token analytics | admin |
|
||||
| `/publishreport` | Structured tenant publish/content report | admin |
|
||||
| `/testreport` | Structured build + test status (from wrapper-written JSON) | admin |
|
||||
| `/policy` | Default runtime, per-chat overrides, fallback cooldowns | anyone |
|
||||
| `/usage` | Per-agent token budget breakdown | anyone |
|
||||
| `/clearcooldown` | Clear a [provider fallback](docs/public/operate/provider-fallback.md) cooldown | ops chat |
|
||||
| `/budgetreset` | Reset agent token budget | ops chat |
|
||||
| `/compact` | Compact session (summarize old, keep recent turns) | admin |
|
||||
| `/new` | Hard reset session, start fresh | admin |
|
||||
| `/resume` | Unpause a budget-paused chat | admin |
|
||||
| `/stop` | Kill running agent mid-response | admin |
|
||||
| `/tts` | Toggle voice replies (on/off/status/default) | admin |
|
||||
| `/activation` | Set trigger mode (always/mention) | admin |
|
||||
| `/whoami` | Show your Telegram identity | anyone |
|
||||
| `/help` | List available commands | anyone |
|
||||
|
||||
The bot also routes **free-text ops phrasings** ("disk usage", "are the
|
||||
tests passing", "task report", etc.) to the matching structured report
|
||||
instead of the LLM path — see
|
||||
[Structured Reports → Free-Text Routing](docs/public/operate/structured-reports.md#free-text-routing).
|
||||
|
||||
### Session Compaction
|
||||
|
||||
|
|
|
|||
|
|
@ -136,9 +136,30 @@ just setup-controlplane
|
|||
|
||||
---
|
||||
|
||||
## Runtime Observability
|
||||
|
||||
Every agent run (orchestrator main chat or specialist heartbeat) records
|
||||
three provider/model values in `agent_activity.payload`:
|
||||
|
||||
| Field | Meaning |
|
||||
| -------------- | --------------------------------------------------------- |
|
||||
| `configured_*` | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`) |
|
||||
| `effective_*` | What was actually passed to pi (after fallback swap) |
|
||||
| `actual_*` | What pi reports having used (parsed from session JSONL) |
|
||||
|
||||
`configured_*` and `effective_*` differ when [provider fallback](../operate/provider-fallback/)
|
||||
is active (cooldown is live, runtime is using the operator's chosen
|
||||
fallback). `actual_*` should match `effective_*` for a successful run; a
|
||||
divergence suggests pi rewrote the model selection internally.
|
||||
|
||||
`/budgetreport` and `/tokens` surface these values; `/policy` shows the
|
||||
fallback cooldown line when one is active.
|
||||
|
||||
## References
|
||||
|
||||
- `doc/CONTROLPLANE-ARCHITECTURE.md` — detailed service layout
|
||||
- `doc/CONTROLPLANE-MESSAGE-CONTRACT.md` — API contracts (what agents query and post)
|
||||
- `doc/CONTROLPLANE-AGENT-ROLES.md` — role definitions, skill mappings, budgets
|
||||
- `SOUL.md`, `SYSADMIN_AGENT.md`, `DB_ADMIN_AGENT.md`, `GIT_ADMIN_AGENT.md` — agent identity files
|
||||
- [Provider Fallback](../operate/provider-fallback/) — automatic provider switching when the primary hits a usage cap
|
||||
- [Structured Reports](../operate/structured-reports/) — operator-facing report family + free-text routing
|
||||
|
|
|
|||
|
|
@ -7,5 +7,8 @@ Runbooks for day-to-day operation and recovery.
|
|||
|
||||
- [Security](./security/)
|
||||
- [Monitoring](./monitoring/)
|
||||
- [Operator Commands](./operator-commands/)
|
||||
- [Structured Reports](./structured-reports/)
|
||||
- [Provider Fallback](./provider-fallback/)
|
||||
- [DB disaster recovery](./db-disaster-recovery/)
|
||||
- [Git storage](./git-storage/)
|
||||
|
|
|
|||
|
|
@ -145,3 +145,37 @@ Bastille monitor and Clawdie doctor solve different problems:
|
|||
- **Clawdie doctor** — application, pipeline, and control plane health
|
||||
|
||||
Use both; don't confuse them.
|
||||
|
||||
## Operator-Facing Reports
|
||||
|
||||
Beyond the runtime health files above, the agent exposes a family of
|
||||
**structured reports** for operator inspection on demand. Each report has a
|
||||
matching Telegram slash command and follows the same `Observed` /
|
||||
`Interpretation` / `Operator Notes` template — see
|
||||
[Structured Reports](./structured-reports/) for the design and the full list.
|
||||
|
||||
| Report | Command | What it answers |
|
||||
| ---------- | ----------------- | --------------------------------------------------- |
|
||||
| System | `/report` | Are services + jails + controlplane healthy? |
|
||||
| Disk | `/disk` | What is consuming ZFS pool space and snapshots? |
|
||||
| Tasks | `/tasks` | What is in the controlplane task queue? |
|
||||
| Budget | `/budgetreport` | Token budgets and burn analytics |
|
||||
| Publish | `/publishreport` | Tenant publish/content state |
|
||||
| Test/Build | `/testreport` | Was the last build/test run green? |
|
||||
|
||||
`/testreport` is fed by `scripts/write-test-build-status.sh`, not by the
|
||||
running process — invoke the wrapper from CI, a hook, or by hand to refresh
|
||||
its status files. The pre-commit and post-commit hooks run it automatically
|
||||
so each commit message footer reflects what was passing at commit time.
|
||||
|
||||
For the full operator command reference (status, sessions, admin actions,
|
||||
free-text routing), see [Operator Commands](./operator-commands/).
|
||||
|
||||
## Provider Fallback Health
|
||||
|
||||
When the configured LLM provider is in cooldown (e.g. zAI usage cap), the
|
||||
agent transparently routes to the operator-defined fallback. Active
|
||||
cooldowns are visible in `/policy` and as structured `logger.warn` lines on
|
||||
every fallback-active run. See [Provider Fallback](./provider-fallback/) for
|
||||
configuration, manual release (`/clearcooldown`), and the
|
||||
configured / effective / actual observability triple.
|
||||
|
|
|
|||
118
docs/public/operate/operator-commands.md
Normal file
118
docs/public/operate/operator-commands.md
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
---
|
||||
title: 'Operator Commands'
|
||||
description: Reference for the Telegram slash commands operators use to inspect and control the running agent.
|
||||
---
|
||||
|
||||
The agent exposes its operational surface as Telegram slash commands. This
|
||||
page is the single reference for what each command does, who can run it,
|
||||
and which underlying surface it inspects. The Telegram bot also publishes a
|
||||
native command menu via `setMyCommands` — start typing `/` in any chat for
|
||||
the live in-app list.
|
||||
|
||||
## Authorization Layers
|
||||
|
||||
Three layers gate the commands. A command may pass through one, two, or all
|
||||
three:
|
||||
|
||||
| Gate | Where | Effect |
|
||||
| ------------------- | -------------------------------------- | --------------------------------------------- |
|
||||
| `requireAdmin` | Per-handler | Only operators on the admin allow-list run it |
|
||||
| `requireOpsChat` | Per-handler (write/destructive only) | Only the configured ops chat may invoke it |
|
||||
| Per-chat overrides | `group.jailConfig` (registered groups) | Per-chat model/provider overrides |
|
||||
|
||||
Read-only commands (`/status`, `/disk`, `/report`, `/testreport`, etc.) are
|
||||
admin-gated but not ops-chat-gated — admins can run them from any chat.
|
||||
Destructive commands (`/budgetreset`, `/clearcooldown`) require the ops chat.
|
||||
|
||||
## Status & Identity
|
||||
|
||||
| Command | Purpose | Surface |
|
||||
| ------------- | ------------------------------------------------------------- | --------------------------------------------- |
|
||||
| `/ping` | Confirm the bot process is responsive | Direct reply |
|
||||
| `/chatid` | Print the current chat's JID | Useful for `.env` registration |
|
||||
| `/whoami` | Show your Telegram identity | Confirms admin-allowlist match |
|
||||
| `/status` | Compact system summary (jails, ZFS pools, PF, budget) | `src/system-state.ts` snapshot |
|
||||
|
||||
## Structured Reports
|
||||
|
||||
All structured reports follow the same `Observed` / `Interpretation` /
|
||||
`Operator Notes` template. See [Structured Reports](./structured-reports/) for
|
||||
the design pattern.
|
||||
|
||||
| Command | Report | Source |
|
||||
| ---------------- | ----------------------------------------------------- | ----------------------------------------------------------------------- |
|
||||
| `/report` | System & auth — services, jails, PF, controlplane | `hostd` probes + `probeControlplaneAuth()` |
|
||||
| `/disk` | ZFS pools and snapshots | `zpool list -H` + `zfs list -H -o name,usedsnap` |
|
||||
| `/tasks` | Controlplane task queue | `getAllTasks()` (Postgres) |
|
||||
| `/budgetreport` | Token budgets and burn analytics | `getAllBudgets()` + `getAgentTokenAnalytics()` |
|
||||
| `/publishreport` | Tenant publish/content state | `loadTenantRegistry()` + webroot inspection |
|
||||
| `/testreport` | Build and test pass/fail | `tmp/status/build-status.json` + `tmp/status/test-status.json` |
|
||||
|
||||
`/testreport` is fed by `scripts/write-test-build-status.sh` — see
|
||||
[Structured Reports](./structured-reports/#test-build-pipeline) for the
|
||||
write/read contract.
|
||||
|
||||
## Runtime & Policy
|
||||
|
||||
| Command | Purpose |
|
||||
| ----------------- | ------------------------------------------------------------------------- |
|
||||
| `/policy` | Active runtime policy (default model, overrides, cooldowns, budget state) |
|
||||
| `/budget` | Alias for `/policy` |
|
||||
| `/usage` | Token budget per agent |
|
||||
| `/tokens` | Runtime token burn per agent (last-N analytics) |
|
||||
| `/model` | Set provider/model for this chat (per-chat override) |
|
||||
| `/activation` | Set trigger mode (always-respond vs mention-only) |
|
||||
| `/tts` | Toggle voice replies (`on` / `off` / `status`) |
|
||||
|
||||
`/policy` shows the [Provider fallback](./provider-fallback/) cooldown line
|
||||
when one is active.
|
||||
|
||||
## Sessions
|
||||
|
||||
| Command | Purpose |
|
||||
| ------------- | ------------------------------------------------------------------ |
|
||||
| `/new` | Reset this chat's session |
|
||||
| `/compact` | Compact the session (summarize old, keep recent) |
|
||||
| `/stop` | Stop a running agent for this chat |
|
||||
| `/resume` | Resume a budget-paused chat |
|
||||
|
||||
## Admin Actions (Ops-Chat Only)
|
||||
|
||||
| Command | Purpose |
|
||||
| ---------------------- | ------------------------------------------------------------------------ |
|
||||
| `/budgetreset <id>` | Reset an agent's token budget. `all` requires `confirm` second arg. |
|
||||
| `/clearcooldown [id]` | Clear a [provider fallback](./provider-fallback/) cooldown |
|
||||
| `/audit` | Platform ownership audit (which jail/dataset/service belongs to which) |
|
||||
| `/snapshots [dataset]` | List ZFS snapshots |
|
||||
| `/scrub <pool> [op]` | ZFS scrub controls (`status` / `start` / `stop`) |
|
||||
| `/updates` | FreeBSD base + ports update status |
|
||||
| `/schedule` | Manage scheduled agent tasks (list / add / cancel / done) |
|
||||
|
||||
## Free-Text Routing
|
||||
|
||||
The bot recognizes **bot-addressed** ops-flavored phrasings without requiring
|
||||
a slash command. Examples that route to structured reports instead of the LLM
|
||||
path:
|
||||
|
||||
| Phrase | Routed to |
|
||||
| -------------------------------------- | --------------- |
|
||||
| `disk usage`, `how much disk` | `/disk` |
|
||||
| `task report`, `active tasks` | `/tasks` |
|
||||
| `budget report`, `how many tokens` | `/budgetreport` |
|
||||
| `are the tests passing`, `build status`| `/testreport` |
|
||||
| `system report`, `report please` | `/report` |
|
||||
|
||||
This keeps memory or narrative recall from drifting into a stale answer when
|
||||
fresh structured data is available. The full pattern set lives in
|
||||
`classifyReportIntent()` in `src/report-intent.ts`.
|
||||
|
||||
A broader `isOpsFlavored()` matcher also suppresses memory injection on any
|
||||
ops-flavored prompt (services, jails, deploy, auth, controlplane terms),
|
||||
even when no specific report matches — so the LLM answers from live tools
|
||||
rather than narrative recall.
|
||||
|
||||
## Help
|
||||
|
||||
`/help` prints the in-bot command list. The list is generated from the same
|
||||
constants that drive the Telegram menu publication, so it reflects whatever
|
||||
is currently registered.
|
||||
141
docs/public/operate/provider-fallback.md
Normal file
141
docs/public/operate/provider-fallback.md
Normal file
|
|
@ -0,0 +1,141 @@
|
|||
---
|
||||
title: 'Provider Fallback'
|
||||
description: Automatic LLM provider switching when the primary provider hits a usage cap.
|
||||
---
|
||||
|
||||
When the primary LLM provider returns a "usage cap reached" error, the agent
|
||||
keeps replying instead of looping on 429s — it transparently switches to a
|
||||
configured fallback until the cap window passes, then automatically returns to
|
||||
the primary.
|
||||
|
||||
## In Plain Language
|
||||
|
||||
- Some LLM providers (notably zAI) impose rolling 5-hour usage caps. When you
|
||||
hit one, every request fails until the reset.
|
||||
- Without fallback, the bot would retry the capped provider on every message
|
||||
and stay broken for hours.
|
||||
- Fallback puts the capped provider in a "cooldown" until the reset timestamp,
|
||||
routes new runs through your operator-chosen alternative (e.g. OpenRouter
|
||||
with a free-tier model), and resumes the primary the moment the cooldown
|
||||
expires.
|
||||
- The cooldown survives a process restart so a quick service bounce inside the
|
||||
cap window does not re-trip the cap.
|
||||
|
||||
## Configuration
|
||||
|
||||
Set in `.env`:
|
||||
|
||||
| Variable | Required | Example |
|
||||
| --------------------------------------- | ---------------------------------------- | -------------------------------------------- |
|
||||
| `LLM_FALLBACK_PROVIDER` | yes (when fallback is desired) | `openrouter` |
|
||||
| `LLM_FALLBACK_MODEL` | recommended | `meta-llama/llama-3.3-70b-instruct:free` |
|
||||
| `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` | optional (default `3600`) | `1800` |
|
||||
|
||||
The default cooldown is used **only** when the cap message has no parseable
|
||||
reset stamp. Real zAI cap errors include the reset timestamp and the cooldown
|
||||
matches the reset exactly.
|
||||
|
||||
The fallback provider's API key (`OPENROUTER_API_KEY` for openrouter,
|
||||
`ZAI_API_KEY` for zai, etc.) must also be set. The agent verifies this at
|
||||
startup and warns in the logs if it is missing — the warning is the only
|
||||
notice you will get before the fallback fails for real.
|
||||
|
||||
## How Cooldowns Work
|
||||
|
||||
1. A run fails with `429 Usage limit reached for 5 hour. Your limit will reset
|
||||
at YYYY-MM-DD HH:MM:SS`.
|
||||
2. The runner parses the reset timestamp (treated as local time) and stores
|
||||
`{ provider: 'zai', until: <reset>, reason: <message> }` in memory and on
|
||||
disk.
|
||||
3. Every subsequent run consults the cooldown map *before* spawning pi. If the
|
||||
configured provider is in cooldown, the spawn args swap to the fallback
|
||||
provider/model.
|
||||
4. The cooldown auto-expires at the reset timestamp. Next run uses the primary
|
||||
again.
|
||||
|
||||
The cooldown file lives at `$CLAWDIE_VAR_DIR/provider-cooldowns.json` (default
|
||||
`$HOME/.clawdie/state/provider-cooldowns.json`). Expired entries are dropped
|
||||
on load.
|
||||
|
||||
> **Path convention note.** The cooldown file currently uses the legacy
|
||||
> `$CLAWDIE_VAR_DIR` / `$HOME/.clawdie/state/` resolution. The newer
|
||||
> [test/build status files](./structured-reports/#test-build-pipeline)
|
||||
> moved to repo-local `tmp/` to align with `AGENTS.md` § "Temporary File
|
||||
> Storage". A future code change should harmonize provider-fallback to the
|
||||
> same precedence (`AGENT_STATUS_DIR` → `CLAWDIE_VAR_DIR` → `tmp/state/`).
|
||||
> Until then, if you set `AGENT_STATUS_DIR`, also set `CLAWDIE_VAR_DIR` to
|
||||
> the same path so both subsystems agree.
|
||||
|
||||
## Inspecting State
|
||||
|
||||
`/policy` shows active cooldowns under the runtime line:
|
||||
|
||||
```
|
||||
Default runtime: zai / glm-4.6
|
||||
Provider cooldown: zai until 2026-04-25T19:00:59 → fallback openrouter/meta-llama/llama-3.3-70b-instruct:free
|
||||
```
|
||||
|
||||
When no cooldowns are active, the line is omitted — runtime looks normal.
|
||||
|
||||
Logs include structured warnings on every fallback-active run:
|
||||
|
||||
```
|
||||
{ originalProvider: 'zai', fallbackProvider: 'openrouter', cooldownUntil: '...' } Provider fallback active — preferred provider is in cooldown
|
||||
```
|
||||
|
||||
And on the run that *trips* the cooldown:
|
||||
|
||||
```
|
||||
{ provider: 'zai', until: '2026-04-25T19:00:59', reason: '429 Usage limit reached; resets ...' } Provider cap detected — marking cooldown
|
||||
```
|
||||
|
||||
## Manual Release
|
||||
|
||||
If you know the cap was lifted early or want to retry the primary before the
|
||||
parsed reset time, clear the cooldown manually:
|
||||
|
||||
```
|
||||
/clearcooldown # lists active cooldowns and prints usage
|
||||
/clearcooldown zai # clears one
|
||||
/clearcooldown all # clears every active cooldown
|
||||
```
|
||||
|
||||
The command is admin-only and ops-chat-gated. It persists immediately so the
|
||||
cleared state survives restart.
|
||||
|
||||
## Observability Triple
|
||||
|
||||
Every agent activity row now records three provider/model values:
|
||||
|
||||
| Field | Meaning |
|
||||
| -------------------- | -------------------------------------------------------- |
|
||||
| `configured_*` | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`) |
|
||||
| `effective_*` | What was actually passed to pi (after fallback swap) |
|
||||
| `actual_*` | What pi reports having used (parsed from session JSONL) |
|
||||
|
||||
When fallback is active, `configured_*` and `effective_*` differ.
|
||||
`actual_*` should match `effective_*` for a successful run; a divergence
|
||||
suggests pi rewrote the model selection internally.
|
||||
|
||||
## Behavior That Stays The Same
|
||||
|
||||
- **Per-chat overrides** (`group.jailConfig.provider` / `.model`) are not
|
||||
touched by the cooldown layer. If you have explicitly set a chat to a
|
||||
specific provider, only that provider's cooldowns affect it.
|
||||
- **Cap detection is conservative** — the parser only matches the specific
|
||||
zAI cap signature, not generic 429s, transport errors, or rate-limit
|
||||
responses from other providers. This is intentional to avoid false
|
||||
positives. If you need the same behavior for another provider, the
|
||||
pattern lives in `parseProviderCapError()` in `src/provider-fallback.ts`.
|
||||
|
||||
## When Fallback Is Not Configured
|
||||
|
||||
If a primary provider hits its cap and `LLM_FALLBACK_PROVIDER` is unset:
|
||||
|
||||
- The cooldown is still tracked.
|
||||
- Runs continue to use the primary and continue to fail until reset.
|
||||
- Logs include a clear warning: `Provider in cooldown but no fallback configured; passing through`.
|
||||
- `/policy` will show the cooldown line without a fallback target.
|
||||
|
||||
This is intentional — the fallback is opt-in. Without it, you fail visibly
|
||||
rather than silently routing to a wrong provider.
|
||||
168
docs/public/operate/structured-reports.md
Normal file
168
docs/public/operate/structured-reports.md
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
---
|
||||
title: 'Structured Reports'
|
||||
description: The Observed / Interpretation / Operator Notes pattern, the report family, and the free-text routing layer.
|
||||
---
|
||||
|
||||
The agent's operator-facing reports follow a single template so an operator
|
||||
or a peer agent can read any of them at a glance and know what is observed
|
||||
fact, what is interpretation, and what action (if any) is suggested.
|
||||
|
||||
## In Plain Language
|
||||
|
||||
- A **structured report** is a deterministic snapshot of one slice of the
|
||||
system (disk, services, tasks, budget, publish state, build/test status).
|
||||
- Reports are built from **raw inputs** — DB rows, command output, JSON
|
||||
status files — by a **pure builder function**. The builder has no side
|
||||
effects and is unit-tested independently of how the report is delivered.
|
||||
- The result is rendered to HTML for Telegram and could equally be rendered
|
||||
to JSON for a dashboard or to plain text for a CLI.
|
||||
- When the agent answers an ops question, it reads the structured report
|
||||
rather than narrating from memory. This matters because memory drifts;
|
||||
ZFS pool capacity does not.
|
||||
|
||||
## The Three-Section Template
|
||||
|
||||
Every structured report has the same three top-level sections:
|
||||
|
||||
### Observed
|
||||
|
||||
What the report measured, with no interpretation. ZFS shows pool A at 87%
|
||||
capacity. Build status file says `status: "fail"`. The last task in the
|
||||
queue was created at 10:23.
|
||||
|
||||
This section is the source of truth for the rest of the report. If
|
||||
`Observed` is empty, the underlying probe failed and the report says so.
|
||||
|
||||
### Interpretation
|
||||
|
||||
A handful of `findings` extracted from `Observed`, each tagged `info`,
|
||||
`warn`, or `error`. "Pool A is at 87% capacity." "Tests last run failed.
|
||||
12 failing tests." "No active controlplane tasks are queued right now."
|
||||
|
||||
Findings are short, factual, and avoid recommending action. Their job is
|
||||
to reduce a wall of data to the few signals that matter.
|
||||
|
||||
### Operator Notes
|
||||
|
||||
Suggestions, conditional and labeled `note` or `action`. "Largest
|
||||
snapshot: `tank/data@2026-04-20-weekly` (4.2 GB). Remove only if that
|
||||
rollback point is no longer needed." "Re-run the test wrapper before
|
||||
relying on this as evidence the branch is green."
|
||||
|
||||
Notes are *suggestions*, not commands. They include the **conditional**
|
||||
that makes the action correct ("only if X"), so an operator can decide
|
||||
without re-deriving the context.
|
||||
|
||||
## The Report Family
|
||||
|
||||
| Report | Module | Slash command | Source |
|
||||
| ---------- | ------------------------------------- | ---------------- | -------------------------------------------------------- |
|
||||
| System | `src/reports/system-report.ts` | `/report` | `hostd` probes + controlplane auth probe |
|
||||
| Disk | `src/reports/disk-report.ts` | `/disk` | `zpool list -H` + `zfs list -H -o name,usedsnap` |
|
||||
| Tasks | `src/reports/tasks-report.ts` | `/tasks` | `getAllTasks()` (Postgres) |
|
||||
| Budget | `src/reports/budget-report.ts` | `/budgetreport` | `getAllBudgets()` + `getAgentTokenAnalytics()` |
|
||||
| Publish | `src/reports/publish-report.ts` | `/publishreport` | tenant registry + webroot inspection |
|
||||
| Test/Build | `src/reports/test-report.ts` | `/testreport` | `tmp/status/build-status.json` + `test-status.json` |
|
||||
|
||||
Each module exports two functions:
|
||||
|
||||
```ts
|
||||
buildXxxReport(inputs) // pure: takes raw inputs, returns a typed report
|
||||
renderXxxReport(report) // pure: takes the report, returns an HTML string
|
||||
```
|
||||
|
||||
The split lets you unit-test the analysis without touching IO and reuse
|
||||
the builder against a JSON sink later.
|
||||
|
||||
## Test/Build Pipeline
|
||||
|
||||
`/testreport` is the only report whose source-of-truth is a file the agent
|
||||
does not write itself. The contract:
|
||||
|
||||
1. `scripts/write-test-build-status.sh` runs `npm run build` and
|
||||
`npx vitest run --reporter=json --outputFile=...` (or one of them via
|
||||
`build` / `tests` argument).
|
||||
2. The wrapper writes two JSON files into the **status directory**:
|
||||
- `<status-dir>/build-status.json`
|
||||
- `<status-dir>/test-status.json`
|
||||
|
||||
The status directory resolves with this precedence (matched by both the
|
||||
wrapper and `getDefaultStatusDir()` in `src/reports/test-report.ts`):
|
||||
|
||||
1. `$AGENT_STATUS_DIR` if set
|
||||
2. `$CLAWDIE_VAR_DIR` if set (legacy)
|
||||
3. `<project-root>/tmp/status` (default)
|
||||
|
||||
Per `AGENTS.md` § "Temporary File Storage", artifact paths under repo
|
||||
`tmp/` are the preferred default — point `$AGENT_STATUS_DIR` elsewhere
|
||||
only if you have a reason to.
|
||||
|
||||
3. `/testreport` reads both files, builds the report, renders it.
|
||||
|
||||
The schema for each file is intentionally narrow:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok" | "fail" | "unknown",
|
||||
"completedAt": "2026-04-26T10:00:00Z",
|
||||
"command": "npx vitest run",
|
||||
"exitCode": 0,
|
||||
"durationMs": 12345,
|
||||
"totalTests": 1934,
|
||||
"failingTests": 0,
|
||||
"skippedTests": 0,
|
||||
"failingTestNames": ["..."],
|
||||
"summary": "..."
|
||||
}
|
||||
```
|
||||
|
||||
Only `status` and `completedAt` are required; everything else degrades
|
||||
gracefully. Files older than 6 hours surface as `stale` with a warn finding.
|
||||
Missing or malformed files surface as `status: "unknown"` with an action
|
||||
note rather than fabricating success.
|
||||
|
||||
The pre-commit and post-commit hooks call this wrapper so commit messages
|
||||
include a `Build: pass | Tests: 12 failed | 1936 passed (1948)` footer
|
||||
visible in `git log`.
|
||||
|
||||
## Free-Text Routing
|
||||
|
||||
When the agent receives a bot-addressed message, `classifyReportIntent()` in
|
||||
`src/report-intent.ts` checks a set of conservative regexes and routes to
|
||||
the matching structured report instead of the LLM path. This means an
|
||||
operator typing "how much disk?" gets a fresh `/disk` snapshot, not a
|
||||
half-remembered narrative from a session three days ago.
|
||||
|
||||
The routing rules are intentionally **narrow** (false negatives are fine,
|
||||
false positives are not). For broader detection of "this prompt smells
|
||||
operational", a separate `isOpsFlavored()` matcher catches a wider net of
|
||||
phrasings (services, jails, deploy, controlplane terms, etc.) — and is
|
||||
used to **suppress memory injection** on those prompts so the LLM answers
|
||||
from live tools rather than narrative recall.
|
||||
|
||||
| Function | Use |
|
||||
| ---------------------------- | -------------------------------------------------------------------- |
|
||||
| `classifyReportIntent(text)` | Hard route → structured report. Only fires on confident phrasings. |
|
||||
| `isOpsFlavored(text)` | Soft signal → drop memory injection. Wider net, lower bar. |
|
||||
|
||||
Both ignore slash-command messages (those are routed by grammy) and
|
||||
`@assistant` mentions are stripped before matching.
|
||||
|
||||
## Why Pure Builders
|
||||
|
||||
The pure builder pattern was a deliberate choice over a one-shot
|
||||
"render-to-HTML now" approach. Three reasons:
|
||||
|
||||
- **Testable** — unit tests exercise the analysis logic with synthetic
|
||||
inputs, no Postgres or pi running.
|
||||
- **Reusable** — the same `buildDiskReport()` could feed a dashboard widget
|
||||
or a daily email digest later. We are not committed to Telegram as the
|
||||
only sink.
|
||||
- **Inspectable** — when an operator asks "why did the report flag this?",
|
||||
the answer is a `findings[]` array with explicit codes, not opaque text
|
||||
generation.
|
||||
|
||||
If you add a new report, follow the same shape: a `Report` interface, a
|
||||
`buildXxxReport()` function with `findings: XxxReportFinding[]` and
|
||||
`operatorNotes: XxxReportOperatorNote[]`, a `renderXxxReport()` HTML
|
||||
renderer, and a `*.test.ts` covering the builder independently.
|
||||
Loading…
Add table
Reference in a new issue