--- title: 'Monitoring Model' --- Clawdie monitoring is split into distinct layers so "process is running" is not confused with "system is healthy". ## Runtime Health Files The running process writes state into: - `data/health/host.json` - `data/health/pipeline.json` - `data/health/jail.json` Inspected by `just doctor`. ## Monitoring Layers ### 1. Host Health Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat. Answers: is the main process alive and making progress? ### 2. Pipeline Health Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure. Answers: are messages actually flowing? ### 3. Jail Health (Warden) Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration. Answers: is the isolated executor working? ### 4. Watchdog The `Watchdog` class in `src/watchdog.ts` runs two timers: - **health timer (60s)** — reads free memory (`sysctl vm.stats.vm.v_free_count`), throttles queue concurrency to 1 if below threshold - **control plane timer (5 min)** — runs `runControlPlaneChecks()`, stores the latest `ControlPlaneReport` The watchdog listens on a Unix socket at `${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock` for IPC. Override via `AGENT_TMP_DIR` if needed. Send `{"cmd":"status"}\n` to get current mode, throttle state, memory, active/queued jails, and the latest control plane report. ### 5. Control Plane `src/controlplane.ts` checks service jails and system state. Runs at startup (before `initDatabase()`) and every 5 minutes via the watchdog timer. Checks: | Check | Method | Fix if failing | | ---------------------- | ------------------------------ | ------------------------------------------ | | hostd reachable | TCP connect to socket | none (can't self-fix) | | Data Service available | `jls -q name` or host DB probe | `hostd('bastille-start')` when jail-backed | | Git Service running | `jls -q name` | `hostd('bastille-start')` | | Web Service running | `jls -q name` | `hostd('bastille-start')` | | PF enabled | `pfctl -s info` | `hostd('pf-enable')` | Severity: - Jail failures → `fail` (db down = cannot start) - PF disabled, hostd unreachable → `warn` (agent can run, degraded) ## Metrics Endpoints When the metrics server is enabled, it exposes two lightweight HTTP endpoints: - `/metrics` — Prometheus text-format counters and gauges for scraping - `/healthz` — minimal liveness probe that returns `ok` Use them for different purposes: - `/healthz` answers: is the metrics listener up? - `/metrics` answers: what counters and gauges is the runtime exposing? - `just doctor` answers: is the system actually healthy? Do not treat `/healthz` as a replacement for `just doctor`. A live metrics listener does not guarantee that the pipeline, jails, control plane, or service checks are healthy. ## Doctor Command ```sh just doctor ``` Reports (in order): - overall status - latest host heartbeats - latest Telegram and pipeline activity - latest jail success/failure - Stripe status - watchdog mode, memory, active/queued jails - control plane check results per service - dnsmasq service/listener state and per-host DNS resolution through both loopback and gateway resolvers - TLS certificate expiry for the public Clawdie and docs certificates - acme.sh renewal cron presence (`ACME_RENEWAL_CRON`) - morning report scheduler state, cron expression, next run, last run, and latest task log - Layered Memory Fabric DB availability and row counts Exit codes: - `STATUS: ok` → exit 0 - `STATUS: warn` → exit 0 (degraded but running) - `STATUS: error` → exit 1 (action required) Note: the built-in knowledge artifact is committed in `bootstrap/skills-memory/`. If it is missing or the database has not imported the current artifact version, `just doctor` reports that as a warning so the runtime can continue while the operator refreshes or imports the artifact. Session safety: - Pi sessions (`groups//sessions/*.jsonl`) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond. - Use `AGENT_SESSION_MAX_BYTES` to cap session size; the runtime will start a fresh session automatically when exceeded (silent by default). - Additional prompt guardrails limit resource abuse: - `AGENT_MAX_INBOUND_CHARS` — truncates inbound messages exceeding this length - `AGENT_MAX_BACKLOG_MESSAGES` — caps the number of historical messages included in a prompt - `AGENT_MAX_BACKLOG_CHARS` — caps total character count of the backlog - `AGENT_MAX_PROMPT_CHARS` — hard limit on the final assembled prompt size Timestamps are printed in European format (`DD.mmm.YYYY HH:MM`). ## Why This Split Exists A running PID can hide real failures: - Telegram intake dead - scheduler stalled - jail execution failing - service jails down - PF disabled (no public web traffic) Each layer catches a different class of failure. ## Bastille's Role Bastille monitor and Clawdie doctor solve different problems: - **Bastille monitor** — jail service watchdog at the OS level - **Clawdie doctor** — application, pipeline, and control plane health Use both; don't confuse them. ## Operator-Facing Reports Beyond the runtime health files above, the agent exposes a family of **structured reports** for operator inspection on demand. Each report has a matching Telegram slash command and follows the same `Observed` / `Interpretation` / `Operator Notes` template — see [Structured Reports](./structured-reports/) for the design and the full list. | Report | Command | What it answers | | ---------- | ---------------- | ----------------------------------------------- | | System | `/report` | Are services + jails + controlplane healthy? | | Disk | `/disk` | What is consuming ZFS pool space and snapshots? | | Tasks | `/tasks` | What is in the controlplane task queue? | | Budget | `/budgetreport` | Token budgets and burn analytics | | Publish | `/publishreport` | Tenant publish/content state | | Test/Build | `/testreport` | Was the last build/test run green? | `/testreport` is fed by `scripts/write-test-build-status.sh`, not by the running process — invoke the wrapper from CI, a hook, or by hand to refresh its status files. The pre-commit and post-commit hooks run it automatically so each commit message footer reflects what was passing at commit time. For the full operator command reference (status, sessions, admin actions, free-text routing), see [Operator Commands](./operator-commands/). ## Provider Fallback Health When the configured LLM provider is in cooldown (e.g. zAI usage cap), the agent transparently routes to the operator-defined fallback. Active cooldowns are visible in `/policy` and as structured `logger.warn` lines on every fallback-active run. See [Provider Fallback](./provider-fallback/) for configuration, manual release (`/clearcooldown`), and the configured / effective / actual observability triple.