clawdie-ai/docs/public/operate/monitoring.md
Sam & Claude faf060e0ce docs: introduce Layered Memory Fabric terminology (Sam & Codex)
Replaces public split-brain wording with Layered Memory Fabric, documents the skills/brain/ops planes, and sketches the shared FreeBSD/Linux install contract around PostgreSQL, ZFS/OpenZFS, and platform isolation adapters.\n\nChecks: npx --yes prettier@3 --check touched docs/html; git diff --check

---
Build: pass | Tests: FAIL — 1 failed
2026-06-13 21:32:50 +02:00

7.3 KiB

title
Monitoring Model

Clawdie monitoring is split into distinct layers so "process is running" is not confused with "system is healthy".

Runtime Health Files

The running process writes state into:

  • data/health/host.json
  • data/health/pipeline.json
  • data/health/jail.json

Inspected by just doctor.

Monitoring Layers

1. Host Health

Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.

Answers: is the main process alive and making progress?

2. Pipeline Health

Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.

Answers: are messages actually flowing?

3. Jail Health (Warden)

Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.

Answers: is the isolated executor working?

4. Watchdog

The Watchdog class in src/watchdog.ts runs two timers:

  • health timer (60s) — reads free memory (sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold
  • control plane timer (5 min) — runs runControlPlaneChecks(), stores the latest ControlPlaneReport

The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory, active/queued jails, and the latest control plane report.

5. Control Plane

src/controlplane.ts checks service jails and system state. Runs at startup (before initDatabase()) and every 5 minutes via the watchdog timer.

Checks:

Check Method Fix if failing
hostd reachable TCP connect to socket none (can't self-fix)
Data Service available jls -q name or host DB probe hostd('bastille-start') when jail-backed
Git Service running jls -q name hostd('bastille-start')
Web Service running jls -q name hostd('bastille-start')
PF enabled pfctl -s info hostd('pf-enable')

Severity:

  • Jail failures → fail (db down = cannot start)
  • PF disabled, hostd unreachable → warn (agent can run, degraded)

Metrics Endpoints

When the metrics server is enabled, it exposes two lightweight HTTP endpoints:

  • /metrics — Prometheus text-format counters and gauges for scraping
  • /healthz — minimal liveness probe that returns ok

Use them for different purposes:

  • /healthz answers: is the metrics listener up?
  • /metrics answers: what counters and gauges is the runtime exposing?
  • just doctor answers: is the system actually healthy?

Do not treat /healthz as a replacement for just doctor. A live metrics listener does not guarantee that the pipeline, jails, control plane, or service checks are healthy.

Doctor Command

just doctor

Reports (in order):

  • overall status
  • latest host heartbeats
  • latest Telegram and pipeline activity
  • latest jail success/failure
  • Stripe status
  • watchdog mode, memory, active/queued jails
  • control plane check results per service
  • dnsmasq service/listener state and per-host DNS resolution through both loopback and gateway resolvers
  • TLS certificate expiry for the public Clawdie and docs certificates
  • acme.sh renewal cron presence (ACME_RENEWAL_CRON)
  • morning report scheduler state, cron expression, next run, last run, and latest task log
  • Layered Memory Fabric DB availability and row counts

Exit codes:

  • STATUS: ok → exit 0
  • STATUS: warn → exit 0 (degraded but running)
  • STATUS: error → exit 1 (action required)

Note: the built-in knowledge artifact is committed in bootstrap/skills-memory/. If it is missing or the database has not imported the current artifact version, just doctor reports that as a warning so the runtime can continue while the operator refreshes or imports the artifact.

Session safety:

  • Pi sessions (groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond.
  • Use AGENT_SESSION_MAX_BYTES to cap session size; the runtime will start a fresh session automatically when exceeded (silent by default).
  • Additional prompt guardrails limit resource abuse:
    • AGENT_MAX_INBOUND_CHARS — truncates inbound messages exceeding this length
    • AGENT_MAX_BACKLOG_MESSAGES — caps the number of historical messages included in a prompt
    • AGENT_MAX_BACKLOG_CHARS — caps total character count of the backlog
    • AGENT_MAX_PROMPT_CHARS — hard limit on the final assembled prompt size

Timestamps are printed in European format (DD.mmm.YYYY HH:MM).

Why This Split Exists

A running PID can hide real failures:

  • Telegram intake dead
  • scheduler stalled
  • jail execution failing
  • service jails down
  • PF disabled (no public web traffic)

Each layer catches a different class of failure.

Bastille's Role

Bastille monitor and Clawdie doctor solve different problems:

  • Bastille monitor — jail service watchdog at the OS level
  • Clawdie doctor — application, pipeline, and control plane health

Use both; don't confuse them.

Operator-Facing Reports

Beyond the runtime health files above, the agent exposes a family of structured reports for operator inspection on demand. Each report has a matching Telegram slash command and follows the same Observed / Interpretation / Operator Notes template — see Structured Reports for the design and the full list.

Report Command What it answers
System /report Are services + jails + controlplane healthy?
Disk /disk What is consuming ZFS pool space and snapshots?
Tasks /tasks What is in the controlplane task queue?
Budget /budgetreport Token budgets and burn analytics
Publish /publishreport Tenant publish/content state
Test/Build /testreport Was the last build/test run green?

/testreport is fed by scripts/write-test-build-status.sh, not by the running process — invoke the wrapper from CI, a hook, or by hand to refresh its status files. The pre-commit and post-commit hooks run it automatically so each commit message footer reflects what was passing at commit time.

For the full operator command reference (status, sessions, admin actions, free-text routing), see Operator Commands.

Provider Fallback Health

When the configured LLM provider is in cooldown (e.g. zAI usage cap), the agent transparently routes to the operator-defined fallback. Active cooldowns are visible in /policy and as structured logger.warn lines on every fallback-active run. See Provider Fallback for configuration, manual release (/clearcooldown), and the configured / effective / actual observability triple.