Replaces public split-brain wording with Layered Memory Fabric, documents the skills/brain/ops planes, and sketches the shared FreeBSD/Linux install contract around PostgreSQL, ZFS/OpenZFS, and platform isolation adapters.\n\nChecks: npx --yes prettier@3 --check touched docs/html; git diff --check --- Build: pass | Tests: FAIL — 1 failed
7.3 KiB
| title |
|---|
| Monitoring Model |
Clawdie monitoring is split into distinct layers so "process is running" is not confused with "system is healthy".
Runtime Health Files
The running process writes state into:
data/health/host.jsondata/health/pipeline.jsondata/health/jail.json
Inspected by just doctor.
Monitoring Layers
1. Host Health
Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.
Answers: is the main process alive and making progress?
2. Pipeline Health
Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.
Answers: are messages actually flowing?
3. Jail Health (Warden)
Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.
Answers: is the isolated executor working?
4. Watchdog
The Watchdog class in src/watchdog.ts runs two timers:
- health timer (60s) — reads free memory (
sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold - control plane timer (5 min) — runs
runControlPlaneChecks(), stores the latestControlPlaneReport
The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock
for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory,
active/queued jails, and the latest control plane report.
5. Control Plane
src/controlplane.ts checks service jails and system state. Runs at startup
(before initDatabase()) and every 5 minutes via the watchdog timer.
Checks:
| Check | Method | Fix if failing |
|---|---|---|
| hostd reachable | TCP connect to socket | none (can't self-fix) |
| Data Service available | jls -q name or host DB probe |
hostd('bastille-start') when jail-backed |
| Git Service running | jls -q name |
hostd('bastille-start') |
| Web Service running | jls -q name |
hostd('bastille-start') |
| PF enabled | pfctl -s info |
hostd('pf-enable') |
Severity:
- Jail failures →
fail(db down = cannot start) - PF disabled, hostd unreachable →
warn(agent can run, degraded)
Metrics Endpoints
When the metrics server is enabled, it exposes two lightweight HTTP endpoints:
/metrics— Prometheus text-format counters and gauges for scraping/healthz— minimal liveness probe that returnsok
Use them for different purposes:
/healthzanswers: is the metrics listener up?/metricsanswers: what counters and gauges is the runtime exposing?just doctoranswers: is the system actually healthy?
Do not treat /healthz as a replacement for just doctor. A live metrics
listener does not guarantee that the pipeline, jails, control plane, or service
checks are healthy.
Doctor Command
just doctor
Reports (in order):
- overall status
- latest host heartbeats
- latest Telegram and pipeline activity
- latest jail success/failure
- Stripe status
- watchdog mode, memory, active/queued jails
- control plane check results per service
- dnsmasq service/listener state and per-host DNS resolution through both loopback and gateway resolvers
- TLS certificate expiry for the public Clawdie and docs certificates
- acme.sh renewal cron presence (
ACME_RENEWAL_CRON) - morning report scheduler state, cron expression, next run, last run, and latest task log
- Layered Memory Fabric DB availability and row counts
Exit codes:
STATUS: ok→ exit 0STATUS: warn→ exit 0 (degraded but running)STATUS: error→ exit 1 (action required)
Note: the built-in knowledge artifact is committed in bootstrap/skills-memory/. If it is missing or the database has not imported the current artifact version, just doctor reports that as a warning so the runtime can continue while the operator refreshes or imports the artifact.
Session safety:
- Pi sessions (
groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond. - Use
AGENT_SESSION_MAX_BYTESto cap session size; the runtime will start a fresh session automatically when exceeded (silent by default). - Additional prompt guardrails limit resource abuse:
AGENT_MAX_INBOUND_CHARS— truncates inbound messages exceeding this lengthAGENT_MAX_BACKLOG_MESSAGES— caps the number of historical messages included in a promptAGENT_MAX_BACKLOG_CHARS— caps total character count of the backlogAGENT_MAX_PROMPT_CHARS— hard limit on the final assembled prompt size
Timestamps are printed in European format (DD.mmm.YYYY HH:MM).
Why This Split Exists
A running PID can hide real failures:
- Telegram intake dead
- scheduler stalled
- jail execution failing
- service jails down
- PF disabled (no public web traffic)
Each layer catches a different class of failure.
Bastille's Role
Bastille monitor and Clawdie doctor solve different problems:
- Bastille monitor — jail service watchdog at the OS level
- Clawdie doctor — application, pipeline, and control plane health
Use both; don't confuse them.
Operator-Facing Reports
Beyond the runtime health files above, the agent exposes a family of
structured reports for operator inspection on demand. Each report has a
matching Telegram slash command and follows the same Observed /
Interpretation / Operator Notes template — see
Structured Reports for the design and the full list.
| Report | Command | What it answers |
|---|---|---|
| System | /report |
Are services + jails + controlplane healthy? |
| Disk | /disk |
What is consuming ZFS pool space and snapshots? |
| Tasks | /tasks |
What is in the controlplane task queue? |
| Budget | /budgetreport |
Token budgets and burn analytics |
| Publish | /publishreport |
Tenant publish/content state |
| Test/Build | /testreport |
Was the last build/test run green? |
/testreport is fed by scripts/write-test-build-status.sh, not by the
running process — invoke the wrapper from CI, a hook, or by hand to refresh
its status files. The pre-commit and post-commit hooks run it automatically
so each commit message footer reflects what was passing at commit time.
For the full operator command reference (status, sessions, admin actions, free-text routing), see Operator Commands.
Provider Fallback Health
When the configured LLM provider is in cooldown (e.g. zAI usage cap), the
agent transparently routes to the operator-defined fallback. Active
cooldowns are visible in /policy and as structured logger.warn lines on
every fallback-active run. See Provider Fallback for
configuration, manual release (/clearcooldown), and the
configured / effective / actual observability triple.