Systematic review of all doc/, docs/internal/, docs/public/, ARCHITECTURE.md,
and README.md against recent codebase changes. 16 files updated:
Cross-cutting fixes (multiple files):
- Model references: anthropic/claude-3-5-sonnet → zai/glm-5-turbo (4 files)
- Port references: hardcoded 3100 → CONTROLPLANE_API_PORT (3 files)
- Skills mechanism: --no-skills + --append-system-prompt + skills_search (6 files)
- CONTROLPLANE_SHARED_SECRET: documented in security, architecture, install (5 files)
- Prompt guardrails: AGENT_MAX_INBOUND_CHARS etc. added to 3 files
- controlplane is NOT a jail — runs on host (3 files corrected)
- git jail added to layouts and IP tables (3 files)
- npm run → just (2 files)
Specific fixes:
- .env.example: AGENT_SESSION_MAX_BYTES session rollover hint
- README.md: fix IP layout (git=.6 not .4), add run-*.sh generation note
- ARCHITECTURE.md: add config vars, recipe count update, --no-skills
- doc/CONTROLPLANE-AGENT-ROLES.md: fix model, remove deleted file ref
- doc/CONTROLPLANE-ARCHITECTURE.md: port params, security, guardrails section
- doc/CONTROLPLANE-MESSAGE-CONTRACT.md: auth header, skills catalog rewrite
- doc/SESSION-HANDOFF-2026-04-18.md: fix Telegram (plain text not Markdown)
- doc/THREE-BIRD-ARCHITECTURE.md: fix 5 broken STRAPI-FREEBSD-GOTCHA refs
- doc/HANDOFF-PHASE7.md: mark sysprompt cleanup as done
- docs/internal/DOCUMENTATION.md: just CLI, tracked hooks, parameterized paths
- docs/internal/HEARTBEAT.md: add controlplane heartbeat reference, fix setup step
- docs/public/architecture/controlplane.md: phases 2-7 all ✅ DONE
- docs/public/architecture/freebsd-jail-implementation.md: git jail, Forgejo
- docs/public/architecture/warden.md: controlplane=host, git jail added
- docs/public/operate/monitoring.md: just doctor, all guardrail vars
- docs/public/operate/security.md: API auth, shell injection, guardrails
Build: pass | Tests: not run (Linux) (Sam & Claude)
4.2 KiB
| title |
|---|
| Monitoring Model |
Clawdie monitoring is split into distinct layers so "process is running" is not confused with "system is healthy".
Runtime Health Files
The running process writes state into:
data/health/host.jsondata/health/pipeline.jsondata/health/jail.json
Inspected by just doctor.
Monitoring Layers
1. Host Health
Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.
Answers: is the main process alive and making progress?
2. Pipeline Health
Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.
Answers: are messages actually flowing?
3. Jail Health (Warden)
Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.
Answers: is the isolated executor working?
4. Watchdog
The Watchdog class in src/watchdog.ts runs two timers:
- health timer (60s) — reads free memory (
sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold - control plane timer (5 min) — runs
runControlPlaneChecks(), stores the latestControlPlaneReport
The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock
for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory,
active/queued jails, and the latest control plane report.
5. Control Plane
src/controlplane.ts checks service jails and system state. Runs at startup
(before initDatabase()) and every 5 minutes via the watchdog timer.
Checks:
| Check | Method | Fix if failing |
|---|---|---|
| hostd reachable | TCP connect to socket | none (can't self-fix) |
{agent}-db running |
jls -q name |
hostd('bastille-start') |
{agent}-git running |
jls -q name |
hostd('bastille-start') |
{agent}-cms running |
jls -q name |
hostd('bastille-start') |
| PF enabled | pfctl -s info |
hostd('pf-enable') |
Severity:
- Jail failures →
fail(db down = cannot start) - PF disabled, hostd unreachable →
warn(agent can run, degraded)
Doctor Command
just doctor
Reports (in order):
- overall status
- latest host heartbeats
- latest Telegram and pipeline activity
- latest jail success/failure
- Stripe status
- watchdog mode, memory, active/queued jails
- control plane check results per service
- split-brain DB availability and row counts
Exit codes:
STATUS: ok→ exit 0STATUS: warn→ exit 0 (degraded but running)STATUS: error→ exit 1 (action required)
Note: a missing built-in knowledge artifact is expected during development and is reported as warn (built-in knowledge unavailable) rather than failing the entire health check.
Session safety:
- Pi sessions (
groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond. - Use
AGENT_SESSION_MAX_BYTESto cap session size; the runtime will start a fresh session automatically when exceeded (silent by default). - Additional prompt guardrails limit resource abuse:
AGENT_MAX_INBOUND_CHARS— truncates inbound messages exceeding this lengthAGENT_MAX_BACKLOG_MESSAGES— caps the number of historical messages included in a promptAGENT_MAX_BACKLOG_CHARS— caps total character count of the backlogAGENT_MAX_PROMPT_CHARS— hard limit on the final assembled prompt size
Timestamps are printed in European format (DD.mmm.YYYY HH:MM).
Why This Split Exists
A running PID can hide real failures:
- Telegram intake dead
- scheduler stalled
- jail execution failing
- service jails down
- PF disabled (no public web traffic)
Each layer catches a different class of failure.
Bastille's Role
Bastille monitor and Clawdie doctor solve different problems:
- Bastille monitor — jail service watchdog at the OS level
- Clawdie doctor — application, pipeline, and control plane health
Use both; don't confuse them.