clawdie-ai/docs/public/operate/monitoring.md at a521ec77ffd17b8caf7bf7516a54cc6b3d1437a1

Clawdie AI a521ec77ff docs: comprehensive doc audit — update 16 files for consistency with codebase

Systematic review of all doc/, docs/internal/, docs/public/, ARCHITECTURE.md,
and README.md against recent codebase changes. 16 files updated:

Cross-cutting fixes (multiple files):
- Model references: anthropic/claude-3-5-sonnet → zai/glm-5-turbo (4 files)
- Port references: hardcoded 3100 → CONTROLPLANE_API_PORT (3 files)
- Skills mechanism: --no-skills + --append-system-prompt + skills_search (6 files)
- CONTROLPLANE_SHARED_SECRET: documented in security, architecture, install (5 files)
- Prompt guardrails: AGENT_MAX_INBOUND_CHARS etc. added to 3 files
- controlplane is NOT a jail — runs on host (3 files corrected)
- git jail added to layouts and IP tables (3 files)
- npm run → just (2 files)

Specific fixes:
- .env.example: AGENT_SESSION_MAX_BYTES session rollover hint
- README.md: fix IP layout (git=.6 not .4), add run-*.sh generation note
- ARCHITECTURE.md: add config vars, recipe count update, --no-skills
- doc/CONTROLPLANE-AGENT-ROLES.md: fix model, remove deleted file ref
- doc/CONTROLPLANE-ARCHITECTURE.md: port params, security, guardrails section
- doc/CONTROLPLANE-MESSAGE-CONTRACT.md: auth header, skills catalog rewrite
- doc/SESSION-HANDOFF-2026-04-18.md: fix Telegram (plain text not Markdown)
- doc/THREE-BIRD-ARCHITECTURE.md: fix 5 broken STRAPI-FREEBSD-GOTCHA refs
- doc/HANDOFF-PHASE7.md: mark sysprompt cleanup as done
- docs/internal/DOCUMENTATION.md: just CLI, tracked hooks, parameterized paths
- docs/internal/HEARTBEAT.md: add controlplane heartbeat reference, fix setup step
- docs/public/architecture/controlplane.md: phases 2-7 all ✅ DONE
- docs/public/architecture/freebsd-jail-implementation.md: git jail, Forgejo
- docs/public/architecture/warden.md: controlplane=host, git jail added
- docs/public/operate/monitoring.md: just doctor, all guardrail vars
- docs/public/operate/security.md: API auth, shell injection, guardrails

Build: pass | Tests: not run (Linux) (Sam & Claude)

2026-04-18 22:15:59 +02:00

4.2 KiB

Raw Blame History

title
Monitoring Model

Clawdie monitoring is split into distinct layers so "process is running" is not confused with "system is healthy".

Runtime Health Files

The running process writes state into:

data/health/host.json
data/health/pipeline.json
data/health/jail.json

Inspected by just doctor.

Monitoring Layers

1. Host Health

Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.

Answers: is the main process alive and making progress?

2. Pipeline Health

Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.

Answers: are messages actually flowing?

3. Jail Health (Warden)

Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.

Answers: is the isolated executor working?

4. Watchdog

The Watchdog class in src/watchdog.ts runs two timers:

health timer (60s) — reads free memory (sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold
control plane timer (5 min) — runs runControlPlaneChecks(), stores the latest ControlPlaneReport

The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory, active/queued jails, and the latest control plane report.

5. Control Plane

src/controlplane.ts checks service jails and system state. Runs at startup (before initDatabase()) and every 5 minutes via the watchdog timer.

Checks:

Check	Method	Fix if failing
hostd reachable	TCP connect to socket	none (can't self-fix)
`{agent}-db` running	`jls -q name`	`hostd('bastille-start')`
`{agent}-git` running	`jls -q name`	`hostd('bastille-start')`
`{agent}-cms` running	`jls -q name`	`hostd('bastille-start')`
PF enabled	`pfctl -s info`	`hostd('pf-enable')`

Severity:

Jail failures → fail (db down = cannot start)
PF disabled, hostd unreachable → warn (agent can run, degraded)

Doctor Command

just doctor

Reports (in order):

overall status
latest host heartbeats
latest Telegram and pipeline activity
latest jail success/failure
Stripe status
watchdog mode, memory, active/queued jails
control plane check results per service
split-brain DB availability and row counts

Exit codes:

STATUS: ok → exit 0
STATUS: warn → exit 0 (degraded but running)
STATUS: error → exit 1 (action required)

Note: a missing built-in knowledge artifact is expected during development and is reported as warn (built-in knowledge unavailable) rather than failing the entire health check.

Session safety:

Pi sessions (groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond.
Use AGENT_SESSION_MAX_BYTES to cap session size; the runtime will start a fresh session automatically when exceeded (silent by default).
Additional prompt guardrails limit resource abuse:
- AGENT_MAX_INBOUND_CHARS — truncates inbound messages exceeding this length
- AGENT_MAX_BACKLOG_MESSAGES — caps the number of historical messages included in a prompt
- AGENT_MAX_BACKLOG_CHARS — caps total character count of the backlog
- AGENT_MAX_PROMPT_CHARS — hard limit on the final assembled prompt size

Timestamps are printed in European format (DD.mmm.YYYY HH:MM).

Why This Split Exists

A running PID can hide real failures:

Telegram intake dead
scheduler stalled
jail execution failing
service jails down
PF disabled (no public web traffic)

Each layer catches a different class of failure.

Bastille's Role

Bastille monitor and Clawdie doctor solve different problems:

Bastille monitor — jail service watchdog at the OS level
Clawdie doctor — application, pipeline, and control plane health

Use both; don't confuse them.

4.2 KiB Raw Blame History