clawdie-ai/docs/public/operate/monitoring.md
Clawdie AI a521ec77ff docs: comprehensive doc audit — update 16 files for consistency with codebase
Systematic review of all doc/, docs/internal/, docs/public/, ARCHITECTURE.md,
and README.md against recent codebase changes. 16 files updated:

Cross-cutting fixes (multiple files):
- Model references: anthropic/claude-3-5-sonnet → zai/glm-5-turbo (4 files)
- Port references: hardcoded 3100 → CONTROLPLANE_API_PORT (3 files)
- Skills mechanism: --no-skills + --append-system-prompt + skills_search (6 files)
- CONTROLPLANE_SHARED_SECRET: documented in security, architecture, install (5 files)
- Prompt guardrails: AGENT_MAX_INBOUND_CHARS etc. added to 3 files
- controlplane is NOT a jail — runs on host (3 files corrected)
- git jail added to layouts and IP tables (3 files)
- npm run → just (2 files)

Specific fixes:
- .env.example: AGENT_SESSION_MAX_BYTES session rollover hint
- README.md: fix IP layout (git=.6 not .4), add run-*.sh generation note
- ARCHITECTURE.md: add config vars, recipe count update, --no-skills
- doc/CONTROLPLANE-AGENT-ROLES.md: fix model, remove deleted file ref
- doc/CONTROLPLANE-ARCHITECTURE.md: port params, security, guardrails section
- doc/CONTROLPLANE-MESSAGE-CONTRACT.md: auth header, skills catalog rewrite
- doc/SESSION-HANDOFF-2026-04-18.md: fix Telegram (plain text not Markdown)
- doc/THREE-BIRD-ARCHITECTURE.md: fix 5 broken STRAPI-FREEBSD-GOTCHA refs
- doc/HANDOFF-PHASE7.md: mark sysprompt cleanup as done
- docs/internal/DOCUMENTATION.md: just CLI, tracked hooks, parameterized paths
- docs/internal/HEARTBEAT.md: add controlplane heartbeat reference, fix setup step
- docs/public/architecture/controlplane.md: phases 2-7 all  DONE
- docs/public/architecture/freebsd-jail-implementation.md: git jail, Forgejo
- docs/public/architecture/warden.md: controlplane=host, git jail added
- docs/public/operate/monitoring.md: just doctor, all guardrail vars
- docs/public/operate/security.md: API auth, shell injection, guardrails

Build: pass | Tests: not run (Linux) (Sam & Claude)
2026-04-18 22:15:59 +02:00

4.2 KiB

title
Monitoring Model

Clawdie monitoring is split into distinct layers so "process is running" is not confused with "system is healthy".

Runtime Health Files

The running process writes state into:

  • data/health/host.json
  • data/health/pipeline.json
  • data/health/jail.json

Inspected by just doctor.

Monitoring Layers

1. Host Health

Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.

Answers: is the main process alive and making progress?

2. Pipeline Health

Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.

Answers: are messages actually flowing?

3. Jail Health (Warden)

Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.

Answers: is the isolated executor working?

4. Watchdog

The Watchdog class in src/watchdog.ts runs two timers:

  • health timer (60s) — reads free memory (sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold
  • control plane timer (5 min) — runs runControlPlaneChecks(), stores the latest ControlPlaneReport

The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory, active/queued jails, and the latest control plane report.

5. Control Plane

src/controlplane.ts checks service jails and system state. Runs at startup (before initDatabase()) and every 5 minutes via the watchdog timer.

Checks:

Check Method Fix if failing
hostd reachable TCP connect to socket none (can't self-fix)
{agent}-db running jls -q name hostd('bastille-start')
{agent}-git running jls -q name hostd('bastille-start')
{agent}-cms running jls -q name hostd('bastille-start')
PF enabled pfctl -s info hostd('pf-enable')

Severity:

  • Jail failures → fail (db down = cannot start)
  • PF disabled, hostd unreachable → warn (agent can run, degraded)

Doctor Command

just doctor

Reports (in order):

  • overall status
  • latest host heartbeats
  • latest Telegram and pipeline activity
  • latest jail success/failure
  • Stripe status
  • watchdog mode, memory, active/queued jails
  • control plane check results per service
  • split-brain DB availability and row counts

Exit codes:

  • STATUS: ok → exit 0
  • STATUS: warn → exit 0 (degraded but running)
  • STATUS: error → exit 1 (action required)

Note: a missing built-in knowledge artifact is expected during development and is reported as warn (built-in knowledge unavailable) rather than failing the entire health check.

Session safety:

  • Pi sessions (groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond.
  • Use AGENT_SESSION_MAX_BYTES to cap session size; the runtime will start a fresh session automatically when exceeded (silent by default).
  • Additional prompt guardrails limit resource abuse:
    • AGENT_MAX_INBOUND_CHARS — truncates inbound messages exceeding this length
    • AGENT_MAX_BACKLOG_MESSAGES — caps the number of historical messages included in a prompt
    • AGENT_MAX_BACKLOG_CHARS — caps total character count of the backlog
    • AGENT_MAX_PROMPT_CHARS — hard limit on the final assembled prompt size

Timestamps are printed in European format (DD.mmm.YYYY HH:MM).

Why This Split Exists

A running PID can hide real failures:

  • Telegram intake dead
  • scheduler stalled
  • jail execution failing
  • service jails down
  • PF disabled (no public web traffic)

Each layer catches a different class of failure.

Bastille's Role

Bastille monitor and Clawdie doctor solve different problems:

  • Bastille monitor — jail service watchdog at the OS level
  • Clawdie doctor — application, pipeline, and control plane health

Use both; don't confuse them.