diff --git a/docs/guide/architecture/control-plane-bridge.md b/docs/guide/architecture/control-plane-bridge.md new file mode 100644 index 0000000..8372ba4 --- /dev/null +++ b/docs/guide/architecture/control-plane-bridge.md @@ -0,0 +1,93 @@ +--- +title: 'Control-Plane Bridge' +description: Reaching the Colibri control plane across hosts over the Tailscale mesh. +--- + +Each host runs `colibri-daemon` listening on a **Unix domain socket** (local +only). The control-plane bridge exposes that socket as a **TCP port on the +Tailscale interface** so other mesh hosts can drive the control plane — create +tasks, register agents, watch terminals — without the socket ever being +reachable from the public internet. + +``` +operator / peer host bridged host + nc 9190 ──tailscale0──▶ socat TCP-LISTEN:9190 + │ (bind = this host's tailnet IP) + ▼ + UNIX-CONNECT /run/colibri/colibri.sock + │ + ▼ + colibri-daemon +``` + +## Implementations + +The bridge is a thin `socat` front-end, supervised by the host's service +manager. Both sides are shipped in the repo: + +| Host | Service | Packaging | +| --- | --- | --- | +| FreeBSD | rc.d `colibri_bridge` | `packaging/freebsd/colibri_bridge.in` | +| Linux | systemd `colibri-bridge.service` | `packaging/linux/` (unit + env + nft + README) | + +Both run effectively: + +``` +socat TCP-LISTEN:9190,bind=,fork,reuseaddr \ + UNIX-CONNECT:/run/colibri/colibri.sock +``` + +The Linux unit adds `freebind` so socat can bind the tailnet address before +`tailscaled` has finished bringing it up, avoiding a boot-order race. The +`bind=` keeps the listener off every other interface even if the +firewall is later changed — defence in depth, not the primary gate. + +## Network gate + +The bridge port is opened **only on the Tailscale interface**, in the host's +native firewall: + +- **FreeBSD (pf):** `pass in quick on tailscale0 proto tcp to port 9190 keep state` +- **Linux (ufw):** `ufw allow in on tailscale0 to any port 9190 proto tcp` + +On a default-deny host (e.g. ufw), the public side is already blocked, so only +the interface-scoped *allow* is needed. The `packaging/linux/colibri-bridge.nft` +ruleset is provided for Linux hosts that do **not** run ufw (a default-accept +input chain); under ufw it is redundant. + +## Security model — the tailnet boundary is the auth + +The control-plane socket has **no authentication of its own**. Once it is +bridged, any peer that can reach the host over the tailnet can issue the full +command set (`spawn-agent`, `kill-agent`, `intake-task`, `terminal-*`, …). That +makes the **Tailscale boundary the access control**: + +- Scope the port to named peers with a **Tailscale ACL** on `:9190` rather than + relying on the firewall allow alone. +- Treat any bridged host as granting control-plane authority to the whole + tailnet unless an ACL narrows it. + +## Configuration notes + +- **No real tailnet IPs in git.** Config templates ship the placeholder + `TAILSCALE_IP_REQUIRED`; the operator fills the host's own address at deploy + time (`tailscale ip -4`). The FreeBSD rc.d defaults likewise refuse to start + until the address is set. +- **Socket-path parity.** The bridge connects to `/run/colibri/colibri.sock` + (FreeBSD: `/var/run/colibri/colibri.sock`); the daemon must be started with a + matching `COLIBRI_DAEMON_SOCKET`. The daemon's default lives under + `$XDG_DATA_HOME`, which a sandboxed unit (`ProtectHome=yes`) cannot reach — + point both at the `/run` path. +- The bridge user must be in the daemon socket's group (the socket is `0770`, + owner + group). + +## Verify + +From another tailnet host: + +```sh +printf '{"cmd":"status"}\n' | nc -w2 9190 +``` + +A healthy bridge returns the daemon's status JSON (including the daemon's +`host`), confirming reachability end to end over the mesh. diff --git a/docs/guide/architecture/index.md b/docs/guide/architecture/index.md index 147add6..40f9a95 100644 --- a/docs/guide/architecture/index.md +++ b/docs/guide/architecture/index.md @@ -40,5 +40,6 @@ Shared platform services: - [FreeBSD jail implementation](./freebsd-jail-implementation/) - [Bastille lifecycle](./bastille/) - [Control Plane](./controlplane/) +- [Control-Plane Bridge](./control-plane-bridge/) - [Colibri](./colibri/) - [Admin Panel](./admin-panel/) diff --git a/docs/guide/operate/index.md b/docs/guide/operate/index.md index 22fe47a..be18b8f 100644 --- a/docs/guide/operate/index.md +++ b/docs/guide/operate/index.md @@ -7,6 +7,7 @@ Runbooks for day-to-day operation and recovery. - [Security](./security/) - [Monitoring](./monitoring/) +- [Terminal Capture & Signature Triage](./terminal-capture/) - [Operator Commands](./operator-commands/) - [Structured Reports](./structured-reports/) - [Provider Fallback](./provider-fallback/) diff --git a/docs/guide/operate/terminal-capture.md b/docs/guide/operate/terminal-capture.md new file mode 100644 index 0000000..9eb4dcb --- /dev/null +++ b/docs/guide/operate/terminal-capture.md @@ -0,0 +1,79 @@ +--- +title: 'Terminal Capture & Signature Triage' +description: Deduplicated tmux pane history with edge-triggered failure alerts. +--- + +Terminal capture is the screen-scraping half of Glasspane. Where the rest of +Glasspane derives agent state from structured JSONL events, this layer records +the **actual terminal text** of a pane and triages it against known patterns — +so Colibri can both *remember* what a terminal showed and *speak up* the moment +something it recognises goes wrong. + +It lives in `colibri-glasspane` (`terminal.rs`, `signatures.rs`) and is driven +by the `colibri-daemon` poll loop. + +## How it works + +- **Content-hash framing.** A frame's id is `SHA-256(stripped_text)[:12]`. + Identical screens produce identical ids. +- **Deduplicated history.** The recorder drops any frame whose hash equals the + previous one, so polling a near-static pane every few seconds collapses into a + compact log of *actual* state transitions, not thousands of duplicates. The + history is a bounded ring buffer per pane. +- **Signature triage.** Each captured frame is scanned by a `SignatureSet`. + A signature carries a severity (`error`/`warn`/`info`/`ok`), a plain-language + `next_action`, and an optional `invoke` (a skill to run to remediate). Matches + are classified into `failures` / `warnings` / `info` / `healthy`. +- **Edge-triggered alerts.** A failure/warning is reported only on the frame + where it *first appears* — not on every subsequent frame that still shows it. + When the condition clears and later recurs, it fires again. This is what keeps + a persistent error from spamming alerts. + +The signature set is per-OS configuration: `SignatureSet::linux_default()` ships +a small, high-value starter set (systemd unit failures, OOM-killer, disk full, +Docker non-zero exits, IP-forwarding, firewall posture). Other hosts load a +different set; the matcher is shared. + +> PNG rendering of a capture is **not** part of this layer — that stays with the +> `tmux-screenshot` skill for human viewing. Colibri owns capture, history, and +> triage; the text and signatures are the diagnostic artifacts. + +## Configuration + +Set on the daemon's environment (off by default): + +| Variable | Purpose | Default | +| --- | --- | --- | +| `COLIBRI_TERMINAL_CAPTURE` | Enable the poll loop (`1`/`true`/`yes`/`on`) | off | +| `COLIBRI_TERMINAL_CAPTURE_INTERVAL_SECS` | Seconds between captures of each watched pane | `5` | +| `COLIBRI_TERMINAL_WATCH` | Comma-separated tmux targets to watch from startup | _(none)_ | +| `TELEGRAM_BOT_TOKEN` / `TELEGRAM_CHAT_ID` | Route edge-triggered alerts to Telegram | _(unset → log only)_ | + +When the bot token/chat id are unset, alerts degrade cleanly to a daemon log +line — the feature is safe to leave enabled without Telegram configured. + +## Control-plane commands + +Over the Colibri socket (newline-delimited JSON): + +| Command | Effect | +| --- | --- | +| `{"cmd":"terminal-watch","target":"clawdie:0"}` | Start recording a tmux target (session / `session:window` / `%pane`) | +| `{"cmd":"terminal-unwatch","target":"clawdie:0"}` | Stop recording and drop the pane's history | +| `{"cmd":"terminal-list"}` | Watched panes with frame counts and currently-firing alerts | +| `{"cmd":"terminal-history","target":"clawdie:0","limit":20}` | Recent recorded frames (text + detection) for a pane | +| `{"cmd":"terminal-poll","target":"clawdie:0"}` | Capture now instead of waiting for the tick (`target` optional → all) | + +`terminal-poll` returns, per pane, whether the frame was `recorded` or +`unchanged` (deduped) and any `new_alerts` that fired on this capture. + +## Operating notes + +- Watch the panes that matter (a build pane, the daemon pane, a status window), + not every pane — the value is in signal, not volume. +- Alerts are edge-triggered, so a failing service alerts once and re-alerts only + after it recovers and breaks again; use `terminal-list` to see what is + currently latched as firing. +- When a failure signature carries an `invoke`, that names the skill to run to + remediate — the alert is "here's what broke and how to fix it", not just + "something happened".