Merge pull request 'docs(guide): add Terminal Capture & Signature Triage page' (#209) from docs/guide-port into main

Reviewed-on: #209
2026-06-26 09:35:16 +02:00 · 2026-06-26 09:35:16 +02:00 · dc9f7f06e7
commit dc9f7f06e7
parent 13303e88b3 5e2692c063
4 changed files with 174 additions and 0 deletions
--- a/docs/guide/architecture/control-plane-bridge.md
+++ b/docs/guide/architecture/control-plane-bridge.md
@ -0,0 +1,93 @@
+---
+title: 'Control-Plane Bridge'
+description: Reaching the Colibri control plane across hosts over the Tailscale mesh.
+---
+
+Each host runs `colibri-daemon` listening on a **Unix domain socket** (local
+only). The control-plane bridge exposes that socket as a **TCP port on the
+Tailscale interface** so other mesh hosts can drive the control plane — create
+tasks, register agents, watch terminals — without the socket ever being
+reachable from the public internet.
+
+```
+operator / peer host                      bridged host
+  nc <tailnet-ip> 9190  ──tailscale0──▶  socat TCP-LISTEN:9190
+                                              │ (bind = this host's tailnet IP)
+                                              ▼
+                                         UNIX-CONNECT /run/colibri/colibri.sock
+                                              │
+                                              ▼
+                                         colibri-daemon
+```
+
+## Implementations
+
+The bridge is a thin `socat` front-end, supervised by the host's service
+manager. Both sides are shipped in the repo:
+
+| Host | Service | Packaging |
+| --- | --- | --- |
+| FreeBSD | rc.d `colibri_bridge` | `packaging/freebsd/colibri_bridge.in` |
+| Linux | systemd `colibri-bridge.service` | `packaging/linux/` (unit + env + nft + README) |
+
+Both run effectively:
+
+```
+socat TCP-LISTEN:9190,bind=<this-host-tailnet-ip>,fork,reuseaddr \
+      UNIX-CONNECT:/run/colibri/colibri.sock
+```
+
+The Linux unit adds `freebind` so socat can bind the tailnet address before
+`tailscaled` has finished bringing it up, avoiding a boot-order race. The
+`bind=<tailnet-ip>` keeps the listener off every other interface even if the
+firewall is later changed — defence in depth, not the primary gate.
+
+## Network gate
+
+The bridge port is opened **only on the Tailscale interface**, in the host's
+native firewall:
+
+- **FreeBSD (pf):** `pass in quick on tailscale0 proto tcp to port 9190 keep state`
+- **Linux (ufw):** `ufw allow in on tailscale0 to any port 9190 proto tcp`
+
+On a default-deny host (e.g. ufw), the public side is already blocked, so only
+the interface-scoped *allow* is needed. The `packaging/linux/colibri-bridge.nft`
+ruleset is provided for Linux hosts that do **not** run ufw (a default-accept
+input chain); under ufw it is redundant.
+
+## Security model — the tailnet boundary is the auth
+
+The control-plane socket has **no authentication of its own**. Once it is
+bridged, any peer that can reach the host over the tailnet can issue the full
+command set (`spawn-agent`, `kill-agent`, `intake-task`, `terminal-*`, …). That
+makes the **Tailscale boundary the access control**:
+
+- Scope the port to named peers with a **Tailscale ACL** on `:9190` rather than
+  relying on the firewall allow alone.
+- Treat any bridged host as granting control-plane authority to the whole
+  tailnet unless an ACL narrows it.
+
+## Configuration notes
+
+- **No real tailnet IPs in git.** Config templates ship the placeholder
+  `TAILSCALE_IP_REQUIRED`; the operator fills the host's own address at deploy
+  time (`tailscale ip -4`). The FreeBSD rc.d defaults likewise refuse to start
+  until the address is set.
+- **Socket-path parity.** The bridge connects to `/run/colibri/colibri.sock`
+  (FreeBSD: `/var/run/colibri/colibri.sock`); the daemon must be started with a
+  matching `COLIBRI_DAEMON_SOCKET`. The daemon's default lives under
+  `$XDG_DATA_HOME`, which a sandboxed unit (`ProtectHome=yes`) cannot reach —
+  point both at the `/run` path.
+- The bridge user must be in the daemon socket's group (the socket is `0770`,
+  owner + group).
+
+## Verify
+
+From another tailnet host:
+
+```sh
+printf '{"cmd":"status"}\n' | nc -w2 <host-tailnet-ip> 9190
+```
+
+A healthy bridge returns the daemon's status JSON (including the daemon's
+`host`), confirming reachability end to end over the mesh.
--- a/docs/guide/architecture/index.md
+++ b/docs/guide/architecture/index.md
@ -40,5 +40,6 @@ Shared platform services:
 - [FreeBSD jail implementation](./freebsd-jail-implementation/)
 - [Bastille lifecycle](./bastille/)
 - [Control Plane](./controlplane/)
+- [Control-Plane Bridge](./control-plane-bridge/)
 - [Colibri](./colibri/)
 - [Admin Panel](./admin-panel/)
--- a/docs/guide/operate/index.md
+++ b/docs/guide/operate/index.md
@ -7,6 +7,7 @@ Runbooks for day-to-day operation and recovery.

 - [Security](./security/)
 - [Monitoring](./monitoring/)
+- [Terminal Capture & Signature Triage](./terminal-capture/)
 - [Operator Commands](./operator-commands/)
 - [Structured Reports](./structured-reports/)
 - [Provider Fallback](./provider-fallback/)
--- a/docs/guide/operate/terminal-capture.md
+++ b/docs/guide/operate/terminal-capture.md
@ -0,0 +1,79 @@
+---
+title: 'Terminal Capture & Signature Triage'
+description: Deduplicated tmux pane history with edge-triggered failure alerts.
+---
+
+Terminal capture is the screen-scraping half of Glasspane. Where the rest of
+Glasspane derives agent state from structured JSONL events, this layer records
+the **actual terminal text** of a pane and triages it against known patterns —
+so Colibri can both *remember* what a terminal showed and *speak up* the moment
+something it recognises goes wrong.
+
+It lives in `colibri-glasspane` (`terminal.rs`, `signatures.rs`) and is driven
+by the `colibri-daemon` poll loop.
+
+## How it works
+
+- **Content-hash framing.** A frame's id is `SHA-256(stripped_text)[:12]`.
+  Identical screens produce identical ids.
+- **Deduplicated history.** The recorder drops any frame whose hash equals the
+  previous one, so polling a near-static pane every few seconds collapses into a
+  compact log of *actual* state transitions, not thousands of duplicates. The
+  history is a bounded ring buffer per pane.
+- **Signature triage.** Each captured frame is scanned by a `SignatureSet`.
+  A signature carries a severity (`error`/`warn`/`info`/`ok`), a plain-language
+  `next_action`, and an optional `invoke` (a skill to run to remediate). Matches
+  are classified into `failures` / `warnings` / `info` / `healthy`.
+- **Edge-triggered alerts.** A failure/warning is reported only on the frame
+  where it *first appears* — not on every subsequent frame that still shows it.
+  When the condition clears and later recurs, it fires again. This is what keeps
+  a persistent error from spamming alerts.
+
+The signature set is per-OS configuration: `SignatureSet::linux_default()` ships
+a small, high-value starter set (systemd unit failures, OOM-killer, disk full,
+Docker non-zero exits, IP-forwarding, firewall posture). Other hosts load a
+different set; the matcher is shared.
+
+> PNG rendering of a capture is **not** part of this layer — that stays with the
+> `tmux-screenshot` skill for human viewing. Colibri owns capture, history, and
+> triage; the text and signatures are the diagnostic artifacts.
+
+## Configuration
+
+Set on the daemon's environment (off by default):
+
+| Variable | Purpose | Default |
+| --- | --- | --- |
+| `COLIBRI_TERMINAL_CAPTURE` | Enable the poll loop (`1`/`true`/`yes`/`on`) | off |
+| `COLIBRI_TERMINAL_CAPTURE_INTERVAL_SECS` | Seconds between captures of each watched pane | `5` |
+| `COLIBRI_TERMINAL_WATCH` | Comma-separated tmux targets to watch from startup | _(none)_ |
+| `TELEGRAM_BOT_TOKEN` / `TELEGRAM_CHAT_ID` | Route edge-triggered alerts to Telegram | _(unset → log only)_ |
+
+When the bot token/chat id are unset, alerts degrade cleanly to a daemon log
+line — the feature is safe to leave enabled without Telegram configured.
+
+## Control-plane commands
+
+Over the Colibri socket (newline-delimited JSON):
+
+| Command | Effect |
+| --- | --- |
+| `{"cmd":"terminal-watch","target":"clawdie:0"}` | Start recording a tmux target (session / `session:window` / `%pane`) |
+| `{"cmd":"terminal-unwatch","target":"clawdie:0"}` | Stop recording and drop the pane's history |
+| `{"cmd":"terminal-list"}` | Watched panes with frame counts and currently-firing alerts |
+| `{"cmd":"terminal-history","target":"clawdie:0","limit":20}` | Recent recorded frames (text + detection) for a pane |
+| `{"cmd":"terminal-poll","target":"clawdie:0"}` | Capture now instead of waiting for the tick (`target` optional → all) |
+
+`terminal-poll` returns, per pane, whether the frame was `recorded` or
+`unchanged` (deduped) and any `new_alerts` that fired on this capture.
+
+## Operating notes
+
+- Watch the panes that matter (a build pane, the daemon pane, a status window),
+  not every pane — the value is in signal, not volume.
+- Alerts are edge-triggered, so a failing service alerts once and re-alerts only
+  after it recovers and breaks again; use `terminal-list` to see what is
+  currently latched as firing.
+- When a failure signature carries an `invoke`, that names the skill to run to
+  remediate — the alert is "here's what broke and how to fix it", not just
+  "something happened".