Merge pull request 'docs(guide): add Terminal Capture & Signature Triage page' (#209) from docs/guide-port into main
Some checks are pending
CI / rust (push) Waiting to run
CI / markdown (push) Waiting to run
CI / port (push) Waiting to run
CI / agent-jail-pkgs (push) Waiting to run

Reviewed-on: #209
This commit is contained in:
clawdie 2026-06-26 09:35:16 +02:00
commit dc9f7f06e7
4 changed files with 174 additions and 0 deletions

View file

@ -0,0 +1,93 @@
---
title: 'Control-Plane Bridge'
description: Reaching the Colibri control plane across hosts over the Tailscale mesh.
---
Each host runs `colibri-daemon` listening on a **Unix domain socket** (local
only). The control-plane bridge exposes that socket as a **TCP port on the
Tailscale interface** so other mesh hosts can drive the control plane — create
tasks, register agents, watch terminals — without the socket ever being
reachable from the public internet.
```
operator / peer host bridged host
nc <tailnet-ip> 9190 ──tailscale0──▶ socat TCP-LISTEN:9190
│ (bind = this host's tailnet IP)
UNIX-CONNECT /run/colibri/colibri.sock
colibri-daemon
```
## Implementations
The bridge is a thin `socat` front-end, supervised by the host's service
manager. Both sides are shipped in the repo:
| Host | Service | Packaging |
| --- | --- | --- |
| FreeBSD | rc.d `colibri_bridge` | `packaging/freebsd/colibri_bridge.in` |
| Linux | systemd `colibri-bridge.service` | `packaging/linux/` (unit + env + nft + README) |
Both run effectively:
```
socat TCP-LISTEN:9190,bind=<this-host-tailnet-ip>,fork,reuseaddr \
UNIX-CONNECT:/run/colibri/colibri.sock
```
The Linux unit adds `freebind` so socat can bind the tailnet address before
`tailscaled` has finished bringing it up, avoiding a boot-order race. The
`bind=<tailnet-ip>` keeps the listener off every other interface even if the
firewall is later changed — defence in depth, not the primary gate.
## Network gate
The bridge port is opened **only on the Tailscale interface**, in the host's
native firewall:
- **FreeBSD (pf):** `pass in quick on tailscale0 proto tcp to port 9190 keep state`
- **Linux (ufw):** `ufw allow in on tailscale0 to any port 9190 proto tcp`
On a default-deny host (e.g. ufw), the public side is already blocked, so only
the interface-scoped *allow* is needed. The `packaging/linux/colibri-bridge.nft`
ruleset is provided for Linux hosts that do **not** run ufw (a default-accept
input chain); under ufw it is redundant.
## Security model — the tailnet boundary is the auth
The control-plane socket has **no authentication of its own**. Once it is
bridged, any peer that can reach the host over the tailnet can issue the full
command set (`spawn-agent`, `kill-agent`, `intake-task`, `terminal-*`, …). That
makes the **Tailscale boundary the access control**:
- Scope the port to named peers with a **Tailscale ACL** on `:9190` rather than
relying on the firewall allow alone.
- Treat any bridged host as granting control-plane authority to the whole
tailnet unless an ACL narrows it.
## Configuration notes
- **No real tailnet IPs in git.** Config templates ship the placeholder
`TAILSCALE_IP_REQUIRED`; the operator fills the host's own address at deploy
time (`tailscale ip -4`). The FreeBSD rc.d defaults likewise refuse to start
until the address is set.
- **Socket-path parity.** The bridge connects to `/run/colibri/colibri.sock`
(FreeBSD: `/var/run/colibri/colibri.sock`); the daemon must be started with a
matching `COLIBRI_DAEMON_SOCKET`. The daemon's default lives under
`$XDG_DATA_HOME`, which a sandboxed unit (`ProtectHome=yes`) cannot reach —
point both at the `/run` path.
- The bridge user must be in the daemon socket's group (the socket is `0770`,
owner + group).
## Verify
From another tailnet host:
```sh
printf '{"cmd":"status"}\n' | nc -w2 <host-tailnet-ip> 9190
```
A healthy bridge returns the daemon's status JSON (including the daemon's
`host`), confirming reachability end to end over the mesh.

View file

@ -40,5 +40,6 @@ Shared platform services:
- [FreeBSD jail implementation](./freebsd-jail-implementation/)
- [Bastille lifecycle](./bastille/)
- [Control Plane](./controlplane/)
- [Control-Plane Bridge](./control-plane-bridge/)
- [Colibri](./colibri/)
- [Admin Panel](./admin-panel/)

View file

@ -7,6 +7,7 @@ Runbooks for day-to-day operation and recovery.
- [Security](./security/)
- [Monitoring](./monitoring/)
- [Terminal Capture & Signature Triage](./terminal-capture/)
- [Operator Commands](./operator-commands/)
- [Structured Reports](./structured-reports/)
- [Provider Fallback](./provider-fallback/)

View file

@ -0,0 +1,79 @@
---
title: 'Terminal Capture & Signature Triage'
description: Deduplicated tmux pane history with edge-triggered failure alerts.
---
Terminal capture is the screen-scraping half of Glasspane. Where the rest of
Glasspane derives agent state from structured JSONL events, this layer records
the **actual terminal text** of a pane and triages it against known patterns —
so Colibri can both *remember* what a terminal showed and *speak up* the moment
something it recognises goes wrong.
It lives in `colibri-glasspane` (`terminal.rs`, `signatures.rs`) and is driven
by the `colibri-daemon` poll loop.
## How it works
- **Content-hash framing.** A frame's id is `SHA-256(stripped_text)[:12]`.
Identical screens produce identical ids.
- **Deduplicated history.** The recorder drops any frame whose hash equals the
previous one, so polling a near-static pane every few seconds collapses into a
compact log of *actual* state transitions, not thousands of duplicates. The
history is a bounded ring buffer per pane.
- **Signature triage.** Each captured frame is scanned by a `SignatureSet`.
A signature carries a severity (`error`/`warn`/`info`/`ok`), a plain-language
`next_action`, and an optional `invoke` (a skill to run to remediate). Matches
are classified into `failures` / `warnings` / `info` / `healthy`.
- **Edge-triggered alerts.** A failure/warning is reported only on the frame
where it *first appears* — not on every subsequent frame that still shows it.
When the condition clears and later recurs, it fires again. This is what keeps
a persistent error from spamming alerts.
The signature set is per-OS configuration: `SignatureSet::linux_default()` ships
a small, high-value starter set (systemd unit failures, OOM-killer, disk full,
Docker non-zero exits, IP-forwarding, firewall posture). Other hosts load a
different set; the matcher is shared.
> PNG rendering of a capture is **not** part of this layer — that stays with the
> `tmux-screenshot` skill for human viewing. Colibri owns capture, history, and
> triage; the text and signatures are the diagnostic artifacts.
## Configuration
Set on the daemon's environment (off by default):
| Variable | Purpose | Default |
| --- | --- | --- |
| `COLIBRI_TERMINAL_CAPTURE` | Enable the poll loop (`1`/`true`/`yes`/`on`) | off |
| `COLIBRI_TERMINAL_CAPTURE_INTERVAL_SECS` | Seconds between captures of each watched pane | `5` |
| `COLIBRI_TERMINAL_WATCH` | Comma-separated tmux targets to watch from startup | _(none)_ |
| `TELEGRAM_BOT_TOKEN` / `TELEGRAM_CHAT_ID` | Route edge-triggered alerts to Telegram | _(unset → log only)_ |
When the bot token/chat id are unset, alerts degrade cleanly to a daemon log
line — the feature is safe to leave enabled without Telegram configured.
## Control-plane commands
Over the Colibri socket (newline-delimited JSON):
| Command | Effect |
| --- | --- |
| `{"cmd":"terminal-watch","target":"clawdie:0"}` | Start recording a tmux target (session / `session:window` / `%pane`) |
| `{"cmd":"terminal-unwatch","target":"clawdie:0"}` | Stop recording and drop the pane's history |
| `{"cmd":"terminal-list"}` | Watched panes with frame counts and currently-firing alerts |
| `{"cmd":"terminal-history","target":"clawdie:0","limit":20}` | Recent recorded frames (text + detection) for a pane |
| `{"cmd":"terminal-poll","target":"clawdie:0"}` | Capture now instead of waiting for the tick (`target` optional → all) |
`terminal-poll` returns, per pane, whether the frame was `recorded` or
`unchanged` (deduped) and any `new_alerts` that fired on this capture.
## Operating notes
- Watch the panes that matter (a build pane, the daemon pane, a status window),
not every pane — the value is in signal, not volume.
- Alerts are edge-triggered, so a failing service alerts once and re-alerts only
after it recovers and breaks again; use `terminal-list` to see what is
currently latched as firing.
- When a failure signature carries an `invoke`, that names the skill to run to
remediate — the alert is "here's what broke and how to fix it", not just
"something happened".