Merge pull request 'docs(guide): add Terminal Capture & Signature Triage page' (#209) from docs/guide-port into main
Reviewed-on: #209
This commit is contained in:
commit
dc9f7f06e7
4 changed files with 174 additions and 0 deletions
93
docs/guide/architecture/control-plane-bridge.md
Normal file
93
docs/guide/architecture/control-plane-bridge.md
Normal file
|
|
@ -0,0 +1,93 @@
|
|||
---
|
||||
title: 'Control-Plane Bridge'
|
||||
description: Reaching the Colibri control plane across hosts over the Tailscale mesh.
|
||||
---
|
||||
|
||||
Each host runs `colibri-daemon` listening on a **Unix domain socket** (local
|
||||
only). The control-plane bridge exposes that socket as a **TCP port on the
|
||||
Tailscale interface** so other mesh hosts can drive the control plane — create
|
||||
tasks, register agents, watch terminals — without the socket ever being
|
||||
reachable from the public internet.
|
||||
|
||||
```
|
||||
operator / peer host bridged host
|
||||
nc <tailnet-ip> 9190 ──tailscale0──▶ socat TCP-LISTEN:9190
|
||||
│ (bind = this host's tailnet IP)
|
||||
▼
|
||||
UNIX-CONNECT /run/colibri/colibri.sock
|
||||
│
|
||||
▼
|
||||
colibri-daemon
|
||||
```
|
||||
|
||||
## Implementations
|
||||
|
||||
The bridge is a thin `socat` front-end, supervised by the host's service
|
||||
manager. Both sides are shipped in the repo:
|
||||
|
||||
| Host | Service | Packaging |
|
||||
| --- | --- | --- |
|
||||
| FreeBSD | rc.d `colibri_bridge` | `packaging/freebsd/colibri_bridge.in` |
|
||||
| Linux | systemd `colibri-bridge.service` | `packaging/linux/` (unit + env + nft + README) |
|
||||
|
||||
Both run effectively:
|
||||
|
||||
```
|
||||
socat TCP-LISTEN:9190,bind=<this-host-tailnet-ip>,fork,reuseaddr \
|
||||
UNIX-CONNECT:/run/colibri/colibri.sock
|
||||
```
|
||||
|
||||
The Linux unit adds `freebind` so socat can bind the tailnet address before
|
||||
`tailscaled` has finished bringing it up, avoiding a boot-order race. The
|
||||
`bind=<tailnet-ip>` keeps the listener off every other interface even if the
|
||||
firewall is later changed — defence in depth, not the primary gate.
|
||||
|
||||
## Network gate
|
||||
|
||||
The bridge port is opened **only on the Tailscale interface**, in the host's
|
||||
native firewall:
|
||||
|
||||
- **FreeBSD (pf):** `pass in quick on tailscale0 proto tcp to port 9190 keep state`
|
||||
- **Linux (ufw):** `ufw allow in on tailscale0 to any port 9190 proto tcp`
|
||||
|
||||
On a default-deny host (e.g. ufw), the public side is already blocked, so only
|
||||
the interface-scoped *allow* is needed. The `packaging/linux/colibri-bridge.nft`
|
||||
ruleset is provided for Linux hosts that do **not** run ufw (a default-accept
|
||||
input chain); under ufw it is redundant.
|
||||
|
||||
## Security model — the tailnet boundary is the auth
|
||||
|
||||
The control-plane socket has **no authentication of its own**. Once it is
|
||||
bridged, any peer that can reach the host over the tailnet can issue the full
|
||||
command set (`spawn-agent`, `kill-agent`, `intake-task`, `terminal-*`, …). That
|
||||
makes the **Tailscale boundary the access control**:
|
||||
|
||||
- Scope the port to named peers with a **Tailscale ACL** on `:9190` rather than
|
||||
relying on the firewall allow alone.
|
||||
- Treat any bridged host as granting control-plane authority to the whole
|
||||
tailnet unless an ACL narrows it.
|
||||
|
||||
## Configuration notes
|
||||
|
||||
- **No real tailnet IPs in git.** Config templates ship the placeholder
|
||||
`TAILSCALE_IP_REQUIRED`; the operator fills the host's own address at deploy
|
||||
time (`tailscale ip -4`). The FreeBSD rc.d defaults likewise refuse to start
|
||||
until the address is set.
|
||||
- **Socket-path parity.** The bridge connects to `/run/colibri/colibri.sock`
|
||||
(FreeBSD: `/var/run/colibri/colibri.sock`); the daemon must be started with a
|
||||
matching `COLIBRI_DAEMON_SOCKET`. The daemon's default lives under
|
||||
`$XDG_DATA_HOME`, which a sandboxed unit (`ProtectHome=yes`) cannot reach —
|
||||
point both at the `/run` path.
|
||||
- The bridge user must be in the daemon socket's group (the socket is `0770`,
|
||||
owner + group).
|
||||
|
||||
## Verify
|
||||
|
||||
From another tailnet host:
|
||||
|
||||
```sh
|
||||
printf '{"cmd":"status"}\n' | nc -w2 <host-tailnet-ip> 9190
|
||||
```
|
||||
|
||||
A healthy bridge returns the daemon's status JSON (including the daemon's
|
||||
`host`), confirming reachability end to end over the mesh.
|
||||
|
|
@ -40,5 +40,6 @@ Shared platform services:
|
|||
- [FreeBSD jail implementation](./freebsd-jail-implementation/)
|
||||
- [Bastille lifecycle](./bastille/)
|
||||
- [Control Plane](./controlplane/)
|
||||
- [Control-Plane Bridge](./control-plane-bridge/)
|
||||
- [Colibri](./colibri/)
|
||||
- [Admin Panel](./admin-panel/)
|
||||
|
|
|
|||
|
|
@ -7,6 +7,7 @@ Runbooks for day-to-day operation and recovery.
|
|||
|
||||
- [Security](./security/)
|
||||
- [Monitoring](./monitoring/)
|
||||
- [Terminal Capture & Signature Triage](./terminal-capture/)
|
||||
- [Operator Commands](./operator-commands/)
|
||||
- [Structured Reports](./structured-reports/)
|
||||
- [Provider Fallback](./provider-fallback/)
|
||||
|
|
|
|||
79
docs/guide/operate/terminal-capture.md
Normal file
79
docs/guide/operate/terminal-capture.md
Normal file
|
|
@ -0,0 +1,79 @@
|
|||
---
|
||||
title: 'Terminal Capture & Signature Triage'
|
||||
description: Deduplicated tmux pane history with edge-triggered failure alerts.
|
||||
---
|
||||
|
||||
Terminal capture is the screen-scraping half of Glasspane. Where the rest of
|
||||
Glasspane derives agent state from structured JSONL events, this layer records
|
||||
the **actual terminal text** of a pane and triages it against known patterns —
|
||||
so Colibri can both *remember* what a terminal showed and *speak up* the moment
|
||||
something it recognises goes wrong.
|
||||
|
||||
It lives in `colibri-glasspane` (`terminal.rs`, `signatures.rs`) and is driven
|
||||
by the `colibri-daemon` poll loop.
|
||||
|
||||
## How it works
|
||||
|
||||
- **Content-hash framing.** A frame's id is `SHA-256(stripped_text)[:12]`.
|
||||
Identical screens produce identical ids.
|
||||
- **Deduplicated history.** The recorder drops any frame whose hash equals the
|
||||
previous one, so polling a near-static pane every few seconds collapses into a
|
||||
compact log of *actual* state transitions, not thousands of duplicates. The
|
||||
history is a bounded ring buffer per pane.
|
||||
- **Signature triage.** Each captured frame is scanned by a `SignatureSet`.
|
||||
A signature carries a severity (`error`/`warn`/`info`/`ok`), a plain-language
|
||||
`next_action`, and an optional `invoke` (a skill to run to remediate). Matches
|
||||
are classified into `failures` / `warnings` / `info` / `healthy`.
|
||||
- **Edge-triggered alerts.** A failure/warning is reported only on the frame
|
||||
where it *first appears* — not on every subsequent frame that still shows it.
|
||||
When the condition clears and later recurs, it fires again. This is what keeps
|
||||
a persistent error from spamming alerts.
|
||||
|
||||
The signature set is per-OS configuration: `SignatureSet::linux_default()` ships
|
||||
a small, high-value starter set (systemd unit failures, OOM-killer, disk full,
|
||||
Docker non-zero exits, IP-forwarding, firewall posture). Other hosts load a
|
||||
different set; the matcher is shared.
|
||||
|
||||
> PNG rendering of a capture is **not** part of this layer — that stays with the
|
||||
> `tmux-screenshot` skill for human viewing. Colibri owns capture, history, and
|
||||
> triage; the text and signatures are the diagnostic artifacts.
|
||||
|
||||
## Configuration
|
||||
|
||||
Set on the daemon's environment (off by default):
|
||||
|
||||
| Variable | Purpose | Default |
|
||||
| --- | --- | --- |
|
||||
| `COLIBRI_TERMINAL_CAPTURE` | Enable the poll loop (`1`/`true`/`yes`/`on`) | off |
|
||||
| `COLIBRI_TERMINAL_CAPTURE_INTERVAL_SECS` | Seconds between captures of each watched pane | `5` |
|
||||
| `COLIBRI_TERMINAL_WATCH` | Comma-separated tmux targets to watch from startup | _(none)_ |
|
||||
| `TELEGRAM_BOT_TOKEN` / `TELEGRAM_CHAT_ID` | Route edge-triggered alerts to Telegram | _(unset → log only)_ |
|
||||
|
||||
When the bot token/chat id are unset, alerts degrade cleanly to a daemon log
|
||||
line — the feature is safe to leave enabled without Telegram configured.
|
||||
|
||||
## Control-plane commands
|
||||
|
||||
Over the Colibri socket (newline-delimited JSON):
|
||||
|
||||
| Command | Effect |
|
||||
| --- | --- |
|
||||
| `{"cmd":"terminal-watch","target":"clawdie:0"}` | Start recording a tmux target (session / `session:window` / `%pane`) |
|
||||
| `{"cmd":"terminal-unwatch","target":"clawdie:0"}` | Stop recording and drop the pane's history |
|
||||
| `{"cmd":"terminal-list"}` | Watched panes with frame counts and currently-firing alerts |
|
||||
| `{"cmd":"terminal-history","target":"clawdie:0","limit":20}` | Recent recorded frames (text + detection) for a pane |
|
||||
| `{"cmd":"terminal-poll","target":"clawdie:0"}` | Capture now instead of waiting for the tick (`target` optional → all) |
|
||||
|
||||
`terminal-poll` returns, per pane, whether the frame was `recorded` or
|
||||
`unchanged` (deduped) and any `new_alerts` that fired on this capture.
|
||||
|
||||
## Operating notes
|
||||
|
||||
- Watch the panes that matter (a build pane, the daemon pane, a status window),
|
||||
not every pane — the value is in signal, not volume.
|
||||
- Alerts are edge-triggered, so a failing service alerts once and re-alerts only
|
||||
after it recovers and breaks again; use `terminal-list` to see what is
|
||||
currently latched as firing.
|
||||
- When a failure signature carries an `invoke`, that names the skill to run to
|
||||
remediate — the alert is "here's what broke and how to fix it", not just
|
||||
"something happened".
|
||||
Loading…
Add table
Reference in a new issue