fix(cms): align docs with Colibri v0.12
This commit is contained in:
parent
7da997402a
commit
3b0fe67400
7 changed files with 92 additions and 113 deletions
|
|
@ -35,13 +35,13 @@ Single clawdie service (host):
|
|||
└── Shared hostd access — privileged ops (bastille, zfs, pf)
|
||||
```
|
||||
|
||||
Agents run on the host via the `pi` CLI. Each agent gets:
|
||||
Agents run on the host via the `zot` binary (v0.12+) or `pi` CLI (v0.10–v0.11). Each agent gets:
|
||||
|
||||
- A system prompt from their identity file (`SYSADMIN_AGENT.md`, `DB_ADMIN_AGENT.md`, etc.)
|
||||
- A persistent session in `data/sessions/{agent}.jsonl`
|
||||
- Access to the skills catalog in `data/skills/`
|
||||
- `CONTROLPLANE_*` env vars pointing at the local HTTP API
|
||||
- All agent spawns use `--no-skills` to disable pi's built-in skill discovery; skills are injected via `--append-system-prompt` from the catalog
|
||||
- All agent spawns use `--no-skills` to disable zot's built-in skill discovery; skills are injected via `--append-system-prompt` from the catalog
|
||||
|
||||
API authentication requires `CONTROLPLANE_SHARED_SECRET` — a Bearer token that all agents and API clients must present.
|
||||
|
||||
|
|
@ -84,17 +84,17 @@ Agents use a catalog of operational skills sourced from `agent/library.yaml`.
|
|||
|
||||
Skills are discoverable via tags and the `skills_search` extension tool. The control plane can route tasks to the right specialist without depending on the LLM to “remember” what exists.
|
||||
|
||||
| Skill | Agent | Trigger example |
|
||||
| ----------------- | --------- | --------------------------------- |
|
||||
| `jail-status` | Sysadmin | "Check if db jail is running" |
|
||||
| `disk-usage` | Sysadmin | "How much free disk?" |
|
||||
| `system-stats` | Sysadmin | "CPU and memory load?" |
|
||||
| `service-restart` | Sysadmin | "Restart nginx service" |
|
||||
| `backup-db` | DBA | "Back up the database" |
|
||||
| `db-vacuum` | DBA | "Run vacuum on system_brain" |
|
||||
| `db-migrate` | DBA | "Apply pending migrations" |
|
||||
| `git-merge` | Git Admin | "Merge PR #42 into main" |
|
||||
| `git-release-tag` | Git Admin | "Tag version v0.11.0" |
|
||||
| Skill | Agent | Trigger example |
|
||||
| ----------------- | --------- | ----------------------------- |
|
||||
| `jail-status` | Sysadmin | "Check if db jail is running" |
|
||||
| `disk-usage` | Sysadmin | "How much free disk?" |
|
||||
| `system-stats` | Sysadmin | "CPU and memory load?" |
|
||||
| `service-restart` | Sysadmin | "Restart nginx service" |
|
||||
| `backup-db` | DBA | "Back up the database" |
|
||||
| `db-vacuum` | DBA | "Run vacuum on system_brain" |
|
||||
| `db-migrate` | DBA | "Apply pending migrations" |
|
||||
| `git-merge` | Git Admin | "Merge PR #42 into main" |
|
||||
| `git-release-tag` | Git Admin | "Tag version v0.12.0" |
|
||||
|
||||
The catalog evolves over time; for the authoritative current list run `/skills` in Telegram or `just skill-list` on the host.
|
||||
|
||||
|
|
@ -102,7 +102,15 @@ Agents also have access to the `skills_search` extension tool, which queries the
|
|||
|
||||
---
|
||||
|
||||
## Implementation Progress
|
||||
## Architecture (v0.12 — Colibri)
|
||||
|
||||
As of v0.12, the control plane has been rewritten in Rust as
|
||||
[Colibri](https://code.smilepowered.org/clawdie/colibri). The TypeScript
|
||||
control plane described below is being pruned. See
|
||||
[Colibri Wiki](https://code.smilepowered.org/clawdie/colibri/src/branch/main/docs/wiki/index.md)
|
||||
for the current architecture.
|
||||
|
||||
### Original (TS, v0.10–v0.11)
|
||||
|
||||
Built in 7 phases. Each phase adds one module and turns its test todos green.
|
||||
|
||||
|
|
@ -141,25 +149,25 @@ just setup-controlplane
|
|||
Every agent run (orchestrator main chat or specialist heartbeat) records
|
||||
three provider/model values in `agent_activity.payload`:
|
||||
|
||||
| Field | Meaning |
|
||||
| -------------- | --------------------------------------------------------- |
|
||||
| `configured_*` | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`) |
|
||||
| `effective_*` | What was actually passed to pi (after fallback swap) |
|
||||
| `actual_*` | What pi reports having used (parsed from session JSONL) |
|
||||
| Field | Meaning |
|
||||
| -------------- | ------------------------------------------------------------------ |
|
||||
| `configured_*` | What `.env` says (`DEEPSEEK_API_KEY` / `COLIBRI_AUTOSPAWN_BINARY`) |
|
||||
| `effective_*` | What was actually passed to zot (after fallback swap) |
|
||||
| `actual_*` | What zot reports having used (parsed from session JSONL) |
|
||||
|
||||
`configured_*` and `effective_*` differ when [provider fallback](../operate/provider-fallback/)
|
||||
is active (cooldown is live, runtime is using the operator's chosen
|
||||
fallback). `actual_*` should match `effective_*` for a successful run; a
|
||||
divergence suggests pi rewrote the model selection internally.
|
||||
divergence suggests zot rewrote the model selection internally.
|
||||
|
||||
`/budgetreport` and `/tokens` surface these values; `/policy` shows the
|
||||
fallback cooldown line when one is active.
|
||||
|
||||
## References
|
||||
|
||||
- `doc/CONTROLPLANE-ARCHITECTURE.md` — detailed service layout
|
||||
- `doc/CONTROLPLANE-MESSAGE-CONTRACT.md` — API contracts (what agents query and post)
|
||||
- `doc/CONTROLPLANE-AGENT-ROLES.md` — role definitions, skill mappings, budgets
|
||||
- [Colibri control plane](https://code.smilepowered.org/clawdie/colibri) — Rust rewrite, current as of v0.12
|
||||
- [Colibri Wiki](https://code.smilepowered.org/clawdie/colibri/src/branch/main/docs/wiki/index.md) — architecture decisions
|
||||
- [Colibri AGENTS.md](https://code.smilepowered.org/clawdie/colibri/src/branch/main/AGENTS.md) — agent identities, build, handoff protocol
|
||||
- `SOUL.md`, `SYSADMIN_AGENT.md`, `DB_ADMIN_AGENT.md`, `GIT_ADMIN_AGENT.md` — agent identity files
|
||||
- [Provider Fallback](../operate/provider-fallback/) — automatic provider switching when the primary hits a usage cap
|
||||
- [Structured Reports](../operate/structured-reports/) — operator-facing report family + free-text routing
|
||||
|
|
|
|||
|
|
@ -161,7 +161,7 @@ just setup-cms
|
|||
just setup -- --step dns
|
||||
just setup -- --step verify
|
||||
just doctor # health check, including DNS, TLS/ACME, and scheduled reports
|
||||
just pi-config # view / validate runtime config
|
||||
just colibri-config # view / validate runtime config
|
||||
```
|
||||
|
||||
## Related docs
|
||||
|
|
|
|||
|
|
@ -358,9 +358,8 @@ move chat to a direct provider:
|
|||
|
||||
- **Per chat:** `/model` in Telegram lets you swap provider/model
|
||||
for a single chat.
|
||||
- **System-wide:** edit `.env` and set `PI_TUI_PROVIDER` /
|
||||
`PI_TUI_MODEL` to your preferred provider (zAI, Anthropic,
|
||||
OpenAI, Gemini). The OpenRouter key can stay configured as a
|
||||
- **System-wide:** edit `.env` and set `DEEPSEEK_API_KEY` to your
|
||||
preferred provider's key. The OpenRouter key can stay configured as a
|
||||
fallback — see
|
||||
[Provider Fallback](../operate/provider-fallback/).
|
||||
|
||||
|
|
|
|||
|
|
@ -14,13 +14,13 @@ installation record.
|
|||
Record wall-clock timestamps at each stage. On bhyve, the serial console
|
||||
shows boot messages with timestamps.
|
||||
|
||||
| Milestone | Command / event | Record |
|
||||
| ---------------------- | ------------------------------------- | -------------- |
|
||||
| Boot start | First kernel message | `T0` |
|
||||
| first-boot setup consumed | `[firstboot] seed loaded` in log | `T1 = T1 - T0` |
|
||||
| Firstboot complete | `[firstboot] Complete.` in log | `T2 = T2 - T0` |
|
||||
| Desktop ready (Lumina) | `lightdm` login screen visible | `T3 = T3 - T0` |
|
||||
| Agent responding | `/ping` on Telegram returns pong | `T4 = T4 - T0` |
|
||||
| Milestone | Command / event | Record |
|
||||
| ------------------------- | -------------------------------- | -------------- |
|
||||
| Boot start | First kernel message | `T0` |
|
||||
| first-boot setup consumed | `[firstboot] seed loaded` in log | `T1 = T1 - T0` |
|
||||
| Firstboot complete | `[firstboot] Complete.` in log | `T2 = T2 - T0` |
|
||||
| Desktop ready (Lumina) | `lightdm` login screen visible | `T3 = T3 - T0` |
|
||||
| Agent responding | `/ping` on Telegram returns pong | `T4 = T4 - T0` |
|
||||
|
||||
If the first-boot setup (`setup.txt`) was absent or invalid, the install falls back to the
|
||||
interactive TUI wizard at the equivalent of `T1` — record the same
|
||||
|
|
@ -41,16 +41,16 @@ sudo bastille list
|
|||
|
||||
Expected jails depend on configuration:
|
||||
|
||||
| Jail | When present |
|
||||
| -------------------------- | ------------------------------------------ |
|
||||
| Host `clawdie` service | Control plane/runtime; not a jail |
|
||||
| `git` (shared, `<subnet>.2`) | Shared across agents — one per host |
|
||||
| `cms` (shared, `<subnet>.3`) | Shared across agents — one per host |
|
||||
| `ollama` / `llama-cpp` (`<subnet>.4`) | When Local AI Models are enabled |
|
||||
| `db` (`<subnet>.5`) | Only when `DB_RUNTIME=jail` |
|
||||
| `db-worker` (`<subnet>.211`) | Specialist worker enabled |
|
||||
| `git-worker` (`<subnet>.212`) | Specialist worker enabled |
|
||||
| `ctrl-worker` (`<subnet>.213`) | Specialist worker enabled |
|
||||
| Jail | When present |
|
||||
| ------------------------------------- | ----------------------------------- |
|
||||
| Host `clawdie` service | Control plane/runtime; not a jail |
|
||||
| `git` (shared, `<subnet>.2`) | Shared across agents — one per host |
|
||||
| `cms` (shared, `<subnet>.3`) | Shared across agents — one per host |
|
||||
| `ollama` / `llama-cpp` (`<subnet>.4`) | When Local AI Models are enabled |
|
||||
| `db` (`<subnet>.5`) | Only when `DB_RUNTIME=jail` |
|
||||
| `db-worker` (`<subnet>.211`) | Specialist worker enabled |
|
||||
| `git-worker` (`<subnet>.212`) | Specialist worker enabled |
|
||||
| `ctrl-worker` (`<subnet>.213`) | Specialist worker enabled |
|
||||
|
||||
`<subnet>` means the configured jail subnet base, for example `10.0.1` in the
|
||||
repo examples or `192.168.72` on a live host. With `DB_RUNTIME=host` there is
|
||||
|
|
@ -84,22 +84,22 @@ ls -la /home/atlas/clawdie-ai/.env
|
|||
```
|
||||
|
||||
```sh
|
||||
grep -E '^(AGENT_NAME|AGENT_GENDER|AGENT_DOMAIN|AGENT_INTERNAL_DOMAIN|AGENT_TMP_DIR|PI_TUI_PROVIDER|PI_TUI_MODEL|EMBED_BASE_URL|TELEGRAM_BOT_TOKEN)=' .env
|
||||
grep -E '^(AGENT_NAME|AGENT_GENDER|AGENT_DOMAIN|AGENT_INTERNAL_DOMAIN|AGENT_TMP_DIR|DEEPSEEK_API_KEY|COLIBRI_AUTOSPAWN_BINARY|EMBED_BASE_URL|TELEGRAM_BOT_TOKEN)=' .env
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
| Key | Expected |
|
||||
| ----------------------- | --------------------------------------------------------------- |
|
||||
| `AGENT_NAME` | Lowercase, no spaces (e.g. `clawdie`, `atlas`) |
|
||||
| `AGENT_GENDER` | `f`, `m`, or `n` |
|
||||
| `AGENT_DOMAIN` | Public domain (e.g. `clawdie.si`) or `{agent}.internal` for VMs |
|
||||
| `AGENT_INTERNAL_DOMAIN` | `{agent}.home.arpa` (Tailscale / local DNS) |
|
||||
| `AGENT_TMP_DIR` | Writable path, not `/tmp` |
|
||||
| `PI_TUI_PROVIDER` | `zai`, `openrouter`, `anthropic`, etc. |
|
||||
| `PI_TUI_MODEL` | Valid model for the provider |
|
||||
| `EMBED_BASE_URL` | URL ending in `/v1` |
|
||||
| `TELEGRAM_BOT_TOKEN` | Non-empty if `FEATURE_TELEGRAM=true` |
|
||||
| Key | Expected |
|
||||
| -------------------------- | --------------------------------------------------------------- |
|
||||
| `AGENT_NAME` | Lowercase, no spaces (e.g. `clawdie`, `atlas`) |
|
||||
| `AGENT_GENDER` | `f`, `m`, or `n` |
|
||||
| `AGENT_DOMAIN` | Public domain (e.g. `clawdie.si`) or `{agent}.internal` for VMs |
|
||||
| `AGENT_INTERNAL_DOMAIN` | `{agent}.home.arpa` (Tailscale / local DNS) |
|
||||
| `AGENT_TMP_DIR` | Writable path, not `/tmp` |
|
||||
| `DEEPSEEK_API_KEY` | DeepSeek API key (used by default) |
|
||||
| `COLIBRI_AUTOSPAWN_BINARY` | `zot` (default) or `pi` for fallback |
|
||||
| `EMBED_BASE_URL` | URL ending in `/v1` |
|
||||
| `TELEGRAM_BOT_TOKEN` | Non-empty if `FEATURE_TELEGRAM=true` |
|
||||
|
||||
## 3. Watchdog IPC status
|
||||
|
||||
|
|
@ -142,8 +142,8 @@ attention before you treat publishing as healthy.
|
|||
## 5. LLM provider connectivity
|
||||
|
||||
```sh
|
||||
# Quick inference test via pi
|
||||
pi --provider "${PI_TUI_PROVIDER}" --model "${PI_TUI_MODEL}" -e "reply with OK"
|
||||
# Quick inference test via zot
|
||||
zot --no-session --print "Reply with OK"
|
||||
```
|
||||
|
||||
Expected: Model responds. If using ZAI (GLM), verify the API key:
|
||||
|
|
|
|||
|
|
@ -61,8 +61,8 @@ just install
|
|||
│
|
||||
[ 2] environment host pkg baseline, bridge, locale REQUIRED
|
||||
│
|
||||
[ 3] pi-config validate/write pi provider optional ─── warn on missing provider auth
|
||||
│ └── pi missing → warn, continue
|
||||
[ 3] colibri-config validate/write Colibri daemon config optional ─── warn on missing provider auth
|
||||
│ └── zot missing → warn, continue
|
||||
[ 4] pf write PF include (NAT egress) REQUIRED
|
||||
│ └── 📸 snapshot: post-pf
|
||||
[ 5] jails create worker jail (--create) REQUIRED
|
||||
|
|
@ -169,56 +169,29 @@ The entire infrastructure (PF, jails, PostgreSQL, nginx, ZFS) has zero
|
|||
LLM dependency. The key is only consumed when the jail-runner spawns a
|
||||
live response. Install and service start succeed without it.
|
||||
|
||||
### Headless Codex login
|
||||
### Agent runtime setup
|
||||
|
||||
Use this when `pi` runs on a remote or headless host and you want to
|
||||
authenticate with a ChatGPT-backed Codex subscription.
|
||||
|
||||
1. Start Pi on the host:
|
||||
|
||||
```bash
|
||||
pi
|
||||
```
|
||||
|
||||
2. Run `/login` inside Pi and select `ChatGPT Plus/Pro (Codex)`.
|
||||
3. Pi prints a long OpenAI auth URL. If your terminal wraps it across
|
||||
multiple lines, copy it into one continuous line before opening it in
|
||||
your local browser.
|
||||
4. Complete login in the browser. It will redirect to something like
|
||||
`http://localhost:1455/auth/callback?...`.
|
||||
5. On a headless host, that browser callback usually fails. That is
|
||||
expected. Copy the full redirect URL from the browser address bar
|
||||
anyway.
|
||||
6. Paste that full redirect URL back into Pi at:
|
||||
|
||||
```text
|
||||
Paste redirect URL below, or complete login in browser:
|
||||
```
|
||||
|
||||
7. Pi stores the subscription auth in `~/.pi/agent/auth.json`.
|
||||
The agent runtime (zot) is configured via `DEEPSEEK_API_KEY` in `.env`.
|
||||
Colibri spawns zot as a subprocess in RPC mode (`--rpc`). No separate login
|
||||
or auth store is needed — zot reads credentials from its own auth store at
|
||||
`$ZOT_HOME/auth.json`, and the Colibri daemon provides the API key via
|
||||
the RPC channel.
|
||||
|
||||
Quick verification:
|
||||
|
||||
```bash
|
||||
pi --provider openai-codex --model gpt-5.5 --no-session --print "Reply with exactly: codex-ok"
|
||||
zot --no-session --print "Reply with exactly: zot-ok"
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
```text
|
||||
codex-ok
|
||||
```
|
||||
|
||||
If you prefer automatic callback handling instead of copy/paste, create
|
||||
an SSH tunnel before running `/login`:
|
||||
|
||||
```bash
|
||||
ssh -L 1455:127.0.0.1:1455 user@server
|
||||
zot-ok
|
||||
```
|
||||
|
||||
### Control plane API auth
|
||||
|
||||
Agent subprocesses (pi, aider) authenticate back to the control plane
|
||||
Agent subprocesses (zot, aider) authenticate back to the control plane
|
||||
API using a shared secret. Generate one after install:
|
||||
|
||||
```bash
|
||||
|
|
@ -255,7 +228,7 @@ but can be disabled), or **optional** (skipped unless explicitly enabled).
|
|||
| ---------------- | -------- | ------------------------------------------------------------------ |
|
||||
| onboarding | required | — |
|
||||
| environment | required | — |
|
||||
| pi-config | optional | warn on missing provider auth or missing pi |
|
||||
| colibri-config | optional | warn on missing provider auth or missing colibri-daemon |
|
||||
| pf | required | — |
|
||||
| jails | required | — |
|
||||
| git | default | `DB_RUNTIME=host` or `install-from hosts` |
|
||||
|
|
|
|||
|
|
@ -52,8 +52,7 @@ agents. The design commitments that shape what you need:
|
|||
provision a dedicated db jail instead.
|
||||
- **LLM provider** of your choice. OpenRouter is the recommended
|
||||
bootstrap path; switch post-install to direct provider keys (zAI,
|
||||
Anthropic, OpenAI, Gemini), subscription auth such as OpenAI Codex
|
||||
via `/login`, or local Ollama by editing `.env`.
|
||||
Anthropic, OpenAI, Gemini), or local Ollama by editing `.env`.
|
||||
See [Provider Fallback](../operate/provider-fallback/) for the
|
||||
cap-detection and fallback behavior.
|
||||
|
||||
|
|
|
|||
|
|
@ -25,11 +25,11 @@ the primary.
|
|||
|
||||
Set in `.env`:
|
||||
|
||||
| Variable | Required | Example |
|
||||
| --------------------------------------- | ---------------------------------------- | -------------------------------------------- |
|
||||
| `LLM_FALLBACK_PROVIDER` | yes (when fallback is desired) | `openrouter` |
|
||||
| `LLM_FALLBACK_MODEL` | recommended | `openai/o3` (paid, stable) |
|
||||
| `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` | optional (default `3600`) | `1800` |
|
||||
| Variable | Required | Example |
|
||||
| --------------------------------------- | ------------------------------ | -------------------------- |
|
||||
| `LLM_FALLBACK_PROVIDER` | yes (when fallback is desired) | `openrouter` |
|
||||
| `LLM_FALLBACK_MODEL` | recommended | `openai/o3` (paid, stable) |
|
||||
| `LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS` | optional (default `3600`) | `1800` |
|
||||
|
||||
The default cooldown is used **only** when the cap message has no parseable
|
||||
reset stamp. Real zAI cap errors include the reset timestamp and the cooldown
|
||||
|
|
@ -78,11 +78,11 @@ one more cost surface to monitor.
|
|||
## How Cooldowns Work
|
||||
|
||||
1. A run fails with `429 Usage limit reached for 5 hour. Your limit will reset
|
||||
at YYYY-MM-DD HH:MM:SS`.
|
||||
at YYYY-MM-DD HH:MM:SS`.
|
||||
2. The runner parses the reset timestamp (treated as local time) and stores
|
||||
`{ provider: 'zai', until: <reset>, reason: <message> }` in memory and on
|
||||
disk.
|
||||
3. Every subsequent run consults the cooldown map *before* spawning pi. If the
|
||||
3. Every subsequent run consults the cooldown map _before_ spawning zot. If the
|
||||
configured provider is in cooldown, the spawn args swap to the fallback
|
||||
provider/model.
|
||||
4. The cooldown auto-expires at the reset timestamp. Next run uses the primary
|
||||
|
|
@ -114,7 +114,7 @@ Logs include structured warnings on every fallback-active run:
|
|||
{ originalProvider: 'zai', fallbackProvider: 'openrouter', cooldownUntil: '...' } Provider fallback active — preferred provider is in cooldown
|
||||
```
|
||||
|
||||
And on the run that *trips* the cooldown:
|
||||
And on the run that _trips_ the cooldown:
|
||||
|
||||
```
|
||||
{ provider: 'zai', until: '2026-04-25T19:00:59', reason: '429 Usage limit reached; resets ...' } Provider cap detected — marking cooldown
|
||||
|
|
@ -138,15 +138,15 @@ cleared state survives restart.
|
|||
|
||||
Every agent activity row now records three provider/model values:
|
||||
|
||||
| Field | Meaning |
|
||||
| -------------------- | -------------------------------------------------------- |
|
||||
| `configured_*` | What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`) |
|
||||
| `effective_*` | What was actually passed to pi (after fallback swap) |
|
||||
| `actual_*` | What pi reports having used (parsed from session JSONL) |
|
||||
| Field | Meaning |
|
||||
| -------------- | ------------------------------------------------------------------ |
|
||||
| `configured_*` | What `.env` says (`DEEPSEEK_API_KEY` / `COLIBRI_AUTOSPAWN_BINARY`) |
|
||||
| `effective_*` | What was actually passed to zot (after fallback swap) |
|
||||
| `actual_*` | What zot reports having used (parsed from session JSONL) |
|
||||
|
||||
When fallback is active, `configured_*` and `effective_*` differ.
|
||||
`actual_*` should match `effective_*` for a successful run; a divergence
|
||||
suggests pi rewrote the model selection internally.
|
||||
suggests zot rewrote the model selection internally.
|
||||
|
||||
## Behavior That Stays The Same
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue