layered-soul/docs/HOST-MATRIX.md
Sam & Claude d73cd403c3 docs: convert negative patterns to positive actionable instructions
Applied positive-language documentation rewrites across key docs and skills:
- AGENTS.md: converted must-not/never/cannot to positive guidance
- docs/HOST-MATRIX.md: converted never/do-not patterns; preserved probe discipline
- docs/HIVE-ONBOARDING.md: converted cannot/never/avoid to actionable instructions
- skills/systematic-debugging/SKILL.md: converted non-safety negatives; preserved core debugging rules (NO FIXES WITHOUT ROOT CAUSE)
- skills/bootable-usb-images/SKILL.md: converted non-safety negatives; preserved safety-critical rules (never a partition, never silently skip target identification)

Changed negative patterns: never→stay/reference/always, do not→use/prefer/send only, cannot→lacks/leaves intact/requires
2026-06-21 13:57:11 +02:00

231 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Host & Agent Matrix (shared, fill-as-you-go)
A living inventory of **who runs where** and **what each host actually is**. Any agent
on any host fills in its own row. Source of truth for facts is the probe — not memory.
> **How to fill your row**
>
> ```sh
> cd ~/layered-soul
> python3 scripts/verify_facts_probe.py --os --hardware --storage --network --text
> ```
>
> Copy the verified values into the tables below, set `Probed` to today's UTC date,
> and commit. **Never guess hardware, OS, or IPs** — paste what the probe reports.
> On FreeBSD the probe synthesizes an OS-specific command map; trust its output over
> Linux habits.
>
> **Disk before action:** before installing a toolchain or starting a build, check
> real free space (`df -h /`, or the probe's `--storage`) — **always measure** before acting. Keep the
> **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below.
>
> **Cost before buying:** before purchasing or retiring infrastructure, record provider,
> plan/SKU, verified monthly cost, and the source of truth (invoice/control panel/utility
> bill). IP-range guesses are not billing proof. See _Cost provenance_ below.
>
**Keep real IPs and bot handles in `fleet.env` (gitignored).** Use `${HOST_TS_IP}` and `${*_BOT}`
placeholders in committed docs; real values live in `fleet.env` and are live via
> `tailscale status`. Copy `fleet.env.example` → `fleet.env` to resolve them. The probe
> prints real IPs — record them in `fleet.env`, not in this table.
---
## 1. Agent placement (who runs where)
| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status |
| ----------------- | ------- | --------------------- | ---------------------------- | ------------------------------------------------------------------------- | --------------------------- | -------------------------------------- |
| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) |
| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE |
| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE |
| **Mevy** | osa | FreeBSD 15 / host | Hermes Agent (upstream, CLI) | **Consolidated into hermes-osa** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — under hermes-osa** |
| **hermes-osa** | osa | FreeBSD 15 / host | Hermes Agent (FreeBSD fork) | **Orchestrator + board host (always-on VPS): chat + gateway** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — chat + Telegram** |
| Codex | osa | FreeBSD 15 / jail | Codex CLI | ISO builds, validation | — (CLI) | LIVE |
| **domedog-agent** | domedog | Ubuntu 24.04 / host | Colibri board agent | Headless Linux media/compute lane (image-render, ffmpeg, rust/go/py/node) | — | **LIVE — on central board 2026-06-19** |
> **Mevy vs hermes-osa distinction**: Mevy (${HERMES_OSA_BOT} / OSA-bot) has been consolidated into hermes-osa as of 2026-06-17. The Telegram bot token was migrated from the old backup .env. hermes-osa now runs both the local CLI chat and the Telegram gateway (polling mode, tmux session `hermes-gateway`).
>
> **Status key**: `LIVE` = running and validated right now. `INSTALLED` = binary present, not yet validated in role. `PLANNED` = not yet set up. No guessing.
> Notes:
>
> - Provider per agent (DeepSeek / OpenRouter / Z.AI / local) — fill in the per-host table.
> - One Telegram token per running service. **Assign each service its own unique token.**
> - **Orchestrator lives on the always-on host.** **osa is the always-on VPS** and hosts the
colibri board + orchestrator (hermes-osa). **debby is an intermittent laptop** (powers off
periodically) — a secondary agent + soul backup; **osa is the designated hub**. The board **always stays on osa** (always-on VPS); tasks routed to debby queue up and execute when it returns.
> - **Routing**: Colibri has a capability matcher for per-host agent pools, and **cross-host
> routing is LIVE** (2026-06-19): a `socat` bridge exposes osa's colibri-daemon on its
> Tailscale IP (`${OSA_TS_IP}:9190`, tailnet-only); agents on debby/domedog reach the osa
> board over the tailnet, and a poller (2 min) / worker (5 min) loop executes assigned tasks.
> Validated on the debby↔osa lane; colibri PR #83. See [`CAPABILITY-ROUTING.md`](./CAPABILITY-ROUTING.md).
> - **Probe vs identity**: `verify_facts_probe.py` is a required discipline/tool,
> not an automatic startup hook — agents run it when grounding host facts, and HOST-MATRIX
> records the result. OS/hardware facts come from probes and the matrix, not from SOUL.md
> (which carries identity and values).
---
## 2. Host hardware & facts (one row per host)
| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By |
| ----------- | ---------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
| **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude |
| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes |
| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi |
### Disk discipline (**measure, then act**)
Disk is a first-class fact, same as OS or CPU — **measure with `df -h` and `du` before acting.**
- **Before installing a toolchain or starting a build**, run `df -h /` (Linux) or
`zfs list` / `df -h` (FreeBSD), or the probe's `--storage`. Confirm the headroom is
really there.
- **Keep the `Disk (free)` column above current** when you add or remove anything large.
- **Flag any host past ~85% used.** Reference footprints to budget with: Go SDK ≈ 290 MB,
Rust toolchain (`~/.rustup` + `~/.cargo`) ≈ 1.8 GB, a Node version ≈ 150 MB; build/module
caches grow on top of these.
- **Standing watch:** `debby` runs ~95% full (23 GB free). Treat new installs/builds there
as a deliberate decision, not a default — prefer the host with real headroom.
This is the survivability principle applied to storage: a host that silently fills up is a
host that fails. What you guess will be wrong; what you probe will be right.
### Cost provenance (invoice/control-panel facts, not guesses)
Hosting spend is a first-class fleet fact, but it must stay non-secret: record provider,
plan/SKU, region, verified monthly cost, and the proof source. Do **not** commit invoice
IDs, account numbers, billing addresses, or payment details. If a provider is inferred from
an IP range, mark it `TBD` until the control panel or invoice confirms it.
| Host / candidate | Provider | Plan / SKU | Region | Monthly cost | Billing cycle | Role paid for | Source / proof | Status / notes |
| ------------------------------------- | ------------------------------------------------------------------ | ----------------------------------------- | --------------- | ------------------------------------- | ------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| **osa** | TBD (verify; OVHcloud is suspected but not invoice-confirmed here) | TBD | TBD | TBD | TBD | always-on orchestrator + board + Hermes gateway | operator invoice/control panel needed | Existing always-on VPS; **verify provider via invoice/control panel**, not by IP range alone. |
| **domedog** | TBD | TBD | TBD | TBD | TBD | Linux media/compute lane | operator invoice/control panel needed | Existing Linux VM; cost not tracked yet. |
| **debby** | self-owned laptop | — | local | utility/power TBD | — | intermittent secondary agent + soul backup | local device + utility rate if needed | Not an always-on hub; power cost only matters when left on. |
| **mother-build** (candidate) | proposed OVHcloud | TBD: Public Cloud hourly or Eco/dedicated | TBD | TBD | TBD | FreeBSD build host / poudriere / Rust+zot builds | OVH quote needed before purchase | Prefer on-demand if builds are infrequent; dedicated only if build demand justifies standing cost. |
| **ML350p Gen8** (candidate/retire) | self-hosted hardware | owned hardware | local | ~€5363/mo @ 460 W high-load estimate | utility bill | multitenant/build candidate; fallback if TCO beats cloud | GEN-I + URO tariff research; fan/PSU label, not wall-metered | Use as planning band only; measure wall draw before committing tenants. |
| **vultr-svc** (Forgejo + Vaultwarden) | Vultr | TBD | TBD (verify EU) | TBD | TBD | git mirror (layered-soul + hermes-soul) + Vaultwarden secrets store | DNS code/vault.smilepowered.org → Vultr (verified 2026-06-20); invoice needed | Off-OVH backup target (good) BUT Forgejo + Vault share one box → SPOF for backups AND secrets; needs own off-box backup + EU-region verify + MFA |
Cost discipline mirrors disk discipline: measure before action. For self-hosted hardware,
calculate monthly power with `watts / 1000 * 24 * 30 * €/kWh` using measured idle/load
wattage and the actual utility rate; **use measured wattage and actual €/kWh** for power-cost comparisons.
**ML350p Gen8 planning note:** for the multitenant/high-load case, use the visible
fan/PSU-side **460 W** mark as the conservative continuous-load assumption until a wall
meter proves otherwise.
- Monthly energy: `0.460 kW * 24 h * 30.4375 d = ~336 kWh/month`.
- GEN-I regular household ET energy price: `0.13286 EUR/kWh` with VAT → **~€44.6/mo**
energy-only.
- Add URO network-energy ET estimate (`0.01864 EUR/kWh` before VAT, ~`0.02274 EUR/kWh`
with VAT) → **~€52.3/mo** variable electricity + network-energy estimate.
- Practical planning band with smaller per-kWh state charges: **~€53/mo** if 460 W is wall
draw; **~€5963/mo** if 460 W is output-side load at ~9085% PSU efficiency.
- Annualized planning band: **~€640760/year**.
---
## 3. Per-host detail (expand as needed)
### domedog (Claude / verification) — probed 2026-06-17 by Claude
- **Identity**: hostname `domedog.pro`, Tailscale `${DOMEDOG_TS_IP}`
- **OS**: Ubuntu 24.04.4 LTS, kernel `6.8.0-117-generic`, x86_64, KVM guest
- **CPU**: AMD EPYC 7543P 32-Core (2 vCPU exposed to guest)
- **Memory**: 7.8 GiB RAM, 2.0 GiB swap
- **Storage**: `/dev/sda1` 96 GB ext4 root, 51 GB free (QEMU HARDDISK). No ZFS.
- **GPU**: none (headless VM)
- **Uptime at probe**: ~3.5 weeks
- **Role here**: Claude Code — verification & review lane. No Telegram bot.
- **Colibri agent (joined central board 2026-06-19)** — the headless Linux media/compute lane:
- **Capabilities advertised**: `linux`, `python3.12`, `rust`, `go`, `node`, `ffmpeg`,
`image-render`. **Not** `screenshot`/`gui` (headless VM), not `docker` (absent).
`image-render`/`ffmpeg` are domedog-only in the fleet — osa dropped Pillow.
- **Reach**: client shim `colibri-shim.service` (system unit, `User=clawdija`,
`Restart=always`, reboot-persistent) runs
`socat UNIX-LISTEN:~/.colibri/colibri.sock → TCP ${OSA_TS_IP}:9190` (osa bridge over
Tailscale). A system unit, not `--user`: `systemctl --user` has no bus on this host.
- **Operate**: `~/.colibri/agent.env` holds `COLIBRI_AGENT_ID` + `COLIBRI_SOCKET`; helpers
in `~/.colibri/``colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`.
- **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw
it → worker marked it `done` (2026-06-19).
- **Executor pending (decision required)**: domedog _receives_ capability-matched tasks, but
no persistent execution loop runs yet — until one does, routed tasks sit `started` (no
lease/reaper). Decide what executes (Claude Code worker / script) and with what authority
before relying on autonomous domedog task completion.
### debby (Hermes secondary + Zot — intermittent laptop) — probed 2026-06-17 by Hermes
- **Identity**: hostname `debby`, Tailscale `${DEBBY_TS_IP}`
- **OS**: Debian 13 (Trixie), kernel `6.12.90+deb13.1-amd64`, bare metal (KDE Plasma desktop)
- **CPU**: AMD Ryzen 7 5700U with Radeon Graphics, 8 physical cores, 16 threads
- **Memory**: 15 GiB RAM, 15 GiB swap
- **Storage**: `/dev/nvme0n1p2` 453 GB ext4 root, 23 GB free (95% full). No ZFS.
- **GPU**: AMD Radeon Graphics (integrated, Lucienne)
- **Containers**: Docker 29.5.3 installed (daemon not currently running)
- **Hermes Agent**: v0.16.0 (upstream f9c8d95e), DeepSeek v4 Pro primary provider, OpenRouter for vision/fallback, Z.AI/GLM available
- **Zot RPC**: Go binary at `~/.local/bin/zot`, GLM-5.1 model
- **Telegram**: ${HERMES_BOT} + ${ZOT_BOT} in "My Debby" group
- **Layered soul**: commit `817624c`, 6 curated memories, 9 cross-harness skills
### osa (FreeBSD: hermes-osa orchestrator + board host, always-on VPS; + Mevy + Codex) — probed 2026-06-17 by hermes-osa
- **Identity**: hostname `osa.smilepowered.org`, Tailscale `${OSA_TS_IP}`
- **OS**: FreeBSD `15.0-RELEASE-p10`, kernel `FreeBSD osa.smilepowered.org 15.0-RELEASE-p10 FreeBSD 15.0-RELEASE-p10 releng/15.0-n281064-98258a339269 GENERIC amd64`
- **CPU**: Intel Core Processor (Haswell, no TSX), 6 vCPU
- **Memory**: 11 GiB RAM
- **Storage**: ZFS pool `zroot`, 98.5G ONLINE, 23.4G available
- **Jails**: `cms` and `worker` (Bastille jails); Docker not installed
- **Agents on host**:
- **hermes-osa** — Hermes Agent v0.16.0 (`hermes-bsd` clean-room MIT fork), FreeBSD local CLI runtime + Telegram gateway. **Status: LIVE — validated local chat + Telegram.** Default provider: DeepSeek direct (`provider: deepseek`, `default: deepseek-chat`). OpenRouter available as fallback/manual lane. Telegram/gateway: LIVE — ${HERMES_OSA_BOT} (Mevy/OSA-bot), polling mode, tmux session `hermes-gateway` on osa. Daemon/rc.d: deferred (Track A).
- **Mevy** — ${HERMES_OSA_BOT} (OSA-bot) — now consolidated under hermes-osa gateway. Token migrated from old backup .env.
- **Codex** — `codex-cli 0.117.0`, ISO builds and validation. Runs in a Bastille jail.
- **Claude Code** — installed (path: `/home/clawdie/.npm-global/bin/claude`), no dedicated role yet.
- **Provider stack** (hermes-osa):
```yaml
provider: deepseek # primary — direct credits, proven DEEPSEEK_OK
default: deepseek-chat
fallback: openrouter # available manually, not auto-fallback configured yet
```
- **Z.AI**: deferred (not configured for hermes-osa; available via OpenRouter if needed)
- **Telegram**: LIVE — ${HERMES_OSA_BOT}, polling mode, connected 2026-06-17
- **Gateway**: LIVE — running in tmux session `hermes-gateway`, manual start (no rc.d yet)
- **Launch command**:
```sh
tmux new -s hermes-osa
cd /home/clawdie/ai/hermes-bsd
export HERMES_HOME=/home/clawdie/.hermes
source venv/bin/activate # or: .venv/bin/activate
hermes chat
```
- **Layered soul**: commit `c9c88fd`, 10 skills, 7 curated memories
- **Future tracks (separate, none blocking)**:
- Track A: daemon/rc.d promotion (hermes_daemon service, dedicated user)
- Track B: ~~Telegram/gateway integration~~ DONE (2026-06-17) — gateway daemonization (rc.d) still deferred
- Track C: ~~Colibri cross-host routing~~ **DONE (2026-06-19)** — `socat` bridge on osa `:9190` (Tailscale-only) + poller/worker loop; colibri PR #83 merged. See CAPABILITY-ROUTING.md
- Track D: old clawdie_glass cleanup
_See [`../AGENTS.md`](../AGENTS.md) for the canonical agent matrix and operating rules._
## §4 Compliance standing constraints
- **EU region only**: All OVHcloud resources in FR/DE/PL. Sidesteps non-EU transfer/SCC burden under GDPR.
- **Off-box backup before any reinstall**: OVH DPA §10 + GTS §6.3/6.5/10.6 — reinstall/termination = irreversible deletion including OVH-side backups, no recovery, OVH not liable. Identity/skills covered by git (layered-soul + hermes-soul on Forgejo). Runtime state (ZFS snapshots, Vaultwarden DB) must be verified backed up outside OVH.
- **Backup independence (verified 2026-06-20):** Forgejo **and** Vaultwarden both run on **Vultr** (the `code` / `vault.smilepowered.org` host) — a _different provider_ than osa/OVH, so an OVH loss does not take the git backup (good). **But Forgejo and Vaultwarden share that one Vultr box**, making it a single point of failure for _both_ the backups _and_ all secrets. → that box needs its _own_ off-box backup (Vaultwarden DB export + Forgejo data to a third location), and **backups are unverified until test-restored** (cost-discipline applies to backups: check, don't assume). Add the Vultr host to the provenance table; apply EU-region (verify) + MFA to it too.
- **MFA on every master-key account**: GTS §2.3/2.4 — operator is liable for fraudulent account use. Enable MFA on **OVH, Vultr, the domain registrar (clawdie.si / smilepowered.org), Forgejo admin, and Vaultwarden** — each is a master key to the fleet. **Auto-renew the domains**: a lapsed domain silently kills `pkg.clawdie.si`, ACME certs, and SSH-by-hostname.
- **Billing hygiene**: provider **auto-renew is on by default** (OVH/Vultr) — disable before the 19th of the month if not renewing. **Commitment Periods lock you in** (full term due, no refund for early cancel/non-use). Act on **price-increase / end-of-life** notices within the 30-day cancel window. Track renewal dates per provider in the provenance table.
- **Continuity plan (contractually required)**: OVH GTS §6.3 makes a recovery plan the Client's obligation, and §4/§10 cap provider liability at service credits — no data-loss or downtime damages. The fleet's **multi-host survivability** (Linux/Docker + FreeBSD/jails, relocatable via layered-soul) **is** the recovery plan; pair it with the off-box backups above.
- **Do not commit OVH contracts/credentials**: GTS §13 makes contract terms confidential. A compliance summary only in public repos — no verbatim DPA/GTS text, no NIC handles or login credentials.
### Multi-tenant GDPR gates (administrative, not technical)
These switch on when the hive goes multi-tenant. None block current internal use:
- [ ] GDPR controller docs (privacy notice, legal basis for processing, ROPA)
- [ ] DPIA only if agents make automated decisions about _individuals_ with legal/significant effect (GDPR Art. 35/22) — the internal agent task scheduler (routing work to machines) does **not** trigger this
- [ ] Pass OVH terms down to customers (GTS §10.6 — sub-licensing)
- [ ] Third-party / "AAA" professional indemnity insurance (§10.6)
- [ ] Customer sanctions screening (GTS §14.3 — denied parties / export controls)
- [ ] Data Processing Agreement with each tenant (DPA §12 — controller→processor chain)
See [`HIVE-ONBOARDING.md §9`](./HIVE-ONBOARDING.md) for the integration checklist.