docs(host-matrix): add infrastructure cost provenance (Sam & Pi)

Track hosting spend as a verified fleet fact alongside disk and hardware, seed TBD rows for osa/domedom/debby/proposed OVH build capacity/ML350p, and update HIVE status now that first-proof blockers are code-complete.\n\nValidation: npx --yes prettier@3 --check docs/HOST-MATRIX.md docs/HIVE-ONBOARDING.md; python3 scripts/layered_soul.py validate .
This commit is contained in:
Sam & Claude 2026-06-20 09:48:12 +02:00
parent 4192574f74
commit 058e4ce926
2 changed files with 71 additions and 44 deletions

View file

@ -2,7 +2,7 @@
**LIVE VS PLANNED.** This is a **design/vision** doc. The building blocks are real and
proven (Bastille jails on osa, capability routing, `register-agent`, and the
`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The *platform*
`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The _platform_
described here — `colibri-vault` as a crate, multi-tenant buckets, the mother skill — is
`[PLANNED]`. The thesis: it is mostly **composition of pieces we already have**, not new
invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
@ -13,33 +13,36 @@ invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
The four MVP steps (§8) are **code-complete on colibri `main`**:
| MVP step | Status | Landed via |
| -------- | ------ | ---------- |
| 1. `colibri-vault` crate | done; hardening in flight | #85#94 PR #100 (server-match + serialize) |
| 2. `tenants` table | on `main` | (PR #90 closed as superseded) |
| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) |
| 4. `mother` skill | done (draft) | layered-soul |
| MVP step | Status | Landed via |
| --------------------------- | ------------------------- | ------------------------------------------- |
| 1. `colibri-vault` crate | done; hardening in flight | #85#94#100 (server-match + serialize) |
| 2. `tenants` table | on `main` | (PR #90 closed as superseded) |
| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) |
| 4. `mother` skill | done (draft) | layered-soul |
Supporting pieces merged: `agent-jail-bootstrap.sh` (#96#97 version-pin → #104
cold-cache guard), `provider.env` staging (#69/#99), vault-fetch shell helper
server-match (#67/#68/#69).
server-match (#67/#68/#69), and the first-proof runbook (#103).
**First proof is *not* code-blocked** — the chain works today via the interim manual
**First proof is _not_ code-blocked** — the chain works today via the interim manual
path in [`../docs/VAULT-PROVISION-FIRST-PROOF.md`](https://code.smilepowered.org/clawdie/colibri)
(colibri). Critical path: merge PR #100 + #103 → run the runbook (scratch jail + test
collection, manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at
`0600` + tenant `active`.
(colibri). Critical path now: operator runs the runbook (scratch jail + test collection,
manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at `0600` + tenant
`active`.
Open work, categorized:
- **Hardening:** colibri PR #100 (closes #95), #92 (path canonicalization/containment).
- **Hardening:** #92 (path canonicalization/containment).
- **CLI-driveability (post-proof ergonomics, not proof blockers):** #101 (`register-tenant`
command), #102 (`--jail` on `spawn-agent`) — these replace the runbook's manual steps.
- **Source-of-truth/naming:** #98 (`npm-node24` vs `npm`), clawdie-iso #70 (agent-jail
section in `pkg-list-jails.txt`).
- **Cost/source-of-truth:** fill `docs/HOST-MATRIX.md` cost provenance rows before buying
or retiring build capacity; compare OVH quotes/invoices against measured self-host power.
**One-line plan:** merge #100 + #103 → run the runbook for the first proof → then land
#101/#102 for CLI driveability, and #92 before promoting past scratch.
**One-line plan:** run the first-proof runbook → then land #101/#102 for CLI driveability,
#92 before promoting past scratch, and fill verified OVH/self-host cost data before buying
or depending on a new mother/build host.
---
@ -50,7 +53,7 @@ primitive**. Promote it from the `clawdie-vault-fetch` shell helper to a first-c
crate, **`colibri-vault`**, sitting beside `colibri-spawner` / `colibri-store`:
- **in:** a tenant id (→ a bucket) + a target jail/home
- **out:** a `0600` `.env` materialized *inside the jail*, owned by the jail user
- **out:** a `0600` `.env` materialized _inside the jail_, owned by the jail user
- wraps the `bw` CLI for now (do **not** reimplement the Bitwarden protocol), fail-closed,
idempotent, no-op when there is no bucket
@ -76,8 +79,8 @@ indirection than that.
On "folder vs bucket":
- **Folders** are personal-vault organization → fine for *Clawdie's own internal* agents.
- **Organization + Collections** give *access-scoped isolation* → the multi-tenant
- **Folders** are personal-vault organization → fine for _Clawdie's own internal_ agents.
- **Organization + Collections** give _access-scoped isolation_ → the multi-tenant
primitive. One customer = one Collection; a scoped credential reads only that collection.
- **Do not** run a separate Vaultwarden instance per customer — Collections are exactly
this feature.
@ -91,7 +94,7 @@ On "folder vs bucket":
the orchestrator, that can read any tenant collection to provision jails.
Everything non-secret — harness, base config, model-routing prefs — **ships in the
clawdie-iso image**. The image is the *body*; the bucket is the *one private nerve*.
clawdie-iso image**. The image is the _body_; the bucket is the _one private nerve_.
## 5. [PLANNED] The mother skill
@ -105,7 +108,7 @@ mother := resolve-identity (layered-soul)
```
- **Narrow:** onboarding — births one working agent from a bare jail.
- **Wide:** self-replication. An agent that *holds* the mother skill can spawn and
- **Wide:** self-replication. An agent that _holds_ the mother skill can spawn and
provision more jails (a queen births workers, each inheriting the mother skill), gated
by capability/policy so it cannot run away. That is "agent swarms with a mother skill,"
and `colibri-vault` is how each birth gets its one nerve.
@ -119,7 +122,7 @@ osa/FreeBSD/Bastille is the natural womb — cheap, dense, isolated jails.
> already shipped.
A one-key agent on osa needs `image-render`? It routes to a Linux lane (domedog). Needs a
build? Routes to a capable host. The customer pays for *one agent* but stands on a
build? Routes to a capable host. The customer pays for _one agent_ but stands on a
survivable, multi-OS hive. Anyone can run an LLM in a container; few hand you a swarm
behind one key — **capability routing is the differentiator.**
@ -133,7 +136,7 @@ behind one key — **capability routing is the differentiator.**
**Bootstraps live on the host; jails hold only their resolved secrets.**
- The orchestrator holds the org service-account credential. It fetches a tenant's
collection, writes the resolved `.env` *into* the jail, and the **bootstrap never enters
collection, writes the resolved `.env` _into_ the jail, and the **bootstrap never enters
the jail**. A compromised jail cannot re-fetch and cannot reach another tenant.
- Per-tenant blast radius = one collection. Scoped credential, never a master.
- This is the same shape the domedog smoke test validated (bootstrap on host, `.env` is the
@ -151,14 +154,15 @@ Smallest path that is real:
**First-proof policy.** The first proven end-to-end runs against a **scratch jail + a
throwaway test collection only** — no real tenant data until the path hardening lands
(canonicalize + allowed-root containment, colibri issue #92). The two first-proof blockers
are colibri **#88** (resolve the collection by name) and **#89** (per-call unlock); #92 is
hardening that follows. Tracker state lives on those issues.
(canonicalize + allowed-root containment, colibri issue #92). The former first-proof
blockers — colibri **#88** (resolve the collection by name) and **#89** (per-call unlock)
— are resolved on `main`; the remaining first-proof step is the operator-run scratch
runbook. #92 is hardening that follows before real tenant data.
**Overengineering traps to avoid for now:** a custom Bitwarden web UI (Vaultwarden's own UI
+ a Collection is enough to start), billing/metering, a native Bitwarden protocol in Rust,
multi-region control plane, and recursive auto-spawn (gate it off until policy exists).
Those are product layers; the four steps above are the engine.
plus a Collection is enough to start), billing/metering, a native Bitwarden protocol in
Rust, multi-region control plane, and recursive auto-spawn (gate it off until policy
exists). Those are product layers; the four steps above are the engine.
---

View file

@ -19,6 +19,10 @@ on any host fills in its own row. Source of truth for facts is the probe — not
> real free space (`df -h /`, or the probe's `--storage`) — never estimate. Keep the
> **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below.
>
> **Cost before buying:** before purchasing or retiring infrastructure, record provider,
> plan/SKU, verified monthly cost, and the source of truth (invoice/control panel/utility
> bill). IP-range guesses are not billing proof. See _Cost provenance_ below.
>
> **Never paste real IPs or bot handles here.** Use `${HOST_TS_IP}` and `${*_BOT}`
> placeholders; real values live in `fleet.env` (gitignored) and are live via
> `tailscale status`. Copy `fleet.env.example``fleet.env` to resolve them. The probe
@ -28,15 +32,15 @@ on any host fills in its own row. Source of truth for facts is the probe — not
## 1. Agent placement (who runs where)
| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status |
| ----------- | ------- | --------------------------- | ---------------------------- | -------------------------------- | --------------------- | ----------------------------- |
| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) |
| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE |
| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE |
| **Mevy** | osa | FreeBSD 15 / host | Hermes Agent (upstream, CLI) | **Consolidated into hermes-osa** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — under hermes-osa** |
| **hermes-osa** | osa | FreeBSD 15 / host | Hermes Agent (FreeBSD fork) | **Orchestrator + board host (always-on VPS): chat + gateway** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — chat + Telegram** |
| Codex | osa | FreeBSD 15 / jail | Codex CLI | ISO builds, validation | — (CLI) | LIVE |
| **domedog-agent** | domedog | Ubuntu 24.04 / host | Colibri board agent | Headless Linux media/compute lane (image-render, ffmpeg, rust/go/py/node) | — | **LIVE — on central board 2026-06-19** |
| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status |
| ----------------- | ------- | --------------------- | ---------------------------- | ------------------------------------------------------------------------- | --------------------------- | -------------------------------------- |
| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) |
| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE |
| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE |
| **Mevy** | osa | FreeBSD 15 / host | Hermes Agent (upstream, CLI) | **Consolidated into hermes-osa** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — under hermes-osa** |
| **hermes-osa** | osa | FreeBSD 15 / host | Hermes Agent (FreeBSD fork) | **Orchestrator + board host (always-on VPS): chat + gateway** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — chat + Telegram** |
| Codex | osa | FreeBSD 15 / jail | Codex CLI | ISO builds, validation | — (CLI) | LIVE |
| **domedog-agent** | domedog | Ubuntu 24.04 / host | Colibri board agent | Headless Linux media/compute lane (image-render, ffmpeg, rust/go/py/node) | — | **LIVE — on central board 2026-06-19** |
> **Mevy vs hermes-osa distinction**: Mevy (${HERMES_OSA_BOT} / OSA-bot) has been consolidated into hermes-osa as of 2026-06-17. The Telegram bot token was migrated from the old backup .env. hermes-osa now runs both the local CLI chat and the Telegram gateway (polling mode, tmux session `hermes-gateway`).
>
@ -64,11 +68,11 @@ on any host fills in its own row. Source of truth for facts is the probe — not
## 2. Host hardware & facts (one row per host)
| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By |
| ----------- | -------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By |
| ----------- | ---------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
| **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude |
| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes |
| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi |
| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes |
| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi |
### Disk discipline (check, don't guess)
@ -87,6 +91,25 @@ Disk is a first-class fact, same as OS or CPU — **measure it before you act, d
This is the survivability principle applied to storage: a host that silently fills up is a
host that fails. What you guess will be wrong; what you probe will be right.
### Cost provenance (invoice/control-panel facts, not guesses)
Hosting spend is a first-class fleet fact, but it must stay non-secret: record provider,
plan/SKU, region, verified monthly cost, and the proof source. Do **not** commit invoice
IDs, account numbers, billing addresses, or payment details. If a provider is inferred from
an IP range, mark it `TBD` until the control panel or invoice confirms it.
| Host / candidate | Provider | Plan / SKU | Region | Monthly cost | Billing cycle | Role paid for | Source / proof | Status / notes |
| ---------------------------------- | ------------------------------------------------------------------ | ----------------------------------------- | ------ | ----------------- | ------------- | ------------------------------------------------ | ------------------------------------- | -------------------------------------------------------------------------------------------------- |
| **osa** | TBD (verify; OVHcloud is suspected but not invoice-confirmed here) | TBD | TBD | TBD | TBD | always-on orchestrator + board + Hermes gateway | operator invoice/control panel needed | Existing always-on VPS; do not treat IP range as proof. |
| **domedog** | TBD | TBD | TBD | TBD | TBD | Linux media/compute lane | operator invoice/control panel needed | Existing Linux VM; cost not tracked yet. |
| **debby** | self-owned laptop | — | local | utility/power TBD | — | intermittent secondary agent + soul backup | local device + utility rate if needed | Not an always-on hub; power cost only matters when left on. |
| **mother-build** (candidate) | proposed OVHcloud | TBD: Public Cloud hourly or Eco/dedicated | TBD | TBD | TBD | FreeBSD build host / poudriere / Rust+zot builds | OVH quote needed before purchase | Prefer on-demand if builds are infrequent; dedicated only if build demand justifies standing cost. |
| **ML350p Gen8** (candidate/retire) | self-hosted hardware | owned hardware | local | power TBD | utility bill | fallback build host only | measured watts + actual €/kWh needed | Do not make critical paths depend on it until reliability and TCO beat cloud. |
Cost discipline mirrors disk discipline: measure before action. For self-hosted hardware,
calculate monthly power with `watts / 1000 * 24 * 30 * €/kWh` using measured idle/load
wattage and the actual utility rate; do not compare cloud invoices to guessed electricity.
---
## 3. Per-host detail (expand as needed)
@ -113,7 +136,7 @@ host that fails. What you guess will be wrong; what you probe will be right.
in `~/.colibri/``colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`.
- **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw
it → worker marked it `done` (2026-06-19).
- **Executor pending (decision required)**: domedog *receives* capability-matched tasks, but
- **Executor pending (decision required)**: domedog _receives_ capability-matched tasks, but
no persistent execution loop runs yet — until one does, routed tasks sit `started` (no
lease/reaper). Decide what executes (Claude Code worker / script) and with what authority
before relying on autonomous domedog task completion.
@ -147,9 +170,9 @@ host that fails. What you guess will be wrong; what you probe will be right.
- **Claude Code** — installed (path: `/home/clawdie/.npm-global/bin/claude`), no dedicated role yet.
- **Provider stack** (hermes-osa):
```yaml
provider: deepseek # primary — direct credits, proven DEEPSEEK_OK
provider: deepseek # primary — direct credits, proven DEEPSEEK_OK
default: deepseek-chat
fallback: openrouter # available manually, not auto-fallback configured yet
fallback: openrouter # available manually, not auto-fallback configured yet
```
- **Z.AI**: deferred (not configured for hermes-osa; available via OpenRouter if needed)
- **Telegram**: LIVE — ${HERMES_OSA_BOT}, polling mode, connected 2026-06-17