docs(host-matrix): add infrastructure cost provenance (Sam & Pi)

Track hosting spend as a verified fleet fact alongside disk and hardware, seed TBD rows for osa/domedom/debby/proposed OVH build capacity/ML350p, and update HIVE status now that first-proof blockers are code-complete.\n\nValidation: npx --yes prettier@3 --check docs/HOST-MATRIX.md docs/HIVE-ONBOARDING.md; python3 scripts/layered_soul.py validate .
This commit is contained in:
Sam & Claude 2026-06-20 09:48:12 +02:00
parent 4192574f74
commit 058e4ce926
2 changed files with 71 additions and 44 deletions

View file

@ -2,7 +2,7 @@
**LIVE VS PLANNED.** This is a **design/vision** doc. The building blocks are real and **LIVE VS PLANNED.** This is a **design/vision** doc. The building blocks are real and
proven (Bastille jails on osa, capability routing, `register-agent`, and the proven (Bastille jails on osa, capability routing, `register-agent`, and the
`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The *platform* `clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The _platform_
described here — `colibri-vault` as a crate, multi-tenant buckets, the mother skill — is described here — `colibri-vault` as a crate, multi-tenant buckets, the mother skill — is
`[PLANNED]`. The thesis: it is mostly **composition of pieces we already have**, not new `[PLANNED]`. The thesis: it is mostly **composition of pieces we already have**, not new
invention. Sections are tagged `[LIVE]` / `[PLANNED]`. invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
@ -14,32 +14,35 @@ invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
The four MVP steps (§8) are **code-complete on colibri `main`**: The four MVP steps (§8) are **code-complete on colibri `main`**:
| MVP step | Status | Landed via | | MVP step | Status | Landed via |
| -------- | ------ | ---------- | | --------------------------- | ------------------------- | ------------------------------------------- |
| 1. `colibri-vault` crate | done; hardening in flight | #85#94 PR #100 (server-match + serialize) | | 1. `colibri-vault` crate | done; hardening in flight | #85#94#100 (server-match + serialize) |
| 2. `tenants` table | on `main` | (PR #90 closed as superseded) | | 2. `tenants` table | on `main` | (PR #90 closed as superseded) |
| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) | | 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) |
| 4. `mother` skill | done (draft) | layered-soul | | 4. `mother` skill | done (draft) | layered-soul |
Supporting pieces merged: `agent-jail-bootstrap.sh` (#96#97 version-pin → #104 Supporting pieces merged: `agent-jail-bootstrap.sh` (#96#97 version-pin → #104
cold-cache guard), `provider.env` staging (#69/#99), vault-fetch shell helper cold-cache guard), `provider.env` staging (#69/#99), vault-fetch shell helper
server-match (#67/#68/#69). server-match (#67/#68/#69), and the first-proof runbook (#103).
**First proof is *not* code-blocked** — the chain works today via the interim manual **First proof is _not_ code-blocked** — the chain works today via the interim manual
path in [`../docs/VAULT-PROVISION-FIRST-PROOF.md`](https://code.smilepowered.org/clawdie/colibri) path in [`../docs/VAULT-PROVISION-FIRST-PROOF.md`](https://code.smilepowered.org/clawdie/colibri)
(colibri). Critical path: merge PR #100 + #103 → run the runbook (scratch jail + test (colibri). Critical path now: operator runs the runbook (scratch jail + test collection,
collection, manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at `0600` + tenant
`0600` + tenant `active`. `active`.
Open work, categorized: Open work, categorized:
- **Hardening:** colibri PR #100 (closes #95), #92 (path canonicalization/containment). - **Hardening:** #92 (path canonicalization/containment).
- **CLI-driveability (post-proof ergonomics, not proof blockers):** #101 (`register-tenant` - **CLI-driveability (post-proof ergonomics, not proof blockers):** #101 (`register-tenant`
command), #102 (`--jail` on `spawn-agent`) — these replace the runbook's manual steps. command), #102 (`--jail` on `spawn-agent`) — these replace the runbook's manual steps.
- **Source-of-truth/naming:** #98 (`npm-node24` vs `npm`), clawdie-iso #70 (agent-jail - **Source-of-truth/naming:** #98 (`npm-node24` vs `npm`), clawdie-iso #70 (agent-jail
section in `pkg-list-jails.txt`). section in `pkg-list-jails.txt`).
- **Cost/source-of-truth:** fill `docs/HOST-MATRIX.md` cost provenance rows before buying
or retiring build capacity; compare OVH quotes/invoices against measured self-host power.
**One-line plan:** merge #100 + #103 → run the runbook for the first proof → then land **One-line plan:** run the first-proof runbook → then land #101/#102 for CLI driveability,
#101/#102 for CLI driveability, and #92 before promoting past scratch. #92 before promoting past scratch, and fill verified OVH/self-host cost data before buying
or depending on a new mother/build host.
--- ---
@ -50,7 +53,7 @@ primitive**. Promote it from the `clawdie-vault-fetch` shell helper to a first-c
crate, **`colibri-vault`**, sitting beside `colibri-spawner` / `colibri-store`: crate, **`colibri-vault`**, sitting beside `colibri-spawner` / `colibri-store`:
- **in:** a tenant id (→ a bucket) + a target jail/home - **in:** a tenant id (→ a bucket) + a target jail/home
- **out:** a `0600` `.env` materialized *inside the jail*, owned by the jail user - **out:** a `0600` `.env` materialized _inside the jail_, owned by the jail user
- wraps the `bw` CLI for now (do **not** reimplement the Bitwarden protocol), fail-closed, - wraps the `bw` CLI for now (do **not** reimplement the Bitwarden protocol), fail-closed,
idempotent, no-op when there is no bucket idempotent, no-op when there is no bucket
@ -76,8 +79,8 @@ indirection than that.
On "folder vs bucket": On "folder vs bucket":
- **Folders** are personal-vault organization → fine for *Clawdie's own internal* agents. - **Folders** are personal-vault organization → fine for _Clawdie's own internal_ agents.
- **Organization + Collections** give *access-scoped isolation* → the multi-tenant - **Organization + Collections** give _access-scoped isolation_ → the multi-tenant
primitive. One customer = one Collection; a scoped credential reads only that collection. primitive. One customer = one Collection; a scoped credential reads only that collection.
- **Do not** run a separate Vaultwarden instance per customer — Collections are exactly - **Do not** run a separate Vaultwarden instance per customer — Collections are exactly
this feature. this feature.
@ -91,7 +94,7 @@ On "folder vs bucket":
the orchestrator, that can read any tenant collection to provision jails. the orchestrator, that can read any tenant collection to provision jails.
Everything non-secret — harness, base config, model-routing prefs — **ships in the Everything non-secret — harness, base config, model-routing prefs — **ships in the
clawdie-iso image**. The image is the *body*; the bucket is the *one private nerve*. clawdie-iso image**. The image is the _body_; the bucket is the _one private nerve_.
## 5. [PLANNED] The mother skill ## 5. [PLANNED] The mother skill
@ -105,7 +108,7 @@ mother := resolve-identity (layered-soul)
``` ```
- **Narrow:** onboarding — births one working agent from a bare jail. - **Narrow:** onboarding — births one working agent from a bare jail.
- **Wide:** self-replication. An agent that *holds* the mother skill can spawn and - **Wide:** self-replication. An agent that _holds_ the mother skill can spawn and
provision more jails (a queen births workers, each inheriting the mother skill), gated provision more jails (a queen births workers, each inheriting the mother skill), gated
by capability/policy so it cannot run away. That is "agent swarms with a mother skill," by capability/policy so it cannot run away. That is "agent swarms with a mother skill,"
and `colibri-vault` is how each birth gets its one nerve. and `colibri-vault` is how each birth gets its one nerve.
@ -119,7 +122,7 @@ osa/FreeBSD/Bastille is the natural womb — cheap, dense, isolated jails.
> already shipped. > already shipped.
A one-key agent on osa needs `image-render`? It routes to a Linux lane (domedog). Needs a A one-key agent on osa needs `image-render`? It routes to a Linux lane (domedog). Needs a
build? Routes to a capable host. The customer pays for *one agent* but stands on a build? Routes to a capable host. The customer pays for _one agent_ but stands on a
survivable, multi-OS hive. Anyone can run an LLM in a container; few hand you a swarm survivable, multi-OS hive. Anyone can run an LLM in a container; few hand you a swarm
behind one key — **capability routing is the differentiator.** behind one key — **capability routing is the differentiator.**
@ -133,7 +136,7 @@ behind one key — **capability routing is the differentiator.**
**Bootstraps live on the host; jails hold only their resolved secrets.** **Bootstraps live on the host; jails hold only their resolved secrets.**
- The orchestrator holds the org service-account credential. It fetches a tenant's - The orchestrator holds the org service-account credential. It fetches a tenant's
collection, writes the resolved `.env` *into* the jail, and the **bootstrap never enters collection, writes the resolved `.env` _into_ the jail, and the **bootstrap never enters
the jail**. A compromised jail cannot re-fetch and cannot reach another tenant. the jail**. A compromised jail cannot re-fetch and cannot reach another tenant.
- Per-tenant blast radius = one collection. Scoped credential, never a master. - Per-tenant blast radius = one collection. Scoped credential, never a master.
- This is the same shape the domedog smoke test validated (bootstrap on host, `.env` is the - This is the same shape the domedog smoke test validated (bootstrap on host, `.env` is the
@ -151,14 +154,15 @@ Smallest path that is real:
**First-proof policy.** The first proven end-to-end runs against a **scratch jail + a **First-proof policy.** The first proven end-to-end runs against a **scratch jail + a
throwaway test collection only** — no real tenant data until the path hardening lands throwaway test collection only** — no real tenant data until the path hardening lands
(canonicalize + allowed-root containment, colibri issue #92). The two first-proof blockers (canonicalize + allowed-root containment, colibri issue #92). The former first-proof
are colibri **#88** (resolve the collection by name) and **#89** (per-call unlock); #92 is blockers — colibri **#88** (resolve the collection by name) and **#89** (per-call unlock)
hardening that follows. Tracker state lives on those issues. — are resolved on `main`; the remaining first-proof step is the operator-run scratch
runbook. #92 is hardening that follows before real tenant data.
**Overengineering traps to avoid for now:** a custom Bitwarden web UI (Vaultwarden's own UI **Overengineering traps to avoid for now:** a custom Bitwarden web UI (Vaultwarden's own UI
+ a Collection is enough to start), billing/metering, a native Bitwarden protocol in Rust, plus a Collection is enough to start), billing/metering, a native Bitwarden protocol in
multi-region control plane, and recursive auto-spawn (gate it off until policy exists). Rust, multi-region control plane, and recursive auto-spawn (gate it off until policy
Those are product layers; the four steps above are the engine. exists). Those are product layers; the four steps above are the engine.
--- ---

View file

@ -19,6 +19,10 @@ on any host fills in its own row. Source of truth for facts is the probe — not
> real free space (`df -h /`, or the probe's `--storage`) — never estimate. Keep the > real free space (`df -h /`, or the probe's `--storage`) — never estimate. Keep the
> **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below. > **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below.
> >
> **Cost before buying:** before purchasing or retiring infrastructure, record provider,
> plan/SKU, verified monthly cost, and the source of truth (invoice/control panel/utility
> bill). IP-range guesses are not billing proof. See _Cost provenance_ below.
>
> **Never paste real IPs or bot handles here.** Use `${HOST_TS_IP}` and `${*_BOT}` > **Never paste real IPs or bot handles here.** Use `${HOST_TS_IP}` and `${*_BOT}`
> placeholders; real values live in `fleet.env` (gitignored) and are live via > placeholders; real values live in `fleet.env` (gitignored) and are live via
> `tailscale status`. Copy `fleet.env.example``fleet.env` to resolve them. The probe > `tailscale status`. Copy `fleet.env.example``fleet.env` to resolve them. The probe
@ -29,7 +33,7 @@ on any host fills in its own row. Source of truth for facts is the probe — not
## 1. Agent placement (who runs where) ## 1. Agent placement (who runs where)
| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status | | Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status |
| ----------- | ------- | --------------------------- | ---------------------------- | -------------------------------- | --------------------- | ----------------------------- | | ----------------- | ------- | --------------------- | ---------------------------- | ------------------------------------------------------------------------- | --------------------------- | -------------------------------------- |
| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) | | Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) |
| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE | | Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE |
| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE | | Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE |
@ -65,7 +69,7 @@ on any host fills in its own row. Source of truth for facts is the probe — not
## 2. Host hardware & facts (one row per host) ## 2. Host hardware & facts (one row per host)
| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By | | Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By |
| ----------- | -------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ | | ----------- | ---------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
| **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude | | **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude |
| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes | | **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes |
| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi | | **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi |
@ -87,6 +91,25 @@ Disk is a first-class fact, same as OS or CPU — **measure it before you act, d
This is the survivability principle applied to storage: a host that silently fills up is a This is the survivability principle applied to storage: a host that silently fills up is a
host that fails. What you guess will be wrong; what you probe will be right. host that fails. What you guess will be wrong; what you probe will be right.
### Cost provenance (invoice/control-panel facts, not guesses)
Hosting spend is a first-class fleet fact, but it must stay non-secret: record provider,
plan/SKU, region, verified monthly cost, and the proof source. Do **not** commit invoice
IDs, account numbers, billing addresses, or payment details. If a provider is inferred from
an IP range, mark it `TBD` until the control panel or invoice confirms it.
| Host / candidate | Provider | Plan / SKU | Region | Monthly cost | Billing cycle | Role paid for | Source / proof | Status / notes |
| ---------------------------------- | ------------------------------------------------------------------ | ----------------------------------------- | ------ | ----------------- | ------------- | ------------------------------------------------ | ------------------------------------- | -------------------------------------------------------------------------------------------------- |
| **osa** | TBD (verify; OVHcloud is suspected but not invoice-confirmed here) | TBD | TBD | TBD | TBD | always-on orchestrator + board + Hermes gateway | operator invoice/control panel needed | Existing always-on VPS; do not treat IP range as proof. |
| **domedog** | TBD | TBD | TBD | TBD | TBD | Linux media/compute lane | operator invoice/control panel needed | Existing Linux VM; cost not tracked yet. |
| **debby** | self-owned laptop | — | local | utility/power TBD | — | intermittent secondary agent + soul backup | local device + utility rate if needed | Not an always-on hub; power cost only matters when left on. |
| **mother-build** (candidate) | proposed OVHcloud | TBD: Public Cloud hourly or Eco/dedicated | TBD | TBD | TBD | FreeBSD build host / poudriere / Rust+zot builds | OVH quote needed before purchase | Prefer on-demand if builds are infrequent; dedicated only if build demand justifies standing cost. |
| **ML350p Gen8** (candidate/retire) | self-hosted hardware | owned hardware | local | power TBD | utility bill | fallback build host only | measured watts + actual €/kWh needed | Do not make critical paths depend on it until reliability and TCO beat cloud. |
Cost discipline mirrors disk discipline: measure before action. For self-hosted hardware,
calculate monthly power with `watts / 1000 * 24 * 30 * €/kWh` using measured idle/load
wattage and the actual utility rate; do not compare cloud invoices to guessed electricity.
--- ---
## 3. Per-host detail (expand as needed) ## 3. Per-host detail (expand as needed)
@ -113,7 +136,7 @@ host that fails. What you guess will be wrong; what you probe will be right.
in `~/.colibri/``colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`. in `~/.colibri/``colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`.
- **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw - **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw
it → worker marked it `done` (2026-06-19). it → worker marked it `done` (2026-06-19).
- **Executor pending (decision required)**: domedog *receives* capability-matched tasks, but - **Executor pending (decision required)**: domedog _receives_ capability-matched tasks, but
no persistent execution loop runs yet — until one does, routed tasks sit `started` (no no persistent execution loop runs yet — until one does, routed tasks sit `started` (no
lease/reaper). Decide what executes (Claude Code worker / script) and with what authority lease/reaper). Decide what executes (Claude Code worker / script) and with what authority
before relying on autonomous domedog task completion. before relying on autonomous domedog task completion.