docs(host-matrix): add infrastructure cost provenance (Sam & Pi)
Track hosting spend as a verified fleet fact alongside disk and hardware, seed TBD rows for osa/domedom/debby/proposed OVH build capacity/ML350p, and update HIVE status now that first-proof blockers are code-complete.\n\nValidation: npx --yes prettier@3 --check docs/HOST-MATRIX.md docs/HIVE-ONBOARDING.md; python3 scripts/layered_soul.py validate .
This commit is contained in:
parent
4192574f74
commit
058e4ce926
2 changed files with 71 additions and 44 deletions
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
**LIVE VS PLANNED.** This is a **design/vision** doc. The building blocks are real and
|
**LIVE VS PLANNED.** This is a **design/vision** doc. The building blocks are real and
|
||||||
proven (Bastille jails on osa, capability routing, `register-agent`, and the
|
proven (Bastille jails on osa, capability routing, `register-agent`, and the
|
||||||
`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The *platform*
|
`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The _platform_
|
||||||
described here — `colibri-vault` as a crate, multi-tenant buckets, the mother skill — is
|
described here — `colibri-vault` as a crate, multi-tenant buckets, the mother skill — is
|
||||||
`[PLANNED]`. The thesis: it is mostly **composition of pieces we already have**, not new
|
`[PLANNED]`. The thesis: it is mostly **composition of pieces we already have**, not new
|
||||||
invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
|
invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
|
||||||
|
|
@ -14,32 +14,35 @@ invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
|
||||||
The four MVP steps (§8) are **code-complete on colibri `main`**:
|
The four MVP steps (§8) are **code-complete on colibri `main`**:
|
||||||
|
|
||||||
| MVP step | Status | Landed via |
|
| MVP step | Status | Landed via |
|
||||||
| -------- | ------ | ---------- |
|
| --------------------------- | ------------------------- | ------------------------------------------- |
|
||||||
| 1. `colibri-vault` crate | done; hardening in flight | #85 → #94 → PR #100 (server-match + serialize) |
|
| 1. `colibri-vault` crate | done; hardening in flight | #85 → #94 → #100 (server-match + serialize) |
|
||||||
| 2. `tenants` table | on `main` | (PR #90 closed as superseded) |
|
| 2. `tenants` table | on `main` | (PR #90 closed as superseded) |
|
||||||
| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) |
|
| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) |
|
||||||
| 4. `mother` skill | done (draft) | layered-soul |
|
| 4. `mother` skill | done (draft) | layered-soul |
|
||||||
|
|
||||||
Supporting pieces merged: `agent-jail-bootstrap.sh` (#96 → #97 version-pin → #104
|
Supporting pieces merged: `agent-jail-bootstrap.sh` (#96 → #97 version-pin → #104
|
||||||
cold-cache guard), `provider.env` staging (#69/#99), vault-fetch shell helper
|
cold-cache guard), `provider.env` staging (#69/#99), vault-fetch shell helper
|
||||||
server-match (#67/#68/#69).
|
server-match (#67/#68/#69), and the first-proof runbook (#103).
|
||||||
|
|
||||||
**First proof is *not* code-blocked** — the chain works today via the interim manual
|
**First proof is _not_ code-blocked** — the chain works today via the interim manual
|
||||||
path in [`../docs/VAULT-PROVISION-FIRST-PROOF.md`](https://code.smilepowered.org/clawdie/colibri)
|
path in [`../docs/VAULT-PROVISION-FIRST-PROOF.md`](https://code.smilepowered.org/clawdie/colibri)
|
||||||
(colibri). Critical path: merge PR #100 + #103 → run the runbook (scratch jail + test
|
(colibri). Critical path now: operator runs the runbook (scratch jail + test collection,
|
||||||
collection, manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at
|
manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at `0600` + tenant
|
||||||
`0600` + tenant `active`.
|
`active`.
|
||||||
|
|
||||||
Open work, categorized:
|
Open work, categorized:
|
||||||
|
|
||||||
- **Hardening:** colibri PR #100 (closes #95), #92 (path canonicalization/containment).
|
- **Hardening:** #92 (path canonicalization/containment).
|
||||||
- **CLI-driveability (post-proof ergonomics, not proof blockers):** #101 (`register-tenant`
|
- **CLI-driveability (post-proof ergonomics, not proof blockers):** #101 (`register-tenant`
|
||||||
command), #102 (`--jail` on `spawn-agent`) — these replace the runbook's manual steps.
|
command), #102 (`--jail` on `spawn-agent`) — these replace the runbook's manual steps.
|
||||||
- **Source-of-truth/naming:** #98 (`npm-node24` vs `npm`), clawdie-iso #70 (agent-jail
|
- **Source-of-truth/naming:** #98 (`npm-node24` vs `npm`), clawdie-iso #70 (agent-jail
|
||||||
section in `pkg-list-jails.txt`).
|
section in `pkg-list-jails.txt`).
|
||||||
|
- **Cost/source-of-truth:** fill `docs/HOST-MATRIX.md` cost provenance rows before buying
|
||||||
|
or retiring build capacity; compare OVH quotes/invoices against measured self-host power.
|
||||||
|
|
||||||
**One-line plan:** merge #100 + #103 → run the runbook for the first proof → then land
|
**One-line plan:** run the first-proof runbook → then land #101/#102 for CLI driveability,
|
||||||
#101/#102 for CLI driveability, and #92 before promoting past scratch.
|
#92 before promoting past scratch, and fill verified OVH/self-host cost data before buying
|
||||||
|
or depending on a new mother/build host.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -50,7 +53,7 @@ primitive**. Promote it from the `clawdie-vault-fetch` shell helper to a first-c
|
||||||
crate, **`colibri-vault`**, sitting beside `colibri-spawner` / `colibri-store`:
|
crate, **`colibri-vault`**, sitting beside `colibri-spawner` / `colibri-store`:
|
||||||
|
|
||||||
- **in:** a tenant id (→ a bucket) + a target jail/home
|
- **in:** a tenant id (→ a bucket) + a target jail/home
|
||||||
- **out:** a `0600` `.env` materialized *inside the jail*, owned by the jail user
|
- **out:** a `0600` `.env` materialized _inside the jail_, owned by the jail user
|
||||||
- wraps the `bw` CLI for now (do **not** reimplement the Bitwarden protocol), fail-closed,
|
- wraps the `bw` CLI for now (do **not** reimplement the Bitwarden protocol), fail-closed,
|
||||||
idempotent, no-op when there is no bucket
|
idempotent, no-op when there is no bucket
|
||||||
|
|
||||||
|
|
@ -76,8 +79,8 @@ indirection than that.
|
||||||
|
|
||||||
On "folder vs bucket":
|
On "folder vs bucket":
|
||||||
|
|
||||||
- **Folders** are personal-vault organization → fine for *Clawdie's own internal* agents.
|
- **Folders** are personal-vault organization → fine for _Clawdie's own internal_ agents.
|
||||||
- **Organization + Collections** give *access-scoped isolation* → the multi-tenant
|
- **Organization + Collections** give _access-scoped isolation_ → the multi-tenant
|
||||||
primitive. One customer = one Collection; a scoped credential reads only that collection.
|
primitive. One customer = one Collection; a scoped credential reads only that collection.
|
||||||
- **Do not** run a separate Vaultwarden instance per customer — Collections are exactly
|
- **Do not** run a separate Vaultwarden instance per customer — Collections are exactly
|
||||||
this feature.
|
this feature.
|
||||||
|
|
@ -91,7 +94,7 @@ On "folder vs bucket":
|
||||||
the orchestrator, that can read any tenant collection to provision jails.
|
the orchestrator, that can read any tenant collection to provision jails.
|
||||||
|
|
||||||
Everything non-secret — harness, base config, model-routing prefs — **ships in the
|
Everything non-secret — harness, base config, model-routing prefs — **ships in the
|
||||||
clawdie-iso image**. The image is the *body*; the bucket is the *one private nerve*.
|
clawdie-iso image**. The image is the _body_; the bucket is the _one private nerve_.
|
||||||
|
|
||||||
## 5. [PLANNED] The mother skill
|
## 5. [PLANNED] The mother skill
|
||||||
|
|
||||||
|
|
@ -105,7 +108,7 @@ mother := resolve-identity (layered-soul)
|
||||||
```
|
```
|
||||||
|
|
||||||
- **Narrow:** onboarding — births one working agent from a bare jail.
|
- **Narrow:** onboarding — births one working agent from a bare jail.
|
||||||
- **Wide:** self-replication. An agent that *holds* the mother skill can spawn and
|
- **Wide:** self-replication. An agent that _holds_ the mother skill can spawn and
|
||||||
provision more jails (a queen births workers, each inheriting the mother skill), gated
|
provision more jails (a queen births workers, each inheriting the mother skill), gated
|
||||||
by capability/policy so it cannot run away. That is "agent swarms with a mother skill,"
|
by capability/policy so it cannot run away. That is "agent swarms with a mother skill,"
|
||||||
and `colibri-vault` is how each birth gets its one nerve.
|
and `colibri-vault` is how each birth gets its one nerve.
|
||||||
|
|
@ -119,7 +122,7 @@ osa/FreeBSD/Bastille is the natural womb — cheap, dense, isolated jails.
|
||||||
> already shipped.
|
> already shipped.
|
||||||
|
|
||||||
A one-key agent on osa needs `image-render`? It routes to a Linux lane (domedog). Needs a
|
A one-key agent on osa needs `image-render`? It routes to a Linux lane (domedog). Needs a
|
||||||
build? Routes to a capable host. The customer pays for *one agent* but stands on a
|
build? Routes to a capable host. The customer pays for _one agent_ but stands on a
|
||||||
survivable, multi-OS hive. Anyone can run an LLM in a container; few hand you a swarm
|
survivable, multi-OS hive. Anyone can run an LLM in a container; few hand you a swarm
|
||||||
behind one key — **capability routing is the differentiator.**
|
behind one key — **capability routing is the differentiator.**
|
||||||
|
|
||||||
|
|
@ -133,7 +136,7 @@ behind one key — **capability routing is the differentiator.**
|
||||||
**Bootstraps live on the host; jails hold only their resolved secrets.**
|
**Bootstraps live on the host; jails hold only their resolved secrets.**
|
||||||
|
|
||||||
- The orchestrator holds the org service-account credential. It fetches a tenant's
|
- The orchestrator holds the org service-account credential. It fetches a tenant's
|
||||||
collection, writes the resolved `.env` *into* the jail, and the **bootstrap never enters
|
collection, writes the resolved `.env` _into_ the jail, and the **bootstrap never enters
|
||||||
the jail**. A compromised jail cannot re-fetch and cannot reach another tenant.
|
the jail**. A compromised jail cannot re-fetch and cannot reach another tenant.
|
||||||
- Per-tenant blast radius = one collection. Scoped credential, never a master.
|
- Per-tenant blast radius = one collection. Scoped credential, never a master.
|
||||||
- This is the same shape the domedog smoke test validated (bootstrap on host, `.env` is the
|
- This is the same shape the domedog smoke test validated (bootstrap on host, `.env` is the
|
||||||
|
|
@ -151,14 +154,15 @@ Smallest path that is real:
|
||||||
|
|
||||||
**First-proof policy.** The first proven end-to-end runs against a **scratch jail + a
|
**First-proof policy.** The first proven end-to-end runs against a **scratch jail + a
|
||||||
throwaway test collection only** — no real tenant data until the path hardening lands
|
throwaway test collection only** — no real tenant data until the path hardening lands
|
||||||
(canonicalize + allowed-root containment, colibri issue #92). The two first-proof blockers
|
(canonicalize + allowed-root containment, colibri issue #92). The former first-proof
|
||||||
are colibri **#88** (resolve the collection by name) and **#89** (per-call unlock); #92 is
|
blockers — colibri **#88** (resolve the collection by name) and **#89** (per-call unlock)
|
||||||
hardening that follows. Tracker state lives on those issues.
|
— are resolved on `main`; the remaining first-proof step is the operator-run scratch
|
||||||
|
runbook. #92 is hardening that follows before real tenant data.
|
||||||
|
|
||||||
**Overengineering traps to avoid for now:** a custom Bitwarden web UI (Vaultwarden's own UI
|
**Overengineering traps to avoid for now:** a custom Bitwarden web UI (Vaultwarden's own UI
|
||||||
+ a Collection is enough to start), billing/metering, a native Bitwarden protocol in Rust,
|
plus a Collection is enough to start), billing/metering, a native Bitwarden protocol in
|
||||||
multi-region control plane, and recursive auto-spawn (gate it off until policy exists).
|
Rust, multi-region control plane, and recursive auto-spawn (gate it off until policy
|
||||||
Those are product layers; the four steps above are the engine.
|
exists). Those are product layers; the four steps above are the engine.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -19,6 +19,10 @@ on any host fills in its own row. Source of truth for facts is the probe — not
|
||||||
> real free space (`df -h /`, or the probe's `--storage`) — never estimate. Keep the
|
> real free space (`df -h /`, or the probe's `--storage`) — never estimate. Keep the
|
||||||
> **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below.
|
> **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below.
|
||||||
>
|
>
|
||||||
|
> **Cost before buying:** before purchasing or retiring infrastructure, record provider,
|
||||||
|
> plan/SKU, verified monthly cost, and the source of truth (invoice/control panel/utility
|
||||||
|
> bill). IP-range guesses are not billing proof. See _Cost provenance_ below.
|
||||||
|
>
|
||||||
> **Never paste real IPs or bot handles here.** Use `${HOST_TS_IP}` and `${*_BOT}`
|
> **Never paste real IPs or bot handles here.** Use `${HOST_TS_IP}` and `${*_BOT}`
|
||||||
> placeholders; real values live in `fleet.env` (gitignored) and are live via
|
> placeholders; real values live in `fleet.env` (gitignored) and are live via
|
||||||
> `tailscale status`. Copy `fleet.env.example` → `fleet.env` to resolve them. The probe
|
> `tailscale status`. Copy `fleet.env.example` → `fleet.env` to resolve them. The probe
|
||||||
|
|
@ -29,7 +33,7 @@ on any host fills in its own row. Source of truth for facts is the probe — not
|
||||||
## 1. Agent placement (who runs where)
|
## 1. Agent placement (who runs where)
|
||||||
|
|
||||||
| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status |
|
| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status |
|
||||||
| ----------- | ------- | --------------------------- | ---------------------------- | -------------------------------- | --------------------- | ----------------------------- |
|
| ----------------- | ------- | --------------------- | ---------------------------- | ------------------------------------------------------------------------- | --------------------------- | -------------------------------------- |
|
||||||
| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) |
|
| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) |
|
||||||
| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE |
|
| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE |
|
||||||
| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE |
|
| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE |
|
||||||
|
|
@ -65,7 +69,7 @@ on any host fills in its own row. Source of truth for facts is the probe — not
|
||||||
## 2. Host hardware & facts (one row per host)
|
## 2. Host hardware & facts (one row per host)
|
||||||
|
|
||||||
| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By |
|
| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By |
|
||||||
| ----------- | -------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
|
| ----------- | ---------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
|
||||||
| **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude |
|
| **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude |
|
||||||
| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes |
|
| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes |
|
||||||
| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi |
|
| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi |
|
||||||
|
|
@ -87,6 +91,25 @@ Disk is a first-class fact, same as OS or CPU — **measure it before you act, d
|
||||||
This is the survivability principle applied to storage: a host that silently fills up is a
|
This is the survivability principle applied to storage: a host that silently fills up is a
|
||||||
host that fails. What you guess will be wrong; what you probe will be right.
|
host that fails. What you guess will be wrong; what you probe will be right.
|
||||||
|
|
||||||
|
### Cost provenance (invoice/control-panel facts, not guesses)
|
||||||
|
|
||||||
|
Hosting spend is a first-class fleet fact, but it must stay non-secret: record provider,
|
||||||
|
plan/SKU, region, verified monthly cost, and the proof source. Do **not** commit invoice
|
||||||
|
IDs, account numbers, billing addresses, or payment details. If a provider is inferred from
|
||||||
|
an IP range, mark it `TBD` until the control panel or invoice confirms it.
|
||||||
|
|
||||||
|
| Host / candidate | Provider | Plan / SKU | Region | Monthly cost | Billing cycle | Role paid for | Source / proof | Status / notes |
|
||||||
|
| ---------------------------------- | ------------------------------------------------------------------ | ----------------------------------------- | ------ | ----------------- | ------------- | ------------------------------------------------ | ------------------------------------- | -------------------------------------------------------------------------------------------------- |
|
||||||
|
| **osa** | TBD (verify; OVHcloud is suspected but not invoice-confirmed here) | TBD | TBD | TBD | TBD | always-on orchestrator + board + Hermes gateway | operator invoice/control panel needed | Existing always-on VPS; do not treat IP range as proof. |
|
||||||
|
| **domedog** | TBD | TBD | TBD | TBD | TBD | Linux media/compute lane | operator invoice/control panel needed | Existing Linux VM; cost not tracked yet. |
|
||||||
|
| **debby** | self-owned laptop | — | local | utility/power TBD | — | intermittent secondary agent + soul backup | local device + utility rate if needed | Not an always-on hub; power cost only matters when left on. |
|
||||||
|
| **mother-build** (candidate) | proposed OVHcloud | TBD: Public Cloud hourly or Eco/dedicated | TBD | TBD | TBD | FreeBSD build host / poudriere / Rust+zot builds | OVH quote needed before purchase | Prefer on-demand if builds are infrequent; dedicated only if build demand justifies standing cost. |
|
||||||
|
| **ML350p Gen8** (candidate/retire) | self-hosted hardware | owned hardware | local | power TBD | utility bill | fallback build host only | measured watts + actual €/kWh needed | Do not make critical paths depend on it until reliability and TCO beat cloud. |
|
||||||
|
|
||||||
|
Cost discipline mirrors disk discipline: measure before action. For self-hosted hardware,
|
||||||
|
calculate monthly power with `watts / 1000 * 24 * 30 * €/kWh` using measured idle/load
|
||||||
|
wattage and the actual utility rate; do not compare cloud invoices to guessed electricity.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 3. Per-host detail (expand as needed)
|
## 3. Per-host detail (expand as needed)
|
||||||
|
|
@ -113,7 +136,7 @@ host that fails. What you guess will be wrong; what you probe will be right.
|
||||||
in `~/.colibri/` — `colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`.
|
in `~/.colibri/` — `colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`.
|
||||||
- **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw
|
- **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw
|
||||||
it → worker marked it `done` (2026-06-19).
|
it → worker marked it `done` (2026-06-19).
|
||||||
- **Executor pending (decision required)**: domedog *receives* capability-matched tasks, but
|
- **Executor pending (decision required)**: domedog _receives_ capability-matched tasks, but
|
||||||
no persistent execution loop runs yet — until one does, routed tasks sit `started` (no
|
no persistent execution loop runs yet — until one does, routed tasks sit `started` (no
|
||||||
lease/reaper). Decide what executes (Claude Code worker / script) and with what authority
|
lease/reaper). Decide what executes (Claude Code worker / script) and with what authority
|
||||||
before relying on autonomous domedog task completion.
|
before relying on autonomous domedog task completion.
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue