docs(host-matrix): add infrastructure cost provenance (Sam & Pi)
Track hosting spend as a verified fleet fact alongside disk and hardware, seed TBD rows for osa/domedom/debby/proposed OVH build capacity/ML350p, and update HIVE status now that first-proof blockers are code-complete.\n\nValidation: npx --yes prettier@3 --check docs/HOST-MATRIX.md docs/HIVE-ONBOARDING.md; python3 scripts/layered_soul.py validate .
This commit is contained in:
parent
4192574f74
commit
058e4ce926
2 changed files with 71 additions and 44 deletions
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
**LIVE VS PLANNED.** This is a **design/vision** doc. The building blocks are real and
|
||||
proven (Bastille jails on osa, capability routing, `register-agent`, and the
|
||||
`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The *platform*
|
||||
`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The _platform_
|
||||
described here — `colibri-vault` as a crate, multi-tenant buckets, the mother skill — is
|
||||
`[PLANNED]`. The thesis: it is mostly **composition of pieces we already have**, not new
|
||||
invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
|
||||
|
|
@ -14,32 +14,35 @@ invention. Sections are tagged `[LIVE]` / `[PLANNED]`.
|
|||
The four MVP steps (§8) are **code-complete on colibri `main`**:
|
||||
|
||||
| MVP step | Status | Landed via |
|
||||
| -------- | ------ | ---------- |
|
||||
| 1. `colibri-vault` crate | done; hardening in flight | #85 → #94 → PR #100 (server-match + serialize) |
|
||||
| --------------------------- | ------------------------- | ------------------------------------------- |
|
||||
| 1. `colibri-vault` crate | done; hardening in flight | #85 → #94 → #100 (server-match + serialize) |
|
||||
| 2. `tenants` table | on `main` | (PR #90 closed as superseded) |
|
||||
| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) |
|
||||
| 4. `mother` skill | done (draft) | layered-soul |
|
||||
|
||||
Supporting pieces merged: `agent-jail-bootstrap.sh` (#96 → #97 version-pin → #104
|
||||
cold-cache guard), `provider.env` staging (#69/#99), vault-fetch shell helper
|
||||
server-match (#67/#68/#69).
|
||||
server-match (#67/#68/#69), and the first-proof runbook (#103).
|
||||
|
||||
**First proof is *not* code-blocked** — the chain works today via the interim manual
|
||||
**First proof is _not_ code-blocked** — the chain works today via the interim manual
|
||||
path in [`../docs/VAULT-PROVISION-FIRST-PROOF.md`](https://code.smilepowered.org/clawdie/colibri)
|
||||
(colibri). Critical path: merge PR #100 + #103 → run the runbook (scratch jail + test
|
||||
collection, manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at
|
||||
`0600` + tenant `active`.
|
||||
(colibri). Critical path now: operator runs the runbook (scratch jail + test collection,
|
||||
manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at `0600` + tenant
|
||||
`active`.
|
||||
|
||||
Open work, categorized:
|
||||
|
||||
- **Hardening:** colibri PR #100 (closes #95), #92 (path canonicalization/containment).
|
||||
- **Hardening:** #92 (path canonicalization/containment).
|
||||
- **CLI-driveability (post-proof ergonomics, not proof blockers):** #101 (`register-tenant`
|
||||
command), #102 (`--jail` on `spawn-agent`) — these replace the runbook's manual steps.
|
||||
- **Source-of-truth/naming:** #98 (`npm-node24` vs `npm`), clawdie-iso #70 (agent-jail
|
||||
section in `pkg-list-jails.txt`).
|
||||
- **Cost/source-of-truth:** fill `docs/HOST-MATRIX.md` cost provenance rows before buying
|
||||
or retiring build capacity; compare OVH quotes/invoices against measured self-host power.
|
||||
|
||||
**One-line plan:** merge #100 + #103 → run the runbook for the first proof → then land
|
||||
#101/#102 for CLI driveability, and #92 before promoting past scratch.
|
||||
**One-line plan:** run the first-proof runbook → then land #101/#102 for CLI driveability,
|
||||
#92 before promoting past scratch, and fill verified OVH/self-host cost data before buying
|
||||
or depending on a new mother/build host.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -50,7 +53,7 @@ primitive**. Promote it from the `clawdie-vault-fetch` shell helper to a first-c
|
|||
crate, **`colibri-vault`**, sitting beside `colibri-spawner` / `colibri-store`:
|
||||
|
||||
- **in:** a tenant id (→ a bucket) + a target jail/home
|
||||
- **out:** a `0600` `.env` materialized *inside the jail*, owned by the jail user
|
||||
- **out:** a `0600` `.env` materialized _inside the jail_, owned by the jail user
|
||||
- wraps the `bw` CLI for now (do **not** reimplement the Bitwarden protocol), fail-closed,
|
||||
idempotent, no-op when there is no bucket
|
||||
|
||||
|
|
@ -76,8 +79,8 @@ indirection than that.
|
|||
|
||||
On "folder vs bucket":
|
||||
|
||||
- **Folders** are personal-vault organization → fine for *Clawdie's own internal* agents.
|
||||
- **Organization + Collections** give *access-scoped isolation* → the multi-tenant
|
||||
- **Folders** are personal-vault organization → fine for _Clawdie's own internal_ agents.
|
||||
- **Organization + Collections** give _access-scoped isolation_ → the multi-tenant
|
||||
primitive. One customer = one Collection; a scoped credential reads only that collection.
|
||||
- **Do not** run a separate Vaultwarden instance per customer — Collections are exactly
|
||||
this feature.
|
||||
|
|
@ -91,7 +94,7 @@ On "folder vs bucket":
|
|||
the orchestrator, that can read any tenant collection to provision jails.
|
||||
|
||||
Everything non-secret — harness, base config, model-routing prefs — **ships in the
|
||||
clawdie-iso image**. The image is the *body*; the bucket is the *one private nerve*.
|
||||
clawdie-iso image**. The image is the _body_; the bucket is the _one private nerve_.
|
||||
|
||||
## 5. [PLANNED] The mother skill
|
||||
|
||||
|
|
@ -105,7 +108,7 @@ mother := resolve-identity (layered-soul)
|
|||
```
|
||||
|
||||
- **Narrow:** onboarding — births one working agent from a bare jail.
|
||||
- **Wide:** self-replication. An agent that *holds* the mother skill can spawn and
|
||||
- **Wide:** self-replication. An agent that _holds_ the mother skill can spawn and
|
||||
provision more jails (a queen births workers, each inheriting the mother skill), gated
|
||||
by capability/policy so it cannot run away. That is "agent swarms with a mother skill,"
|
||||
and `colibri-vault` is how each birth gets its one nerve.
|
||||
|
|
@ -119,7 +122,7 @@ osa/FreeBSD/Bastille is the natural womb — cheap, dense, isolated jails.
|
|||
> already shipped.
|
||||
|
||||
A one-key agent on osa needs `image-render`? It routes to a Linux lane (domedog). Needs a
|
||||
build? Routes to a capable host. The customer pays for *one agent* but stands on a
|
||||
build? Routes to a capable host. The customer pays for _one agent_ but stands on a
|
||||
survivable, multi-OS hive. Anyone can run an LLM in a container; few hand you a swarm
|
||||
behind one key — **capability routing is the differentiator.**
|
||||
|
||||
|
|
@ -133,7 +136,7 @@ behind one key — **capability routing is the differentiator.**
|
|||
**Bootstraps live on the host; jails hold only their resolved secrets.**
|
||||
|
||||
- The orchestrator holds the org service-account credential. It fetches a tenant's
|
||||
collection, writes the resolved `.env` *into* the jail, and the **bootstrap never enters
|
||||
collection, writes the resolved `.env` _into_ the jail, and the **bootstrap never enters
|
||||
the jail**. A compromised jail cannot re-fetch and cannot reach another tenant.
|
||||
- Per-tenant blast radius = one collection. Scoped credential, never a master.
|
||||
- This is the same shape the domedog smoke test validated (bootstrap on host, `.env` is the
|
||||
|
|
@ -151,14 +154,15 @@ Smallest path that is real:
|
|||
|
||||
**First-proof policy.** The first proven end-to-end runs against a **scratch jail + a
|
||||
throwaway test collection only** — no real tenant data until the path hardening lands
|
||||
(canonicalize + allowed-root containment, colibri issue #92). The two first-proof blockers
|
||||
are colibri **#88** (resolve the collection by name) and **#89** (per-call unlock); #92 is
|
||||
hardening that follows. Tracker state lives on those issues.
|
||||
(canonicalize + allowed-root containment, colibri issue #92). The former first-proof
|
||||
blockers — colibri **#88** (resolve the collection by name) and **#89** (per-call unlock)
|
||||
— are resolved on `main`; the remaining first-proof step is the operator-run scratch
|
||||
runbook. #92 is hardening that follows before real tenant data.
|
||||
|
||||
**Overengineering traps to avoid for now:** a custom Bitwarden web UI (Vaultwarden's own UI
|
||||
+ a Collection is enough to start), billing/metering, a native Bitwarden protocol in Rust,
|
||||
multi-region control plane, and recursive auto-spawn (gate it off until policy exists).
|
||||
Those are product layers; the four steps above are the engine.
|
||||
plus a Collection is enough to start), billing/metering, a native Bitwarden protocol in
|
||||
Rust, multi-region control plane, and recursive auto-spawn (gate it off until policy
|
||||
exists). Those are product layers; the four steps above are the engine.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -19,6 +19,10 @@ on any host fills in its own row. Source of truth for facts is the probe — not
|
|||
> real free space (`df -h /`, or the probe's `--storage`) — never estimate. Keep the
|
||||
> **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below.
|
||||
>
|
||||
> **Cost before buying:** before purchasing or retiring infrastructure, record provider,
|
||||
> plan/SKU, verified monthly cost, and the source of truth (invoice/control panel/utility
|
||||
> bill). IP-range guesses are not billing proof. See _Cost provenance_ below.
|
||||
>
|
||||
> **Never paste real IPs or bot handles here.** Use `${HOST_TS_IP}` and `${*_BOT}`
|
||||
> placeholders; real values live in `fleet.env` (gitignored) and are live via
|
||||
> `tailscale status`. Copy `fleet.env.example` → `fleet.env` to resolve them. The probe
|
||||
|
|
@ -29,7 +33,7 @@ on any host fills in its own row. Source of truth for facts is the probe — not
|
|||
## 1. Agent placement (who runs where)
|
||||
|
||||
| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status |
|
||||
| ----------- | ------- | --------------------------- | ---------------------------- | -------------------------------- | --------------------- | ----------------------------- |
|
||||
| ----------------- | ------- | --------------------- | ---------------------------- | ------------------------------------------------------------------------- | --------------------------- | -------------------------------------- |
|
||||
| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) |
|
||||
| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE |
|
||||
| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE |
|
||||
|
|
@ -65,7 +69,7 @@ on any host fills in its own row. Source of truth for facts is the probe — not
|
|||
## 2. Host hardware & facts (one row per host)
|
||||
|
||||
| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By |
|
||||
| ----------- | -------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
|
||||
| ----------- | ---------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ |
|
||||
| **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude |
|
||||
| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes |
|
||||
| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi |
|
||||
|
|
@ -87,6 +91,25 @@ Disk is a first-class fact, same as OS or CPU — **measure it before you act, d
|
|||
This is the survivability principle applied to storage: a host that silently fills up is a
|
||||
host that fails. What you guess will be wrong; what you probe will be right.
|
||||
|
||||
### Cost provenance (invoice/control-panel facts, not guesses)
|
||||
|
||||
Hosting spend is a first-class fleet fact, but it must stay non-secret: record provider,
|
||||
plan/SKU, region, verified monthly cost, and the proof source. Do **not** commit invoice
|
||||
IDs, account numbers, billing addresses, or payment details. If a provider is inferred from
|
||||
an IP range, mark it `TBD` until the control panel or invoice confirms it.
|
||||
|
||||
| Host / candidate | Provider | Plan / SKU | Region | Monthly cost | Billing cycle | Role paid for | Source / proof | Status / notes |
|
||||
| ---------------------------------- | ------------------------------------------------------------------ | ----------------------------------------- | ------ | ----------------- | ------------- | ------------------------------------------------ | ------------------------------------- | -------------------------------------------------------------------------------------------------- |
|
||||
| **osa** | TBD (verify; OVHcloud is suspected but not invoice-confirmed here) | TBD | TBD | TBD | TBD | always-on orchestrator + board + Hermes gateway | operator invoice/control panel needed | Existing always-on VPS; do not treat IP range as proof. |
|
||||
| **domedog** | TBD | TBD | TBD | TBD | TBD | Linux media/compute lane | operator invoice/control panel needed | Existing Linux VM; cost not tracked yet. |
|
||||
| **debby** | self-owned laptop | — | local | utility/power TBD | — | intermittent secondary agent + soul backup | local device + utility rate if needed | Not an always-on hub; power cost only matters when left on. |
|
||||
| **mother-build** (candidate) | proposed OVHcloud | TBD: Public Cloud hourly or Eco/dedicated | TBD | TBD | TBD | FreeBSD build host / poudriere / Rust+zot builds | OVH quote needed before purchase | Prefer on-demand if builds are infrequent; dedicated only if build demand justifies standing cost. |
|
||||
| **ML350p Gen8** (candidate/retire) | self-hosted hardware | owned hardware | local | power TBD | utility bill | fallback build host only | measured watts + actual €/kWh needed | Do not make critical paths depend on it until reliability and TCO beat cloud. |
|
||||
|
||||
Cost discipline mirrors disk discipline: measure before action. For self-hosted hardware,
|
||||
calculate monthly power with `watts / 1000 * 24 * 30 * €/kWh` using measured idle/load
|
||||
wattage and the actual utility rate; do not compare cloud invoices to guessed electricity.
|
||||
|
||||
---
|
||||
|
||||
## 3. Per-host detail (expand as needed)
|
||||
|
|
@ -113,7 +136,7 @@ host that fails. What you guess will be wrong; what you probe will be right.
|
|||
in `~/.colibri/` — `colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`.
|
||||
- **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw
|
||||
it → worker marked it `done` (2026-06-19).
|
||||
- **Executor pending (decision required)**: domedog *receives* capability-matched tasks, but
|
||||
- **Executor pending (decision required)**: domedog _receives_ capability-matched tasks, but
|
||||
no persistent execution loop runs yet — until one does, routed tasks sit `started` (no
|
||||
lease/reaper). Decide what executes (Claude Code worker / script) and with what authority
|
||||
before relying on autonomous domedog task completion.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue