diff --git a/docs/HIVE-ONBOARDING.md b/docs/HIVE-ONBOARDING.md index 5701df6..6231254 100644 --- a/docs/HIVE-ONBOARDING.md +++ b/docs/HIVE-ONBOARDING.md @@ -2,7 +2,7 @@ **LIVE VS PLANNED.** This is a **design/vision** doc. The building blocks are real and proven (Bastille jails on osa, capability routing, `register-agent`, and the -`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The *platform* +`clawdie-vault-fetch` flow validated end-to-end on domedog 2026-06-19). The _platform_ described here — `colibri-vault` as a crate, multi-tenant buckets, the mother skill — is `[PLANNED]`. The thesis: it is mostly **composition of pieces we already have**, not new invention. Sections are tagged `[LIVE]` / `[PLANNED]`. @@ -13,33 +13,36 @@ invention. Sections are tagged `[LIVE]` / `[PLANNED]`. The four MVP steps (§8) are **code-complete on colibri `main`**: -| MVP step | Status | Landed via | -| -------- | ------ | ---------- | -| 1. `colibri-vault` crate | done; hardening in flight | #85 → #94 → PR #100 (server-match + serialize) | -| 2. `tenants` table | on `main` | (PR #90 closed as superseded) | -| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) | -| 4. `mother` skill | done (draft) | layered-soul | +| MVP step | Status | Landed via | +| --------------------------- | ------------------------- | ------------------------------------------- | +| 1. `colibri-vault` crate | done; hardening in flight | #85 → #94 → #100 (server-match + serialize) | +| 2. `tenants` table | on `main` | (PR #90 closed as superseded) | +| 3. spawner → provision hook | done | #91 (root-verify) → #94 (wired) | +| 4. `mother` skill | done (draft) | layered-soul | Supporting pieces merged: `agent-jail-bootstrap.sh` (#96 → #97 version-pin → #104 cold-cache guard), `provider.env` staging (#69/#99), vault-fetch shell helper -server-match (#67/#68/#69). +server-match (#67/#68/#69), and the first-proof runbook (#103). -**First proof is *not* code-blocked** — the chain works today via the interim manual +**First proof is _not_ code-blocked** — the chain works today via the interim manual path in [`../docs/VAULT-PROVISION-FIRST-PROOF.md`](https://code.smilepowered.org/clawdie/colibri) -(colibri). Critical path: merge PR #100 + #103 → run the runbook (scratch jail + test -collection, manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at -`0600` + tenant `active`. +(colibri). Critical path now: operator runs the runbook (scratch jail + test collection, +manual SQLite tenant insert, raw-socket jailed spawn) → verify `.env` at `0600` + tenant +`active`. Open work, categorized: -- **Hardening:** colibri PR #100 (closes #95), #92 (path canonicalization/containment). +- **Hardening:** #92 (path canonicalization/containment). - **CLI-driveability (post-proof ergonomics, not proof blockers):** #101 (`register-tenant` command), #102 (`--jail` on `spawn-agent`) — these replace the runbook's manual steps. - **Source-of-truth/naming:** #98 (`npm-node24` vs `npm`), clawdie-iso #70 (agent-jail section in `pkg-list-jails.txt`). +- **Cost/source-of-truth:** fill `docs/HOST-MATRIX.md` cost provenance rows before buying + or retiring build capacity; compare OVH quotes/invoices against measured self-host power. -**One-line plan:** merge #100 + #103 → run the runbook for the first proof → then land -#101/#102 for CLI driveability, and #92 before promoting past scratch. +**One-line plan:** run the first-proof runbook → then land #101/#102 for CLI driveability, +#92 before promoting past scratch, and fill verified OVH/self-host cost data before buying +or depending on a new mother/build host. --- @@ -50,7 +53,7 @@ primitive**. Promote it from the `clawdie-vault-fetch` shell helper to a first-c crate, **`colibri-vault`**, sitting beside `colibri-spawner` / `colibri-store`: - **in:** a tenant id (→ a bucket) + a target jail/home -- **out:** a `0600` `.env` materialized *inside the jail*, owned by the jail user +- **out:** a `0600` `.env` materialized _inside the jail_, owned by the jail user - wraps the `bw` CLI for now (do **not** reimplement the Bitwarden protocol), fail-closed, idempotent, no-op when there is no bucket @@ -76,8 +79,8 @@ indirection than that. On "folder vs bucket": -- **Folders** are personal-vault organization → fine for *Clawdie's own internal* agents. -- **Organization + Collections** give *access-scoped isolation* → the multi-tenant +- **Folders** are personal-vault organization → fine for _Clawdie's own internal_ agents. +- **Organization + Collections** give _access-scoped isolation_ → the multi-tenant primitive. One customer = one Collection; a scoped credential reads only that collection. - **Do not** run a separate Vaultwarden instance per customer — Collections are exactly this feature. @@ -91,7 +94,7 @@ On "folder vs bucket": the orchestrator, that can read any tenant collection to provision jails. Everything non-secret — harness, base config, model-routing prefs — **ships in the -clawdie-iso image**. The image is the *body*; the bucket is the *one private nerve*. +clawdie-iso image**. The image is the _body_; the bucket is the _one private nerve_. ## 5. [PLANNED] The mother skill @@ -105,7 +108,7 @@ mother := resolve-identity (layered-soul) ``` - **Narrow:** onboarding — births one working agent from a bare jail. -- **Wide:** self-replication. An agent that *holds* the mother skill can spawn and +- **Wide:** self-replication. An agent that _holds_ the mother skill can spawn and provision more jails (a queen births workers, each inheriting the mother skill), gated by capability/policy so it cannot run away. That is "agent swarms with a mother skill," and `colibri-vault` is how each birth gets its one nerve. @@ -119,7 +122,7 @@ osa/FreeBSD/Bastille is the natural womb — cheap, dense, isolated jails. > already shipped. A one-key agent on osa needs `image-render`? It routes to a Linux lane (domedog). Needs a -build? Routes to a capable host. The customer pays for *one agent* but stands on a +build? Routes to a capable host. The customer pays for _one agent_ but stands on a survivable, multi-OS hive. Anyone can run an LLM in a container; few hand you a swarm behind one key — **capability routing is the differentiator.** @@ -133,7 +136,7 @@ behind one key — **capability routing is the differentiator.** **Bootstraps live on the host; jails hold only their resolved secrets.** - The orchestrator holds the org service-account credential. It fetches a tenant's - collection, writes the resolved `.env` *into* the jail, and the **bootstrap never enters + collection, writes the resolved `.env` _into_ the jail, and the **bootstrap never enters the jail**. A compromised jail cannot re-fetch and cannot reach another tenant. - Per-tenant blast radius = one collection. Scoped credential, never a master. - This is the same shape the domedog smoke test validated (bootstrap on host, `.env` is the @@ -151,14 +154,15 @@ Smallest path that is real: **First-proof policy.** The first proven end-to-end runs against a **scratch jail + a throwaway test collection only** — no real tenant data until the path hardening lands -(canonicalize + allowed-root containment, colibri issue #92). The two first-proof blockers -are colibri **#88** (resolve the collection by name) and **#89** (per-call unlock); #92 is -hardening that follows. Tracker state lives on those issues. +(canonicalize + allowed-root containment, colibri issue #92). The former first-proof +blockers — colibri **#88** (resolve the collection by name) and **#89** (per-call unlock) +— are resolved on `main`; the remaining first-proof step is the operator-run scratch +runbook. #92 is hardening that follows before real tenant data. **Overengineering traps to avoid for now:** a custom Bitwarden web UI (Vaultwarden's own UI -+ a Collection is enough to start), billing/metering, a native Bitwarden protocol in Rust, -multi-region control plane, and recursive auto-spawn (gate it off until policy exists). -Those are product layers; the four steps above are the engine. +plus a Collection is enough to start), billing/metering, a native Bitwarden protocol in +Rust, multi-region control plane, and recursive auto-spawn (gate it off until policy +exists). Those are product layers; the four steps above are the engine. --- diff --git a/docs/HOST-MATRIX.md b/docs/HOST-MATRIX.md index 9231ec6..7e57722 100644 --- a/docs/HOST-MATRIX.md +++ b/docs/HOST-MATRIX.md @@ -19,6 +19,10 @@ on any host fills in its own row. Source of truth for facts is the probe — not > real free space (`df -h /`, or the probe's `--storage`) — never estimate. Keep the > **Disk (free)** column current and flag any host past ~85%. See _Disk discipline_ below. > +> **Cost before buying:** before purchasing or retiring infrastructure, record provider, +> plan/SKU, verified monthly cost, and the source of truth (invoice/control panel/utility +> bill). IP-range guesses are not billing proof. See _Cost provenance_ below. +> > **Never paste real IPs or bot handles here.** Use `${HOST_TS_IP}` and `${*_BOT}` > placeholders; real values live in `fleet.env` (gitignored) and are live via > `tailscale status`. Copy `fleet.env.example` → `fleet.env` to resolve them. The probe @@ -28,15 +32,15 @@ on any host fills in its own row. Source of truth for facts is the probe — not ## 1. Agent placement (who runs where) -| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status | -| ----------- | ------- | --------------------------- | ---------------------------- | -------------------------------- | --------------------- | ----------------------------- | -| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) | -| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE | -| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE | -| **Mevy** | osa | FreeBSD 15 / host | Hermes Agent (upstream, CLI) | **Consolidated into hermes-osa** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — under hermes-osa** | -| **hermes-osa** | osa | FreeBSD 15 / host | Hermes Agent (FreeBSD fork) | **Orchestrator + board host (always-on VPS): chat + gateway** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — chat + Telegram** | -| Codex | osa | FreeBSD 15 / jail | Codex CLI | ISO builds, validation | — (CLI) | LIVE | -| **domedog-agent** | domedog | Ubuntu 24.04 / host | Colibri board agent | Headless Linux media/compute lane (image-render, ffmpeg, rust/go/py/node) | — | **LIVE — on central board 2026-06-19** | +| Agent | Host | OS / Isolation | Harness | Role | Bot / channel | Status | +| ----------------- | ------- | --------------------- | ---------------------------- | ------------------------------------------------------------------------- | --------------------------- | -------------------------------------- | +| Hermes | debby | Debian 13 / Docker | Hermes Agent (upstream) | Secondary agent + soul backup (intermittent laptop) | ${HERMES_BOT} | LIVE (intermittent) | +| Zot | debby | Debian 13 / Docker | Zot RPC | Coding, media workflows | ${ZOT_BOT} | LIVE | +| Claude | domedog | Ubuntu 24.04 / Docker | Claude Code | Verification, review | — (CLI) | LIVE | +| **Mevy** | osa | FreeBSD 15 / host | Hermes Agent (upstream, CLI) | **Consolidated into hermes-osa** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — under hermes-osa** | +| **hermes-osa** | osa | FreeBSD 15 / host | Hermes Agent (FreeBSD fork) | **Orchestrator + board host (always-on VPS): chat + gateway** | ${HERMES_OSA_BOT} (OSA-bot) | **LIVE — chat + Telegram** | +| Codex | osa | FreeBSD 15 / jail | Codex CLI | ISO builds, validation | — (CLI) | LIVE | +| **domedog-agent** | domedog | Ubuntu 24.04 / host | Colibri board agent | Headless Linux media/compute lane (image-render, ffmpeg, rust/go/py/node) | — | **LIVE — on central board 2026-06-19** | > **Mevy vs hermes-osa distinction**: Mevy (${HERMES_OSA_BOT} / OSA-bot) has been consolidated into hermes-osa as of 2026-06-17. The Telegram bot token was migrated from the old backup .env. hermes-osa now runs both the local CLI chat and the Telegram gateway (polling mode, tmux session `hermes-gateway`). > @@ -64,11 +68,11 @@ on any host fills in its own row. Source of truth for facts is the probe — not ## 2. Host hardware & facts (one row per host) -| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By | -| ----------- | -------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ | +| Host | Tailscale IP | OS / Kernel | Virt | CPU | vCPU | RAM | Swap | Disk (free) | GPU | Probed | By | +| ----------- | ---------------- | ---------------------------------- | --------------------- | -------------------------------------- | ---- | ------- | --------------------- | ---------------------------- | ---------------------- | ---------- | ------ | | **domedog** | ${DOMEDOG_TS_IP} | Ubuntu 24.04.4 / 6.8.0-117 | KVM | AMD EPYC 7543P (32-core host) | 2 | 7.8 GiB | 2.0 GiB | 100 GB QEMU (51G free) | none (headless) | 2026-06-17 | Claude | -| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes | -| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi | +| **debby** | ${DEBBY_TS_IP} | Debian 13 / 6.12.90+deb13.1-amd64 | bare metal | AMD Ryzen 7 5700U (8-core) | 16 | 15 GiB | 15 GiB | nvme0n1p2 453G (23G free) | Radeon Graphics (iGPU) | 2026-06-17 | Hermes | +| **osa** | ${OSA_TS_IP} | FreeBSD 15.0-RELEASE-p10 / GENERIC | not reported by probe | Intel Core Processor (Haswell, no TSX) | 6 | 11 GiB | not reported by probe | ZFS pool: zroot (23.4G free) | not reported by probe | 2026-06-17 | Pi | ### Disk discipline (check, don't guess) @@ -87,6 +91,25 @@ Disk is a first-class fact, same as OS or CPU — **measure it before you act, d This is the survivability principle applied to storage: a host that silently fills up is a host that fails. What you guess will be wrong; what you probe will be right. +### Cost provenance (invoice/control-panel facts, not guesses) + +Hosting spend is a first-class fleet fact, but it must stay non-secret: record provider, +plan/SKU, region, verified monthly cost, and the proof source. Do **not** commit invoice +IDs, account numbers, billing addresses, or payment details. If a provider is inferred from +an IP range, mark it `TBD` until the control panel or invoice confirms it. + +| Host / candidate | Provider | Plan / SKU | Region | Monthly cost | Billing cycle | Role paid for | Source / proof | Status / notes | +| ---------------------------------- | ------------------------------------------------------------------ | ----------------------------------------- | ------ | ----------------- | ------------- | ------------------------------------------------ | ------------------------------------- | -------------------------------------------------------------------------------------------------- | +| **osa** | TBD (verify; OVHcloud is suspected but not invoice-confirmed here) | TBD | TBD | TBD | TBD | always-on orchestrator + board + Hermes gateway | operator invoice/control panel needed | Existing always-on VPS; do not treat IP range as proof. | +| **domedog** | TBD | TBD | TBD | TBD | TBD | Linux media/compute lane | operator invoice/control panel needed | Existing Linux VM; cost not tracked yet. | +| **debby** | self-owned laptop | — | local | utility/power TBD | — | intermittent secondary agent + soul backup | local device + utility rate if needed | Not an always-on hub; power cost only matters when left on. | +| **mother-build** (candidate) | proposed OVHcloud | TBD: Public Cloud hourly or Eco/dedicated | TBD | TBD | TBD | FreeBSD build host / poudriere / Rust+zot builds | OVH quote needed before purchase | Prefer on-demand if builds are infrequent; dedicated only if build demand justifies standing cost. | +| **ML350p Gen8** (candidate/retire) | self-hosted hardware | owned hardware | local | power TBD | utility bill | fallback build host only | measured watts + actual €/kWh needed | Do not make critical paths depend on it until reliability and TCO beat cloud. | + +Cost discipline mirrors disk discipline: measure before action. For self-hosted hardware, +calculate monthly power with `watts / 1000 * 24 * 30 * €/kWh` using measured idle/load +wattage and the actual utility rate; do not compare cloud invoices to guessed electricity. + --- ## 3. Per-host detail (expand as needed) @@ -113,7 +136,7 @@ host that fails. What you guess will be wrong; what you probe will be right. in `~/.colibri/` — `colibri_cmd.py` (raw JSON), `colibri_poll.py`, `colibri_task_done.py`. - **Validated**: register → scheduler routed an `image-render` task to domedog → poller saw it → worker marked it `done` (2026-06-19). - - **Executor pending (decision required)**: domedog *receives* capability-matched tasks, but + - **Executor pending (decision required)**: domedog _receives_ capability-matched tasks, but no persistent execution loop runs yet — until one does, routed tasks sit `started` (no lease/reaper). Decide what executes (Claude Code worker / script) and with what authority before relying on autonomous domedog task completion. @@ -147,9 +170,9 @@ host that fails. What you guess will be wrong; what you probe will be right. - **Claude Code** — installed (path: `/home/clawdie/.npm-global/bin/claude`), no dedicated role yet. - **Provider stack** (hermes-osa): ```yaml - provider: deepseek # primary — direct credits, proven DEEPSEEK_OK + provider: deepseek # primary — direct credits, proven DEEPSEEK_OK default: deepseek-chat - fallback: openrouter # available manually, not auto-fallback configured yet + fallback: openrouter # available manually, not auto-fallback configured yet ``` - **Z.AI**: deferred (not configured for hermes-osa; available via OpenRouter if needed) - **Telegram**: LIVE — ${HERMES_OSA_BOT}, polling mode, connected 2026-06-17