Resolves the collision class where a tenant named `clawdie` would
produce `clawdie_ops` clashing with the platform's shared ops DB.
Two constants instead of one:
- service name / brand / UNIX user: `clawdie` (one of them)
- platform namespace prefix for shared resources: `system`
Shared DBs become `system_ops` / `system_brain` / `system_skills`;
shared dataset becomes `zroot/system-runtime`. `system` joins the
reserved_host_labels list so the same collision cannot reappear at
the FQDN layer.
Also adds:
- Vocabulary section distinguishing operator account, service
account, service name, platform namespace, assistant display name,
tenant id (six terms, one bug class each)
- Install-paths section formalizing fresh-machine (ISO) vs
existing-host flows; `just install` is the platform install, never
the OS install
- Service-account override field as bootstrap config, not an
onboarding prompt; default stays `clawdie`
- Operator-account treatment: existing-host path checks for it;
Clawdie never renames or recreates it
AGENTS.md "Multitenant Rules" updated to match.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: pass — Tests 2099 passed (2099)
Single source of truth at docs/internal/MULTITENANT.md (~430 lines)
replaces the previous spread across NAMING-POLICY, ARCHITECTURE,
HOST-REALITY, INTERNAL-ROLLOUT, ROADMAP, HANDOFF, AGENT-WORKFLOW, and
PLATFORM-V2-MANIFESTO. Load-bearing content (vision, conceptual model,
naming schema, surfaces, controlplane, publishing, conventions) is
folded in; current-state runbooks, phased migration plans, and
deployment-drift snapshots are dropped — design phase, fresh start.
Identity decision: drop PLATFORM_ID / PLATFORM_SERVICE_NAME /
PLATFORM_RUNTIME_USER. Platform identity is the constant 'clawdie'
baked into code; ASSISTANT_NAME is display-only and never feeds infra
names; TENANT_ID is for additive tenants only. AGENTS.md gains a short
"Multitenant Rules" block carrying the day-to-day do/don't extract.
Cross-references in AGENT-WORKFLOW-CHECKLIST, AGENT-WORKTREE-WORKFLOW,
and the two freebsd-jail-implementation docs updated to point at the
new file.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: pass — Tests 2099 passed (2099)
Earlier version claimed the readiness wait was "in the wrong place" —
only running in the 5-min periodic check. That was wrong:
runControlPlaneChecks() is called at src/index.ts:1087, before
initDatabase / loadState / initMemoryPool. The wait already gates
bootstrap.
Trimmed the doc to the real follow-up scope: swap tcpReachable for
pg_isready, add HOST_DB_READINESS_TIMEOUT_MS env (default 60s),
minimal logging, one timeout-path test. No move, no restructuring.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 2081 passed (2084)
Improvement over no-wait, but two follow-ups before §E is closed:
- default probe is `tcpReachable` — pg opens its socket during WAL
recovery while rejecting queries with "starting up", so TCP-open
is not the same as accepting connections. Need a SELECT 1 /
pg_isready check.
- wait runs inside the 5-minute periodic controlplane check, not at
Mevy bootstrap. If anything in startup touches DB before the
first tick, the wait does not gate the actual race.
Plus: 30s default may be tight post-incident, no logs during the
wait, no env override, post-deadline extra probe makes timeout
fuzzy, and the "3 failed tests" trailer is still present.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 2081 passed (2084)
Implementation review of zai/Codex's "Harden host DB reboot path."
Direction is right but three blockers:
- snapshots are not atomic (two separate `zfs snapshot` calls
reproduce the pgwal/pgdata skew that caused the incident)
- `serviceMaybeStop` swallows real `onestatus` errors as
"already stopped" — can proceed to checkpoint pg with mevy
still running
- committed with 3 failing tests
Plus smells around missing readiness wait (§E), no spawnSync
timeouts, duplicated pool resolution, and an unrelated bonus fix
smuggled into the commit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 2 failed | 2080 passed (2082)
Addendum to d456aa4. Three gaps that would have left the plan
implementable-but-unsafe:
- snapshot step now mandates a single recursive ZFS snapshot of the
common parent; two separate snapshots reproduce the pgwal/pgdata skew
that caused the 30.apr.2026 incident
- new §E: Mevy startup must poll for DB readiness (pg_isready or
equivalent); rc.d REQUIRE only orders start invocations, not actual
connect-ability
- §A now specifies failure semantics for the maintenance-reboot op
(each pre-reboot step aborts on failure; reboot only schedules after
all prior steps succeed)
- pg_resetwal explicitly demoted to non-recovery-path
- note that CHECKPOINT before clean stop is belt-and-suspenders
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: pass — Tests 2080 passed (2080)
---
Build: pass | Tests: FAIL — Tests 2 failed | 2080 passed (2082)
Replaces the decision-tree handoff with a concrete step-by-step test guide
for Codex to run on the live host. Documents what Claude already shipped,
the exact verification commands, the nginx pattern question (direct vs proxy),
and a prioritized simplification assessment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 8 failed | 2009 passed (2017)
- Update ARCHITECTURE.md Prompt Assembly section to document runtime-manifest
as a new context layer injected per-message, explaining it answers
coherence questions: 'what repo/branch/skills do I have?'
- Update docs/internal/AGENT-HARNESS-V2.md Phase 5 to detail both System State
and Runtime Manifest as complementary context blocks, explaining the
coherence gap they solve together
- New docs/internal/RUNTIME-MANIFEST-DESIGN.md: complete specification
- Why: agents had infrastructure facts but couldn't see them
- What: machine-generated inventory from .git/library.yaml/artifacts
- How: fresh per-message, cheap local sources, compact XML-like format
- Where: injected in system prompt alongside SOUL/IDENTITY files
- Testing: coverage for git parsing, skills counting, specialist discovery
The three-layer coherence system is now:
1. Hand-written identity (SOUL/USER/IDENTITY/MEMORY) — philosophy, stable
2. Machine-generated manifest (RUNTIME_MANIFEST) — inventory, fresh
3. Live system state (system-state.ts) — operations, current
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 8 failed | 2009 passed (2017)
Cross-repo analysis after 975f37f landed. TypeScript setup layer is
correct in isolation; gaps are all at the ISO firstboot boundary:
no setup.txt reader, pool name mismatch, mode naming divergence,
AGENT_DOMAIN derivation missing, Slovenian locale defaults, and
system.env unknown to the ISO.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 4 failed | 1996 passed (2000)
Round 5 in the handoff doc captures the five agreed adopt-mode
decisions (INSTALL_MODE field, fill-blanks default, identity
mismatch blocks, Telegram identity changes require explicit flag,
fingerprint gate) so they survive into Codex's design doc.
Implementation doc gets an "Adopt Mode (V1.1)" section with the
proposed 4-task split + per-field freeze contract table, plus a
task-4 followup subsection naming the legacy `operators` table
sync gap and the unification plan with Codex's
setup/operator-auth.ts. scripts/set-operator.ts gets a TODO(unify)
header pointing at the same gap.
first-boot.md notes adopt mode is V1.1 and to back up before
reflashing until then.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1972 passed (1975)
Net -206 lines across install docs while making the V1 first-boot
model the recommended path:
- install/index: restructure to put first-boot + ISO as the
recommended path; existing-host install demoted.
- install/iso: collapse to image selection + USB write; defer the
V1 setup.txt flow to first-boot.md (saves ~30 lines).
- install/requirements: drop @Andy/Mac-launchd/personal-config
sections and the duplicated memory/session/task model that lives
in architecture docs (saves ~150 lines).
- install/install: reframe the onboarding step as setup.txt-first
with TUI as the explicit fallback.
- install/fresh-install-checklist: replace bsddialog wizard
milestone with setup.txt seed milestone, note TUI fallback case.
- architecture/deployment-models: ISO model now says
"setup.txt seed, TUI fallback".
- architecture/admin-panel: note planned set-operator menu entry.
- ISO-FIRST-BOOT-IMPLEMENTATION: sharpen task 4 reasoning —
clawdie-admin exists but as a TUI launcher, not a CLI router.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1972 passed (1975)
Lands task 6 in skeleton form: docs/public/install/first-boot.md
covers the four required lines, optional fields (profile, locale,
dashboard credentials, SSH key, headless password), the post-install
set-operator command, and how to switch off OpenRouter. Two
TBD blocks remain: "Where setup.txt lives" (waits on task 5
delivery-mechanism validation) and "Troubleshooting" (waits on real
failure traces from the ISO build).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1972 passed (1975)
Lands task 4 from the ISO first-boot implementation split as a
standalone scripts/set-operator.ts (matches existing scripts/
convention — no clawdie-admin umbrella). Reuses
ensureControlplaneBootstrapOperator() for the Better Auth signUp
path. Prompts password via stdin with echo suppressed; refuses
non-TTY runs; updates OPERATOR_PASSWORD in .env (mode 0600).
First-set only — rotation goes through the dashboard.
Both planning docs updated to drop "notional" references and point
at the real npm run set-operator -- <email> command.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1972 passed (1975)
Drops ROOT_PASSWORD (root locked by default), adds SSH_AUTHORIZED_KEY
as the preferred headless box-access path, adds CLAWDIE_USER_PASSWORD
as fallback only. Parser warns visibly when plaintext passwords are
present in setup.txt. Implementation doc task 1 (parser) and task 5
(delivery validation) extended to cover the new fields.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1958 passed (1961)
Folds in Codex's three reservations on Round 2. Round 2 field list is
now superseded by Round 3 below.
Resolutions:
- "First registration wins" registration window dropped. The
mitigations Round 2 proposed (IP logging, post-install summary)
were detection, not prevention — useless against an attacker on a
shared LAN who registered first. Replaced with Option α: if
dashboard credentials are missing from setup.txt, the dashboard
waits until the operator runs `clawdie-admin set-operator <email>`
post-install. Telegram remains the operator interface in the
meantime. Option β (Telegram CONFIRM flow for registration
requests) documented as the upgrade path if dashboard becomes
load-bearing enough to justify the extra friction.
- PROFILE=balanced moved from "required" to "prefilled." If it
always defaults, calling it required misrepresents the operator's
cognitive load. The line stays in setup.txt as visible
documentation, not as a question the operator must answer.
- ASSISTANT_NAME promoted to "recommended" tier; HOSTNAME demoted
to "optional with derived default." The project currently
conflates two distinct concepts (system-admin hostname vs
emotional assistant identity); for first-boot, the emotional one
is what the operator cares about. HOSTNAME defaults to lowercased
ASSISTANT_NAME.
Round 3 field list (authoritative):
- 3 required (OpenRouter key, Telegram bot token, Telegram admin ID)
- 1 recommended (ASSISTANT_NAME)
- 1 prefilled (PROFILE=balanced)
- 4 optional (TIMEZONE, HOSTNAME, OPERATOR_EMAIL, OPERATOR_PASSWORD)
Cognitive bar before first boot: four lines the operator types
into. Everything else has a sensible fallback.
Doc split (Codex's recommendation to extract a V1 onboarding spec
doc plus an implementation task breakdown) acknowledged as the right
next move, but premature — two items remain open (seed delivery
mechanism, clawdie-admin set-operator surface). Split happens once
those resolve.
Section explicitly lists what's now firmly decided vs still open
after Round 3 so future readers don't re-litigate closed questions
or silently commit open ones.
No code changes. Pure planning convergence.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1958 passed (1961)
Captures the converged state after Codex's pushback on the Round 1
take. Three pushbacks accepted with resolutions; one open question
(OPERATOR_EMAIL / OPERATOR_PASSWORD in setup.txt) resolved as a
hybrid; final V1 field list locked.
Resolutions:
- Seed-partition specifics (256 MB FAT32, label "CLAWDIE-SEED")
demoted from "spec" to "direction." Architectural commitment is
file-based seed import; exact mechanism stays open until validated
against the ISO repo and real flashers.
- Auto-wipe of setup.txt on import is dropped. Replaced with an
installer warning to the operator (immediate + post-install
summary) telling them to reformat the media. Keeps multi-machine
reflash working; treats credential hygiene as documented operator
action, not silent destruction.
- PROFILE explicitly sets all three (chat primary, fallback,
compaction) as a coordinated bundle. Splitting them re-creates the
configuration sprawl the profile is supposed to prevent. Advanced
operators drop down to explicit lines that override the profile
mapping.
OPERATOR_EMAIL / OPERATOR_PASSWORD resolution:
- Both optional in setup.txt.
- If both present: installer pre-creates the operator account in
Better Auth on first boot. Unattended-install path.
- If either missing: Better Auth opens a "first registration wins"
window (default 30 min, configurable) for local-network IPs only.
First person to hit /dashboard registers through the normal sign-up
form. Window auto-closes on success or timeout.
- Bound to local-network IPs via existing CONTROLPLANE_AUTH_MODE
semantics; full source IP logged; "operator account registered
from <ip>" surfaced in post-install summary so hijacked
registration is visible immediately.
- Recovery via "clawdie-admin reopen-registration --minutes 30" CLI
if window expires.
Final V1 field list: 5 required (OPENROUTER_API_KEY,
TELEGRAM_BOT_TOKEN, TELEGRAM_ADMIN_ID, PROFILE=balanced,
HOSTNAME=clawdie) + 3 optional (TIMEZONE, OPERATOR_EMAIL,
OPERATOR_PASSWORD). Anything else gets configured from the live
system, not from setup.txt.
Three items explicitly listed as still-open after Round 2 (seed
mechanism, registration window default, post-install summary
delivery channel) so they don't get silently committed.
No code changes. The Claude Take + Round 2 sections are scoped to
lift in the same commit that lands the actual seed-import
implementation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1958 passed (1961)
Answers all six review questions in the handoff doc with a single
recommended V1 design (writable seed partition + profile indirection
+ TUI fallback), two realistic alternatives (post-bootstrap web/SSH
config, two-USB), eight named risks, and a complete eight-field
setup.txt template.
Operator-facing rename folded in: setup.env → setup.txt. The .env
extension is a developer convention; setup.txt opens cleanly in any
text editor on Win/Mac/Linux without configuration, which removes
one of the largest non-technical-operator friction points in the
flow.
Profile indirection (PROFILE=balanced/economy/quality) keeps model
IDs out of operator hands at install time and lets the team change
the validated mapping over time without breaking old setup.txt
files. The installer resolves the profile to actual
PI_TUI_PROVIDER/PI_TUI_MODEL/LLM_FALLBACK_* at install time.
The take also flags the second onboarding cliff (Telegram BotFather
flow, easily underestimated) and the V2 follow-up (web-based setup
wizard) so the seed-file work isn't throwaway when the better UX
ships later.
No code changes. Pure handoff response in
docs/internal/ISO-FIRST-BOOT-SECRETS-HANDOFF.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1958 passed (1961)
Codex's 6-step plan from the 26.apr.2026 chat session (defaults policy
→ state policy → token burn → truth-surface polish → smoke checklist →
ISO dry run) lands in BOOTABLE-ISO-PLAN-V1.md, with six refinements
integrated:
- Already-resolved snapshot section so reading-cold agents do not
re-open closed questions (provider fallback works end-to-end,
cooldown path normalized in 6983415, Token Ledgers in
/budgetreport, Telegram basics stable).
- Step 1 explicitly absorbs the previously-separate Open Questions
list (default primary, default fallback, free-tier policy, identity
wording) so "freeze defaults" actually closes them rather than
parallel-tracking. Identity wording is named as the same root cause
as the fe14fad fixture failures, not a separate concern.
- Step 1 notes the cost-amplification trap in "compaction follows
primary" (when fallback is paid-stable, compaction follows there
too — burn during cooldown can amplify).
- Step 3 (token burn) promoted ahead of polish work because a fresh
ISO install that quietly eats budget leaves a worse first
impression than any of the polish items.
- Step 5 smoke checklist gains per-item triage hints so the dry-run
operator knows what to check first when an item fails.
- Step 6 explicitly notes one dry run only catches one-time issues;
two runs on different hardware is the post-1.0.0 bar, so nobody is
surprised when the first prod install hits something the lone dry
run missed.
Each step is now tagged by which agent class can claim it
(decision/docs / code / deploy) so independent claims do not stall on
"only Codex can do this."
Original Goal / Success Criteria / Non-Goals / Operator DoD / Working
Rule preserved verbatim at the top and bottom of the doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1958 passed (1961)
Closes the .env drift Codex flagged on 26.apr.2026: the live deploy
runtime had switched to openrouter/openai/o3 as chat default and
unpinned AGENT_COMPACTION_PROVIDER, but neither change was reflected in
.env.example. Effect: a fresh ISO build or reinstall would have started
in the old (silent-no-reply) configuration.
This commit does not change the current zai/glm-5-turbo example primary
— some operators have working zAI keys with budget — but adds a
clearly-marked "known-stable alternative primary" block that documents
the openrouter/openai/o3 setup the operator validated today, with the
rationale (zAI 5-hour cap → silent no-reply).
The AGENT_COMPACTION_PROVIDER block now explains both modes: unset
(compaction follows chat runtime, including fallback) is the validated
default; pinning decouples compaction from chat fallback for cost or
stability reasons. The previous one-liner left both pieces undocumented.
provider-fallback.md gets a matching "Compaction interaction" note so
the reading order from the operator guide ends up at the same answer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 3 failed | 1956 passed (1959)
Three findings from the operator's afternoon session, captured for Codex
to act on or defer. One docs change pre-applied; rest are notes.
Pre-applied in this commit:
- docs/public/operate/provider-fallback.md: example fallback model
changed from meta-llama/llama-3.3-70b-instruct:free to openai/o3 (paid,
stable). New "Choosing a fallback model" subsection warns explicitly
that free-tier models are unsafe as fallback targets — they rate-limit
silently and the failure mode is indistinguishable from "agent dead."
Operator hit this in production today.
- .env.example: LLM_FALLBACK_PROVIDER, LLM_FALLBACK_MODEL,
LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS now documented (were missing
entirely), with the same free-tier warning inline.
New session block in docs/internal/MULTITENANT-HANDOFF.md:
- Finding 1 (V1-blocker): live .env on deploy host needs the same model
swap; consider startup WARN if LLM_FALLBACK_MODEL ends with :free;
decide whether silent rate-limit-no-output should bubble as a visible
Telegram error.
- Finding 2: token ledger views (/usage, /tokens, /policy) are
arithmetically reconcilable but ask operator to mental-diff across
three places. Recommended fix is a "Token Ledgers" section in
/budgetreport showing quota + activity + reset-archived together.
- Finding 3: verify whether the mevy 0→14054 spent_today snapshot was
a reset or a recording-path bug in recordTokenSpend (a73f211).
- Finding 4: review notes on BOOTABLE-ISO-PLAN-V1.md — promote identity
wording from open question to Priority; split Priority A into
regeneratable-status vs persistent-state; add synthetic-cap test path
for fallback verification; add brief risk register.
No code changes. Docs and a single .env.example block.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 2 failed | 1949 passed (1951)
Brings the public docs in line with what shipped on multitenant over the
last few days. Three new operator-facing pages, three updates to existing
ones, and a CHANGELOG batch.
New pages (docs/public/operate/):
- operator-commands.md — single reference for all Telegram slash commands,
grouped by purpose (status, structured reports, runtime, sessions, admin
actions) with auth gating per command. Previously only in-bot /help text.
- provider-fallback.md — operator guide for the cooldown layer: env vars,
how cooldowns are detected and tracked, /policy surfacing, /clearcooldown
for manual release, the configured/effective/actual observability triple.
Includes a "path convention note" flagging that the cooldown file still
uses the legacy $CLAWDIE_VAR_DIR resolution while test/build status
files have moved to repo tmp/ — divergence to harmonize later in code.
- structured-reports.md — explains the Observed/Interpretation/Operator
Notes pattern, lists the six structured reports, documents the
test/build pipeline contract (status JSON schema + new $AGENT_STATUS_DIR
→ $CLAWDIE_VAR_DIR → tmp/status precedence Codex landed in 1389e17),
and covers free-text routing (classifyReportIntent + isOpsFlavored).
Updates:
- monitoring.md: appended "Operator-Facing Reports" section pointing at
the new structured-reports page, and "Provider Fallback Health" pointing
at the fallback page.
- operate/index.md: added the three new pages to the runbook list.
- architecture/controlplane.md: added "Runtime Observability" section
documenting the configured/effective/actual triple and linking to the
new operate pages.
- README.md: expanded the Telegram Commands table (was 10 rows, missing
every structured report, /policy, /clearcooldown, /budgetreset) and
added a pointer to operator-commands.md as the full reference. Also
noted free-text routing.
- CHANGELOG.md: appended an "operator observability + provider fallback,
apr.2026" batch under [Unreleased] covering provider fallback, the
reports family, the test/build wrapper pipeline, free-text routing,
/clearcooldown, the observability triple, the Telegram setMyCommands
menu, and the new "Verify Before Claiming Remote State" rule in
AGENTS.md.
No code changes. Slovenian sl/ mirror left untouched (out of localization
scope).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: FAIL — Tests 8 failed | 1940 passed (1948)
---
Build: pass | Tests: FAIL — Tests 2 failed | 1949 passed (1951)
Two outputs from this session bundled for the next agent:
- AGENTS.md gains "Verify Before Claiming Remote State" — durable rule
born from the 1e87f34 vs 3d33482 confusion: don't speak about a
remote without a fresh git fetch. When two agents disagree about a
tip, both fetch before debugging.
- MULTITENANT-HANDOFF.md gains a 26.apr session block telling Codex
how to disable the nanoclaw upstream remote in each worktree
without deleting the source code (setup/upstream.ts and the
check_upstream_updates MCP tool both gracefully degrade and stay
useful as a re-enable path).
No code changes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
Build: pass | Tests: pass — Tests 1944 passed (1944)
---
Build: pass | Tests: pass — Tests 1944 passed (1944)