docs(host-db): review of b02746c — request revisions before merge

Implementation review of zai/Codex's "Harden host DB reboot path." Direction is right but three blockers: - snapshots are not atomic (two separate `zfs snapshot` calls reproduce the pgwal/pgdata skew that caused the incident) - `serviceMaybeStop` swallows real `onestatus` errors as "already stopped" — can proceed to checkpoint pg with mevy still running - committed with 3 failing tests Plus smells around missing readiness wait (§E), no spawnSync timeouts, duplicated pool resolution, and an unrelated bonus fix smuggled into the commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --- Build: pass | Tests: FAIL — Tests 2 failed | 2080 passed (2082)
2026-05-01 10:57:13 +02:00 · 2026-05-01 10:57:13 +02:00 · f2358fcb80
commit f2358fcb80
parent 2bc14c7040
1 changed files with 118 additions and 0 deletions
--- a/docs/internal/HOST-DB-REBOOT-REVIEW.md
+++ b/docs/internal/HOST-DB-REBOOT-REVIEW.md
@ -0,0 +1,118 @@
+# Host DB Reboot — Implementation Review (b02746c)
+
+**Date:** 01.may.2026
+**Status:** Review — revision requested
+**Subject:** `b02746c "Harden host DB reboot path"`
+**Plan:** see `HOST-DB-RECOVERY-PLAN.md`
+
+---
+
+## Verdict
+
+Directionally right, ship-blocking gaps. Worth iterating on, **not**
+worth merging as-is.
+
+The implementation maps cleanly onto §A–§D of the recovery plan: a real
+`maintenance-reboot` hostd op, Telegram wired to it, rc.d ordering for
+`DB_RUNTIME=host`, controlplane auto-revive removed, auth registry
+updated, tests added. Scope is good.
+
+Three things make it not-yet-mergeable, plus a handful of smells.
+
+---
+
+## Blockers
+
+### 1. Snapshots are not atomic — the bug we just fixed in the plan
+
+`runMaintenanceReboot` issues two separate `zfs snapshot` calls
+(`src/hostd/privileged-commands.ts:158-178`):
+
+```
+zfs snapshot pool/.../pgdata@pre-reboot-<stamp>
+zfs snapshot pool/.../pgwal@pre-reboot-<stamp>
+```
+
+Between those two calls, anything writing WAL re-introduces the exact
+pgwal/pgdata skew that caused the 30.apr.2026 incident.
+
+ZFS guarantees atomicity when given multiple snapshots in a single
+invocation. Fix:
+
+```
+zfs snapshot pool/.../pgdata@<tag> pool/.../pgwal@<tag>
+```
+
+Recursive `-r` on the parent would over-snapshot siblings (e.g.
+`shared/npm-global`, jails) — don't use it here.
+
+### 2. `serviceMaybeStop` swallows real errors as "already stopped"
+
+`src/hostd/privileged-commands.ts:71-77` treats any non-zero
+`service onestatus` as "stopped, skip stop." But `onestatus` returns
+non-zero for permission errors, missing rc script, broken pidfile —
+not just "not running."
+
+Failure mode: `onestatus` errors for a non-"not running" reason →
+`serviceMaybeStop` returns `ok: true` → op proceeds to checkpoint and
+snapshot postgres while Mevy is **still hammering it**. That is worse
+than the pre-change behavior.
+
+Fix: distinguish "service not running" (specific exit code + output)
+from other failures, and abort on the latter.
+
+### 3. Committed with failing tests
+
+Commit trailer: `Tests: FAIL — 3 failed`. No indication which 3.
+
+Likely candidates: the `controlplane.checkDbJail` host branch no
+longer calls hostd, which probably breaks tests/mocks that asserted
+the old auto-revive behavior.
+
+Either the new tests are flaky on CI or unrelated tests broke. Either
+way: must be resolved before merge. "Build pass / tests fail" should
+not pass review.
+
+---
+
+## Smells
+
+- **Plan §E (Mevy startup readiness wait) not implemented.** controlplane
+  now correctly *reports* failure instead of auto-reviving (good), but
+  nothing was added to make Mevy wait for `pg_isready` before its checks
+  run. Post-reboot, Mevy can still race ahead of pg accepting
+  connections. Plan called this out explicitly.
+- **Mevy stop has no timeout.** `service mevy stop` can hang
+  indefinitely if a worker is stuck. `spawnSync` with no timeout option
+  means the entire reboot op blocks and hostd freezes with it.
+- **`delayMinutes: 1` default.** Operator gets 60s to abort. Fine, but
+  the Telegram confirmation should mention what is about to happen
+  (mevy stop, pg checkpoint, snapshots) — this is now a much heavier
+  op than before.
+- **Pool resolution duplicated three ways.** `buildHostDbDatasets`
+  reads `process.env.ZFS_POOL || 'zroot'` inline; `setup/db.ts`
+  hardcodes `zroot/${ZFS_PREFIX}/...`; `setup/install-identity.ts`
+  uses `systemEnv.zfsPool || 'zroot'`. Three conventions in three
+  files. This op should use one source of truth, otherwise dev/test
+  envs with non-default pools snapshot the wrong dataset.
+- **Bonus fix smuggled in.** `${agentName}_enable` →
+  `${runtime.serviceName}_enable` in `setup/service.ts:205` is a real
+  bug fix, but unrelated and unmentioned in the commit message.
+
+---
+
+## Required revisions before merge
+
+1. Atomic snapshot in one `zfs` invocation (blocker #1).
+2. `serviceMaybeStop` distinguishes "not running" from "broken"
+   (blocker #2).
+3. Resolve the 3 failing tests (blocker #3).
+4. Timeout on the inner `spawnSync` calls so Mevy stop / pg
+   checkpoint cannot hang the op indefinitely.
+
+## Follow-ups (separate PR ok)
+
+5. Mevy-side `pg_isready` wait at startup (plan §E).
+6. Single source of truth for pool/prefix dataset resolution.
+7. Telegram confirmation copy describing what the maintenance reboot
+   actually does.