docs(host-db): review of b02746c — request revisions before merge

Implementation review of zai/Codex's "Harden host DB reboot path."
Direction is right but three blockers:

- snapshots are not atomic (two separate `zfs snapshot` calls
  reproduce the pgwal/pgdata skew that caused the incident)
- `serviceMaybeStop` swallows real `onestatus` errors as
  "already stopped" — can proceed to checkpoint pg with mevy
  still running
- committed with 3 failing tests

Plus smells around missing readiness wait (§E), no spawnSync
timeouts, duplicated pool resolution, and an unrelated bonus fix
smuggled into the commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---
Build: pass | Tests: FAIL — Tests  2 failed | 2080 passed (2082)
This commit is contained in:
Operator & Claude Code 2026-05-01 10:57:13 +02:00
parent 2bc14c7040
commit f2358fcb80

View file

@ -0,0 +1,118 @@
# Host DB Reboot — Implementation Review (b02746c)
**Date:** 01.may.2026
**Status:** Review — revision requested
**Subject:** `b02746c "Harden host DB reboot path"`
**Plan:** see `HOST-DB-RECOVERY-PLAN.md`
---
## Verdict
Directionally right, ship-blocking gaps. Worth iterating on, **not**
worth merging as-is.
The implementation maps cleanly onto §A§D of the recovery plan: a real
`maintenance-reboot` hostd op, Telegram wired to it, rc.d ordering for
`DB_RUNTIME=host`, controlplane auto-revive removed, auth registry
updated, tests added. Scope is good.
Three things make it not-yet-mergeable, plus a handful of smells.
---
## Blockers
### 1. Snapshots are not atomic — the bug we just fixed in the plan
`runMaintenanceReboot` issues two separate `zfs snapshot` calls
(`src/hostd/privileged-commands.ts:158-178`):
```
zfs snapshot pool/.../pgdata@pre-reboot-<stamp>
zfs snapshot pool/.../pgwal@pre-reboot-<stamp>
```
Between those two calls, anything writing WAL re-introduces the exact
pgwal/pgdata skew that caused the 30.apr.2026 incident.
ZFS guarantees atomicity when given multiple snapshots in a single
invocation. Fix:
```
zfs snapshot pool/.../pgdata@<tag> pool/.../pgwal@<tag>
```
Recursive `-r` on the parent would over-snapshot siblings (e.g.
`shared/npm-global`, jails) — don't use it here.
### 2. `serviceMaybeStop` swallows real errors as "already stopped"
`src/hostd/privileged-commands.ts:71-77` treats any non-zero
`service onestatus` as "stopped, skip stop." But `onestatus` returns
non-zero for permission errors, missing rc script, broken pidfile —
not just "not running."
Failure mode: `onestatus` errors for a non-"not running" reason →
`serviceMaybeStop` returns `ok: true` → op proceeds to checkpoint and
snapshot postgres while Mevy is **still hammering it**. That is worse
than the pre-change behavior.
Fix: distinguish "service not running" (specific exit code + output)
from other failures, and abort on the latter.
### 3. Committed with failing tests
Commit trailer: `Tests: FAIL — 3 failed`. No indication which 3.
Likely candidates: the `controlplane.checkDbJail` host branch no
longer calls hostd, which probably breaks tests/mocks that asserted
the old auto-revive behavior.
Either the new tests are flaky on CI or unrelated tests broke. Either
way: must be resolved before merge. "Build pass / tests fail" should
not pass review.
---
## Smells
- **Plan §E (Mevy startup readiness wait) not implemented.** controlplane
now correctly *reports* failure instead of auto-reviving (good), but
nothing was added to make Mevy wait for `pg_isready` before its checks
run. Post-reboot, Mevy can still race ahead of pg accepting
connections. Plan called this out explicitly.
- **Mevy stop has no timeout.** `service mevy stop` can hang
indefinitely if a worker is stuck. `spawnSync` with no timeout option
means the entire reboot op blocks and hostd freezes with it.
- **`delayMinutes: 1` default.** Operator gets 60s to abort. Fine, but
the Telegram confirmation should mention what is about to happen
(mevy stop, pg checkpoint, snapshots) — this is now a much heavier
op than before.
- **Pool resolution duplicated three ways.** `buildHostDbDatasets`
reads `process.env.ZFS_POOL || 'zroot'` inline; `setup/db.ts`
hardcodes `zroot/${ZFS_PREFIX}/...`; `setup/install-identity.ts`
uses `systemEnv.zfsPool || 'zroot'`. Three conventions in three
files. This op should use one source of truth, otherwise dev/test
envs with non-default pools snapshot the wrong dataset.
- **Bonus fix smuggled in.** `${agentName}_enable`
`${runtime.serviceName}_enable` in `setup/service.ts:205` is a real
bug fix, but unrelated and unmentioned in the commit message.
---
## Required revisions before merge
1. Atomic snapshot in one `zfs` invocation (blocker #1).
2. `serviceMaybeStop` distinguishes "not running" from "broken"
(blocker #2).
3. Resolve the 3 failing tests (blocker #3).
4. Timeout on the inner `spawnSync` calls so Mevy stop / pg
checkpoint cannot hang the op indefinitely.
## Follow-ups (separate PR ok)
5. Mevy-side `pg_isready` wait at startup (plan §E).
6. Single source of truth for pool/prefix dataset resolution.
7. Telegram confirmation copy describing what the maintenance reboot
actually does.