docs(host-db): review of b02746c — request revisions before merge
Implementation review of zai/Codex's "Harden host DB reboot path." Direction is right but three blockers: - snapshots are not atomic (two separate `zfs snapshot` calls reproduce the pgwal/pgdata skew that caused the incident) - `serviceMaybeStop` swallows real `onestatus` errors as "already stopped" — can proceed to checkpoint pg with mevy still running - committed with 3 failing tests Plus smells around missing readiness wait (§E), no spawnSync timeouts, duplicated pool resolution, and an unrelated bonus fix smuggled into the commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --- Build: pass | Tests: FAIL — Tests 2 failed | 2080 passed (2082)
This commit is contained in:
parent
2bc14c7040
commit
f2358fcb80
1 changed files with 118 additions and 0 deletions
118
docs/internal/HOST-DB-REBOOT-REVIEW.md
Normal file
118
docs/internal/HOST-DB-REBOOT-REVIEW.md
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
# Host DB Reboot — Implementation Review (b02746c)
|
||||
|
||||
**Date:** 01.may.2026
|
||||
**Status:** Review — revision requested
|
||||
**Subject:** `b02746c "Harden host DB reboot path"`
|
||||
**Plan:** see `HOST-DB-RECOVERY-PLAN.md`
|
||||
|
||||
---
|
||||
|
||||
## Verdict
|
||||
|
||||
Directionally right, ship-blocking gaps. Worth iterating on, **not**
|
||||
worth merging as-is.
|
||||
|
||||
The implementation maps cleanly onto §A–§D of the recovery plan: a real
|
||||
`maintenance-reboot` hostd op, Telegram wired to it, rc.d ordering for
|
||||
`DB_RUNTIME=host`, controlplane auto-revive removed, auth registry
|
||||
updated, tests added. Scope is good.
|
||||
|
||||
Three things make it not-yet-mergeable, plus a handful of smells.
|
||||
|
||||
---
|
||||
|
||||
## Blockers
|
||||
|
||||
### 1. Snapshots are not atomic — the bug we just fixed in the plan
|
||||
|
||||
`runMaintenanceReboot` issues two separate `zfs snapshot` calls
|
||||
(`src/hostd/privileged-commands.ts:158-178`):
|
||||
|
||||
```
|
||||
zfs snapshot pool/.../pgdata@pre-reboot-<stamp>
|
||||
zfs snapshot pool/.../pgwal@pre-reboot-<stamp>
|
||||
```
|
||||
|
||||
Between those two calls, anything writing WAL re-introduces the exact
|
||||
pgwal/pgdata skew that caused the 30.apr.2026 incident.
|
||||
|
||||
ZFS guarantees atomicity when given multiple snapshots in a single
|
||||
invocation. Fix:
|
||||
|
||||
```
|
||||
zfs snapshot pool/.../pgdata@<tag> pool/.../pgwal@<tag>
|
||||
```
|
||||
|
||||
Recursive `-r` on the parent would over-snapshot siblings (e.g.
|
||||
`shared/npm-global`, jails) — don't use it here.
|
||||
|
||||
### 2. `serviceMaybeStop` swallows real errors as "already stopped"
|
||||
|
||||
`src/hostd/privileged-commands.ts:71-77` treats any non-zero
|
||||
`service onestatus` as "stopped, skip stop." But `onestatus` returns
|
||||
non-zero for permission errors, missing rc script, broken pidfile —
|
||||
not just "not running."
|
||||
|
||||
Failure mode: `onestatus` errors for a non-"not running" reason →
|
||||
`serviceMaybeStop` returns `ok: true` → op proceeds to checkpoint and
|
||||
snapshot postgres while Mevy is **still hammering it**. That is worse
|
||||
than the pre-change behavior.
|
||||
|
||||
Fix: distinguish "service not running" (specific exit code + output)
|
||||
from other failures, and abort on the latter.
|
||||
|
||||
### 3. Committed with failing tests
|
||||
|
||||
Commit trailer: `Tests: FAIL — 3 failed`. No indication which 3.
|
||||
|
||||
Likely candidates: the `controlplane.checkDbJail` host branch no
|
||||
longer calls hostd, which probably breaks tests/mocks that asserted
|
||||
the old auto-revive behavior.
|
||||
|
||||
Either the new tests are flaky on CI or unrelated tests broke. Either
|
||||
way: must be resolved before merge. "Build pass / tests fail" should
|
||||
not pass review.
|
||||
|
||||
---
|
||||
|
||||
## Smells
|
||||
|
||||
- **Plan §E (Mevy startup readiness wait) not implemented.** controlplane
|
||||
now correctly *reports* failure instead of auto-reviving (good), but
|
||||
nothing was added to make Mevy wait for `pg_isready` before its checks
|
||||
run. Post-reboot, Mevy can still race ahead of pg accepting
|
||||
connections. Plan called this out explicitly.
|
||||
- **Mevy stop has no timeout.** `service mevy stop` can hang
|
||||
indefinitely if a worker is stuck. `spawnSync` with no timeout option
|
||||
means the entire reboot op blocks and hostd freezes with it.
|
||||
- **`delayMinutes: 1` default.** Operator gets 60s to abort. Fine, but
|
||||
the Telegram confirmation should mention what is about to happen
|
||||
(mevy stop, pg checkpoint, snapshots) — this is now a much heavier
|
||||
op than before.
|
||||
- **Pool resolution duplicated three ways.** `buildHostDbDatasets`
|
||||
reads `process.env.ZFS_POOL || 'zroot'` inline; `setup/db.ts`
|
||||
hardcodes `zroot/${ZFS_PREFIX}/...`; `setup/install-identity.ts`
|
||||
uses `systemEnv.zfsPool || 'zroot'`. Three conventions in three
|
||||
files. This op should use one source of truth, otherwise dev/test
|
||||
envs with non-default pools snapshot the wrong dataset.
|
||||
- **Bonus fix smuggled in.** `${agentName}_enable` →
|
||||
`${runtime.serviceName}_enable` in `setup/service.ts:205` is a real
|
||||
bug fix, but unrelated and unmentioned in the commit message.
|
||||
|
||||
---
|
||||
|
||||
## Required revisions before merge
|
||||
|
||||
1. Atomic snapshot in one `zfs` invocation (blocker #1).
|
||||
2. `serviceMaybeStop` distinguishes "not running" from "broken"
|
||||
(blocker #2).
|
||||
3. Resolve the 3 failing tests (blocker #3).
|
||||
4. Timeout on the inner `spawnSync` calls so Mevy stop / pg
|
||||
checkpoint cannot hang the op indefinitely.
|
||||
|
||||
## Follow-ups (separate PR ok)
|
||||
|
||||
5. Mevy-side `pg_isready` wait at startup (plan §E).
|
||||
6. Single source of truth for pool/prefix dataset resolution.
|
||||
7. Telegram confirmation copy describing what the maintenance reboot
|
||||
actually does.
|
||||
Loading…
Add table
Reference in a new issue