docs: add host DB recovery plan

--- Build: pass | Tests: FAIL — Tests 3 failed | 2077 passed (2080)
2026-05-01 10:17:11 +02:00 · 2026-05-01 10:17:11 +02:00 · d456aa4be1
commit d456aa4be1
parent 54c74a9e22
1 changed files with 229 additions and 0 deletions
--- a/docs/internal/HOST-DB-RECOVERY-PLAN.md
+++ b/docs/internal/HOST-DB-RECOVERY-PLAN.md
@ -0,0 +1,229 @@
+# Host DB Recovery Plan
+
+**Date:** 01.may.2026
+**Status:** Planning
+**Scope:** Host PostgreSQL for Mevy/controlplane/agent ops
+
+---
+
+## Incident summary
+
+On `30.apr.2026`, a package-upgrade reboot exposed a weakness in the current
+host-DB maintenance path.
+
+The host reboot itself was expected. What failed was the database recovery
+path after reboot:
+
+- host PostgreSQL did not come back cleanly
+- startup failed with:
+  - `PANIC: could not locate a valid checkpoint record`
+- Mevy then entered a restart loop and repeatedly tried to recover the
+  database through hostd
+- secondary fallout appeared after DB recovery:
+  - Better Auth tables/password mismatch
+  - dashboard webroot missing
+  - stale global budget state
+  - chat registration had to be restored in memory
+
+This was not a simple "PostgreSQL needed time to recover" event. The host DB
+required manual recovery work.
+
+---
+
+## Recovery summary
+
+Service was restored by:
+
+1. stopping Mevy to break the restart loop
+2. taking rescue ZFS snapshots of the broken state
+3. rolling back both DB datasets together:
+   - `zroot/mevy-ai/pgdata`
+   - `zroot/mevy-ai/pgwal`
+4. confirming rollback alone was insufficient
+5. using `pg_resetwal`
+6. restarting PostgreSQL successfully
+7. restoring controlplane/auth/bootstrap state
+8. re-importing the built-in knowledge artifact
+9. re-registering chat routing and clearing stale budget fallout
+
+This recovered service, but it was too manual and too fragile for a normal
+upgrade/reboot path.
+
+---
+
+## What failed
+
+### 1. Upgrade/reboot path did not quiesce PostgreSQL
+
+The Telegram upgrade flow currently:
+
+- runs package upgrades
+- offers reboot
+- schedules reboot through hostd
+
+It does **not**:
+
+- stop Mevy first
+- force a PostgreSQL `CHECKPOINT`
+- stop PostgreSQL cleanly
+- take explicit paired DB snapshots before reboot
+
+That is the main operational gap.
+
+### 2. Mevy rc.d service does not express a host-DB dependency
+
+The live service script requires only:
+
+- `NETWORKING`
+- `LOGIN`
+
+It does not require `postgresql` when `DB_RUNTIME=host`, so boot ordering can
+race.
+
+### 3. Mevy tries to revive its own critical DB dependency
+
+When host PostgreSQL is unreachable, controlplane checks currently try to
+start `postgresql` through hostd.
+
+That is acceptable as a manual recovery aid, but it is the wrong default
+behavior during boot or after a serious DB failure. It creates noise and
+blurs ownership of the DB lifecycle.
+
+### 4. Dual runtime support adds branching where ops wants one path
+
+`DB_RUNTIME=host|jail` was useful during transition, but for current
+Mevy/controlplane operations it adds complexity without enough operational
+value.
+
+---
+
+## Architecture decision
+
+For Mevy/controlplane/agent ops:
+
+- **host PostgreSQL is the primary supported path**
+- **db jail remains optional and secondary**
+
+The db jail may still be useful for:
+
+- testing
+- migration
+- isolated non-core service cases
+
+But it is no longer the path we should optimize first for agent operations.
+
+### Rationale
+
+- Mevy cannot operate meaningfully without the DB
+- having Mevy also wake or revive its own critical DB dependency is poor
+  layering
+- host DB gives simpler boot ordering
+- host DB gives simpler recovery
+- host DB reduces hot-path branching in code, docs, and operations
+
+---
+
+## Option 1 goal
+
+Make upgrade-triggered reboot boring and safe without adding major new
+architecture.
+
+The system should:
+
+1. drain/stop Mevy
+2. checkpoint and stop PostgreSQL cleanly
+3. snapshot DB datasets explicitly
+4. reboot
+5. start PostgreSQL first
+6. start Mevy only after DB is ready
+
+---
+
+## Minimal change set
+
+### A. Add a dedicated hostd maintenance reboot operation
+
+Add one higher-level hostd op for upgrade-safe reboot.
+
+Target behavior:
+
+1. `service mevy stop`
+2. run PostgreSQL `CHECKPOINT`
+3. `service postgresql stop`
+4. snapshot:
+   - `zroot/mevy-ai/pgdata`
+   - `zroot/mevy-ai/pgwal`
+5. schedule reboot
+
+This should live in hostd, not in Telegram command glue, so the dangerous
+sequence is centralized and reusable.
+
+### B. Make Telegram upgrade reboot use the maintenance op
+
+The Telegram package-update flow should call the new maintenance reboot path
+instead of a plain `shutdown-reboot`.
+
+This keeps the operator UX unchanged while fixing the dangerous part.
+
+### C. Add host-DB service ordering in rc.d generation
+
+When `DB_RUNTIME=host`, the generated Mevy rc.d service should require:
+
+- `postgresql`
+
+instead of only:
+
+- `NETWORKING LOGIN`
+
+This is small but valuable.
+
+### D. Reduce boot-time DB auto-start thrash
+
+Controlplane checks should stop treating host PostgreSQL startup as their
+default responsibility during boot.
+
+Preferred behavior:
+
+- check DB reachability
+- report failure clearly
+- do not aggressively restart PostgreSQL from inside the normal boot loop
+
+The host/service layer should own DB startup ordering.
+
+---
+
+## What we are not doing in this pass
+
+- moving the DB back into a jail
+- adding heavy self-healing orchestration
+- redesigning backups/PITR
+- removing db-jail support completely
+- rewriting the whole controlplane startup model
+
+This pass is intentionally conservative.
+
+---
+
+## Success criteria
+
+This plan is successful when:
+
+1. package-upgrade reboot stops Mevy first
+2. PostgreSQL receives a clean checkpoint and shutdown before reboot
+3. paired DB snapshots exist before reboot
+4. boot starts PostgreSQL before Mevy
+5. Mevy no longer enters a noisy restart loop on normal post-upgrade boot
+6. operator login/dashboard/chat do not require manual repair after a normal
+   upgrade reboot
+
+---
+
+## Follow-up after option 1
+
+Once ordering is fixed, evaluate whether we still need:
+
+- startup reconciliation for dashboard/auth artifacts
+- additional DB restore safeguards
+- deeper removal of `DB_RUNTIME` branching from core ops paths
+
+Those are follow-ups, not prerequisites for this first hardening pass.