docs: add host DB recovery plan

---
Build: pass | Tests: FAIL — Tests  3 failed | 2077 passed (2080)
This commit is contained in:
Operator & Codex 2026-05-01 10:17:11 +02:00
parent 54c74a9e22
commit d456aa4be1

View file

@ -0,0 +1,229 @@
# Host DB Recovery Plan
**Date:** 01.may.2026
**Status:** Planning
**Scope:** Host PostgreSQL for Mevy/controlplane/agent ops
---
## Incident summary
On `30.apr.2026`, a package-upgrade reboot exposed a weakness in the current
host-DB maintenance path.
The host reboot itself was expected. What failed was the database recovery
path after reboot:
- host PostgreSQL did not come back cleanly
- startup failed with:
- `PANIC: could not locate a valid checkpoint record`
- Mevy then entered a restart loop and repeatedly tried to recover the
database through hostd
- secondary fallout appeared after DB recovery:
- Better Auth tables/password mismatch
- dashboard webroot missing
- stale global budget state
- chat registration had to be restored in memory
This was not a simple "PostgreSQL needed time to recover" event. The host DB
required manual recovery work.
---
## Recovery summary
Service was restored by:
1. stopping Mevy to break the restart loop
2. taking rescue ZFS snapshots of the broken state
3. rolling back both DB datasets together:
- `zroot/mevy-ai/pgdata`
- `zroot/mevy-ai/pgwal`
4. confirming rollback alone was insufficient
5. using `pg_resetwal`
6. restarting PostgreSQL successfully
7. restoring controlplane/auth/bootstrap state
8. re-importing the built-in knowledge artifact
9. re-registering chat routing and clearing stale budget fallout
This recovered service, but it was too manual and too fragile for a normal
upgrade/reboot path.
---
## What failed
### 1. Upgrade/reboot path did not quiesce PostgreSQL
The Telegram upgrade flow currently:
- runs package upgrades
- offers reboot
- schedules reboot through hostd
It does **not**:
- stop Mevy first
- force a PostgreSQL `CHECKPOINT`
- stop PostgreSQL cleanly
- take explicit paired DB snapshots before reboot
That is the main operational gap.
### 2. Mevy rc.d service does not express a host-DB dependency
The live service script requires only:
- `NETWORKING`
- `LOGIN`
It does not require `postgresql` when `DB_RUNTIME=host`, so boot ordering can
race.
### 3. Mevy tries to revive its own critical DB dependency
When host PostgreSQL is unreachable, controlplane checks currently try to
start `postgresql` through hostd.
That is acceptable as a manual recovery aid, but it is the wrong default
behavior during boot or after a serious DB failure. It creates noise and
blurs ownership of the DB lifecycle.
### 4. Dual runtime support adds branching where ops wants one path
`DB_RUNTIME=host|jail` was useful during transition, but for current
Mevy/controlplane operations it adds complexity without enough operational
value.
---
## Architecture decision
For Mevy/controlplane/agent ops:
- **host PostgreSQL is the primary supported path**
- **db jail remains optional and secondary**
The db jail may still be useful for:
- testing
- migration
- isolated non-core service cases
But it is no longer the path we should optimize first for agent operations.
### Rationale
- Mevy cannot operate meaningfully without the DB
- having Mevy also wake or revive its own critical DB dependency is poor
layering
- host DB gives simpler boot ordering
- host DB gives simpler recovery
- host DB reduces hot-path branching in code, docs, and operations
---
## Option 1 goal
Make upgrade-triggered reboot boring and safe without adding major new
architecture.
The system should:
1. drain/stop Mevy
2. checkpoint and stop PostgreSQL cleanly
3. snapshot DB datasets explicitly
4. reboot
5. start PostgreSQL first
6. start Mevy only after DB is ready
---
## Minimal change set
### A. Add a dedicated hostd maintenance reboot operation
Add one higher-level hostd op for upgrade-safe reboot.
Target behavior:
1. `service mevy stop`
2. run PostgreSQL `CHECKPOINT`
3. `service postgresql stop`
4. snapshot:
- `zroot/mevy-ai/pgdata`
- `zroot/mevy-ai/pgwal`
5. schedule reboot
This should live in hostd, not in Telegram command glue, so the dangerous
sequence is centralized and reusable.
### B. Make Telegram upgrade reboot use the maintenance op
The Telegram package-update flow should call the new maintenance reboot path
instead of a plain `shutdown-reboot`.
This keeps the operator UX unchanged while fixing the dangerous part.
### C. Add host-DB service ordering in rc.d generation
When `DB_RUNTIME=host`, the generated Mevy rc.d service should require:
- `postgresql`
instead of only:
- `NETWORKING LOGIN`
This is small but valuable.
### D. Reduce boot-time DB auto-start thrash
Controlplane checks should stop treating host PostgreSQL startup as their
default responsibility during boot.
Preferred behavior:
- check DB reachability
- report failure clearly
- do not aggressively restart PostgreSQL from inside the normal boot loop
The host/service layer should own DB startup ordering.
---
## What we are not doing in this pass
- moving the DB back into a jail
- adding heavy self-healing orchestration
- redesigning backups/PITR
- removing db-jail support completely
- rewriting the whole controlplane startup model
This pass is intentionally conservative.
---
## Success criteria
This plan is successful when:
1. package-upgrade reboot stops Mevy first
2. PostgreSQL receives a clean checkpoint and shutdown before reboot
3. paired DB snapshots exist before reboot
4. boot starts PostgreSQL before Mevy
5. Mevy no longer enters a noisy restart loop on normal post-upgrade boot
6. operator login/dashboard/chat do not require manual repair after a normal
upgrade reboot
---
## Follow-up after option 1
Once ordering is fixed, evaluate whether we still need:
- startup reconciliation for dashboard/auth artifacts
- additional DB restore safeguards
- deeper removal of `DB_RUNTIME` branching from core ops paths
Those are follow-ups, not prerequisites for this first hardening pass.