docs: add host DB recovery plan
--- Build: pass | Tests: FAIL — Tests 3 failed | 2077 passed (2080)
This commit is contained in:
parent
54c74a9e22
commit
d456aa4be1
1 changed files with 229 additions and 0 deletions
229
docs/internal/HOST-DB-RECOVERY-PLAN.md
Normal file
229
docs/internal/HOST-DB-RECOVERY-PLAN.md
Normal file
|
|
@ -0,0 +1,229 @@
|
|||
# Host DB Recovery Plan
|
||||
|
||||
**Date:** 01.may.2026
|
||||
**Status:** Planning
|
||||
**Scope:** Host PostgreSQL for Mevy/controlplane/agent ops
|
||||
|
||||
---
|
||||
|
||||
## Incident summary
|
||||
|
||||
On `30.apr.2026`, a package-upgrade reboot exposed a weakness in the current
|
||||
host-DB maintenance path.
|
||||
|
||||
The host reboot itself was expected. What failed was the database recovery
|
||||
path after reboot:
|
||||
|
||||
- host PostgreSQL did not come back cleanly
|
||||
- startup failed with:
|
||||
- `PANIC: could not locate a valid checkpoint record`
|
||||
- Mevy then entered a restart loop and repeatedly tried to recover the
|
||||
database through hostd
|
||||
- secondary fallout appeared after DB recovery:
|
||||
- Better Auth tables/password mismatch
|
||||
- dashboard webroot missing
|
||||
- stale global budget state
|
||||
- chat registration had to be restored in memory
|
||||
|
||||
This was not a simple "PostgreSQL needed time to recover" event. The host DB
|
||||
required manual recovery work.
|
||||
|
||||
---
|
||||
|
||||
## Recovery summary
|
||||
|
||||
Service was restored by:
|
||||
|
||||
1. stopping Mevy to break the restart loop
|
||||
2. taking rescue ZFS snapshots of the broken state
|
||||
3. rolling back both DB datasets together:
|
||||
- `zroot/mevy-ai/pgdata`
|
||||
- `zroot/mevy-ai/pgwal`
|
||||
4. confirming rollback alone was insufficient
|
||||
5. using `pg_resetwal`
|
||||
6. restarting PostgreSQL successfully
|
||||
7. restoring controlplane/auth/bootstrap state
|
||||
8. re-importing the built-in knowledge artifact
|
||||
9. re-registering chat routing and clearing stale budget fallout
|
||||
|
||||
This recovered service, but it was too manual and too fragile for a normal
|
||||
upgrade/reboot path.
|
||||
|
||||
---
|
||||
|
||||
## What failed
|
||||
|
||||
### 1. Upgrade/reboot path did not quiesce PostgreSQL
|
||||
|
||||
The Telegram upgrade flow currently:
|
||||
|
||||
- runs package upgrades
|
||||
- offers reboot
|
||||
- schedules reboot through hostd
|
||||
|
||||
It does **not**:
|
||||
|
||||
- stop Mevy first
|
||||
- force a PostgreSQL `CHECKPOINT`
|
||||
- stop PostgreSQL cleanly
|
||||
- take explicit paired DB snapshots before reboot
|
||||
|
||||
That is the main operational gap.
|
||||
|
||||
### 2. Mevy rc.d service does not express a host-DB dependency
|
||||
|
||||
The live service script requires only:
|
||||
|
||||
- `NETWORKING`
|
||||
- `LOGIN`
|
||||
|
||||
It does not require `postgresql` when `DB_RUNTIME=host`, so boot ordering can
|
||||
race.
|
||||
|
||||
### 3. Mevy tries to revive its own critical DB dependency
|
||||
|
||||
When host PostgreSQL is unreachable, controlplane checks currently try to
|
||||
start `postgresql` through hostd.
|
||||
|
||||
That is acceptable as a manual recovery aid, but it is the wrong default
|
||||
behavior during boot or after a serious DB failure. It creates noise and
|
||||
blurs ownership of the DB lifecycle.
|
||||
|
||||
### 4. Dual runtime support adds branching where ops wants one path
|
||||
|
||||
`DB_RUNTIME=host|jail` was useful during transition, but for current
|
||||
Mevy/controlplane operations it adds complexity without enough operational
|
||||
value.
|
||||
|
||||
---
|
||||
|
||||
## Architecture decision
|
||||
|
||||
For Mevy/controlplane/agent ops:
|
||||
|
||||
- **host PostgreSQL is the primary supported path**
|
||||
- **db jail remains optional and secondary**
|
||||
|
||||
The db jail may still be useful for:
|
||||
|
||||
- testing
|
||||
- migration
|
||||
- isolated non-core service cases
|
||||
|
||||
But it is no longer the path we should optimize first for agent operations.
|
||||
|
||||
### Rationale
|
||||
|
||||
- Mevy cannot operate meaningfully without the DB
|
||||
- having Mevy also wake or revive its own critical DB dependency is poor
|
||||
layering
|
||||
- host DB gives simpler boot ordering
|
||||
- host DB gives simpler recovery
|
||||
- host DB reduces hot-path branching in code, docs, and operations
|
||||
|
||||
---
|
||||
|
||||
## Option 1 goal
|
||||
|
||||
Make upgrade-triggered reboot boring and safe without adding major new
|
||||
architecture.
|
||||
|
||||
The system should:
|
||||
|
||||
1. drain/stop Mevy
|
||||
2. checkpoint and stop PostgreSQL cleanly
|
||||
3. snapshot DB datasets explicitly
|
||||
4. reboot
|
||||
5. start PostgreSQL first
|
||||
6. start Mevy only after DB is ready
|
||||
|
||||
---
|
||||
|
||||
## Minimal change set
|
||||
|
||||
### A. Add a dedicated hostd maintenance reboot operation
|
||||
|
||||
Add one higher-level hostd op for upgrade-safe reboot.
|
||||
|
||||
Target behavior:
|
||||
|
||||
1. `service mevy stop`
|
||||
2. run PostgreSQL `CHECKPOINT`
|
||||
3. `service postgresql stop`
|
||||
4. snapshot:
|
||||
- `zroot/mevy-ai/pgdata`
|
||||
- `zroot/mevy-ai/pgwal`
|
||||
5. schedule reboot
|
||||
|
||||
This should live in hostd, not in Telegram command glue, so the dangerous
|
||||
sequence is centralized and reusable.
|
||||
|
||||
### B. Make Telegram upgrade reboot use the maintenance op
|
||||
|
||||
The Telegram package-update flow should call the new maintenance reboot path
|
||||
instead of a plain `shutdown-reboot`.
|
||||
|
||||
This keeps the operator UX unchanged while fixing the dangerous part.
|
||||
|
||||
### C. Add host-DB service ordering in rc.d generation
|
||||
|
||||
When `DB_RUNTIME=host`, the generated Mevy rc.d service should require:
|
||||
|
||||
- `postgresql`
|
||||
|
||||
instead of only:
|
||||
|
||||
- `NETWORKING LOGIN`
|
||||
|
||||
This is small but valuable.
|
||||
|
||||
### D. Reduce boot-time DB auto-start thrash
|
||||
|
||||
Controlplane checks should stop treating host PostgreSQL startup as their
|
||||
default responsibility during boot.
|
||||
|
||||
Preferred behavior:
|
||||
|
||||
- check DB reachability
|
||||
- report failure clearly
|
||||
- do not aggressively restart PostgreSQL from inside the normal boot loop
|
||||
|
||||
The host/service layer should own DB startup ordering.
|
||||
|
||||
---
|
||||
|
||||
## What we are not doing in this pass
|
||||
|
||||
- moving the DB back into a jail
|
||||
- adding heavy self-healing orchestration
|
||||
- redesigning backups/PITR
|
||||
- removing db-jail support completely
|
||||
- rewriting the whole controlplane startup model
|
||||
|
||||
This pass is intentionally conservative.
|
||||
|
||||
---
|
||||
|
||||
## Success criteria
|
||||
|
||||
This plan is successful when:
|
||||
|
||||
1. package-upgrade reboot stops Mevy first
|
||||
2. PostgreSQL receives a clean checkpoint and shutdown before reboot
|
||||
3. paired DB snapshots exist before reboot
|
||||
4. boot starts PostgreSQL before Mevy
|
||||
5. Mevy no longer enters a noisy restart loop on normal post-upgrade boot
|
||||
6. operator login/dashboard/chat do not require manual repair after a normal
|
||||
upgrade reboot
|
||||
|
||||
---
|
||||
|
||||
## Follow-up after option 1
|
||||
|
||||
Once ordering is fixed, evaluate whether we still need:
|
||||
|
||||
- startup reconciliation for dashboard/auth artifacts
|
||||
- additional DB restore safeguards
|
||||
- deeper removal of `DB_RUNTIME` branching from core ops paths
|
||||
|
||||
Those are follow-ups, not prerequisites for this first hardening pass.
|
||||
Loading…
Add table
Reference in a new issue