docs: record OSA intake scheduler re-smoke

This commit is contained in:
Sam & Claude 2026-05-27 21:09:26 +02:00
parent d760536fe1
commit 8199e23890

View file

@ -0,0 +1,142 @@
# OSA FreeBSD Intake Scheduler Re-smoke
**Date:** 27.maj.2026
**Host:** osa.smilepowered.org
**OS:** FreeBSD 15.0-RELEASE-p9 amd64
**Repo:** `Clawdie/Colibri`
**Base pulled:** `af8c011` — daemon loop wiring landed
**Fix commit:** `d760536``fix: avoid scheduler store deadlock on intake drain`
**Status:** PASS after follow-up fix
## What was checked
Pulled the daemon-loop wiring (`9717ce7`, plus `af8c011`) and ran the full workspace gates:
```sh
cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
cargo build --workspace --release
```
Initial gates were green, but a live `/tmp` FreeBSD re-smoke exposed one more runtime bug.
## Finding during live re-smoke
With `daemon::run_loop` now started, `intake-task` did reach the scheduler tick and inserted a SQLite task, but a concurrent `list-tasks` socket request timed out.
Root cause: `Scheduler::tick` used expressions like:
```rust
match state.store.lock().unwrap().create_task(...) {
Ok(task) => {
state.store.lock().unwrap().list_agents()
}
}
```
The `MutexGuard` temporary from the `match` scrutinee can live through the match arm. Relocking `state.store` inside the arm can deadlock the scheduler thread. On FreeBSD this manifested as:
- SQLite row was created
- socket request blocked on the store lock
- daemon became unresponsive until killed by the smoke harness timeout
## Fix
`d760536` rewrites scheduler store access so every lock is scoped explicitly and dropped before the next lock:
```rust
let create_result = {
let store = state.store.lock().unwrap();
store.create_task(...)
};
```
It also adds a regression test:
```text
scheduler::tests::test_scheduler_tick_drains_intake_without_deadlock
```
## Re-smoke result after fix
Isolated `/tmp` environment:
```text
COLIBRI_DAEMON_DATA_DIR=/tmp/colibri-osa-resmoke-clawdie-1779908417/data
COLIBRI_DAEMON_SOCKET=/tmp/colibri-osa-resmoke-clawdie-1779908417/colibri.sock
COLIBRI_DB_PATH=/tmp/colibri-osa-resmoke-clawdie-1779908417/colibri.sqlite
COLIBRI_HOST=osa-resmoke
```
`intake-task` response:
```json
{"ok":true,"data":{"status":"queued"}}
```
The scheduler drained intake on the 30s tick:
```text
FOUND_ON_POLL=30
```
`list-tasks` then returned the queued task:
```json
{
"ok": true,
"data": [
{
"agent_id": null,
"created_at": "2026-05-27T19:00:47.360062420+00:00",
"description": "prove scheduler loop drains intake",
"id": "c3dab9df-8a37-47b1-854b-d956fd796d41",
"status": "queued",
"title": "osa resmoke intake",
"updated_at": "2026-05-27T19:00:47.360062420+00:00"
}
]
}
```
SQLite verification:
```text
tasks 1
journal_mode wal
```
Graceful shutdown:
```text
socket exists after stop? no
process remains? no
```
Daemon log included both task loops exiting cleanly:
```text
daemon background loop started ... scheduler_secs=30
received interrupt signal, initiating graceful shutdown
socket server received shutdown signal
daemon loop received shutdown signal
daemon background loop exited
Herdr socket API shut down
colibri-daemon shut down cleanly
```
## Final verdict
The daemon loop wiring is valid after `d760536`.
Validated on FreeBSD:
- daemon starts with `/tmp` data/socket/DB paths
- background daemon loop starts beside the socket server
- `intake-task` over Unix socket becomes a queued SQLite task on the scheduler tick
- `list-tasks` remains responsive after the tick
- SQLite WAL works
- graceful shutdown removes socket and exits both socket + loop tasks
The smoke directory was removed after recording this report.