docs: record OSA intake scheduler re-smoke
This commit is contained in:
parent
d760536fe1
commit
8199e23890
1 changed files with 142 additions and 0 deletions
142
docs/internal/sessions/2026-05-27-osa-freebsd-intake-resmoke.md
Normal file
142
docs/internal/sessions/2026-05-27-osa-freebsd-intake-resmoke.md
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
# OSA FreeBSD Intake Scheduler Re-smoke
|
||||
|
||||
**Date:** 27.maj.2026
|
||||
**Host:** osa.smilepowered.org
|
||||
**OS:** FreeBSD 15.0-RELEASE-p9 amd64
|
||||
**Repo:** `Clawdie/Colibri`
|
||||
**Base pulled:** `af8c011` — daemon loop wiring landed
|
||||
**Fix commit:** `d760536` — `fix: avoid scheduler store deadlock on intake drain`
|
||||
**Status:** PASS after follow-up fix
|
||||
|
||||
## What was checked
|
||||
|
||||
Pulled the daemon-loop wiring (`9717ce7`, plus `af8c011`) and ran the full workspace gates:
|
||||
|
||||
```sh
|
||||
cargo fmt --check
|
||||
cargo clippy --workspace --all-targets -- -D warnings
|
||||
cargo test --workspace
|
||||
cargo build --workspace --release
|
||||
```
|
||||
|
||||
Initial gates were green, but a live `/tmp` FreeBSD re-smoke exposed one more runtime bug.
|
||||
|
||||
## Finding during live re-smoke
|
||||
|
||||
With `daemon::run_loop` now started, `intake-task` did reach the scheduler tick and inserted a SQLite task, but a concurrent `list-tasks` socket request timed out.
|
||||
|
||||
Root cause: `Scheduler::tick` used expressions like:
|
||||
|
||||
```rust
|
||||
match state.store.lock().unwrap().create_task(...) {
|
||||
Ok(task) => {
|
||||
state.store.lock().unwrap().list_agents()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `MutexGuard` temporary from the `match` scrutinee can live through the match arm. Relocking `state.store` inside the arm can deadlock the scheduler thread. On FreeBSD this manifested as:
|
||||
|
||||
- SQLite row was created
|
||||
- socket request blocked on the store lock
|
||||
- daemon became unresponsive until killed by the smoke harness timeout
|
||||
|
||||
## Fix
|
||||
|
||||
`d760536` rewrites scheduler store access so every lock is scoped explicitly and dropped before the next lock:
|
||||
|
||||
```rust
|
||||
let create_result = {
|
||||
let store = state.store.lock().unwrap();
|
||||
store.create_task(...)
|
||||
};
|
||||
```
|
||||
|
||||
It also adds a regression test:
|
||||
|
||||
```text
|
||||
scheduler::tests::test_scheduler_tick_drains_intake_without_deadlock
|
||||
```
|
||||
|
||||
## Re-smoke result after fix
|
||||
|
||||
Isolated `/tmp` environment:
|
||||
|
||||
```text
|
||||
COLIBRI_DAEMON_DATA_DIR=/tmp/colibri-osa-resmoke-clawdie-1779908417/data
|
||||
COLIBRI_DAEMON_SOCKET=/tmp/colibri-osa-resmoke-clawdie-1779908417/colibri.sock
|
||||
COLIBRI_DB_PATH=/tmp/colibri-osa-resmoke-clawdie-1779908417/colibri.sqlite
|
||||
COLIBRI_HOST=osa-resmoke
|
||||
```
|
||||
|
||||
`intake-task` response:
|
||||
|
||||
```json
|
||||
{"ok":true,"data":{"status":"queued"}}
|
||||
```
|
||||
|
||||
The scheduler drained intake on the 30s tick:
|
||||
|
||||
```text
|
||||
FOUND_ON_POLL=30
|
||||
```
|
||||
|
||||
`list-tasks` then returned the queued task:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"data": [
|
||||
{
|
||||
"agent_id": null,
|
||||
"created_at": "2026-05-27T19:00:47.360062420+00:00",
|
||||
"description": "prove scheduler loop drains intake",
|
||||
"id": "c3dab9df-8a37-47b1-854b-d956fd796d41",
|
||||
"status": "queued",
|
||||
"title": "osa resmoke intake",
|
||||
"updated_at": "2026-05-27T19:00:47.360062420+00:00"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
SQLite verification:
|
||||
|
||||
```text
|
||||
tasks 1
|
||||
journal_mode wal
|
||||
```
|
||||
|
||||
Graceful shutdown:
|
||||
|
||||
```text
|
||||
socket exists after stop? no
|
||||
process remains? no
|
||||
```
|
||||
|
||||
Daemon log included both task loops exiting cleanly:
|
||||
|
||||
```text
|
||||
daemon background loop started ... scheduler_secs=30
|
||||
received interrupt signal, initiating graceful shutdown
|
||||
socket server received shutdown signal
|
||||
daemon loop received shutdown signal
|
||||
daemon background loop exited
|
||||
Herdr socket API shut down
|
||||
colibri-daemon shut down cleanly
|
||||
```
|
||||
|
||||
## Final verdict
|
||||
|
||||
The daemon loop wiring is valid after `d760536`.
|
||||
|
||||
Validated on FreeBSD:
|
||||
|
||||
- daemon starts with `/tmp` data/socket/DB paths
|
||||
- background daemon loop starts beside the socket server
|
||||
- `intake-task` over Unix socket becomes a queued SQLite task on the scheduler tick
|
||||
- `list-tasks` remains responsive after the tick
|
||||
- SQLite WAL works
|
||||
- graceful shutdown removes socket and exits both socket + loop tasks
|
||||
|
||||
The smoke directory was removed after recording this report.
|
||||
Loading…
Add table
Reference in a new issue