clawdie-ai/docs/FRESH-INSTALL-CHECKLIST.md

213 lines
5.4 KiB
Markdown
Raw Normal View History

# Fresh Install Checklist
Verification checklist for new Clawdie-AI installations (bare metal, bhyve VM,
or jail-based). Run after firstboot completes. Each check includes the exact
command and expected result.
Designed to work with the tmux-screenshot skill — capture each section for the
installation record.
## Timing milestones
Record wall-clock timestamps at each stage. On bhyve, the serial console
shows boot messages with timestamps.
| Milestone | Command / event | Record |
|-----------|----------------|--------|
| Boot start | First kernel message | `T0` |
| Firstboot wizard shown | `bsddialog` prompt appears | `T1 = T1 - T0` |
| Wizard complete | `[firstboot] Complete.` in log | `T2 = T2 - T0` |
| Desktop ready (Lumina) | `lightdm` login screen visible | `T3 = T3 - T0` |
| Agent responding | `/ping` on Telegram returns pong | `T4 = T4 - T0` |
Check firstboot log for exact timestamps:
```sh
head -5 /var/log/${AGENT_NAME}-firstboot.log
tail -5 /var/log/${AGENT_NAME}-firstboot.log
```
## 1. Jails running
```sh
sudo bastille list
```
Expected (agent name may vary):
```
JID IP Address Hostname Path
{agent}-cont.. 10.0.X.2 {agent}-controlplane /usr/local/bastille/jails/...
db 10.0.X.3 db /usr/local/bastille/jails/...
cms 10.0.X.4 cms /usr/local/bastille/jails/...
llamacpp 10.0.X.5 llamacpp /usr/local/bastille/jails/...
```
All four jails must be present and running. If any are missing:
```sh
grep -i 'fail\|error' /var/log/${AGENT_NAME}-firstboot.log
```
## 2. .env correctness
```sh
grep -E '^(AGENT_NAME|AGENT_GENDER|AGENT_DOMAIN|AGENT_INTERNAL_DOMAIN|AGENT_TMP_DIR|PI_TUI_PROVIDER|PI_TUI_MODEL|EMBED_BASE_URL|TELEGRAM_BOT_TOKEN)=' .env
```
Verify:
| Key | Expected |
|-----|----------|
| `AGENT_NAME` | Lowercase, no spaces (e.g. `clawdie`, `mevy`) |
| `AGENT_GENDER` | `f`, `m`, or `n` |
| `AGENT_DOMAIN` | Public domain (e.g. `clawdie.si`) or `{agent}.internal` for VMs |
| `AGENT_INTERNAL_DOMAIN` | `{agent}.home.arpa` (Tailscale / local DNS) |
| `AGENT_TMP_DIR` | Writable path, not `/tmp` |
| `PI_TUI_PROVIDER` | `zai`, `openrouter`, `anthropic`, etc. |
| `PI_TUI_MODEL` | Valid model for the provider |
| `EMBED_BASE_URL` | URL ending in `/v1` |
| `TELEGRAM_BOT_TOKEN` | Non-empty if `FEATURE_TELEGRAM=true` |
## 3. Watchdog IPC status
```sh
# Check socket exists
ls -la "${AGENT_TMP_DIR:-tmp}/ipc/"
# Query watchdog status
echo '{"cmd":"status"}' | nc -U "${AGENT_TMP_DIR:-tmp}/ipc/${AGENT_NAME}-watchdog.sock"
```
Expected: JSON response with `mode`, `throttle`, `memory`, `activeJails`.
If socket is missing, check if the agent process is running:
```sh
sudo bastille cmd "${AGENT_NAME}-controlplane" service clawdie status
```
## 4. Database connectivity
```sh
# From host — test PostgreSQL in db jail
sudo bastille cmd db service postgresql status
# Test connection (uses .env credentials)
npm run setup -- --step verify
```
Expected: `postgresql is running` and verify step exits 0.
## 5. LLM provider connectivity
```sh
# Quick inference test via pi
pi --provider "${PI_TUI_PROVIDER}" --model "${PI_TUI_MODEL}" -e "reply with OK"
```
Expected: Model responds. If using ZAI (GLM), verify the API key:
```sh
grep '^ZAI_API_KEY=' .env | cut -c1-20
```
## 6. Telegram bot
```sh
# Check bot token is valid (should return bot info)
curl -s "https://api.telegram.org/bot$(grep '^TELEGRAM_BOT_TOKEN=' .env | cut -d= -f2)/getMe" | python3 -m json.tool
```
Expected: `"ok": true` with the bot username.
## 7. Lumina desktop (baremetal only)
```sh
service lightdm status
service dbus status
```
If Lumina fails to start, check:
```sh
# X11 log
tail -30 /var/log/Xorg.0.log
# LightDM log
tail -30 /var/log/lightdm/lightdm.log
# GPU driver loaded?
pciconf -lv | grep -B3 'VGA'
```
## 8. Network and firewall
```sh
# PF rules loaded
sudo pfctl -sr | head -10
# NAT working (from inside the db jail)
sudo bastille cmd db ping -c 1 1.1.1.1
# Bridge healthy
ifconfig warden0 | grep 'inet '
```
## 9. ZFS health
```sh
zpool status -x
zfs list -o name,used,avail -t filesystem | head -20
```
Expected: `all pools are healthy`.
## 10. Screenshot smoke test
Capture the final state as proof of successful install:
```sh
python3 .agent/skills/tmux-screenshot/tmux-screenshot.py \
--session "${AGENT_NAME}" \
--base-url "https://${AGENT_DOMAIN}/screenshots" \
--publish
```
Verify the capture landed:
```sh
ls -la /usr/local/www/${AGENT_NAME}/screenshots/*.png | tail -3
```
## Log paths reference
| Log | Path |
|-----|------|
| Firstboot orchestrator | `/var/log/${AGENT_NAME}-firstboot.log` |
| Firstboot progress | `/var/log/${AGENT_NAME}-firstboot.progress` |
| Agent (production) | `logs/${AGENT_NAME}.log` (relative to project) |
| Watchdog | Same as agent log (pino structured) |
| Preflight run | `logs/preflight-{runstamp}/` |
| LightDM | `/var/log/lightdm/lightdm.log` |
| X11 | `/var/log/Xorg.0.log` |
| PostgreSQL | `/var/log/postgresql.log` (inside db jail) |
| nginx | `/var/log/nginx/error.log` |
## Running the full preflight
The automated version of this checklist:
```sh
# As root (for jail and firewall steps)
sudo npm run preflight
# With onboarding wizard
sudo npm run preflight -- --with-onboarding
# Stop on first failure
sudo npm run preflight -- --fail-fast
```
Results are written to `logs/preflight-{timestamp}/summary.json`.