diff --git a/skills/freebsd-truss-debug/SKILL.md b/skills/freebsd-truss-debug/SKILL.md new file mode 100644 index 0000000..4d51b75 --- /dev/null +++ b/skills/freebsd-truss-debug/SKILL.md @@ -0,0 +1,112 @@ +--- +name: freebsd-truss-debug +description: Debug FreeBSD process failures with truss — trace syscalls to find the exact kernel call that fails (EACCES, ENOENT, etc.). +--- + +# FreeBSD truss Debugging + +`truss` traces every system call a process makes to the kernel. When a command +works from a shell but fails from a daemon/service, `truss` shows exactly which +syscall returns the error and why. + +## Quick reference + +```sh +# Trace a NEW process (follow children) +sudo truss -f -o /tmp/trace.out command [args] + +# Attach to a RUNNING process +sudo truss -f -o /tmp/trace.out -p PID + +# Common filters +grep 'ERR#' /tmp/trace.out # all errors +grep -v 'ERR#2' # exclude "No such file" noise +grep 'fork\|rfork\|execve' # process creation only +grep 'EACCES\|EPERM\|ERR#13' # permission errors +``` + +## When to use + +Use `truss` when a command works in one context but not another. Common scenarios: + +- Daemon (via `daemon(8)` or rc.d) gets EACCES but shell works fine → PATH issue +- Permission denied but `sudo -u ` works → staging directory ownership +- "Text file busy" on binary replacement → process still holding the file +- Silent failures with no error message → syscall trace reveals the hidden error + +## Walkthrough: debugging a daemon spawn failure + +### 1. Start daemon under truss + +```sh +sudo service daemon_name stop +sleep 1; sudo rm -f /var/run/socket.sock /tmp/trace.out +sudo truss -f -o /tmp/trace.out \ + env COLIBRI_JAIL_PRIV_MODE=sudo \ + COLIBRI_DAEMON_SOCKET=/var/run/socket.sock \ + COLIBRI_DAEMON_DATA_DIR=/var/db/app \ + /usr/local/bin/daemon-binary & +sleep 3 # wait for socket ready +``` + +**Important:** pass the daemon's expected env vars explicitly so the trace +captures the real spawn path, not a misconfigured one. + +### 2. Trigger the failing operation + +```sh +client-command --socket /var/run/socket.sock trigger-failure +sleep 2 +``` + +### 3. Stop and analyze + +```sh +sudo pkill daemon-binary; wait +wc -l /tmp/trace.out # expect hundreds-thousands of lines + +# Find the error +grep 'ERR#13\|ERR#1\|EACCES\|EPERM' /tmp/trace.out | grep -v 'ERR#2' + +# Find process creation (fork + exec) +grep 'fork\|rfork\|execve' /tmp/trace.out +``` + +### 4. Interpret + +| Pattern | Meaning | +|---------|---------| +| `fork() = ERR#13` | Can't create child process (resource limits?) | +| `execve("/path/to/bin") ERR#13` | Binary exists but can't execute (permissions, MAC) | +| `execve("sudo") ERR#2` | Bare name — PATH doesn't include `/usr/local/bin` | +| `open("/path") ERR#13` | File exists but can't open (ownership, mode) | +| `mkdir("/path") ERR#13` | Parent directory not writable | +| No fork/exec at all | Error happens BEFORE spawn — staging/validation failure | + +## Common daemon pitfalls caught by truss + +1. **Bare command names**: daemon(8) clears/reorders PATH — `execvp("sudo")` can't find `/usr/local/bin/sudo`. Fix: use absolute paths or a fixed search list. + +2. **Staging directory ownership**: daemon runs as unprivileged user but staging path was created by root. Fix: pre-create with correct ownership in bootstrap script. + +3. **Orphaned processes holding socket**: `service stop` killed the supervisor but old background daemons still hold the socket. Fix: `ps aux | grep 'daemon: name'` to find all supervisors, kill them all before starting. + +4. **Capsicum sandboxing**: if `cap_enter()` appears in the trace, the process entered capability mode and subsequent `open()`/`fork()` calls may fail. Fix: do all setup BEFORE `cap_enter()`. + +## ktrace / kdump (alternative) + +For long-running processes where `truss` output would be too large: + +```sh +# Record +sudo ktrace -f /tmp/ktrace.out -p PID +# ... trigger the bug ... +sudo ktrace -C # stop tracing + +# Read +kdump -f /tmp/ktrace.out | less +kdump -f /tmp/ktrace.out | grep 'fork\|execve\|ERR' +``` + +`ktrace` writes to a binary file, so it's faster than `truss` for high-throughput +processes. Use `kdump` to decode. Same syscall output, different capture mechanism.