Add TLS cert lifecycle audit handoff (Sam & Claude)

The platform's nginx vhosts reference certs at
/usr/local/etc/nginx/ssl/{clawdie,docs}/fullchain.cer that are renewed
by acme.sh's own cron — outside Clawdie's audit, doctor, and morning
report. If acme.sh's cron breaks, fails silently, or a new domain isn't
re-issued, Clawdie has no warning shape until HTTPS dies.

Doc lays out the actual risk surface (it's observability, not renewal),
the on-host investigation steps to establish ground truth, a four-step
build plan starting with doctor checks (mirroring the dnsmasq pattern),
and three open questions that need Sam's input before code lands.

Highest-leverage step is wiring collectTlsIssues() into doctor-checks
so days-remaining and acme.sh cron status surface every morning. Lowest
effort, biggest payoff.

---
Build: FAIL | Tests: FAIL — 16 failed
This commit is contained in:
Operator & Claude Code 2026-05-10 08:13:17 +02:00
parent b6d72d9353
commit 8654c4e145

View file

@ -0,0 +1,196 @@
# TLS cert lifecycle audit + observability — handoff
**Author:** Claude (analysis on dev box, no host access)
**For:** Codex (runs on the FreeBSD host, can inspect cert files, cron, acme.sh state)
**Date:** 2026-05-10
**Branch:** suggested `tls-cert-lifecycle`
## Why this doc exists
The clawdie.si and docs.clawdie.si nginx vhosts reference TLS certs at:
```
/usr/local/etc/nginx/ssl/clawdie/fullchain.cer + clawdie.key
/usr/local/etc/nginx/ssl/docs/fullchain.cer + docs.key
```
Per `html/clawdie/guides/nginx-ssl.html`, these were issued and installed via `acme.sh`. acme.sh sets up its own cron job for renewal — typical Let's Encrypt cadence (renew at ~60 days, certs valid for 90).
**The risk isn't that nothing is renewing the certs** — acme.sh almost certainly is. The risk is that **Clawdie has no visibility into whether acme.sh is still working**. If its cron entry got dropped after a host reboot, if the renewal hook is failing silently, if the disk filled, if a new domain wasn't added to the issuance list — Clawdie's audit, doctor, and morning report all stay green right up to the moment HTTPS dies. There's no warning shape until clients start seeing cert errors.
The fix is observability + scheduled verification, not a new renewal mechanism.
## The actual risk surface
Three failure modes the platform currently can't see:
1. **acme.sh cron silently disabled or broken.** Renewal stops. Cert expires ~90 days later. Discovery vector today: a user reports the site is broken, or `dig`/`curl` start failing.
2. **Reload hook broken.** acme.sh renews the cert file but `service nginx reload` (the `--reloadcmd`) fails or was removed. nginx keeps serving the old cert until restart. Discovery vector today: same as above, but harder to diagnose because the cert file looks new.
3. **New domain added without re-issuance.** Operator adds a new tenant subdomain to nginx, forgets to add it to `acme.sh --issue`. The vhost serves a cert that doesn't cover the new SAN. Discovery vector today: tenant complains about cert mismatch.
## What to investigate first (on host)
Spend 15 minutes establishing ground truth before writing any code:
```bash
# 1. Is acme.sh installed and where?
which acme.sh
acme.sh --version
# 2. What certs does it manage?
acme.sh --list
# 3. What's its cron entry look like?
crontab -l -u root | grep -i acme
# also check /etc/cron.d/ for non-user crontabs
ls -la /etc/cron.d/ | grep -i acme
# 4. When did each cert last get renewed (file mtime)?
ls -la /usr/local/etc/nginx/ssl/clawdie/
ls -la /usr/local/etc/nginx/ssl/docs/
# 5. Read the cert NotAfter field directly
openssl x509 -in /usr/local/etc/nginx/ssl/clawdie/fullchain.cer -noout -dates -subject
openssl x509 -in /usr/local/etc/nginx/ssl/docs/fullchain.cer -noout -dates -subject
# 6. Does the issued cert cover all the SANs we serve?
openssl x509 -in /usr/local/etc/nginx/ssl/clawdie/fullchain.cer -noout -text | grep -A1 'Subject Alternative Name'
# 7. Force-test a renewal in dry-run mode
acme.sh --renew -d clawdie.si --force --dry-run
# 8. Check acme.sh's own log for recent activity
ls -la ~/.acme.sh/ 2>/dev/null || ls -la /root/.acme.sh/
tail -50 ~/.acme.sh/acme.sh.log 2>/dev/null || tail -50 /root/.acme.sh/acme.sh.log
```
Record:
- Path to acme.sh binary
- Cron entry (verbatim) — including which user it runs as
- For each cert: `notBefore`, `notAfter`, days remaining, SAN list
- Last successful renewal log line
- Whether `--reloadcmd` is set on each managed cert
That ground truth informs the design.
## What to build, in order
### Step 1: doctor checks for cert health (highest leverage, lowest effort)
Mirror the shape of `src/doctor-checks.ts` exactly. Add `collectTlsIssues()` next to `collectDnsIssues()` and `collectMorningReportIssues()`:
```ts
export const TLS_RENEWAL_WARNING_DAYS = 21;
export const TLS_RENEWAL_CRITICAL_DAYS = 7;
export interface TlsCheckDeps {
spawnSync?: SpawnSyncText;
certPaths?: string[]; // absolute paths to fullchain.cer files
nowMs?: () => number;
}
export async function collectTlsIssues(
deps: TlsCheckDeps = {},
): Promise<DoctorCheckResult> {
// For each cert path:
// 1. Run `openssl x509 -in <path> -noout -enddate -subject` via spawnSync
// 2. Parse notAfter -> daysRemaining
// 3. Push line: TLS_<cert-label>: <days-remaining> days (subject=...)
// 4. Issue if days-remaining < TLS_RENEWAL_CRITICAL_DAYS (error)
// or < TLS_RENEWAL_WARNING_DAYS (warning, but still an issue line)
// 5. If file missing or openssl fails, that's an issue.
}
```
Default `certPaths` should derive from the live nginx vhost paths the platform commits to:
```
/usr/local/etc/nginx/ssl/clawdie/fullchain.cer
/usr/local/etc/nginx/ssl/docs/fullchain.cer
```
Make `certPaths` overridable via a `TLS_CERT_PATHS` env var (comma-separated) so operators with extra domains can extend the check without code changes.
Wire `collectTlsIssues()` into `src/doctor.ts` next to the DNS and morning-report checks.
Tests in `src/doctor-checks.test.ts`:
- happy path: cert with 60 days remaining → no issues
- warning: 14 days remaining → issue raised
- critical: 3 days remaining → issue raised, distinct phrasing
- missing file: issue raised
- openssl failure: issue raised, doesn't crash the doctor
This gives the operator the warning shape the platform is missing today.
### Step 2: surface in the morning report
Once doctor checks land, the morning report task (`morning-report-0800`) should include cert state in its summary. Whoever owns the report-prompt assembly should pull `collectTlsIssues()` output the same way it pulls the platform audit. Estimate: 10 minutes once the data is available.
This means at 08:00 every morning, the operator sees "all certs valid, 47/52 days remaining" or "WARNING: clawdie.si cert expires in 12 days." No surprises.
### Step 3: verify acme.sh cron exists (defensive)
Add a fourth doctor check, `collectAcmeRenewalIssues()`:
- Read `crontab -l -u root` (and optionally `/etc/cron.d/*`)
- Look for an entry containing `acme.sh` or `--cron`
- If absent, raise an issue
- If present, surface the schedule line for visibility
This catches the "renewal cron got dropped after a host reboot" case independently of the cert-expiry check, which would only flag it 80 days too late.
### Step 4 (optional): codify acme.sh setup
Currently `html/clawdie/guides/nginx-ssl.html` is a manual operator runbook. A fresh install requires the operator to step through it by hand. Could convert it to a `setup/tls.ts` step that:
- Installs acme.sh (`pkg install acme.sh`)
- Issues the cert for clawdie.si + www.clawdie.si + docs.clawdie.si (or whatever set the registry declares as public hosts)
- Runs `--install-cert` with the canonical paths and `--reloadcmd "service nginx reload"`
- Confirms acme.sh's own cron entry was created
This is the higher-effort, higher-payoff option. **Recommend skipping it in this initial pass** — observability is the urgent gap; codified setup is a separate slice once we know the failure modes the doctor surfaces in real operation.
## Open questions for Sam
1. **Public-hostname source of truth.** The cert audit needs to know which hostnames Clawdie is supposed to serve over HTTPS. Today they're hardcoded in nginx vhosts (clawdie.si, docs.clawdie.si). If/when more land (mail.clawdie.si, status.clawdie.si, etc.), where should the doctor read the canonical list from? Options: (a) parse nginx vhosts directly, (b) add a `public_hosts:` array to `infra/tenants.yaml`, (c) hardcode in doctor and accept yearly drift. **Recommend (b)** — registry-driven, consistent with the rest of the platform's "tenants.yaml is the source of truth" stance.
2. **Cert ownership classification.** Does the operator want certs to appear in `/publishreport`'s platform audit alongside services and jails? Same shape as the existing audit (observed vs declared, with status). If yes, that's a small `platform-audit-report.ts` extension after the doctor lands.
3. **acme.sh as a registered platform service?** Following the dnsmasq/pf/nginx pattern. If yes, add `acme.sh` to `shared.services` and treat its cron as the lifecycle. Probably NOT worth doing — acme.sh isn't a daemon, it's a cron-driven script. The doctor check (Step 3) covers the same ground without forcing the registry to model non-daemon services.
## Validation script for after work lands
```bash
# Doctor sees current cert state
just doctor | grep TLS_
# Doctor flags an artificially-near-expiry cert (use a test cert with 5d remaining)
TLS_CERT_PATHS=/tmp/test-near-expiry.cer just doctor | grep "expires in"
# Morning report includes TLS line at 08:00
# (or trigger manually via whatever invokes the report task)
# acme.sh cron is verified
just doctor | grep ACME_CRON
```
## Files you'll likely touch
- `src/doctor-checks.ts` — add `collectTlsIssues()`, `collectAcmeRenewalIssues()`
- `src/doctor-checks.test.ts` — tests for both
- `src/doctor.ts` — wire the new checks into the orchestration
- `src/reports/...` (wherever morning-report assembly lives) — pull TLS state into summary
- Optionally `infra/tenants.yaml` — add `public_hosts:` if going with option (b) above
## Deliberate non-goals for this slice
- Replacing acme.sh with another ACME client. Keep what works.
- Auto-rotating certs ourselves. Trust acme.sh, just observe it.
- Touching the existing nginx vhost cert paths. They're the spec; doctor reads from them.
- Re-issuing or renewing certs from doctor. Read-only audit only — anything that mutates cert state belongs in a deliberate operator action, not a periodic check.
## What "done" looks like
- `just doctor` reports days-remaining for each cert and the acme.sh cron status, with named issue lines for warning/critical thresholds.
- 08:00 morning report includes a TLS line so the operator sees cert health every day without thinking about it.
- A test cert with simulated near-expiry triggers the right issue text in the doctor output.
- No change to current renewal behavior — acme.sh keeps doing what it does, the platform now just watches.