Add TLS cert lifecycle audit handoff (Sam & Claude)
The platform's nginx vhosts reference certs at
/usr/local/etc/nginx/ssl/{clawdie,docs}/fullchain.cer that are renewed
by acme.sh's own cron — outside Clawdie's audit, doctor, and morning
report. If acme.sh's cron breaks, fails silently, or a new domain isn't
re-issued, Clawdie has no warning shape until HTTPS dies.
Doc lays out the actual risk surface (it's observability, not renewal),
the on-host investigation steps to establish ground truth, a four-step
build plan starting with doctor checks (mirroring the dnsmasq pattern),
and three open questions that need Sam's input before code lands.
Highest-leverage step is wiring collectTlsIssues() into doctor-checks
so days-remaining and acme.sh cron status surface every morning. Lowest
effort, biggest payoff.
---
Build: FAIL | Tests: FAIL — 16 failed
This commit is contained in:
parent
b6d72d9353
commit
8654c4e145
1 changed files with 196 additions and 0 deletions
196
docs/internal/TLS-CERT-LIFECYCLE-HANDOFF.md
Normal file
196
docs/internal/TLS-CERT-LIFECYCLE-HANDOFF.md
Normal file
|
|
@ -0,0 +1,196 @@
|
|||
# TLS cert lifecycle audit + observability — handoff
|
||||
|
||||
**Author:** Claude (analysis on dev box, no host access)
|
||||
**For:** Codex (runs on the FreeBSD host, can inspect cert files, cron, acme.sh state)
|
||||
**Date:** 2026-05-10
|
||||
**Branch:** suggested `tls-cert-lifecycle`
|
||||
|
||||
## Why this doc exists
|
||||
|
||||
The clawdie.si and docs.clawdie.si nginx vhosts reference TLS certs at:
|
||||
|
||||
```
|
||||
/usr/local/etc/nginx/ssl/clawdie/fullchain.cer + clawdie.key
|
||||
/usr/local/etc/nginx/ssl/docs/fullchain.cer + docs.key
|
||||
```
|
||||
|
||||
Per `html/clawdie/guides/nginx-ssl.html`, these were issued and installed via `acme.sh`. acme.sh sets up its own cron job for renewal — typical Let's Encrypt cadence (renew at ~60 days, certs valid for 90).
|
||||
|
||||
**The risk isn't that nothing is renewing the certs** — acme.sh almost certainly is. The risk is that **Clawdie has no visibility into whether acme.sh is still working**. If its cron entry got dropped after a host reboot, if the renewal hook is failing silently, if the disk filled, if a new domain wasn't added to the issuance list — Clawdie's audit, doctor, and morning report all stay green right up to the moment HTTPS dies. There's no warning shape until clients start seeing cert errors.
|
||||
|
||||
The fix is observability + scheduled verification, not a new renewal mechanism.
|
||||
|
||||
## The actual risk surface
|
||||
|
||||
Three failure modes the platform currently can't see:
|
||||
|
||||
1. **acme.sh cron silently disabled or broken.** Renewal stops. Cert expires ~90 days later. Discovery vector today: a user reports the site is broken, or `dig`/`curl` start failing.
|
||||
2. **Reload hook broken.** acme.sh renews the cert file but `service nginx reload` (the `--reloadcmd`) fails or was removed. nginx keeps serving the old cert until restart. Discovery vector today: same as above, but harder to diagnose because the cert file looks new.
|
||||
3. **New domain added without re-issuance.** Operator adds a new tenant subdomain to nginx, forgets to add it to `acme.sh --issue`. The vhost serves a cert that doesn't cover the new SAN. Discovery vector today: tenant complains about cert mismatch.
|
||||
|
||||
## What to investigate first (on host)
|
||||
|
||||
Spend 15 minutes establishing ground truth before writing any code:
|
||||
|
||||
```bash
|
||||
# 1. Is acme.sh installed and where?
|
||||
which acme.sh
|
||||
acme.sh --version
|
||||
|
||||
# 2. What certs does it manage?
|
||||
acme.sh --list
|
||||
|
||||
# 3. What's its cron entry look like?
|
||||
crontab -l -u root | grep -i acme
|
||||
# also check /etc/cron.d/ for non-user crontabs
|
||||
ls -la /etc/cron.d/ | grep -i acme
|
||||
|
||||
# 4. When did each cert last get renewed (file mtime)?
|
||||
ls -la /usr/local/etc/nginx/ssl/clawdie/
|
||||
ls -la /usr/local/etc/nginx/ssl/docs/
|
||||
|
||||
# 5. Read the cert NotAfter field directly
|
||||
openssl x509 -in /usr/local/etc/nginx/ssl/clawdie/fullchain.cer -noout -dates -subject
|
||||
openssl x509 -in /usr/local/etc/nginx/ssl/docs/fullchain.cer -noout -dates -subject
|
||||
|
||||
# 6. Does the issued cert cover all the SANs we serve?
|
||||
openssl x509 -in /usr/local/etc/nginx/ssl/clawdie/fullchain.cer -noout -text | grep -A1 'Subject Alternative Name'
|
||||
|
||||
# 7. Force-test a renewal in dry-run mode
|
||||
acme.sh --renew -d clawdie.si --force --dry-run
|
||||
|
||||
# 8. Check acme.sh's own log for recent activity
|
||||
ls -la ~/.acme.sh/ 2>/dev/null || ls -la /root/.acme.sh/
|
||||
tail -50 ~/.acme.sh/acme.sh.log 2>/dev/null || tail -50 /root/.acme.sh/acme.sh.log
|
||||
```
|
||||
|
||||
Record:
|
||||
- Path to acme.sh binary
|
||||
- Cron entry (verbatim) — including which user it runs as
|
||||
- For each cert: `notBefore`, `notAfter`, days remaining, SAN list
|
||||
- Last successful renewal log line
|
||||
- Whether `--reloadcmd` is set on each managed cert
|
||||
|
||||
That ground truth informs the design.
|
||||
|
||||
## What to build, in order
|
||||
|
||||
### Step 1: doctor checks for cert health (highest leverage, lowest effort)
|
||||
|
||||
Mirror the shape of `src/doctor-checks.ts` exactly. Add `collectTlsIssues()` next to `collectDnsIssues()` and `collectMorningReportIssues()`:
|
||||
|
||||
```ts
|
||||
export const TLS_RENEWAL_WARNING_DAYS = 21;
|
||||
export const TLS_RENEWAL_CRITICAL_DAYS = 7;
|
||||
|
||||
export interface TlsCheckDeps {
|
||||
spawnSync?: SpawnSyncText;
|
||||
certPaths?: string[]; // absolute paths to fullchain.cer files
|
||||
nowMs?: () => number;
|
||||
}
|
||||
|
||||
export async function collectTlsIssues(
|
||||
deps: TlsCheckDeps = {},
|
||||
): Promise<DoctorCheckResult> {
|
||||
// For each cert path:
|
||||
// 1. Run `openssl x509 -in <path> -noout -enddate -subject` via spawnSync
|
||||
// 2. Parse notAfter -> daysRemaining
|
||||
// 3. Push line: TLS_<cert-label>: <days-remaining> days (subject=...)
|
||||
// 4. Issue if days-remaining < TLS_RENEWAL_CRITICAL_DAYS (error)
|
||||
// or < TLS_RENEWAL_WARNING_DAYS (warning, but still an issue line)
|
||||
// 5. If file missing or openssl fails, that's an issue.
|
||||
}
|
||||
```
|
||||
|
||||
Default `certPaths` should derive from the live nginx vhost paths the platform commits to:
|
||||
```
|
||||
/usr/local/etc/nginx/ssl/clawdie/fullchain.cer
|
||||
/usr/local/etc/nginx/ssl/docs/fullchain.cer
|
||||
```
|
||||
|
||||
Make `certPaths` overridable via a `TLS_CERT_PATHS` env var (comma-separated) so operators with extra domains can extend the check without code changes.
|
||||
|
||||
Wire `collectTlsIssues()` into `src/doctor.ts` next to the DNS and morning-report checks.
|
||||
|
||||
Tests in `src/doctor-checks.test.ts`:
|
||||
- happy path: cert with 60 days remaining → no issues
|
||||
- warning: 14 days remaining → issue raised
|
||||
- critical: 3 days remaining → issue raised, distinct phrasing
|
||||
- missing file: issue raised
|
||||
- openssl failure: issue raised, doesn't crash the doctor
|
||||
|
||||
This gives the operator the warning shape the platform is missing today.
|
||||
|
||||
### Step 2: surface in the morning report
|
||||
|
||||
Once doctor checks land, the morning report task (`morning-report-0800`) should include cert state in its summary. Whoever owns the report-prompt assembly should pull `collectTlsIssues()` output the same way it pulls the platform audit. Estimate: 10 minutes once the data is available.
|
||||
|
||||
This means at 08:00 every morning, the operator sees "all certs valid, 47/52 days remaining" or "WARNING: clawdie.si cert expires in 12 days." No surprises.
|
||||
|
||||
### Step 3: verify acme.sh cron exists (defensive)
|
||||
|
||||
Add a fourth doctor check, `collectAcmeRenewalIssues()`:
|
||||
|
||||
- Read `crontab -l -u root` (and optionally `/etc/cron.d/*`)
|
||||
- Look for an entry containing `acme.sh` or `--cron`
|
||||
- If absent, raise an issue
|
||||
- If present, surface the schedule line for visibility
|
||||
|
||||
This catches the "renewal cron got dropped after a host reboot" case independently of the cert-expiry check, which would only flag it 80 days too late.
|
||||
|
||||
### Step 4 (optional): codify acme.sh setup
|
||||
|
||||
Currently `html/clawdie/guides/nginx-ssl.html` is a manual operator runbook. A fresh install requires the operator to step through it by hand. Could convert it to a `setup/tls.ts` step that:
|
||||
|
||||
- Installs acme.sh (`pkg install acme.sh`)
|
||||
- Issues the cert for clawdie.si + www.clawdie.si + docs.clawdie.si (or whatever set the registry declares as public hosts)
|
||||
- Runs `--install-cert` with the canonical paths and `--reloadcmd "service nginx reload"`
|
||||
- Confirms acme.sh's own cron entry was created
|
||||
|
||||
This is the higher-effort, higher-payoff option. **Recommend skipping it in this initial pass** — observability is the urgent gap; codified setup is a separate slice once we know the failure modes the doctor surfaces in real operation.
|
||||
|
||||
## Open questions for Sam
|
||||
|
||||
1. **Public-hostname source of truth.** The cert audit needs to know which hostnames Clawdie is supposed to serve over HTTPS. Today they're hardcoded in nginx vhosts (clawdie.si, docs.clawdie.si). If/when more land (mail.clawdie.si, status.clawdie.si, etc.), where should the doctor read the canonical list from? Options: (a) parse nginx vhosts directly, (b) add a `public_hosts:` array to `infra/tenants.yaml`, (c) hardcode in doctor and accept yearly drift. **Recommend (b)** — registry-driven, consistent with the rest of the platform's "tenants.yaml is the source of truth" stance.
|
||||
|
||||
2. **Cert ownership classification.** Does the operator want certs to appear in `/publishreport`'s platform audit alongside services and jails? Same shape as the existing audit (observed vs declared, with status). If yes, that's a small `platform-audit-report.ts` extension after the doctor lands.
|
||||
|
||||
3. **acme.sh as a registered platform service?** Following the dnsmasq/pf/nginx pattern. If yes, add `acme.sh` to `shared.services` and treat its cron as the lifecycle. Probably NOT worth doing — acme.sh isn't a daemon, it's a cron-driven script. The doctor check (Step 3) covers the same ground without forcing the registry to model non-daemon services.
|
||||
|
||||
## Validation script for after work lands
|
||||
|
||||
```bash
|
||||
# Doctor sees current cert state
|
||||
just doctor | grep TLS_
|
||||
|
||||
# Doctor flags an artificially-near-expiry cert (use a test cert with 5d remaining)
|
||||
TLS_CERT_PATHS=/tmp/test-near-expiry.cer just doctor | grep "expires in"
|
||||
|
||||
# Morning report includes TLS line at 08:00
|
||||
# (or trigger manually via whatever invokes the report task)
|
||||
|
||||
# acme.sh cron is verified
|
||||
just doctor | grep ACME_CRON
|
||||
```
|
||||
|
||||
## Files you'll likely touch
|
||||
|
||||
- `src/doctor-checks.ts` — add `collectTlsIssues()`, `collectAcmeRenewalIssues()`
|
||||
- `src/doctor-checks.test.ts` — tests for both
|
||||
- `src/doctor.ts` — wire the new checks into the orchestration
|
||||
- `src/reports/...` (wherever morning-report assembly lives) — pull TLS state into summary
|
||||
- Optionally `infra/tenants.yaml` — add `public_hosts:` if going with option (b) above
|
||||
|
||||
## Deliberate non-goals for this slice
|
||||
|
||||
- Replacing acme.sh with another ACME client. Keep what works.
|
||||
- Auto-rotating certs ourselves. Trust acme.sh, just observe it.
|
||||
- Touching the existing nginx vhost cert paths. They're the spec; doctor reads from them.
|
||||
- Re-issuing or renewing certs from doctor. Read-only audit only — anything that mutates cert state belongs in a deliberate operator action, not a periodic check.
|
||||
|
||||
## What "done" looks like
|
||||
|
||||
- `just doctor` reports days-remaining for each cert and the acme.sh cron status, with named issue lines for warning/critical thresholds.
|
||||
- 08:00 morning report includes a TLS line so the operator sees cert health every day without thinking about it.
|
||||
- A test cert with simulated near-expiry triggers the right issue text in the doctor output.
|
||||
- No change to current renewal behavior — acme.sh keeps doing what it does, the platform now just watches.
|
||||
Loading…
Add table
Reference in a new issue