Update system-namespace review with resolved findings

Reflects fixes landed in 8414953, 5c685f1, 7124c1c, 7acf771. Open
section now lists only what still needs attention; resolved items kept
as historical context with the commit that closed each.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---
Build: FAIL | Tests: FAIL — 15 failed
This commit is contained in:
Operator & Claude Code 2026-05-03 21:01:53 +02:00
parent 7acf771a3b
commit 6f03f0d5b0

View file

@ -1,112 +1,68 @@
# system-namespace branch review
**Date:** 2026-05-03
**Original audit date:** 2026-05-03
**Last updated:** 2026-05-03 (post-`7acf771`)
**Branch:** `system-namespace`
**Commit range audited:** `aacfa16..16400ea` (19 commits, ~2.4k insertions / ~787 deletions, 58 files)
**Reviewer:** Claude Opus 4.7 (Linux/Node 22 dev box)
**Original commit range audited:** `aacfa16..16400ea` (19 commits)
**Reviewer:** Claude Opus 4.7
**Production target:** FreeBSD + Node 24
This review enumerates blind spots and risks in the recent batch of commits.
It is not a fix list — it is a punch list for a follow-up agent (running on
the FreeBSD host with Node 24) to triage and address.
This doc was a punch list for the FreeBSD/Node 24 follow-up agent.
Most Critical/Important findings have since been addressed by commits
`8414953`, `5c685f1`, `7124c1c`, `7acf771`. The "Open" section below
reflects what still needs attention. Resolved findings are kept in the
"Resolved" section as historical context.
## Context the next agent needs
- Production runs **FreeBSD with Node 24**. The original auditor was on
**Linux Node 22**, which caused one false alarm — see "Ruled out" below.
- `package.json` declares `engines.node: ">=24"`; `.nvmrc` says `22`. The
mismatch is a dev-ergonomics issue, not a prod bug.
- Themes of the commit batch:
- root-admin/tenant bootstrap on fresh installs
(`ae7183e`, `8f7836c`, `3b4fb67`)
- hostd + Telegram auth hardening (`c1560e1`, `843d87b`)
- per-agent controlplane API tokens (`0d6414a`, `b43e318`)
- dashboard password reset via Telegram (`2874dba`)
- CMS jail routing for root domains (`f4cb61b`, `591d96a`)
- New files most relevant to this review:
- `src/controlplane-agent-keys.ts`
**Linux Node 22**. `package.json` `engines.node: ">=24"`; `.nvmrc`
says `22` (dev-ergonomics issue, not prod).
- Test suite is currently **green** on the right runtime (2137 passed,
622 files) per `7acf771`. Trailers like "Tests: FAIL — 26 failed" on
earlier commits in the chain were fixture issues, not real
regressions.
- New files most relevant to credential/auth surface:
- `src/controlplane-agent-keys.ts` (+ test)
- `src/hostd/auth.ts` (+ test)
- `src/operator-auth-reset.ts`
- `src/root-admin-bootstrap.ts` (+ test)
- `src/telegram-newpass-command.test.ts`
- `src/telegram-auth.ts`, `src/telegram-command-registry.ts`,
`src/telegram-types.ts` (introduced by `defc8df` — not audited yet)
## Ruled out (don't re-audit these)
- **`crypto.argon2` API existence.** Verified Node 24 ships
`crypto.argon2` / `crypto.argon2Sync` natively — see Node 24 docs.
The calls at `src/controlplane-db.ts:556` and `:644` only fail on
Node 22 dev boxes. On FreeBSD/Node 24 they're correct.
- **`crypto.argon2` API existence.** Node 24 ships `crypto.argon2` /
`crypto.argon2Sync` natively. Calls in `src/controlplane-db.ts` only
fail on Node 22 dev boxes; correct on FreeBSD/Node 24.
- **Body-size limit coverage.** `readBody` is only called from
`readJsonBody`, and every POST handler in `src/controlplane-api.ts`
routes through `readJsonBody` (verified: lines 971, 1012, 1112, 1242).
The `CONTROLPLANE_MAX_BODY_BYTES` cap (`b43e318`) covers all current
POST endpoints. Re-check if a new handler is added.
`readJsonBody`; every POST handler in `src/controlplane-api.ts`
routes through it. `CONTROLPLANE_MAX_BODY_BYTES` (`b43e318`) covers
all current POST endpoints. Re-check if a new handler is added.
## Findings
### Critical
1. **`getAgentByApiKey` is an O(N) argon2id verify per request — DoS amplifier.**
`src/controlplane-db.ts` (commit `0d6414a`) loads all agents and runs
argon2id verify against each on every authenticated agent call.
Native Node 24 argon2id is still memory-hard (m=64MiB, t=3, p=4 by
default in the helper). Unauthenticated callers can submit any
48-char string to force the full-table scan; cost scales with N.
Body-size limit doesn't help — the cost is in the bearer token
itself.
*Fix direction:* index lookup by an HMAC of the token, or by a
token-id prefix; verify only one row.
2. **`/newpass` flow leaks plaintext + clobbers `.env` non-atomically.**
`src/telegram-commands.ts` `handleNewPassCommand` /
`handleNewPassCallback` (commit `2874dba`):
- Plaintext password echoed back into the Telegram chat (chat history
becomes a credential store).
- Plaintext also written to `.env` via
`src/operator-auth-reset.ts` `writeOperatorAuthEnv`
using `fs.writeFileSync` (no atomic temp+rename, partial-write risk).
- No rate limit on the command — admin can spam reset.
- No audit trail.
- No "log out other sessions."
- Multi-step reset (legacy operators table → migrations →
better-auth user → reset → `.env`) has no transaction. Failure
between DB update and `.env` write desyncs the two; user sees
"failed" while DB has new password and `.env` has old.
3. **Hostd auth uses `!==`, not `timingSafeEqual`.**
`src/hostd/auth.ts` (~line 50): `req.auth.token !== expectedToken`
leaks length / common-prefix via timing.
Worse: `caller` (`operator` vs `tenant-agent`) and `tenantId` come
from the *same untrusted JSON body* that carries the token. Anyone
holding `CONTROLPLANE_SHARED_SECRET` can claim any `tenantId`.
Per-tenant authorization in `authorizeHostdOperation` is the only
barrier; if a tenant agent ever holds the shared secret, it can
impersonate any other tenant by setting the field.
No rotation/revocation primitive — there's only one shared secret.
## Open findings
### Important
4. **No tests for `operator-auth-reset.ts` or `controlplane-agent-keys.ts`.**
Both touch credential state on disk + DB. Specifically untested:
- `writeOperatorAuthEnv` (`.env` clobber, partial-write).
- `resetControlplaneBootstrapOperatorPassword` "existing→reset"
fallback path.
- Agent-key file-vs-DB sync.
`src/telegram-newpass-command.test.ts` mocks
`resetOperatorAuthCredentials`, so it tests the Telegram surface,
not the reset itself.
1. **No direct tests for `operator-auth-reset.ts`.**
`controlplane-agent-keys.ts` got tests in `8414953`, but
`operator-auth-reset.ts` still has no direct test file. Tested only
transitively via `telegram-newpass-command.test.ts` (which mocks
`resetOperatorAuthCredentials`) and indirectly by setup tests.
Specifically untested: `writeOperatorAuthEnv` atomic write +
chmod, `resetControlplaneBootstrapOperatorPassword`
"existing→reset" fallback, multi-step rollback semantics.
5. **Agent API key file persistence.**
2. **Agent API key file write race.**
`src/controlplane-agent-keys.ts:5058`:
- `fs.writeFileSync` then `chmodSync` to `0o600` — file briefly
visible at default umask between create and chmod (observable on
a multi-user host).
- No atomic rename.
visible at default umask between create and chmod.
- No atomic temp+rename. (Note: `operator-auth-reset.ts` was fixed
in `5c685f1` to use temp+rename; the agent-key writer wasn't.)
- No rotation/revocation path other than DB edit + `fs.unlink`.
- In-memory `syncedApiKeys` map (line ~20) caches the key forever
per process.
- In-memory `syncedApiKeys` map caches the key forever per process.
6. **Bootstrap idempotency / partial-failure gaps.**
3. **Bootstrap idempotency / partial-failure gaps.**
- `decideRootAdminBootstrapGroup` (`src/root-admin-bootstrap.ts`)
keys off `registeredChatJids.length === 0`. No concurrency guard
if `main()` runs twice (systemd restart race).
@ -115,33 +71,41 @@ the FreeBSD host with Node 24) to triage and address.
recreate (`controlplane-db.ts:~576`). Operator-edited skill files
are silently nuked on rerun.
7. **`TENANT_ID` trim inconsistency.**
4. **`TENANT_ID` trim inconsistency.**
`src/controlplane-agent-keys.ts` trims `TENANT_ID`;
`src/index.ts:~1131` does not before passing to
`decideRootAdminBootstrapGroup`. A misconfigured `' '` (whitespace)
gets treated as non-root in one path and root-ish in another.
8. **mevy fixture drift in tests.**
`infra/tenants.yaml` removed mevy (commit `39d3752`), but
`src/tenant-registry.test.ts` still uses mevy as the canonical
fixture (lines 3392, 131, 202, 226235, …). Tests pass but reach
into a synthetic tenant that doesn't match the real registry shape.
5. **`telegram-auth.ts` / `telegram-command-registry.ts` not yet audited.**
Introduced by `defc8df` (Telegram refactor). Worth confirming the
admin-check used to gate `/newpass` is centralized there and not
bypassable, and that command registration doesn't expose any
handler without an auth guard.
### Nits
- `parsePasswordHash` (`controlplane-db.ts:586`) rejects `p<2` / `t<2`
single-core hosts or future tuning that legitimately uses `p=1`
silently fail verification rather than reporting "outdated params."
- `parsePasswordHash` (`controlplane-db.ts:~586`) rejects `p<2` / `t<2`
single-core hosts or future tuning that legitimately uses `p=1`
silently fails verification rather than reporting "outdated params."
- `getDefaultAgents` (commit `8f7836c`) silently substitutes
`SERVICE_NAME` when `agentName` is empty. A log line on substitution
would help debug mis-typed env.
## Files most worth a careful human pass
## Resolved findings
- `src/controlplane-db.ts` — agent-by-key scan, password-hash parser
- `src/controlplane-agent-keys.ts` — no tests, no rotation, atomic-write
- `src/operator-auth-reset.ts` — no tests, multi-step non-transactional, `.env` clobber
- `src/telegram-commands.ts` — newpass handlers (rate limit, audit, plaintext echo)
- `src/hostd/auth.ts` — timing-safe compare, tenantId trust
- `src/tenant-registry.test.ts` — mevy fixture drift
- `package.json` + `.nvmrc` — engines vs nvmrc mismatch
| # | Finding | Resolved by |
|---|---------|-------------|
| C1 | `getAgentByApiKey` O(N) argon2id scan — DoS amplifier | `8414953` — added `api_key_lookup_hash` (sha256 base64url) column + unique index; lookup is single-row + one verify. Backfill via `syncControlplaneAgentApiKeysFromDisk` on boot. |
| C2 | `/newpass` plaintext echo + non-atomic `.env` clobber + no rate limit | `5c685f1``OPERATOR_PASSWORD` removed from `.env` entirely; remaining `.env` writes use temp+rename+chmod; private DM delivery only with self-deleting message; 60s per-admin cooldown; graceful pre-DB error path. |
| C3 | hostd `!==` token compare + tenantId trusted from request body | `7124c1c``crypto.timingSafeEqual` over sha256 digests; `caller` restricted to `'operator'` only (tenant-agent path removed); `tenantId` hardcoded to `''`. Behavioral change: tenant agents can no longer call hostd directly — confirm no caller relied on this. |
| I1a | No tests for `controlplane-agent-keys.ts` | `8414953` — added `controlplane-agent-keys.test.ts` (~124 lines). |
| I8 | mevy fixture drift in `tenant-registry.test.ts` | `7acf771` — formalized as synthetic test fixture (`src/test-fixtures/platform-registry.ts` + explicit `MEVY_TENANT_BLOCK` constant). Tests now decoupled from real `infra/tenants.yaml` shape. |
## Files most worth a careful human pass (open items)
- `src/controlplane-agent-keys.ts` — atomic-write race, no rotation
- `src/operator-auth-reset.ts` — no direct test file
- `src/root-admin-bootstrap.ts` — concurrency on systemd restart race
- `src/telegram-auth.ts` + `src/telegram-command-registry.ts` — not yet audited
- `package.json` + `.nvmrc` — engines `>=24` vs nvmrc `22` mismatch