Operator & Codex ba33a349cc Document UI-TARS adoption direction

---
Build: pass | Tests: pass — 2382 passed (175 files)

2026-05-11 12:33:13 +02:00

11 KiB

Raw Blame History

Browser Jail Handoff

From: Claude (design) → Codex (implementation) Date: 11.maj.2026 Status: PHASE 0.5 COMPLETE — READY FOR PHASE 1 REVIEW

Design reference: docs/internal/BROWSER-JAIL.md Vision spike: docs/internal/VISION-GROUNDING-FINDINGS.md (in progress) UI-TARS direction: docs/internal/UI-TARS-ADOPTION.md

Scope

Repo: codeberg.org:Clawdie/Clawdie-AI.git
Branch: new feature branch off main (suggested: browser-jail)
Runtime: Node v22+ on FreeBSD — no Bun anywhere in this work
Build/test stamping: new trailer hook from 17746bb already lives on main; full-suite test footer on each commit
Coordination: Codex's controlplane-heartbeat refactor runs in parallel on a separate branch; no expected conflicts (different file surfaces)
UI-TARS alignment: do not implement a parallel Clawdie-specific GUI-agent loop, action grammar, or prediction parser unless the UI-TARS-compatible operator path is proven unworkable. The browser jail remains the execution backend; the controlplane or external client owns the model loop.

Pre-flight

Before starting Phase 0.5:

Read docs/internal/BROWSER-JAIL.md end-to-end. Especially the threat model and the implementation notes on FreeBSD Chromium / CDP choice.
Skim setup/ollama.ts and infra/packages/ollama-jail.txt as reference for the jail-bootstrap pattern. The browser jail follows the same shape.
Confirm the host has the recent post-17746bb trailer hook so commit footers reflect single, accurate full-suite runs.

Phase 0.5 — FreeBSD viability spike

Status: COMPLETE by Codex on 11.maj.2026. See docs/internal/BROWSER-JAIL-FREEBSD-VIABILITY.md.

Goal: before any clawdie-side code, prove the FreeBSD substrate supports headless Chromium + CDP.

Steps:

On the FreeBSD host, create a throwaway bastille jail:

sudo bastille create browser-spike 15.0-RELEASE 10.0.0.X
sudo bastille pkg browser-spike install -y chromium

From inside the jail, run Chromium headless with the debugging port:
```
chromium \
  --headless=new \
  --no-sandbox \
  --remote-debugging-port=9222 \
  --user-data-dir=/tmp/spike-profile \
  about:blank
```
Note: --no-sandbox is for the spike only. Production runs with Chromium sandbox enabled inside a bastille jail; bastille is the outer sandbox.

From a separate shell in the jail, install Node 22 and a CDP client:

pkg install -y node22 npm-node22
npm init -y && npm install puppeteer-core@latest

Write a 20-line Node script that:
- Connects to http://127.0.0.1:9222 via puppeteer-core.connect.
- Opens a new page, navigates to data:text/html,<h1>hello</h1>.
- Takes a PNG screenshot.
- Reads the <h1> text via page.evaluate(() => document.querySelector('h1').textContent).
- Prints OK and exits.
If puppeteer-core has FreeBSD-specific issues, retry with chrome-remote-interface as the fallback CDP client.

Output: a viability note at docs/internal/BROWSER-JAIL-FREEBSD-VIABILITY.md covering:

Which pkg Chromium version was used.
Which CDP client worked (puppeteer-core vs chrome-remote-interface).
Any FreeBSD-specific flags or environment that were needed.
A copy of the working spike script.

Gate: passed. puppeteer-core connected to system-pkg Chromium over CDP, read DOM text, wrote a PNG screenshot, and re-ran the deterministic renderer.

Commit: one commit, the viability note + the spike script under scripts/browser-jail-spike/. Trailer must show full-suite test pass.

Phase 1 — MVP browser jail + MCP proxy

Only proceed after Phase 0.5 passes.

Phase 1A — Jail-side HTTP service

Slice into small commits, each with full-suite tests passing.

Jail scaffolding.
- infra/packages/browser-jail.txt listing packages (chromium, node22).
- setup/browser-jail.ts mirroring setup/ollama.ts shape.
- Mount the host pkg cache with mountPkgCacheInJail(jailName) before package installation.
- Set initial ZFS quotas: 10 GiB for the jail dataset, 20 GiB for the screenshot/session dataset (tune later with telemetry).
- rc.d service file for Chromium with the right flags.
- Document the bastille jail config in docs/internal/BROWSER-JAIL-OPS.md (mirrors LOCAL-LLM.md shape).
HTTP server skeleton.
- Node v22+ HTTP server on :8090, binds to the internal jail IP.
- GET /health returns {status: ok}.
- Routes scaffolded for /sessions, /screenshot, /click, /type, /scroll, /navigate, /read_dom — return 501 for now.
- Unit tests for routing and the health endpoint.
Session lifecycle.
- POST /sessions accepts {record: "off" | "transient" | "audit"} (immutable for session life; default "transient"). Spawn fresh BrowserContext, return {session_id, record}.
- DELETE /sessions/:id → close context, delete profile dir.
- In-memory map of session_id → {BrowserContext, record, ring_buffer}.
- Test: open + close, double-close returns 404, max sessions per jail enforced, record echoed in response, attempting to change record mid-session returns 400.
navigate.
- POST /navigate {session_id, url} → page.goto, return {final_url, title}.
- Test: 200 OK page, 404, redirect, network error.
screenshot.
- POST /screenshot {session_id, full_page?} → PNG, return base64
  - dimensions + timestamp + persisted_path?.
- Default viewport-only; full_page: true is explicit.
- Persistence is governed by the session's record mode (not per-call): off writes nothing; transient writes to the ring buffer and FIFO-evicts at N=50; audit writes all with 7d retention.
- Test: full page vs viewport, multiple sequential screenshots, ring buffer eviction at N+1, record:"off" returns no persisted_path.
click and type.
- POST /click accepts either {x,y} or {selector}.
- POST /type accepts {text, selector?} (uses focused element if no selector).
- after_screenshot: true bundles a fresh screenshot into the response.
- Test: click by coords, click by selector, type into focused vs selector, after_screenshot behavior.
scroll and read_dom.
- POST /scroll {session_id, delta_y, selector?}.
- POST /read_dom {session_id, selector?, max_chars?} → outerHTML truncated.
- Tests as above.
Resource limits and retention.
- Per-session deadline (default 300s).
- Max concurrent sessions per jail (default 10).
- record:"transient" ring buffer: N=50 per session, FIFO-evicted as new screenshots arrive (no time component).
- record:"audit" retention: 7d, GC'd by controlplane cron (configurable per-tenant in Phase 2).
- Dataset-level FIFO eviction under screenshot quota pressure (80% high-water alert, 100% evicts transient sessions first, then audit). Eviction never crosses tenant boundaries.
- Tests: deadline kills a stuck session; over-limit returns 429; transient ring buffer evicts at N+1; audit GC removes screenshots past 7d; quota-pressure eviction prefers transient over audit data.

Phase 1B — Controlplane MCP proxy

MCP server scaffolding.
- New src/browser-jail-mcp.ts.
- HTTP/SSE transport (one canonical transport in MVP).
- Tool definitions matching the browser.* surface in BROWSER-JAIL.md.
- All tools return 501 initially.
- Unit tests: tool list, schema validation.
Auth wiring.
- Reuse better-auth session check at the MCP boundary.
- Resolve tenant_id + operator_id from session.
- Test: unauthenticated request → 401, valid session → forwarded.
Audit log table + write path.
- Migration: audit.browser_session_events per the schema in BROWSER-JAIL.md.
- Proxy writes a row before forwarding to the jail.
- browser.type params redacted (length + redacted flag only).
- Tests: row written, redaction applied, retries don't duplicate.
Per-tool forwarding.
- One commit per tool, each: route → audit → HTTP call to jail → return MCP response.
- Tests: happy path, jail-down, tenant mismatch.
Claude Desktop integration smoke test.
- docs/internal/BROWSER-JAIL-CLAUDE-DESKTOP-SETUP.md showing the claude_desktop_config.json snippet that points at the proxy.
- Manual smoke: open session, navigate to a test page, click, screenshot, close session. Document outcome in the same doc.

Phase 1C — Network policy

PF rules on the browser jail.
- Block egress to RFC1918, loopback, link-local, ULA, fe80::/10, 169.254.169.254.
- Allow public egress.
- Ingress on :8090 only from controlplane IP.
- Tests: PF rule lint + an integration test that asserts a navigate to an RFC1918 URL fails as expected.

Phase 1 stop point

Browser jail is reachable only from controlplane.
Controlplane MCP proxy presents the browser.* tools to Claude Desktop.
A manual end-to-end smoke (Claude Desktop → MCP → jail → page action) works.
All commits green on full-suite tests.

Do not proceed to controlplane integration (Phase 2) until Phase 1 is demoed and reviewed by Claude/Sam.

Phase 2 — Controlplane integration

Spec only — implement after Phase 1 review.

New controlplane task type: browser-session.
ZFS snapshot before session start; destroy on clean exit; retain on error. Hook reuses the existing hostd-side snapshot helpers from the sudo-elimination work.
PF rule hardening: confirm rules survive jail restart, integrate with the existing PF management surface.
Audit retention: 7d default, configurable per-tenant.
Screenshot GC cron: hourly sweep of /var/db/browser-jail for sessions past their retention window.

Validation per commit

Run the full test suite (not just browser-jail / controlplane tests). Reference: 7dca928's "2369 passed (703 files)" footer is the canonical shape.
Commit footer must show single, accurate stamping (the 17746bb-era hook).
No commit may introduce imports from src/controlplane-api.ts into the browser-jail modules — DAG must stay acyclic.

Open questions / escalation triggers

Escalate to Claude/Sam before deciding, don't guess:

Phase 0.5 viability spike fails with both puppeteer-core and chrome-remote-interface.
A clean better-auth session token isn't available for MCP clients (Claude Desktop in particular — check its auth-passing model before Phase 1B-2).
Chromium's BrowserContext isolation turns out leakier than documented during Phase 1A-3 testing.
ZFS snapshot lifecycle decisions (retention, restore semantics) for Phase 2.

Out of scope

Electron + nut-js desktop shell.
Vision/grounding inside the browser jail. The jail ships pixels; the client agent grounds them.
Multi-jail scale-out beyond one shared Chromium per jail.
File downloads/uploads (Phase 3+).

11 KiB Raw Blame History