clawdie-ai/scripts/browser-jail-validation
Operator & Codex 6c549e7ad0 Rename browser validation assets
---
Build: pass | Tests: pass — 2383 passed (175 files)
2026-05-11 17:32:22 +02:00
..
fixtures Rename browser validation assets 2026-05-11 17:32:22 +02:00
results Rename browser validation assets 2026-05-11 17:32:22 +02:00
screenshots Rename browser validation assets 2026-05-11 17:32:22 +02:00
.gitignore Rename browser validation assets 2026-05-11 17:32:22 +02:00
freebsd-cdp-smoke.mjs Rename browser validation assets 2026-05-11 17:32:22 +02:00
package-lock.json Rename browser validation assets 2026-05-11 17:32:22 +02:00
package.json Rename browser validation assets 2026-05-11 17:32:22 +02:00
predict-openai-compat.mjs Rename browser validation assets 2026-05-11 17:32:22 +02:00
README.md Rename browser validation assets 2026-05-11 17:32:22 +02:00
render.mjs Rename browser validation assets 2026-05-11 17:32:22 +02:00
score.mjs Rename browser validation assets 2026-05-11 17:32:22 +02:00

Browser Jail — Vision Grounding Validation

Self-contained workspace for the vision-grounding experiment described in ../../docs/internal/VISION-GROUNDING-FINDINGS.md.

The question: can a vision-capable model translate "click the X button" into pixel coordinates on a screenshot reliably enough to drive browser.click(x, y) in the browser jail's MCP tool surface?

What's in here

fixtures/      synthetic HTML pages with known target elements
render.mjs     loads each fixture in headless Chromium via CDP,
               screenshots to PNG, extracts ground-truth bboxes from the DOM
screenshots/   PNGs produced by render.mjs (committed for reproducibility)
results/       per-fixture ground truth + per-model predictions + scores
predict-openai-compat.mjs   harness for any OpenAI-compatible vision API
score.mjs      computes pass rate at 30 px tolerance, in-bbox rate, dist stats

The fixtures are deterministic — DOM-extracted bounding boxes are the ground truth, not human labels. Re-running render.mjs against the same HTML produces byte-identical screenshots and JSON ground truth (modulo Chromium version differences).

Running

1) Re-render (only needed if fixtures change or you want to verify the CDP path)

npm install
chromium --headless=new --no-sandbox \
  --remote-debugging-port=9222 \
  --user-data-dir=/var/tmp/validation-profile about:blank &
node render.mjs

On FreeBSD, install Chromium with pkg install chromium. puppeteer-core connects to a running Chromium — it doesn't bundle one, which is why this approach works on FreeBSD.

2) Run a model column

The harness expects an OpenAI-compatible chat-completions endpoint.

# GPT-4o (OpenAI direct)
VISION_BASE_URL=https://api.openai.com/v1 \
VISION_API_KEY=$OPENAI_API_KEY \
VISION_MODEL=gpt-4o \
VISION_LABEL=gpt-4o \
  node predict-openai-compat.mjs

# GLM-4V (via z.ai)
VISION_BASE_URL=https://api.z.ai/api/paas/v4 \
VISION_API_KEY=$ZAI_API_KEY \
VISION_MODEL=glm-4v-flash \
VISION_LABEL=glm-4v \
  node predict-openai-compat.mjs

# UI-TARS via vLLM (if available)
VISION_BASE_URL=http://<host>:8000/v1 \
VISION_API_KEY=dummy \
VISION_MODEL=ui-tars-7b \
VISION_LABEL=ui-tars-7b \
  node predict-openai-compat.mjs

Each run produces results/predictions-<label>.json.

3) Score

node score.mjs results/predictions-gpt-4o.json

Outputs a per-target table + summary JSON at results/score-predictions-<label>.json.

Reporting results

Append a new "## Results — <model>" section to docs/internal/VISION-GROUNDING-FINDINGS.md with:

  • pass rate at 30 px tolerance
  • mean / median / max pixel distance
  • in-bbox rate
  • notable failure modes (e.g., model refused, malformed JSON, off-by-resolution)
  • approximate cost per full run

Don't tune the prompt per model — the comparison is apples-to-apples. If a model needs a different prompt to perform, note it as an integration cost, not as a different score.

Existing results

  • Claude Opus 4.7 (baseline): 17/17 PASS at 30 px tolerance, mean 1 px, max 8 px. See results/score-predictions-claude-opus-4-7.json.