|
|
||
|---|---|---|
| .. | ||
| fixtures | ||
| results | ||
| screenshots | ||
| .gitignore | ||
| freebsd-cdp-smoke.mjs | ||
| package-lock.json | ||
| package.json | ||
| predict-openai-compat.mjs | ||
| README.md | ||
| render.mjs | ||
| score.mjs | ||
Browser Jail — Vision Grounding Validation
Self-contained workspace for the vision-grounding experiment described in
../../docs/internal/VISION-GROUNDING-FINDINGS.md.
The question: can a vision-capable model translate "click the X button"
into pixel coordinates on a screenshot reliably enough to drive
browser.click(x, y) in the browser jail's MCP tool surface?
What's in here
fixtures/ synthetic HTML pages with known target elements
render.mjs loads each fixture in headless Chromium via CDP,
screenshots to PNG, extracts ground-truth bboxes from the DOM
screenshots/ PNGs produced by render.mjs (committed for reproducibility)
results/ per-fixture ground truth + per-model predictions + scores
predict-openai-compat.mjs harness for any OpenAI-compatible vision API
score.mjs computes pass rate at 30 px tolerance, in-bbox rate, dist stats
The fixtures are deterministic — DOM-extracted bounding boxes are the
ground truth, not human labels. Re-running render.mjs against the same
HTML produces byte-identical screenshots and JSON ground truth (modulo
Chromium version differences).
Running
1) Re-render (only needed if fixtures change or you want to verify the CDP path)
npm install
chromium --headless=new --no-sandbox \
--remote-debugging-port=9222 \
--user-data-dir=/var/tmp/validation-profile about:blank &
node render.mjs
On FreeBSD, install Chromium with pkg install chromium.
puppeteer-core connects to a running Chromium — it doesn't bundle one,
which is why this approach works on FreeBSD.
2) Run a model column
The harness expects an OpenAI-compatible chat-completions endpoint.
# GPT-4o (OpenAI direct)
VISION_BASE_URL=https://api.openai.com/v1 \
VISION_API_KEY=$OPENAI_API_KEY \
VISION_MODEL=gpt-4o \
VISION_LABEL=gpt-4o \
node predict-openai-compat.mjs
# GLM-4V (via z.ai)
VISION_BASE_URL=https://api.z.ai/api/paas/v4 \
VISION_API_KEY=$ZAI_API_KEY \
VISION_MODEL=glm-4v-flash \
VISION_LABEL=glm-4v \
node predict-openai-compat.mjs
# UI-TARS via vLLM (if available)
VISION_BASE_URL=http://<host>:8000/v1 \
VISION_API_KEY=dummy \
VISION_MODEL=ui-tars-7b \
VISION_LABEL=ui-tars-7b \
node predict-openai-compat.mjs
Each run produces results/predictions-<label>.json.
3) Score
node score.mjs results/predictions-gpt-4o.json
Outputs a per-target table + summary JSON at results/score-predictions-<label>.json.
Reporting results
Append a new "## Results — <model>" section to
docs/internal/VISION-GROUNDING-FINDINGS.md with:
- pass rate at 30 px tolerance
- mean / median / max pixel distance
- in-bbox rate
- notable failure modes (e.g., model refused, malformed JSON, off-by-resolution)
- approximate cost per full run
Don't tune the prompt per model — the comparison is apples-to-apples. If a model needs a different prompt to perform, note it as an integration cost, not as a different score.
Existing results
- Claude Opus 4.7 (baseline): 17/17 PASS at 30 px tolerance,
mean 1 px, max 8 px. See
results/score-predictions-claude-opus-4-7.json.