clawdie-ai/docs/internal/UI-TARS-ADOPTION.md
Operator & Codex 50513681b4 Add post-install setup bootstrap flow
---
Build: pass | Tests: pass — 2446 passed (182 files)
2026-05-12 11:37:55 +02:00

5.8 KiB

UI-TARS Adoption Direction

Date: 11.maj.2026 Status: DIRECTION — use UI-TARS as the computer-use reference design

Clawdie should stop inventing a parallel GUI-agent stack where UI-TARS already has a polished, working shape. The adoption target is not to replace Clawdie's FreeBSD/controlplane substrate; it is to make Clawdie's browser-computer-use layer look like a UI-TARS operator backend.

Scope: UI-TARS is the GUI/browser agent loop reference only. Clawdie's controlplane, setup, auth, tenant policy, install, audit, Telegram, and jail orchestration remain Clawdie-owned.

Decision

Use UI-TARS as the reference for:

  • the agent loop shape: screenshot → model prediction → execute → repeat,
  • operator abstraction: screenshot() and execute(prediction),
  • action-space prompt style,
  • status lifecycle: init/running/end/max-loop,
  • recent screenshot context rather than unlimited screenshot history,
  • streamed/delta status updates to the user.

Keep Clawdie as the reference for:

  • FreeBSD, Bastille, ZFS, PF, and hostd orchestration,
  • controlplane auth, audit, task registry, Telegram, and dashboard,
  • fixed browser template and future browser task clone lifecycle,
  • tenant policy and operator approval,
  • install/onboarding.

What this removes

Do not build a separate Clawdie-specific GUI-agent loop, action grammar, or prediction parser unless UI-TARS cannot be adapted. Redundant bespoke attempts should be reduced to validation fixtures or deleted before implementation.

In particular:

  • no new custom model-loop state machine if UI-TARS GUIAgent semantics fit,
  • no separate action syntax when UI-TARS action spaces can be reused,
  • no independent screenshot-context policy beyond Clawdie's disk recording and audit requirements,
  • no Electron/nut-js desktop runtime as the FreeBSD server path.

Target shape

External request / Clawdie task
        │
        ▼
controlplane
  ├── auth + policy + audit
  ├── UI-TARS-compatible GUIAgent runner
  └── ClawdieBrowserOperator
        ├── screenshot()  → browser session screenshot
        └── execute()     → click/type/scroll/navigate/finished
        │
        ▼
browser jail or browser task clone
  ├── Chromium from FreeBSD packages
  ├── CDP/puppeteer-core HTTP service
  └── no model logic inside the jail

The browser jail remains a small execution backend. The model loop runs in the controlplane or in an external UI-TARS-compatible client.

UI-TARS concepts to mirror

From the local UI-TARS research copy:

  • GUIAgent accepts a model, operator, abort signal, data callback, error callback, and max loop count.
  • Operator exposes two core methods:
    • screenshot(): { base64, scaleFactor }
    • execute({ prediction, parsedPrediction, screenWidth, screenHeight, scaleFactor, factors })
  • The model sees the instruction, action spaces, and recent screenshots (screenshots.slice(-5) in the UI-TARS docs).
  • Operator action spaces describe actions like click, type, scroll, and finished.

Clawdie should implement a ClawdieBrowserOperator with this shape rather than inventing a different internal interface.

Integration levels

Phase 1 — compatible operator, not full app

Implement the browser jail/controlplane pieces so they can back a UI-TARS-style operator:

  • screenshot() calls the browser session screenshot endpoint.
  • execute() translates UI-TARS parsed actions to jail operations.
  • finished() closes or marks the session complete.
  • Clawdie adds auth, audit, credential-mode, and clone/session policy around the operator.

The substrate direction is now one fixed thick browser template plus future browsertaskNNN clones. Phase 0.6 passed for CDP cookie injection and repeated clone cleanup, so implementation can proceed against that substrate.

Phase 2 — high-level task API

Expose a high-level controlplane/MCP entry point such as:

browser.run_task({ instruction, credential_mode, domains, record, max_steps })

Internally this runs the UI-TARS-compatible loop and streams deltas to the controlplane trace, not to pi. Primitive browser tools may remain for debugging, but the normal product path should be high-level task execution.

The pi harness integration is a single extension tool named browser_run_task (the MCP surface can keep the dotted browser.run_task name). pi should receive one compact result, not every screenshot/action frame:

{
  "status": "finished | max_steps | error | aborted",
  "summary": "model's final meaningful response, often the answer",
  "result_data": { "optional": "parsed JSON when the task requested structured output" },
  "trace_id": "browser trace/session id for controlplane inspection",
  "step_count": 8,
  "final_screenshot_path": "/var/db/browser-jail/sessions/.../final.png"
}

Screenshots stay in the UI-TARS loop and in Clawdie's recording store according to the session record policy. They are not appended into pi JSONL history.

Phase 3 — optional UI polish

Borrow UI-TARS Desktop UX patterns for trace viewing, screenshots, action history, abort, replay, and settings. Do this after the FreeBSD backend is stable.

Boundaries

  • Do not make Electron the required Clawdie runtime on FreeBSD.
  • Do not put MCP, auth, or audit inside the browser jail.
  • Do not let model instructions silently choose operator credential injection; Clawdie policy and a valid grant token must approve that.
  • If UI-TARS code is copied rather than used as a dependency, preserve license attribution and keep the copied surface minimal.

Next changes to make

  1. Implement setup + hostd clone lifecycle for the fixed browser template and browsertaskNNN clones.
  2. Implement ClawdieBrowserOperator against the browser controlplane surface.
  3. Prefer adapting UI-TARS SDK concepts before writing new control-loop code.