5.8 KiB
UI-TARS Adoption Direction
Date: 11.maj.2026 Status: DIRECTION — use UI-TARS as the computer-use reference design
Clawdie should stop inventing a parallel GUI-agent stack where UI-TARS already has a polished, working shape. The adoption target is not to replace Clawdie's FreeBSD/controlplane substrate; it is to make Clawdie's browser-computer-use layer look like a UI-TARS operator backend.
Scope: UI-TARS is the GUI/browser agent loop reference only. Clawdie's controlplane, setup, auth, tenant policy, install, audit, Telegram, and jail orchestration remain Clawdie-owned.
Decision
Use UI-TARS as the reference for:
- the agent loop shape: screenshot → model prediction → execute → repeat,
- operator abstraction:
screenshot()andexecute(prediction), - action-space prompt style,
- status lifecycle: init/running/end/max-loop,
- recent screenshot context rather than unlimited screenshot history,
- streamed/delta status updates to the user.
Keep Clawdie as the reference for:
- FreeBSD, Bastille, ZFS, PF, and hostd orchestration,
- controlplane auth, audit, task registry, Telegram, and dashboard,
- fixed
browsertemplate and future browser task clone lifecycle, - tenant policy and operator approval,
- install/onboarding.
What this removes
Do not build a separate Clawdie-specific GUI-agent loop, action grammar, or prediction parser unless UI-TARS cannot be adapted. Redundant bespoke attempts should be reduced to validation fixtures or deleted before implementation.
In particular:
- no new custom model-loop state machine if UI-TARS
GUIAgentsemantics fit, - no separate action syntax when UI-TARS action spaces can be reused,
- no independent screenshot-context policy beyond Clawdie's disk recording and audit requirements,
- no Electron/nut-js desktop runtime as the FreeBSD server path.
Target shape
External request / Clawdie task
│
▼
controlplane
├── auth + policy + audit
├── UI-TARS-compatible GUIAgent runner
└── ClawdieBrowserOperator
├── screenshot() → browser session screenshot
└── execute() → click/type/scroll/navigate/finished
│
▼
browser jail or browser task clone
├── Chromium from FreeBSD packages
├── CDP/puppeteer-core HTTP service
└── no model logic inside the jail
The browser jail remains a small execution backend. The model loop runs in the controlplane or in an external UI-TARS-compatible client.
UI-TARS concepts to mirror
From the local UI-TARS research copy:
GUIAgentaccepts a model, operator, abort signal, data callback, error callback, and max loop count.Operatorexposes two core methods:screenshot(): { base64, scaleFactor }execute({ prediction, parsedPrediction, screenWidth, screenHeight, scaleFactor, factors })
- The model sees the instruction, action spaces, and recent screenshots
(
screenshots.slice(-5)in the UI-TARS docs). - Operator action spaces describe actions like click, type, scroll, and finished.
Clawdie should implement a ClawdieBrowserOperator with this shape rather than
inventing a different internal interface.
Integration levels
Phase 1 — compatible operator, not full app
Implement the browser jail/controlplane pieces so they can back a UI-TARS-style operator:
screenshot()calls the browser session screenshot endpoint.execute()translates UI-TARS parsed actions to jail operations.finished()closes or marks the session complete.- Clawdie adds auth, audit, credential-mode, and clone/session policy around the operator.
The substrate direction is now one fixed thick browser template plus future
browsertaskNNN clones. Phase 0.6 passed for CDP cookie injection and repeated
clone cleanup, so implementation can proceed against that substrate.
Phase 2 — high-level task API
Expose a high-level controlplane/MCP entry point such as:
browser.run_task({ instruction, credential_mode, domains, record, max_steps })
Internally this runs the UI-TARS-compatible loop and streams deltas to the controlplane trace, not to pi. Primitive browser tools may remain for debugging, but the normal product path should be high-level task execution.
The pi harness integration is a single extension tool named browser_run_task
(the MCP surface can keep the dotted browser.run_task name). pi should receive
one compact result, not every screenshot/action frame:
{
"status": "finished | max_steps | error | aborted",
"summary": "model's final meaningful response, often the answer",
"result_data": { "optional": "parsed JSON when the task requested structured output" },
"trace_id": "browser trace/session id for controlplane inspection",
"step_count": 8,
"final_screenshot_path": "/var/db/browser-jail/sessions/.../final.png"
}
Screenshots stay in the UI-TARS loop and in Clawdie's recording store according
to the session record policy. They are not appended into pi JSONL history.
Phase 3 — optional UI polish
Borrow UI-TARS Desktop UX patterns for trace viewing, screenshots, action history, abort, replay, and settings. Do this after the FreeBSD backend is stable.
Boundaries
- Do not make Electron the required Clawdie runtime on FreeBSD.
- Do not put MCP, auth, or audit inside the browser jail.
- Do not let model instructions silently choose operator credential injection; Clawdie policy and a valid grant token must approve that.
- If UI-TARS code is copied rather than used as a dependency, preserve license attribution and keep the copied surface minimal.
Next changes to make
- Implement setup + hostd clone lifecycle for the fixed
browsertemplate andbrowsertaskNNNclones. - Implement
ClawdieBrowserOperatoragainst the browser controlplane surface. - Prefer adapting UI-TARS SDK concepts before writing new control-loop code.