An Operator of our own

A year and change after I wrote about letting GPT drive a single test step, OpenAI shipped Operator. A polished browser agent running on a Computer-Using Agent model, vision-based, GPT-4o looking at screenshots and clicking pixels. The numbers on WebArena were good and the demos looked clean. I watched the launch in January, and right away I was arguing with myself about whether to build one, since Mockingjay was already neck deep in browsers.

It was never a product decision. We had containerized browsers. The recorder already spoke a structured action language. And there was an AI-driven action runner in the supervisor for test steps the regular recorder could not capture cleanly, which had been there since February. The pieces for a workflow-driving agent were already in the box, mostly because we needed them for tests anyway. So we built a small Operator. Same idea as OpenAI's, much narrower scope: a goal, a browser, a loop, and a chat thread you can interrupt.

The Jan 2024 post ended with me wondering whether the accessibility tree might be a better signal than the HTML we were stuffing into the context window. A year later, the answer is yes, mostly. The supervisor pulls the Chrome accessibility tree, not the DOM. The role and the accessible name come from the browser itself, with elements that are not interactive collapsed out. The same page is roughly an order of magnitude smaller this way. The agent sees buttons, links, inputs, and what they are for, instead of fifteen layers of wrapper divs.

Real websites are worse than the spec assumes. ARIA roles are missing, or wrong, or copy-pasted from the example in the documentation onto an element that does something entirely different. A <div> with an onclick handler and no role is still common enough that the a11y tree has nothing useful to say about it, and the agent ends up picking the wrong element.

When the tree-based plan returns nothing useful, or the agent's confidence in the chosen element is low, the supervisor sends a screenshot and asks the model to ground the action against pixels. We do this sparingly, because vision is slow and expensive and the latency is felt by anyone watching the run. The vast majority of steps resolve against the tree. The vision turns kick in on the kind of page where a human would also have had to look.

The loop also changed. In Jan 2024 each step was one-shot, with a toBeCalledAgain flag if the model wanted more turns; the model usually said yes. The Operator loop runs the other way around. The agent emits an action, the supervisor runs it, the next observation comes back, the agent has a chance to notice the action did not do what it expected, and correct. The conversation column on the workflow execution row holds the whole thread for a run, with screenshots and tree snapshots inlined where they were used. Langfuse, which replaced Helicone in March, traces every step and is the only reason we can debug a long run that went sideways at turn nineteen.

Operator is running on Claude 3.7 Sonnet via Bedrock, with GPT-4o-mini for the smaller classification calls. We started on 3.5 Sonnet, the October version. 3.7 handled multi-step plans better. OpenAI's Operator runs on a model that was reinforcement-learned for the task; we did not have anything like that, so we ran a base instruct model in the same loop everyone else was writing. The one advantage we had was running it inside the supervisor process, where the agent's tools were just the existing action primitives the test executor already used.

The obvious objection is the name. OpenAI shipped a product called Operator in January and we have a feature called Operator in our product, and no, we are not competing with OpenAI. The feature is for someone who is already using Mockingjay to test their app and wants a goal-driven runner alongside the regular recorder. "Book the cheapest flight to NYC next Tuesday" is not what we are pitching. "Run this workflow against a staging environment and tell me whether it succeeded" is closer.

A real-world enterprise app with a custom design system and aggressive React rerendering can still defeat the agent. The tree is wrong, the vision fallback is slow, and the run takes long enough that the customer gives up and falls back to the recorder. There are also apps where Operator handles entire flows end-to-end and looks like magic. We ship the same code to both.

In early 2024 I thought a single-shot AI Action might replace the recorder. That was wrong. The current shape, where the agent is one tool among several and is allowed to fumble, is closer to right. Whether that holds at the next model release, or the one after, is not something I plan to predict. We point the next Sonnet at it when one lands and read the diff.

❦

Thanks for reading. Questions, disagreements, or corrections, .