Letting GPT drive

A few months after the test-case generation feature shipped, we did the obvious next thing. We let GPT drive the browser.

In November I shipped what we called an AI Action. The shape was: at run time, hand GPT-4 the page HTML, tell it what the user wants to do in plain English, and let it return a structured tool call. The tool was performActionOnPage, with a selector, an actionType of "input", "click", "dblclick", "keypress", or "scroll", an optional value, an optional keyValue for things like Enter or Shift, scroll direction and amount for scrolls, and a toBeCalledAgain flag if the model wanted another turn. The supervisor took the result, ran it on the real page through the existing executor, and either kept going or asked the model again.

The prompt was four sentences. It is committed and I am looking at it as I type this.

You are a test automation bot that will run the actions provided
by the user on a HTML Web page. Output only in json. This is the
HTML of the webpage. Generat selectors based on this HTML. Ensure
the selectors are resilient.

"Generat" is missing an 'e'. I noticed it months later and left it. Same reason as the prompt in prompts.ts from earlier in the year: by the time I saw the typo, the prompt was the part that worked.

A few weeks after AI Action, I shipped a sibling. validateSingleElementOnPage: the user describes a check in English, GPT returns a JavaScript validate(variables, attributes, innerText) function as a string, and the supervisor evaluates it against the real element. The prompt had three or four worked examples. The output was usually a one or two line function.

The first demos went the way the test-case generator's demos had gone. A user types "click the Submit button at the bottom", GPT returns a sensible CSS selector, the supervisor clicks it, the test moves on. The validation flow worked too. "Check that the price is greater than thirteen" came back as Number(attributes.value) > 13 and ran clean. Customers found it impressive. The mechanic felt like the future for about three weeks.

Then it had to run for real, on the kinds of pages our customers actually tested. Take a virtualized list: the model picks a selector for a row, the user scrolls, and the same selector now points at different content. Or a custom dropdown built out of divs with no semantic role, where actions only appear on hover and the selector the model returned belonged to an element that had already collapsed by the time the supervisor tried to use it. Every interaction changed the page, and the model was being asked to re-pick selectors against a target that was already moving when its previous answer arrived.

The other failure mode was self-inflicted. The toBeCalledAgain flag meant the model could ask for another turn, and on long flows it would. Every turn was the full HTML again. Token bills climbed faster than I had budgeted for. The validate functions GPT wrote sometimes did not compile. When they did, they occasionally referenced an attribute that was not on the element, or hardcoded a value the user had explicitly written as a variable.

By December the tweaks had started. We dropped the temperature and added more examples to the prompts. None of this fixed the underlying thing, which is that putting the model in the execution path made every test as deterministic as the model was, and the model was not.

A few months earlier, the test-case generator had given me a tidy lesson: LLM features are forgiving of code quality. With the model now inside the execution loop, I was getting the second half of that lesson, which was less tidy. LLM features are unforgiving of the things code is good at. A regular click in our recorder is as flaky as the selector and the page. Once GPT was picking the selector, the same click was also as flaky as the model and the prompt and the temperature and the example mix and how much of the HTML fit in the context window today. I had spent a decade writing software that did the same thing twice when given the same input. I missed it more than I expected to.

GPT-4V is out, and I have spent more than a few evenings wondering whether a vision model would understand the page better than the structural representation we are handing it. I am also starting to doubt whether selectors are the right primitive at all. Even with the HTML we send heavily sanitized, with attributes stripped and decorative tags removed, data-heavy pages still fill most of the context window before the prompt and the tool definitions have room to land. The accessibility tree is the other shape I keep circling. It is flatter than the DOM and describes each element by what it is for. A framework's layout scaffolding mostly disappears.

We kept the AI Action and the AI validation in the product. They were better than nothing on flows the regular recorder could not handle. The version of November, where this was going to be the new shape of the product, has settled by now into something more measured. Plenty more AI work is still to come. I have just stopped expecting any single piece of it to replace the deterministic parts of the system.

❦

Thanks for reading. Questions, disagreements, or corrections, .