Drawing the map

A goal lurking behind Mockingjay since the start: tests should write themselves. The recorder is a workaround for the fact that we cannot. The Operator we shipped in April was a fun side project, but it left a thing on the table we kept poking at. If an agent can run a workflow we describe in English, can it explore an app we point it at and come back with enough of a map that the next agent can write tests against it?

That was the work for September and October.

The first instinct was the wrong one. We started writing a bigger Operator. One agent, a richer prompt, a longer tool list, and a JSON object on the side that we updated as the agent went. The agent was now expected to navigate the page, classify the page, decide what to do next, and remember everything it had seen well enough not to revisit it. By turn fifteen the JSON object was lying about the shape of the app. The agent was lying about what it was looking at, and the loop kept going for another forty turns because there was no stopping condition the agent could not talk itself out of. A linear conversation cannot describe an application that doubles back on itself.

A web application is a graph. Nodes are pages, edges are the ways one page leads to another. The memory of an exploration is a partially-built map of that graph, with the agent standing on one node and looking at the next few edges available to it. We pulled the graph out of the agent's transcript and put it in a real directed graph. Nodes and edges have schemas. A node carries the URL, the page type the model thinks it is looking at (login, dashboard, form, listing, error, and a handful more), whether it requires authentication, the count of interactive elements, the count of forms, and a discovery timestamp. An edge is typed: direct_link, form_submission, button_click, menu_navigation, ajax_call, redirect, workflow_step, conditional_access. The edge also carries the element that triggered it, a weight, and a traversal count. The graph lives in Postgres on the workflow execution row, the same place the Operator's conversation lives.

The next thing that did not fit in a single agent was the work itself. There are at least four problems on the table at any given step: figuring out what is on the page, picking the next move, performing it cleanly, and writing the result into the graph without duplicating a node we have already seen. They are different shapes, and a single prompt asked to do all four does each of them worse than a focused prompt that does one. So the Explorer has five agents.

A coordinator runs the loop. It does not see the page; it sees only the other agents' outputs and the graph. Its tools are the other agents, four of them, plus update_application_graph, complete_exploration, and abort_exploration. On every new page the coordinator is required to call the analysis agent first, before anything else. We added that guardrail the third or fourth week, after the coordinator kept skipping analysis when it thought it already knew what was on the page.

The navigation agent does browser actions. Navigate to a URL, click an element, type into an input, wait for an element to appear. One atomic action per call. The analysis agent takes a screenshot and returns a structured assessment: page type, authentication state, the interactive elements it can see, error messages if any, navigation structure. The decision agent picks the next move and a strategy for the run, choosing among breadth_first, depth_first, targeted, and mixed, and scoring candidate targets with a priority between zero and one. The graph builder is the agent that actually writes to the graph, with tools to add or update nodes, add or update edges, query the existing graph, and validate its integrity. The others can suggest a change, but only the graph builder commits it. That was the constraint that stopped the graph from drifting, more so than the agent count.

The whole thing runs on Sonnet 4.5, which shipped on September 29 and we moved everything to in the first week of October. Sonnet 4 was already in the Operator path since May, and 4.5 was a free upgrade for the same price, so there was nothing to decide. The agents share the model and the Bedrock provider; the prompts and the tool sets are what make them different.

We did try Claude Opus 4.1 on the coordinator specifically, for a couple of weeks after it shipped in August. Opus was noticeably better at the long-horizon coordination: fewer wandering runs, fewer redundant analyses. At several times the price, it was not worth keeping as the default. Sonnet on the coordinator is a little worse than Opus and much cheaper, and that was the trade we made.

Two objections, both fair.

The first is that five agents is over-engineering. One agent with the right tools could do this. I half-agree. The split came out of debugging, not architecture astronaut work; each agent is here because the single-prompt version was visibly worse at that piece. The roles are a compression of weeks of looking at traces. If a future model can hold all four jobs at once without dropping any of them, the agents collapse back into one and we will not miss them. We are not married to five.

The second is that this is a sitemap crawler with model calls bolted on. Mostly fair on a static site, where a crawler would do most of what we are doing. The model earns its keep where the crawler cannot: deciding that a <div> with an onclick handler is the same kind of thing as a button somewhere else, that a modal counts as a different node than the page it sits on, that this form is the same form we filled out two hops ago even though the URL has a new id in it. The bookkeeping a crawler does on a static site is easy. The bookkeeping on a real SPA is the entire problem.

Where the Explorer still falls over is, irritatingly, the kind of app where the map would have been most useful. A dense, data-heavy enterprise application with hundreds of pages, fifty of them behind permission gates, ten variants of the same screen depending on which customer profile is loaded, and three modes of navigation that share none of the same patterns. The decision agent looks at the candidates and the priority scores cluster too closely to choose well. The run wanders. The graph at hour two has the shallow surface of the app mapped pretty well and almost none of the depth. Coverage stalls. We watch the trace for a while, then shut the run down.

Tests do not write themselves yet. The map is the prerequisite, and it is partial on the apps where it would have mattered most. Even where the map is good, we have not built the next agent yet, the one that reads the graph and writes the cases. We will. Whether the shape we have now survives the next model release, or whether the whole thing wants to be one agent again with a graph as its scratchpad, is not something I have a confident answer to. We point Sonnet 4.5 at it and read the failures.

❦

Thanks for reading. Questions, disagreements, or corrections, .