← Back to PX-bench
02v1.0Jun 3, 2026

Evaluating agents in an existing product, not a blank canvas

Most benchmarks test design from a blank canvas. Real product work is adding a feature to an app that already has its own conventions, and almost nobody tests that. Why the app we evaluate inside shapes what we can measure.

Most product work isn't a blank canvas

Most evaluations of AI design capability run on isolated prompts: design a settings page, build a pricing table, create a sign-up flow. The agent starts in an empty editor, and the result is judged on its own.

Real product work almost never looks like that. A feature gets built inside an app that already exists, with its own conventions, its own design system, and the quirks any real codebase carries. The job is usually to add something to what is already there, in a way that fits in.

So that is what we test. Each scenario fixes one complete, opinionated app and asks the agent to build a feature inside its actual codebase. We call that app the host app.

Some questions need an app to answer

In PX-bench we split product-experience capability into eight categories. Two of them only mean something when there is an existing app to compare against.

Product fit asks whether a feature lands in the right place in the product: the right container, the right entry point, one consolidated surface instead of three scattered ones.

Convention adherence asks whether the agent built the way the app already builds, reusing its components and tokens and matching how it names and formats things.

Neither question has an answer in an empty editor. "Does this fit?" needs something to fit into; "does this match?" needs conventions to match. That is the whole reason we test inside an existing app: it is the only way to ask them.

What a host app is

A host app is a complete, opinionated, multi-screen application with its own component library, design tokens, and code patterns, plus the small idiosyncrasies real products pick up. It has several ways to view its data, real empty and loading and error states, keyboard shortcuts, a command palette, and a mobile layout.

We build these ourselves. They are not shipping products, but each one is built to behave like one, conventions and idiosyncrasies included. That matters because of how the test works: add a feature without breaking the conventions only means something if the conventions are real and applied consistently. So we design each host app as a genuine product in its own right, never reverse-engineered to fit the rubric. An app built to trip the agent up would only measure how well agents handle contrived traps. A coherent one measures what we actually care about: whether an agent can read an unfamiliar codebase, learn how it works, and extend it the same way.

Two decisions that only exist inside an app

Our first host app is a task tracker: list, board, and calendar views; filtering, sorting, and grouping; a command palette; the usual furniture of a real productivity tool. The first scenario asks the agent to add saved views: let a user name the current filter-and-sort setup, switch between saved ones, and edit or delete them.

The brief reads like a normal product ticket. What it deliberately does not do is tell the agent how to build any of it. The answers are in the existing app, waiting to be read. Two examples follow, each the kind of decision that simply does not exist until there is an existing codebase to make it in.

Where does the editor live? The brief asks for a way to create and edit a saved view. It never says whether that should be a modal, a drawer, a separate page, or an inline form. On a blank canvas, several of those are defensible and there is no way to be wrong. In this app, creating and editing a task already opens a right-side drawer. So a saved-view editor that opens as a modal is the weaker answer. Modals are fine in general. Here the app already settled how "create and edit" looks, and a modal breaks from it. The right call is knowable only by reading the app. That is Product fit, and here one answer is clearly better, precisely because the app exists.

The task tracker's existing right-side drawer for creating a task, sliding over the dimmed list behind it.

The existing create-and-edit pattern is a right-side drawer. The brief never names a container; the app already did.

Where do saved views live? This one runs deeper. The app already remembers the current view (your active filters, sort, grouping, columns) by writing it to the browser's local storage. An agent reading the code finds that pattern quickly, and the tempting move is to extend it: saved views are just more view state, so store them the same way. But the brief asks for views that belong to the user's account and follow them across devices, and one browser's local storage cannot do that. A saved view is a named, persisted entity that belongs on the server, alongside the app's tasks, through the same data layer the app already uses for everything account-bound. The ephemeral current view stays local; the saved views do not. Telling those apart requires reading the existing code, recognizing two distinct patterns already in it, and judging which to reuse and which to deliberately set aside.

Diagram: the current view writes to one browser's local storage; saved views persist on the server and travel to every device.

Two patterns already in the codebase, for two different jobs: the current view stays in local storage, a saved view follows the user to the server.

Neither decision is visible from the prompt alone. There is no existing drawer to match and no local-storage pattern to be tempted by until there is an app that made those choices first.

What this lets us see

The two decisions above are instances of a general payoff. Holding a complete app fixed reveals judgments an isolated prompt cannot reach:

There is also a quieter signal worth naming. A host app with a hand-rolled design system (coherent, but not a library the agent has seen a thousand times) separates an agent that genuinely reads and absorbs a design system from one that pattern-matches to a popular one it has memorized. An app built on an off-the-shelf kit cannot tell those apart; a hand-rolled one can. It is one of the axes we expect to matter, and one a host app is well placed to test.

What's next

For now there is a single host app, and we get as much out of it as we can: more scenarios on the same app, many agents scored on each. A second app comes later. That ordering is intentional: it lets us get one host app right before building the next.

The apps that follow will each vary one axis on purpose: the design system (hand-rolled versus a familiar library), the domain (an over-represented one like task tracking versus something models see far less of), the flavor of the conventions. Each becomes a question we can actually answer rather than a confound we have to hope cancels out.

The through-line stays the same. The most consequential design judgments are relational: they are about fitting work into a product that already exists. That makes the product the agent works inside part of the test itself.


Version 1.0 — June 2026. Reach us at hello@chordio.com.

← Back to PX-bench