07v1.0Jun 11, 2026

What gets measured in AI design evaluation today, and what doesn't

Benchmarks can tell you whether an agent's output looks nice, and whether it matches a mockup. Almost none can tell you whether it made the right product decisions. This is a map of what the field measures, and the gap it leaves.

AI is getting good at design, depending on what the word design is doing in the sentence.

Sometimes it means visual polish: hierarchy, rhythm, typography, enough taste that a designer would rather ship it than another tool's version. Sometimes it means implementation fidelity: given a screenshot or Figma frame, the model produces code that renders close to the reference. Sometimes it means product judgment: given an existing product and a feature request, the agent decides where the feature belongs, how it should behave, which conventions to preserve, and what has to be finished before the work is shippable.

Those are different capabilities. A tool can win a visual-preference benchmark and still put a feature in the wrong place. A model can reproduce a screenshot closely and have no ability to decide whether the screenshot was the right thing to build. A coding agent can make a feature function and still leave the product worse than it found it.

The measurement landscape covers the first two meanings and mostly skips the third. The asymmetry is easy to lose track of, because every result gets read as a verdict on "design" in general: a tool tops a visual-preference leaderboard, the takeaway becomes "this model is good at design," and a number about one capability gets spent as if it covered all three.

So this is a map of the questions: which benchmark measures which meaning of design, and which meaning still has no benchmark at all.

The short map

As of June 2026, the public measurement that bears on AI design capability clusters into five families, plus the empty slot they leave between them.

FamilyRepresentative workMeasures wellMostly does not measure
Aesthetic preferenceUI-Bench, WebDev Arena, Design ArenaWhich generated output raters prefer visuallyProduct fit inside an existing app, convention adherence, whether the result works in use
Design-to-code fidelityDesign2Code, DesignBench, WebSight, Web2Code, WebCode2M, WebUIBenchWhether generated code matches a provided visual referenceWhether the reference was the right product decision
Front-end task executionFrontendBench, WebGen-Bench, WebCoderBenchWhether code satisfies specified front-end behavior under testsHigher-level product judgment, app-specific design conventions
Feature work in real codebasesSWE-bench, SWE-LancerWhether a change to an existing codebase resolves the issue and passes its testsThe product experience of the result
Informal tool comparisonsBlog posts, vendor comparisons, demosMarket signal and fast qualitative impressionsReproducible methodology
Product in context (the gap)PX-bench (our work)Whether an agent makes the right product decisions adding a feature to a real, opinionated app

The first four families each score one question well: preference, fidelity to a reference, behavior under test, functional resolution. The last row is the gap, and it is where our own work sits; the rest of this piece maps why it is empty.

Aesthetic preference

The most rigorous example is UI-Bench, which has expert raters compare AI text-to-app tools pairwise: 10 tools, 30 prompts, 300 generated sites, more than 4,000 blinded judgments, ranked with a TrueSkill-derived model. The public leaderboard is live. Crowd-voted arenas run the same head-to-head format at consumer scale: LMArena's WebDev Arena and Design Arena rank models from public votes, trading expert raters for volume. Academic instruments circle the same territory: UIClip scores a UI's design quality from a screenshot, AesBench tests whether multimodal models perceive aesthetics at all, G-FOCUS judges which of two UI variants persuades better, validated against A/B tests.

This family answers a question teams actually ask: given the same prompt, whose output would a professional ship? UI-Bench's client-delivery framing is especially good because it avoids the weakest version of aesthetic voting, the bare "which one do you like more?"

UI-Bench also states its own boundary clearly: it "intentionally ignores UX metrics such as load time, accessibility, or code quality." The scope is right for a visual-preference benchmark, and the boundary is the one that matters for the next wave of evaluation. All 30 of its prompts start from a blank canvas (a marketing site, a portfolio, a storefront, built fresh from a brief). With no design system to honor, no navigation structure, no existing pattern for create and edit flows, the evaluator cannot ask "does this match the app?" or "did the feature land in the right place?" There is nothing for the result to match.

That is not a flaw in UI-Bench. It is the shape of the capability it measures.

Design-to-code fidelity

The second family asks: given a reference image or design, can a model produce code that renders close to it? Design2Code is the clean example: 484 real-world webpages, screenshots converted to HTML and CSS, scored with automatic metrics and human evaluation. WebSight, Web2Code, WebCode2M, DesignBench, and WebUIBench vary the recipe (datasets at scale, sub-capabilities such as UI perception and HTML understanding), but the center of gravity is the same: a reference exists, and the model is judged on how well it implements it. A real capability, since a design-to-code system is only useful if it preserves what the designer decided.

A reproduction task asks "did you copy it right?" Product work asks "what should be built here?" The hardest decisions in product experience happen before there is a mockup to match. If a feature request asks for saved views in a task tracker, the agent has to decide whether the editor is a modal, drawer, page, or inline form; where saved views sit relative to filters, grouping, and the command palette; which state belongs to one device and which should follow a signed-in user everywhere. None of that can be scored against a reference screenshot, because the screenshot would already have made the decisions.

Front-end task execution

A third family moves from appearance to behavior. FrontendBench runs generated code in a sandbox against predefined test scripts. WebGen-Bench sends a browser agent through each generated site to execute its test cases. WebCoderBench scores generated apps against real user requirements with automated metrics spanning code quality, content, performance, and accessibility. This work matters because visual comparison misses whether the product works: a button can look right and do nothing, a form can match a mockup and fail validation.

But testable behavior is still not the whole experience. A test can check that clicking "delete" removes an item. It cannot decide whether deletion should have required confirmation, whether the confirmation should reuse the app's existing dialog, or whether the action also belongs in the two other surfaces where the same entity appears. Those are the calls a senior designer or product engineer makes when adding a feature to an existing product, and they require product context.

Feature work in real codebases

The nearest neighbors to the missing measurement sit outside design evaluation entirely. Coding-agent benchmarks hand an agent an existing codebase and a real task: SWE-bench draws tasks from GitHub issues and scores a patch by running the project's test suite; SWE-Lancer draws them from paid freelance jobs, many user-facing, graded by end-to-end tests.

These benchmarks have the one ingredient every design benchmark above lacks: a product that already exists, with structure and conventions the change either respects or breaks. They score none of it. A patch can pass every test while the feature lands in the wrong surface, duplicates a pattern the product already had, and ships without its empty, error, and loading states. The tests confirm the feature functions; nothing examines what it did to the product around it.

So the two halves of the missing measurement already exist, in separate fields: design benchmarks have experience criteria and no product context, coding benchmarks have product context and no experience criteria. Nothing public puts the halves together.

Informal industry comparisons

The last family comes from the market rather than the research community. Agencies and reviewers compare v0, Bolt, Lovable, Replit, Figma Make, and other AI interface builders, asking sensible practical questions: which gets closest to a Figma reference, which produces cleaner code, which a designer would actually use. These are useful artifacts, not reproducible measurement: prompts vary, tools receive different amounts of human help, and the criteria shift from section to section. But the demand signal is real. Teams already need to know which agents can produce product work they would trust.

The gap

The missing measurement is narrower and more concrete than "design" in the broadest sense: whether an AI agent produces good product experience when it adds a feature to a product that already exists. Visual craft is part of it. So are:

Some of these can be partly tested in a blank-canvas prompt. Most cannot. Product fit and convention adherence are relational: "does this match?" needs something to match, and "does this fit?" needs a product structure to fit into or break from. That is the methodological gap underneath the capability gap.

This is the gap our benchmark, PX-bench, is built for, and its shape follows from the diagnosis: extension, not generation. The agent receives a feature request and a host application, a complete product with its own design system, routes, components, data model, and accumulated idiosyncrasies. The brief says what the user needs without prescribing the UI pattern, so the agent has to read the product before it can build into it. That makes questions scoreable that no blank canvas can pose:

None of these are exotic traps; a team reviewing an AI-generated pull request would ask them by reflex. They are also the questions that decide adoption. Companies are not mostly asking agents to design greenfield landing pages. They are asking agents to change products that already exist, without making them more fragmented, less accessible, or harder to understand.

Different things, not worse things

Existing benchmarks measure different things, each genuinely useful for its question. If you are choosing an AI website builder for first-pass visual concepts, UI-Bench and the arenas answer your question. If you are building a Figma-to-code pipeline, Design2Code-style fidelity does. If you are testing whether generated components work, FrontendBench and its relatives do. If you are evaluating coding agents on functional resolution, SWE-bench and its descendants do. If you want to know whether an agent can take a product ticket, enter an unfamiliar codebase, preserve the product's conventions, and ship a complete feature, each of those answers part of the question, and none answers it whole.

Product-in-context evaluation is the next layer, beside the others rather than replacing them. PX-bench is our attempt at it; it should not be the only one.

That is the open question for the field now: not whether AI design capability should be measured, but which design capability a given number is actually about.


Version 1.0 — June 2026. Reach us at hello@chordio.com.