Mapping the AI design evaluation landscape

More than a dozen public benchmarks now grade AI on design, and most say it's getting good. They don't measure the same thing when they say it.

Sometimes it means visual polish: hierarchy, rhythm, typography, enough taste that a designer would rather ship it than another tool's version. Sometimes it means implementation fidelity: given a screenshot or Figma frame, the model produces code that renders close to the reference. Sometimes it means product judgment: given an existing product and a feature request, the agent decides where the feature belongs, how it should behave, which conventions to preserve, and what has to be finished before the work is shippable.

Those are different capabilities. A tool can win a visual-preference benchmark and still put a feature in the wrong place. A model can reproduce a screenshot closely and have no ability to decide whether the screenshot was the right thing to build. A coding agent can make a feature function and still leave the product worse than it found it.

The three meanings are measured by different families of benchmarks, and a score rarely says which family it came from, so every result gets read as a verdict on "design" in general. A tool tops a visual-preference leaderboard, the takeaway becomes "this model is good at design," and a number about one capability gets spent as if it covered all three.

So we mapped the field: which family measures which meaning of design, which meanings it leaves to the others, and which to reach for when.

The short map

As of June 2026, we group the public measurement that bears on AI design capability into six families.

Family	Representative work	Measures well	Mostly does not measure
Aesthetic preference	UI-Bench, WebDev Arena, Design Arena	Which generated output raters prefer visually	Product fit inside an existing app, convention adherence, whether the result works in use
Design-to-code fidelity	Design2Code, DesignBench, WebSight, Web2Code, WebCode2M, WebUIBench	Whether generated code matches a provided visual reference	Whether the reference was the right product decision
Front-end task execution	FrontendBench, WebGen-Bench, WebCoderBench	Whether code satisfies specified front-end behavior under tests	Higher-level product judgment, app-specific design conventions
Feature work in real codebases	SWE-bench, SWE-Lancer	Whether a change to an existing codebase resolves the issue and passes its tests	The product experience of the result
Informal tool comparisons	Blog posts, vendor comparisons, demos	Market signal and fast qualitative impressions	Reproducible methodology
Product in context	PX-bench (our work)	Whether an agent makes the right product decisions adding a feature to a real, opinionated app	Aesthetic taste, fidelity to a given mockup, general coding ability

The first four families each score one question well: preference, fidelity to a reference, behavior under test, functional resolution. The sixth is ours and the youngest; the rest of this piece walks the families in turn.

Aesthetic preference

The most rigorous example is UI-Bench, which has expert raters compare AI text-to-app tools pairwise: 10 tools, 30 prompts, 300 generated sites, more than 4,000 blinded judgments, ranked with a TrueSkill-derived model. The public leaderboard is live. Crowd-voted arenas run the same head-to-head format at consumer scale: LMArena's WebDev Arena and Design Arena rank models from public votes, trading expert raters for volume. Academic instruments circle the same territory: UIClip scores a UI's design quality from a screenshot, AesBench tests whether multimodal models perceive aesthetics at all, G-FOCUS judges which of two UI variants persuades better, validated against A/B tests.

This family answers a question teams ask. Given the same prompt, whose output would a professional ship? UI-Bench's client-delivery framing is especially good because it avoids the weakest version of aesthetic voting, the bare "which one do you like more?"

UI-Bench also states its own boundary clearly: it "intentionally ignores UX metrics such as load time, accessibility, or code quality." The scope is right for a visual-preference benchmark, and the boundary is the one that matters for the next wave of evaluation. All 30 of its prompts start from a blank canvas (a marketing site, a portfolio, a storefront, built fresh from a brief). With no design system to honor, no navigation structure, no existing pattern for create and edit flows, the evaluator cannot ask "does this match the app?" or "did the feature land in the right place?" There is nothing for the result to match.

That is not a flaw in UI-Bench. It is the shape of the capability it measures.

Design-to-code fidelity

The second family asks: given a reference image or design, can a model produce code that renders close to it? Design2Code is the clean example: 484 real-world webpages, screenshots converted to HTML and CSS, scored with automatic metrics and human evaluation. WebSight, Web2Code, WebCode2M, DesignBench, and WebUIBench vary the recipe (datasets at scale, sub-capabilities such as UI perception and HTML understanding), but the center of gravity is the same. A reference exists, and the model is judged on how well it implements it. A real capability, since a design-to-code system is only useful if it preserves what the designer decided.

A reproduction task asks "did you copy it right?" Product work asks "what should be built here?" The hardest decisions in product experience happen before there is a mockup to match. If a feature request asks for saved views in a task tracker, the agent has to decide whether the editor is a modal, drawer, page, or inline form; where saved views sit relative to filters, grouping, and the command palette; which state belongs to one device and which should follow a signed-in user everywhere. None of that can be scored against a reference screenshot, because the screenshot would already have made the decisions.

Front-end task execution

A third family moves from appearance to behavior. FrontendBench runs generated code in a sandbox against predefined test scripts. WebGen-Bench sends a browser agent through each generated site to execute its test cases. WebCoderBench scores generated apps against real user requirements with automated metrics spanning code quality, content, performance, and accessibility. This work matters because visual comparison misses whether the product works. A button can look right and do nothing; a form can match a mockup and fail validation.

But testable behavior is still not the whole experience. A test can check that clicking "delete" removes an item. It cannot decide whether deletion should have required confirmation, whether the confirmation should reuse the app's existing dialog, or whether the action also belongs in the two other surfaces where the same entity appears. Those are the calls a senior designer or product engineer makes when adding a feature to an existing product, and they require product context.

Feature work in real codebases

The nearest neighbors to product-in-context measurement sit outside design evaluation entirely. Coding-agent benchmarks hand an agent an existing codebase and a real task: SWE-bench draws tasks from GitHub issues and scores a patch by running the project's test suite; SWE-Lancer draws them from paid freelance jobs, many user-facing, graded by end-to-end tests.

These benchmarks have the one ingredient every design benchmark above lacks: a product that already exists, with structure and conventions the change either respects or breaks. They score none of it. A patch can pass every test while the feature lands in the wrong surface, duplicates a pattern the product already had, and ships without its empty, error, and loading states. The tests confirm the feature functions; nothing examines what it did to the product around it.

So the two halves of product-in-context measurement grew up in separate fields. Design benchmarks have experience criteria and no product context; coding benchmarks have product context and no experience criteria. The sixth family puts the halves together.

Informal industry comparisons

The last family comes from the market rather than the research community. Agencies and reviewers compare v0, Bolt, Lovable, Replit, Figma Make, and other AI interface builders, asking sensible practical questions: which gets closest to a Figma reference, which produces cleaner code, which a designer would use. These are useful artifacts, not reproducible measurement. Prompts vary, tools receive different amounts of human help, and the criteria shift from section to section. But the demand signal is real. Teams already need to know which agents can produce product work they would trust.

Product in context

This family measures something narrower and more concrete than "design" in the broadest sense: whether an AI agent produces good product experience when it adds a feature to a product that already exists. Visual craft is part of it. So are:

Product fit. Did the feature land in the right structural place, using the right pattern for the job?
Convention adherence. Did the agent reuse the app's components, tokens, naming, and behavioral conventions?
Pathway completeness. Did it finish the cancel, back, undo, error-recovery, loading, empty, and pending states?
Content and language. Do labels, empty states, and error messages use the product's own terms and help the user move forward?
Resilience. Does the result hold up with long content, small viewports, failed requests, and realistic data volume?
Accessibility. Can people use the feature with a keyboard, assistive technology, and sufficient contrast? Scanners catch pieces of this in isolation; nothing checks it where new features land.

Some of these can be partly tested in a blank-canvas prompt. Most cannot. Product fit and convention adherence are relational: "does this match?" needs something to match, and "does this fit?" needs a product structure to fit into or break from. That is why this family needs a different task shape from the others.

PX-bench, our benchmark, is built to exactly that shape: extension into a product that already exists, rather than generation from a blank canvas. The agent receives a feature request and a host application, a complete product with its own design system, routes, components, data model, and accumulated idiosyncrasies. The brief says what the user needs without prescribing the UI pattern, so the agent has to read the product before it can build into it. That makes questions scoreable that no blank canvas can pose:

Did the agent discover the existing create-and-edit pattern, or invent a parallel one?
Did it extend every surface where the feature belongs, or only the first one it saw?
Did it store data in the layer that matches the requirement, or copy a nearby pattern meant for something else?
Did it preserve the product's vocabulary, or introduce a new term for an existing concept?
Did it handle the edge states the product already handles elsewhere?

None of these are exotic traps; a team reviewing an AI-generated pull request would ask them by reflex. They are also the questions that decide adoption. Companies are not mostly asking agents to design greenfield landing pages. They are asking agents to change products that already exist, without making them more fragmented, less accessible, or harder to understand.

Each benchmark is a different lens

Existing benchmarks are different lenses on capability, each genuinely useful for its question. If you are choosing an AI website builder for first-pass visual concepts, UI-Bench and the arenas answer your question. If you are building a Figma-to-code pipeline, Design2Code-style fidelity does. If you are testing whether generated components work, FrontendBench and its relatives do. If you are evaluating coding agents on functional resolution, SWE-bench and its descendants do. If you want to know whether an agent can take a product ticket, enter an unfamiliar codebase, preserve the product's conventions, and ship a complete feature, product-in-context evaluation does.

Product-in-context evaluation sits beside the others rather than replacing them. PX-bench is our attempt at it; it should not be the only one.

The field has settled that AI design capability should be measured. The open question now is which design capability a given number is about.

Version 1.0 — June 2026. Reach us at hello@chordio.com.