Chordio
Real PX-bench grid result — GPT-5.5, one scenario, 5 epochs. Shown as a demonstration of the report format.
Product-experience performance report · Confidential

GPT-5.5

Scenario: Task tracker — add saved viewsEvaluated 2026-06-175 epochsBuild cost 766k tokens · ~$1.90

GPT-5.5 builds the right feature and makes it robust, then reaches for its own components instead of the ones the app already ships.

Its biggest gap is product fit: it builds a custom popover for create and edit instead of reusing the app's drawer, and it skips a real error path on API failure. Accessibility and content trail where the feature leans on its own UI rather than the app's. Intent, resilience, and code conventions are at or near the comparative reference.

82/ 100
± 4 across 5 epochs
Strong
GPT-5.5Comparative SOTA model
Executive summary

The 60-second read. Everything below is evidence for these claims.

The read

GPT-5.5 clears the hard part. All four saved-view operations work, persistence is server-backed and account-bound, and the feature is responsive and survives an API failure without taking the page down. On intent fidelity (99) and resilience (92) it matches the comparative reference; visual craft (86) and code conventions (85) sit just behind it.

The leaks are about fitting into this app and finishing the unhappy paths, not capability. For create and edit GPT-5.5builds its own popover instead of reusing the app's Drawer(product fit 73 vs the reference's 97), and on a 500 the saved-views list falls through to the empty state with no inline error or retry (error-state ≈ 2/100, every epoch). That same custom container is why keyboard accessibility is the most run-to-run-volatile score.

Highest-leverage fix:bias the agent to import the app's components before building its own. In 1 of the 5 epochs it did reuse the drawer and product fit jumped to 97, so this is reachable, not a ceiling.

Capability profile

GPT-5.5 vs a comparative SOTA model (│)

Cat-1Intent fidelity
99
Cat-2Product fit
73
Cat-3Visual craft
86
Cat-4Convention adherence
85
Cat-5Pathway completeness
73
Cat-6Content & language
73
Cat-7Resilience
92
Cat-8Accessibility
75

Recommended next investments

1
Import before build
Feed the host app's component inventory in and route create/edit through the existing drawer. Moves product fit and inherits the focus trap for free.
2
Make the error path first-class
Require an explicit error / empty / loading branch that reuses the app's InlineError. Closes the sharpest, most consistent defect.
3
AA contrast + keyboard gate
Run axe and a keyboard pass before the agent reports done, and lift muted text to the AA token. Steadies the most volatile score.
Capability profile

Eight categories, the locked PX-bench taxonomy. Each score is the mean across 5 epochs, shown against a comparative SOTA model run on the same scenario and harness (the tick).

Cat-1Intent fidelity
Did it build what was asked, working in the happy path?
99
Cat-2Product fit
Does the feature attach to the app in the right place and shape?
73
Cat-3Visual craft
Is the surface composed into a clear visual hierarchy?
86
Cat-4Convention adherence
Does it work in the app's house style: reuse, tokens, naming?
85
Cat-5Pathway completeness
Are all paths and states present and reachable?
73
Cat-6Content & language
Are the words right: labels, errors, empty copy, voice?
73
Cat-7Resilience
Does it hold together under long content, small screens, failure?
92
Cat-8Accessibility
Keyboard, contrast, labels, axe violations.
75

Bar = GPT-5.5 (mean of 5epochs). │ = comparative SOTA model. Color reflects band: green ≥ 75, amber 50–74, red < 50. Click any category below to see its items.

Findings

The patterns behind the scores, ranked by user-facing impact. Each pairs a concrete failure with the fix. Click to expand evidence.

The pattern

For create and edit, GPT-5.5 builds a custom absolutely-positioned popover (a <div role="dialog"> in SavedViewsControl.tsx) rather than importing the app's right-side <Drawer> — the exact container new-task and edit-task use. It does reliably reuse Toast (for the undo) and ConfirmDialog (for delete), but it rebuilds the container, the loading line, and the inline error. That parallel container is the single biggest reason product fit lands at 73 against the comparative reference's 97.

Why it matters

A hand-rolled container drifts from the rest of the app and is the visible 'this feature feels different' tell. It also drags keyboard accessibility down, because the custom popover doesn't inherit the drawer's focus trap (see the accessibility finding). In 1 of the 5 epochs GPT-5.5 did reuse the drawer and product fit jumped to 97, so this is reachable, not a capability ceiling.

Evidence

Create / edit live in a custom popover instead of the app's drawer:

// app/features/tasks/SavedViewsControl.tsx
- <div role="dialog" className="absolute ..."> // custom popover, no focus trap
+ import { Drawer } from '@/components/Drawer' // app uses Drawer for new/edit task
+ <Drawer open={open} onOpenChange={setOpen}> ... </Drawer>
Recommendation

Feed the host app's component inventory into the agent and bias it toward import-before-build. Specifically, route create/edit through the existing <Drawer>; that one change moves product fit and inherits the focus trap for free.

Rubric items
R-2.1R-4.1R-8.2G-2G-15
Confidence

High. Component reuse is a deterministic check, and all three scorers independently flagged the create/edit container as a custom popover rather than the app's drawer.

The pattern

Saved views load through a failable useQuery, but the code never handles isError — it does savedViews = data ?? [], so a 500 produces an empty array and the list renders the empty state, 'No saved views yet.' Where an error message does appear it is a bare line of text with no retry and not the app's InlineError. This is the sharpest defect in the run: error-state scored about 2 out of 100, in every one of the 5 epochs.

Why it matters

A failed load looks identical to 'you have no saved views' — the worst kind of failure, because it quietly implies data loss and offers no way forward. Real users hit 500s eventually, and right now the feature has no honest answer for them.

Evidence

On a 500 the list falls through to the empty state instead of an error:

// app/features/tasks/SavedViewsControl.tsx
- const savedViews = savedViewsQuery.data ?? [] // 500 → [] → "No saved views yet"
+ if (savedViewsQuery.isError)
+ return <InlineError onRetry={() => savedViewsQuery.refetch()} />
Recommendation

Handle isError explicitly, reuse the app's InlineError component, and add a retry. Make an explicit empty / loading / error branch a required step before the agent reports done.

Rubric items
R-5.6R-7.3G-8
Confidence

High. All three scorers, every epoch — the most consistent finding in the run. The page itself survives the 500 (resilience R-7.3 = 99); it just shows no error UI.

Category detail

Every rubric item, what it tested, and its score (mean across 5epochs). Expand an AUTO-AGENT item to see the three independent scorers (Anthropic, OpenAI, Google) — each provider's mean for the item, with representative reasoning quoted from one epoch — and where they agreed or split.

How this was measured

The eval

Scenario
Add a “saved views” feature to a working 8-screen task tracker with its own design system.
What's tested
Whether the agent designs a complete product experience: intent, fit, craft, conventions, pathways, content, resilience, accessibility.
Scoring
Three tiers. 4 deterministic script checks, 26 cross-provider LLM-judge items, 2 human-panel items.
Cross-provider
Every judge item is scored independently by Anthropic, OpenAI, and Google models. The reported score is their median; we publish the spread.
Comparison
A comparative SOTA model run on the same scenario and harness (6 epochs), shown as the reference tick on each bar. Shown without naming the specific model.
Samples
5 epochs of GPT-5.5; category and overall scores are means, with the spread shown as ± on the overall.
Build cost
722k in + 44k out tokens, ~$1.90per run. The agent's own spend producing the feature, not the scorers'.

Coverage & confidence

Items scored
29 of 32
Pending human
2 (R-2.4, R-4.6)
No-signal
1 (R-4.4) — shipped a text loader, no skeleton to judge
Overall spread
± 4 points across 5 epochs

No-signal items are excluded from the mean, not counted as zero. Each category shows its own coverage so a partial category never reads as complete.

Ground truth set by

Senior product designer ASenior product designer BSenior product designer C

Human-panel items (R-2.4, R-4.6) reported with inter-rater agreement; a scorer earns its place by matching designer judgment at ≥ 0.7.

Provenance & reproducibility

Run set
reportable-gpt-5-5 · 2026-06-17
Subject model
openai/gpt-5.5-2026-04-23
Scorers
Anthropic + OpenAI + Google, cross-provider consensus
Scenario
01-task-tracker / add-saved-views
Harness
Inspect AI (AISI), sandboxed container

Every score traces to its run, rubric version, and scorer version. The harness, scenario, and rubric are pinned per run; the run is reproducible.

What this does and doesn't tell you

·
One scenario. This measures product experience on a feature-add to an existing app. It is one slice of capability, not a verdict on everything GPT-5.5 builds.
·
Out of scope. We do not score speed, animation polish, or end-to-end team workflows.
·
Coverage conditioning. One item reports no-signal: GPT-5.5 shipped a text loading line, not the app's skeleton, so there was nothing to score for loading-convention match. It is excluded from the mean, not zeroed.
·
Demonstration. This page shows a real grid result in the report format we share with customers. The comparison reference is a frontier model run on the same scenario, shown without naming it.
·
Phase-0 footnote. Visual craft is scored from code in this run; render-based scoring lands next phase.