A laptop showing an AI assistant generating alt text for a photograph, with bounding boxes around detected objects — the visual marker for AI-and-alt-text reporting.
Image description: A laptop showing an AI assistant generating alt text for a photograph, with bounding boxes around detected objects — the visual marker for AI-and-alt-text reporting.

Engineering primer · AI + alt text

AI and alt text: where the technology actually delivers in 2026

An engineering primer on the state of AI-generated alt text in 2026. We tested GPT-4o, Claude 3.7 Sonnet, Gemini 2.0, Llama-Vision-3, and Pixtral against four image categories and documented exactly where the technology delivers and where it still fabricates.

AI and alt text
where the technology actually delivers in 2026

Vision-language models can now describe an informative photo with a fluency that would have looked impossible in 2022. They still hallucinate text on screenshots, mis-gender visibly disabled subjects, and invent brand names that were never in the frame. This primer maps the line between the two.

5
vision models benchmarked
4
image categories tested
approx. 62%
first-pass usability ceiling
11 min read
Updated May 2026

1. The shape of the problem in 2026

WCAG 2.2 Success Criterion 1.1.1 has not changed since 2008. Every non-text image that conveys meaning needs a text alternative; every decorative image needs to be marked decorative. What has changed, between the version of this article we would have written in 2022 and the version we are writing in May 2026, is that generating a plausible-sounding sentence from a pixel array is no longer the bottleneck. Generating a sentence that is correct, contextually appropriate, and free of fabricated detail still is.

The shift matters because most production CMS platforms in 2026 ship an “auto alt-text” button. The button calls a vision-language model behind a vendor API and writes the result straight into the alt attribute. The accessibility consequence is direct: if the button is right, an image that previously shipped with an empty alt is now described to a screen-reader user. If the button is wrong, the screen-reader user receives a confidently worded sentence about something that is not in the image.

This primer is for the engineers who own that button. It surveys the five vision models that account for the overwhelming majority of vendor integrations in 2026, tests each one against the four canonical image categories, documents the recurring failure modes, and ends with a hybrid workflow that we believe is the only defensible default until the underlying behaviour shifts.

approx. 41%
of images on a representative crawl of 500 large US e-commerce pages ship with a missing or empty alt attribute (DW internal scan, March 2026).
approx. 18%
of remaining alts are auto-generated filenames or default phrases like “image” or “product” — present, but useless to a screen-reader user.
approx. 11%
of alts are AI-generated and unedited — visible by their characteristic three-clause hedged sentence structure (DW internal classifier).
What we mean by “delivers”

An AI alt-text candidate “delivers” if a human reviewer would accept it as is, or accept it with a one-token edit. Anything requiring a rewrite is a miss. This is a stricter bar than the academic CIDEr or BLEU metric a model might cite — it is the bar that a CMS button has to clear.

”The accessibility consequence is direct: if the button is right, an image that previously shipped with an empty alt is now described to a screen-reader user. If the button is wrong, the screen-reader user receives a confidently worded sentence about something that is not in the image.”

— this article, section 1

2. The model landscape in 2026

Five vision-language models dominate the integrations we see in production: two closed frontier models (GPT-4o vision, Claude 3.7 Sonnet vision), one closed model used heavily inside Google products and downstream Workspace add-ons (Gemini 2.0), and two open-weights models that ship in self-hosted CMS plugins where data residency rules out the closed APIs (Llama-Vision-3, Pixtral). Each has a distinct profile on the four-category test below.

The combo cards here capture the practical behaviour we observed across approx. 600 test images in March and April 2026, not the marketing claims. Costs are per-image at typical resolution as of May 2026 and exclude vendor markup.

GPT-4o vision
OpenAI · gpt-4o (May 2026 build)
Most common closed-API default in mid-market CMS
StrengthInformative photos, scene composition
WeaknessHallucinates on-screen text
Approx. cost / imageapprox. 0.004 USD
Claude 3.7 Sonnet vision
Anthropic · claude-3-7-sonnet
Common in enterprise CMS where editorial review is part of the workflow
StrengthRefuses to invent text it cannot read; charts
WeaknessVerbose; needs explicit length prompt
Approx. cost / imageapprox. 0.005 USD
Gemini 2.0
Google · gemini-2.0-pro vision mode
Default in Workspace add-ons, Google-adjacent CMS
StrengthScreenshots, UI element identification
WeaknessMis-identifies mobility aids, fabricates brand names
Approx. cost / imageapprox. 0.003 USD
Llama-Vision-3
Meta · 90B vision, open weights
Self-hosted CMS plugins, EU data-residency deployments
StrengthPhotos, decorative classification
WeaknessCharts; will guess at axis values
Approx. cost / imageself-hosted inference cost
Pixtral
Mistral · pixtral-large, open weights
European self-hosted; smaller-model plugins
StrengthConcise outputs; respects length budget
WeaknessLower scene-composition recall on complex photos
Approx. cost / imageself-hosted inference cost

3. The four-category test

WCAG decision-tree guidance for non-text content collapses, in practice, to four categories: informative photos (a person, a scene, an object that carries meaning); charts and diagrams (a bar chart, a flow diagram, an annotated map); screenshots and UI (a dashboard, an error state, a settings panel); and decorative (a hero gradient, a divider, a stock-illustration filler). We assembled a 600-image test set sampling 150 images per category from disability-news contexts, charity reports, software documentation, and editorial filler. Each model produced one alt candidate per image; three human reviewers labelled each candidate as accept, edit, or reject. The matrix below reports the accept rate.

The numbers are not designed to crown a winner. They are designed to tell you which category is the riskiest place to ship an AI candidate without review.

ModelInformative photosCharts & diagramsScreenshots & UIDecorative (correctly null)
GPT-4o vision71%34%52%41%
Claude 3.7 Sonnet vision68%49%61%58%
Gemini 2.066%38%64%44%
Llama-Vision-3 (90B)62%21%47%53%
Pixtral large57%26%42%48%
The two columns to watch

Across every model, the two weakest columns are charts & diagrams and decorative (correctly null). The first fails because the model invents values it cannot read; the second fails because the model writes a sentence when the correct answer is silence. Both errors are invisible to a sighted reviewer who only spot-checks the photo column.


4. The four failure modes that matter

Aggregate accept rates hide the texture of the errors. Reviewing the rejected candidates across the test set, four failure modes recur with enough regularity that they account for the great majority of misses. We name them here so that any editor reviewing AI output knows which patterns to look for first.

1

Hallucinated on-screen text

The model writes that a chart axis is labelled “Q3 2024 revenue” when the chart actually shows page-view counts; the model writes that a screenshot’s button reads “Submit” when it reads “Save and continue”. GPT-4o is the worst offender here; Claude 3.7 Sonnet most often refuses, returning a phrase like “a chart whose axis label is not legible at this resolution”. The refusal is the correct behaviour, and the right thing for a CMS button to expose.

2

Mis-identification of disabled subjects

A power wheelchair becomes “a motorised scooter”; a white cane becomes “a walking stick”; a visibly disabled subject in a photo of an activism rally is described as “a person sitting in a chair watching the parade”. The error pattern reflects training-data composition. None of the five models we tested handled mobility-aid identification at a rate we would call production-ready, and the corrective edit is almost always required.

3

Contextual nuance loss

A photo of two people signing American Sign Language is described as “two people gesturing”; a photo of a service dog under a restaurant table is described as “a dog sleeping under furniture”. The pixels are described accurately. The meaning that the editor placed the image to convey is not. Contextual nuance is the failure mode that the matrix cannot measure, and the reason that AI alt text without editorial review is, in practice, the wrong default.

4

Brand-name fabrication

The model writes that a stock photo of a laptop is “an Apple MacBook” when the laptop is a generic Windows-shaped chassis; the model writes that an unbranded coffee cup is “a Starbucks cup”. Gemini 2.0 is the most prone to this category of error in our test set. The fix is a prompt-side constraint: instruct the model to refuse named-brand identification unless a brand mark is unambiguously visible. Even with the constraint, a sample-rate review remains necessary.

”The pixels are described accurately. The meaning that the editor placed the image to convey is not.”

— this article, failure mode 3

5. The hybrid workflow we recommend

Treating AI alt text as either “fully automated” or “irresponsible” is a false binary. The category-by-category numbers say something more useful: AI candidates are usable as a first draft in the photo column and as a refusal source in the chart column, and they are an active risk in the decorative column unless the workflow has an explicit “mark decorative” affordance. The right default is a hybrid, and the steps below are the hybrid we recommend.

1

Route by image category before generating

A small classifier (a few thousand parameters is enough) decides whether the image is a photo, a chart, a screenshot, or decorative. The routing decision determines the prompt, the model, and whether to generate at all. Decorative images should not be sent to the model: they should be marked decorative directly and ship with an empty alt.

2

Use Claude 3.7 Sonnet for charts and screenshots

The matrix shows Claude leads on the two columns where refusal is the correct behaviour. Configure the prompt to require explicit refusal when text is not legible, and to flag any chart whose axis values are not readable rather than guessing. Surface the refusal in the CMS as a “needs human description” state, not as an empty alt.

3

Use GPT-4o or Gemini 2.0 for photos, with a brand-name constraint

For the informative-photo column, either model produces accept rates above approx. 65%. Add a prompt-side instruction to never identify a brand name unless a logo or wordmark is unambiguously in frame. Cap output length at 125 characters to discourage the verbose three-clause sentence pattern.

4

Human edit pass before publish

Every AI candidate is a draft. The CMS button writes the candidate into a review field, not into the alt attribute. The editor either accepts, edits, or replaces with original text. For news contexts, accessibility contexts, or anything where mis-identification of a disabled subject would be harmful, the editor pass is non-negotiable.

5

Audit on a schedule

Re-run a sample of published alts against the matrix every quarter. Models drift; vendor builds change; the failure modes shift. A 100-image sample takes an afternoon and catches behaviour regression before a screen-reader user does.

What “automation” should and should not mean

An AI alt-text feature that writes directly into the alt attribute without human review is not an accessibility feature — it is an accessibility statement. WCAG conformance still requires that the text alternative be correct, contextual, and non-fabricated. The model can draft; only the editor can ship.


Conclusion: the bar moved, the floor did not

The headline of this primer, written honestly, is that vision-language models in 2026 are now a useful first draft for the photo column and a useful refusal source for the chart column, and that the two facts together imply a hybrid workflow rather than a fully automated one. The bar moved meaningfully between 2022 and 2026 — accept rates on informative photos are now in the high sixties for the best closed models, where in 2022 they were closer to the low thirties. The floor did not. Mobility aids are still mis-identified, ASL still becomes “gesturing”, and decorative images still receive a sentence when they need silence.

The accessibility consequence is that the right default for any CMS shipping an “auto alt-text” button in 2026 is not “press the button and publish”. It is “press the button to draft, then review before publish”. Anything tighter than that ships fabricated detail to the readers who depend most directly on the text alternative being correct. Anything looser than that — ignoring AI entirely — leaves the 41% of images with empty alts unaddressed when a draft would have helped.

We will re-run this matrix in November 2026. If the chart column has moved above the 60% accept line, the hybrid workflow will tighten. Until then, the button drafts, the editor ships.

”The model can draft; only the editor can ship.”

— this article, hybrid workflow step 4