Generative AI and Screen-Reader Prompts

Image description: A smartphone resting on a wooden desk showing an AI chat interface with headphones plugged in — the visual marker for screen-reader-friendly AI prompt design.

Reading Time: 9 minutes

A new design discipline has crystallised inside the accessibility community over the past eighteen months, and it does not yet have a settled name. Some teams call it “AT-aware prompt engineering”; others call it “screen-reader-shaped system prompts”; the practitioners who came up through voice-UI design tend to call it “the speech-output layer of an LLM.” Whatever the label, the craft is the same: writing system prompts and output-shaping rules that make generative AI assistants — ChatGPT, Claude, Gemini, Copilot, Be My AI — useful for the roughly approx. 253 million people worldwide who reach those products through a screen reader.

The problem is concrete and the failure mode is loud. An LLM trained on the public web produces, by default, prose decorated with em-dashes, nested markdown lists, code fences, headings that exist only because the model felt the answer was “structured”, and decorative emoji. Read aloud by NVDA, JAWS, VoiceOver or TalkBack, that output becomes a stream of “dash dash” interjections, “bullet bullet bullet” enumeration without any sense of where one item ends, “heading level two” announcements that interrupt a sentence, and emoji name-strings (“smiling face with sunglasses”) between every other clause. The information is in there. The user cannot extract it without rewinding three times. This piece is a primer on what the discipline is asking of model builders, what the products have shipped so far, and the open UX problems nobody has solved yet.

The new discipline — what it actually consists of

Screen-reader-aware prompt design is not a single rule. It is a small set of constraints that, together, produce output a synthesiser can pronounce intelligibly and a screen-reader navigation key can move through. The constraints fall into four buckets.

Concise responses with semantic structure. Default LLM output is too long for spoken delivery — a 600-word answer that reads fine in a sighted user’s browser becomes a four-minute monologue that the screen-reader user has no way to skim. The discipline asks for shorter answers, but more importantly for structured shorter answers: an opening one-sentence summary that the user can stop at, followed by structure the screen reader can navigate by heading or by list-item.

Avoid em-dashes and other punctuation that synthesisers mispronounce. The em-dash, the en-dash, the parenthetical, the slash-as-conjunction, the ASCII-art separator — all of these are read aloud as either silence, a literal “dash”, or a confusing pause that breaks a clause in half. The convention emerging across the major models is: prefer the comma and the full stop; use the colon for the one place it really earns its keep; never use em-dashes in spoken-context responses; never use ASCII rules to separate sections.

Declare what is a list, what is a heading, what is code. Synthesised speech has no visual hierarchy. A heading needs to be announced as “heading”, a list needs to be announced as “list with N items, item one”, code needs to be announced as “code”, and the model needs to either output structures the screen reader recognises (HTML, proper markdown the rendering surface converts to ARIA) or to verbally narrate the structure itself (“Here are three options. Option one: …”).

No markdown soup. Markdown is fine when the rendering surface converts it to semantic HTML. Markdown is hostile when the surface displays the raw asterisks and underscores, because the screen reader then announces “asterisk asterisk” before every bold word. The discipline is to detect the rendering context — chat UI with markdown rendering versus terminal versus screen-reader-driven voice interface — and to shape the output accordingly. The same model needs to produce different surface representations of the same answer.

What screen readers actually need from AI

To make the constraints above concrete, it helps to look at the actual behaviour of the four screen-reader / OS combinations that dominate the field: JAWS on Windows, NVDA on Windows, VoiceOver on macOS and iOS, and TalkBack on Android. They are not interchangeable, and a prompt that produces great output for one can be unreadable on another.

Navigation by heading. All four readers expose a heading-navigation key (H in JAWS and NVDA, Rotor in VoiceOver, the reading-control toggle in TalkBack). For a long AI answer to be navigable, the model has to emit real semantic headings — either through a markdown rendering pipeline that converts to <h2>/<h3> with proper level nesting, or via the chat surface’s own structured-response API. A model that “structures” its answer by bolding the first three words of each paragraph has produced something that looks structured visually and is completely flat to a screen reader.

Navigation by list. Lists are useful in spoken output precisely because the screen reader announces the count (“list with seven items”) and lets the user step through with the list-item navigation key (I in NVDA, L in JAWS). But this only works if the list is a real <ul> or <ol>. A “list” produced by emitting bullet characters at the start of each line, with no list wrapper, is read as ordinary prose with an unexplained “black circle” or “bullet” interjection on every line.

Skip-by-section. Long-form AI answers — explanations, comparisons, code-and-commentary, multi-step instructions — need a way for the screen-reader user to skip to the section they care about without listening through the preamble. This is the single hardest piece to design well, because the model has to produce a navigable structure and the chat surface has to render it in a way the OS exposes to the assistive technology, and the screen reader has to be configured to use the heading-navigation key in that surface. All three things fail in the wild; usually it is the middle one.

Pronunciation hints. Synthetic voices stumble on technical terms, acronyms with mixed letters, URLs, code identifiers, mathematical notation, and non-English names. A well-designed model will, for screen-reader-context responses, spell out acronyms on first use (“WCAG, the Web Content Accessibility Guidelines”), expand initialisms the synthesiser cannot pronounce, and avoid embedding raw URLs inside flowing prose where the synth will read the slashes aloud. None of the major products do this consistently in 2026.

How the products are handling it

As of mid-2026, the major generative AI products have taken visibly different positions on screen-reader-aware output. None of them have nailed it. The progression is faster than it was twelve months ago, but the gap between the best and the worst is still wide.

ChatGPT (OpenAI). The web client now ships with a “concise mode” toggle that shortens default responses and reduces markdown decoration. The voice mode introduced in 2024 — and substantially upgraded in 2025 — is the closest any major product has come to a screen-reader-native interface, because it bypasses the visual chat entirely and delivers a spoken answer with a stop, replay and “say that again” gesture. The custom-instructions field allows screen-reader users to declare their preferences once and have them apply across sessions, which is the user-driven workaround the community has settled on. The remaining gaps: GPT models still default to em-dash-heavy prose unless instructed otherwise, and the heading-level emitted in markdown does not always map cleanly to ARIA in the chat surface.

Claude (Anthropic). Claude’s system-prompt discipline has moved closest to the conventions described above. The model is noticeably less em-dash-prone than the GPT line in 2026, defaults to shorter answers, and responds well to system-prompt instructions like “you are speaking to a screen-reader user; use no em-dashes, prefer short paragraphs, and use real headings or numbered lists when structure is needed.” The Claude.ai chat surface renders markdown to semantic HTML with proper heading levels, which makes the heading-navigation key work. Voice output through third-party integrations exists but is less developed than ChatGPT’s first-party voice mode.

Gemini (Google). Tight integration with TalkBack on Android is Gemini’s structural advantage; the model can hand off to the OS-level screen reader through Android’s accessibility services in a way the iOS and web competitors cannot. The “Hey Google, ask Gemini…” flow on accessible Android devices is, for some users, the most natural AI-plus-screen-reader experience available. The remaining gaps: the web interface still over-decorates responses, the heading hierarchy in Gemini’s web answers is inconsistent, and the model is more prone to producing decorative emoji than its competitors.

Be My AI (Be My Eyes + OpenAI). This is the most narrowly scoped of the four — a visual-description assistant that uses GPT-4-class vision models to describe images and surroundings for blind and low-vision users. It is also the only product in this list designed from day one for a screen-reader user as the primary audience. Be My AI’s prompt design is the field’s clearest demonstration of what AT-aware output looks like in practice: descriptions open with a one-sentence summary the user can stop at, follow with structured detail only if asked, and avoid spatial language (“on the left”, “above”) that requires sighted context to interpret. The product remains, in 2026, the closest the field has to a reference implementation.

The cross-cutting observation is that the four products have made progress on the easy parts — shorter answers, fewer em-dashes, a custom-instructions field — and have barely begun on the hard parts. The hard parts are below.

Open UX problems nobody has solved

The screen-reader-aware prompt-design literature converges on four open UX problems where the right answer is not yet known. None of them are model-capability problems; all of them are interaction-design problems that sit between the LLM, the chat surface, the OS, and the screen reader.

Interrupt-ability. A sighted user can scan an LLM response in approx. two seconds and decide whether to read it. A screen-reader user cannot. If the answer is wrong or off-target, the user has to listen through enough of it to know that, then interrupt. Voice modes have a stop button. Text modes generally do not — the response streams in and the screen reader announces it as new content as it arrives, and the user has no clean way to say “stop generating, this is not what I asked.” The Be My AI app handles this best; the web chat clients handle it worst.

Repeat-last-answer with selectable granularity. Asking a screen reader to re-read the last response is easy if the answer is short. It is unusable if the answer is six paragraphs and the user only wants to hear the third paragraph again. The interaction the community is asking for is “repeat the last list item”, “repeat the last heading section”, “repeat the last code block.” That requires the chat surface to expose the structure to the screen reader in a way the screen reader’s own re-read commands can address. In 2026, none of the major products do this; the user has to use the screen reader’s own line-by-line navigation, which is laborious.

Navigate-by-section in spoken output. Voice modes do not have a heading-navigation key. The user listens to a four-minute answer linearly, with no way to skip from the “overview” section to the “specifics” section without rewinding by time. The interaction designs being prototyped — a spoken “section list” the user can navigate with arrow keys, a “go to section three” voice command, a “give me the headings only” mode — are early. The Be My AI app’s “more detail on the colours” follow-up is the closest functioning version of this in a shipping product.

The AT-handoff question — when does the AI speak versus read content aloud? This is the deepest design question. If a screen-reader user opens an AI assistant on a webpage, who is speaking — the AI’s own voice (TTS layer), or the user’s installed screen reader reading the AI’s text output? The two voices have different settings, different speaking rates, different pronunciation hints, different stop-and-replay gestures. Two systems trying to speak the same content at the same time produces nothing usable. The convention emerging is: voice-mode interactions use the AI’s own TTS and explicitly suppress the screen reader; text-mode interactions emit semantic HTML and let the screen reader do the speaking. But the boundary between the two modes is not always clean — image-description, code-generation, mathematical notation, and multi-modal answers all sit awkwardly between voice and text — and that boundary is where most of the live UX problems live.

Where it goes next

The discipline is roughly where web accessibility was in approx. 2002 — past the “is this a real problem?” phase, past the “is anyone responsible?” phase, into the “what are the actual rules?” phase. Three things are likely to happen across 2026 and 2027.

First, the model builders will codify their internal screen-reader prompts and publish them, the way Anthropic publishes Claude’s system prompts in VPAT-style accessibility statements and OpenAI has begun documenting GPT’s behavioural defaults. The community is asking for the equivalent of a model card — a “screen-reader output card” — that names the conventions a given model has been trained or system-prompted to follow.

Second, the chat surfaces — web clients, mobile apps, IDE integrations — will gain proper semantic-HTML rendering pipelines and proper ARIA exposure for chat history, with the navigation keys mapped to the OS-level screen reader. This is unglamorous work, and it is the work that will move the needle most for daily users.

Third, the screen-reader vendors themselves — Vispero (JAWS), NV Access (NVDA), Apple (VoiceOver), Google (TalkBack) — will start shipping AI-aware features: native heading-navigation inside AI chat surfaces, a standardised “stop generating” gesture, smarter re-read commands that know about LLM response structure. NVDA’s open-source add-on ecosystem is already producing early versions of these. The proprietary readers are slower but the direction is the same.

The deeper observation is that screen-reader-aware prompt design has stopped being a niche concern of a handful of blind developers and has become a baseline expectation of every AI product team that wants to ship into regulated markets. The European Accessibility Act applies to “interactive self-service terminals” and “consumer terminal equipment with interactive computing capability” — a category that almost certainly captures a major AI assistant on a phone. The AT-aware output layer is not a feature any more; it is procurement-binding. The teams that figure out the rules now will ship the products that survive 28 June 2025 and onwards. The teams that treat it as an afterthought will be the next round of EAA enforcement cases.

Final thoughts

The craft is small, the stakes are large, and the rules are still being written. If you build with LLMs and you have not yet had a conversation with a screen-reader user about what your product actually sounds like when they use it, that is the next thing on the list. Most of what is wrong with AI for screen-reader users in 2026 is not a model-capability problem; it is a prompt-and-surface design problem that any product team can fix in a sprint, if they decide to.

The community has been generous with its time, its testing, and its patience. It is also losing patience faster than it used to, because the products are now mainstream and the excuse of “we are still figuring it out” has run out. The discipline is here. The conventions are converging. The next eighteen months will sort the teams that listened from the teams that did not.