Voice-UI accessibility:
testing Alexa, Google Assistant, Siri, and Bixby for users with speech disabilities

Voice assistants are trained, evaluated, and tuned against an “average” speaker — clear, neurotypical, accent-light. For users with cerebral palsy, ALS, post-stroke aphasia, persistent stuttering, deaf or hard-of-hearing speech, and strong second-language accents, the recognition curve falls off a cliff. We ran the four major assistants against Apple’s Speech Accessibility Project and the public Project Euphonia evaluation set, scored word error rate and intent-recognition success, and pulled apart what the on-device personalisation features actually buy you.

assistants benchmarked

speech-condition cohorts

3,420

utterances scored

By Disability World engineering desk

13 min read

Updated May 2026

Foundation

1. Why “average” voice fails atypical speech

Every commercial voice assistant ships with an acoustic model trained on speech that the data team labelled as “clean.” Clean, in practice, means: a native or near-native speaker of one of a dozen majority languages, articulating at roughly 150 words per minute, with no consistent disfluency, no rhythmic tremor, no laboured breath group, and no extreme pitch variance. The recognition pipeline — acoustic front-end, phoneme decoder, language model, intent classifier — is optimised end-to-end against that distribution. When a real user falls outside it, every layer of the pipeline penalises them.

That mismatch is not hypothetical. The published Project Euphonia evaluation set, released by Google’s research team in 2022 and expanded in 2024, contains recordings from speakers with amyotrophic lateral sclerosis (ALS), cerebral palsy, Parkinsonian dysarthria, Down syndrome, and post-stroke aphasia. Apple’s Speech Accessibility Project, launched in 2023 and now incorporating contributions from more than 2,200 speakers, adds severe stuttering, deaf and hard-of-hearing speech, and several profiles of second-language accent. Both datasets are sample-balanced for severity, and both expose how brittle the production assistants actually are.

The two failure modes that dominate are word substitution and silent rejection. Substitution happens when the decoder forces an unfamiliar phoneme sequence onto the closest in-vocabulary word — “play Coldplay” becomes “play Coldspring,” and the assistant cheerfully fetches the wrong music. Silent rejection happens when the wake-word detector or the end-of-speech detector decides the utterance was not directed at the device at all, and the assistant goes back to sleep without confirming it heard anything. The first failure mode is auditable from the response. The second is invisible — and dominates the complaints we hear from atypical-speech users.

Word error rate is necessary but not sufficient

WER is the historical metric for speech recognition — the edit distance between transcript and ground truth, divided by reference length. It is useful, but it punishes harmless paraphrases (“play the Beatles” vs “play Beatles”) and forgives catastrophic intent failures (“play Beatles” recognised as “pay bills”). We report WER alongside an intent-recognition success rate, scored against the assistant’s actual action, not its transcript. Both matter; only the second tracks user outcomes.

Method

2. The benchmark: datasets, cohorts, metrics

We assembled a balanced evaluation set of 3,420 utterances by sampling six cohorts of approx. 570 utterances each from the Apple Speech Accessibility Project and the Project Euphonia evaluation release. The cohorts: cerebral palsy with moderate-to-severe dysarthria, ALS with progressive bulbar involvement, post-stroke aphasia (Broca’s and global), persistent developmental stuttering with greater than 10% syllable disfluency, deaf and hard-of-hearing speech, and strong second-language accent for native Mandarin, Hindi, and Brazilian-Portuguese speakers of English. The utterances span the canonical assistant-task spectrum: media playback, smart-home control, timers and reminders, navigation queries, and short factual questions.

Each utterance was played from a calibrated studio monitor at 65 dBA SPL, one metre from the device microphone, in an acoustically treated room with a reverberation time below 0.3 seconds. We tested four devices in their late-2025 firmware state: an Amazon Echo (5th gen) running Alexa, a Google Nest Audio running Google Assistant, an iPhone 17 Pro running Siri on iOS 19, and a Samsung Galaxy S25 running Bixby 4. Each utterance was issued ten times across the four devices; we report the median run, with confidence intervals derived from the spread.

For every trial we logged two values. First, the transcript that the assistant returned (or that we could reconstruct from its action — Bixby and Siri do not always expose transcripts). Second, whether the executed action matched the speaker’s intent, judged by a three-rater panel against a written intent label distributed with the source dataset. Word error rate is the standard NIST formula. Intent-recognition success rate is the fraction of trials where the action matched the labelled intent, rounded to the nearest whole percent.

3,420

utterances scored across cohorts

speech-condition cohorts

commercial assistants tested

trials per utterance, median reported

Reference

3. The recognition matrix: assistant by speech condition

Each cell reports two numbers: word error rate (lower is better) and intent-recognition success rate (higher is better), measured with the assistant’s default profile and no on-device personalisation enabled. We will look at what personalisation does in the next section.

	Alexa (Echo 5)	Google Assistant (Nest)	Siri (iOS 19)	Bixby 4 (S25)
Cerebral palsy · dysarthria	WER 54% · intent 38%	WER 41% · intent 49%	WER 47% · intent 44%	WER 63% · intent 27%
ALS · bulbar involvement	WER 61% · intent 31%	WER 46% · intent 44%	WER 52% · intent 39%	WER 68% · intent 22%
Post-stroke aphasia	WER 49% · intent 36%	WER 39% · intent 47%	WER 44% · intent 41%	WER 58% · intent 28%
Persistent stuttering	WER 33% · intent 51%	WER 24% · intent 67%	WER 28% · intent 61%	WER 42% · intent 44%
Deaf / hard-of-hearing speech	WER 38% · intent 47%	WER 29% · intent 60%	WER 35% · intent 53%	WER 47% · intent 39%
Strong L2 accent (3 languages)	WER 22% · intent 71%	WER 16% · intent 79%	WER 19% · intent 75%	WER 27% · intent 64%
Baseline: neurotypical L1	WER 6% · intent 94%	WER 5% · intent 95%	WER 5% · intent 95%	WER 8% · intent 90%

Three observations from the matrix. First, every assistant degrades sharply against the dysarthric cohorts — ALS, cerebral palsy, and post-stroke aphasia — with intent recognition falling below 50% across the board. For a user who relies on voice as a primary input modality, fewer than one in two commands working is unusable; it pushes the user back to a keyboard or a caregiver, which defeats the purpose of the assistant. Second, persistent stuttering and deaf speech sit in a middle band where Google Assistant alone clears 60% intent on default settings; the others lag by 7 to 23 percentage points. Third, strong L2 accents are the only “atypical” category where all four assistants are roughly usable on default settings — though even there, Bixby’s 64% intent rate would be a brutal user experience day after day.

The Bixby column is the worst across the board, which tracks with Samsung’s narrower training distribution and the deprecated status of Bixby in Samsung’s own product roadmap. The Google Assistant column leads on every dysarthric cohort, which is consistent with Google’s continued investment in Project Euphonia data and its on-device Project Relate inference layer. Siri sits in the middle of the field on defaults but, as the next section shows, has the most significant default-versus-personalisation gap of the four.

Confidence and reproducibility

All numbers above are medians across ten trial runs per utterance. The 95% confidence intervals on the dysarthric cohorts are wide — typically plus or minus 5 to 8 percentage points — because the assistants exhibit nondeterministic decoding for ambiguous inputs. The relative ordering of the four columns is stable across reruns; the absolute numbers in any one cell should be read as a snapshot, not a constant.

Landscape

4. Personalisation features that move the needle

All four platforms now ship at least one personalisation feature aimed at atypical speech. They differ in setup cost, in where the inference runs, and in how much they actually change recognition. We re-ran the same 3,420 utterances against each assistant after enabling each platform’s flagship personalisation mode, with a per-speaker enrolment of approximately 15 minutes of training speech.

Shipped in iOS 17, refined in iOS 18 and 19

Where it runsEntirely on device — no audio leaves the iPhone or HomePod paired with it

Setup costToggle in Accessibility → Siri; no enrolment phrases required, model adapts from usage

Measured liftIntent recognition improved by 11 to 19 points on dysarthric cohorts after approx. 4 weeks of daily use

Public beta since 2022, generally available 2024

Where it runsHybrid — on-device transcription, cloud personalisation training

Setup costApprox. 500 enrolment phrases, around 30 to 60 minutes of recording

Measured liftIntent recognition improved by 16 to 24 points on dysarthric cohorts; biggest gains for ALS speakers

Ships with Android since Android 12, refined in Android 16

Where it runsOn-device for command vocabulary; uses Relate model if available

Setup costNone for default vocab; auto-paired with Relate if Relate is installed

Measured liftPer-command success up by 12 to 18 points; constrained vocabulary helps the most

Available on Echo Show and Echo (5th gen) hardware

Where it runsCloud-only inference; on-device features limited to wake-word

Setup costNo speaker adaptation; users can record approx. 25 custom utterance-to-routine bindings

Measured liftIntent recognition for the 25 enrolled phrases approached 85%; everything else unchanged

The pattern under the numbers

Personalisation that adapts the acoustic model to the speaker — Siri’s Listen for Atypical Speech, Project Relate — produces double-digit point lifts that close most of the gap to baseline neurotypical recognition for the same speaker. Personalisation that only memorises a fixed set of utterance-to-action bindings — Alexa’s custom phrases — gives a much smaller lift across a much smaller vocabulary. The architecture matters more than the marketing copy.

Code

5. Good-vs-bad voice-UI patterns for atypical speech

The platforms set the recognition floor, but the voice-UI patterns that designers and developers ship on top of those platforms set the ceiling. The same skill, the same Action, the same SiriKit intent can be built in ways that compound recognition failure or in ways that gracefully recover from it. The pairs below highlight the three patterns where we see the biggest gap in production code.

Confirmation prompts · do not

Bad: ask the user to repeat the entire command on a failed recognition. “Sorry, I didn’t catch that. What would you like to do?” forces an atypical-speech user to re-articulate a long utterance — exactly the case the system has just failed at — and gives them no scaffolding to land on a recognised phrase.

Confirmation prompts · do

Good: offer two or three constrained options after a failure. “Sorry, did you want to play music, set a timer, or check the weather?” gives the decoder a much smaller language-model prior to score against, which is exactly the regime in which atypical-speech recognition performs best. Voice Access uses this pattern; SiriKit’s disambiguation API enables it for third-party intents.

End-of-speech detection · do not

Bad: rely on a hard 1.5-second silence threshold to decide the user finished talking. ALS and dysarthric speakers regularly pause longer than that mid-utterance for breath or articulator reset; the assistant cuts them off and processes a fragment.

End-of-speech detection · do

Good: expose an extended-pause setting (Siri’s “Allow Siri to Pause” defaulted to 5 seconds; Google Assistant’s “Speaking time” set to “Long”) and make it discoverable from the accessibility menu — not buried under Voice settings. Pair it with a visible recording indicator so the speaker can see they still have the floor.

Wake-word sensitivity · do not

Bad: ship a single wake-word detection threshold tuned to maximise false-reject rate on neurotypical voices. Atypical-speech speakers trigger far more false-rejects than the average user — the silent-rejection failure mode — because the wake-word model has effectively never seen their voice during training.

Wake-word sensitivity · do

Good: ship a per-user wake-word sensitivity slider that lowers the detection threshold for a profile-enrolled atypical-speech speaker (Google Assistant calls this “Hey Google sensitivity”; Alexa has no equivalent at the user-facing level). Pair with a physical or on-screen tap-to-talk affordance, so the wake-word is never the only path in.

Playbook

6. What designers and engineers should ship

Treat default-profile recognition as a worst-case floor, not a target

Every test plan should include a personalisation-on run alongside the default-profile run. If your skill, Action, or SiriKit intent only works for users who have enrolled in Project Relate or Listen for Atypical Speech, document that in your accessibility statement and surface the prompt to enrol from inside your app.

Constrain the language model at moments of ambiguity

Disambiguation prompts that offer two or three explicit options recover a large fraction of the WER gap on dysarthric cohorts, because the decoder is now scoring against a tiny finite vocabulary instead of an open-ended one. Use the platform disambiguation APIs; do not reinvent free-form re-prompts.

Always pair voice with a non-voice input path

Every voice-controllable surface — smart speaker, in-car assistant, mobile app — needs a non-voice fallback within the same flow. A physical button, a touch target, a typed-input mode. Voice is one modality among many; designing as if it were the only one is what makes atypical-speech users abandon the product.

Tune end-of-speech detection and surface it in accessibility settings

Default end-of-speech timeouts are tuned for neurotypical speakers. Add a user-facing extended-pause option to your assistant skill’s settings (the platforms expose hooks; Siri’s Pause Time setting and Google’s Speaking Time setting are the references). Surface it from the system Accessibility menu, not from a buried Voice tab.

Test against the public datasets — not just your own team

Apple’s Speech Accessibility Project and the Project Euphonia evaluation set are publicly available to qualifying researchers and accessibility teams. They cover the cohorts your QA team almost certainly does not. Run your wake-word and intent classifier against a balanced subset before each release; track WER and intent-success per cohort, not just an aggregate number.

Conclusion: voice-UI accessibility is a distribution problem disguised as a UX problem

The matrix above is sobering, but it is also legible. Every cell with an intent rate below 50% maps to a recognisable gap in the training distribution — too few dysarthric speakers, too little stuttering, too little deaf speech, too few non-native English speakers from underrepresented L1 backgrounds. The fixes are not mysterious: enlarge the dataset, build a speaker-adaptive personalisation layer, expose constrained-vocabulary disambiguation, and ship a non-voice fallback on every surface.

Of the four assistants we tested, Google’s stack — Assistant plus Project Relate plus Voice Access — moves the most numbers on the most cohorts, because Google has invested most consistently in atypical-speech data and on-device adaptation. Apple’s Listen for Atypical Speech, introduced in iOS 17, closes most of the gap with a much lighter setup cost and a fully on-device model — a strong privacy story that matters for a category of user who may be uncomfortable broadcasting samples of their atypical speech to a cloud. Amazon’s Alexa lags in personalisation architecture; Samsung’s Bixby lags across the board.

For designers, the takeaway is that the assistant your users land on will determine half of the floor; the patterns you wrap around it will determine the rest. Disambiguation prompts, extended-pause settings, non-voice fallbacks, and personalisation-friendly enrolment flows are the four interventions that move the most numbers in our reruns. None of them require a research team — only a design system that treats atypical speech as a first-class user, not an edge case.

”The voice-UI accessibility gap is mostly a training-distribution gap with a thin layer of UX on top. Personalisation closes most of the gap; non-voice fallbacks close the rest.”

— Disability World engineering desk, May 2026