Live-caption accuracy benchmark — six services, one panel, one professional CART writer in the back of the room
We ran six live-captioning services through three 60-minute test sessions: Otter.ai, Google Meet captions, Zoom captions, Microsoft Teams captions, Cisco Webex captions, and StreamText (operator-driven). Each session carried the same prepared script — eight panel speakers with mixed accents (American, British, Indian English, Bulgarian, Singaporean, French), seventeen named entities including five deliberately code-named products, two passages of dense engineering jargon, and three minutes of scripted crosstalk. Every session was simultaneously captioned by a professional CART writer at 220+ WPM, whose transcript served as the gold standard. Measured composite word-error rate (WER) ranged from 3.1% (human CART) to 14.8% (the worst-performing automated service). Median end-to-end latency ranged from 0.9 s to 5.6 s. Two services hit the SAS-LIVE certification floor on jargon recall. Most did not.
What the benchmark reveals
- 014.8×
The gap between the most-accurate automated service and the least-accurate is nearly five times the WER
Otter.ai posted a composite WER of approx. 6.2% across the three sessions. Cisco Webex posted approx. 14.8%. That is not a marginal spread — that is the difference between a transcript a Deaf participant can follow in real time and a transcript that requires post-meeting reconstruction.
- 023.1%
A human CART writer still outperforms every automated service by a wide margin
Our control CART writer (certified RPR, 240 WPM sustained) posted a composite WER of approx. 3.1% — roughly half the error rate of the best automated service and a fifth of the worst. The gap widens further on named entities and overlapping speech, where the human paraphrases gracefully and the machine guesses.
- 030.9 s
Median latency between speech and on-screen caption varied from under one second to nearly six
Google Meet posted the fastest median latency at approx. 0.9 s. Microsoft Teams ran at approx. 1.4 s. Webex sat at approx. 2.7 s. StreamText (operator-driven) averaged approx. 3.8 s. Zoom’s cloud-side captions, on a non-US region, hit approx. 5.6 s — slow enough that a Deaf participant trying to ask a clarifying question is already two utterances behind.
- 0447%
Code-named entities were recovered correctly less than half the time across the automated services
Of the five deliberately code-named products in the script (e.g. “Halcyon”, “Bramble”, “Crosshatch”), the automated services as a group recovered the correct spelling in approx. 47% of utterances. The human CART writer recovered them in 96% of utterances — because we briefed her with the glossary in advance. Three of the six services accept a custom vocabulary; the other three do not.
- 052 of 6
Only two of the six services announce caption updates to assistive technology via a proper ARIA live region
Otter.ai’s web client and Google Meet’s caption pane both expose updates through
aria-live=“polite”regions that a screen-reader user can subscribe to. Zoom, Teams, Webex, and StreamText render captions in DOM nodes that are not announced — meaning a Deaf-blind user on a braille display gets no signal that new text has appeared. - 065.4×
Crosstalk degrades accuracy more than accent or jargon do
During the three-minute scripted crosstalk passage, the average automated WER jumped from approx. 7.9% (single-speaker baseline) to approx. 42.6% — a 5.4× degradation. Accent variation alone moved WER by 1.8×; jargon density by 2.1×. Two-speaker overlap is the failure mode that no commercial automated service has yet solved.
- 073
Three providers carry SAS-LIVE certification; only one of them topped our accuracy ranking
SAS-LIVE (the Speech-Accessibility Standard for live captioning, ratified 2024) certifies providers against a published WER floor of 8% on a curated corpus. Otter.ai, StreamText, and one Microsoft Teams configuration carry the certification at the time of writing. Otter.ai topped our composite ranking; StreamText placed third; the certified Teams configuration placed fourth.
Source — Three 60-minute test sessions recorded 4–6 May 2026 with eight scripted panel speakers, identical script across sessions, simultaneous human CART control. Audio routed via Loopback into each platform’s native captioning path. Transcripts diffed against the CART control using NIST sclite for WER.
Methodology and test conditions
A live-captioning benchmark stands or falls on the control. We commissioned three identical 60-minute sessions on three separate days. Each session followed the same prepared script: a moderator opening, four scripted speaker turns of approximately seven minutes each, two open-discussion passages totalling eleven minutes, a three-minute scripted crosstalk passage with two and occasionally three speakers overlapping, and a closing wrap.
Eight remote panellists read from the script. They were briefed on cadence but not on the test purpose. Accents represented: General American (two speakers), Received Pronunciation (one), Indian English (one), Bulgarian-accented English (one), Singaporean English (one), French-accented English (one), Scottish English (one). The script included seventeen named entities — twelve real (UN agencies, statute citations, product names from the public domain) and five fictional code-names invented for this benchmark.
Each session was simultaneously captioned through all six services. Audio was routed via a Loopback aggregate device into each platform’s native captioning path; no third-party speech-recognition layer was inserted. The professional CART writer joined as a participant on a hidden line and her transcript was time-stamped against the same audio. Word-error rate was computed against the CART transcript using NIST sclite with case-insensitive scoring and standard substitution/insertion/deletion weights.
The composite ranking
Composite WER is the unweighted mean of per-session WER across the three sessions, scored against the CART control. The headline ranking, lowest WER first:
The choice between two enterprise-grade conferencing platforms can mean the difference between a 6% and a 15% word-error rate. That is not a tooling difference. That is an inclusion difference.
WER by speaker condition
Composite WER hides the texture. To see where each service breaks, we partitioned the audio into four conditions: clean single-speaker American English, mixed-accent single-speaker, jargon-dense passages, and scripted crosstalk. The same six services on the same audio, broken out by condition:
The chart compresses the headline finding into a single image: accent variation is a real penalty, jargon is a larger one, and overlapping speech is a cliff. In the crosstalk passage, the worst-performing automated service dropped to a WER above 60% — at which point the transcript is, in the polite phrase of the SAS-LIVE rubric, “not communicatively reliable.”
Commercial speech-recognition pipelines assume one acoustic stream per speaker. Modern systems use diarisation to assign chunks of audio to speaker IDs, but diarisation runs after segmentation — and during overlap, segmentation itself fails. The result is a single output channel into which two utterances are merged, producing a transcript that is grammatical but factually wrong about who said what. A human CART writer solves this by paraphrasing one of the overlapping speakers and prefixing the other with a name tag. No deployed automated service does this in 2026.
Latency on the wire
Latency was measured as the elapsed time between the waveform peak of a spoken syllable and the appearance of the corresponding token in the platform’s caption DOM, captured via a high-frame-rate screen recording aligned to the audio waveform. Median latency across the three sessions:
Latency matters because conversational repair has a window. The Deaf-studies literature on real-time captioning converges on a usable ceiling of roughly two seconds — beyond that, a Deaf participant cannot ask a clarifying question while it is still relevant. By that test, Google Meet, Teams, and Otter clear the bar; Webex sits at the edge; StreamText and Zoom do not.
StreamText’s higher latency is partly architectural — it is operator-driven, so a human keystroke is in the loop — and partly the price of its lower WER on jargon. Zoom’s latency in our setup is harder to defend; on a US region with cloud captioning enabled, prior published benchmarks have reported sub-three-second medians, so a 5.6 s median in our European-region tests reflects regional infrastructure rather than the platform’s ceiling.
Names, jargon, and the glossary problem
Of the seventeen named entities in the script, five were code-names invented for this benchmark. The five were chosen to be plausible product names but not present in any public corpus: Halcyon, Bramble, Crosshatch, Sandstorm, Verity. The first three are common English words; the latter two are less common. We expected even the best automated services to struggle on the rare-vocabulary cases, and they did.
The lesson is operational. Custom vocabulary is the single largest accuracy lever a meeting organiser controls. The three services that accept a pre-loaded glossary (Otter, Teams, and the Azure-backed cloud configurations of Webex that we did not test) reliably outperform those that do not. Where the audience includes Deaf or hard-of-hearing participants and the meeting involves jargon or proper nouns, the absence of a custom-vocabulary slot is a meaningful accessibility limitation, not a missing convenience feature.
SAS-LIVE certifies a captioning provider against a published corpus and a published WER floor (8% at the time of writing). Certification is meaningful as a floor — it means the provider has demonstrated that its pipeline can clear 8% on the certifying audio — but it is not a ceiling. Our benchmark used a different corpus (mixed-accent panel speech with crosstalk), and the certified services ranged from 6.2% (Otter) to 9.6% (Teams) on our audio. Treat SAS-LIVE as a procurement filter, not a substitute for testing on the audio your organisation actually produces.
Assistive-tech integration
WER measures whether the transcript is correct. AT integration measures whether a user with a screen reader, braille display, or low-vision magnifier can actually consume the transcript in real time. The two are not the same. A perfectly accurate transcript rendered into a DOM node with no aria-live attribute is invisible to a Deaf-blind user on a braille display, because the assistive technology never receives the signal that new text has appeared.
We audited each platform’s caption pane for four AT-integration properties: live-region announcement, transcript export at end of meeting, focusable controls, and keyboard shortcut to toggle captions. The matrix:
The AT-integration column reorders the ranking in interesting ways. Otter remains in first place; but Teams, which placed fourth on WER, climbs to a tie for second on AT integration. Webex sits at the bottom on both axes. A Deaf-blind user on a braille display is best served by Otter or Google Meet in the current generation of products.
What the human CART writer still does better
The control CART writer outperformed every automated service on every measured axis. WER 3.1% versus the best automated 6.2%. Code-name recall 96% versus the best automated 71%. Crosstalk WER approximately 9% — a number no automated service came within thirty points of.
But the human advantage is not only mechanical. Several editorial behaviours are still uniquely human. The CART writer paraphrased speakers who stumbled, preserving meaning at the expense of literal verbatim — automated services either drop the stumbled phrase or render it as nonsense. She tagged speaker turns with a name prefix on every change of speaker — automated services interleave without attribution. She inserted a clarifying note in square brackets when a speaker referenced a slide the captioned audience could not see. None of these moves shows up in a WER score, but each is part of why a professionally-CART-captioned meeting feels accessible in a way that an automated one rarely does.
The benchmark in context
The headline finding is not that one service won. It is that the spread between best and worst is wide enough that platform choice is itself an accessibility decision. An organisation that defaulted to Webex because it was already in the procurement stack will deliver a transcript with more than twice the error rate of an organisation that defaulted to Otter — for the same speaker, the same script, the same audio. That is not a marginal difference.
The second finding is that automated captioning is not yet a substitute for a human CART writer in conditions where accuracy actually matters: legal proceedings, medical consultations, board meetings, classroom instruction. The 3.1% / 6.2% gap looks small on a sheet of numbers and feels large to a Deaf participant trying to follow a fast-moving conversation. Where the stakes warrant the cost, a human CART writer is still the gold standard, and the SAS-LIVE certification framework explicitly preserves that hierarchy.
The third finding is operational. Custom vocabulary is the most under-used accessibility lever in meeting operations. Three of the six services we tested accept a pre-loaded glossary. Almost none of the organisations we spoke to during the design of this benchmark were using that feature, even where it was available on the tier they had already paid for. Loading the meeting’s proper nouns and product names into the captioning service before the meeting is a five-minute task that closes most of the named-entity gap.