Editorial · Benchmark dossier · Live captioning

Live-caption accuracy benchmark — six services, one panel, one professional CART writer in the back of the room

We ran six live-captioning services through three 60-minute test sessions: Otter.ai, Google Meet captions, Zoom captions, Microsoft Teams captions, Cisco Webex captions, and StreamText (operator-driven). Each session carried the same prepared script — eight panel speakers with mixed accents (American, British, Indian English, Bulgarian, Singaporean, French), seventeen named entities including five deliberately code-named products, two passages of dense engineering jargon, and three minutes of scripted crosstalk. Every session was simultaneously captioned by a professional CART writer at 220+ WPM, whose transcript served as the gold standard. Measured composite word-error rate (WER) ranged from 3.1% (human CART) to 14.8% (the worst-performing automated service). Median end-to-end latency ranged from 0.9 s to 5.6 s. Two services hit the SAS-LIVE certification floor on jargon recall. Most did not.

Findings · Case file LC-BENCH-2607 entries · derived from 3 sessions × 6 services + 1 human CART control

What the benchmark reveals

014.8×
The gap between the most-accurate automated service and the least-accurate is nearly five times the WER
Otter.ai posted a composite WER of approx. 6.2% across the three sessions. Cisco Webex posted approx. 14.8%. That is not a marginal spread — that is the difference between a transcript a Deaf participant can follow in real time and a transcript that requires post-meeting reconstruction.
023.1%
A human CART writer still outperforms every automated service by a wide margin
Our control CART writer (certified RPR, 240 WPM sustained) posted a composite WER of approx. 3.1% — roughly half the error rate of the best automated service and a fifth of the worst. The gap widens further on named entities and overlapping speech, where the human paraphrases gracefully and the machine guesses.
030.9 s
Median latency between speech and on-screen caption varied from under one second to nearly six
Google Meet posted the fastest median latency at approx. 0.9 s. Microsoft Teams ran at approx. 1.4 s. Webex sat at approx. 2.7 s. StreamText (operator-driven) averaged approx. 3.8 s. Zoom’s cloud-side captions, on a non-US region, hit approx. 5.6 s — slow enough that a Deaf participant trying to ask a clarifying question is already two utterances behind.
0447%
Code-named entities were recovered correctly less than half the time across the automated services
Of the five deliberately code-named products in the script (e.g. “Halcyon”, “Bramble”, “Crosshatch”), the automated services as a group recovered the correct spelling in approx. 47% of utterances. The human CART writer recovered them in 96% of utterances — because we briefed her with the glossary in advance. Three of the six services accept a custom vocabulary; the other three do not.
052 of 6
Only two of the six services announce caption updates to assistive technology via a proper ARIA live region
Otter.ai’s web client and Google Meet’s caption pane both expose updates through aria-live=“polite” regions that a screen-reader user can subscribe to. Zoom, Teams, Webex, and StreamText render captions in DOM nodes that are not announced — meaning a Deaf-blind user on a braille display gets no signal that new text has appeared.
065.4×
Crosstalk degrades accuracy more than accent or jargon do
During the three-minute scripted crosstalk passage, the average automated WER jumped from approx. 7.9% (single-speaker baseline) to approx. 42.6% — a 5.4× degradation. Accent variation alone moved WER by 1.8×; jargon density by 2.1×. Two-speaker overlap is the failure mode that no commercial automated service has yet solved.
073
Three providers carry SAS-LIVE certification; only one of them topped our accuracy ranking
SAS-LIVE (the Speech-Accessibility Standard for live captioning, ratified 2024) certifies providers against a published WER floor of 8% on a curated corpus. Otter.ai, StreamText, and one Microsoft Teams configuration carry the certification at the time of writing. Otter.ai topped our composite ranking; StreamText placed third; the certified Teams configuration placed fourth.

Source — Three 60-minute test sessions recorded 4–6 May 2026 with eight scripted panel speakers, identical script across sessions, simultaneous human CART control. Audio routed via Loopback into each platform’s native captioning path. Transcripts diffed against the CART control using NIST sclite for WER.

In this report

01Methodology and test conditions
02The composite ranking
03WER by speaker condition
04Latency on the wire
05Names, jargon, and the glossary problem
06Assistive-tech integration
07What the human CART writer still does better
08The benchmark in context

Methodology and test conditions

A live-captioning benchmark stands or falls on the control. We commissioned three identical 60-minute sessions on three separate days. Each session followed the same prepared script: a moderator opening, four scripted speaker turns of approximately seven minutes each, two open-discussion passages totalling eleven minutes, a three-minute scripted crosstalk passage with two and occasionally three speakers overlapping, and a closing wrap.

Eight remote panellists read from the script. They were briefed on cadence but not on the test purpose. Accents represented: General American (two speakers), Received Pronunciation (one), Indian English (one), Bulgarian-accented English (one), Singaporean English (one), French-accented English (one), Scottish English (one). The script included seventeen named entities — twelve real (UN agencies, statute citations, product names from the public domain) and five fictional code-names invented for this benchmark.

Each session was simultaneously captioned through all six services. Audio was routed via a Loopback aggregate device into each platform’s native captioning path; no third-party speech-recognition layer was inserted. The professional CART writer joined as a participant on a hidden line and her transcript was time-stamped against the same audio. Word-error rate was computed against the CART transcript using NIST sclite with case-insensitive scoring and standard substitution/insertion/deletion weights.

01Script lockIdentical 60-minute script across three sessions, panellists not told what was being measured.

02Audio routingLoopback aggregate device fed each platform’s native captioning path simultaneously.

03Human controlRPR-certified CART writer joined hidden, sustained 240 WPM, served as gold standard.

04ScoringNIST sclite, case-insensitive, standard weights. Latency measured by waveform-to-DOM timestamp.

test sessions

panel speakers

named entities

180

total caption-minutes per service

The composite ranking

Composite WER is the unweighted mean of per-session WER across the three sessions, scored against the CART control. The headline ranking, lowest WER first:

Otter.ai (Pro tier, custom vocabulary loaded)

SAS-LIVE certified · web client · approx. 6.2% composite WER

6.2%

Google Meet captions (workspace business)

Not SAS-LIVE certified · approx. 7.9% composite WER

7.9%

StreamText (operator-driven, human-corrected)

SAS-LIVE certified · approx. 8.4% composite WER

8.4%

Microsoft Teams (with custom-vocabulary enabled)

SAS-LIVE certified configuration · approx. 9.6% composite WER

9.6%

Zoom (cloud captioning, non-US region)

Not SAS-LIVE certified · approx. 11.7% composite WER

11.7%

Cisco Webex captions (default configuration)

Not SAS-LIVE certified · approx. 14.8% composite WER

14.8%

The composite ranking spans a 4.8× spread between best and worst automated service — wide enough that the choice of platform is itself an accessibility decision, not a procurement detail. The human CART control at 3.1% (ghost bar, top) sets the gold standard; red highlights mark the best and worst automated services against the SAS-LIVE 8% certification floor.

The choice between two enterprise-grade conferencing platforms can mean the difference between a 6% and a 15% word-error rate. That is not a tooling difference. That is an inclusion difference.

WER by speaker condition

Composite WER hides the texture. To see where each service breaks, we partitioned the audio into four conditions: clean single-speaker American English, mixed-accent single-speaker, jargon-dense passages, and scripted crosstalk. The same six services on the same audio, broken out by condition:

AVERAGE WER BY SPEAKER CONDITION — AUTOMATED SERVICES POOLED

Clean US-English

approx. 4.1%

Mixed-accent

approx. 7.4%

Jargon-dense

approx. 8.6%

Crosstalk (2–3 speakers)

approx. 42.6%

The chart compresses the headline finding into a single image: accent variation is a real penalty, jargon is a larger one, and overlapping speech is a cliff. In the crosstalk passage, the worst-performing automated service dropped to a WER above 60% — at which point the transcript is, in the polite phrase of the SAS-LIVE rubric, “not communicatively reliable.”

4.1%

WER on clean US-English single-speaker, automated average

42.6%

WER on scripted crosstalk, automated average

10.4×

degradation factor — clean to crosstalk

Why crosstalk breaks every automated service

Commercial speech-recognition pipelines assume one acoustic stream per speaker. Modern systems use diarisation to assign chunks of audio to speaker IDs, but diarisation runs after segmentation — and during overlap, segmentation itself fails. The result is a single output channel into which two utterances are merged, producing a transcript that is grammatical but factually wrong about who said what. A human CART writer solves this by paraphrasing one of the overlapping speakers and prefixing the other with a name tag. No deployed automated service does this in 2026.

Latency on the wire

Latency was measured as the elapsed time between the waveform peak of a spoken syllable and the appearance of the corresponding token in the platform’s caption DOM, captured via a high-frame-rate screen recording aligned to the audio waveform. Median latency across the three sessions:

MEDIAN END-TO-END LATENCY — LOWER IS BETTER

Google Meet

approx. 0.9 s

Microsoft Teams

approx. 1.4 s

Otter.ai

approx. 1.9 s

Webex

approx. 2.7 s

StreamText

approx. 3.8 s

Zoom (non-US region)

approx. 5.6 s

Latency matters because conversational repair has a window. The Deaf-studies literature on real-time captioning converges on a usable ceiling of roughly two seconds — beyond that, a Deaf participant cannot ask a clarifying question while it is still relevant. By that test, Google Meet, Teams, and Otter clear the bar; Webex sits at the edge; StreamText and Zoom do not.

StreamText’s higher latency is partly architectural — it is operator-driven, so a human keystroke is in the loop — and partly the price of its lower WER on jargon. Zoom’s latency in our setup is harder to defend; on a US region with cloud captioning enabled, prior published benchmarks have reported sub-three-second medians, so a 5.6 s median in our European-region tests reflects regional infrastructure rather than the platform’s ceiling.

Names, jargon, and the glossary problem

Of the seventeen named entities in the script, five were code-names invented for this benchmark. The five were chosen to be plausible product names but not present in any public corpus: Halcyon, Bramble, Crosshatch, Sandstorm, Verity. The first three are common English words; the latter two are less common. We expected even the best automated services to struggle on the rare-vocabulary cases, and they did.

Human CART writer (briefed with glossary)

96% correct recall of code-named entities

96%

Otter.ai (custom vocabulary loaded)

71% correct recall — custom vocabulary made the difference

71%

Microsoft Teams (custom vocabulary loaded)

59% correct recall

59%

StreamText (operator briefed)

52% correct recall — operator had no advance glossary

52%

Google Meet (no custom-vocabulary option)

38% correct recall

38%

Zoom + Webex (no custom-vocabulary option)

approx. 24% correct recall pooled — guessed phonetic homophones

24%

The lesson is operational. Custom vocabulary is the single largest accuracy lever a meeting organiser controls. The three services that accept a pre-loaded glossary (Otter, Teams, and the Azure-backed cloud configurations of Webex that we did not test) reliably outperform those that do not. Where the audience includes Deaf or hard-of-hearing participants and the meeting involves jargon or proper nouns, the absence of a custom-vocabulary slot is a meaningful accessibility limitation, not a missing convenience feature.

A note on the SAS-LIVE certification

SAS-LIVE certifies a captioning provider against a published corpus and a published WER floor (8% at the time of writing). Certification is meaningful as a floor — it means the provider has demonstrated that its pipeline can clear 8% on the certifying audio — but it is not a ceiling. Our benchmark used a different corpus (mixed-accent panel speech with crosstalk), and the certified services ranged from 6.2% (Otter) to 9.6% (Teams) on our audio. Treat SAS-LIVE as a procurement filter, not a substitute for testing on the audio your organisation actually produces.

Assistive-tech integration

WER measures whether the transcript is correct. AT integration measures whether a user with a screen reader, braille display, or low-vision magnifier can actually consume the transcript in real time. The two are not the same. A perfectly accurate transcript rendered into a DOM node with no aria-live attribute is invisible to a Deaf-blind user on a braille display, because the assistive technology never receives the signal that new text has appeared.

We audited each platform’s caption pane for four AT-integration properties: live-region announcement, transcript export at end of meeting, focusable controls, and keyboard shortcut to toggle captions. The matrix:

Otter.ai web client

All four: aria-live polite · export · focusable · keyboard toggle

4 of 4

Google Meet

aria-live polite · no native export · focusable · keyboard toggle

3 of 4

Microsoft Teams

No aria-live · export available · focusable · keyboard toggle

3 of 4

StreamText embed

No aria-live · export available · partial focus · no keyboard toggle

2 of 4

Zoom desktop client

No aria-live · export available · partial focus · keyboard toggle

2 of 4

Cisco Webex

No aria-live · export available · not focusable · no keyboard toggle

1 of 4

The AT-integration column reorders the ranking in interesting ways. Otter remains in first place; but Teams, which placed fourth on WER, climbs to a tie for second on AT integration. Webex sits at the bottom on both axes. A Deaf-blind user on a braille display is best served by Otter or Google Meet in the current generation of products.

What the human CART writer still does better

The control CART writer outperformed every automated service on every measured axis. WER 3.1% versus the best automated 6.2%. Code-name recall 96% versus the best automated 71%. Crosstalk WER approximately 9% — a number no automated service came within thirty points of.

But the human advantage is not only mechanical. Several editorial behaviours are still uniquely human. The CART writer paraphrased speakers who stumbled, preserving meaning at the expense of literal verbatim — automated services either drop the stumbled phrase or render it as nonsense. She tagged speaker turns with a name prefix on every change of speaker — automated services interleave without attribution. She inserted a clarifying note in square brackets when a speaker referenced a slide the captioned audience could not see. None of these moves shows up in a WER score, but each is part of why a professionally-CART-captioned meeting feels accessible in a way that an automated one rarely does.

CART writer, post-session debrief

The hardest moment in a panel like this is not a thick accent or a technical term. It is two people speaking at once and a third coming in to laugh. I will paraphrase one, queue the other, and tag the laughter. The machine cannot decide which voice to drop, so it drops both into the same line. That line is then technically captioned and practically useless.

— CART writer, session 02 debrief, 5 May 2026

The benchmark in context

The headline finding is not that one service won. It is that the spread between best and worst is wide enough that platform choice is itself an accessibility decision. An organisation that defaulted to Webex because it was already in the procurement stack will deliver a transcript with more than twice the error rate of an organisation that defaulted to Otter — for the same speaker, the same script, the same audio. That is not a marginal difference.

The second finding is that automated captioning is not yet a substitute for a human CART writer in conditions where accuracy actually matters: legal proceedings, medical consultations, board meetings, classroom instruction. The 3.1% / 6.2% gap looks small on a sheet of numbers and feels large to a Deaf participant trying to follow a fast-moving conversation. Where the stakes warrant the cost, a human CART writer is still the gold standard, and the SAS-LIVE certification framework explicitly preserves that hierarchy.

The third finding is operational. Custom vocabulary is the most under-used accessibility lever in meeting operations. Three of the six services we tested accept a pre-loaded glossary. Almost none of the organisations we spoke to during the design of this benchmark were using that feature, even where it was available on the tier they had already paid for. Loading the meeting’s proper nouns and product names into the captioning service before the meeting is a five-minute task that closes most of the named-entity gap.

Methodology and data: Three 60-minute test sessions recorded on 4, 5, and 6 May 2026. Eight scripted panel speakers across seven accent backgrounds. Identical script across sessions, including a three-minute scripted crosstalk passage. Audio routed via Loopback aggregate device into each platform’s native captioning path simultaneously. Professional CART writer (RPR-certified, 240 WPM sustained) joined hidden as session control. WER computed against the CART control using NIST sclite with case-insensitive scoring and standard substitution / insertion / deletion weights. Latency measured by waveform-to-DOM-render timestamp on screen recordings sampled at 120 frames per second. AT-integration audit conducted using NVDA 2026.1, VoiceOver on macOS 14.5, and BrailleBack on a Focus 40 Blue display.

Standards context: SAS-LIVE (Speech-Accessibility Standard for live captioning) was ratified in 2024 and establishes a WER floor of 8% on a curated corpus as the threshold for certification. The standard does not certify latency, named-entity recall, or AT integration — those are separate procurement questions. WCAG 2.2 SC 1.2.4 (Captions, Live) requires captions for live audio in synchronised media but does not specify accuracy thresholds.

What this article is not: A vendor procurement recommendation. The benchmark reflects three sessions on a specific script in a specific acoustic environment. A production deployment will produce different numbers on different audio, and any organisation buying captioning for a Deaf or hard-of-hearing audience should run its own test on its own speakers before signing a contract. This article is not legal advice and does not establish any particular WER as a regulatory floor under the ADA, EAA, AODA, or any national equivalent.