Captions — accessibility glossary

Captions are a text representation of all meaningful audio content in a video — including dialogue, speaker identification, sound effects, and music cues. Captions exist for deaf and hard-of-hearing viewers; they’re also widely used by hearing viewers in noisy environments, by users learning the spoken language, and by autistic viewers who process text faster than spoken audio.

Captions vs subtitles

These two terms get conflated constantly. The operative distinction:

Captions are for deaf and hard-of-hearing viewers. They transcribe all audio: dialogue, plus speaker labels (“[NARRATOR]:”), plus sound effects (“[door slams]”), plus music cues (“[suspenseful music]”). They’re in the same language as the original audio.
Subtitles are for hearing speakers of another language. They transcribe dialogue only (no sound effects, no music cues), in a language different from the original audio.

Streaming services have muddled this by labelling everything “subtitles” or “CC” without distinction. For accessibility, what matters is whether the text content is captions-style (includes all meaningful audio) or subtitles-style (dialogue-translation only).

Closed vs open captions

Closed captions are stored as a separate text track that the user can enable or disable. Standard for web video (WebVTT files); standard for broadcast TV (CEA-708).
Open captions are burned into the video pixels themselves and cannot be disabled. Used when no separate caption track is supported (some social-media platforms, some legacy contexts).

WCAG accepts either; closed captions are operationally preferable because they leave the original video unmodified.

What WCAG requires

1.2.2 Captions (Pre-recorded) — Level AA — captions for all pre-recorded audio in synchronised media.
1.2.4 Captions (Live) — Level AA — live captions for live audio content.
1.2.6 Sign Language (Pre-recorded) — Level AAA — sign-language interpretation in addition to captions.

WCAG specifically excludes “media alternatives for text” (a video that exists only as a visual alternative to a text article) from the requirement, but those cases are rare.

What goes wrong in production

Auto-generated captions shipped untouched. YouTube and most video platforms generate captions automatically. Accuracy on accented speech, technical vocabulary, or background noise is poor — typically 85-95% word accuracy. Below 99%, captions don’t meet the legal standard. Auto-generated captions are a starting point, not a shipping product.
Missing speaker identification. Two-person dialogue with no labels: deaf viewers can’t tell who’s saying what.
No sound-effect cues. The plot turns on a sound the deaf viewer has no way to know happened.
Bad timing. Captions appearing on screen 2 seconds after the dialogue, or staying up after the speaker has moved on. Timing precision matters.
Low-contrast captions. White text on bright video with no background or shadow. Effectively invisible.

The minimum quality bar is broadcast-standard captions: 99 %+ word accuracy, timed to within ±50 ms, with full speaker ID and sound cues.