Producing Audio Textbooks: DAISY to AI Narration

By Disability WorldReading time: 10 minutes

Image description: A professional studio microphone beside an open textbook with headphones and audio controls — the visual marker for audio-textbook production.

A textbook is not a podcast. It has heading levels, numbered exercises, footnotes, indexes, equations, captioned diagrams, and a student who needs to find page 217 in the middle of a revision session. Producing it as audio means producing all of that — not just the prose. In 2026, two parallel pipelines do that work: the legacy DAISY chain that has carried specialist audio publishers for a quarter of a century, and a new AI narration chain that, in the past three years, has dropped the per-hour production cost by roughly an order of magnitude. They are not interchangeable. Where they meet — what survives from DAISY, what gets handed to the synthesizer, what stays with a human — is the story of the 2026 audio-textbook.

This piece is a production primer for the people who commission, fund, and use these books: special-education coordinators, university disability offices, alternative-format librarians, and the publishing teams at organisations working at the edges of accessible education. It walks through the DAISY pipeline that produces an accessible audio textbook, the AI-narration shift remaking the upstream economics, the cost-quality trade-off both sides are now negotiating, the accuracy issues nobody has fully solved (mathematics, proper names, code-switching languages), the DAISY 4.0 specification published in 2025, and the major producers shaping which books actually reach a student.

What “DAISY” actually means

DAISY — the Digital Accessible Information System — is a specification, a consortium, and a file-format family. It was first published in 1996 by a coalition of talking-book libraries that needed a way to ship the navigable, structured audio that a cassette tape could not. The two specifications that still anchor the format are DAISY 2.02, released in 2001 and still the format the majority of legacy talking-book libraries actually serve, and DAISY 3, formalised as ANSI/NISO Z39.86 in 2002 and revised in 2012 and again in 2024. The 2024 update — Z39.86-2024 — is the version most current production tooling targets, and the bridge specification between the legacy world and DAISY 4.0.

What DAISY does that an MP3 cannot: it carries structural navigation (jump to chapter 4, section 2, exercise 3), SMIL synchronisation (the audio file and the text track are kept in lock-step so the playback position in one always maps to the other), and a metadata layer rich enough to describe footnotes, sidebars, page numbers, table cells, and skip-on/skip-off elements like running headers. A DAISY player — Dolphin EasyReader, Voice Dream, the AMIS reference player, the Victor Reader Stratus hardware — turns those structures into a keystroke: a student can step forward by sentence, by paragraph, by heading level 3, or by page number, on the same book.

The legacy DAISY production pipeline

Producing a DAISY textbook in the legacy pipeline takes six distinct stages and, for a 400-page textbook, roughly six to twelve weeks of elapsed time per title at a producer like Learning Ally or the Royal National Institute of Blind People (RNIB).

Stage 1 — source preparation. The publisher supplies a print PDF or, increasingly, an EPUB. Production cleans the file, separates the main text from running heads and footers, marks up the heading hierarchy, and exports a structured XHTML reading order. Diagrams and equations are flagged for separate handling.
Stage 2 — narration. A trained human narrator records the prose in a studio session. For a textbook the narrator follows a publisher style guide that covers how to read tables, how to describe diagrams, how to pronounce subject-specific terminology, and how to handle untranslated foreign-language passages.
Stage 3 — editing and quality assurance. A second pass removes breath noise, retakes mispronunciations, and aligns the recorded audio against the source text. A QA reader listens against the print for accuracy.
Stage 4 — SMIL synchronisation. Production software generates a SMIL (Synchronized Multimedia Integration Language) file that timestamps every sentence boundary in the audio against the corresponding span in the XHTML, producing the moment-by-moment text-audio mapping that DAISY navigation relies on.
Stage 5 — packaging. The audio, the SMIL track, the XHTML text, and a navigation manifest are bundled into a DAISY 2.02 or DAISY 3 package, validated against the format’s conformance checker, and uploaded to the producer’s distribution catalogue.
Stage 6 — distribution. The package is served to authorised readers via a producer-specific app or through the global cross-border Marrakesh Treaty exchange to partner libraries in other jurisdictions.

The pipeline produces an authoritative, navigable, classroom-grade book. It is also expensive. The cost per finished hour of audio, in the legacy human-narrated DAISY chain, sits in the range of approx. 45 to approx. 75 US dollars across the major producers — a figure relatively unchanged since the mid-2010s and driven almost entirely by studio time, narrator fees, and editorial QA.

The AI-narration pipeline

The change that has moved the audio-textbook conversation in 2024–26 is the arrival of neural text-to-speech voices that are, for the first time, indistinguishable enough from a human narrator that the question of whether to use them is no longer answered automatically with “no”. The shortlist of services driving production decisions in 2026 is small and well-defined: ElevenLabs (whose multilingual v3 model, released in 2025, is the reference for English textbook narration in most current discussions); Speechify (whose 2024 enterprise offering targets education specifically, with a long-form mode and pre-baked academic-style voices); Amazon Polly Neural (the cheapest at scale, with strong SSML support); and OpenAI TTS HD (the most narrative-sounding general-purpose voice in the comparative listening tests run by accessibility-research groups in 2025).

The shape of an AI-narrated audio-textbook pipeline differs from the legacy one less in its stages than in its economics. Source preparation, structure markup, and packaging all remain. Stages 2 and 3 — narration and editing — collapse into a single automated step: the structured text is fed to the synthesizer with SSML hints for emphasis, pronunciation, and pause length, and the synthesizer returns audio. A reduced human QA pass then sweeps for the failure modes (covered below) that the synthesizer still cannot resolve unaided.

The cost change is the headline. Where the legacy chain produces a finished hour at approx. 45 to approx. 75 dollars, AI narration at production scale lands between approx. 3 and approx. 7 dollars per hour at the major providers in 2026 — a 10x reduction. That figure is what has moved the question from “can we afford to produce this book” to “which book should we not produce”. A national alternative-format library that previously selected 800 new titles a year against a fixed budget can, on the same budget, select 6,000 to 8,000 — provided the quality holds across the categories where it actually matters.

The cost-quality trade-off

”Quality” in audio-textbook production is not a single dimension. It is at least four: intelligibility (can a listener parse what the voice is saying), naturalness (does sustained listening cause fatigue), accuracy (are the words on the page the words being read), and structural fidelity (do tables, equations, and footnotes survive into the audio). Modern neural TTS now lands at human-comparable scores on intelligibility and within a single point of naturalness on the standard 5-point Mean Opinion Score (MOS) tests used by the speech-synthesis research community. Where the gap remains visible is on accuracy and structural fidelity.

The 2025 American Foundation for the Blind comparative listening study — the largest single piece of published evidence on the question — recruited blind university students to listen to matched passages from chemistry, history, and Spanish-literature textbooks, narrated alternately by human and by ElevenLabs v3 voices. The headline result: at the sentence level, the AI narration was preferred or rated equivalent in 71% of trials for prose-dominant subjects (history, philosophy, English literature). For symbol-dense subjects (chemistry, mathematics, physics) the AI was preferred or rated equivalent in only 28% of trials, with the gap driven by mathematical-notation rendering and the AI voice’s handling of subscripted formulae. The study’s recommendation was unsurprising and now operationally cited: AI narration first, with a human pass over the symbol-dense chapters.

The educationally interesting question is no longer “human or AI” — it is “which sentences need a human, and which can be synthesized at scale”. The answer is increasingly that 80–90% of a textbook can be synthesized, but the remaining 10–20% — equations, proper names in unfamiliar languages, primary-source quotations in archaic spelling — is where a textbook stops being a podcast.
The 80/20 production rule, 2026

Mathematics, proper names, and the code-switching problem

The accuracy failure modes that current neural TTS has not solved are predictable enough that producers now plan for them at the source-preparation stage rather than discovering them in QA.

Mathematics. Equations encoded as MathML have a canonical spoken form — read the integral from a to b of x squared dx — that no general-purpose TTS engine generates correctly. Production pipelines now route MathML through a dedicated math-to-speech engine (MathSpeak, the MathJax accessibility extension, or the open-source SRE engine maintained by the Math-in-DAISY project) before handing the resulting English text to the narrator-voice synthesizer. The DAISY 4.0 specification formalises this routing as a recommended production pattern.

Proper names. Personal names, place names, organisation names, and subject-specific terminology mispronounce predictably. A 2024 audit by the DAISY Consortium of 50 hours of AI-narrated educational content found name-mispronunciation rates of approximately 14% in history texts (where the names range across multiple languages) and approximately 22% in foreign-language textbooks (where the names are the content). The mitigation is a per-title pronunciation lexicon — typically 50 to 300 entries for a 400-page textbook — built during source preparation and supplied to the synthesizer as SSML lexicon hints.

Code-switching languages. A history textbook quoting Cicero in Latin, a literature textbook quoting Pushkin in Russian, an economics textbook quoting Piketty in French — these are the sentences where a monolingual TTS voice fails most visibly. ElevenLabs v3 and OpenAI’s 2025 TTS update both ship multilingual single-voice models that switch languages mid-utterance, but the quality of the switch is uneven. The reliable production pattern in 2026 is to tag the foreign-language span explicitly, route it to a language-specific voice, and stitch the audio back together at the SMIL layer.

DAISY 4.0: what the 2025 specification changes

DAISY 4.0, published in draft form by the DAISY Consortium in late 2025, is the first format-level revision in a decade. Its design starting point is that the produced object should not have to choose between an audio book and a text-and-image book — it should be both, simultaneously, with the player choosing what to surface to the reader.

Four changes matter most for textbook production. First, EPUB 3 alignment: DAISY 4.0 is structurally an EPUB 3 package with audio added, rather than a parallel format with EPUB as an export target. A producer that maintains an EPUB 3 textbook can produce its DAISY 4.0 audio edition by adding tracks, not by converting files. Second, native MathML: equations travel as MathML through to the player, which decides at runtime whether to render visually, read aloud, or both. Third, multi-voice provenance metadata: a DAISY 4.0 package can carry mixed human-narrated, AI-narrated, and math-engine-rendered spans, with each span attributed in metadata to its production method — a transparency requirement an emerging set of national procurement rules are beginning to require. Fourth, navigation extensions for the structural items textbooks have always carried but DAISY 3 handled awkwardly: numbered exercises, problem sets, glossary back-references, and cross-volume references.

The transition timeline most producers are quoting publicly is conservative. The DAISY Consortium expects the majority of new educational titles to ship as DAISY 4.0 by 2027–28, with the legacy DAISY 2.02 catalogue persisting indefinitely on the player side because the installed base of dedicated hardware players cannot be remotely upgraded.

The major producers and what they produce

Learning Ally, the US-based non-profit founded in 1948 as Recording for the Blind, holds the largest English-language audio-textbook catalogue in the world — approximately 80,000 titles as of 2026 — and remains substantially human-narrated, with a volunteer narrator network of roughly 1,000 active voices. Its 2025 strategy paper committed to an AI-augmented pipeline (AI-first narration with human QA on symbol-dense chapters) for school-level mathematics and science titles, while preserving human narration for the literary canon.

Bookshare, operated by Benetech, ships an EPUB-first catalogue — over 1.3 million titles in 2026, across general-reader and educational categories — that pairs the underlying text with synthesized audio rendered by the user’s player rather than pre-baked at production. The model is the cheapest at scale and the one most aligned with DAISY 4.0’s player-decides architecture.

RNIB Talking Books in the UK serves approximately 25,000 active members and produces around 1,500 new titles a year, mostly via human narration with a 2024–26 pilot programme on AI narration for non-fiction. Its catalogue is the reference for the UK-curriculum textbook audience.

The IFLA Libraries Serving Persons with Print Disabilities (LPD) Section coordinates the global producer network and runs the Accessible Books Consortium (ABC) cross-border catalogue under the Marrakesh Treaty — the mechanism by which a book produced in one signatory country can be lent across borders to authorised readers in another. ABC’s 2024 catalogue exchange reported over 850,000 cross-border title transfers, an order of magnitude up on the figure from five years earlier, with the growth concentrated in educational materials.

What this means for the student in 2026

The practical effect of the 2024–26 changes is that the catalogue available to a blind or low-vision student in a major English-language jurisdiction is roughly an order of magnitude larger than it was at the start of the decade, and the lag between a print publication and an accessible audio edition is collapsing from a year or more to weeks. The lag for textbooks specifically — historically the slowest category because of mathematical and structural complexity — is closing more slowly, but it is closing.

What has not changed is the floor of acceptable quality. A textbook still has to be navigable, accurate, and synchronised with its source text. DAISY 4.0’s design and the AI-narration pipeline’s economics make that floor cheaper to clear than it has ever been. The producers most likely to do well across the rest of the decade are the ones that have stopped framing the choice as human or AI and started framing it as which sentences need which method — and the disability-services offices in universities and schools that have stopped accepting “we cannot afford to produce this” as a final answer.

Read more from Disability World on the state of deaf-education access worldwide, on national accessibility regulations, and on the wider 2026 accessibility reporting record.