LLC research demonstrated that speech recognition based captioning/transcription was technically feasible, but revealed that commercially available dictation systems were unsuitable for these novel applications.  The Consortium works collaboratively to develop specialized technologies designed to improve access to educational content.

Key Technology Challenges
Using speech recognition technology for captioning and transcribing spontaneous speech is an extremely complex task given implementation, pedagogical, and technology variables, including a number of core challenges that drive ongoing research and development:

  1. Commercially available “dictation” systems provide core technology, but the underlying statistically derived acoustic and language models were created using a speech corpus based on samples of read and written language.  Classroom speech, such as during a lecture, is acoustically and linguistically unique and characterized by pauses, disfluencies, and incomplete sentences.  To maximize performance for  captioning applications that attempt to recognize spontaneous speech, the underlying statistical models should theoretically be created using an appropriate corpus.
  2. Speech recognition systems are available in limited languages and dialects.  Given accessibility is a global imperative, robust systems supporting a variety of languages must be developed.
  3. Dictation systems are typically speaker dependent, meaning they require training and the creation of a personal voice profile.  Most busy instructors are not necessarily able and/or willing to engage in time consuming training. This reality reinforces the need to construct easy to use speaker independent systems.
  4. Digitized text from dictation systems is typically displayed as a continuous stream of words when used to caption speech, which significantly reduces readability.
  5. Various real time processing methods sometimes generate lengthy display lags that affect readability. Optimally, a captioning system displays text as quickly as possible after words are spoken.
  6. While most dictation systems can save the source media to allow playback and error correction, they usually use proprietary formats that prevent easy dissemination of synchronized multimedia. Providing synchronized multimedia in accessible formats would create greater flexibility.
  7. Recognition accuracy is dependent on a myriad of interrelated factors including the underlying speech engine and statistical models, computer hardware, microphone quality, ambient noise, instructor’s speech characteristics, domains/topics, and presentation skills.  Improving recognition accuracy (conversely -decreasing word error rates) are longstanding goals for the scientific community regardless of the application in question.


Interested in systems used to caption live speech?   Visit  Real Time Captioning

Interested in systems used to transcribe educational media offline?   Visit  Hosted Transcription