Regardless of the task, Speech Recognition (SR) systems make recognition errors. To evaluate technology performance, the Consortium conducts ongoing evaluations and benchmarking of various SR powered captioning and transcription systems.

dinary digital image








Using SR to caption and transcribe live or digitized speech is a complex and significantly different challenge than using SR for dictation. Spontaneous speech that typically occurs in traditional lectures and presentations is acoustically, linguistically, and structurally different than read speech or speech used to create written documents.Recognition accuracy for this unique task is driven by a complex interaction of both technical and non-technical variables, including the underlying speech engine, statistical models, computing hardware, microphones, audio quality, individual speech characteristics, and other myriad environmental factors.

The Consortium evaluates both “speaker dependent” and “speaker independent” systems and strives to understand and quantify variables affecting accuracy.  Speaker dependent or trained systems are typically characterized by the creation of a personal voice profile that contains data about an individual’s speech characteristics and language usage. For these systems, the quantity and quality of training data is highly correlated with recognition accuracy.  Speaker independent systems generally do not retain data about individual users and are designed to work without previous knowledge of an individual speaker.

Evaluations focus on calculating recognition Accuracy or Word Error Rates.  Accuracy is generally defined as the number of correct words transcribed divided by the total number of words spoken.  Conversely, Word Error Rate is the number of incorrect (misrecognized) words divided by the total number of words spoken (* 100).

For example:

  • 100 words spoken, 80 words transcribed correctly, 20 words transcribed incorrectly
  • Accuracy = 80%
  • Word Error Rate= 20%

SR systems are evaluated by transcribing recordings from a  “Lecture Test Set” that  includes over 120,000 words, sampled from  different speakers chosen from a wide range of academic disciplines. Separate Test Sets are used for benchmarking performance for various dialects (i.e. North American, UK, and Australian English). Word Error Rates are subsequently calculated by comparing an edited reference transcript to a raw, unedited transcript generated by the system being evaluated.  Word Errors are also classified into various categories in an effort to understand factors that affect system performance:

  • Substitutions:  Word spoken is recognized as another word or words
  • Insertions:  SR interprets sound artifact as word and inserts text.  This often occurs when breathing is interpreted as a word
  • Deletions:  When word is in fact spoken, but no word is recognized

Evaluations to date include analysis of SR systems from IBM, Nuance, Google – Youtube’s automatic captioning feature, Adobe Premiere Pro CS6 software, Microsoft speech engine, and Koemei.