A longstanding objective is to improve accuracy and reduce word error rates for speech recognition based captioning/transcription. Speech recognition accuracy for any application is affected by a number of interrelated variables including speaker characteristics, speaker dependence/independence, computer hardware, microphone quality, ambient noise, recognition task, and the speech recognition engine and associated statistical models.
The Consortium’s first generation Hosted Transcription technologies were configured for general transcription tasks and its statistical models were developed from a broadcast news corpus. While this corpus provided a large data pool for creating robust, speaker independent, speech recognition systems, the statistical models were not tuned to the Consortium’s primary task/domain of lecture transcription. Accuracy will therefore theoretically increase if the speech recognition engine uses the best acoustic and language models for a given recognition task/domain.
The Consortium is actively creating new statistical models for the lecture domain. To support model development, Consortium members compiled lecture data including audio/video recordings, verbatim transcripts with specialized editing conventions, and timing data. Our current data base includes hundreds of hours of transcribed lecture recordings from North America, Europe, and Australia. An objective is to archive more than 1,000 hours by 2012.
Another challenge is the development of new language capability. Current speech recognition applications include limited language capability, including U.S. English, Chinese Mandarin, Spanish, and Arabic. However, other languages and/or dialects represented by Consortium members are not developed or underdeveloped (i.e. German, Italian, Australian English, UK English). As model development methodologies evolve, the Consortium will be able to create new language capability on demand.
The first generation “Lecture” models for North American English speakers were created and evaluated by calculating accuracy using a lecture test set. Results to date indicate that domain specific data does improve recognition accuracy for lecture transcription tasks.
K. Bain, C. MacEachern. Developing speech recognition acoustic and language models for post-secondary lectures. July 2011
The Consortium also conducted experiments to measure the affect of adding text-only transcripts to the lecture model. Early results indicate that adding domain specific text to Language models does improve recognition accuracy for lecture transcription tasks.
K. Bain, C. MacEachern. Early effects of adding text transcripts to a speech recognition language model for post-secondary lectures. July 2011
To improve accuracy for a wider range of potential users, including speakers with discernible accents, the Consortium created a new Multilingual model. This model included data from a wider variety of speakers and was augmented by acoustic data from German and French corpora.
Planned Activities 2012-2013
1) Implementation of statistical models into new Hosted Transcription platforms
2) Adaptation of Language models using text data
3) Supervised/Unsupervised adaptation experiments
4) New Baseline Acoustic / Language model build and evaluation