The past years have seen an increase of smart devices, such as phones, watches and speakers, that we use on a regular basis. The most natural way of interacting with these devices is by natural speech. As a consequence, speech technology, that enables this interaction, is having an ever increasing impact on our everyday lives. It simplifies many tasks for most people, but also makes some of the tasks accessible for people that, because of disabilities, would otherwise be excluded. Yet, for many real-life situations, current technology is not sufficiently advanced to be really useful. Issues like spontaneous, conversational speech; ambient noise and overlapping speech are among the situations where we still do not have satisfactory performance of current speech technology. Moreover, commercial solutions usually do not work as well for small languages such as Norwegian as they do for English and other languages spoken by larger populations.
SCRIBE’s objective is to improve speech technology in Norwegian by developing a speech-to-text transcription system for multi-party conversations in realistic recording conditions. In order to attain the project goal, research and technology development beyond the state-of-the-art is needed within several key areas. These include language universal issues, as well as issues specifically related to the Norwegian language. We will develop models that are robust to disfluencies that are typical in spontaneous conversational speech, that can cope with turn taking and take advantage of the context in the dialog. The models will also support the use of spoken dialects and different orthographies (Bokmål, Nynorsk, or dialect specific). Our goal is that these advances will make it possible for speech technology to reach its potential in Norwegian, and have a beneficial impact on Norwegian society.
In last three years period the SCRIBE project has contributed to the following results: i) substantial data collections and annotations that are essential to speech research for the Norwegian language even beyond the scope of the project; ii) support of the development of state-of-the-art speech recognition systems based on adaptation of available general models for speech representation trained on huge amounts of (multi-lingual) speech; iii) development of new semantic evaluation metrics for the quality of automatic transcriptions that align better with human judgment than the current metrics that treat all transcription errors as being equally important; iv) analysis of automatic speech recognition systems with respect to Norwegian dialects and v) studies on human perception of Norwegian dialects. Furthermore, in collaboration with related projects, SCRIBE has contributed to the development of speech recognition and pronunciation assessment for child speech and the collection of a unique first and second language child speech corpus for Norwegian.
SCRIBE will develop a Norwegian speech-to-text transcription system capable of transcribing multi-party conversations. Speech technology has demonstrated a remarkable progress over the last decade, much due to the evolution of deep learning combined with the availability of massive amounts of speech and language data and high-performance computational resources. Although the amount of language data required for developing high performance speech technology is similar for all languages, irrespective of the number of speakers, products and services have become available that enable spoken communication with computers, even for smaller languages, like Norwegian. Examples include devices like Google Home, services like Siri and Google Voice Search, and voice command and dictation capabilities in recent versions of Windows and OS X.
Yet, for many real-life situations, current technology is not sufficiently advanced to be really useful. Issues like spontaneous, conversational speech; ambient noise and overlapping speech are among the situations where we still do not have satisfactory performance of current speech technology. For Norwegian, existing speech corpora are moderate in size compared to other languages, and mainly contain read and non-conversational speech. Matters are complicated further by large variations in dialects. The problem is that these “phenomena” occur in a variety of situations where automated solutions would be of great use.
The system we will develop in SCRIBE will fill the gap in current speech recognition systems for Norwegian. It will be robust to disfluencies that are typical of spontaneous conversational speech, and will support the spoken and written dialectal variation of the Norwegian Language. It will also be assessed on metrics that are more closely related to the semantic content of the transcription, rather than on the number of misrecognized words.