IKTPLUSS-IKT og digital innovasjon

What are the "atoms" that constitute the basic building blocks of speech? Every human is equipped with the same physiological apparatus for speech production, regardless of language. In current automatic speech recognizers, we start out with a top-down hierarchy, where the top level is a sentence or phrase. A sentence is composed of words, and the words are composed of phonemes, which are the smallest units distinguishing meaning in a language. The units are assumed to appear as beads on a string, such that a sentence can be decomposed into a sequence of non-overlapping phonemes. Traditional speech recognition finds which phoneme sequence is the most likely for a recorded acoustic waveform, that corresponds to a legal sequence of words. In this project, we investigate an alternative to traditional, phoneme-based speech recognition by turning the hierarchy upside-down. We propose an approach to speech recognition that is based on detecting features that are pertinent to speech production and articulation, and thus are universal. The detection forms the basis for machine learning of basic patterns, the "atoms", that all speech is composed of, from limited amounts of speech data. This set of basic units shall provide the bridge between the highly variable acoustic speech signal and invariable, meaningful symbols, and thus a more robust and reliable speech recognition. In the initial phase of the project we have utilized our existing system for prediction of the probability that specific phonetic features are active in a speech segment, so-called soft decision. With this as a foundation, we have developed two methods for discovery of acoustic units that may provide the basis for speech recognition. One of the methods finds units that are linguistically defined (i.e. phoneme-like units), whilst the other method finds units that are solely defined from their acoustic properties. We have trained statistical models for both methods. Results for phone recognition experiments demonstrate that the new units perform with approximately the same accuracy as traditional units for this task. Deep Neural Networks (DNN) have over the past few years had a breakthrough as a tool for a number of applications in prediction and pattern recognitions. Speech recognition is one of these applications. DNNs are central for several tasks in the project. We have investigated the performance of DNN-based phone recognizers on comparable Norwegian and American-English data, primarily to examine whether there is a significant difference in the level of difficulty for fundamental speech recognition between the two languages. The experimental results show that it is generally a little more difficult to achieve good phone recognition for Norwegian than for English. Further investigations, where the phone recognition builds on DNN-based detection of phonetic features show that this approach does not result in significant differences between the two languages. Different DNN architectures have different properties. We have investigated some prominent DNN architectures on a speech classification task, using a small and controlled database. The results indicate that for this task, a simple, feed-forward architecture performs better than more complex architectures. Including context by letting the input be a sequence centered around the frame to be classified gave significant improvement over using just the frame to be classified as input. Speaking rate variation leads to a change in the realization of the sounds. We have performed a study of the impact of this variation on classification of short speech segments, and of which speech features are most robust to speaking rate variations. In our initial system, estimation of phonetic features that indicate how the sounds are generated was based on a traditional spectral representation of the information content of speech. This is not necessarily optimal. If we can represent the actual movements of the most important articulators (tongue, lips etc.) we will have a more realistic picture of how the sounds of language are actually produced. We have developed new methods for acoustic-articulatory inversion, i.e. estimation of the articulator movements from the acoustic speech waveform that have improved accuracy and robustness. Physiological differences make the connection between the acoustic speech signal and a spatial description of the articulator movements speaker specific. Our new approach can predict the articulator movements, without having training data or other speaker information from the current speaker, that are nearly as accurate as speaker specific systems. Moreover, the systems are very robust to varying levels and types of ambient noise which is very promising for use of articulatory information in real-life speech technology applications.

Resultatene, spesielt innenfor metoder og bruk av akustisk-artikulatorisk inversjon (AAI), har bidratt til å skape ny kunnskap på den internasjonale forskningsarenaen. AAI er en variant av en større klasse av inversjonsproblemer, og det metoder utviklet i prosjektet vil kunne utnyttes i andre andre anvendelsesområder. Prosjektresultater vil bli utnyttet i videre forskningsarbeid på NTNU og hos våre samarbeidspartnere. To stipendiater er i ferd med å fullføre sine PhD-grader. Arbeidet i prosjektets postdoc-stilling har vært medvirkende til fast ansettelse i vitenskapelig stilling ved NTNU Gjøvik. Samarbeid med prof. Siniscalchi (Univ. Enna, Italia) er styrket og formaliseres nå ved at prof. Siniscalchi ble tilsatt i prof II-stilling ved Institutt for elektroniske systemer, NTNU, høsten 2020.

Traditional speech recognition systems are based on a top-down approach where the sub-word units are pre-defined, usually on the basis of linguistic theory. In order to build robust statistical models of these units, massive amounts of data is required. Yet, this approach is sensitive to mismatch between the imposed model and real-world data at all levels. The recognition problem is framed as finding the most likely sequence of units that match a legal sequence of words, as defined by the lexicon and the language model. Instead of relying on top-down decoding, we propose a paradigm based on bottom-up detection and information extraction. Instead of learning statistical models of pre-defined units, we aim at developing an approach to ASR that is based on learning the 'optimal' set of units that can be used to map from variable acoustic data to invariable meaningful symbols in a bottom-up information extraction procedure. These units must capture the structure in the speech signals that are imposed by the constraints of the articulatory system, i.e., the structure that encodes the linguistic information. At the same time, the units must be flexible and adaptive, so that they can be used for understanding unknown speakers in arbitrary acoustic backgrounds. Last but not least, it must be possible to learn the units from limited amounts of annotated speech. The core paradigm will be investigated through exploring and verifying five supporting hypotheses: - The salient information of the speech signal can be represented by detecting a small number of acoustic-phonetic events. - The set of sub-word units can be discovered from the detected events by machine learning approaches. - The relationship between sub-word units and linguistic units can be learnt from (possibly labeled) data. - The dependence of language and speaker on the sub-word units will be explored through employing them for automatic language identification and for speaker recognition.