In the past decades, language models have increasingly moved from manually constructed, deductive rule based systems towards systems based on statistics and machine learning of authentic data. The construction of such language models presupposes very larg e amounts of suitable, annotated language data for every language in question. Current approaches are often generally called "corpus-based" due to their reliance on text or speech corpora, but other, derived language resources such as lexicons, wordnets, termbanks etc. play an important role as well. Even newer approaches are aimed at building hybrid models that include both rule-based and data-based knowledge sources, e.g. weighted finite-state transducers.
CLARA has interdisciplinary relevance. The proj ect is primarily situated in the humanities, because the use of language sources, which are increasingly digitized, is pervasive in all humanities disciplines. CLARA will also have relevance for psychological and social science approaches to language incl uding the study of mental language processes and social dynamics of language groups. The project is also relevant beyond the Humanities and Social Sciences, since the traditional language sciences must be complemented by relevant knowledge from informatio n theory, statistics, computer science, cognitive science and artificial intelligence, to name just a few. Thus, the next generation of language researchers will need a new combination of training components which most universities and research institutio ns cannot offer by themselves.
The technological applications of CLARA are intersectoral. Language models have, for instance, a huge potential in the educational sector, which currently faces the gap between advanced research in Intelligent Computer-Assis ted Language Learning (ICALL) and the actual needs and practice of current Foreign Language Teaching (FLT). In the IT sector, localization, search engines and information retrieval constitute huge application area