FORINFRA-Nasj.sats. forskn.infrastrukt

Present day approaches to the study of language, linguistic processes and applied language technologies are crucially dependent on large 'treebanks'. These are collections of authentic sentences, annotated with detailed linguistic analyses at the syntactic and often also the semantic levels. Since there is currently no suitable treebank for Norwegian, and such a treebank is essential for progress, one will be constructed and made accessible in this project. INESS will be a research infrastructure providing access to detailed, high quality treebanks for Norwegian and other languages. The methods and tools from the pilot project TREPIL will be used to achieve this goal. A Norwegian reference corpus of 500,000 words will be automatically analyzed and manually disambiguated. This material will provide the pattern on which automatic annotation of 500 million words can be achieved. This annotation effort is the largest task in the project. INESS will allow researchers to search for syntactic and semantic patterns in actual language data. This information will enrich our knowledge of the language and will be important for theoretical linguistics, historical linguistics, language teaching, and for the development of language based applications such as machine translation, information retrieval, human-computer interaction, etc. The next generation of IT systems that understand language will be dependent on linguistic insights gained from treebanks. INESS will be different from existing treebanks in that it will not only provide complex data, but it will make access to this information easy through intelligent and powerful interactions. The infrastructure will expand and evolve over time and users can experiment with the data in different ways, even build their own treebanks interactively.