IKTPLUSS-IKT og digital innovasjon

Computers today have difficulties in understanding linguistic meanings. If you search in Google for "Bordeaux wines with merlot" and "Bordeaux wines without merlot" you get approximately the same hits. That's because the computer does not try to understand what you mean, but instead finds the most relevant web pages containing the words you are looking for. "with" and "without" are so common words that they are found on almost every webpage, so they do not give the computer any information about what you are looking for. In this project we try to help the computer to understand what you mean and therefore understand that when you search for "Bordeaux wines without merlot" you want information about wines that do not contain this grape, and so can be found on webpages that do not contain the word "merlot". To do this we need to represent linguistic meaning in a way the computer can understand. In this project we develop a method to create such represetations. We start with representations of grammatical structure, i.e. what is the verb of the clause, what is the subject, the object, etc., and translate these structures to logical formulae expressing the meaning of the sentence. For grammatical structures there is now a universal standard, Universal Dependencies (UD), that has been applied to 90 languages. The project starts from UD representations and the method that will be develop can therefore be used for any language that is analyzed with UD. Much of the meaning of a sentences is to be found not in the linguistic structure represented by UD, but in the words of the sentence. That is for example the case with the word "without", which contains a hidden negation: we can think of it as "not with". This kind of information must be collected for language specific, lexical databases. Here we will build on existin resources for Norwegian, which will also be extended during the project.

This project will use techniques from Glue semantics to derive semantic representations from UD syntax trees. It will build a software pipeline that can map text to meaning by combining a machine-learning approach to syntactic parsing with a largely rule-driven interface to deep, logic-founded semantic representations that improve considerably on the current state of the art. Moreover, the central part of the system will be based exclusively on information from the UD tree, which means that it can be used for any language that has a UD treebank (currently more than 70 languages). In addition, the project will develop tools for post-compositional enrichment of English and Norwegian meaning representations based on lexical knowledge encoded in resources available for those languages. Improved natural language understanding has the potential to help numerous computational tasks from web search to human-robot interaction and so the potential impact of the project is very large.