Back to search

IKTPLUSS-IKT og digital innovasjon

WeSearch: Language Technology for the Web

Awarded: NOK 10.6 mill.

The project works to enable the automatic semantic analysis of on-line content in natural language form, specifically so-called user generated content. In scavenging the Internet to harvest such texts, one needs to separate relevant (linguistic) content from irrelevant context (e.g. navigation elements, banner ads, meta information, and of course mark-up). Main results of the project include (a) three new releases of the English Resource Grammar (ERG), with greatly improved grammatical coverage across variable genres and domains; (b) a massively enlarged collection of gold-standard syntacto-semantic analyses; (c) a new methodology for the semantic interface documentation and emerging on-line 'encyclopedia' of ERG semantic analyses (in joint work w ith Stanford University and the University of Washington; published at LREC); (d) significant improvements over the state of the art in sentence segmentation and tokenization, using supervised machine learning (published in COLING and CICLing); (e) a generalization of supertagging, dubbed ubertagging, integrated with PET, offering speed-up of up to one magnitude, at marginal losses in accuracy (published in EMNLP); (f) a head-to-head comparison to state-of-the-art data-driven dependency parsers (published in IWPT), showing superior accuracy and resiliance to domain and genre variation for the ERG parser; (g) a collaboration with the University of Washington on using ERG-derived semantic analyses for the resolution of negation scope (published in ACL); (h) an international alliance, involving DFKI gGmbH, the University of Prague, Linköping University, and the National Institute of Informatics of Japan, who have been invited to arrange a system competition on semantic dependency parsing at SemEval 2014 and 2015; (i) collaboration with leading NLP research centers, the US-based Common Crawl Foundation, and the Nordic e-Infrastructure Collaboration on establishing a shared center for Web-scale Natural Language Processing in Northern Europe; and (j) an alliance of stakeholders across frameworks who work to compare and harmonize semantic target representations for the set of phenomena and exemplars compiled by the project (in addition to the SemEval collaborators, including Johan Bos of the University of Groningen; Dick Crouch of Nuance; Alex Lascarides of the University of Edinburgh; Martha Palmer of the University of Colorado; and Alexander Koller of the University of Potsdam).

The project sets out to enable next-generation Web services in the realm of social networking, as characterized by user-centric information sharing and on-line collaboration. Here, a key element is so-called user-generated content (UGC), which already to date accounts for a large proportion of Internet traffic. The vast majority of UGC is cast in human language. The project develops so-called semantic parsing technology, an automated process to allow IT systems to 'make sense' of human language. While semantic parsing systems exist for at least a few languages, current technology does not scale to the size of the Web, nor is it capable of coping with the linguistic complexity and diversity of typical types of UGC. Large-scale semantic parsing technol ogy is prohibitively expensive to build for a single player. Therefore, a long-term perspective, collaborative development, focus on task-, genre-, and domain-adaptable approaches, and the reuse of knowledge and resources are prerequisites to broader use of parsing in next-generation ICT solutions. Parsing technology has matured to a point where its large-scale application to Web content is now within reach. However, there are important scientific and technological challenges that need to be addressed to actually reach this goal. These are scalability (i.e. primarily parser efficiency), robustness (to out-of-scope or ill-formed inputs), and precision (of output representations). Finally, it is necessary to define adequate, task- and application-indep endent output representations for semantic parsing (abstractly, a linguistic API), and such standardization for use in applications needs to be approached in close cooperation with key international players. Project results will be showcased through a no vel Web service, a search interface based on semantic relations between concepts, which is applied to a large selection of diverse Web content from the domain of information technology.

Funding scheme:

IKTPLUSS-IKT og digital innovasjon