Back to search

IKTPLUSS-IKT og digital innovasjon

SANT: Sentiment Analysis for Norwegian Text

Alternative title: SANT: Sentimentanalyse for Norsk Tekst

Awarded: NOK 10.1 mill.

Natural Language Processing (NLP) is a central sub-field of Artificial Intelligence (AI) concerned with enabling machines to `make sense' of human language. A particular application of NLP that has gained widespread use over the recent years, both for scientific and commercial use, is Sentiment Analysis. In SA, the task is to automatically identify opinions, attitudes, emotions, or other similar subjective information in text. This technology has been successfully applied for market analysis, political opinion analysis, news and social media monitoring, and much more. The main objective of this project has been to provide open and publicly available resources for sentiment analysis for the Norwegian language, something which was previously lacking. Under coordination of the Language Technology Group (LTG) at the Department of informatics at the University of Oslo, the project has created an ecosystem of different resources, most importantly in the form of annotated datasets based on NoReC, the Norwegian Review Corpus, which contains thousands of reviews for different domains collected from Norwegian news sources, shared by the media partners in the project – Schibsted, Aller Media, and NRK – comprising some of Norway's largest media companies. The annotated datasets enables training and evaluation for machine learning models, and the project has also made available pre-trained models. So-called Language Models comprise an important cornerstone in current NLP. Rather than starting from scratch when training models for specific applications like sentiment analysis, we build on the knowledge already embedded in Language Models pre-trained on vast amounts of raw text. An important contribution of the SANT project has been the development of the first Language Models for Norwegian, based on the well-known BERT-architecture. Datasets, code, and models are openly shared on GitHub and HuggingFace. For more information about the various resources, please the project webpages: https://www.mn.uio.no/ifi/english/research/projects/sant/

Sentiment analysis (SA) is an application of NLP that has proved to have a wide number of use-cases, both within research and for commercial purposes, ranging from market analysis and political opinion analysis to news- and social media monitoring and much more. Coordinated by the Language Technology Group (LTG) at UiO, the SANT project has created a rich ecosystem of openly available resources for sentiment analysis for Norwegian text, something that was previously lacking. We have created a number of large-scale manually annotated datasets, which have subsequently been used for training and evaluating machine-learned SA models. The project has attracted a lot of external interest, spawning collaborations with scholars from several different fields of study, ranging from political science to healthcare. Several new research projects have already been funded where we continue to build on and expand on the outcomes of SANT, and we anticipate more to come. The SANT project has played an important role in advancing the state of Norwegian NLP, not only with respect to sentiment analysis but also more generally through the development of the (then) first Norwegian transformer-based language models, based on both the BERT and T5 architectures. These models are still in wide-spread use within both industry and research. The project was also instrumental in developing the NorBench evaluation suite, where several SANT datasets are included, which will continue to be important for testing and comparing future generations of Norwegian LMs. Beyond the importance for Norwegian NLP, we have also made language-independent methodological advancements, e.g. showing how Structured Sentiment Analysis can be approached using graph-based neural modeling adapted from semantic parsing, establishing new state-of-the-art results for the task. Our datasets are also included in standardized data collections, like the SemEval 2022 Shared Task on Structured SA, ensuring continued use and visibility in the international NLP and ML community. SANT’s role in co-organizing the SemEval Shared Task and subsequent workshop also serves as an example of the international collaborations spawned by the project. All resources – whether in the form of data, annotation guidelines, code, or models – are made publicly available under an open license. Importantly, this not only allows for free use for research purposes but also commercial applications, e.g. of derived models. Making the resources available via both HuggingFace and GitHub facilitates both discoverability and ease of access. Finally, also in terms of building competence locally – in deep learning-based NLP, high-performance computing, and data creation – the project has had a huge impact, with well over twenty people having been involved in different capacities, from student research assistants, PhD- and postdoctoral research fellows, and researchers.

Language Technology (LT) is a sub-field of Artificial Intelligence (AI) concerned with enabling machines to `make sense' of human language. A particular application of LT that has gained widespread use over the recent years, both for scientific and commercial use, is Opinion Mining or Sentiment Analysis (SA). The task of an SA system is to automatically identify the opinions, attitudes or emotions that are expressed by subjective information in text. This technology has been successfully applied for market analysis, political opinion analysis, reputation tracking, customer relationship management, news and social media monitoring, and much more. The main objective of this project is to provide open and publicly available resources for sentiment analysis for the Norwegian language, something which is currently lacking. The project will take advantage of a peculiarity of the way reviews and critiques are typically summarized in Norwegian arts journalism and consumer journalism, viz. by an explicit rating on a scale 1-6, represented as a throw of a die. We here propose to use this feature for semi-automatically compiling a polarity labeled text collection. We can then use this to train and evaluate machine learned models for sentiment analysis on the document-level. For some applications it is necessary to have models that can make more granular predictions at the sentence-level and identify the targets and holders of the opinions (`who means what about whom'). To enable such models, a subset of the review will therefore be manually annotated with fine-grained in-sentence polarity information. In the field of AI in general, and LT in particular, the use of many-layered artificial neural networks (so-called Deep Learning) has recently seen as great revival with many successful applications, including sentiment analysis. The classifiers developed in this project will seek to push the state-of-the-art in large-scale sentiment analysis using deep neural architectures.

Funding scheme:

IKTPLUSS-IKT og digital innovasjon