IKTPLUSS-IKT og digital innovasjon

Språkteknologi er en sentral del av fagområdet Kunstig Intelligens (KI) og handler om å få maskiner til å forstå menneskelige språk. En anvendelse av språkteknologi som har funnet svært mange bruksområder de siste årene er såkalt sentimentanalyse (SA). Oppgaven til et SA-system er å automatisk identifisere meninger, holdninger, eller liknende subjektiv informasjon i tekst. Denne teknologien har funnet anvendelser innen markedsanalyse, nyhetsovervåking, analyse av kulturelle og politiske trender, og mye annet. Imidlertid har man tidligere ikke hatt fritt tilgjengelig teknologi for sentimentanalyse for norsk. Hovedmålet til SANT har vært å rette på dette. Under ledelse av språkteknologigruppen ved Institutt for informatikk ved UiO har prosjektet tilgjengeliggjort en rekke ulike ressurser, først og fremst i form av manuelt annoterte datasett basert på NoReC, Norwegian Review Corpus, som inneholder tusenvis av anmeldelser fra norske aviser, delt av mediepartnerne i SANT; Schibsted, Aller, og NRK. Den viktigste rollen til de annoterte datasettene er trening og evaluering maskinlæringsmodeller, og prosjektet har også tilgjengeliggjort ferdig trente modeller. Såkalte språkmodeller utgjør i dag en viktig grunnstein for de fleste anvendelser av språkteknologi. I stedet for starte på bar bakke når man skal trene maskinlæringsmodeller for spesifikke anvendelser, som f.eks. sentimentanalyse, bygger vi videre på kunnskapen som allerede finnes innbakt i språkmodeller trent på enorme mengder rå tekst. Et viktig bidrag fra SANT-prosjektet har vært utviklingen av de første språkmodellene for norsk, basert på den kjente BERT-arkitekturen. Alt av datasett, kode, og modeller er åpent delt på GitHub og HuggingFace. For mer informasjon om de ulike ressursene, vennligst se prosjektets nettsider: https://www.mn.uio.no/ifi/english/research/projects/sant/

Sentiment analysis (SA) is an application of NLP that has proved to have a wide number of use-cases, both within research and for commercial purposes, ranging from market analysis and political opinion analysis to news- and social media monitoring and much more. Coordinated by the Language Technology Group (LTG) at UiO, the SANT project has created a rich ecosystem of openly available resources for sentiment analysis for Norwegian text, something that was previously lacking. We have created a number of large-scale manually annotated datasets, which have subsequently been used for training and evaluating machine-learned SA models. The project has attracted a lot of external interest, spawning collaborations with scholars from several different fields of study, ranging from political science to healthcare. Several new research projects have already been funded where we continue to build on and expand on the outcomes of SANT, and we anticipate more to come. The SANT project has played an important role in advancing the state of Norwegian NLP, not only with respect to sentiment analysis but also more generally through the development of the (then) first Norwegian transformer-based language models, based on both the BERT and T5 architectures. These models are still in wide-spread use within both industry and research. The project was also instrumental in developing the NorBench evaluation suite, where several SANT datasets are included, which will continue to be important for testing and comparing future generations of Norwegian LMs. Beyond the importance for Norwegian NLP, we have also made language-independent methodological advancements, e.g. showing how Structured Sentiment Analysis can be approached using graph-based neural modeling adapted from semantic parsing, establishing new state-of-the-art results for the task. Our datasets are also included in standardized data collections, like the SemEval 2022 Shared Task on Structured SA, ensuring continued use and visibility in the international NLP and ML community. SANT’s role in co-organizing the SemEval Shared Task and subsequent workshop also serves as an example of the international collaborations spawned by the project. All resources – whether in the form of data, annotation guidelines, code, or models – are made publicly available under an open license. Importantly, this not only allows for free use for research purposes but also commercial applications, e.g. of derived models. Making the resources available via both HuggingFace and GitHub facilitates both discoverability and ease of access. Finally, also in terms of building competence locally – in deep learning-based NLP, high-performance computing, and data creation – the project has had a huge impact, with well over twenty people having been involved in different capacities, from student research assistants, PhD- and postdoctoral research fellows, and researchers.

Language Technology (LT) is a sub-field of Artificial Intelligence (AI) concerned with enabling machines to `make sense' of human language. A particular application of LT that has gained widespread use over the recent years, both for scientific and commercial use, is Opinion Mining or Sentiment Analysis (SA). The task of an SA system is to automatically identify the opinions, attitudes or emotions that are expressed by subjective information in text. This technology has been successfully applied for market analysis, political opinion analysis, reputation tracking, customer relationship management, news and social media monitoring, and much more. The main objective of this project is to provide open and publicly available resources for sentiment analysis for the Norwegian language, something which is currently lacking. The project will take advantage of a peculiarity of the way reviews and critiques are typically summarized in Norwegian arts journalism and consumer journalism, viz. by an explicit rating on a scale 1-6, represented as a throw of a die. We here propose to use this feature for semi-automatically compiling a polarity labeled text collection. We can then use this to train and evaluate machine learned models for sentiment analysis on the document-level. For some applications it is necessary to have models that can make more granular predictions at the sentence-level and identify the targets and holders of the opinions (`who means what about whom'). To enable such models, a subset of the review will therefore be manually annotated with fine-grained in-sentence polarity information. In the field of AI in general, and LT in particular, the use of many-layered artificial neural networks (so-called Deep Learning) has recently seen as great revival with many successful applications, including sentiment analysis. The classifiers developed in this project will seek to push the state-of-the-art in large-scale sentiment analysis using deep neural architectures.

Publikasjoner hentet fra Cristin

Budsjettformål:

IKTPLUSS-IKT og digital innovasjon

2,6MRD. KRtotalt tildelt i programperioden 658PROSJEKTERhar fått tildeling i programperioden 8KILDERhar finansiert programmet

Finansieringskilder

Kunnskapsdepartement Justis- og beredskap Kommunal-og distrikt Samferdselsdeparteme Diverse Nærings- og fiskerid Forsvarsdepartemente Digitaliserings- og

IKTPLUSS-IKT og digital innovasjon

SANT: Sentiment Analysis for Norwegian Text

Alternativ tittel: SANT: Sentimentanalyse for Norsk Tekst

Tildelt: kr 10,1 mill.

Populærvitenskapelig framstilling

Oppnådde effekter

Sammendrag

Publikasjoner hentet fra Cristin

NorBench – A Benchmark for Norwegian Language Models

Word Substitution with Masked Language Models as Data Augmentation for Sentiment Analysis

A Diagnostic Dataset for Sentiment and Negation Modeling for Norwegian

Direct parsing to sentiment graphs

SemEval 2022 Task 10: Structured Sentiment Analysis

Entity-Level Sentiment Analysis (ELSA): An Exploratory Task Survey

Large-Scale Contextualised Language Modelling for Norwegian

Using Gender- and Polarity-informed Models to Investigate Bias

Budsjettformål:

IKTPLUSS-IKT og digital innovasjon

Finansieringskilder

Temaer og emner