IKTPLUSS-IKT og digital innovasjon

Many public & private organisations struggle to manage the personal data they gather or produce. This data may relate to patients, customers, welfare recipients, or even defendants in court cases. Such databases are often highly valuable, both for the organisations themselves and for society at large. For instance, patient records are essential for biomedical research. Similarly, court cases constitute an important resource for legal professions, and customer data can be used to improve the company?s services and user experience. However, data that may reveal personal information about individuals must also comply with privacy and data protection laws, such as the General Data Protection Regulation (GDPR) newly introduced in Europe. In particular, personal data cannot be distributed to third parties (or even used for secondary purposes) without legal ground, such as the consent of the individuals to whom the data refers. One solution is to use anonymisation techniques to protect the privacy of the registered individuals. However, current anonymisation methods do not work for unstructured formats such as text documents. This is unfortunate, as text documents constitute a large part of many data management systems (for instance, patient records are mostly made of texts). Text anonymisation is therefore mostly done manually. This process, however, is very costly, prone to human errors and inconsistencies, and difficult to scale to large numbers of texts. The CLEANUP project seeks to address this technological gap and develop new machine learning models to automatically anonymise text documents. The project also designs new methods to evaluate the quality of text anonymisation techniques and connect these metrics to legal requirements. Finally, the project investigates how these technical solutions can be integrated into organisational processes, in particular how to perform quality control and adapt the anonymisation to the specific needs of the data owner. During the first two years of the project, we worked on several fronts. We have collected various data sources, including court decisions and patient records. We have had a strong focus on the development of a new corpus called TAB (Text Anonymisation Benchmark) which has been annotated manually by law students. The corpus comes along with new evaluation methods that can be used to automatically assess anonymisation quality. We have also worked on analyzing and comparing existing methods, and developing new anonymisation models that are not dependent on annotated data.

The project sets out to develop new computational models and processing techniques to automatically anonymise unstructured data containing personal information, with a specific focus on text documents. The project's key idea is to combine approaches from natural language processing and data privacy to design a new generation of text anonymisation techniques that simultaneously: -Take advantage of state-of-the-art natural language processing techniques (based on deep neural architectures) to derive fine-grained records of the individuals referred to in a given document ; - Connect these individual records to principled measures of disclosure risk and data utility, with the goal of modifying text documents in a way that prevents the disclosure of personal information while preserving as closely as possible the internal coherence and semantic content of the documents. The project will also design dedicated evaluation methods to assess the empirical performance of text anonymisation mechanisms, and examine how these metrics are to be interpreted from a legal perspective, in particular with respect to how privacy risk assessments should be conducted on large amounts of text data. Finally, the project will investigate how these technological solutions can be integrated into organisational processes - in particular how quality control can be performed in direct interaction with text anonymisation tools, and how the level and type of anonymisation can be parametrised to meet the specific needs of the data owner. To achieve these objectives, the project brings together a consortium of researchers with expertise in machine learning, natural language processing, computational privacy, statistical modelling, health informatics and IT law. In addition, external partners from the public and private sector (covering the fields of insurance, welfare, healthcare and legal publishing) will also contribute to the research objectives with their data and domain knowledge.

Publications from Cristin

Funding scheme:

IKTPLUSS-IKT og digital innovasjon

2.6BILL. NOKtotal funding in the programme period 658PROJECTShave received funding in the programme period 8SOURCEShave financed the programme

Funding Sources

Kunnskapsdepartement Justis- og beredskap Kommunal-og distrikt Samferdselsdeparteme Diverse Nærings- og fiskerid Forsvarsdepartemente Digitaliserings- og

Thematic Areas and Topics

Politikk- og forvaltningsområder Forskning Digitalisering og bruk av IKT eVitenskap Politikk- og forvaltningsområder Digitalisering Politikk- og forvaltningsområder Offentlig administrasjon og forvaltning Digitalisering og bruk av IKT Privat sektor IKT forskningsområde Kunstig intelligens, maskinlæring og dataanalyse Bransjer og næringer Grunnforskning LTP3 IKT og digital transformasjon Politikk- og forvaltningsområder Internasjonalisering Internasjonalt prosjektsamarbeid LTP3 Fagmiljøer og talenter IKT forskningsområde Menneske, samfunn og teknologi Anvendt forskning Portefølje Innovasjon Portefølje Banebrytende forskning LTP3 Samfunnssikkerhet og beredskap LTP3 Styrket konkurransekraft og innovasjonsevne LTP3 Høy kvalitet og tilgjengelighet Digitalisering og bruk av IKT Offentlig sektor LTP3 Muliggjørende og industrielle teknologier Portefølje Muliggjørende teknologier Digitalisering og bruk av IKT Internasjonalisering IKT forskningsområde IKT forskningsområde Digital sikkerhet LTP3 Et kunnskapsintensivt næringsliv i hele landet Bransjer og næringer IKT-næringen Samfunnssikkerhet Portefølje Forskningssystemet LTP3 Samfunnsikkerhet, sårbarhet og konflikt Portefølje Demokrati og global utvikling IKT

IKTPLUSS-IKT og digital innovasjon

Machine Learning for the Anonymisation of Unstructured Personal Data

Alternative title: Maskinlæring for anonymisering av ustrukturert persondata

Awarded: NOK 16.0 mill.

Popular Science Description

Summary

Publications from Cristin

A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning

Generation of Replacement Options in Text Sanitization

The GDPR and Unstructured Data: Is Anonymisation Possible?

Bootstrapping Text Anonymization Models with Distant Supervision

Automatic Evaluation of Disclosure Risks of Text Anonymization Methods

Neural Text Sanitization with Explicit Measures of Privacy Risk

The text anonymization benchmark (TAB): A dedicated corpus and evaluation framework for text anonymization

Hva er universell utforming?

Utviklere av kunstig intelligens ber om klare rammer

Innspillsmøte om fremvoksende teknologier

Episode 5: Hva er språkteknologi (eller NLP)? Med Pierre Lison

Episode 6: Kan språkteknologi virkelig forstå språk? Med Ingrid Lossius Falkum og Pierre Lison

Panelsamtale om regulatoriske sandkasser som verktøy for digitalisering

Kan kunstig intelligens "forstå" språk?

Kunstig intelligens og personvern: et (u)lykkelig ekteskap?

Publishing Judgments in Europe: Publicity vs Privacy

Anonymisering av ustrukturerte data og utvikling av språkmodeller

Anonymization of sensitive information

Hva er egentlig kunstig intelligens – og hvor er fallgruvene?

Hva er egentlig maskinlæring – og kan robotene ta over jobbene våre?

Hvilket fremmedspråk bør man lære seg i Google-oversettelsenes tidsalder?

Funding scheme:

IKTPLUSS-IKT og digital innovasjon

Funding Sources

Thematic Areas and Topics