Back to search

IKTPLUSS-IKT og digital innovasjon

Machine Learning for the Anonymisation of Unstructured Personal Data

Alternative title: Maskinlæring for anonymisering av ustrukturert persondata

Awarded: NOK 16.0 mill.

Many public & private organisations struggle to manage the personal data they gather or produce. This data may relate to patients, customers, welfare recipients, or even defendants in court cases. Such databases are often highly valuable, both for the organisations themselves and for society at large. For instance, patient records are essential for biomedical research. Similarly, court cases constitute an important resource for legal professions, and customer data can be used to improve the company?s services and user experience. However, data that may reveal personal information about individuals must also comply with privacy and data protection laws, such as the General Data Protection Regulation (GDPR) newly introduced in Europe. In particular, personal data cannot be distributed to third parties (or even used for secondary purposes) without legal ground, such as the consent of the individuals to whom the data refers. One solution is to use anonymisation techniques to protect the privacy of the registered individuals. However, current anonymisation methods do not work for unstructured formats such as text documents. This is unfortunate, as text documents constitute a large part of many data management systems (for instance, patient records are mostly made of texts). Text anonymisation is therefore mostly done manually. This process, however, is very costly, prone to human errors and inconsistencies, and difficult to scale to large numbers of texts. The CLEANUP project seeks to address this technological gap and develop new machine learning models to automatically anonymise text documents. The project also designs new methods to evaluate the quality of text anonymisation techniques and connect these metrics to legal requirements. Finally, the project investigates how these technical solutions can be integrated into organisational processes, in particular how to perform quality control and adapt the anonymisation to the specific needs of the data owner. During the first two years of the project, we worked on several fronts. We have collected various data sources, including court decisions and patient records. We have had a strong focus on the development of a new corpus called TAB (Text Anonymisation Benchmark) which has been annotated manually by law students. The corpus comes along with new evaluation methods that can be used to automatically assess anonymisation quality. We have also worked on analyzing and comparing existing methods, and developing new anonymisation models that are not dependent on annotated data.

The project sets out to develop new computational models and processing techniques to automatically anonymise unstructured data containing personal information, with a specific focus on text documents. The project's key idea is to combine approaches from natural language processing and data privacy to design a new generation of text anonymisation techniques that simultaneously: -Take advantage of state-of-the-art natural language processing techniques (based on deep neural architectures) to derive fine-grained records of the individuals referred to in a given document ; - Connect these individual records to principled measures of disclosure risk and data utility, with the goal of modifying text documents in a way that prevents the disclosure of personal information while preserving as closely as possible the internal coherence and semantic content of the documents. The project will also design dedicated evaluation methods to assess the empirical performance of text anonymisation mechanisms, and examine how these metrics are to be interpreted from a legal perspective, in particular with respect to how privacy risk assessments should be conducted on large amounts of text data. Finally, the project will investigate how these technological solutions can be integrated into organisational processes - in particular how quality control can be performed in direct interaction with text anonymisation tools, and how the level and type of anonymisation can be parametrised to meet the specific needs of the data owner. To achieve these objectives, the project brings together a consortium of researchers with expertise in machine learning, natural language processing, computational privacy, statistical modelling, health informatics and IT law. In addition, external partners from the public and private sector (covering the fields of insurance, welfare, healthcare and legal publishing) will also contribute to the research objectives with their data and domain knowledge.

Publications from Cristin

No publications found

No publications found

Funding scheme:

IKTPLUSS-IKT og digital innovasjon