WebData will make research on the Norwegian web possible, in the same way that research on physical material can be conducted using services like Nettbiblioteket and DH-lab from the National Library of Norway. WebData will thus give researchers access to data that is not readily available in a research infrastructure today, but that is sorely needed.
The National Library of Norway has harvested data from the Norwegian web since the late 1990s, resulting in vast amounts of data. However, researchers have not been able to access this material due to strict legal requirements on the processing of personal information in such archives. At the same time, public discourse has to a large degree moved to the web, which makes access to data from the web more important. Furthermore, web data plays a crucial role in artificial intelligence and large language models. If Norwegian and Sámi are to survive in the digital age, we need substantial amounts of high-quality data to support language technology.
WebData therefore aims to build a platform for research on material from the web, within the current legal framework. We will grant open access to material published by public entities, and research access to material with and without an editor-in-chief. We will use knowledge and tools to identify and sanitize text with personal identifiers. In this way, we will be able to build a system for secure access.
The platform will be built in close cooperation with the research community. An initial needs-assessment will lay the foundation for functionality in the platform (e.g. visualization of language use over time, event extraction) and, in turn, inform how the data should be annotated. One of the project’s goals is to strengthen the representation of Sámi languages. We will carry out a representation study to map the coverage of Sámi in the web archive and apply measures to increase the harvesting of Sámi language content.
The National Library of Norway, a major cultural heritage institution in Norway, joins forces with some of the most prominent research communities for language technology in Norway to create WebData, a national research infrastructure for web data. WebData will offer researchers access to the Norwegian Web Archive, hosted on-premises at the National Library of Norway. The infrastructure will first and foremost consist of a data platform featuring a general purpose search interface for web data from the Norwegian web (the .no top-level domain) from the last 25 years, allowing researchers to search and explore web pages, documents, transcribed audio/video and images.
The platform will implement a model of layered access, using automatic categorization and identification of personal information to open parts of the collection that would otherwise be closed due to regulatory policies, which is a major R&D challenge and one of the main reasons why the material is not available today. The web archive will further be scaled up according to the needs of researchers and underrepresented communities, e.g. allowing for quantitative analysis of web data and by increasing the coverage of Sámi web content. The infrastructure will contribute to research on Norwegian and Sámi language and culture and produce language resources, e.g. corpora for large language models, that help prevent domain-loss of these languages.
The project is highly interdisciplinary and aims primarily at researchers in the social sciences and humanities, but will also be relevant for e.g. computer science, medicine and law.