The projects main goal is the creation of a system based on Artificial Intelligence that can automatically transcribe any historical handwriting from Norwegian writers even if they have not been seen before, during the training phase of the system. This goal is very much in line with the Norwegian National Library?s role as a hub for handwriting recognition for the Norwegian GLAM (Galleries, Libraries, Archives and Museums) sector.
Even though there have been large developments within artificial intelligence, computer linguistics and neural networks such a general system with acceptable quality does not exist for Norwegian. There are only specialized systems that can recognize handwriting from writers in the training set with sufficient quality.
Intermediate goals are to further improve the specialized recognition for writers in the training set, increase the number of writers in the training and to automate the training process as much as possible.
The following steps will be used to achieve the goals:
-Building from existing systems, generate a robust layout system, i.e. finding text lines, that can adapt to new writer?s style
-Using and adapting state-of-the-art neural network technology for character recognition.
-Utilizing advanced linguistics for historical Norwegian to improve the recognition.
-Incorporate novel techniques such as making artificial documents that mimics handwriting of a writer (using GAN networks), but with a known content so it can be used for training without any manual effort. Also use a trainable feature-based method (?Zero-shot word spotting?) to recognize words and augment the results from the other processing.
-Generate a large training set with a diverse set of writing styles and try to minimize the manual effort need for transcription.
The project will place great emphasis on testing and analysing test results with feedback to the development to track progress and identify issues that need special attention.
The Work Plan for 2021 has been completed and includes the above as planned.
A digitized document is basically a visual representation that can be read only by humans. To permit computational analysis, the information in the document must be made machine-readable. This is currently standard procedure when digitizing printed documents using Optical Character Recognition (OCR). Even though today's automatic handwriting recognition systems (HTR) can produce transcripts usable for further processing, like indexing or Natural Language Processing, they are still not part of standard digitization procedures. The reason for this is that creating samples representative of the vast diversity of documents and handwriting styles would require annotating unrealistically large numbers of documents, even in the case of relatively small collections.
The overall aim of the HUGIN-MUNIN project is to develop technological solutions that will enable the use of HTR without the requirements for massive manual annotation and model training. The solutions developed will go beyond traditional supervised machine learning by using ideas from active learning, unsupervised learning, transfer learning, and zero-shot learning. It will also leverage natural language processing resources recently developed for Norwegian.
The impact of the project could be very significant as the National Library acts as a digitization hub for the Norwegian LAM sector. The project will significantly increase the scope and variety of sources available for data-driven research on Norwegian culture and society. It will also democratize the access to knowledge by enabling the public to read documents that have so far been mainly reserved for domain experts and scholars.
The project is one of close interdisciplinary collaboration, both nationally and internationally. This will expand Norwegian experience and competence in AI/autonomous systems expertise and ehance the innovative potential of the Norwegian LAM sector.