Back to search

BIA-Brukerstyrt innovasjonsarena

Lumex Optical Character Recognition

Alternative title: Lumex automatisk tekstgjenkjenning

Awarded: NOK 4.1 mill.

There is a demand for the conversion of large amounts of historical documents to searchable and editable text. This is something more than digitization, which in its simplest form just means to create digital images of the documents. This conversion process is called Optical Character Recognition (OCR). OCR of historical documents is difficult due to low contrast, that ink and paper are damaged with age and the use of many special fonts. Postcorrection using lookup in dictionaries is a widely used method for character recognition. This is also more demanding in historical documents that may have ancient languages with special words and spellings. An additional problem in many historical documents is that non-textual elements, "clutter" may cover letters. This can be elements like e.g. stamps, free-hand lines, form lines that are misplaced relative to the text and microfilm scratches. This creates great problems for detecting text, layout analysis, and recognition of the characters in the text. Through this project Lumex finds solutions to this problem. Lumex's basic algorithms has been further developed for detection of text hidden behind clutter. This includes lookup in special dictionaries. The Lumex recognition enhancement algorithms are self-adaptive and uses input from an initial recognition to build models of the characters in the document. The model initiation algorithms have been modified so that the clutter does not introduce noise in the models Character segmentation (i.e. the correct division of a word into separate characters) has special challenges in cluttered documents. aAdvanced segmentation routines have been developed also using more advanced linguistic methods including lookup in the internet. An important element of the processing is the detection and precise localization of the clutter. The Norwegian Computing Centre (NCC) have developed algorithms. The effectivity of these algorithms have been tested, both directly when the clutter localization is exactly known and it in its influence on the layout analysis (where the text lines are found and text structure is extracted), and character recognition both in commercial OCR software and Lumex recognition enhancement. The NCC clutter localization routines have been continuously improved while reducing the execution time. An advanced testing software has been developed including the generation of synthetic documents that have required noise and clutter levels. Ground truth (e.g. manually proofed) has been made for a number of real documents Test tools that includes processing with state-of-the-art commercial OCR processing software (FineReader) have been made. The initial result from FineReader is enhanced with the Lumex recognition software adapted for clutter. The localization of the clutter is estimated using algorithms and software developed by NCC. This set up with a commercial OCR software enhanced with a clutter detection software and character recognition could be used in industry workflows. When the clutter has been detected, it is possible to try to remove it. This has been done in order to improve layout analysis by commercial OCR software. Tools for testing the quality of the layout analysis with a commercial OCR software has been made. The influence of clutter plus the improvement with different uses of the NCC clutter localization routines optimized for layout analysis have been documented. It is possible to use linguistic methods such as dictionary lookup and searches in a phrase database to recognize words even of some of the letters are completely hidden by clutter. A report that shows good test results with just linguistic methods and, consequently, measures these methods' effectiveness have been written. NNC has written a report about the use and effectiveness of their clutter localization routines for layout analysis and recognition. NCC has also made a version of the clutter localization for Lumex use and is part of the program suite for dealing with cluttered documents.

Our main objective is the development of software for performing OCR in documents containing stamps, ink stains, underlinings etc. Such clutter is very often present in heritage documents. This problem is not handled well by existing OCR software and c onstitutes a large problem in OCR of these documents. We will establish a ground truth database and, using this as a reference, develop new algorithms for detection and precise localization of clutter as well as a measure of the quality of the detection of the clutter. We will also address the problem of document layout analysis in the presence of clutter. Knowing the position of the clutter we will develop new OCR methods for use on the text affected by this clutter. Finally, we will develop ling uistic approaches for post-correction for use in text sections containing clutter. There are several significant research challenges in this project. It is critical to know where the clutter is located. Surprisingly little research has been performed in this field and completely new approaches must be developed. Knowing the position of the clutter one can tailor the character and word recognition mechanisms to the effect the clutter has on the document, however, this remains very challenging. We w ill also develop linguistic methods tailored to the postcorrection of errors introduced by the clutter. Finally, we address the analysis of the document layout in the presence of clutter. There is a huge need for improved methods for OCR of heritage do cuments. Currently, a large effort is undertaken to digitalize cultural heritage documents, but the full potential of this effort cannot be reached before good OCR results can be obtained. When such results are available, this will facilitate documen tation and research and increase public cultural heritage awareness. The market is significant and the economic incentive for performing research in this domain is strong.

Funding scheme:

BIA-Brukerstyrt innovasjonsarena