Back to search

EUROSTARS-EUROSTARS

E!11454 ZoneMaster Advanced Document Layout Analysis

Alternative title: ZoneMaster avansert layoutanalyse

Awarded: NOK 6.0 mill.

Project Number:

277041

Project Period:

2017 - 2020

Funding received from:

Organisation:

Location:

Partner countries:

In Optical Character Recognition (OCR), an image of a text is converted to a search and editable text file. The result of the conversion can also contain formatting, such as in MS Word file. Layout analysis, which includes finding text lines and sections in a text image is an essential part of the OCR process. Although the development of OCR technology has taken place for more than 50 years, the layout analysis is still not satisfactorily resolved; In many documents, text lines are incorrectly linked or fragmented, which causes the words in the text to be in the wrong order. Even more seriously, if text is not located at all, or non-text items, such as noise or graphics, are taken as text. The layout analysis should also find the text flow from section to section, which can be very demanding. Commercial OCR software has problems with layout analysis for historical documents with a lot of noise, unusual fonts, curved text lines and tight columns, but also for modern documents with complex layout. In general, layout analysis in tables in documents is also not satisfactorily resolved. Because the outcome of the layout analysis from commercial software is uncertain, a lot of manual correction of layout analysis is used. This is costly and time consuming, but if omitted, the result of OCR will be less useful. It is therefore a good market for a better layout analysis. The partners in this project, Lumex AS, Skilja Gmbh (Germany) and PRImA Research (UK) have long experience with OCR in general and layout analysis in particular. In the project, a new layout analysis software will be developed that will solve the problems mentioned above and minimize the need for manual correction. More traditional methods of layout analysis will be combined with modern technology such as deep learning and used text analysis (to find text flow). The project has made promising prototypes that already preform on par or better than commercially available alternatives. Improvements on the algorithms that will improve the performance further are planned. A special solution that can solve the problem with curved text lines has been made. The solution can also straighten the text lines, thus improving character recognition as OCR engines generally perform poorly on curved lines. A solution that is robust for "clutter" such as underlinings and strikeouts have also been made. Such clutter can occur because the content of a form or a table is offset with regards to the frames,or because there are handwritten markings. The results of the layout analysis are input to the character and word recognition. It is important that the input is given in a way that gives the best possible recognition. Therefore adaptations for the input to the two most used OCR engines for historical documents have been made. The project is now successfully completed. Internal and external tests have shown that the software developed in the project, ZoneMaster, has generally better performance for layout analysis on historical documents than the best commercially available solutiuons. A very interesting, both from a technical and commercial viewpoint, result of the project is the pairing of ZoneMaster and the open source OCR program Tesseract4. Tesseract4 uses deep learning to achieve a very high precision character recognition, but Tesseract4?s internal layout has insufficient quality in many cases. The combination of ZoneMaster and Tesserct should be a very attractive commercial product.

Achieved outcomes: A fully automatic layout analysis software prototype that has superior performance on historical documents compared to the best commercially available alternatives A very interesting integration of the layout software with the open source deep learning OCR software Tesseract4 that achieves very good results OCR results Potential outcomes A commercially successful layout tool integrated with an open source OCR software With further development layout analysis superior performance for all kinds of documents including modern ones Integration with Lumex OCR enhancing software to increase the recognition further.

Trillions of documents are digitized every year through a process which in one phase, layout analysis or ZONING, still relies on manual intervention. Zoning, preceding OCR and all content classification in the digitization process, is imperative for the result to be usable at all. No reliable tool exists to automatically zone various document types today, meaning costly human intervention is always required. This consortium proposes a software concept which will resolve this problem.

Funding scheme:

EUROSTARS-EUROSTARS