Since its inception, Lumex AS has been run as a privately financed research project with the aim of expanding the boundaries of OCR-technology. So far more than NOK 50 million have been invested. The project has resulted in a number of ground-breaking new algorithms which can change the market view on what capture software should be able to do.
Below is a list of conditions where Lumex delivers significant improvement over state of the art technology:
- Typewritten material of poor quality: all kinds of old and degraded documents produced by typewriters.
- Documents with blackletter fonts such as Gothic/Fraktur
- Documents in languages with special letters such as ‘æ’, ‘ø’ and ‘å’ in Danish and Norwegian
- Documents where artifacts like stamps, watermarks, handwritten text, manual underlining etc. masks the text
- Digitization of documents stored on microfilm/microfiche.
- Improved tolerance for other practical issues that contribute to poor results with current OCR technologies.
Below is an example of corrected characters in an old newspaper. The red boxes show corrected letters
Adaptive character models
The Lumex technology builds and verifies adaptive character models based on an initial recognition – possibly from a standard OCR engine. This approach only requires self-similarity and works on all printed and typed fonts, and is very tolerant for noise, low quality print and errors in the initial recognition.
Smart dictionary and phrase look-up
By analyzing word frequency and using fast look-up in large phrase databases and dictionaries (>13 million words), Lumex smart dictionary look-up improves recognition,. Special dictionaries can be created and modified from text input. Direct internet lookup using a patented method is also possible.
Words are verified by gap/overlap filtering, optimal difference analysis and dictionary and phrase look-up (tunable according to document quality).
New character classes can be found iteratively by combining the results of dictionary/phrase look-up with template analysis.
Optimal difference analysis
Lumex patented difference analysis subtracts aligned template images to automatically find where confusion alternatives are different (as in the example figure for an ‘i’ and an ‘l’).This is used to find optimal difference criteria
Detecting partly hidden characters
Lumex has a patented method for finding partly obscured (hidden) characters. The occlusion can be a result of stamps, manual underlining or doodling, water marks, paper creases or microfilm scratches.
The figure below shows detected characters (correct in green) partially hidden behind a “COPY” stamp
Detection of overlapping characters
Overlapping characters either deliberate as in ligatures or by printing/typing errors can be detected by Lumex patented method of combining templates.
Enhanced Word Recognition and Correlation Metric
One of the most common errors that occur in a standard OCR process is incorrect splitting of words into letters. Lumex Enhanced Word Recognition eliminates the problem with splitting into characters using smart template analysis with gap/overlap filtering and will identify the most likely recognition alternatives.
Lumex patented method for dewarping text lines can be used to straightening text lines in photographed text, scans of a non-planar surface or curved lines in the original image.
Local adaptive binarization
Lumex has a patented method for local adaptive binarization that works even for large contrast variations as shown in figure below
The core technologies used in our products are covered by an extensive portfolio of 7 patent families, with some 38 individual patents in many countries around the world. New ideas and developments will further supply the portfolio, and our know-how of OCR technology will spur further innovation in the future.
- The patent portfolio has broad technical applicability, and can support entry into many different markets.
- Non-core patents in our portfolio will be spun off. We are currently seeking partners for one such patent, and will use the patent portfolio actively in the partnering process.