Google Promotes Open Source OCR Library
By Scott M. Fulton, III | Published September 5, 2006, 5:43 PM
"You might wonder," reads a Google corporate blog post this morning, "why Google is interested in [optical character recognition]." Indeed, you might wonder that if you didn't already know that Google has been deeply involved with an on-again/off-again project to produce a digital library of the world's literary material.
Although the future of the project remains up in the air, work continues on one of the technical prerequisites to making such a library possible: a project called Tesseract, begun in 1985 at the University of Nevada at Las Vegas. The school worked with HP to construct a reliable OCR system that works with all manners of printed text.
As the World Wide Web started to take root, Tesseract began losing ground, perhaps mainly due to the reorganization of HP from a research company to a consumer products firm. In 2005, Google apparently made a successful case for UNLV to release Tesseract into open source.
With Google contributing some of its resources toward updates and corrections, the company sponsored the release of a new version of Tesseract last month. But software announcements being what they are in the modern era, they sometimes need to be re-announced, which is why Google stepped up its efforts this morning to make developers aware of Tesseract's availability.
What isn't obvious at first glance is that Tesseract is an application of a neural networking library. Specifically, it implements a system called Aspirin/MIGRAINES, developed by long-time neural network simulator engineer Russell Leighton, and licensed for free although not open-sourced.
For years, neural networks have been known to be the most effective pattern recognition systems, and have thus been applied for use in OCR. Because so few people understand what neural networking truly is, many of the applications that utilize it -- including financial analysis -- don't admit up front to doing so.
For the Asprin system, Leighton implemented a back-propagated network, which learns to recognize patterns through repetitive introduction, analysis, then trial-and-error learning. MIGRAINES serves as the visualization environment for Aspirin developers.
The dependency of Tesseract on Aspirin may make it difficult for open-source developers to sublicense the products of their work to other developers. Aspirin is not licensed under the usual Apache terms; its terms are stated separately.
This isn't much of a problem for Google, though, which for now is mainly interested in seeing developers help perfect Tesseract for its own purposes. To that end, it has put out a call for OCR engineers to join the company.
For now -- even after 21 years -- the Tesseract project appears plagued by the same problem that has baffled OCR engineers with neural networks since the beginning. Judging from comments on Sourceforge, even though Tesseract remains the best performing OCR system ever developed, using UNLV metrics, it still has trouble with diacritical marks such as accents and umlauts.
Typically, once diacriticals become a part of text, they impair the analytical system's ability not only to distinguish accented characters from non-accented ones, but ordinary characters from one another. During much of the 1990s, tests on OCR systems were conducted using English-language text, which is most often umlaut-free.
Good news! One of the things free software world needs badly.
Score: 0
|Google owns.
Score: 0
|Hey .. Google ... if you REALLY want to foster this, you need to drop a Win32 out there, something with the quality of Picasa 2, that provides this functionality.
Most users (you know, the people you probably REALLY want to target) have no use for your API.
Score: 0
|Would be great if someone can come up with a Window GUI based OCR engine that just does OCR against set of images and outputs it to the same folder or another one.
A dumb OCR without the bells and whistles and can produce 5 - 10k an hour.
Would be very useful the our Banking and financial industry.
Score: 0
|http://www.adobe.com/products/server/docgen.html
The banking and financial industry has the money: pony up.
Score: 0
|Current version of the source fails to compile in VC++2k5. Too many errors I don't know how to deal with.
Score: 0
|FWIW, it did not compile in gcc/g++4.0, but did in gcc/g++3.4. I think there is a bit of messy code in there, perhaps some typecasts that shouldn't be made...
Score: 0
|There's a couple of tweaks you need to do for VS2005. For the PINT8/INT8 lines replace them with the code here:
http://sourceforge.net/f...506&forum_id=534360
You'll also find the code for the pow conversion error but you basically need to explicitly cast the operands.
You'll also be missing getopt.cpp and mfcpch.cpp in the filesystem. They should both just contain:
#include "mfcpch.h"
#include "getopt.h"
Lastly, change the project properties, C/C++, Precompiled Headers, Create/Use Precompiled Header to /Yc
Score: 0
|