DOCUMENT IMAGING REPORT
September 28th, 2012
There is no question that ISVs are currently trying to go where no capture software has gone before—in terms of applying automatic recognition technology to documents. Over the past couple issues, we’ve covered topics like artificial intelligence, semantics, and advanced analytics, all designed to take applications beyond the capabilities of current capture software. However, an appropriately named start-up out of Germantown, TN (near Memphis), may have beaten many established capture players to the mark.
Using a combination of advanced pattern recognition and computer vision, BeyondRecognition recently completed a project in which it successfully indexed 2.3 billion images, which originally contained no meta data. Yes, working as a contractor for an energy company, BeyondRecognition was presented with 27,000 CDs and DVDs full of images that covered a timeframe of roughly 90 years and were created in different locations across the world.
"There were no boundaries separating the scanned images,” said John Martin, founder and CEO of BeyondRecognition. “The only thing we knew was that the documents on the discs in the front of each box were scanned before the documents on the discs in the back. Our client wanted to be able to mine the data on all these documents.”
According to Martin, he looked at everything available on the market for accomplishing the task at hand. “I looked at traditional OCR applications, but they were a bust—even with voting engines, which would just have made the process five or six times longer,” he said. “Even if we could have applied full-text OCR, conventional search engines could not do the things we wanted. In addition, traditional relational databases couldn’t support the millions of many-to-many relationships we had to set up.”
To build the application that eventually became of cornerstone of BeyondRecognition, Martin applied a process he called “negative learning.” “Basically, I started with no pre-conceived ideas about how automatic recognition was being done,” he said. “Instead, I tried to take the position that if I were to build a recognition solution with tools available today, how would I approach it?”
The guts of the solution
Martin started with the basic premise that computers are good at working with numbers. “Based on that, we were able to identify one of the key flaws of traditional OCR—it attempts to read characters like humans do,” he said. “It goes left to right, top to bottom, first page to last.
“And it attempts to recognize each character individually. Think about that from a statistical standpoint. On each page, a single lowercase character might appear 40 to 50 times. So, on a million pages, that character could show up as many as 50 million times. Basically, with traditional OCR, you’re giving software 50 million chances to get it wrong. Statistically, that means, it’s certainly going to make at least one mistake.”
BeyondRecognition puts each image through a process it calls “scraping.” “We literally rip the images apart into glyphs,” said Martin. “Those glyphs include not only the characters on a page, but also things like logos, staple holes, check boxes, signatures, and even specks of dirt. On average, we produce about 1,500 glyphs per page.
“We then run the glyphs through a normalization process before grouping them. The normalization involves accounting for orientation by rotating each glyph 720 times in half-degree increments. This way, direction doesn’t matter, and neither does size. After it’s normalized, if a glyph is 99% similar or greater to other glyphs, they are placed in the same cluster.”
What happens next is a bit confusing, but it basically involves identifying these clusters as sets of characters. This is accomplished at least partially by identifying the glyph in a cluster that most exactly resembles a known character and then plugging that character into a word that is checked against a global dictionary. There are also statistical formulas incorporated regarding how often a particular character should show up in a set of documents.
“We average the results of all that, and if it comes back above a 99% confidence rating that the glyph represents a specific character, we presume it to be true,” said Martin. “Then, because our software tracks the location of each glyph it creates, we can identify all the glyphs in that particular cluster as being that particular character.”
Of course, not every glyph comes back at a 99% confidence level. To account for this, Martin showed us a process called “Word QC.” In the example he showed “lockbox” was not recognized as a valid word in the global dictionary, so it was highlighted on the screen. The statistics said it was one of several million suspect words (in a large set of documents). Merely confirming that “lockbox” was a valid word had a cascading effect that helped validate other glyphs as characters. The result was that with a single keystroke, 900,000 suspect words were eliminated. “That’s the type of stuff, offshore keyers are being paid to correct on a word-by-word basis,” said Martin.
BeyondRecognition has the ability to output searchable PDF files, as well as what it calls an “XPDF” file. “Basically, this is a cross-reference file, which includes a coordinate point for every word and numeric sequence pulled off a page,” said Martin. “It’s a great tool for redaction applications, for example.
“We have a customer using it to redact expressions like Social Security and phone numbers. They can achieve this at a rate of 600,000 redactions per hour.”
Because BeyondRecognition works with glyphs, it is able to handle multiple languages—even mixed within a single document set.
It also has the ability to do document clustering based on the layout and content of images. “This is a great tool for automatically routing files to the right process,” said Martin. “We have a BPO that utilizes that element of our technology to help it allocate its resources more effectively.
“In the discovery space, document clustering can help users eliminate a lot of documents they’re not going to need. For example, in a labor case, it enables them to quickly identify tax and marketing documents that won’t be relevant.”
This type of document classification also has obvious potential in markets like mortgage and patient records processing, where several types of documents are often mixed in a single file.
Rules can be set up within BeyondRecognition’s application for extracting specific data fields and tables. Extraction can be applied to structured and semi-structured documents.
Rules can also be set up around the glyphs to eliminate background noise such as watermarks. Martin showed an example of image enhancement being applied to documents created through carbon-paper duplication.
Martin said the speed of BeyondRecognition’s software depends on the number of CPUs being utilized. “A 30-core server can process up to a million pages per day,” he said. “An 80-core can do 5 million, and a 160-core, about 10 million.
To date, BeyondRecognition has offered its technology solely as a service. “We’ve done several dozen projects, including several different types,” said Martin. “We’re currently working on developing an appliance that can be run behind a customer’s firewall.”
Martin said that BeyondRecognition’s technology belongs entirely to his company. “There are some patents around it, as well as some we’re applying for,” he said. “We have 15-16 man years worth of development in this.”
Martin concluded by echoing the theme we thought was prevalent at the recent Harvey Spencer Associates Capture Conference, regarding next generation document capture. “This is a solution for big data,” he said. “If you don’t know what you have, you certainly can’t decide what’s relevant.”