Monday, June 2, 2014

Superior Document Services announces two team members became Relativity Certified Trainers

Superior Documents Services is proud to announce that two members of our team, Daniel Attaway and Bryant Overgard, have recently become Relativity Certified Administrators! The Relativity Certified Administrator (RCA) program ensures that case administrators fully understand Relativity's capabilities, allowing you to maximize the software's flexibility and provide an intuitive interface for end users. To obtain certification, you must earn a score of 80% or higher on the RCA exam, which contains both written and practical elements
By completing the Relativity Certified Administrator (RCA) program, these individuals have shown a thorough understanding of Relativity and how to best leverage the platform to our clients’ advantage. This is just part of our ongoing goal to provide truly Superior service to all of our clients.

Thursday, May 8, 2014

"Predictive" Coding and the Naked Emperor

I've always been suspicious of the claims of the "predictive coding" zealots. Every year it seems that there is a new buzzword in the field of  Ediscovery. Technology Assisted Review - Check! Information Governance - Check! Cloud Computing - Check. Big Data - Check.  

My friend John Martin did a magnificent job of distilling some of my concerns in his blog entry that can be found right here.

The Emperor has No Clothes - and PC Can't See Image-Only Documents

There are several parallels between predictive coding (AKA technology assisted review) and Hans Christian Andersons' tale, "The Emperor's New Clothes." In the story, two weavers tell the emperor they will make him a suit of clothes that will be invisible to those people who are unfit for their position, stupid, or incompetent. None of the emperor's subjects want to admit to those deficiencies so the emperor parades around with no clothes on until a child states the obvious - the emperor has no clothes.
Predictive Coding - do not see any evidence hereIn the case of predictive coding, its advocates have touted the efficacy of their approaches in white papers, blogs, and social media postings, and have practically created a separate industry to host conferences promoting the wonders of predictive coding. Few people want to ruin the moment or buck the trend by pointing out what is obvious when one considers the technology underlying predictive coding - it is completely dependent on having text to analyze.  It will absolutely fail to analyze documents for which there is no text, and will do a miserable job where the text is of poor quality.
This might be just an esoteric debating point if virtually all documents had associated text. However, in some industries like oil & gas, half or more of some collections will be engineering drawings and schematics that were output to image-only PDF for distribution and use by those who don't have the software licenses needed to view the documents in their original file formats.
Predictive Coding 100 percent right 20 percent of the timeIn practically all industries it is common practice to develop documents in one application like Word and then, once finalized, distribute them as image-only PDF so they can be viewed on a variety of devices and so recipients can't easily change the content. In one collection we analyzed, only 20% of the PDFs had associated text. Even if predictive coding were 100% effective, the most it could classify would be 20% because it literally cannot "see" the 80% without text. If in fact predictive coding has a recall rate of 70-80% of what it can see, that would mean that predictive coding would have identified 14 to 16% of the total PDFs (70% x 20% = 14% or 80% of 20% = 16%). By contrast, BR's visual classification technology classified 100% of them.
PDFs will potentially be among the most relevant file types in a collection because that is the format used to distribute information within and among groups of people within an organization, and among organizations. Note that even if in some unique e-discovery settings predictive coding is acceptable, the text-restriction failing of predictive coding will be fatal for broader information governance purposes.
So... if you're going to use predictive coding, at the very least measure what PC doesn't "see." If you're planning on using PC for information governance purposes, make sure that the organization doesn't mind not classifying a potentially significant percentage of its documents.

Thursday, April 3, 2014

What BeyondRecognition Brings to Document Management

I found this article about BeyondRecognition written by Mimi Dionne which does an excellant job of explaining in plain english how BR can benefit every business with large unstructured data collections.
You can read the entire article right here
Ever heard of BeyondRecognition? If not, the time to learn is now. The Chantilly, Va.-based "document textnology" software provider offers document managers an alternative to optical character recognition (OCR), while delivering results with accuracy and speed.

How BeyondRecognition Works

BeyondRecognition (“BR”) may be a young innovation, but it is a viable alternative to OCR. It utilizes glyphs, a letter or character formed by pixels that are of a sufficiently different color from the background of the document as to be identifiable. BR groups like glyphs into clusters at the character and word level. BR converts one glyph per cluster to text as appropriate.
While OCR continuously decides what each glyph is, BeyondRecognition’s single instance technology need only recognize one glyph per cluster to form a catalog of letters or characters. The advantage: the return on investment of using single instance recognition technology is much higher with a smaller data set — a faster processing speed and better accuracy rate — which shortens the Records Management program’s work breakdown structure significantly.
Because BeyondRecognition software is glyph dependent — not text — it is more versatile:
  • BR is language agnostic. It currently recognizes over forty languages.
  • BR is symbology agnostic. It can recognize and relate non-text elements.
  • BR clusters visual similarities. It works on all kinds of documents.
  • BR is over ninety-nine percent accurate.
  • BR scales. It can analyze millions of pages per day per the BeyondRecognition server.
BeyondRecognition’s zonal attribute extraction permits subject matter experts to extract attributes from document classifications by clicking and dragging zones on one document per document type cluster.

Again for more of Mimi's article click here

Information Governance Lessons from the Six Blind Men and the Elephant

Information Governance Lessons from the Six Blind Men and the Elephant

Posted by Rich E. Davis
Mar 30, 2014 11:35:00 PM

Elephant_wBlindment_Fx600Most of us have heard about the parable of the six blind men and the elephant - it may actually be the first recorded instance of faceted classification. Six blind men touched different parts of an elephant and each described a completely different thing based on their own perspective or “view” of the elephant: The one who felt a tusk reported it as a pipe, the one who felt an ear reported a fan, the belly was reported as a wall, the trunk as a branch, a leg was reported as a pillar and the tail was described as being a rope.
This story illustrates several important information governance lessons:
Elephant_Separate_Views_Fx600Different Stakeholders Have Different Views & Needs. People’s views of and information needs from any given corpus of documents will vary according to where they are in an organization and the functions they perform. People with different roles in a company will naturally be interested in different attributes of the documents in the corpus and may well use different descriptors when describing or trying to find some of them. While some document attributes are common to all stakeholders, others, namely those which enable an individual or group to perform their job function within an organization, are probably not.
Here is an example of how different roles will be interested in different attributes:
The Offshore Power Plant
  • Stakeholders in the tax department need to know whether an expenditure on a sub-sea turbine, a critical component on a key project, can be categorized as an operating or capital expense in the jurisdiction where the project’s work is being performed.
  • The plant maintenance department needs to know:
    • When the warranty period kicks in.
    • Part numbers, nomenclature and service level.
  • Environmental Safety & Health wants to know that the team and all the contractors associated with the project are properly qualified and sanctioned to install the turbine to the engineering design specifications and operated within the tolerances.
  • RIM and Compliance wants to track the locations of all relevant project documents for information lifecycle management, disposition and regulatory reasons.
  • IT needs to ensure that business critical documents are backed up for disaster recovery and business continuity purposes.
  • InfoSec wants to know that the IT group has the requisite information to ensure that the IP associated with the project is properly secured and that the people who access the content have the proper authorization to do so.
Some or all of the information required by the stakeholders above will be objectively evident on the face of individual documents. Other “subjective” attributes may have to be assigned (e.g., “project lead engineer”) by knowledge workers with specific domain expertise, and other more granular data elements (e.g., installation location) may have to be assigned by linking attributes from other authoritative data sources or systems of record.
Elephant_Reconstituted_Fx600Just preserving documents without having a systematic, dynamically updatable and holistic view created by assimilating other interrelated data points will result in an incomplete picture of a project or process. Without a holistic way of assembling and viewing all the extracted document attributes of interest to the various stakeholders, the overarching information governance needs of the organization will never be met. There will be incomplete, ambiguous, erroneous and superfluous data points.
Limited Data Points Means Incomplete or Distorted Pictures. As the elephant parable illustrates, having only one or a few attributes available results in having an incomplete or distorted picture of what is being managed. The blind men’s picture is so distorted in fact, that when word of an elephant rumbling through cane fields destroying them in search of food reaches their ears, they have no adequate description for the sum of the parts, and thus no way of applying the individually assimilated knowledge in a holistic fashion. The more uniform, accurate and persistent the document attributes or facets that are available, the greater the ability of the organization to assimilate seemingly disparate information to form a more accurate picture of present and future state projects.
Elephant_Multiplied_Fx600Duplicated data sets. Without a holistic enterprise content plan, each stakeholder starts keeping their own copies of documents so they can extract the attributes they are interested in. The result is multiple copies of the same documents, multiple expenditures to extract the same attributes, and inconsistencies in ways that the same data is extracted and stored.


The challenge described above is endemic. It exists across all types of businesses in every jurisdiction. Corporations of all sizes are dealing with big data symptoms and are stymied when comes to finding a cure that has not been available from prior technology.
Standing apart from the herd is Continuum Advisors. At Continuum, we believe in using powerful emerging technology to help our clients address their most daunting data management challenges. To that end, we have incorporated BeyondRecognition (“BR”) in our services matrix for IG, legal, information security, RIM and a host initiatives that required powerful, scalable data analytics.
BR is a radically new, data-driven information governance technology that meets the IG needs of multiple stakeholders in any enterprise, public or private. Continuum has implemented BR technology at multiple Fortune 500 clients with great success.
Elephant_BR_Consclustion_Fx600The highly experienced CA team chose to align with BR as it is the only technology in the world that automatically classifies electronic files or scanned paper documents based on their visual characteristics – and without having to waste time writing rules to identify each type of document or designating exemplars for each document type. This is tremendously important because accurate, consistent classification is the bedrock upon which all IG initiatives are built. BR solves this long-standing, previously intractable problem.
Subject matter experts can quickly determine how to classify all the documents in a document cluster by examining one or two documents per cluster. They can also associate a document type name with the cluster based on their organization’s document classification tree, and assign retention periods based on the classification.
Our subject matter experts in energy, financial services, and pharmaceuticals work with corporate knowledge workers to extract multiple attributes from each document classification by “painting,” i.e., clicking and dragging boxes, on an image of a document from each cluster. BR then automatically extracts the specified attributes and associates each attribute with the attribute or field names assigned by the subject matter experts. The extracted data can then be loaded into the appropriate content management system.
The various attributes enable the BR-processed documents to be associated with management control systems, e.g., pipeline planning and maintenance, or capital asset acquisition, or ESH inspections. The various attributes serve to provide multiple views into the document collection.
The extracted attribute values can be normalized prior to loading into the target system or the extracted values can be used to update and validate existing field authority lists.
For more information, please contact Rich E. Davis.

Tuesday, April 1, 2014

BeyondRecognition Denies Plans to Aquire EMC or Kofax

Independent information governance technology provider allays concerns it will seek to acquire market share through acquisitions.

Germantown TN – April 1, 2014. John Martin, CEO and Founder of BeyondRecognition, LLC, a Memphis-based technology company providing data-driven information governance technology to Fortune 500 companies, today denied trade rumors that BR had plans in place to acquire the stock or assets of either Kofax or EMC. According to Martin, “While both Kofax and EMC presently have respectable revenue numbers, we have no plans to acquire them. We believe that our organic growth will permit us to capture a significant share of their document capture and business process automation business.” 
To support his view of BR’s growth potential, Martin noted that BR had signed MSA agreements with six Fortune 100 clients in Q1, 2014.                         
Martin went on to explain that BR’s information governance technology was based on visual similarity, enabling it to automatically classify documents without the client having to develop upfront document classification rules or select multiple exemplars for each classification. “This greatly compresses the time frame required to launch projects like content migration or file share remediation. The fact that BR classifies native electronic documents as well as scanned paper documents is also a huge competitive advantage.”
About BeyondRecognition
BeyondRecognition (“BR”) provides enterprise-scale information governance technology to Fortune 500 clients. BR’s core technology classifies electronic and scanned paper documents based on their visual similarity.  Other components of BR’s offerings include zonal attribute extraction, visual deduping, and glyph recognition. BR technology enables content migration, file remediation, and other IG tasks as well as powering document-intensive business processes. BR’s clients enjoy rapid project start-up  and improved accuracy in coding or extracting document attributes, and they particularly appreciate being able to finish projects in months that had originally been scheduled to take years.
For more information about BeyondRecognition, visit the BR website at, or contact Joe Howie, VP, Corporate Communications, at, or 918-894-6943.This release valid only on April 1, 2014 – think about it and have a great day.
You can also follow BR on Twitter @BeyondRecog or join the BeyondRecognition group on LinkedIn
Credits: Globe in graphic obtained under Creative Commons license from, "Modern Globe Blue and Green Connection Vector Illustration.jpg"

Thursday, October 11, 2012

ISV Takes Fresh Look at Recognition

September 28th, 2012

There is no question that ISVs are currently trying to go where no capture software has gone before—in terms of applying automatic recognition technology to documents. Over the past couple issues, we’ve covered topics like artificial intelligence, semantics, and advanced analytics, all designed to take applications beyond the capabilities of current capture software. However, an appropriately named start-up out of Germantown, TN (near Memphis), may have beaten many established capture players to the mark.
Using a combination of advanced pattern recognition and computer vision, BeyondRecognition recently completed a project in which it successfully indexed 2.3 billion images, which originally contained no meta data. Yes, working as a contractor for an energy company, BeyondRecognition was presented with 27,000 CDs and DVDs full of images that covered a timeframe of roughly 90 years and were created in different locations across the world.
 "There were no boundaries separating the scanned images,” said John Martin, founder and CEO of BeyondRecognition. “The only thing we knew was that the documents on the discs in the front of each box were scanned before the documents on the discs in the back. Our client wanted to be able to mine the data on all these documents.”
According to Martin, he looked at everything available on the market for accomplishing the task at hand. “I looked at traditional OCR applications, but they were a bust—even with voting engines, which would just have made the process five or six times longer,” he said. “Even if we could have applied full-text OCR, conventional search engines could not do the things we wanted. In addition, traditional relational databases couldn’t support the millions of many-to-many relationships we had to set up.”
To build the application that eventually became of cornerstone of BeyondRecognition, Martin applied a process he called “negative learning.” “Basically, I started with no pre-conceived ideas about how automatic recognition was being done,” he said. “Instead, I tried to take the position that if I were to build a recognition solution with tools available today, how would I approach it?”
The guts of the solution
Martin started with the basic premise that computers are good at working with numbers. “Based on that, we were able to identify one of the key flaws of traditional OCR—it attempts to read characters like humans do,” he said. “It goes left to right, top to bottom, first page to last.
“And it attempts to recognize each character individually. Think about that from a statistical standpoint. On each page, a single lowercase character might appear 40 to 50 times. So, on a million pages, that character could show up as many as 50 million times. Basically, with traditional OCR, you’re giving software 50 million chances to get it wrong. Statistically, that means, it’s certainly going to make at least one mistake.”
BeyondRecognition puts each image through a process it calls “scraping.” “We literally rip the images apart into glyphs,” said Martin. “Those glyphs include not only the characters on a page, but also things like logos, staple holes, check boxes, signatures, and even specks of dirt. On average, we produce about 1,500 glyphs per page.
“We then run the glyphs through a normalization process before grouping them. The normalization involves accounting for orientation by rotating each glyph 720 times in half-degree increments. This way, direction doesn’t matter, and neither does size. After it’s normalized, if a glyph is 99% similar or greater to other glyphs, they are placed in the same cluster.”
What happens next is a bit confusing, but it basically involves identifying these clusters as sets of characters. This is accomplished at least partially by identifying the glyph in a cluster that most exactly resembles a known character and then plugging that character into a word that is checked against a global dictionary. There are also statistical formulas incorporated regarding how often a particular character should show up in a set of documents.
“We average the results of all that, and if it comes back above a 99% confidence rating that the glyph represents a specific character, we presume it to be true,” said Martin. “Then, because our software tracks the location of each glyph it creates, we can identify all the glyphs in that particular cluster as being that particular character.”
Of course, not every glyph comes back at a 99% confidence level. To account for this, Martin showed us a process called “Word QC.” In the example he showed “lockbox” was not recognized as a valid word in the global dictionary, so it was highlighted on the screen. The statistics said it was one of several million suspect words (in a large set of documents). Merely confirming that “lockbox” was a valid word had a cascading effect that helped validate other glyphs as characters. The result was that with a single keystroke, 900,000 suspect words were eliminated. “That’s the type of stuff, offshore keyers are being paid to correct on a word-by-word basis,” said Martin.
BeyondRecognition has the ability to output searchable PDF files, as well as what it calls an “XPDF” file. “Basically, this is a cross-reference file, which includes a coordinate point for every word and numeric sequence pulled off a page,” said Martin. “It’s a great tool for redaction applications, for example.
“We have a customer using it to redact expressions like Social Security and phone numbers. They can achieve this at a rate of 600,000 redactions per hour.”
Because BeyondRecognition works with glyphs, it is able to handle multiple languages—even mixed within a single document set.
It also has the ability to do document clustering based on the layout and content of images. “This is a great tool for automatically routing files to the right process,” said Martin. “We have a BPO that utilizes that element of our technology to help it allocate its resources more effectively.
“In the discovery space, document clustering can help users eliminate a lot of documents they’re not going to need. For example, in a labor case, it enables them to quickly identify tax and marketing documents that won’t be relevant.”
This type of document classification also has obvious potential in markets like mortgage and patient records processing, where several types of documents are often mixed in a single file.
Rules can be set up within BeyondRecognition’s application for extracting specific data fields and tables. Extraction can be applied to structured and semi-structured documents.
Rules can also be set up around the glyphs to eliminate background noise such as watermarks. Martin showed an example of image enhancement being applied to documents created through carbon-paper duplication.
Martin said the speed of BeyondRecognition’s software depends on the number of CPUs being utilized. “A 30-core server can process up to a million pages per day,” he said. “An 80-core can do 5 million, and a 160-core, about 10 million.
To date, BeyondRecognition has offered its technology solely as a service. “We’ve done several dozen projects, including several different types,” said Martin. “We’re currently working on developing an appliance that can be run behind a customer’s firewall.”
Martin said that BeyondRecognition’s technology belongs entirely to his company. “There are some patents around it, as well as some we’re applying for,” he said. “We have 15-16 man years worth of development in this.”
Martin concluded by echoing the theme we thought was prevalent at the recent Harvey Spencer Associates Capture Conference, regarding next generation document capture. “This is a solution for big data,” he said. “If you don’t know what you have, you certainly can’t decide what’s relevant.”
For more information:

Sunday, September 2, 2012

Willie Wonka, Big Data, Pure Imagination

Hold your breath, Make a wish, Count to Three

Come with me …And you'll be
In a world of Pure imagination
Take a look, And you'll see Into your imagination

We'll begin
With a spin…Traveling in
The world of my creation…What we'll see
Will defy….Explanation

Maybe you thought that BeyondRecognition was merely the greates automatic “Visual Document Clustering” New Tool for Big Data. Well you'd be correct yet wrong. BeyondRecognition's powerful graphical engine can provide never before dreamed of  levels of image enhancement. 

BeyondRecognition's Visual-Similarity Clustering automatically processes and groups documents together for document boundary detection and document type classification, regardless of source and format — seamlessly processing native electronic files and scanned documents
Automatic Visual Document Clustering means that visually similar pages are gathered based on their graphical, rather than textual, content.  This avoids the errors normally encountered in extracted or generated text and leverages non-text graphical elements such as logos, form elements and other objects to greatly improve accuracy.

About BeyondRecognition

BeyondRecognition is a "textnology" company that has developed unique character, word and document attribute recognition and extraction capabilities for analyzing image-based documents. Disclosure of further details is being deferred until one or more patents on the process are filed. BeyondRecognition is working with a select number of companies in the electronic discovery and document management industries. 
For more information, visit

About Focus Data Management
FDM is the sales and marketing arm of BeyondRecognition. With offices on both coasts FDM is available to help you customize your Big Document Solutions using the power of BeyondRecognition. Contact us at 804.690.0010 or 562.822.7141