Wednesday, May 16, 2012

Document Clustering Using Facial Recognition Principles & Post-Clustering Data Extraction By John Martin

…the new document textnology

John Martin turns the Technology Assisted Review, Post Clustering Data Extraction Upside Down using Facial Recognition Principles :

Companies achieve many benefits when theycan quickly, accurately and inexpensively cluster like documents. For example, 500-page home loan files can be scanned on high speedscanners, the pages programmatically grouped intodocuments, and like document types clustered or categorized across all the loan files without significant operator intervention, permitting the company to check if the mortgage files contain the expected types of documents, e.g. deeds, inspection reports, insurance binders, lien releases, etc.

Furthermore, with precise clustering of the same document types, data that was unique to each
document can be extracted and compared to expected values. To continue the loan file example
the form of the grantee’s name or the property description can be compared across all documents or
against a loan tracking system to make sure all the values matched.

New technology from BeyondRecognition (“BR”) provides this type of functionality. BR examines the
“faces” or images of documents and using between roughly a thousand and fifteen hundred points of
comparison, clusters or groups those documents that have a high degree of similarity. By contrast, facial
recognition of picture of human faces typically uses less than 200 points of comparison.

One of the most subtle but profound consequences of programmatic document typing or clustering is that
documents are arranged in clusters before you have to decide what to name them or what data elements to
extract from them or how long to retain them.

Post-Clustering Data Element Extraction

Anyone who has ever tried to write coding rules or document processing instructions for the handling of
heterogeneous document collections knows that one of the largest obstacles is that you don’t know what’s
in the collection until you’ve gone through it all. The typical scenario is to sample documents, write the
best manual you can up front and then be prepare for many updates and revisions as the project unfolds.

The graphical interface used by BR for creating data extraction rules permits rapid, extremely accurate
coding of the various document clusters using cluster specific location coordinates and other textual as well
as nontextual markers, to extract those data elements most pertinent to each data type or cluster.

In fact, it will typically take an experienced clustering consultant less time to actually create and apply the
rules after the documents have been programmatically clustered than it would for a projectmanager to write a comprehensive coding manua before or during the project implementation phase for manual coding.

With the breakthrough graphical interface the clustering consultant can also see immediately what
data is extracted across an entire cluster right as clusters are being analyzed. This greatly reduces
the time required to write and debug data element extraction rules. Without the interactive feature,
testing would have to be done on a batch basis which is inherently more time consuming and permits far
fewer iterations.

Clustering also readily identifies anomalousdocuments, ones that are truly unique within the
collection and for which individual rules need not be developed. Anyone who has had to work within the limitations of writing data extraction rules for documents based solely upon the textual representation of those documents will be pleased to know that BR’s extraction rules can be based on non-textual graphical
elements, e.g. a logos or vertical lines. This ability to use non-text graphical elements or data
extraction flags or cues greatly expands the flexibility and power of BR’s data element extraction capability.
It enables BR to achieve an accuracy and thoroughness rivaling if not in many cases exceeding
that of human-based reviewers or coders – and at a fraction of the cost and time, to say nothing of
security benefits of not having legions of people reviewing potentially sensitive corporate documents
and trade secrets just to have indices prepared.

If you are not familiar with this technology, please visit

Alternatively, to learn how you can put BeyondRecognition to work solving your document problems

Contact John Martin at or

No comments: