EDD: Issues, Law, and Solutions: Document Clustering Using Facial Recognition Principles & Post-Clustering Data Extraction By John Martin

Wednesday, May 16, 2012

Document Clustering Using Facial Recognition Principles & Post-Clustering Data Extraction By John Martin

…the new document textnology

John Martin turns the Technology Assisted Review, Post Clustering Data Extraction Upside Down using Facial Recognition Principles :

Companies achieve many benefits when theycan quickly, accurately and inexpensively cluster like documents. For example, 500-page home loan files can be scanned on high speedscanners, the pages programmatically grouped intodocuments, and like document types clustered or categorized across all the loan files without significant operator intervention, permitting the company to check if the mortgage files contain the expected types of documents, e.g. deeds, inspection reports, insurance binders, lien releases, etc.

Furthermore, with precise clustering of the same document types, data that was unique to each
document can be extracted and compared to expected values. To continue the loan file example
the form of the grantee’s name or the property description can be compared across all documents or
against a loan tracking system to make sure all the values matched.

New technology from BeyondRecognition (“BR”) provides this type of functionality. BR examines the
“faces” or images of documents and using between roughly a thousand and fifteen hundred points of
comparison, clusters or groups those documents that have a high degree of similarity. By contrast, facial
recognition of picture of human faces typically uses less than 200 points of comparison.

One of the most subtle but profound consequences of programmatic document typing or clustering is that
documents are arranged in clusters before you have to decide what to name them or what data elements to
extract from them or how long to retain them.

Post-Clustering Data Element Extraction

Anyone who has ever tried to write coding rules or document processing instructions for the handling of
heterogeneous document collections knows that one of the largest obstacles is that you don’t know what’s
in the collection until you’ve gone through it all. The typical scenario is to sample documents, write the
best manual you can up front and then be prepare for many updates and revisions as the project unfolds.

The graphical interface used by BR for creating data extraction rules permits rapid, extremely accurate
coding of the various document clusters using cluster specific location coordinates and other textual as well
as nontextual markers, to extract those data elements most pertinent to each data type or cluster.

In fact, it will typically take an experienced clustering consultant less time to actually create and apply the
rules after the documents have been programmatically clustered than it would for a projectmanager to write a comprehensive coding manua before or during the project implementation phase for manual coding.

With the breakthrough graphical interface the clustering consultant can also see immediately what
data is extracted across an entire cluster right as clusters are being analyzed. This greatly reduces
the time required to write and debug data element extraction rules. Without the interactive feature,
testing would have to be done on a batch basis which is inherently more time consuming and permits far
fewer iterations.

Clustering also readily identifies anomalousdocuments, ones that are truly unique within the
collection and for which individual rules need not be developed. Anyone who has had to work within the limitations of writing data extraction rules for documents based solely upon the textual representation of those documents will be pleased to know that BR’s extraction rules can be based on non-textual graphical
elements, e.g. a logos or vertical lines. This ability to use non-text graphical elements or data
extraction flags or cues greatly expands the flexibility and power of BR’s data element extraction capability.
It enables BR to achieve an accuracy and thoroughness rivaling if not in many cases exceeding
that of human-based reviewers or coders – and at a fraction of the cost and time, to say nothing of
security benefits of not having legions of people reviewing potentially sensitive corporate documents
and trade secrets just to have indices prepared.

If you are not familiar with this technology, please visit http://www.beyondrecognition.net/.

Alternatively, to learn how you can put BeyondRecognition to work solving your document problems

Contact John Martin at John@beyondrecognition.net or Kriss@focusdata-mgt.com

No comments:

Post a Comment

Sedona Principles 2nd ed.

1. Electronically stored information is potentially discoverable under Fed. R. Civ. P. 34 or its state equivalents. Organizations must properly preserve electronically stored information that can reasonably be anticipated to be relevant to litigation.
2. When balancing the cost, burden, and need for electronically stored information, courts and parties should apply the proportionality standard embodied in Fed. R. Civ. P. 26(b)(2)(C) and its state equivalents, which require consideration of the technological feasibility and realistic costs of preserving, retrieving, reviewing, and producing electronically stored information, as well as the nature of the litigation and the amount in controversy.
3. Parties should confer early in discovery regarding the preservation and production of electronically stored information when these matters are at issue in the litigation and seek to agree on the scope of each party’s rights and responsibilities.
4. Discovery requests for electronically stored information should be as clear as possible, while responses and objections to discovery should disclose the scope and limits of the production.
5. The obligation to preserve electronically stored information requires reasonable and good faith efforts to retain information that may be relevant to pending or threatened litigation. However, it is unreasonable to expect parties to take every conceivable step to preserve all potentially relevant electronically stored information.
6. Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.
7. The requesting party has the burden on a motion to compel to show that the responding party’s steps to preserve and produce relevant electronically stored information were inadequate.
8. The primary source of electronically stored information for production should be active data and information. Resort to disaster recovery backup tapes and other sources of electronically stored information that are not reasonably accessible requires the requesting party to demonstrate need and relevance that outweigh the costs and burdens of retrieving and processing the electronically stored information from such sources, including the disruption of business and information management activities.
9. Absent a showing of special need and relevance, a responding party should not be required to preserve, review, or produce deleted, shadowed, fragmented, or residual electronically stored information.
10. A responding party should follow reasonable procedures to protect privileges and objections in connection with the production of electronically stored information.
11. A responding party may satisfy its good faith obligation to preserve and produce relevant electronically stored information by using electronic tools and processes, such as data sampling, searching, or the use of selection criteria, to identify data reasonably likely to contain relevant information.
12. Absent party agreement or court order specifying the form or forms of production, production should be made in the form or forms in which the information is ordinarily maintained or in a reasonably usable form, taking into account the need to produce reasonably accessible metadata that will enable the receiving party to have the same ability to access, search, and display the information as the producing party where appropriate or necessary in light of the nature of the information and the needs of the case.
13. Absent a specific objection, party agreement or court order, the reasonable costs of retrieving and reviewing electronically stored information should be borne by the responding party, unless the information sought is not reasonably available to the responding party in the ordinary course of business. If the information sought is not reasonably available to the responding party in the ordinary course of business, then, absent special circumstances, the costs of retrieving and reviewing such electronic information may be shared by or shifted to the requesting party.
14. Sanctions, including spoliation ﬁndings, should be considered by the court only if it finds that there was a clear duty to preserve, a culpable failure to preserve and produce relevant electronically stored information, and a reasonable probability that the loss of the evidence has materially prejudiced the adverse party.
Copyright © 2007 The Sedona Conference®. All Rights Reserved.
Reprinted courtesy of The Sedona Conference®.
Go to http://www.thesedonaconference.org/ to download a free copy of the complete document for your personal use only.