EDD: Issues, Law, and Solutions: "Predictive" Coding and the Naked Emperor

Thursday, May 8, 2014

"Predictive" Coding and the Naked Emperor

I've always been suspicious of the claims of the "predictive coding" zealots. Every year it seems that there is a new buzzword in the field of Ediscovery. Technology Assisted Review - Check! Information Governance - Check! Cloud Computing - Check. Big Data - Check.

My friend John Martin did a magnificent job of distilling some of my concerns in his blog entry that can be found right here.

The Emperor has No Clothes - and PC Can't See Image-Only Documents

There are several parallels between predictive coding (AKA technology assisted review) and Hans Christian Andersons' tale, "The Emperor's New Clothes." In the story, two weavers tell the emperor they will make him a suit of clothes that will be invisible to those people who are unfit for their position, stupid, or incompetent. None of the emperor's subjects want to admit to those deficiencies so the emperor parades around with no clothes on until a child states the obvious - the emperor has no clothes.

Predictive Coding - do not see any evidence here

In the case of predictive coding, its advocates have touted the efficacy of their approaches in white papers, blogs, and social media postings, and have practically created a separate industry to host conferences promoting the wonders of predictive coding. Few people want to ruin the moment or buck the trend by pointing out what is obvious when one considers the technology underlying predictive coding - it is completely dependent on having text to analyze. It will absolutely fail to analyze documents for which there is no text, and will do a miserable job where the text is of poor quality.

This might be just an esoteric debating point if virtually all documents had associated text. However, in some industries like oil & gas, half or more of some collections will be engineering drawings and schematics that were output to image-only PDF for distribution and use by those who don't have the software licenses needed to view the documents in their original file formats.

Predictive Coding 100 percent right 20 percent of the time

In practically all industries it is common practice to develop documents in one application like Word and then, once finalized, distribute them as image-only PDF so they can be viewed on a variety of devices and so recipients can't easily change the content. In one collection we analyzed, only 20% of the PDFs had associated text. Even if predictive coding were 100% effective, the most it could classify would be 20% because it literally cannot "see" the 80% without text. If in fact predictive coding has a recall rate of 70-80% of what it can see, that would mean that predictive coding would have identified 14 to 16% of the total PDFs (70% x 20% = 14% or 80% of 20% = 16%). By contrast, BR's visual classification technology classified 100% of them.

PDFs will potentially be among the most relevant file types in a collection because that is the format used to distribute information within and among groups of people within an organization, and among organizations. Note that even if in some unique e-discovery settings predictive coding is acceptable, the text-restriction failing of predictive coding will be fatal for broader information governance purposes.

So... if you're going to use predictive coding, at the very least measure what PC doesn't "see." If you're planning on using PC for information governance purposes, make sure that the organization doesn't mind not classifying a potentially significant percentage of its documents.

No comments:

Subscribe to: Post Comments (Atom)

Sedona Principles 2nd ed.

1. Electronically stored information is potentially discoverable under Fed. R. Civ. P. 34 or its state equivalents. Organizations must properly preserve electronically stored information that can reasonably be anticipated to be relevant to litigation.
2. When balancing the cost, burden, and need for electronically stored information, courts and parties should apply the proportionality standard embodied in Fed. R. Civ. P. 26(b)(2)(C) and its state equivalents, which require consideration of the technological feasibility and realistic costs of preserving, retrieving, reviewing, and producing electronically stored information, as well as the nature of the litigation and the amount in controversy.
3. Parties should confer early in discovery regarding the preservation and production of electronically stored information when these matters are at issue in the litigation and seek to agree on the scope of each party’s rights and responsibilities.
4. Discovery requests for electronically stored information should be as clear as possible, while responses and objections to discovery should disclose the scope and limits of the production.
5. The obligation to preserve electronically stored information requires reasonable and good faith efforts to retain information that may be relevant to pending or threatened litigation. However, it is unreasonable to expect parties to take every conceivable step to preserve all potentially relevant electronically stored information.
6. Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.
7. The requesting party has the burden on a motion to compel to show that the responding party’s steps to preserve and produce relevant electronically stored information were inadequate.
8. The primary source of electronically stored information for production should be active data and information. Resort to disaster recovery backup tapes and other sources of electronically stored information that are not reasonably accessible requires the requesting party to demonstrate need and relevance that outweigh the costs and burdens of retrieving and processing the electronically stored information from such sources, including the disruption of business and information management activities.
9. Absent a showing of special need and relevance, a responding party should not be required to preserve, review, or produce deleted, shadowed, fragmented, or residual electronically stored information.
10. A responding party should follow reasonable procedures to protect privileges and objections in connection with the production of electronically stored information.
11. A responding party may satisfy its good faith obligation to preserve and produce relevant electronically stored information by using electronic tools and processes, such as data sampling, searching, or the use of selection criteria, to identify data reasonably likely to contain relevant information.
12. Absent party agreement or court order specifying the form or forms of production, production should be made in the form or forms in which the information is ordinarily maintained or in a reasonably usable form, taking into account the need to produce reasonably accessible metadata that will enable the receiving party to have the same ability to access, search, and display the information as the producing party where appropriate or necessary in light of the nature of the information and the needs of the case.
13. Absent a specific objection, party agreement or court order, the reasonable costs of retrieving and reviewing electronically stored information should be borne by the responding party, unless the information sought is not reasonably available to the responding party in the ordinary course of business. If the information sought is not reasonably available to the responding party in the ordinary course of business, then, absent special circumstances, the costs of retrieving and reviewing such electronic information may be shared by or shifted to the requesting party.
14. Sanctions, including spoliation ﬁndings, should be considered by the court only if it finds that there was a clear duty to preserve, a culpable failure to preserve and produce relevant electronically stored information, and a reasonable probability that the loss of the evidence has materially prejudiced the adverse party.
Copyright © 2007 The Sedona Conference®. All Rights Reserved.
Reprinted courtesy of The Sedona Conference®.
Go to http://www.thesedonaconference.org/ to download a free copy of the complete document for your personal use only.