EDD: Issues, Law, and Solutions: Can Hash values leave your E-discovery Team with Egg on its Face?

Sunday, October 10, 2010

Can Hash values leave your E-discovery Team with Egg on its Face?

A lot of rhetoric has been focused lately on the use of the MD5 (Message-Digest algorithm 5) hash code to uniquely identify one item and defensibly compare it to another. All of this stems from the fact that there has been at least one instance where two different documents resulted in the same MD5 hash. So how troublesome is this when including or (more importantly) excluding items from case review?

MD5 hashes are widely used today on countless file servers and P2P networks, as well as a way to guarantee file integrity in the E-discovery realm. It is more popularly known today as a “digital fingerprint.” Hash is the bedrock of e-discovery because the digital fingerprint guarantees the authenticity of data, and protects it against alteration, either negligent or intentional. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.

According to a report from security researcher Dan Kaminsky, the MD5 (define) cryptographic algorithm may be at risk. This means that files, applications and programs supposedly authenticated and verified by MD5 could potentially be compromised.
According to Kaminsky, this makes them blind to any signature embedded within MD5 collisions.

In a research paper titled, "MD5 To Be Considered Harmful Some Day," Kaminsky expanded on the theoretical work done by Chinese security researchers Xiaoyun Wang, Dengguo Feng, Xuejia Lai and Hongbo Yu on "Collisions for MD5 Hash Functions." Kaminsky released a tool Stripwire to demonstrate some of the attacks he describes.

A hash (define) collision essentially means that you could have two identical outputs from a hash function. That situation may lead to an algorithm that is not considered to be cryptographically secure and can be attacked. In August, French research Antoine Joux presented an unpublished paper at the Crypto 2004 show similar to the original Chinese research that Kaminsky expanded upon.

The 128-bit (16-byte) MD5 hashes (also termed message digests) are typically represented as a sequence of 32 hexadecimal digits. The following demonstrates a 43-byte ASCII input and the corresponding MD5 hash:

MD5("The quick brown fox jumps over the lazy dog")
= 9e107d9d372bb6826bd81d3542a419d6

Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the avalanche effect. For example, adding a period to the end of the sentence:

MD5("The quick brown fox jumps over the lazy dog.")
= e4d909c290d0fb1ca068ffaddf22cbd0

As you can see the simple addition of a period completely changes the hash value

i found an article written by Matt Harnish of Fios that states:

Since 1997, it has been known that, theoretically, two different items could produce the same MD5 hash. Only recently has it been shown with tangible data that two sets of data (“vectors” in this case) that were different produced the same MD5 hash code. Interestingly, it has taken a lot of dedicated effort over a long period of time to produce evidence of this single collision within a type of e-discovery data that is not very relevant to everyday loose data and email.
Also, remember that most of the studies have been with vectors looking at “multi-collision attacks” and “forced expansion failures” – typically with authentication. Not something your data deals with? I didn’t think so. And, yes, there has been a recent discovery of a short phrase that could be manipulated in such a way as to produce the same MD5 hash. Is this indicative of the types of documents and email that would be relevant in litigation? Highly doubtful. Even if it was, the metadata surrounding the items was not considered or factored into the offending MD5 hash. Any e-discovery providers worth their salt will consider both the data itself as well as the metadata in some proportions during processing. As such, the examples causing concern and controversy are not based on anything that resembles the data that is dealt with during e-discovery, and they do not include the parameters that legal professionals today incorporate.
Sure, there are those out there who raise scary-sounding issues (e.g., if two files have the same hash, then two files appended with the same data also have the same hash); however, the fact remains that the likelihood of duplication over two different sets of data using an MD5 hash is 2^128 (or, in more common terms, 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456 or 340 billion billion billion billion or 340 undecillion). That is an incredibly low probability of occurrence, especially in light of the fact that the likelihood of two humans having the same fingerprint is somewhere between 1 in 6.4 billion (Galton study) and 1 in 100 billion billion (Osterburg study). Regardless of which fingerprint study you prefer, the MD5 hash is certainly much more unique than human fingerprints – this is why the hash code is sometimes referred to as data’s “digital fingerprint” – yet human fingerprints are often considered to be one of the almost “irrefutable” pieces of evidence in criminal matters.

The Federal Indentity Credentialing Committee (FICC) put it nicely when they reported on March 8th that a MD5 "compromise is not seen as a major impact to the security product and services industry". See page three of the FPKI minutes: http://www.idmanagement.gov/fpkipa/

in other words carry on ...for the time being anyway

No comments:

Post a Comment

Sedona Principles 2nd ed.

1. Electronically stored information is potentially discoverable under Fed. R. Civ. P. 34 or its state equivalents. Organizations must properly preserve electronically stored information that can reasonably be anticipated to be relevant to litigation.
2. When balancing the cost, burden, and need for electronically stored information, courts and parties should apply the proportionality standard embodied in Fed. R. Civ. P. 26(b)(2)(C) and its state equivalents, which require consideration of the technological feasibility and realistic costs of preserving, retrieving, reviewing, and producing electronically stored information, as well as the nature of the litigation and the amount in controversy.
3. Parties should confer early in discovery regarding the preservation and production of electronically stored information when these matters are at issue in the litigation and seek to agree on the scope of each party’s rights and responsibilities.
4. Discovery requests for electronically stored information should be as clear as possible, while responses and objections to discovery should disclose the scope and limits of the production.
5. The obligation to preserve electronically stored information requires reasonable and good faith efforts to retain information that may be relevant to pending or threatened litigation. However, it is unreasonable to expect parties to take every conceivable step to preserve all potentially relevant electronically stored information.
6. Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.
7. The requesting party has the burden on a motion to compel to show that the responding party’s steps to preserve and produce relevant electronically stored information were inadequate.
8. The primary source of electronically stored information for production should be active data and information. Resort to disaster recovery backup tapes and other sources of electronically stored information that are not reasonably accessible requires the requesting party to demonstrate need and relevance that outweigh the costs and burdens of retrieving and processing the electronically stored information from such sources, including the disruption of business and information management activities.
9. Absent a showing of special need and relevance, a responding party should not be required to preserve, review, or produce deleted, shadowed, fragmented, or residual electronically stored information.
10. A responding party should follow reasonable procedures to protect privileges and objections in connection with the production of electronically stored information.
11. A responding party may satisfy its good faith obligation to preserve and produce relevant electronically stored information by using electronic tools and processes, such as data sampling, searching, or the use of selection criteria, to identify data reasonably likely to contain relevant information.
12. Absent party agreement or court order specifying the form or forms of production, production should be made in the form or forms in which the information is ordinarily maintained or in a reasonably usable form, taking into account the need to produce reasonably accessible metadata that will enable the receiving party to have the same ability to access, search, and display the information as the producing party where appropriate or necessary in light of the nature of the information and the needs of the case.
13. Absent a specific objection, party agreement or court order, the reasonable costs of retrieving and reviewing electronically stored information should be borne by the responding party, unless the information sought is not reasonably available to the responding party in the ordinary course of business. If the information sought is not reasonably available to the responding party in the ordinary course of business, then, absent special circumstances, the costs of retrieving and reviewing such electronic information may be shared by or shifted to the requesting party.
14. Sanctions, including spoliation ﬁndings, should be considered by the court only if it finds that there was a clear duty to preserve, a culpable failure to preserve and produce relevant electronically stored information, and a reasonable probability that the loss of the evidence has materially prejudiced the adverse party.
Copyright © 2007 The Sedona Conference®. All Rights Reserved.
Reprinted courtesy of The Sedona Conference®.
Go to http://www.thesedonaconference.org/ to download a free copy of the complete document for your personal use only.