Sunday, October 10, 2010

Can Hash values leave your E-discovery Team with Egg on its Face?




A lot of rhetoric has been focused lately on the use of the MD5 (Message-Digest algorithm 5) hash code to uniquely identify one item and defensibly compare it to another. All of this stems from the fact that there has been at least one instance where two different documents resulted in the same MD5 hash. So how troublesome is this when including or (more importantly) excluding items from case review?

MD5 hashes are widely used today on countless file servers and P2P networks, as well as a way to guarantee file integrity in the E-discovery realm. It is more popularly known today as a “digital fingerprint.” Hash is the bedrock of e-discovery because the digital fingerprint guarantees the authenticity of data, and protects it against alteration, either negligent or intentional. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.

According to a report from security researcher Dan Kaminsky, the MD5 (define) cryptographic algorithm may be at risk. This means that files, applications and programs supposedly authenticated and verified by MD5 could potentially be compromised.
According to Kaminsky, this makes them blind to any signature embedded within MD5 collisions.

In a research paper titled, "MD5 To Be Considered Harmful Some Day," Kaminsky expanded on the theoretical work done by Chinese security researchers Xiaoyun Wang, Dengguo Feng, Xuejia Lai and Hongbo Yu on "Collisions for MD5 Hash Functions." Kaminsky released a tool Stripwire to demonstrate some of the attacks he describes.

A hash (define) collision essentially means that you could have two identical outputs from a hash function. That situation may lead to an algorithm that is not considered to be cryptographically secure and can be attacked. In August, French research Antoine Joux presented an unpublished paper at the Crypto 2004 show similar to the original Chinese research that Kaminsky expanded upon.


The 128-bit (16-byte) MD5 hashes (also termed message digests) are typically represented as a sequence of 32 hexadecimal digits. The following demonstrates a 43-byte ASCII input and the corresponding MD5 hash:

MD5("The quick brown fox jumps over the lazy dog")
= 9e107d9d372bb6826bd81d3542a419d6

Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the avalanche effect. For example, adding a period to the end of the sentence:

MD5("The quick brown fox jumps over the lazy dog.")
= e4d909c290d0fb1ca068ffaddf22cbd0

As you can see the simple addition of a period completely changes the hash value

i found an article written by Matt Harnish of Fios that states:

Since 1997, it has been known that, theoretically, two different items could produce the same MD5 hash. Only recently has it been shown with tangible data that two sets of data (“vectors” in this case) that were different produced the same MD5 hash code. Interestingly, it has taken a lot of dedicated effort over a long period of time to produce evidence of this single collision within a type of e-discovery data that is not very relevant to everyday loose data and email.
Also, remember that most of the studies have been with vectors looking at “multi-collision attacks” and “forced expansion failures” – typically with authentication. Not something your data deals with? I didn’t think so. And, yes, there has been a recent discovery of a short phrase that could be manipulated in such a way as to produce the same MD5 hash. Is this indicative of the types of documents and email that would be relevant in litigation? Highly doubtful. Even if it was, the metadata surrounding the items was not considered or factored into the offending MD5 hash. Any e-discovery providers worth their salt will consider both the data itself as well as the metadata in some proportions during processing. As such, the examples causing concern and controversy are not based on anything that resembles the data that is dealt with during e-discovery, and they do not include the parameters that legal professionals today incorporate.
Sure, there are those out there who raise scary-sounding issues (e.g., if two files have the same hash, then two files appended with the same data also have the same hash); however, the fact remains that the likelihood of duplication over two different sets of data using an MD5 hash is 2^128 (or, in more common terms, 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456 or 340 billion billion billion billion or 340 undecillion). That is an incredibly low probability of occurrence, especially in light of the fact that the likelihood of two humans having the same fingerprint is somewhere between 1 in 6.4 billion (Galton study) and 1 in 100 billion billion (Osterburg study). Regardless of which fingerprint study you prefer, the MD5 hash is certainly much more unique than human fingerprints – this is why the hash code is sometimes referred to as data’s “digital fingerprint” – yet human fingerprints are often considered to be one of the almost “irrefutable” pieces of evidence in criminal matters.


The Federal Indentity Credentialing Committee (FICC) put it nicely when they reported on March 8th that a MD5 "compromise is not seen as a major impact to the security product and services industry". See page three of the FPKI minutes: http://www.idmanagement.gov/fpkipa/

in other words carry on ...for the time being anyway

No comments: