Monday, September 27, 2010

eDiscovery Searching 101: Don’t Get “Wild” with Wildcards

I just stumbled across an article by Doug Austin at the brand new eDiscovery Daily blog. The blog looks interesting and this article hit home from personal experience with deficient keyword search terms. To expand on this:

Presently litigants most commonly search repositories of electronic data for documents containing defined search terms (keyword searches) or search terms appearing in a specified relation to one another (Boolean searches). These search technologies have been in use for years. But keyword and Boolean searches are not perfect solutions; these searches will identify only those electronic documents containing the precise terms and will not catch documents using words that are close, but not identical, to the specified search terms, such as abbreviations, synonyms, nicknames, initials and maybe most importantly misspelled words.

On the other hand, using more search terms may reduce the risk that an electronic search will miss a relevant document, but only at the price of increasing -- often quite dramatically -- the number of irrelevant documents found in the search

Evidently weary of deficient keyword searches, U.S. Magistrate Judge Andrew J. Peck recently issued a self-styled "wake-up call" to members of the bar in the Southern District. Instead of attorneys designing keywords without adequate information "by the seat of their pants," Peck appealed for keyword formulations based on careful thought, quality control, testing and cooperation. "

Here's part of Doug Austin's blog entry - it can be found in its entirety here

Several months ago, I provided search strategy assistance to a client that had already agreed upon several searches with opposing counsel. One search related to mining activities, so the attorney decided to use a wildcard of “min*” to retrieve variations like “mine”, “mines” and “mining”.

That one search retrieved over 300,000 files with hits.

Why? Because there are 269 words in the English language that begin with the letters “min”. Words like “mink”, “mind”, “mint” and “minion” were all being retrieved in this search for files related to “mining”. We ultimately had to go back to opposing counsel and negotiate a revised search that was more appropriate.

How do you ensure that you’re retrieving all variations of your search term?

Stem Searches

One way to capture the variations is with stem searching. Applications that support stem searching give you an ability to enter the root word (e.g., mine) and it will locate that word and its variations. Stem searching provides the ability to find all variations of a word without having to use wildcards.

Other Methods

If your application doesn’t support stem searches, shows list of words that begin with your search string (e.g., to get all 269 words beginning with “min”, go here – simply substitute any characters for “min” to see the words that start with those characters). Choose the variations you want and incorporate them into the search instead of the wildcard – i.e., use “(mine or “mines or mining)” instead of “min*” to retrieve a more relevant result set.

Some applications let you select the wildcard variations you wish to use. As a result, you can avoid all of the non-relevant variations and limit the search to the relevant hits.

No comments: