Population Analysis: Litigation Support’s New Best Friend

by Sandra E. Serkes

Normal Text »     Larger Text ? »

There is a perfect storm happening in large-scale litigation. The prolonged down economy, combined with the explosive growth in number (volume) of litigation documents, both paper and electronic, has contributed to a new era in large-scale document management. Today, much legal work is outsourced to lower cost providers, and the legal community is at last (if reluctantly) joining the rest of the world in utilizing predictive sampling to reduce complexity and cost for their clients.

Enter the notion of Population Analysis[1] (a.k.a. early case assessment, pre-conversion, first pass review, and so on), an approach to large-scale document litigation that provides detailed, predictive information about the documents in order for the litigation team to make smart, informed choices about how best to proceed.

Think Before You Act
Population Analysis (“PA”) is a growing field of research and vendor service that evaluates documents based on their content, context, and metadata information. Using sophisticated pattern-matching algorithms, semantic analyses, and computational linguistics, providers of these services are changing the way case teams manage their data.

PA pulls together vast sums of information about each document, and the population in the aggregate. Typically the process works like this:

Data is collected from various sources. Paper documents, electronic files, voicemail archives, etc. all become input “data.”
The entire “population” is amassed and put into a software system (typically consisting of a database, analytical tools, reporting system, and a user interface to operate and view everything).
Software runs over the data population and grabs information from each unit of available information. (Units include document images, text, metadata, formatting, proximity, and more).
Software then analyses its results and reports back its findings. The generated report outlines what is known about the population, how that knowledge might be used, and how best to apply the knowledge in a cost- and time-effective way.
The process from data input (step 2) to reporting (step 4), takes between a few hours and a few days, depending upon the population size and complexity of content.
Below is a sample PA Report from Valora Technologies, Inc., a pioneer in automated document assessment. Astute readers will notice that this PA assessment covers both paper and electronic documents, although many PA techniques consider only native electronic populations.


Typically, PA reporting tries to answer the basic questions most litigation teams have about their data populations: What’s in there? Is it germane? How much of it matters? And, of course, how much cost and trouble will it be to get ahold of? The sidebar below this paragraph details some of the specific questions PA can answer.

What PA Analysis & Reporting Should Tell You
In general, PA Analysis and Reporting answers the following questions about an otherwise “blind” document population (data set). Each the questions below is answered for particular documents, as well as for the population as a whole.

WHAT
What information is contained herein? (What and in what proportion, i.e., how much?)
What kinds of document are present? (Types, languages, length)
What do they say?
What is relevant, what is not?

WHO
Who is mentioned in this material?
Who provided or authored it?
For whom was each piece of information intended?
Where did this information come from?

WHEN
What timeframe is covered by this information?
Is any of it inside/outside the window I’m concerned with?
How did things change over time?

WHY
Why was this material created?
What was the original intent?

WORTH (HOW MUCH?)
How much information is here?
How much will it cost to manage? (Prepare, process, host, review, etc.)


What’s In There?
Good PA Reporting provides detailed information about which kinds of documents are present, and in what density. This is helpful because it goes far beyond simple file extensions. To know you have 37,000 PDFs in your population is of limited use. A PDF file could contain just about anything, including images of scanned paper ! Far more useful would be a breakdown of the actual document types, such as “Q3, 2007 Income Statements” or “Hospital Admission Forms.” Sophisticated PA engines go into the documents themselves to determine what they are and what they contain.

PA Reporting should also tell you the relative distribution of documents by type. A common approach to communicating this information is to present a pie chart of the types of documents present, with each pie slice indicating the relative “share” of the population of each type. Reports can be as broad or as granular as needed per the specific population data. Below are examples of broad distribution reporting versus granular (specific) distribution reporting for the same document population.



PA Reporting for Document Type Distribution: Simple Vs. Detailed Reporting


Other items of note that are reported by PA are what languages are present in the population, where the documents came from (source), their length, condition and suitability for various purposes and whether or not there are potentially privileged, confidential or classified materials present.

What’s Needed? What’s Useful?
One of the first things a PA system will do is “tag” documents as its software engines run over them. Tagging means that software will evaluate each document, extracting information on a variety of attributes. These attributes (the tags) are then stored in a back-end database for later analysis and reporting.

Simple PA systems create tags from electronic file metadata and then mirror that information back in a report. These simple tags are fields that the originating software system created when the files were first created. Typical metadata information includes:
• the size of the file (typically in KB or MB)
• the type of application used to create it (such as Microsoft Excel or Outlook)
• the filename (e.g., letter.doc)
• the date the file was first created on the system
and so on. The actual content of the files is generally skipped, as is any type of analysis of the document itself, such as what it might actually be or whether or not its contents may be relevant.

More sophisticated PA systems go inside the documents’ content, extracting many more attributes and providing analysis of the relative merits of the document. Sophisticated document tags typically include:
• the type of document (such as “Cash Flow Statement” or “Certificate of Service”)
• the date on the document itself (such as the byline date of an article)
• who wrote the document, and for whom is was intended
• any important names, themes or key words present
• how many pages the document has
• whether other documents are attached or packaged with it (such as an email and its attachments)
• whether the document is related to other documents by subject matter, conversation or association, or formatting style
• whether the document is potentially privileged or confidential in nature
• whether the document is relevant or responsive
and so on. There are limitless possibilities to a sophisticated PA engine. In Valora’s experience, the more our clients use these tools and techniques, the more in-depth their tagging requests become. As a general rule, if you can explain what you are looking for in your documents, PA engines can be configured to find it.

PA also aggregates data across the population. Typical PA reporting on a population basis includes:
• how many documents of each type are present (the pie slices)
• how many duplicates and near duplicates are present
• how the documents distribute over time
• how the documents fit together, on the basis of subject matter, conversation association, formatting style and packaging
• how much of the information is likely to be privileged or confidential in nature
• who are the most frequent creators or recipients of information

PA as a Data Culling Tool
Because PA is aware of the contents of each of the documents and their relevance to specific requests, smart clients are using PA to perform an early culling process on their oversized document populations. By seeking tags that either include documents or exclude them (and sometimes a clever use of both), populations can be custom-tailored to specific needs.

For example, if a litigation team is interested in producing documents to the other side, they often wish to identify and restrict their privileged documents from the production. Using PA, they can construct sophisticated rules that identify and tag the potentially privileged material, prior to processing. By removing (or quarantining) the potentially privileged material, they save immediately on processing costs, and later on hosting and review costs. Every document removed from the queue is one less document to pay for.

A similar technique is often employed to restrict documents by date ranges, key words or custodian/source. Other people restrict exact duplicates. All of these are good techniques and all can be expanded upon and tested out using PA before committing to definitive answer. There is nothing worse than running a large population through an –ediscovery file conversion process, only to learn you had the wrong parameters at the start! PA permits, frankly encourages, multiple iterations through the populations to help assess which set of filters and rules produce the optimal output for the case requirements.

Projections & Recommendations from PA
Experienced users of PA learn to use the technique in an iterative way. Iterations may occur on document sub-populations, additional incoming documents or later in time. Because the PA process itself is largely automated, and in some cases provided free of charge , it encourages “experimentation” among litigation teams. Not finding enough issues? Add some more terms to the list and try again. Still too much to review? Try relaxing the Near Duplicates level, so more documents fall into Duplicate Groups. More data coming from another source? Run that against itself and/or against the greater population to see how it compares. Does it contain what you expected?

By iterating and tweaking the PA parameters, people test out different theories of what lies inside the documents. Since the PA is predictive in nature, smaller sub-populations can be assessed and the results forecasted upward to the entire population. This is a technique known as statistical sampling. By sampling a small, yet statistically valid, portion of the documents, clients learn plenty and spend little.

Finally, better PA Analysis includes predictions not just about the documents themselves, but about their usage in the litigation process. Good, predictive PA can pinpoint how much of a document population is likely to be privileged and/or responsive to a request. Given this, the PA report can predict likely review needs, such as how many staff to hire, the expected per hour rates of people ultimately looking at the material, and the cost to look at any subset of category of documents, based on their tags. PA is increasingly becoming a resource allocation tool, in addition to a first glance one.

Cost Benefits Analysis
There are lots areas of cost, time and effort savings using Population Analysis. Here are the most common:

1. Reduced processing costs for ESI file conversion. By performing analysis first, prior to any conversion work, charges are typically split between data coming in for analysis and data ultimately being processed (going out) as a result of the analysis. Almost universally, the outgoing document volume is much lower than the volume initially being assessed. Sometimes the resultant data is only 10% of the original volume. Thus the “out” charge is applied on only 10% of the population, often saving 50% or more on the total ESI processing charges.
2. Reduced hosting costs for document storage and review. Since PA indicates which documents matter and which can be safely discarded (culled), set aside (quarantined) or pushed lower in priority, smart litigation teams will only host those documents that are needed. If PA helps reduce the usable set of documents down to a small enough volume, then only a small portion need to be hosted. Often in-house document management programs are sufficient.
3. Reduced review costs and time. The real bang for the buck with PA, though, is the savings in document review. Widely regarded as the single biggest discovery expense, document review is ripe for taking advantage of PA and its ability to slice and dice document information. By segmenting documents into useful groups (such as groups of duplicate documents or near duplicates, conversation threads from emails or IMs, by topics and content, and by associated people and dates), review occurs much more rapidly than in a simple,” blind,” doc-by-doc process. Consider also, that PA segments out those documents that are irrelevant, potentially privileged or out of scope, and the total “to be reviewed” population diminishes altogether. Combined, the net yield on document review is typically a reduction in document scope (number of docs to be review) by 75% or more, and an increase in document review rate by 200% or more.

Cost Calculator: Putting PA to the Test
To evaluate the specific savings in both population volume and review effort, many PA providers have some form of Cost Savings Calculator. An example of Valora’s calculator is given below. You can download the live version from our website: www.valoratech.com.


Summary
Population Analysis is changing the way we think about documents, large populations, document review, and ESI processing. Today we know more than ever about what documents contain and how and why they might be useful to litigation teams. Knowing this information as soon as possible and for as little cost as possible is invaluable and can often mean the difference between case success or failure.

To keep doing “business as usual” is costly, time-consuming and unfair to clients. The litigation support community owes it to itself and its customers to be cognizant and fluent in this capability. Population Analysis and Reporting are great tools, available now, that can provide almost real-time, detailed information, reduce population scope, group and tag documents for analysis, and perform a first pass of review of the documents themselves.

Remember the fitting words of Benjamin Franklin: “An investment in knowledge pays the best interest.”

For more information on Population Analysis and early case assessment, see:

http://www.valoratech.com/firstlook.html

http://www.allbusiness.com/services/business-services/4332126-1.html

http://library.findlaw.com/2005/May/6/186420.html





Sandra E. Serkes is President & CEO of Valora Technologies, Inc., a leading provider of automated document analysis and reporting. Valora’s patent-pending systems have been automatically tagging and classifying documents of all sorts for over eight years. Ms. Serkes is an expert on document analysis and management of large-scale document populations for litigation and records retention matters. She holds an MBA from Harvard Business School and Bachelor of Science from MIT. She can be reached at sserkes@valoratech.com or (781) 229-2265.


Top of Page ↑


President’;s Corner
Elisabeth Mcnamara
President
Document Technologies Inc.

April Albrecht
VP-Newsletter
Kirkland & Ellis

Marco Nasca
VP-Programming
Iris Data Services

Lela Laurent
VP-Membership
Independent

Bruce Malter
VP-Vendor Membership
Project Leadership Associates

David Siarny
Secretary
Winston & Strawn LLP

Denio Di Francesco
Treasurer
Mayer Brown LLP

Barbara Hanahan
Immediate Past President
Reveal Data, Inc.



In this Issue


Population Analysis: Litigation Support’s New Best Friend… Page1

"Thomson Reuters Litigation Readiness Workshop"?… Page1

President’;s Corner