Ten Essential Best Practices in Predictive Coding

By on May 16, 2013
SmallBLOCKS420x470

April/May 2013

Warwick Sharp

This paper outlines emerging best practices in the application of predictive coding to e-discovery. Having moved past its early experimental stages, this technology is now approved by multiple courts in the United States and abroad and has become the most talked about topic in e-discovery worldwide.

The classification technologies that underlie predictive coding applications in e-discovery have been used in a broad range of industrial and scientific settings for decades. Some of the best practices that have developed in these settings are analogous to those in the e-discovery setting. However, due to stringent defensibility requirements in e-discovery, it has been necessary to develop and define best practices that address the application of predictive coding technology. specifically to e-discovery.

(1) Choose the expert with due consideration. Despite the marketing hype, predictive coding is not magic. It is a smart piece of software, one that can be trained to imitate the criteria used by a human being to evaluate a document’s relevance. In a sense, the software is encoding the intelligence and knowledge of an experienced attorney. But here’s the catch: predictive coding is a garbage-in, garbage-out application. With quality input, predictive coding applications can generate outstanding results.

But the technology will also encode “incorrect” guidance from the trainer. Thus, the first best practice is to give due consideration to the choice of the trainer. Commonly referred to as “the expert,” this person needs to be a knowledgeable attorney, with the authority to make review decisions that are likely to have a significant impact on the conduct and outcome of the case.

(2) Begin with collaborative training. Collaborative training uses a team of two or three experts to train the system together. The collaborative approach is typically used for the first 500 or 1,000 documents. The rule is that the attorneys training the system are required to reach consensus on the relevance designation for each document. Tests using this approach show that at the outset of the training process, the attorneys will disagree on well over 50 percent of the documents. After a few hours, disagreement rates are close to zero. The process taking place here is that the group is progressively refining the concept of relevance that underlies the case. The distillation of a well-defined, well-bounded concept of document relevance helps ensure the quality of the system training.

(3) Tag by “application-accessible” data. Most predictive coding systems in the e-discovery market focus their analysis on document content, rather than external metadata, such as date or custodian. It is important to instruct the expert to tag documents based on the data that can be accessed by the system. For example, if the predictive coding system captures only document content, the expert should be instructed to tag documents based on document content only, regardless of metadata.

Consider, for example, a document whose content is apparently responsive, but which falls outside the relevant date range of the case. Were this document to be tagged as not-relevant, the predictive coding application would be misled into “thinking” that the document content itself is not relevant.

The best approach is to cull by metadata prior to training the predictive coding application, thus avoiding any potential tagging errors.

(4) Understand the distinction between the “super-issue” and individual or sub-issues. The super-issue, or master issue, relates to whether the document is relevant or not to the case, and is used to construct the review set. Documents with scores above the relevance cut-off score for the super-issue are passed on to review, while documents with scores below the cut-off are culled. Within the set of review documents, the individual issue scores are used solely to organize the review set and assign documents to the relevant review teams. For example, if the super-issue is “Navigation,” the individual issues might be North, South, East and West, with separate review teams specializing in each individual issue.

(5) Separate control documents from training documents. To ensure the statistical validity and defensibility of the predictive coding process, it is critical that the “control” documents be kept separate from the “training” documents.

The control set comprises a random, representative sample of documents from the collection. The expert tags the control documents as relevant or not. The control set then serves as the gold standard against which the results of the predictive coding system are tested.

(6) Build control before training. The control set serves as an independent yardstick for measuring the performance of the predictive coding system. It is important to create the control set prior to training. This approach is in contrast with earlier methodologies, where the control set was created after training.

The “control-first” approach facilitates use of the control set as an independent tool for monitoring of training. Rather than relying on intuition or arbitrary measures, the control-first strategy equips the user with an objective, concrete measure of the training process, along with a clear indication of when training can be terminated.

(7) Use manual training mode. Automatic training refers to a training technique in which document relevance tags, generated by a team of reviewers in a standard review process, are fed into the predictive coding system. For example, for a collection of one million documents, review may have been completed for 10,000 documents. Automatic training would use the relevance tags from these 10,000 documents to train the system to assess relevance in the remaining documents.

In manual training scenarios, a reviewer is assigned to train the predictive coding system as part of an intensive, dedicated training effort. As opposed to the designations from a standard review, manual training tends to yield better quality input for predictive coding training. This is because the “expert” is very aware of the training process and the significance of each document to the overall outcome. In addition, training input from a single senior reviewer often tends to be more consistent than input from a large review team. Nonetheless, automatic mode can be a useful option in certain situations, such as plaintiff scenarios or internal investigations, where there is no onus to produce documents to opposing counsel. However, for document productions, where defensibility is paramount, the preferred approach is manual training.

(8) Track training consistency. The need to track the consistency of the expert’s training input derives from the garbage-in garbage-out risk discussed earlier. Ideally, the predictive coding application monitors the expert’s input across various dimensions, to verify that input is consistent. For example, if two very similar documents are encountered in training and the expert tags one as relevant and one as not, best practice would dictate that this potential inconsistency be flagged for verification.

Past its early experimental stages, this technology is now approved by multiple courts in the United States and abroad and has become the most talked about topic in e-discovery worldwide.

(9) Use graduated relevance scores. Predictive coding systems typically generate, for each document, either a graduated relevance score (for example, on a scale of 0 through 100) or a binary designation (relevant or non-relevant). The graduated relevance score has a number of advantages. Most importantly, the graduated approach enables the user to intelligently control the volume of documents to be passed on to review. This is a key e-discovery business decision, which needs to be based on criteria of reasonableness and proportionality. These criteria vary from case to case, reflecting the strategic and financial significance of the case and the mix of risk and cost that the client is willing to bear.

In addition, the graduated scores enable the implementation of new models in e-discovery, such as prioritized review (starting with the most relevant documents and working back) and stratified review (where high scoring documents are assigned for in-house review, and low scoring documents are assigned for low-cost contract review).

(10) Validate results. As emphasized in Judge Andrew Peck’s opinion in the Da Silva Moore case, quality assurance is a key component of the predictive coding process. The objective of quality assurance is to provide transparent validation of the results generated by the application. One key test is to verify culling decisions. For example, using the distribution of relevance scores, the user may decide that documents with scores above 24 will be submitted for review, and documents with scores of 24 and below will be culled.

An emerging best practice is to “test the rest” – that is, test documents below the cut-off line to double-check that the cull zone does in fact contain a very low prevalence of relevant documents. The expert will review and tag a representative random sample from the cull zone and, based on the results, confirm or modify the cut-off point accordingly.

In conclusion, it should be noted that predictive coding is a dynamic, rapidly developing arena. The best practices described here will undoubtedly continue to evolve. In documenting them as they have taken shape over the past year or so, the intention is not to define a universal textbook template for the generic predictive coding project, but to provide a platform from which it will be possible to develop, refine and create new and better practices as the e-discovery industry continues to assimilate the game-changing technology of predictive coding.

Warwick Sharp is a co-founder of Equivio, where he leads marketing and business development.
 

4 Comments

  1. Gerard Britton

    May 7, 2013 at 1:51 pm

    Warwick,

    Great to see some pieces on practices in PC. I find clients in particular have to be educated about #3. I’d add that in addition to distinguishing metadata from content, reviewers have to mind “impliedly” content-relevant documents, i.e. documents with innocuous content that the reviewer realizes is relevant by association with facts of which the reviewer is aware. A judgment has to be made whether such a document being coded relevant is worth the noise that it produces in the predicted output.

    One question: under 9, you refer to graduated relevance scores. Are you referring to scores that indicate how likely it is that a document is relevant, or a score that actually estimates level of relevance?

    If the latter, how does the system estimate this without accompanying changes to the coding schema? If not, you may want to re-consider nomenclature.

    Thanks
    Gerard

  2. Warwick Sharp

    May 14, 2013 at 8:44 pm

    Gerard,

    Thanks for your comments and questions.

    Regarding the “impliedly relevant” issue – this is a good point. Our standard approach is to keep things simple. It’s true that apparently innocuous content can be relevant by association, and that it can create some noise – but the alternative is to create guidelines for the “expert” that start to become overly complex. This results in errors. As you rightly note, the potential price is over-inclusiveness, but in e-discovery most practitioners prefer over-inclusiveness to the alternative in this case, which is under-inclusiveness.

    Regarding relevance scores – the scores represent ordinal rankings. While the scores are not probabilities as such, a document with a higher score is more likely to be relevant than a document with a lower score. The ability to generate comparable relevance scores is critical to delivering on the promise of predictive coding – proportionate culling, prioritized review and stratified review. This is obviously not possible with a clustering-based solution where a document’s relevance score represents its “distance” from a seed document (aka epicenter). In a clustering approach, the scores are not comparable because, in the absence of changes to the coding schema (which would require the user to code for “degree of relevance”), the system has no way of knowing the extent of a given seed document’s relevance. This contrasts with other classification technologies which deconstruct the document into its composite attributes, and which calculate the weighted contribution of each attribute to a document’s relevance. As such, using a standard relevant/not-relevant coding schema, this latter approach is able to generate rankable relevance scores because each document’s score is simply the aggregate of its weighted attributes.

    I hope this has helped clarify the point. Please don’t hesitate if there are follow-up questions. If you like, we can get on a call and discuss in more detail.

    Regards,

    Warwick

  3. Pingback: Electronic Discovery Best Practices Update | e-Discovery Team ®

  4. Pingback: Predictive Coding minus the hype – webinar on classification technologies from Equivio | eDisclosure Information Project Updates

Leave a Reply

Your email address will not be published. Required fields are marked *