Foundations of Machine Learning-Based Contract Review Software

Written by: Noah Waisberg

6 minute read

The past several posts in the Contract Review Software Buyer’s Guide have gone into details of how manual rule and comparison powered systems actually find contract provisions, and included a case study on a well-funded vendor’s experience with manual rules. Manual rule (aka, keyword) and comparison automated contract abstraction systems are appealing to vendors because they are technically easy to build and augment with new provision models. Adding new provisions can be as simple as typing a few regular expression or Boolean search strings. The problem is that keyword and comparison systems are unlikely to work well on unfamiliar agreements and poor quality scans. And most large scale contract review work—whether for legal due diligence or contract management database population—is done on scanned copies of unfamiliar agreements.

Machine learning approaches have helped solve other difficult data-based problems. Here’s a description (by eDiscovery company CEO John Tredennick) of one success:

Have you heard how they finally taught computers to translate? For many years, humans tried to create more and more complicated rules to govern the translation of grammar in different languages. Microsoft and many others struggled with the problem, finding they could only get so far with this largely human-based approach. The resulting translations were sometimes passable but more often comical, using the humans to articulate the rules of the road.

Franz Josef Och, a research scientist at Google, tried a different approach. Rather than try to define language through rules and grammar, he simply tossed a couple billion translations at the computer to see what would happen. The result was a huge leap forward in the accuracy of computerized translation and a model that most other companies (including Microsoft) follow today.

This post gives background on machine learning. The next post will focus on how machine learning algorithms can be used to build automated contract abstraction models. Since some readers may be familiar with the eDiscovery space (where machine learning based systems have come to be better regarded than keyword approaches (see, for example, this piece by Southern District of New York Magistrate Judge Andrew Peck on the subject)), a third post will give some distinctions between machine learning systems in eDiscovery and automated contract abstraction.

What Is Machine Learning

Machine learning is where computer systems themselves learn how to behave, instead of being directly instructed by a programmer. In place of if:then rules and the like, special algorithms (an “algorithm” is basically a computer program) get fed data, and build models of how to act from these examples. Machine learning has three core advantages over manual rules:

  1. Rules are brittle and don’t scale. They can be great for straightforward and predictable systems but don’t work well for complex unpredictable systems.
  2. Computers have vast processing power and maintain their focus. While a human building a rule can keep a few examples in mind, a computer learning how to behave can reconcile millions of data points.
  3. Machine learning has generated success where manual rules failed. It actually works. Whether or not you understand the science of how, self-driving cars and computer-written news articles are now becoming real. These problems were not solvable with rules.

Another spot where machine learning makes an impact is in spam email filtering. Remember when your email inbox was overrun with spam? Along came machine learning-based spam filtering. Here, machine learning algorithms use examples of real live spam to build filtering models to weed new spam out. Do you still get spam? There’s still lots of it being sent (spam was 68% of email traffic in 2012) but it now rarely makes it to our inboxes, even when our email addresses are freely available on the internet. Note that rule-based spam filtering (like rule-based automated contract metadata extraction) exists. It’s just appears to have been largely abandoned in favor of machine learning-based systems.

Supervised and Unsupervised Machine Learning

Machine learning breaks down into supervised and unsupervised forms (with important hybrids in-between). In supervised machine learning, the system learns from explicit inputs. So, for example, to build a model for automated contract provision extraction in a supervised way we feed our system a bunch of clauses, and tell it which are change of control clauses and which are not. From this, it learns a model of what a change of control clauses look like (more on this in the next post).

In contrast, an unsupervised machine learning technique would not require us to tell it which clauses are change of control and which are not. It would learn this difference completely on its own. For some, unsupervised machine sounds too far-fetched to actually work. However, unsupervised learning has been shown to yield strong results in limited areas (though these techniques are typically used in conjunction with supervised learning). Serious players are putting real effort into further work here.

Machine learning approaches stack up very well compared with humans building manual rule based models. Machines are able to keep far more examples in mind than a person building a keyword rule; where a person could perhaps integrate five examples simultaneously, a computer system can consider tens or millions. More importantly, machine learning works. While commentators write dismissively of rule based search, many of today’s most impressive software systems are based on machine learning technology (e.g., eDiscovery technology aided review systems, Netflix or Amazon’s recommendation engines, fraud detection, Google Translate, self-driving cars, modern optical character recognition systems, software that writes news articles, voice recognition software).

Is machine learning technology necessary for accurate contract provision extraction on unfamiliar agreements? Sure, it’s needed for voice recognition and autonomously driven vehicles, but is contract provision extraction so hard manual rules don’t work? Our experience has been that getting software to accurately find contract provisions is a very difficult problem. The next post in the Contract Review Software Buyer’s Guide will go into more detail on how we have used machine learning technology (and a lot of hard work) to solve it.


Contract Review Buyers Guide Series:

Share this article: