Machine Learning Based Automated Contract Provision Extraction

10 minute read

Put contract provision examples in, get provision models out.

If machine learning is so powerful, why would anyone build a contract provision extraction system any other way? Machine learning techniques can generate robust, scalable data models; models may include details a human might not be able to notice; and—most importantly—these techniques actually work, underlying many of today’s most impressive software systems (e.g., eDiscovery technology aided review systems, Google Translate, self-driving cars, software that writes news articles, voice recognition software). In automated contract abstraction, the principal alternative approach to building provision models is via human written rules. Manual rules have been seriously criticized by researchers and commentators (one pair of researchers saying: “A full-text retrieval system [aka, manual rules-based search] does not give us something for nothing. Full-text searching is one of those things, as Samuel Johnson put it so succinctly, that ‘. . . is never done well, and one is surprised to see it done at all’"). Yet some automated contract review vendors build their systems using manual rules instead of machine learning technology. There’s a reason why. This post explains how machine learning techniques can be used to build contract provision extraction models. It also gives details on the big minuses of machine learning-based automated contract abstraction.

How Machine Learning Technology is Used to Build Automated Contract Provision Extraction Models

While this Contract Review Software Buyer’s Guide series generally keeps from discussing details of how any individual vendor does things, since we use machine learning techniques to power our system, I will use how our system works as an example.

We build our automated contract provision extraction models using both supervised and unsupervised machine learning techniques. For simplicity’s sake, this post will be limited to a basic explanation of one way we use supervised learning to build contract provision models. Follow along with Figure 1, which shows a simplified version of how we built our “change of control” provision model.

We start with provisions from real contracts (anywhere from tens to over a thousand). Every provision in our training dataset has been categorized by an experienced lawyer.

We feed these provision examples into our training system, sometimes provide human guidance, and let the system consider the examples. The system takes this data and learns language that is relevant to a given provision concept (and language that is not). On the basis of this classification, the system generates probabilistic provision models. Our training system can do this work autonomously: we feed provisions in and get models out.

We test completed models on a large pool of annotated agreements the system is unfamiliar with.

Using machine learning techniques to build contract provision models gives several benefits:

Accurate Models, Even on Unfamiliar Documents.

Based on testing, we know our system finds 90% or more of the instances in unfamiliar documents of nearly every substantive provision it covers. This 90% number is our system’s recall; its precision differs by provision by provision but is high enough to give generally manageable results. Our accuracy tests are on a large and diverse pool of real contracts that the system was not trained on. Most contract review—whether for due diligence or contract management purposes—is on unfamiliar documents, so it matters that we use contracts that are fresh to the system to test it with. This is testing for the problem it’s built to solve. All contracts in our testing pool were experienced-lawyer-reviewed to identify what the system should find on them, and this manual review was generally further supplemented by electronic review; many agreements were reviewed multiple times.

Even Form Agreements can Become Unfamiliar Documents if they are Poor Quality Scans.

Some documents in contract review can be poor-enough quality that OCR gives mixed results. Documents with imperfect OCR results become like unfamiliar agreements; even though they may be written off a company form, manual rules tailored to the company form could miss them. Would a manual rules based system (not specifically set up for this wording) pick up a change of control provision written like this?:

Mengesnorter iigernent or Control

tf-any-material change occurs in the management or control of the Supplieror_the_Business,save accordance-with-the provisions of this Agreement.

Maybe, maybe not. But our machine learning based system did. And we certainly never trained it on change of control provisions that were worded quite like this!

Known Accuracy on Unknown Agreements.

If (i) a contract provision extraction system is meant to work on unfamiliar agreements and (ii) you would like to have an idea how accurate it will be on unfamiliar agreements, it needs to be tested on unfamiliar agreements. As I wrote when describing how we test our contract provision extraction models:

One way to measure accuracy would be to test on the same documents used to build provision models. However, this is analogous to giving students the answers to a test beforehand. You can know that high scoring students are good at memorizing, but you cannot know if they’ve really learned the material. Computers are particularly good at memorizing, and thus you should never test on the training data to determine the accuracy of a learned model (unless the problem you are trying to evaluate is if a system can find already seen instances (which might be the case for an automated contract review system only intended to work on known documents like company forms)).

This requirement to test on “unseen” data is particularly difficult to meet for systems that use manual rules (i.e., human created rules, such as those built using Boolean search strings). If using a manual rules based system, the only way to obtain truly unbiased accuracy results is to keep testing documents secret from the people building the rules. This requires a great deal of discipline; it is very tempting to look at where rules are going wrong. When testing a machine learning built model, on the other hand, it is easier to make sure the computer does not improperly peek at the test questions!

Another potential pitfall can come through testing on a fixed set of testing data. It might be tempting to set aside a portion (e.g., 20%) of total training data to be used as testing data. Testing on a small and static testing set raises the risk of biasing models to perform well on the test set; final model accuracies may reflect accuracy on the test set and not reality. To avoid this, the test set should be varied across training data. The technical term for this technique is cross-fold validation.

A final thing to beware of is training data diversity. No clever accuracy testing technique can make up for training data that is itself not a good reflection of reality.

Because we test on unseen data, we have a pretty good idea how our system will perform in real life. It is possible to test a manual rules based system on unseen data that is rotated over time and diverse, but harder.

Minuses of Machine Learning Approaches to Automated Contract Abstraction

If machine learning is such a great way to build automated contract abstraction provision models, and rule-based approaches a mediocre one (apart from if searching clean scans of agreements written off a form you have), why do any vendors build provision extraction models using manual rules? Since we use machine learning techniques ourselves, we can only speculate why others don’t. Our best guess why not is that machine learning-based automated contract abstraction takes a lot of work to work well, and even then it might not. Here are the core problems:

Hard to assemble provisions for training.

Machine learning contract provision extraction models get trained on real contract provisions. Some provisions—essentially, ones that get expressed a lot of different ways (e.g., change of control, exclusivity)—require a lot of data to build for accurate provision models. Someone has to gather that data. The someone gathering provision examples really needs to know what they are doing; garbage in, garbage out. A further post in the Contract Review Software Buyer’s Guide will cover this issue in more detail.

Since it can be hard to assemble provision examples, this also means it takes effort to add new provision models to a machine learning system.* All the work tends to be in gathering new provision examples (since our core machine learning technology does most of the rest of the work), and there can be ways to gather provision examples relatively quickly. Nonetheless, it is a challenge. The reward for putting in the extra effort of collecting provision examples, though, is more accurate and robust provision models, which should outperform manual rule-based models on unfamiliar agreements and poor quality scans (and match them on clean copies of agreements the system was trained on).

Hard to get the machine learning technology right.

When we started building the DiligenceEngine automated contract review system (in late 2010/early 2011), a number of technologists we spoke with thought we could use pre-existing machine learning techniques and algorithms to build an well performing system in about four months. Many different machine learning approaches and algorithms existed, and academic literature described good test results with limited training data. Five or six months later (software development inevitably runs behind schedule), it turned out that this tech allowed us to build a pretty accurate governing law detector but a middling one on tougher provisions like change of control or assignment. While finding governing law in documents was kind of neat, we knew lawyers—who we were building our system for—would be unlikely to be interested in our system at the accuracy numbers we were notching on critical provisions. So we put our heads down and kept working. About a year-and-a-half in, our chief technologist had a breakthrough and our numbers jumped to acceptable levels on hard provisions. And we kept pushing. With another six months or so of hard work, our numbers slowly pushed up significantly further. Figure 2 shows our progress on change of control and governing law accuracy over time. Now that we have an accurate system, I actually think we were lucky to get it to work as well as it does in as little time as it did; it could have been far worse. Perhaps we were just unskilled or lazy at this work. Maybe, but (1) our chief technologist (who leads our machine learning efforts) has a Ph.D. in computer science from a top program and (2) said as someone who was in a Biglaw corporate department during the boom years of 2006–07, we worked hard at this.

One implication of the difficulty of building machine learning systems is that—while machine learning can be used to create accurate and robust contract provision extraction models—it does not necessarily follow that all machine learning-based automated contract abstraction systems are equal. Our machine learning based system is far more accurate today than it was a year ago (when we were still the only vendor in the space to advertise our accuracy).

*At least we think it’s harder. We do not add new substantive provision models to our system unless we think we think they are quite accurate. As described in the Contract Review Buyer’s Guide post on manual rule-based automated contract abstraction, it is hard for manual rule system vendors to know the accuracy of their provision models. It is possible that it would not take us a big number of examples to build provision models with equivalent accuracy to those use in manual rule systems.

Contract Review Buyers Guide Series:

Share this article: