A Dataset and an Examination of Identifying Passages for Due Diligence

We present and formalize the due diligence problem, where lawyers extract data from legal documents to assess risk in a potential merger or acquisition, as an information retrieval task. Furthermore, we describe the creation and annotation of a document collection for the due diligence problem that will foster research in this area. This dataset comprises 50 topics over 4,412 documents and ~15 million sentences and is a subset of our own internal training data.

Using this dataset, we present what we have found to be the state of the art for information extraction in the due diligence problem. In particular, we find that when treating documents as sequences of labelled and unlabelled sentences, Conditional Random Fields significantly and substantially outperform other techniques for sequence-based (Hidden Markov Models) and non-sequence based machine learning (logistic regression). Included in this is an analysis of what we perceive to be the major failure cases when extraction is performed based upon sentence labels.

Authors:

Adam Roegiest
Alexander K. Hudek
Anne McNulty

Publication Date:

July 8, 2018

Patent Number:

US9645988B1

Conference:

SIGIR 2018

Read our other research papers

Redesigning Document Viewer for Legal Documents

Read More
This site uses cookies. By continuing to browse this site you are agreeing to our use of cookies.