How We Benchmark Clause Extraction Accuracy Against Attorney Review

Inside our internal accuracy evaluation methodology — how we define extraction p

Every contract AI vendor publishes accuracy numbers. The methodologies behind those numbers are rarely disclosed. That matters, because accuracy in clause extraction is not a single dimension -- it depends on what you are extracting, from what document types, measured against what ground truth, and using what definition of "correct." This article explains how we define and measure accuracy at Clauseflint, what our current benchmarks are across the provision categories we cover, and where our numbers are still improving.

Defining the Problem: What "Accurate Extraction" Actually Means

Clause extraction accuracy has two distinct components borrowed from information retrieval: precision and recall. Precision is the fraction of extracted items that are correct -- if the system returns 100 extracted indemnification cap provisions and 91 of them are accurate, precision is 91%. Recall is the fraction of all relevant items in the document set that were found -- if there are 50 indemnification cap provisions in the corpus and the system finds 47 of them, recall is 94%.

These metrics trade off against each other in ways that matter for legal use cases. A system optimized purely for precision will decline to extract any provision it is uncertain about, achieving high precision at the cost of recall -- it will miss real provisions. A system optimized purely for recall will extract aggressively, including low-confidence items, driving up attorney review burden. The right operating point depends on the use case: for M&A diligence where missing a critical provision is a material risk, we prioritize recall on high-risk clause categories. For initial corpus triage where attorney time is the constraint, precision matters more.

We add a third dimension that most accuracy discussions omit: completeness of extraction. An indemnification clause may be correctly identified at the section level but incompletely extracted -- the cap amount is captured, but the exclusion for gross negligence that appears in a subordinate clause is missed. Partial extraction is not the same as missing extraction, but it is also not correct. Our evaluation protocol counts partial extractions separately from full misses and from correct complete extractions.

The Ground Truth Problem

Any accuracy benchmark is only as meaningful as its ground truth. We use attorney-annotated ground truth across our evaluation corpus, which creates its own complication: attorney review is not perfectly consistent. When we had the same 80-document set reviewed independently by two senior M&A attorneys, their agreement on all material provision identifications was 91% -- not 100%. That 9% gap represents the inherent ambiguity in legal interpretation, not error in either reviewer.

Our evaluation protocol handles this by using a consensus annotation: for provisions where attorney reviewers disagree, we convene a third reviewer and adopt the majority position. Items that remain contested after three reviews are flagged as "ambiguous" and excluded from the precision/recall calculation, though we track them separately for model improvement purposes. Approximately 4% to 6% of the provisions in our evaluation corpus fall into the ambiguous category, concentrated in complex cross-referenced provisions and foreign-law governed agreements.

We also distinguish between document types in our benchmarks. An accuracy number measured against standard ABA-model agreement templates is not the same as an accuracy number measured against bespoke negotiated agreements or foreign-language documents. Our primary evaluation corpus covers US-governed commercial agreements; we maintain separate benchmarks for UK-law agreements, EU-law agreements, and highly negotiated bespoke structures. The numbers differ, and we report them separately rather than aggregating them into a single headline figure.

Current Benchmark Results: Provision-Level Breakdown

The following figures reflect our most recent evaluation pass on a corpus of 1,400 annotated agreements, conducted in Q1 2025. The corpus spans SaaS customer agreements, M&A acquisition documents (purchase agreements, ancillary agreements, disclosure schedules), IP licenses, real property leases, and employment agreements.

Provision Category Precision Recall Complete Extraction Rate
Governing Law 98.2% 97.6% 96.8%
Limitation of Liability Cap 94.1% 93.7% 88.4%
Indemnification Obligations 91.8% 94.2% 84.7%
Change-of-Control Provisions 92.3% 91.5% 87.1%
Termination-for-Convenience Rights 95.6% 95.1% 93.2%
IP Ownership and Assignment 89.4% 87.9% 81.3%
Non-Compete / Non-Solicitation 93.7% 90.6% 86.9%
Representations and Warranties (M&A) 88.2% 92.4% 79.8%

The lower complete extraction rates on complex categories like IP ownership and reps and warranties reflect provisions that span multiple sections with cross-references and condition clauses. Our system identifies the provision but may not capture all qualifying language in a single extraction unit. We flag these for attorney review with a confidence score below our 90% threshold.

What Counts as an Error: Our Classification Scheme

Not all extraction errors have the same consequences. We classify extraction errors into four severity tiers, which determines how they are handled in the output and how they factor into model retraining priority.

Critical errors are cases where the system extracts a provision with the wrong material content -- for example, extracting an indemnification cap of $5 million when the actual cap is $500,000, or identifying a termination right as unilateral when it requires mutual consent. These errors carry the highest retraining weight and trigger a quality alert in the output.

Structural errors occur when the system correctly identifies a provision type but extracts the wrong section as the operative text -- for example, extracting the definition of "Indemnified Party" rather than the indemnification obligation itself. The provision type is correct; the extracted text is not. These are significant errors that would produce incorrect summaries.

Completeness errors occur when the extraction is correct as far as it goes but omits qualifying language that changes the legal interpretation. These are the hardest to detect and carry the most risk of propagating silently into a diligence memo.

Boundary errors occur when the extraction captures the correct provision but includes too much or too little surrounding text, without affecting the substantive content. These are the lowest-severity category and typically do not require re-extraction.

Where We Are Still Improving

Our weakest areas are provisions that are deliberately fragmented across a document -- drafters who want to obscure a limitation of liability, for example, will sometimes spread the operative language across a definition, a main clause, and an exhibit. Cross-document provisions are also challenging: a representation in an asset purchase agreement that is qualified by a disclosure schedule requires reading both documents in conjunction to evaluate correctly.

We are also below our target performance on certain foreign-law governed documents, particularly Japanese-law and German-law agreements, where structural conventions differ significantly from common law drafting practice. Our current coverage for those document types is limited and we are building out the training corpus accordingly.

We publish these numbers because we believe transparency about methodology is what separates a credible accuracy claim from a marketing headline. If you are conducting an evaluation and want to test our extraction against a representative sample of your deal corpus, we will run the comparison and give you the actual precision and recall numbers for your document types -- not the benchmark corpus. Contact us at [email protected] to set up a pilot evaluation.