Insights Methodology Contract Review Legal Ops

Precision and Recall in Contract Review: What the Numbers Actually Mean

By Margaret Sullivan — November 10, 2025 — 10 min read

Precision and recall in contract clause detection

When in-house legal teams and deal counsel evaluate contract review tools, the first question is often about accuracy. How accurate is it? The problem is that "accuracy" as a single number tells you almost nothing useful about a tool's performance in legal review contexts. Precision and recall — terms borrowed from information retrieval research — tell a much more honest story about what you're actually risking when you use a tool to review contracts.

This article explains how precision and recall apply to contract review, why the cost asymmetry between false negatives and false positives makes recall the dominant concern, how those metrics relate to senior associate baseline performance, and where the trade-off becomes difficult.

Precision and Recall Defined for Legal Contexts

In information retrieval, precision is the fraction of flagged items that are actually relevant. Recall is the fraction of all relevant items that were actually flagged. In contract review, the "items" are clause instances — a specific change-of-control provision in a specific agreement, a non-standard indemnity cap, an exclusivity term that restricts post-acquisition go-to-market flexibility.

High precision means that when the review process flags a clause, it's almost always worth an attorney's attention. Low precision means the flagged list is noisy — many false positives, many clauses that on reading turn out to be standard boilerplate with no material deviation from market norms. The cost of low precision is review time: attorneys spend time reading and dismissing irrelevant flags instead of focusing on genuinely material provisions.

High recall means that the review process surfaces most of the relevant clauses in the document set. Low recall means relevant clauses are being missed — false negatives, in the statistical vocabulary. The cost of low recall is exposure: material provisions that weren't flagged don't receive attorney scrutiny, and the risk they represent doesn't get assessed before signing.

The difference in cost between a false positive and a false negative in legal review is not symmetric. A false positive costs an hour of attorney time, maybe less. A false negative costs whatever the clause was worth — which in M&A diligence can run to millions in unconsented change-of-control triggers, post-close liability, or renegotiation pressure applied by a counterparty who knows you missed their clause.

Why Accuracy Is the Wrong Metric

Accuracy — total correct classifications divided by total items — produces a high number almost automatically in contract review because the baseline is asymmetric. In a typical M&A diligence package, the vast majority of clause-level review tasks will be "this section contains no material deviation from standard language." A system that classifies everything as non-material will achieve high accuracy simply because most sections are non-material. It will also miss every actually material clause — achieving 0% recall while showing 90%+ accuracy on the aggregate metric.

This is not a theoretical concern. Document classification tools in legal technology often report accuracy numbers in press materials and product one-pagers. Those numbers tell you how often the tool agreed with the reference classification across all document sections. They don't tell you what happened to the material clauses specifically — which is the only question that matters.

The relevant question is: of all the material change-of-control clauses, non-standard indemnity caps, and problematic assignment restrictions in this document set, what fraction were surfaced for attorney review? That's a recall question. The secondary question is: of everything that was surfaced, how much of it was noise that required attorney time to dismiss? That's a precision question. Neither is captured by a single accuracy number.

Senior Associate Baseline Performance

What does a strong human reviewer actually achieve in a time-constrained diligence context? This is difficult to measure precisely because deal teams don't systematically track clause-level miss rates — there's no post-close audit of "clauses present in the data room that weren't flagged during diligence" except in cases where the missed clause becomes a material post-close issue.

What we can say with reasonable confidence from practitioner experience: a senior associate reviewing a 300-document data room over a 21-day diligence period, with a specific focus on change-of-control, indemnity, and assignment provisions, will reliably flag the high-visibility instances of those provisions in the documents they review carefully. The challenge is the documents at the margins — the schedule attachments, the addenda signed years after the master agreement, the service orders that modify base agreement terms, the consent letters that extend certain rights. These are the documents that are present in the data room but don't make the initial pass through the folder structure.

In a scenario involving a growing medical device distribution company — roughly 180 employees, being acquired by a strategic in the surgical equipment sector — a post-close review of the data room identified three change-of-control provisions in schedule attachments to master supply agreements that had not been flagged during the compressed four-week diligence period. Two were notification-only provisions with no practical consequence. The third was a consent requirement that gave a critical supplier the right to renegotiate pricing on change of control. The renegotiation happened; the acquirer ultimately agreed to a pricing adjustment. Whether that outcome was avoidable with more thorough initial review is a counterfactual, but the pattern — material clauses in schedule attachments not receiving the same scrutiny as provisions in master agreement bodies — is a well-documented diligence gap.

The Precision-Recall Trade-Off in Practice

Raising recall typically comes at the cost of precision. A review process calibrated to surface every possible instance of change-of-control language will include many instances of standard triggering definitions with no practical risk — flagging them for attorney review adds noise without adding value. A review process calibrated tightly to flag only clear, non-standard provisions will have higher precision but lower recall — some material but less-obvious provisions will fall below the flagging threshold.

The right calibration depends on the review context. In M&A diligence on a large data room with a concentrated set of material contracts, the priority is recall — missing a material consent requirement is a worse outcome than spending attorney time dismissing a flag. In ongoing commercial contract review where an in-house team is screening a high volume of routine vendor agreements, precision matters more — a tool that flags 40% of inbound contracts for attorney review, mostly on boilerplate provisions, will quickly be abandoned in favor of no tool at all.

We're not saying that high recall should always be the objective, regardless of context. We're saying that any evaluation of a contract review tool — whether a vendor-provided system or an internal process — should specify what recall target is appropriate for the use case before assessing whether the tool meets it. An undifferentiated "accuracy" number can't answer that question.

F1 Score and Why It Misleads in Legal Contexts

F1 score is the harmonic mean of precision and recall — it produces a single number that weights both metrics equally. In information retrieval applications where precision and recall have symmetric costs, F1 is a useful summary. In legal contract review, where the cost of a false negative (missed clause) is typically much higher than the cost of a false positive (false alarm), F1 systematically underweights the recall dimension that matters most.

A tool with 85% precision and 70% recall would achieve an F1 score of approximately 0.77. A tool with 70% precision and 85% recall would achieve the same F1 score. But for most M&A diligence applications, the second tool is substantially preferable — a 15-percentage-point improvement in recall, at the cost of 15 percentage points of precision, means materially better coverage of the clause landscape at the cost of some additional attorney review time per flagged item.

The point is not that F1 is a bad metric in general — it's that applying it without adjusting for cost asymmetry produces evaluation criteria that don't reflect how legal teams actually bear risk. Legal tech vendors who report F1 scores in product materials are answering the wrong question.

What to Ask When Evaluating a Review Tool

Rather than asking "what's your accuracy?" the more useful questions are: what is the recall rate for change-of-control clauses specifically — across both master agreement bodies and schedule attachments? What is the recall rate for non-standard indemnity provisions — where "non-standard" is defined against the counterparty's base template, not against a generic market baseline? How does the tool handle clauses embedded in cross-reference definitions rather than numbered clause headings?

These questions probe the specific failure modes that matter in diligence contexts — schedule attachments, definitional embedding, cross-reference structures. A tool that achieves 94% recall on clearly-headed change-of-control sections but 62% recall on change-of-control language embedded in definitions or schedules has a recall profile that looks good in aggregate but fails at the margin cases where material clauses actually hide.

Clauseflint's internal benchmarks are measured on a per-clause-type basis, including schedule attachments and definitional embedding, because those are the measurement criteria that matter. We're not publishing specific benchmark numbers in this article — that level of disclosure requires a controlled benchmark methodology and a defined reference corpus, and we'd rather show you the extraction output on your own documents than ask you to accept our self-reported numbers. But we'd ask the same questions of any other tool, and we think you should too.

Implications for Diligence Process Design

If recall is the dominant concern in M&A diligence, the process design implication is clear: the review methodology should explicitly include schedule attachments, addenda, order forms, and consent letters as clause extraction targets — not just master agreement bodies. Many standard diligence checklists treat "Material Contracts" as a single category and assign review to the lead associate who reads the master agreement. The schedule attachments may be reviewed by a more junior team member under time pressure, with lower systematic coverage of the clause types that matter.

Clauseflint processes all documents in a defined VDR folder or upload set with the same clause extraction methodology — master agreements and schedule attachments receive the same pass. That's a recall-oriented design choice. The trade-off is that some flags will land on standard boilerplate; the attorney's job is to dismiss them efficiently. In our experience working with deal teams, that trade-off is consistently the right one. The conversation after a deal closes that you want to avoid is the one that starts with "we missed a clause in Schedule B."