AI Contract Review Explained: How Legal-Domain Models Differ from General LLMs
When a legal team evaluates an AI contract review tool, the marketing copy often reads the same across vendors: fast extraction, high accuracy, attorney-grade results. The practical differences only emerge when a tool encounters a non-standard indemnification cap buried in a schedule, a change-of-control clause buried in exhibit B, or an ISDA annex with negotiated carve-outs. That is where the distinction between legal-domain models and general-purpose large language models becomes consequential. We have seen this difference play out repeatedly, and it is worth explaining precisely why it exists.
What General-Purpose LLMs Actually Do With Contracts
General-purpose language models are trained on a broad corpus of text from the internet, books, and publicly available data. They are very good at tasks like summarization, question answering, and drafting. When you ask one to extract the governing law clause from a commercial agreement, it will usually succeed -- the governing law clause tends to appear in a predictable location and follows a recognizable template.
The failure modes emerge with ambiguity. Legal drafting is full of provisions that function differently depending on context: a cure period clause means something different in a technology license than in a real property lease. An exclusivity provision may be absolute, conditional, or subject to carve-outs that appear three clauses earlier. A general model lacks the trained framework to resolve those dependencies without explicit prompting, and even with prompting, its output is inconsistent across document batches.
In our testing against a corpus of 400 M&A-related agreements, general-purpose models achieved high precision on standard clause categories like governing law, notice provisions, and counterparts clauses. Precision dropped measurably on conditional obligations, cross-referenced definitions, and provisions where the operative language appeared across multiple sections. The drop was not catastrophic -- but in a deal context, a missed obligation or a mis-characterized condition is not a minor inconvenience. It is a risk item that either propagates into a diligence memo or gets caught by an attorney who then has to re-review the document from scratch.
How Legal-Domain Models Are Trained Differently
A legal-domain model is trained on contract corpora and annotated against attorney review. The annotation step is the critical variable. General models learn language patterns. Legal-domain models learn legal function -- the difference between a representation and a covenant, between a condition precedent and a closing deliverable, between a carve-out and an exception.
The training pipeline for a legal-domain model typically involves three components that general models lack. First, a structured taxonomy of contract provisions matched to the legal significance of each type: an indemnification cap is not just a number, it is a ceiling on a party's liability exposure, and the model needs to surface it in the context of the overall risk profile. Second, annotated examples of non-standard drafting: what does an atypical limitation of liability look like when the drafter has tried to obscure it in a definition section? Third, cross-document reasoning: for M&A diligence, many relevant provisions only become material in the context of other documents in the same deal corpus -- a change-of-control provision in a software license becomes a deal risk only if the target company's revenue is concentrated in that license.
The result is a model that is narrower in scope but substantially more reliable within that scope. It does not write poems or answer history questions. It extracts clause categories with the trained precision of an attorney who has reviewed 50,000 M&A contracts.
Precision and Recall in Practice: What the Metrics Mean
When vendors cite accuracy figures, they are typically citing precision (the fraction of extracted clauses that are correct) or recall (the fraction of all relevant clauses that were found). Both metrics matter, but they pull in opposite directions, and the threshold that is acceptable depends on use case.
For NDA review, high recall is paramount. A missed confidentiality carve-out or an undetected perpetual obligation is a material risk. For an initial pass on a large diligence corpus, high precision matters more -- you want extracted items to be reliable enough that attorneys do not have to re-read the source document for every flagged provision.
Our internal benchmarks compare extraction outputs against attorney review on the same document set. We target a recall floor of 94% on critical provision categories (indemnification, limitation of liability, change-of-control, termination for convenience) and a precision floor of 91% across all categories. Those floors are not marketing figures -- they are quality gates that trigger retraining when we fall below them on a new document class. The honest acknowledgment is that no extraction system achieves perfect recall on novel contract structures. The question is how the system handles uncertainty: does it surface a low-confidence flag for attorney review, or does it silently omit the provision?
What Attorneys Should Look for When Evaluating Tools
Three questions worth asking any vendor:
- What is the training corpus? A model trained primarily on publicly available contracts (EDGAR filings, template libraries) will perform differently from one trained on a curated corpus of negotiated agreements. Ask for specificity about document types and deal sizes.
- How does the system handle uncertainty? A well-designed legal AI system distinguishes between high-confidence extraction and low-confidence extraction. The output format should make that distinction visible -- a provision extracted with 96% confidence should look different from one extracted with 74% confidence.
- Can you test it on your actual document types? Vendor benchmarks are conducted on vendor-selected documents. Ask to run a pilot on a representative sample of your deal corpus -- a mix of negotiated agreements, foreign-law governed contracts, and non-standard structures.
The answers to these questions will tell you more about actual performance than any accuracy headline in a product brochure.
The Attorney-in-the-Loop Principle
Legal-domain AI is most valuable when it is positioned as a first-pass system, not a final reviewer. The extraction engine surfaces clause categories, flags deviations from market standard, and produces a structured issues list. An attorney reviews the flags, makes judgment calls on materiality and deal context, and signs off on the diligence output.
This division is not a limitation of current AI -- it is the correct workflow. AI can process 2,000 documents in the time it takes an attorney to review 20. But the judgment call about whether a non-standard limitation of liability is a deal-stopper or a negotiating point belongs to a human being who understands the transaction context, the client's risk appetite, and the counterparty's history.
The tools that earn sustained adoption in legal practice are the ones that make attorneys faster and more thorough, not the ones that attempt to replace attorney judgment. That is the design principle behind every decision we make at Clauseflint -- the AI does the extraction and the flagging; counsel makes the call.
If you are evaluating AI contract review tools for your firm or legal department, we are happy to walk through our methodology in detail. Reach out to [email protected] or request access to see the platform against your own documents.