Legal AI Evaluation Metrics: Accuracy, Recall & Risk
Back to Blog Posts

Legal AI Evaluation Metrics: Accuracy, Recall & Risk

Legal AI evaluation metrics are practical checks that show whether AI‑assisted work is usable and defensible — whether it's supported by the right sources, whether it missed something material, and whether it introduced new risk.

Most legal AI problems are not really "AI problems". They are supervision problems.

When a partner signs off a clause, they are not signing off tone or style. They are signing off the footing underneath it: the position, the authority, the commercial intent, and the risk that sits behind the words. That is why "it sounds right" is never a serious standard in legal work.

If you only remember one thing, make it this:

Accuracy is "would I accept this without rebuilding it."

Recall is "did it miss something I would normally check."

Risk is "did it invent, drift jurisdictions, or rely on the wrong footing."

A simple scorecard you can hold in your head

If you find yourself rewriting most suggestions, accuracy is low even if the draft reads smoothly. A polished paragraph that doesn't hold up under review is still a time cost.

If issues keep appearing late (during escalation, negotiation, or final sign‑off), recall is low. In legal work, missed issues are rarely cheap; they surface when it's hardest to fix them.

If you can't clearly show what a change is based on, risk is high. If the footing can't be verified, you can't supervise it — and unsupervised legal AI is risk, not efficiency.

Accuracy: would a competent lawyer accept this?

Accuracy is not "does it sound convincing." Accuracy is whether the change is usable under real supervision.

In practice, accuracy shows up when the reviewer can accept a suggestion as‑is, or with light edits, without rebuilding the clause to make it correct. If a tool regularly produces outputs that need a full rewrite, the tool may look helpful, but it is not reducing work it's rearranging it.

Recall: did it miss something important?

A suggestion can be correct and still be incomplete.

Recall (often called "coverage") is the question lawyers care about most in practice: did the workflow bring the right issues into view for this clause, in this jurisdiction, in this document context? Misses tend to look familiar: an absent carve‑out, a missing definition, a schedule that changes the effect, or a jurisdiction nuance that shifts the drafting position.

When recall is poor, the problem usually doesn't appear immediately. It appears later as a surprise during partner review or negotiation the moment you least want surprises.

Risk: did it introduce exposure?

Most legal‑AI risk is not dramatic. It is subtle.

Risk shows up when an output looks plausible but doesn't hold up: a confident statement with weak support; a source that doesn't actually say what the draft claims; a drift into the wrong jurisdiction; an outdated position treated as current.

The simplest habit for risk‑bearing clauses is also the oldest legal habit: check the footing before you accept.

A quick example: where the metrics show up in real clauses

Take a limitation of liability clause. Accuracy is whether the suggested wording is usable without you rebuilding the clause. Recall is whether the workflow brings the usual "gotchas" into view the carve‑outs, the interaction with indemnities, and the definitions that change the effect. Risk is whether the suggestion stays anchored to the right footing, rather than importing a position from the wrong market standard.

Now take a data protection clause. Accuracy is whether the language is usable for the deal you're doing. Recall is whether it surfaces the issues you expect to check (like transfers, security, and responsibilities). Risk is where tools often fail quietly: drifting jurisdictions, relying on outdated footing, or asserting obligations that don't hold up when checked.

These are not academic tests. They're the everyday realities of supervision.

How do you measure hallucinations in legal AI?

Answer: treat any legal claim that fails a quick source check as a hallucination if it happens at all on risk‑bearing clauses, it's a serious red flag.

In practice, "hallucination" in legal work often means "this sounds authoritative, but you can't support it." The most damaging form is not nonsense text; it's an unsupported position that slips into a draft because it reads well.

What does recall mean in legal RAG?

Answer: recall means coverage whether the workflow surfaces the authorities and issues a competent lawyer would expect to check for the clause, jurisdiction, and context.

RAG is often described as "look it up before you write." In legal workflows, that translates to a simple expectation: the relevant basis should be discoverable in the moment you're reviewing, not as a separate research scavenger hunt afterwards.

What is jurisdiction drift in contract drafting?

Answer: jurisdiction drift is when an output quietly assumes the wrong legal framework so the drafting sounds right, but rests on the wrong footing.

In practice, drift looks like imported concepts, wrong regulator assumptions, or clause language that belongs in a different market standard. It's rarely obvious on first read, which is why it shows up as a risk pattern during evaluation.

What is "RAG evaluation" in legal AI?

Answer: it asks whether the tool brings the right supporting sources into view, and whether the output stays true to those sources.

You may see vendors describe this as "RAG evaluation" (or "rag eval legal"). Ignore the acronym.

In practice, it comes down to two questions:

Did the workflow bring the right supporting sources into view?

Did the suggestion stay true to what those sources actually say?

That is just legal supervision expressed in modern language.

How to benchmark legal AI tools without turning it into a project

When legal teams talk about benchmarking, validation, or quality assurance (QA), they usually mean something simple: a decision process that stands up to scrutiny.

Call it benchmarking, a scorecard, or evaluation criteria the point is the same: can you supervise AI‑assisted work in a way you'd be comfortable explaining to a partner or a client?

A useful way to compare tools is to watch where effort actually goes. Does the tool reduce rewrite burden, or does it create it? Do issues surface early, or do they surface late? Can the reviewer see and verify what a change is based on, or does verification become extra work?

The best tools make the "why" visible during review. The weakest tools force you to verify after the fact which is exactly when people stop verifying because the workflow is too slow.

Why Qanooni is designed to be easier to evaluate

Evaluation becomes easier when supervision happens inside the workflow.

Qanooni is built for review inside Microsoft Word, because that is where legal supervision actually happens. Suggestions are designed to be evidence‑linked, so the basis for a change is visible at the moment you are deciding whether to accept it.

Qanooni is designed so the question "what is this based on?" is answerable without leaving the draft.

During review, the lawyer sees the suggested change and its linked basis in the same place they are already editing. They can open it, confirm it supports the position, and then accept or edit the language in context.

That changes the evaluation experience. You are not judging outputs in isolation. You are judging them the way legal work is normally judged: in context, during review, with the footing visible.

Canonical line: If you can't show what a clause is based on, you can't supervise it — and unsupervised legal AI is risk, not efficiency.

If you want the workflow view, see:

For the wider trust posture, see:

And for how grounded authority is structured, see:


Frequently Asked Questions

What are legal AI evaluation metrics in plain English?

They are practical checks that show whether AI‑assisted work is usable and defensible: would you accept it, did it miss anything material, did it introduce risk, and can you explain what it is based on.

Why isn't accuracy enough?

Because a change can be technically correct and still miss the point. Legal work needs completeness and risk control as well as correctness.

What does recall mean in contract review?

It means coverage: did the workflow surface the issues you would normally expect to check for that clause, in that jurisdiction, in that context?

What is the simplest risk test?

For risk‑bearing clauses: verify the footing before accepting. If you can't support a change when checked, it shouldn't be treated as reliable.


Author: Qanooni Editorial Team

Last updated: 2025-12-19