Just pick one thing

Why most legal AI pilots fail to move out of pilot stage, by Taariq Ismail

The real reason pilots stall

A head of legal ops described something I’ve been hearing in variations all year. Her team had run an AI pilot for six months. Broad scope, covering several contract types, with a well-known copilot tool. The lawyers used it to get through work faster, but when the CFO asked what it had saved them, she couldn’t point to a number. The team was processing roughly the same volume of contracts with roughly the same headcount on roughly the same timelines. Needless to say, it didn’t move out of the pilot stage.

The lawyers used it to get through work faster, but when the CFO asked what it had saved them, she couldn’t point to a number.

She was talking to us because she wanted to try again, differently. The question she kept coming back to was: how do you scope a deployment so that the result is actually legible to the business?

MIT’s (now infamous) research puts a number on why that question matters so much: 95% of generative AI projects deliver no measurable value. That is an extraordinary figure given the billions flowing into the space, and I think it deserves more examination than it usually gets, because the failure mode isn’t what most people assume. The technology worked in her pilot. The scoping didn’t.

I find that failed legal AI deployments almost always trace back to one of two errors, and they look like opposites but share a common origin.

Error 1: Choosing the right problem

The first is choosing a problem that doesn’t matter enough. The team picks something safe: summarising meeting notes, answering basic policy questions, generating first drafts of internal memos nobody was waiting for. These are real tasks that consume real time, but they aren’t tasks the business cares about. Nobody in the C-suite is measuring how long it takes to produce a meeting summary. The pilot succeeds on its own terms and fails to produce any result that anyone outside the legal team notices.

Error 2: Determining the right scope

The second error is the inverse: choosing a scope so broad that the deployment can’t demonstrate clear results within a timeframe that matters. “We’re going to use AI across our contracting function” is a strategy statement, not a deployment plan. Contracting involves triage, template selection, review, redlining, negotiation, escalation, and execution. Each of those steps has different data requirements, different playbook logic, and different supervision needs. Trying to address all of them simultaneously in a pilot means none of them gets the depth of configuration required to produce output that lawyers actually trust. The pilot runs in a permanent state of “promising but not quite there.” Nobody signs off on the next phase.

“We’re going to use AI across our contracting function” is a strategy statement, not a deployment plan.

The root cause

The common origin, I think, is a reluctance to commit. Choosing a low-stakes problem avoids the discomfort of handing something important to a machine. Choosing an impossibly broad scope avoids the discomfort of deciding what specifically to bet on. Both are ways of deferring the hard decision, which is: what is the single most valuable thing this technology could do for us in the next 90 or so days, and are we willing to actually let it?

The advice that follows from this, “pick one specific problem and solve it brilliantly,” runs directly against how most legal leaders think about transformation, and I think the psychology of that resistance is worth examining.

A GC looking at their team’s workload sees dozens of processes that could benefit from automation. They see the NDA backlog, the triage chaos, the outside counsel spend on routine reviews, the knowledge that lives in people’s heads rather than in any system. The instinct is to address all of it at once, because the problem is genuinely systemic. Picking a single narrow workflow feels like an admission that you aren’t thinking big enough.

Choosing a low-stakes problem avoids the discomfort of handing something important to a machine. Choosing an impossibly broad scope avoids the discomfort of deciding what specifically to bet on.

But the narrowing is the strategy, not a concession to it. A legal ops leader at a large industrial company described their vision to me recently using the phrase “engine room”: agents handle the 80% of work that is high-volume and low-judgment, while lawyers use “precision tools” for the complex work that falls out. That’s the right end state. The mistake is trying to build the entire engine room in one go.

What actually works

What I’ve seen work, consistently, is something more like this. A team identifies the single workflow that is simultaneously high-volume, clearly rule-governed, and politically visible enough that improving it would be noticed by the business. Often this is NDAs or supplier agreements, because most enterprise legal teams handle hundreds or thousands of these per year, the review logic is well-understood even if it has never been formally documented, and the business users who send them, sales, procurement, partnerships, have strong opinions about turnaround time. Sometimes it’s incoming triage, because a shared legal inbox receiving 60 to 80 requests per day, with lawyers spending 30 to 45 minutes each morning just sorting what needs attention, is a problem that everybody on the team feels and that has an obvious before-and-after metric.

A team identifies the single workflow that is simultaneously high-volume, clearly rule-governed, and politically visible enough that improving it would be noticed by the business.

The team deploys an agent on that single workflow. Not as a proof of concept but as a production system, with real requests flowing through it, real supervision, and real measurement from the first week. The agent gets its own email address. Business users send their requests to it exactly as they would email a colleague. Within two to four weeks the agent is processing live work. After that there is a measurable result: requests handled, turnaround time, hours redirected, escalation rate. That result is what funds the expansion, not the technology itself.

What the 95% have in common

I keep coming back to the head of legal ops who came to us after her pilot had failed. Nobody had decided, at the outset, what specific result would constitute success, in terms concrete enough that a CFO could act on them. The organisations that land in the 5% aren’t necessarily more sophisticated. They’re the ones that were willing to pick a specific, important problem, hand it to an agent, measure what happens, and make a decision based on the evidence. The ones that land in the 95% hedged, either by choosing something too small to matter or something too large to measure, because committing to a specific bet felt riskier than keeping things exploratory.

The organisations that land in the 5% aren’t necessarily more sophisticated. They’re the ones that were willing to pick a specific, important problem, hand it to an agent, measure what happens, and make a decision based on the evidence.

The GC who can say “we deployed an agent on NDA triage three months ago, it has processed 400 requests with a 97% first-pass approval rate, and we’ve redirected 120 hours of lawyer time elsewhere” is having a fundamentally different conversation from the GC who says “we’ve been piloting AI across our contracting function and the team finds it useful.” The CFO can do something with the first.

And once you prove one use case in a successful pilot, it’s much easier to deploy the subsequent ones.

Thanks for reading! Subscribe for free to receive new insights, and check out flank.ai for more information on how you can deploy agents in your enterprise.