Fluent and wrong
On the failure modes of single-shot LLM critique in legal work, the synthesiser-and-auditor architecture, and where domain expertise binds.
In April 2025, in R (Ayinde) v London Borough of Haringey [2025] EWHC 1040 (Admin), Ritchie J found that the grounds for judicial review cited five authorities, none of which existed; he made wasted costs orders of £2,000 each against counsel personally and the instructing solicitors. Two months later, the Divisional Court (Sharp P and Johnson J) returned to the same facts in Ayinde and Al-Haroun [2025] EWHC 1383 (Admin), where the companion judgment in Al-Haroun v Qatar National Bank went further: of forty-five citations, eighteen were fabricated, and many of the real cases contained none of the quotations attributed to them. The Divisional Court referred the lawyers to their regulators and held the contempt threshold to have been met. Two months later still, in MS (Bangladesh) [2025] UKUT 305 (IAC), an immigration barrister handed up a ChatGPT print-out as a judgment of the Court of Appeal; given a copy of Ayinde and asked to reconsider, he asked ChatGPT to verify the citation, and ChatGPT obligingly confirmed the fake. The public tracker of UK judgments containing AI-fabricated authorities is updated continuously, and new cases arrive most weeks. The Bar Council updated its generative AI guidance in late November 2025 to keep pace.
The pattern — fluent prose, well-formatted citations, internally coherent argument, none of it tied to anything real — is a predictable failure mode of large language models, not an aberration. The clinical word “hallucination” is misleading; what the model is doing is closer to confabulation: producing content that fits the local statistics of plausible legal writing without any anchor in retrieved fact. Retrieval helps, but does not solve the problem. The Stanford RegLab study by Magesh and colleagues, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, found Lexis+ AI hallucinating on roughly 17% of queries, Westlaw’s AI-Assisted Research on about 33%, and a vanilla GPT-4 baseline on around 43%. At one in three for the leading retrieval-augmented commercial tools, this is not something a barrister can rely on without independent verification. The sharper failure mode is critique: asked to attack a document, a current frontier model is exceptionally good at producing critique-shaped text — internally coherent, fluently expressed, structurally well-formed — with no grounding whatsoever in the underlying facts. The recent literature on multi-agent systems, including Wynn, Satija and Hadfield’s Talk Isn’t Always Cheap and Princeton’s AI Agents That Matter, has been mapping these failure modes in detail.
The conclusion the literature converges on is not that LLMs cannot be made reliable, but that reliability has to be engineered into the structure of how they are used. The single-shot critique pattern — paste in the document, ask for criticisms, read the output — has no such structure. It is the worst possible configuration for legal work: high-stakes, fact-dense, retrieval-poor, with the model rewarded for sounding right and punished for nothing.
What actually works
The architecture that does work, in the systems I have built and seen built, is some version of a synthesiser-and-auditor pattern. The names vary; the structural commitments are the same.
The first commitment is information asymmetry between roles. The agent that drafts or critiques does not also verify. The agent that verifies does not see the drafting agent’s output unfiltered; it sees a decomposition of that output into checkable propositions. This is the design at the heart of the MARCH framework — Multi-Agent Reinforced Self-Check — where a Solver generates, a Proposer decomposes the output into atomic claims, and a Checker validates each claim against retrieved evidence in isolation, deprived of the Solver’s original wording. The information asymmetry is the point: it breaks the self-confirmation loop in which a verifier rationalises the generator’s mistakes because it has been shown the generator’s reasoning.
The second is atomic claim decomposition. Critique that says “the section on causation is weak” is not checkable. Critique that says “the proposition on page twelve, paragraph four — that Caparo requires foreseeability of the type of damage rather than the extent — misstates the authority” is checkable: the source says what it says. The architectural commitment is that no critique leaves the system in a form that is not reducible to a list of atomic, individually verifiable propositions, each annotated with the source it would stand or fall on. The PROClaim work — a courtroom-style framework with progressive retrieval — took this approach to its logical conclusion, achieving 81.7% accuracy on the Check-COVID benchmark, around ten points above standard multi-agent debate, with most of the gain — about 7.5 points — attributable to the progressive-retrieval component rather than to the debate itself.
The third is heterogeneous tools and models. If the same model both generates and verifies, you have not introduced a check; you have introduced a second sample from a correlated distribution. The Tool-MAD framework makes this explicit: distinct agents are bound to distinct external tools — one to a vector-indexed corpus of authorities, another to a live legal search API — and their disagreements are surfaced rather than averaged away. The principle generalises. Where it is feasible, use different model families for the synthesiser and the auditor, and bind each to different evidence sources.
The fourth is bounded debate. The empirical finding in Talk Isn’t Always Cheap is that more rounds of inter-agent discussion do not help and frequently hurt. Two design choices follow. First, cap interaction at one or two structured exchanges, with the loop terminated by a deterministic procedure rather than by the agents declaring agreement — agreement is not evidence of truth in this regime; agreement is what sycophancy produces. Second, give each agent only one turn of memory: each iteration receives the current state and nothing else, with no carry-over of its own prior turns. Accumulating context across rounds is the mechanism by which agents drift into self-consistent but incorrect positions — what the literature calls context pollution.
The fifth is human prioritisation as the gate. The final move in the pipeline is not the auditor’s verdict; it is a ranked list of contested propositions, each with the source it depends on, presented to a human reviewer for decision. The system never autonomously concludes that a piece of critique is correct. It produces a worklist: here are the points the auditor cannot verify, here is the source it would need to verify them, here is the rank order in which a human should look. Everything downstream is a barrister or solicitor reading authorities and exercising judgement, with the tool functioning as a search-and-shortlist accelerator rather than a deliverer of opinions.
The fourth and fifth commitments work as bias controls in tandem. Bounded debate with one-turn memory caps intra-loop convergence toward agreement; the human gate caps the per-iteration biases of the auditor — verbosity preference, retrieval gaps, judge-style positional bias — from compounding into final output.
These five commitments are sufficient to produce a system whose output is inspectable. A reviewer can see, for any claim the system has made, which atomic propositions it rests on, which sources support them, where the auditor disagreed, where it abstained. There is no opaque “the AI thinks this is wrong” step. The trade-off is that the system is slower, more expensive per query, and less impressive in demonstration. It is also the only configuration I am willing to put in front of a barrister with a duty to the court.
Where the law is already code
The auditor’s job becomes far easier wherever law has already been expressed in machine-readable form. This is the project of an emerging field sometimes called computational law or Rules-as-Code. The Catala project at Inria — used in pilots with CNAF and DGFiP to express French family-benefits and income-tax computations as executable code derived directly from the underlying legislation — and the DC Council’s GitHub-versioned Code of the District of Columbia are the most developed examples. Spain’s legalize-es project mirrors every reform to Spanish legislation as a discrete commit. In each case the artefact is queryable, version-controlled, and diffable. A claim about what the statute provides can be tested against the statute as code, not against a model’s recollection of its prose. Where this infrastructure exists, the architectural advantages of the synthesiser-and-auditor pattern compound: the auditor’s evidence base becomes precise, deterministic and inexpensive to query.
The honest part of this story is that the territory in which law is amenable to such treatment is small, and the political economy resists its expansion. The parts of law that already function as decision procedures — tax, social benefits, regulatory thresholds — translate cleanly. Common-law reasoning, statutory interpretation in genuinely contested cases, and the constitutional reasoning that determines how rules apply in novel situations do not. They are not failures of formalisation waiting to be solved; they are domains in which meaning is determined precisely by the unresolved interpretive contest, and reducing them to a default-logic representation would change what they are. Beyond the technical limit there is a structural one. Vagueness in legislation is often the load-bearing element of the political compromise that produced it. Earmarks, drafting ambiguities, undefined terms and discretionary thresholds are not bugs; they are the mechanism by which competing interests sign on. A regime in which every statute compiles to executable code, and every amendment is a public commit traceable to its proposer, would make those compromises legible — and the political economy that produces statutes selects against legibility. Computational law will continue to advance in the corners of the legal system that already function algorithmically. It is unlikely to extend much beyond them, and the auditor architecture must be designed not to depend on its extension.
Where the lawyer fits, and what juniors learn
The architectural commitments above only matter if a human is the final consumer. Each commitment surfaces a specific kind of decision for the practitioner: which contested propositions to investigate first, which auditor disagreements indicate genuine error rather than retrieval failure, which atomic claims rest on authorities the auditor’s corpus does not cover. This is the human-in-loop stage, and it is where domain expertise binds. The system is upstream of the lawyer, not downstream.
Two skills become structurally important. The first is prioritisation. An auditor will produce more flagged claims than any practitioner has time to chase, and the worklist is only useful if the reviewer at the gate can quickly distinguish the citation that turns the appeal from the citation that is technically correct but tangential. This is judgement, and it does not transfer from the model. The second is debugging the model’s reasoning. When the auditor disagrees with the synthesiser, the failure is sometimes in the synthesiser, sometimes in the auditor’s retrieved evidence, sometimes in the decomposition step that produced a malformed atomic claim. Working out which requires the practitioner to read the chain of evidence and recognise where it breaks. Both skills are the same judgement a junior develops by reading authorities under supervision.
This is the relevant point for the training pipeline. The pessimistic story about AI absorbing the work that historically trained juniors assumes the architecture is autonomous: a system that produces a polished memo replaces the junior who used to draft it. The synthesiser-and-auditor architecture is structurally different. It does not produce a polished memo; it produces a worklist for a human reviewer. The work of prioritising the worklist, checking the contested authorities, and deciding which points survive is exactly the work that builds a competent senior. A junior whose role is to triage, verify and escalate the auditor’s findings is doing legal work — possibly more directly than a junior whose former role was to draft the first cut from scratch. The pipeline survives if the architecture is deployed to surface findings to juniors, rather than to replace them with autonomous output.
Domain expert knowledge does not become less valuable in this regime; it becomes the binding input. The system can decompose a skeleton into atomic claims and check each against retrieved authority. It cannot decide which claim is doctrinally load-bearing, which line of authority the Court of Appeal is most likely to develop, which argument is worth the time of a leading silk. Those decisions remain with practitioners, and the value of being a practitioner who can make them well goes up rather than down.