Issue 01Data discipline

The case for synthetic data in legal reasoning.

A model fine-tuned on synthetic legal reasoning never sees a real client’s confidential matter. Privilege survives at the training layer — not just at the inference layer.

By Eve-Legal·May 19, 2026·6 min read

The most common procurement question we get is also the most consequential one:do you train your models on our matter data? The answer is no. The harder and more interesting question is the one underneath it: how do you train your models without our data? The answer is Eve-Genesis (Law Edition) — a proprietary synthetic reasoning corpus, 100% synthetic by construction, that fine-tunes the Phi-4-derived legal reasoner inside Eve-Legal F5/reasoner.

Synthetic data in legal AI has had a credibility problem. The phrase suggests, to a litigator, fabrication — and litigators are right to be wary of anything that sounds like fabrication when it touches their work product. The credibility problem is worth resolving carefully, because the alternative — training on real client matter data — has a much larger problem.

What "synthetic" means here and what it does not

Synthetic does not mean fictional. The reasoning patterns Eve-Genesis encodes are real reasoning patterns — the patterns a senior litigator applies when reading a record, deciding a demand, or preparing an expert. The legal authorities cited in production demand letters are real authorities; they are not in Eve-Genesis. They are resolved at inference time against CourtListener, the public legal-research corpus, by the four-pass pipeline’s citation verifier.

What Eve-Genesis constructs is the example. The fact pattern of the matter, the shape of the medical chronology, the procedural posture, the names of the parties, the geography — all of it engineered for training. The constructed examples are calibrated to the workload the production pipeline runs: four-pass demand-letter construction, multi-provider medical chronology, jurisdiction-aware damages modeling, deposition preparation, lien orchestration. Coverage is engineered, not accidental.

The four properties only synthetic data delivers

Privilege survives at the training layer. A model fine-tuned on synthetic legal reasoning never sees a real client’s confidential matter, never memorises a real attorney’s work product, never risks reproducing a real privileged exchange in a future inference. The attorney-client relationship is intact at the training layer — not just at the inference layer where most vendors stop talking about privilege. This is structural, not policy.

Quality is engineered. Real legal documents in any plausible scraped corpus are dominated by the most common matter shapes: simple soft-tissue PI, routine contract disputes, uncontested employment terminations. A model trained on that distribution gets very good at the easy cases and predictably bad at the consequential ones. Eve-Genesis is constructed to over-represent the matter shapes that matter — surgical-recommendation cases, multi-impact matters, treatment-gap patterns, IME contradictions. The training distribution is the production distribution, by design.

Bias is auditable. When the corpus is constructed, the demographics, jurisdictions, injury types, and case shapes that compose it are documented. We know what the reasoner has been shown. We know what it has not been shown. When a customer asks "is your model biased toward a particular kind of plaintiff," the answer is not a hand-wave — it is the corpus composition document.

Coverage scales with the editions. Eve-Genesis is a family — Law, Education, Clinical, Theology — each calibrated for its product’s workload. The Law Edition itself extends as new editions ship: WC, MM, MT, IB, EL each contribute their own reasoning patterns back into the corpus. A customer who comes on under PI today gets every benefit of the Law Edition’s subsequent calibration as future editions ship.

The honest limitation

Synthetic data does not magically know the law. Cited authorities — case law, statutes, regulations — are not part of the training corpus. They are resolved at inference time against external sources. CourtListener for case law, the relevant state codes and agencies for statutory and regulatory text, the federal register for federal regulatory developments.

The four-pass demand-letter pipeline’s citation verifier — the same pipeline documented elsewhere in this issue — is the mechanism that turns the synthetic reasoning corpus into citation-accurate output. The reasoner knows how to construct a four-pass demand letter; the pipeline checks every cited authority against the real public record before the letter ships. The two halves of the architecture are not interchangeable. Together they produce the work product. Apart, they would not.

The procurement-grade version

A procurement officer reviewing a legal-AI vendor’s data posture should ask three questions. First: is the training data composed of customer matter data, scraped public records, or constructed synthetic examples? Second: if synthetic, what does the corpus composition look like — what matter shapes, what jurisdictions, what demographics? Third: how do cited authorities get into the output, and how are they verified?

The vendor that cannot answer the first question is selling AI that may have been trained on someone’s privileged data. The vendor that answers "scraped public records" is selling a model trained on the most common cases, optimized for the ones that least matter to the firm. The vendor that answers "synthetic, with documented corpus composition, and authorities resolved at inference time against public corpora" — that vendor is selling the architecture that protects the firm’s privilege, calibrates to the matters that matter, and verifies every citation against a source the litigator can independently check.