Annual reportInaugural benchmark

The year in agentic legal AI.

What shipped. What shifted. What the next twelve months will bring. The first edition lays the measurement framework; subsequent editions track the change.

By Eve-Legal·May 19, 2026·12 min read

This is the inaugural edition of the JustineAI™ annual review of agentic AI in legal practice. The framing matters: this is a benchmark, not a marketing surface. The goal of the publication is to lay a measurement framework that subsequent editions can use to track real change in the category, year over year, with the same metrics applied consistently. If a measure that mattered last year does not appear here, that is the editorial decision — not an oversight. If a measure that should matter is missing, the inaugural framework is wrong and we will say so in the second edition.

What shipped

The single most consequential capability that landed in production over the past year is long-context reasoning at scale. Meta Llama 4 Scout’s 10-million-token context window arrived during the review window. A handful of legal-AI vendors incorporated it into shipping product within the same period. JustineAI™ is among them. The capability is genuinely categorical — workflow patterns that previously required chunked retrieval and were therefore prone to cross-document contradictions are now one-shot reads of the full matter.

Frontier-model capability climbed in two specific dimensions that matter for legal practice: structured-reasoning fidelity, and refusal calibration. The latter is the quieter shift. Models that two years ago refused to engage with legal-reasoning questions because they had been calibrated for liability-conservative posture now engage productively when the workflow makes clear the user is a licensed attorney and the work product is attorney-attested. This shift is what made the four-pass demand-letter pipeline possible in production.

The supervisor pattern — a named Digital Employee coordinating stage-specialized sub-agents — moved from research to production. The pattern is now visibly the architectural posture that distinguishes serious agentic-legal-AI products from chatbot wrappers. The companies that adopted it have a structurally different ceiling on what their products can do than the companies that did not.

What shifted

The conversation about training data shifted decisively away from "we don’t train on customer data" as a tagline and toward "we don’t train on customer data, and here is the corpus we do train on." The vendors who can answer the second question — by pointing to a documented synthetic corpus, a vendor-disclosed scraping discipline, or a public-source-only training posture — are the vendors whose procurement conversations are now meaningfully shorter. The vendors who can only answer the first question are losing deals to that opacity.

Procurement-side expectations of legal-AI vendors moved toward the security posture general enterprise software has carried for years. Tenant isolation as an architectural commitment, not a configuration option. Customer-managed keys on enterprise tiers. Documented subprocessor lists with materiality thresholds. DPAs that the firm’s outside counsel can mark up. A year ago these were nice-to-have. By the end of the review window they were the gate.

The price shape shifted. Per-seat pricing, with optional volume tiers, became the dominant model for self-serve and mid-market legal-AI products — replacing the per-matter and per-output models that dominated the prior period. Per-seat aligns with how firms actually budget for software, and it removes the perverse incentive to under-use the product to save on per-output fees.

What the next twelve months will bring

The capability ceiling will rise again. The frontier labs’ roadmap implies another step-change in structured-reasoning fidelity, particularly for matters that require multi-source synthesis. Whether that step-change is genuinely categorical or incremental will depend on details we do not yet have visibility into.

Mass-tort and class-action workflows will receive specific product attention from multiple vendors. The supervisor-pattern architecture that PI products demonstrated this year scales naturally to per-plaintiff sub-agent coordination. The shape of the work — one supervisor, thousands of sub-agents in a single reasoning context — is well-matched to the architecture. JustineAI™’s MT edition is one of the products we expect to ship during the next review window. Others will follow.

Procurement attestation requirements will harden. SOC 2 Type II for the product-specific control surface (not just the inherited cloud-platform surface) will move from procurement nice-to-have to procurement gate over the next twelve months. The vendors that have committed to the attestation work are at varying stages; the vendors that have not committed will be visibly absent from procurement-led deals by the end of the review window.

Geographic expansion will accelerate, particularly into the UK, Canada, and the Australia / New Zealand markets, with calibrated editions that respect the significantly different statutory and procedural landscapes. The architecture patterns documented in this issue — the supervisor pattern, the compositional fabric with provider-swappable frontier slots, the synthetic-data substrate that permits jurisdiction-specific calibration without privilege exposure — are the patterns that make this expansion structurally tractable.

The measurement framework

Subsequent annual reviews will measure: capability ceiling (the hardest matter shape any agentic-legal-AI product can produce production-grade output for); provider concentration (the percentage of category-level workload that depends on a single foundation-model provider); architecture posture (the share of new deployments that adopt the supervisor pattern); training-data discipline (the share of new deployments that disclose corpus composition); attestation posture (the share of vendors holding product-specific SOC 2 Type II at the end of the year); and geographic distribution (the count of jurisdictions with at least one shipping calibrated agentic-legal-AI product).

The inaugural-edition baseline values for these measures will appear in a companion data appendix in the next revision of this document. They are not in this edition; we want the framework to be reviewed and refined by readers — and by the colleagues at other companies in the category — before the baseline locks. The intent is for the annual review to be the kind of publication other companies cite when they make their own claims about the state of the category.

A note on tone

The publication is not adversarial. The state of agentic AI in legal practice is materially improving year over year, and the work of the companies in the category — including the ones whose architectural choices we disagree with — is, in aggregate, advancing the category in ways that benefit the attorneys we all ultimately serve. The annual review will name capability decisions and architectural postures we find unhelpful; it will not name companies in unflattering ways, because the goal is to elevate the category, not to compete inside it through criticism.

Subsequent editions will land in May of each year. The second edition will lay the measured baselines. The third will track the year-over-year change against them. By the time the publication reaches its fifth or sixth edition, it should be the canonical reference for the state of agentic AI in legal practice. That is the goal we’re writing toward.