Why Legal Help AI R+D Gets Stuck

And why we need shared infrastructure strategy to overcome these choke points….

By Margaret Hagan, first published on Legal Design and Innovation

So many people in the legal help field are excited about AI’s potential. Legal aid organizations, courts, law libraries, and technologists are launching chatbots, intake tools, document assistants, and triage systems.

But from my conversations & experiences over the past few months, a striking pattern keeps emerging: projects that should take weeks take months. Prototypes that demo well in controlled settings fall apart when real users arrive. Tools that work for one organization cannot be transplanted to another without rebuilding from scratch.

How can we get from promising ideas to high-quality, well-executed, safe pilots?

Why? The answer is not primarily about model capabilities or funding (though both matter). The answer is a set of structural choke points. These are recurring technical, institutional, and knowledge barriers that slow down R&D teams independently, because each team runs into them alone — and then has to solve it on their own.

This piece presents an initial array of 10 common choke points I have heard about from colleagues, experienced in our own Legal Design Lab work, and observed among student projects. As I lay them out, I also explain why they derail legal AI R&D. And then I point to what we could be doing (as a legal help community) to overcome them together.

In particular, I go back to what I’ve been talking about for the past several years — a shared infrastructure strategy to take on & solve challenges together, so they don’t block every team forever. We’ve been crystallizing that strategy into the Legal Help Commons — a coordinated effort to build the shared resources, reference architectures, toolkits, benchmarks, and community infrastructure that the field needs so that every team isn’t rebuilding the same foundations from scratch. (More on the Commons at this write-up from a few days ago…)

The big takeaway of this essay, though, is:

Legal AI R&D is slow not because the foundation models are weak at our teams’ tasks, but because the surrounding infrastructure — data, legal logic, compliance, and classification —often isn’t prepped to work with the models and we’re not clear about how to get to the right level of performance to go live. Our domain is also highly regulated, high-stakes for users, and with big risks for providers and consumers. We don’t have a lot of common knowledge or tools about how to mitigate these risks, get to great & consistent performance, and get the models to perform we want them to. Shared infrastructure could turn months of rework, frustration, mis-directions, and other R&D detours into a more straightforward sprint.

These are recurring R&D blocks — but we can get past them if we work together.

Why Does It Take So Long to Get to High-Performing, Pilot-Ready Tools?

Before diving into individual choke points, it’s worth understanding the structural reasons thatlegal AI R&D moves slowly. Four forces compound to make every project harder than it looks from the outside.

Attorney Logic Is Unwritten

The most valuable knowledge in legal help — how an experienced attorney actually thinks through a case, what questions they ask, what red flags they watch for, what judgment calls they make — is overwhelmingly tacit. It lives in practitioners’ heads, not in documents. You cannot scrape it from a website or extract it from a statute. Even if the ‘law’ is written down, this often is not clear enough, accurate enough, or detailed enough to tell us what a person’s options are, what they need to do, and what their best strategy should be. The only reliable way to surface it is through extensive, structured conversations and iterative testing with subject-matter experts. This is slow, expensive, and doesn’t scale the way most technologists expect.

Source Data Is Messy and Unpredictable

Where does this knowledge live? It exists, but it’s not in good or accessible shape. Legal help content — the guides, forms, rules, eligibility criteria, and service directories that AI tools need to draw on — exists in dozens of formats across hundreds of organizations.

PDFs with no machine-readable structure.
Drupal sites with inconsistent tagging.
Spreadsheets maintained by a single person who left the organization.
Intake forms with fields that mean different things in different counties.

Every AI project begins with a data-wrangling phase that is far longer and more painful than anyone budgets for.

Unexpected Edge Cases Don’t Get Documented

Related to both the unwritten logic and the messy data — there are actually big risks and trouble around getting unusual or less frequent cases correctly helped. Legal problems are inherently complex and come in many, strange forms. A person facing eviction may also have a disability accommodation claim, a domestic violence protection order, unpaid utility liens, and immigration status concerns — all intersecting. These edge cases are where AI tools most often fail, and they’re also the cases where failure carries the highest stakes. But edge cases are, by definition, poorly documented. They surface only through real usage, and the field lacks systematic ways to capture, share, and learn from them.

Detailed Tasks Are Harder Than They Look

Even tasks that sound straightforward — pulling data from a case record, analyzing income documents, drafting a form response — turn out to involve many sub-decisions that require legal judgment. Is this income source countable? Does this record indicate an active case or a closed one? Which form version applies in this courthouse? Each sub-decision is a potential failure point, and each requires domain expertise to resolve correctly. It often takes teams many, many different attempts to get an LLM to do these tasks correctly and consistently.

The Ten Choke Points we should be addressing

Based on talking to teams working on legal help AI projects across many jurisdictions, I have pulled out 10 recurring choke points that slow or block development.

Confidentiality, PII Masking, and Privilege
Conflicts Checking
Income Verification and Eligibility Determination
Legal Logic and Expert Reasoning (The Non-Documentable)
Records Pulls and System Integration
Tech Compliance, Data Sovereignty, and Business Agreements
Issue and Problem Classification
Edge Cases and the Documentation Gap
The Practitioner Empowerment Gap
The Perfection Trap: Unrealistic Accuracy Expectations

Each one is a problem that many teams have encountered independently — and that we could dramatically reduce through shared infrastructure.

1. Confidentiality, PII Masking, and Privilege

Legal AI tools handle some of the most sensitive personal information imaginable: immigration status, domestic violence histories, income details, criminal records, health conditions. Every team building an AI tool must figure out how to handle PII — how to mask it during development, how to protect it in production, and how to ensure that attorney-client privilege is not inadvertently waived when data flows through third-party models or services.

All the private info people share (even if you ask them not to, or that lawyers need to gather to process a case — all becomes a huge hindrance to devleoping an effective solution.

Why PII and privacy obligations this stalls R&D:

No shared PII detection and masking libraries tuned for legal help contexts (names in court filings, SSNs in income forms, addresses in safety-sensitive cases). Teams are rebuilding this or trying to figure it out
Lack of clear guidance on when privilege attaches in AI-assisted interactions and what data flows can safely involve cloud-based models
Each team builds ad hoc masking scripts, often missing edge cases (e.g., names embedded in narrative text, addresses in exhibits)
Testing with realistic data is nearly impossible without robust de-identification, forcing teams to test with artificial scenarios that don’t reveal real failure modes

What would help teams protect data and reduce PII exposure:

A shared, open-source PII detection and masking toolkit purpose-built for legal help documents — court filings, intake forms, case notes — combining rule-based pipelines with trained NER models for legal-specific entity types
Cutting-edge privacy by design mechanisms should be the default, not an afterthought. This means architectures where sensitive data never leaves the secure perimeter in the first place: differential privacy for aggregate analytics, federated learning so models can improve without centralizing client data, synthetic data generation for realistic testing without real PII, and confidential computing or trusted execution environments for the most sensitive processing steps. On-device or on-premises processing for safety-critical intake flows can eliminate entire categories of cloud-based privacy risk
Model privilege and confidentiality guidance documents, developed with ethics experts and state bar associations, covering common AI deployment architectures — including when privilege attaches in AI-assisted interactions and what data flows are safe
PII data flow diagrams included in every reference architecture, so teams start with a compliant design rather than bolting on privacy later

2. Conflicts Checking

Legal aid organizations (and other firms) have strict ethical obligations to check for conflicts of interest before providing services. When an AI tool screens a potential client, gathers information, or provides guidance, it may create a conflict that prevents the organization from later representing that person’s adversary — or vice versa. This seemingly simple compliance requirement has deep implications for system architecture, data sharing, and multi-organization collaboration.

Why confidentiality checking stalls R&D:

AI intake tools often gather enough information to trigger conflict obligations before the organization realizes it
No shared technical patterns for conflict-aware intake architecture (when to check, what data to hold vs. discard, how to handle warm handoffs between organizations)
Multi-organization triage systems (e.g., statewide referral tools) face compound conflict risks that no single organization can resolve alone
Lack of conflict-checking APIs or integration patterns with case management systems (LegalServer, Legal Files, etc.)

What would help do conflict checking efficiently:

Reference architecture patterns for conflict-aware intake, including decision trees for when AI-gathered information triggers a conflict check
Direct API integration with case management systems — LegalServer, Legal Files, and others — so that AI tools can query the organization’s existing conflict database in real time before collecting detailed case information, rather than building parallel conflict-detection logic
Model data retention and handoff protocols for multi-organization triage platforms, specifying what information can be shared during warm handoffs and what must be discarded
A living resource documenting emerging case law and ethics opinions on AI-assisted intake and conflict obligations

3. Income Verification and Eligibility Determination

Most legal aid services are means-tested: to qualify, a person must demonstrate that their income falls below a certain threshold (often tied to Federal Poverty Guidelines).

Courts also need to do means-testing as well. When they give low-income people the option to waive fees to file eviciton answers or debt collection ones — they require a form explaining that person’s income and verifying the quality. Same when they consider whether to reduce a person’s traffic ticket fines based on ‘ability to pay’. Or when they consider waiving past court costs based on income or disability.

Income verification sounds like a simple data check, but in practice it is one of time-consuming, discouraging, and error-prone steps in legal help delivery. Income comes from many sources, is reported differently in different documents, and is calculated differently across programs. For a person to prove they qualify for a given service or policy, it feels like doing your taxes. Lots of complicated fields and proof required, lots of thinking and research to do to fill it in correctly.

Why this income/status verification stalls R&D:

Every organization builds its own income/eligibilty calculation logic, often with subtle bugs (e.g., annualizing hourly wages, handling irregular income, counting household vs. individual income)
Eligibility thresholds vary by program, court, funder, and jurisdiction — there is no single lookup table. Same thing for court’s fee waivers or ability to pay determinations
Document verification (pay stubs, tax returns, benefits letters) requires extraction from varied formats with no shared tooling
Calculating a given household or individual income level requires lots of many different fields and questions
AI tools that skip or simplify income checks risk enrolling ineligible clients (compliance failure) or turning away eligible ones (access failure)

What would help overcome this verification choke point:

Direct data connections to authoritative eligibility databases rather than rebuilding verification logic from scratch. In many states, systems of record already exist — CalSAWS/CalFresh in California, state SNAP and TANF databases, SSA benefit verification services, Medicaid enrollment systems. The technical challenge is about getting clean, permissioned query access to the systems that already hold the answer to eligibility (not having the person calculate tehir own eligibility). Mapping these authoritative data sources, building adapter patterns for querying them, and advocating for API access where it doesn’t yet exist would eliminate the single most error-prone step in legal aid intake
A shared eligibility rules engine for cases where direct database access isn’t available, encoding income thresholds, household size calculations, and program-specific variations as configurable, reusable logic
Document extraction templates for common income verification documents (pay stubs, W-2s, SSA letters, benefits statements) that AI tools can reuse across projects
Test suites with edge cases: irregular income, self-employment, mixed households, benefits cliff scenarios
Partnership with courts, LSC and IOLTA programs to maintain a canonical, machine-readable eligibility threshold database

4. Legal Logic and Expert Reasoning (The Non-Documentable)

This is the deepest and most consequential choke point, that takes up so much time of R&D teams. The knowledge that makes a legal aid attorney effective — the ability to synthesize facts, weigh risks, anticipate complications, judge credibility, and make strategic choices — is overwhelmingly tacit. It is not written down in any guide, form, or statute. It lives in the practiced judgment of experienced practitioners, and it varies by jurisdiction, judge, courthouse culture, and case type.

So much knowledge is not written down anywhere!

Why this unwritten lawyer knowledge stalls R&D:

AI tools built only on written legal content miss the most important layer: the strategic and practical reasoning that experienced attorneys apply
Extracting this knowledge requires structured interviews, scenario walkthroughs, and iterative testing with SMEs — all expensive and slow
Even when captured, expert reasoning is often conditional and probabilistic (“usually the judge will…”, “in my experience…”) which is hard to encode faithfully
Without this knowledge layer, AI tools give technically correct but practically useless advice — like following the recipe but never having tasted the food

What would help overcome this knowledge gap:

Structured knowledge elicitation protocols (interview guides, scenario-based walkthroughs, think-aloud methods) published as reusable toolkits so every project doesn’t reinvent the process
A practitioner knowledge contribution framework — structured ways for attorneys to annotate, correct, and enrich AI tool outputs that feed back into shared knowledge bases
Journey-aware content models (like the Basics / Process / Complications framework) as standard chunking approaches that preserve the practical reasoning layer rather than flattening expert knowledge into flat FAQ pairs
Funded “expert sprint” sessions where practitioners from multiple jurisdictions walk through scenarios together, generating reusable decision logic and edge case libraries
Evaluation rubrics that specifically test for practical reasoning quality, not just legal accuracy

5. Records Pulls and System Integration

This one is about getting the authoritative data out of key databases, to make accurate determinations, craft strategies, and fill in forms/draft documents. It’s also about getting new filings and documentation into the authoritative systems.

Many legal AI tasks require information that lives in external systems: court records, property records, benefits databases, criminal history repositories, vital records. Pulling this information programmatically is essential for automation — and almost always harder than expected.

Why this database access stalls R&D:

Most court and government record systems lack modern APIs; data access requires screen-scraping, SFTP drops, or manual lookup
Record formats vary dramatically across jurisdictions (docket entries, case indexes, property records), with no common schema
Access permissions, authentication requirements, and acceptable use policies differ by system and often require formal agreements that take months to execute
Real-time access is rarely possible, meaning AI tools must work with stale data and handle the resulting uncertainty

What would help to build these points of database access:

A field-wide map of record systems and access pathways for high-priority use cases (eviction records, court calendars, benefits verification) across pilot states, published as navigable guides so that every new project doesn’t have to rediscover who to call and what format to expect
Partnership with CourtStack and court technology vendors to advocate for standardized record APIs, starting with case status and calendar endpoints — the same pattern that transformed healthcare interoperability (FHIR) and could do the same for court systems
Adapter libraries for common record system patterns (e.g., Odyssey, Tyler Supervise, ICMS) that projects can reuse rather than building from scratch
Fallback architecture patterns for when API access is unavailable — including cached snapshots, polling strategies, and graceful degradation designs that handle stale data transparently

6. Tech Compliance, Data Sovereignty, and Business Agreements

Another big one is demonstrating privacy compliance, which can seem harder than it looks. Along with privacy rules, the solution should also follow other regulations that require ethical compliance.

Before a legal AI tool can go live, the deploying organization must navigate a thicket of compliance requirements: data processing agreements with cloud providers, BAAs for tools that touch health information, state-specific data residency rules, ADA accessibility requirements, terms of service for upstream AI model providers, and often funder-specific technology policies (e.g., like a philanthropy’s regulations on technology use with grant funds).

Why this privacy and other compliance stalls R&D:

Each organization negotiates these agreements independently, with limited legal and technical capacity to evaluate complex terms
Model provider terms of service change frequently and may include training-on-input clauses that conflict with confidentiality obligations
State-level data sovereignty requirements (especially for courts) are poorly documented and inconsistently applied
Accessibility compliance (WCAG 2.1 AA) is often treated as an afterthought rather than a design requirement, leading to costly retrofits
The result: months of legal review before any code ships

What would help groups navigate this compliance:

Template agreements (DPAs, BAAs, model provider addenda) pre-negotiated for common legal aid deployment patterns, so organizations can adopt rather than draft from scratch
A regulatory landscape tracker covering state-by-state AI and data requirements for legal services, maintained as a living resource
An accessibility compliance checklist and testing protocol specific to legal help AI interfaces
Pre-approved technology standards developed with LSC, IOLTA programs, and major funders that organizations can reference in compliance documentation
A procurement and compliance working group that pools expertise and shares reviewed vendor assessments, so that one organization’s diligence benefits everyone

7. Issue and Problem Classification

Legal problems rarely arrive pre-labeled. But many teams’ solutions rely on correct classification of a person’s scenario, to get it slotted and connected correctly. This is the classificaiton problem.

A person contacts a legal aid hotline and says, “My landlord is trying to kick me out and I haven’t been able to work because of my injury.” That single sentence may implicate housing (eviction defense), employment (wrongful termination or disability accommodation), public benefits (workers’ compensation, SSI/SSDI), and possibly immigration or family law depending on context. Correctly classifying the legal issues is foundational to triage, routing, and service delivery — and it is surprisingly hard to automate.

Why this classification problem stalls R&D:

Each organization keeps relying on their own internal taxonomy that are similar but just different enough that they have to be maintained and deployed one org at a time. They are not cross-walked to common taxonomies.
Existing taxonomies (like LIST) provide excellent coverage for some areas, but require human judgment to apply to real-world narratives that rarely map cleanly to a single code. There are no common or easily available tools to do LIST or other classificaiton.
Multi-issue cases are the norm, not the exception — but most AI systems are built to classify to a single primary issue
Classification accuracy drops sharply for uncommon issue types, non-English speakers, and cases involving intersecting legal domains
Without reliable classification, every downstream step — routing, content retrieval, eligibility determination — is compromised

What would help overcome the classification trouble:

Problem/taxonomy classifier that anyone can use to correctly spot and label the issues present in a person’s problem scenario, documents, narratives, and more.
More groups using the same LIST taxonomy and contributing to it, rather than creating their own.
Crosswalks between LIST and other taxonomies in active use (LSC problem codes, state-specific codes, NSLA categories) so that classification outputs are interoperable across organizations and systems. Even for those who want to use their own taxonomy, crosswalk it over.
Gold-standard classification test sets using LIST codes, with realistic multi-issue scenarios, non-English examples, and edge cases — maintained as a shared benchmark
Multi-label classification approaches (not single-label) as the default pattern in reference architectures
Iterative refinement of classification models incorporating real intake data (de-identified) to improve accuracy over time
Published classification accuracy benchmarks by issue area and language, so organizations can make informed decisions about where AI classification is reliable enough to deploy

8. Edge Cases and the Documentation Gap

Edge cases are where legal AI tools are most likely to fail and where failure matters most. A tenant who is both a victim of domestic violence and an undocumented immigrant facing eviction needs different guidance than a straightforward non-payment case. A person with a cognitive disability navigating a court self-help center needs a different interaction pattern than someone who is tech-savvy and literate. These edge cases are poorly documented, poorly tested, and poorly handled by current tools.

Why this lack of edge case documentation stalls R&D:

Edge cases are discovered in production, when the tool is already far down the develoment journey or maybe even in pilot— by the time they surface, real people may have been harmed.
There is no systematic mechanism for organizations to share the edge cases they discover with other teams building similar tools
Test suites focus on common scenarios because edge case data is scarce; this creates a false sense of confidence in model performance
The highest-stakes edge cases (safety, immigration consequences, mandatory reporting triggers) are also the most sensitive to document and share

What Edge Case documentation would help:

A shared, de-identified edge case library organized by issue area, with structured fields for the scenario, the failure mode, the correct handling, and the lessons learned
Edge case discovery protocols built into every reference architecture — structured ways for pilot teams to identify, document, and escalate unexpected scenarios
Safety-specific test suites for high-risk situations like mandatory reporting triggers, imminent harm indicators, and immigration red flags that every tool should pass before deployment
A cross-organization learning loop where teams regularly share anonymized edge case reports and collectively develop response patterns
Edge case coverage as a required dimension in all evaluation rubrics — a tool cannot score well without demonstrating handling of non-standard scenarios

9. The Practitioner Empowerment Gap

Another big choke point around legal help AI R&D is that we don’t have enough people working on it. If a lawyer, paralegal, or other team member has a great idea for transforming their workflow, they often feel like they have to wait for a tech/design team to help them build it. Where is the innovation team to help them? The legal team experts’ ideas stall out, while they wait.

The innovation teams are so oversubscribed, there is so much to do, teh ideas lose momentum and don’t move into serious develoment or testing.

The people who understand legal help delivery best — attorneys, paralegals, court clerks, legal aid managers, navigators — are overwhelmingly not the ones building the AI tools. They are waiting for technologists to build something, then reacting to it. This dynamic is backwards.

Legal experts hold the domain knowledge that determines whether a tool actually works, but they rarely feel empowered to drive the transformation themselves. They don’t see their daily work reflected in system-level workflow diagrams. They aren’t given tools or frameworks that let them participate in design, specification, or evaluation at a meaningful level. Instead, they are positioned as reviewers of someone else’s interpretation of their work — and by the time they see a prototype, the foundational assumptions are already locked in.

Why this expert practitioner gap stalls R&D:

Legal experts are treated as consultants to technology projects rather than co-designers or owners of the transformation
Workflow mapping and systems thinking are rarely part of legal training, so practitioners don’t see their expertise as relevant to technology design — even though it is the most critical input
Technology teams build from their own assumptions about how legal help works, encoding misunderstandings that practitioners could have caught in the first hour
The field loses the practitioners who are best positioned to drive change, because they don’t see a path from their current role to meaningful participation in AI development
Organizations wait for external vendors or grant-funded tech teams to “bring AI to them” rather than building internal capacity to shape and lead their own AI strategy

What would help more legal help teams get empowered with AI R&D:

Reference architectures and playbooks designed so that legal experts — not just technologists — can use them to specify, scope, and evaluate AI tools for their own contexts
Practitioner-facing workflow mapping tools and templates that let legal staff document their own processes in a format that directly feeds AI design (bridging the gap between “how I do my job” and “what the system should do”)
Legal practitioners embedded as co-leads and co-designers in build sprints, not as advisors brought in after architecture decisions are made. This could also be through taking experts off their regular teams for 6 weeks or 6 months to see a project through.
Training programs built around empowering legal professionals to lead AI projects, not just understand them
Low-code and no-code specification frameworks (e.g., structured scenario templates, decision tree builders) that let practitioners author the logic without writing code
A narrative shift from “technologists building for legal aid” to “legal experts building with technology support” — and funding structures that match

10. The Perfection Trap: Unrealistic Accuracy Expectations

This last one is more further down the R&D road — where an idea has been built out. It might even have gone through many, many rounds of refinement. But then the choke point emerges: leaders say, it cannot be put into pilot until it is (nearly) perfect.

(Even though this is not the standard for human legal teams… )

Projects stall or get taken ofline because stakeholders demand sky-high accuracy across every dimension before greenlighting a pilot. A triage tool must classify every issue correctly. A document assistant must never produce a flawed draft. An intake chatbot must handle every possible scenario without error.

Too-high standards can mean things can never go to pilot.

This all-or-nothing standard sounds responsible, but it actually prevents responsible deployment by conflating genuinely high-risk functions with lower-risk ones that could safely launch with good-enough performance and human oversight.

It ignores what the current baseline is. Often this baseline is no services at all — sending people to DIY or use free, non-specialized AI tools or Reddit or informal advice to solve their problem. These are likely to have much, much lower performance scores. Or the baseline is human legal teams, which likely will have higher quality performance scores — but not 99% accuracy and safety.

Why this Perfection Gap stalls R&D:

Decision-makers apply a single accuracy bar across all functions, rather than differentiating by actual risk level — a wrong answer about courthouse hours is not the same as a wrong answer about a filing deadline
The comparison baseline is implicit perfection, not the current reality — where people routinely get no help at all, or get wrong information from overwhelmed staff, or miss deadlines because they couldn’t reach anyone
Fear of liability and bad press creates institutional paralysis: no one wants to be the organization that deployed an AI tool that gave wrong legal advice, even if the alternative is that thousands of people get no advice
Evaluation frameworks rarely distinguish between “high-risk, must-be-right” functions (safety screening, mandatory reporting triggers, deadline calculations) and “lower-risk, value-even-if-imperfect” functions (general orientation, resource finding, form field explanations)
Pilot proposals die in committee because reviewers focus on the 5% failure cases rather than the 95% improvement over the status quo of no help

What would help teams responsibly deal with this perfection gap:

A risk-tiered evaluation framework that categorizes AI functions by actual consequence of error — distinguishing safety-critical functions (where near-perfect accuracy is genuinely required) from informational and navigational functions (where good-enough performance with human backup is a massive improvement over no service)
“Compared to what?” baselines for common legal help scenarios: what is the current accuracy, completeness, and timeliness of help that people actually receive today? Make the real comparison explicit so that decision-makers evaluate AI tools against the actual alternative, not against an imagined perfect system
Graduated deployment models: start with low-risk functions (information, orientation, resource finding), demonstrate safety and value, then expand to higher-risk functions as confidence and evidence accumulate
Model governance frameworks that show how to combine AI assistance with human review at calibrated levels — heavy oversight for high-risk functions, lighter touch for lower-risk ones — so that organizations can deploy responsibly without requiring perfection
Risk-tier mapping as a standard step in every implementation playbook, so that teams and their stakeholders explicitly agree on which functions require what level of accuracy before development begins
Advocacy with funders and regulators for evidence-based accuracy standards rather than zero-tolerance policies, using data from pilots to demonstrate that imperfect AI plus human oversight outperforms no AI at all
Insurance products that can make consumers whole again if there is a problem — but that still allow a team to move forward if their solution meets the agreed-upon standard.
Evaluation teams in well-resourced, public interest organizations who can maintain more realistic standards and also help teams carry out reliable, right-sized performance evaluations to have a more accurate knowledge of how their tool is performing.

The Shared Infrastructure Strategy to Overcome these 10 choke points

Each of these choke points, encountered individually, is enough to stall or kill a project. Encountered together — which is what happens to every team — they explain why legal AI R&D feels so much harder than it should be.

The Legal Help Commons is designed specifically to address this pattern. Rather than letting every team independently solve the same problems, the Commons creates shared resources that any team can draw on — organized across three pillars:

JusticeBench: Project inventory, benchmarks, test suites, classification datasets, edge case libraries, regulatory landscape tracker

Implementation Library: Reference architectures, PII and privacy toolkits, eligibility engines, template agreements, adapter libraries

Cohorts & Community: Power Groups, expert sprints, edge case sharing loops, working groups, cross-org learning

The key insight is that these choke points are not unique to any one project. They are field-level problems that require field-level solutions. No single organization should have to negotiate its own data privacy agreements from scratch, build its own PII masking pipeline, map its own path to court record APIs, or discover its own edge cases through trial and error. Shared infrastructure makes these solved problems rather than ongoing obstacles.

What Changes If We Deal With these R&D Blocks

If the field successfully builds shared infrastructure around these ten choke points, the impact compounds:

Project timelines compress from months to weeks as teams start from working reference architectures instead of blank pages
Quality improves as edge case libraries and evaluation rubrics capture hard-won lessons across organizations
Costs drop as privacy toolkits, eligibility engines, and template agreements eliminate duplicated work
Equity improves as smaller organizations gain access to the same infrastructure as well-resourced ones
Trust grows as the field demonstrates shared standards for safety, accuracy, and accountability
Practitioners lead as workflow mapping tools and low-code frameworks give legal experts direct authorship over AI system design

What We Need From the Field

Shared infrastructure requires shared participation. To build these resources, we need things, especially from those who have been on R&D journeys or supporting them.

Practitioner time: Attorneys and legal aid staff willing to participate in expert sprints, annotate AI outputs, and share edge cases
Pilot partners: Organizations willing to test reference architectures in real settings and report back on what works and what doesn’t
Technical contributors: Developers, data scientists, and researchers who can build and maintain open-source tooling
Funder alignment: Grants and contracts that support shared infrastructure, not just point-to-point tool development
Honest reporting: Teams willing to share failures and difficulties, not just successes — because the failures are where the learning lives

Legal help AI R&D gets stuck not because the technology is inadequate, but because every team hits the same structural barriers alone. The Legal Help Commons is building the shared infrastructure to clear those barriers once — so that the field’s energy goes into innovation, not rework.