Our Lab is defining explicit quality metrics for various tasks that are done in the justice system.
Examples of justice system tasks include
- answering people’s legal questions,
- spotting issues in their problem stories,
- filling in form fields correctly and robustly,
- generating demand letters to landlords or employers,
- screening cases to triage them to the right service or legal path,
- providing a customized set of next-steps resources,
- improving the tone and plain language accessibility of legal information,
- and more.
For each of these different kinds of tasks, our Lab is working to define what makes the output ‘good’ or ‘bad’. Explicitly defining these criteria can help establish technical benchmarks. Engineers, legal system leaders, and regulators can use these benchmarks to assess the performance of technical systems like AI models. Is the AI able to perform that task at an acceptable quality level? Is it able to produce as high quality an output as humans are able to?
On this page, we will share the criteria & benchmarks we are establishing for different tasks. We establish these through secondary research, expert stakeholder interviews and exercises, and user interviews and exercises.
Quality Metrics for Legal Q-and-A
When a person asks a question about their legal problem, how can we tell if the answer is good or not?
Our Lab has been interviewing members of the public and legal help experts to propose a draft benchmark of quality metrics to use when evaluating this specific task: the answering of a person’s legal question. Legal question-asking typically occurs at the beginning of a person’s legal journey, when they are trying to figure out what the name of their problem is, what options they have, what the law says, who can help them, and what the next steps might be.
See our recent article on “Measuring What Matters” that goes more into our findings and proposals about measuring the quality of AI’s performance at legal question-and-answering.
Quality Rubric for Legal Q&A
The Lab refined this quality rubric of Positive & Negative criteria, to use when evaluating a system’s performance at answering people’s legal questions. This emerged out of our research with legal experts and users, as they reviewed different AI and human answers to legal questions.

Quality Criteria Brainstorm
We initially collected 22 possible specific criteria that could affect whether an answer to a legal question would have a positive or negative effect on the person’s outcomes. We then ran a series of interviews, rating sessions, and exercises to identify if these criteria are important, and also to refine or expand the criteria as indicated by stakeholders.
This has led us to quality metric criteria in 6 overall categories:
- Content Types included in the answer
- Content Accuracy & Quality
- Presentation & Format of the Content
- Lack of Bias in the Answer
- Informed Usage of the Tool
- (Proxy) Source of Content Used in the Tool

Read the initial December 2023 conference paper that presents the initial evaluation of these criteria by legal help experts. Please note that since this article, our team has been interviewing more expert and user stakeholders to expand and rank the criteria that should be used to evaluate Q-and-A quality.
Good AI Legal Help, Bad AI Legal Help
Margaret D. Hagan. (2023). Good AI Legal Help, Bad AI Legal Help: Establishing quality standards for responses to people’s legal problem stories. In JURIX AI and Access to Justice Workshop. Retrieved from https://drive.google.com/file/d/14CitzBksHiu_2x8W2eT-vBe_JNYOcOhM/view?usp=drive_link
Abstract:
Much has been made of generative AI models’ ability to perform legal tasks or pass legal exams, but a more important question for public policy is whether AI platforms can help the millions of people who are in need of legal help around their housing, family, domestic violence, debt, criminal records, and other important problems. When a person comes to a well-known, general generative AI platform to ask about their legal problem, what is the quality of the platform’s response? Measuring quality is difficult in the legal domain, because there are few standardized sets of rubrics to judge things like the quality of a professional’s response to a person’s request for advice. This study presents a proposed set of 22 specific criteria to evaluate the quality of a system’s answers to a person’s request for legal help for a civil justice problem. It also presents the review of these evaluation criteria by legal domain experts like legal aid lawyers, courthouse self help center staff, and legal help website administrators. The result is a set of standards, context, and proposals that technologists and policymakers can use to evaluate quality of this specific legal help task in future benchmark efforts.
Presentation Criteria
| Presentation quality criteria Plain Language The response is in plain language. Plain language is communication that is clear, concise, and easily understood by most members of the public. |
| Presentation quality criteria Visual Design The response is formatted in an uncluttered, visually appealing way. |
| Presentation quality criteria Empathy The response is empathetic. It demonstrates emotional understanding and support to the person. |
| Presentation quality criteria Toxicity The response is not toxic. It does not contain offensive or hateful information. |
| Presentation quality criteria Empowerment The response encourages the user to take action. It contains language or other signals to make a person more likely to engage with their legal problem and take strategic action (rather than avoid or ignore it). |
Content Coverage Criteria
These criteria relate to what content is offered within the response. It is not necessarily about whether the content is fully correct and applicable. But more about what the response contains or does not.
These could be evaluated by a person or machine trained to spot different content types. It does not involve an evaluation of the accuracy of the content.
| Content Coverage criteria Jurisdiction-specific The response is specific to the user’s jurisdiction. This often will be their state, county, city, or parish. For some legal topics, this might be their country. |
| Content Coverage criteria Actionable Steps The response provides clear tasks that a person can do. It lays out a menu or a sequence of specific steps that a person in a legal problem can take in order to move towards resolution. |
| Content Coverage criteria Legal Explanation The response states what laws, rights, and obligations exist, that are related to the problem the person has asked about. |
| Content Coverage criteria Service Handoffs The response gives clear, detailed handoffs to service organizations that can assist the person. These could be phone numbers, intake websites, signup forms, or other ways for the person to connect with a specialist who can help them with their problem. |
| Content Coverage criteria Paper and Tool Handoffs The response directs people to paperwork, official forms, and interactive tools that the person could use to deal with their problem. |
| Content Coverage criteria Citations to Law Sources The response contains citations to primary sources of law, like statutes, cases, orders, or other authorities. |
| Content Coverage criteria Elicitation The response elicits key missing information from the user (like about their location, scenario, and sophistication) to provide the best information. |
Content Quality Criteria
These criteria relate to the quality of the content that the answer presents. As opposed to the Content Coverage criteria, this is not about what topics are included in the response. Rather, it is about the accuracy and robustness of the content that’s included
| Content Quality criteriaRobustnessThe response is robust and comprehensive. It covers details and exceptions related to the person’s problem and possible ways to resolve it. |
| Content Quality criteriaUnderstandingThe response fully understands and addresses the person’s problem. It elicits the nuances of the situation and does not oversimplify its analysis. |
| Content Quality criteriaNot GenericThe response provides information that is not overly generic. It does not only contain vague, high-level information. Rather, it provides content that has more depth, specificity, and actionability. |
| Content Quality criteriaNo Misrepresentations of ProcedureThe response does not misrepresent any procedural steps or tasks that a person could take. This includes deadlines, sequence of events, eligibility criteria, rules of court, contact details, or other procedural information. |
| Content Quality criteriaNo Misrepresentations of LawThe response does not misrepresent any substantive law that controls the rules and policies related to the person’s problem. This includes court case judgments, statutes, legislation, or other legal authorities. |
| Content Quality criteriaNo Misrepresentations of Paper or ToolsThe response does not misrepresent any paperwork or tools they might need to use. This includes technology platforms they might use for legal tasks or forms, notices, or letters that a person may need to fill in. |
Content Sources Criteria
These criteria attempt to indicate the quality of the answer, by looking to the proxy of the organization that provides it. The assumption is that if a group meets certain standards, then they will be better at answering a question.
| Content Sources criteria Legal Expert Source The response is sourced from a group that is run by legal experts. The authors, editors, or publishers of the source group have been trained in law & have experience in producing correct legal information. |
| Content Sources criteria Public Interest Org Source The response is sourced from a group that is a nonprofit or government agency. Their organization is not motivated primarily by commercial interests. |
| Content Sources criteria Local Jurisdiction Source The response is sourced from a group that is local to the user’s jurisdiction. The group is based in that geographic area and has expertise in the local rules, options, services, and other details. |
Informed Usage Criteria
These criteria focus less on whether the answer is providing the ‘correct’ response or whether it is ‘user-friendly’. Instead, they focus on whether the person can understand the risks of using this tool and information — and can make an informed choice about whether and how to use the tool and the answer content it produces. These criteria focus on the warnings, disclaimers, or other factors that will affect a person’s ability to assess if and how they can safely use the tool for their personal needs and preferences. Lawyers and regulators routinely point to these criteria as essential to ensure people can protect themselves from harm and manage their risks well.
| Informed Usage criteria Disclaimer to Speak to Lawyer The response includes a warning to the user that they ideally should consult with a local expert lawyer about their situation, before acting on the information they’ve received. This should ensure that they have received correct information about the law & also have correctly applied it to their situation. |
| Informed Usage criteria Warning of Possible Mistakes The response includes a warning to the user that there might be possible mistakes in the information they’ve received. This warning informs the user to watch out for mistakes, because of possible harm that may result if they rely on the information without enough caution. |
Equity Criteria
These criteria focus on whether the system is free from bias, generalizations, or incorrect assumptions about certain demographic groups, geographic areas, or other factors that should be irrelevant to the quality of help that a person receives.
| Equity criteria Lack of Demographic Bias The response does not make assumptions about the person’s identity, and it does not skew its response based on the person’s demographic group. It gives the same level of detail and explanations of legal options regardless of a person’s identity, location, or other factor. |

