Quality Metrics for AI at Justice Tasks

Our Lab is defining explicit quality metrics for various tasks that are done in the justice system.

Examples of justice system tasks include

answering people’s legal questions,
spotting issues in their problem stories,
filling in form fields correctly and robustly,
generating demand letters to landlords or employers,
screening cases to triage them to the right service or legal path,
providing a customized set of next-steps resources,
improving the tone and plain language accessibility of legal information,
and more.

For each of these different kinds of tasks, our Lab is working to define what makes the output ‘good’ or ‘bad’. Explicitly defining these criteria can help establish technical benchmarks. Engineers, legal system leaders, and regulators can use these benchmarks to assess the performance of technical systems like AI models. Is the AI able to perform that task at an acceptable quality level? Is it able to produce as high quality an output as humans are able to?

On this page, we will share the criteria & benchmarks we are establishing for different tasks. We establish these through secondary research, expert stakeholder interviews and exercises, and user interviews and exercises.

Quality Metrics for Legal Q-and-A

When a person asks a question about their legal problem, how can we tell if the answer is good or not?

Our Lab has been interviewing members of the public and legal help experts to propose a draft benchmark of quality metrics to use when evaluating this specific task: the answering of a person’s legal question. Legal question-asking typically occurs at the beginning of a person’s legal journey, when they are trying to figure out what the name of their problem is, what options they have, what the law says, who can help them, and what the next steps might be.

We initially collected 22 possible specific criteria that could affect whether an answer to a legal question would have a positive or negative effect on the person’s outcomes. We then ran a series of interviews, rating sessions, and exercises to identify if these criteria are important, and also to refine or expand the criteria as indicated by stakeholders.

This has led us to quality metric criteria in 6 overall categories:

Content Types included in the answer
Content Accuracy & Quality
Presentation & Format of the Content
Lack of Bias in the Answer
Informed Usage of the Tool
(Proxy) Source of Content Used in the Tool

Read the initial December 2023 conference paper that presents the initial evaluation of these criteria by legal help experts. Please note that since this article, our team has been interviewing more expert and user stakeholders to expand and rank the criteria that should be used to evaluate Q-and-A quality.

Good AI Legal Help, Bad AI Legal Help

Margaret D. Hagan. (2023). Good AI Legal Help, Bad AI Legal Help: Establishing quality standards for responses to people’s legal problem stories. In JURIX AI and Access to Justice Workshop. Retrieved from https://drive.google.com/file/d/14CitzBksHiu_2x8W2eT-vBe_JNYOcOhM/view?usp=drive_link

Abstract:

Much has been made of generative AI models’ ability to perform legal tasks or pass legal exams, but a more important question for public policy is whether AI platforms can help the millions of people who are in need of legal help around their housing, family, domestic violence, debt, criminal records, and other important problems. When a person comes to a well-known, general generative AI platform to ask about their legal problem, what is the quality of the platform’s response? Measuring quality is difficult in the legal domain, because there are few standardized sets of rubrics to judge things like the quality of a professional’s response to a person’s request for advice. This study presents a proposed set of 22 specific criteria to evaluate the quality of a system’s answers to a person’s request for legal help for a civil justice problem. It also presents the review of these evaluation criteria by legal domain experts like legal aid lawyers, courthouse self help center staff, and legal help website administrators. The result is a set of standards, context, and proposals that technologists and policymakers can use to evaluate quality of this specific legal help task in future benchmark efforts.