Art by Ellie Gabriel.
Over the last two decades, predictive risk assessment tools have been used to determine the fates of millions of individuals in the criminal justice system, deciding whether a defendant will be detained pretrial or released on parole based on an algorithmic calculation of risk. The technology has been embraced by courts and policymakers alike, with one Congressional bill going as far as to call for the implementation of risk assessment systems in every federal prison. But, in 2018, researchers Julia Dressel and Hany Farid published a surprising result: a commonly used risk assessment tool named COMPAS was incorrect almost half the time. With the accuracy rate of COMPAS only a few percentage points higher than that of humans with no previous judicial experience, some judges were left wondering whether they would be better off not using algorithms at all.
When Stanford graduate student Zhiyuan Lin heard about Dressel and Farid’s study, he was equally surprised at its findings—although for a different reason than the public. As a computer scientist in Stanford’s Computational Policy Lab, Lin had encountered dozens of studies demonstrating that algorithms performed better than humans, and he was puzzled why Dressel and Farid had found otherwise. Together with a team of researchers from Stanford and Berkeley, Lin decided to see whether he could fill in the missing pieces to understand what was going on.
Lin and his
colleagues began by attempting to replicate the 2018 study, giving over six
hundred participants the same set of profiles that Dressel and Farid used and
asking them to predict whether the defendants would recidivate. When they
provided participants with immediate feedback after each response, they found
that the participants guessed correctly sixty-four percent of the time,
compared to the sixty-two percentaccuracy rate reported in the 2018 study. The
COMPAS algorithm’s accuracy rate of sixty-five percent matched the 2018 study
Next, the researchers investigated whether these results would hold if they modified the experiment to resemble the real world more closely. They did so in three ways: providing the respondents with more detailed criminal profiles, lowering the average recidivism rates to reflect the rate of violent crime, and most significantly, not telling the respondents whether they were right or wrong. “Receiving this kind of immediate feedback is something that rarely happens in reality, because when the judges are making bail decisions, they don’t find out whether a defendant will recidivate until two years later,” Lin said. “More often than not, they don’t see the outcome at all.”
Under these new conditions, the algorithm performed substantially better than humans. This accuracy gap was especially pronounced in the case of violent crime, for which the study participants consistently overestimated the risk of recidivism. When feedback was present, the participants adjusted their predictions to reflect the lower recidivism rate, but, when they didn’t receive feedback, they continued to guess incorrectly forty percent of the time. In comparison, the algorithm was correct eighty-nine percent of the time. Lin noted that this percentage may have been skewed by the low recidivism rates, since a simple algorithm that guessed “no” each time could have achieved the same score. But even under a different measure that accounted for variations in the base recidivism rates, the algorithm still performed better than humans by achieving sixty-seven percent accuracy. The researchers were able to replicate their results with various risk assessment tools including their own statistical model, suggesting that these improvements in performance were not unique to the COMPAS algorithm.
For researchers like Dressel, however, Lin’s findings emphasize just how limited algorithms can be. Accuracy rates under seventy percent are still “really low,” she said, given that “the consequences of making mistakes is so high.” Dressel also expressed concerns about racial bias, citing a 2016 ProPublica study which found that COMPAS predicted false positives for black defendants at almost twice the rate of white defendants. “A fundamental principle of machine learning is that the future will look like the past, so it’s not surprising that the predictions being made are reinforcing inequalities in the system,” she said.
Lin acknowledged the shortcomings of algorithms, but he said that humans exhibit bias too—and that the biases now embedded in algorithms initially arose from humans themselves. Since people often make decisions in an inconsistent manner, even imperfect algorithms could inject a degree of objectivity into an arbitrary criminal justice system. Lin emphasized that these algorithms should only be used for their intended purpose—risk assessment—and that judges should consider other factors when making their final decision. “There’s this dichotomy of whether we should rely only on humans or only on artificial intelligence, and that’s not really how things work around here,” Lin said. “Algorithms should be complementary tools that help people make better decisions.”
In order to ensure that algorithms are being used correctly, Lin believes that policymakers must be aware of how they work. With its black-box formulas that are protected as intellectual property, the COMPAS software has not been conducive to fostering this kind of understanding. However, developing transparent and interpretable algorithms is very much possible. In another study, to demonstrate accessibility without compromising accuracy, Lin created an algorithm with an eight-step checklist that can be scored manually by prosecutors to track exactly how risk can be calculated. The checklist is simpler than many traditional machine learning models, yet it performs just as well in real-life situations.
But given that neither algorithms nor humans are perfect predictors of recidivism, Dressel suggests that our focus should not be on developing better tools, but rather reducing our reliance on them. Enacted this January, the New York bail reform law is an instance where the role of risk assessment has become essentially obsolete—all pretrial detainees arrested for nonviolent crimes are allowed to go free without posting bail, regardless of perceived risk. According to a report by the Center for Court Innovation, the reform could decrease the number of pretrial detainees by forty-three percent, which is especially significant given that eighty-five percent of them are Hispanic or black—by no coincidence the same races overrepresented in algorithmic predictions of high-risk individuals. “I think what New York did is great,” Dressel said. “The decisions we’re making in a pretrial context shouldn’t be based on someone’s risk. We shouldn’t sacrifice anyone’s liberty until they’ve had a fair trial.”
Still, many researchers believe there’s a place for algorithms within the criminal justice system. “It’s a bit premature to be using these kinds of algorithms now, but I think we will be seeing more of them in the future,” said Nisheeth Vishnoi, a computer science professor and founder of the Computation and Society Initiative at Yale. “It’s good that people are scrutinizing them, because what that is doing is creating new dialogue around these issues.” A proper application of machine learning algorithms, he says, will require learning in all directions—from policymakers, scientists, and each other.
Dressel, J. & Farid, H. (2018). The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1). https://doi.org/10.1126/sciadv.aao5580
Lin, Z., Jung, J., Goel, S., & Skeem, J. (2020). The limits of human predictions of recidivism. Science Advances, 6(7). https://doi.org/10.1126/sciadv.aaz0652
Lin, Z., Chohlas-Wood, A., & Goel, S. (2019). Guiding prosecutorial decisions with an interpretable statistical model. In AAAI/ACM Conference on AI, Ethics, and Society (AIES ’19). https://doi.org/10.1145/ 3306618.3314235