People Are Going To Prison Thanks To DNA Software — But How It Works Is Secret
A new breed of software claims it can find DNA matches for forensic cases with unprecedented accuracy. But if it’s sending people to prison, should its secret source code be revealed to the accused?
Two days after Christmas 1977, police found Shelley H. dead in her apartment in Long Beach, California. The 17-year-old had been sexually assaulted and strangled: She was lying on the end of her bed, her feet touching the ground, with an electrical wire tied around her neck.
Vaginal swabs were taken during her autopsy, but at the time, there was no DNA testing. So the samples went into storage, and the case went cold for the next three decades.
Then in 2011, a private DNA lab matched the samples to a man who had lived in Long Beach at the time, Martell Chubbs. The DNA, according to his attorney, is the only evidence linking him to the victim.
And these DNA samples were particularly tough to read, tangled with the genetic traces of one or two people in addition to Shelley. What were the chances that the match was correct? Dr. Mark Perlin, the CEO of a Pittsburgh company called Cybergenetics, said his computer program could figure it out.
In pop culture, DNA is often portrayed as a magical piece of evidence that links perpetrators to crimes and exonerates the innocent. But deciphering it is usually much messier in real life, especially in cases like Shelley’s that involve samples of more than one person’s DNA. Over the last decade or so, forensic scientists have come to realize that traditional methods for interpreting these “mixed samples” are often less reliable than previously thought.
Now, Cybergenetics and a handful of other companies are selling a solution: software that claims to interpret mixed DNA with a high degree of accuracy. These companies point to several peer-reviewed studies that describe the underlying mathematical concepts of their programs, as well as results of mock testing in which they correctly interpreted known samples.
But here’s the problem, according to some attorneys and geneticists: These companies claim that the details of how the computer programs carry out their calculations, spelled out in their source codes, are trade secrets. So there’s no way to independently verify that the programs are pinning the real criminal, critics say.
“It’s a black box,” said Angelyn Gates, an attorney based in Pasadena who’s defending Chubbs, now 56, from a murder charge for the 1977 crime. That trial is expected to happen this year in Los Angeles County, and she’s trying to prevent the prosecution from introducing Cybergenetics’ software, called TrueAllele, into court.
“You have a defendant’s right to cross-examine and determine, ‘How are you saying this is the result in my case?’” Gates said. In her view, “Perlin says, ‘Who cares about your constitutional rights? I want my money.’”
Since 2009, TrueAllele has been used in more than 500 cases and helped convict robbers, child molesters, murderers, and rapists across the U.S. and the U.K. Another program, STRmix, has been used in thousands of cases in Australia and New Zealand since 2012. Just last month, in two double homicide cases near Pittsburgh, judges rejected both defense attorneys’ requests to examine TrueAllele’s source code. In December, a Michigan judge allowed STRmix to be used in a case, marking one of the first times it has been admitted into a U.S. court.
Apart from how well the software works, privacy advocates say it exemplifies an even bigger problem: the growing and often unchecked influence of secret algorithms on society, from Facebook’s News Feed experiment on users’ emotions to the 11 million Volkswagen cars programmed to cheat emissions tests.
“In so many things we use now, algorithms are being used to make determinations about us,” said Caitriona Fitzgerald of the Electronic Privacy Information Center, which is filing public records requests to reveal TrueAllele’s source code. “Those algorithms are not available to people.”
Given the billions of DNA letters in each human genome, there are myriad ways that one person’s code is different from the next.
But such differences are tricky to pin down in a sample with DNA from two or more people, as is often the case with a gun, shirt, or cell phone from a crime scene. Depending on how many people came into contact with the item, and for how long, their unique genetic traces may be present in large or small amounts. Sometimes a person’s genetic markers, or strings of DNA code, partially overlap with others’. And in small or decaying samples, bits of these markers can fade away, and others can sneak in through contamination.
Scientists have known these potential limits for at least a decade. Last spring, the FBI notified crime labs that there were errors in the data they’ve used for years to calculate the chances of a match between DNA evidence and a suspect. That spurred Texas to flag thousands of past cases for potential issues involving mixed DNA samples.
In a 2013 survey that has not yet been published, the National Institute of Standards and Technology (NIST) asked 108 labs to interpret a made-up sample with four people in it, and as a test, provided the DNA profile of a fake suspect who wasn’t in the sample. Seventy percent of the labs pinned the fake suspect.
“At the moment, there are really no national standards as far as interpreting mixed DNA,” said Michael Coble, a research biologist at NIST who helped conduct the survey.
The traditional approach for analyzing DNA samples, and attempting to account for extraneous code that’s in the sample due to contamination or degradation, involves throwing out bits of the DNA sequence that appear as either large or small outliers in the sample. This approach works well for samples that include just one person’s DNA. But in mixed samples, it runs the risk of discarding information that could be crucial to making a correct analysis.
TrueAllele’s approach, in contrast, considers everything. “We have a method that’s objective, that uses all the data,” Perlin said.
TrueAllele runs through up to hundreds of thousands of potential scenarios that might have produced the DNA code in a mixed sample, and calculates the probability of each. Once all the probabilities are calculated, they’re compared with the DNA of a suspect or suspects. Finally, TrueAlelle spits out a ratio: What are the chances that a DNA “match” to a suspect is actually just a random coincidence? One in 100? One in a million?
Differing ratios may not always change jurors’ minds, like when one method claims a 1 in 5 million chance of being wrong and another claims 1 in 81 billion (as was the case with a rapist in Pennsylvania). But errors in how these ratios are calculated can really matter when two methods end up with wildly different results, like 1 in 420 versus 1 in 18 billion (as was the case in a fatal 2008 shooting).
Ratios can also be contentious in investigations that hinge largely or entirely on DNA evidence, as in the case of Shelley H.’s unsolved murder.
In 2011, in an attempt to restart that investigation, Sorenson Forensics, a private DNA testing company in Utah, analyzed her autopsy sample and detected three people: two sources of sperm (one major, one minor), and the victim herself, according to court documents. Shelley had had a partner, but he wasn’t the major source, the lab found.
Long Beach police say they ran the material through a DNA database of local, state, and federal crime labs, then arrested Chubbs upon finding a match. Chubbs was in the system because he was a registered sex offender in California for past crimes that included rape by force, oral copulation with a minor, and sodomy with force on a minor.
The chances of the main sperm DNA profile matching an unrelated black person other than Chubbs were roughly 1 in 10,000, according to Sorenson. But when Los Angeles County prosecutors sent the samples to Cybergenetics for further testing to prepare for trial, TrueAllele put the chances at 1 in 1.62 quintillion — a number with 16 zeros. It’s these vastly smaller odds of misidentification that prosecutors may cite in their case against him.
But TrueAllele isn’t only used by prosecuting attorneys, its maker Perlin says. He points to cases like Darryl Pinkins, an Indiana man who is serving a prison sentence for raping a woman in 1989. A TrueAllele analysis suggested that his DNA wasn’t in a sample taken from the scene. So now lawyers at the Wrongful Conviction Clinic at Indiana University are trying to use the results to exonerate him.
Although Cybergenetics describes TrueAllele’s mathematical concepts in peer-reviewed papers, it keeps confidential the details of how those basic equations get translated into software. As proof that they work, Perlin points to studies like one with the crime laboratory in Kern County, California. An independent party created 40 mixed samples with DNA traces of two to five people, and lab scientists ran them through both TrueAllele and manual analysis methods. TrueAllele generated the correct results almost every time, while the humans struggled to figure out how many people were in the mix, former lab director Kevin Miller told BuzzFeed News.
“I know that if I give it known samples, it works as expected,” Miller said, “so when I give it unknown samples, I have no reason to believe it wouldn’t work the same way.”
Perlin says his company lets anyone try out a free trial, that critics who want to see all 170,000 lines of source code are missing the point, and that revealing it would expose him to copycats. Even though the technology is patented, he says his 10-person company lacks the money to fight a patent dispute.
Five crime labs in the U.S. use the TrueAllele software and hardware, which cost $60,000, according to Perlin. Other agencies prefer STRmix, including the California Department of Justice, which found that it had “more reproducible and sensitive results with fewer re-analyses” compared to TrueAllele. It has used STRmix in about two dozen cases, a spokesperson told BuzzFeed News.
A trio of Australian and New Zealand forensic scientists began working on STRmix around 2009. They, too, have published its mathematical models in scientific journals.
Developer John Buckleton told BuzzFeed News by email that “the source code is made available when requested under appropriate supervision conditions” but wouldn’t describe what those circumstances might be. The company declined to reveal the cost, but said its more than 50 customers include the FBI and the U.S. Army Criminal Investigation Laboratory.
In December 2014, STRmix learned of a coding error that led the company to redo 22 calculations. The error, fixed within a week, did not significantly change any of the results, STRmix told BuzzFeed News.
“No software can claim to be completely error-free,” STRmix spokesperson Stephen Corbett said.
Such incidents make more than a few forensic and legal experts wary.
Dan Krane, a Wright State University biologist who has testified against TrueAllele, told BuzzFeed News the code would help answer questions about how the software weighs various factors, like how it discerns between signal and noise. “Those dozen or so different questions are individually things that people have been debating and arguing quite vigorously for the last 10 to 15 years.”
If a program says the chances of a match are 1 in a million, “how do you truly know that is the right number rather than 100 million or 10 million?” asked William Thompson, a criminology and law professor at the University of California at Irvine. “Are you going to run it a trillion times or a million times and see how often you get a false result?”
Still, not everyone thinks raw code is helpful. “I’m more interested in validation: Does the thing work?” Greg Hampikian, the director of the Idaho Innocence Project, who works with Cybergenetics on wrongful conviction cases, told BuzzFeed News. “The results are clear from the studies I’ve seen.”
Fitzgerald of the Electronic Privacy Information Center suggests a compromise: that lawyers and experts could review a program’s code if they signed agreements to not reveal or copy it.
Another alternative is free open-source software like LRmix Studio, created by data scientist Hinda Haned. Haned and her colleagues constantly receive feedback from users of the software, one of whom recently flagged a coding mistake.
“I don’t think you have to be open-source to be good software,” Haned said, “but it’s difficult to interact with other scientists when there is this layer of secrecy.”
Chubbs has pleaded not guilty to Shelley’s murder. Gates, his attorney, had requested to have TrueAllele’s source code disclosed, but last year a court of appeal said she didn’t sufficiently demonstrate why she needed it. Now she is asking prosecutors to prove that TrueAllele is accepted in the scientific community in order to introduce it in court.
TrueAllele’s report of the swab samples makes the DNA match to Chubbs look all but ironclad. The public believes DNA is largely infallible; and to hear Cybergenetics and similar companies tell it, that’s still true.
“I have a problem with the courts or any judge or anybody else,” Gates said, “putting Perlin’s financial gain over a person spending the rest of their life in prison.”