Saturday, September 19, 2015

Labs must correct wrong DNA mixture analyses, learn when not to analyze 'crap'

Yesterday for work I attended a Forensic Science Commission committee meeting in Dallas on DNA mixtures where the agenda had suggested they'd be parsing prosecutor disclosure obligations and mapping out a path toward reviewing old cases. Instead, the committee couldn't field a quorum, so four scientists brought in to advise them were left to field a lengthy panel-discussion/Q&A which clarified some issues and on others, only emphasized how muddy much of this remains.

Terri Langford at the Texas Tribune was the only reporter there, here's her story. In general, she correctly summarized:
experts tried to temper the expectations about DNA testing that were built over more than a decade.
 
"One of the problems was DNA was called the gold standard," Bruce Budowle, director of the University of Texas Health Science Center's Institute of Applied Genetics, said. "Big mistake."
Budowle said DNA deserved gold-standard status when it came to a single DNA sample compared to a single suspect, or even in rape kits where there are two samples and one (the woman's) is known. But when analyzing mixed DNA samples where no one is definitively known, or even where labs can't tell precisely how many DNA contributors there are, analysts engage in interpretation which has not always been informed by best practices. Cutting-edge science takes too many years to trickle down from the research labs to the crime-lab work bench, the panelists repeatedly emphasized.

Budowle said the 2009 National Academy of Sciences report "gave DNA a pass" and it shouldn't have - interpretation of DNA mixtures has a subjective human element just like other comparative forensics.

We learned a bit more about how all this came up: When Galveston DA Jack Roady asked for DNA results to be reinterpreted in one of his cases, the probability the DNA matched their defendant went from more than one in a billion to one in 38.

But that was the FBI lab, having already corrected their method.* Yesterday we learned more about recent changes in DPS' DNA mixture interpretations. Again, from Langford:
Crime labs have recently adopted the new “mixed DNA” standard. The DPS switched to it on Aug. 10. The move has prompted prosecutors like [Inger] Chandler to resend evidence in pending cases to the lab to have the data analyzed using the new standard. In Houston's Harris County, that's about 500 pending cases where DNA evidence will be introduced at trial.

In addition, DAs are notifying defendants who are already convicted about the new standard. For example, Harris County prosecutors have already notified those convicted of capital murder and awaiting execution. It is not known how many of the 253 inmates on Texas’ death row were convicted with mixed DNA. Of the 253 inmates on Texas death row, 90 are from Harris County.
The new standard at DPS deserves further elaboration because the expert panelists universally agreed that the old method was wrong and improperly interpreted results.

After yesterday, I understood for the first time (perhaps it was said before and didn't penetrate my notes/consciousness/thick skull) that DPS' DNA labs had not changed their protocols until this issue came up while your correspondent was on vacation last month. And the details of the change were significant.

First, a bit of background. DNA testing looks at two metrics on X and Y axes: Whether alleles are present at various loci, and the quantity of DNA available for testing at that spot. (The latter is complicated by allele drop-in, drop-out, and stacking, terms I'm only beginning to understand.) When examining the peak height of DNA quantity on the test results, DPS' old method did not impose a "stochastic" threshold, which as near as I can tell is akin to the mathematical sin of interpreting a poll without ensuring a random sample.  (The word "stochastic" was tossed around blithely as though everyone knew what it meant.) Basically, DPS did not discard data which did not appear in sufficient quantity; their new threshold is more than triple the old one.

That new methodology could change probability ratios for quite a few other cases, the panel predicted. One expert showed slides demonstrating how four different calculation methods could generate wildly different results, to my mind calling into question how accurate any of them are if they're all considered valid. Applying the stochastic threshold in one real-world case which he included as an example reduced the probability of a match from one in 1.40 x 109 to one in 38.6. You can see where a jury might view those numbers differently.

Not every calculation will change that much and some will change in the other direction. The application of an improper statistical method generates all types of error, not just those which benefit defendants. There may be folks who were excluded that become undetermined, or undetermined samples may become suspects when they're recalculated. The panel seemed to doubt there were examples where a positive association would flip all the way to excluded, but acknowledged it was mathematically possible.

DPS has identified nearly 25,000 cases where they've analyzed DNA mixtures. Since they typically represent about half the state's caseload, it was estimated, the total statewide may be double that when it's all said and done. Not all of those are problematic and in some cases the evidence wasn't used in court. But somebody has to check. Ch. 64 of the Code of Criminal Procedure grants a right to counsel for purposes of seeking a DNA test, including when, "although previously subjected to DNA testing, [the evidence] can be subjected to testing with newer testing techniques that provide a reasonable likelihood of results that are more accurate and probative than the results of the previous test." So there's a certain inevitability about the need to recalculate those numbers.

Making the situation even more complex, next year DPS will abandon the updated method and shift to "probabilistic genotyping," which has the benefit of using more of the DNA data but asks a mathematically different question than the old method. Instead of calculating how many people in the population share DNA traits with the sample, the new method calculates, e.g., how likely it is that two patterns match the suspects compared to any other two random people.

That's a subtle difference, but it means the new DPS method is not a direct refutation of the old one, prosecutors exasperatedly realized upon questioning the panel. Going forward, it's probably best to shift to probabilistic genotyping until something else comes along, they were told. For older cases, though,labs would probably need to calculate both. That stickies the wicket quite a bit - they can't just wait and issue results under the new method in old cases, as some labs had been advising. They'll have to recalculate them using the new stochastic threshold.

Another interesting side note: the old method always generates the same result. Because of statistical modeling, probabilistic genotyping will get a different result every time (presumably within a valid range of error). That made me wonder about the wisdom of moving to a system where results are not entirely replicable. That's an issue for the courts, one supposes, which will ultimately need to decide which approach they prefer. All this will end up before the Texas Court of Criminal Appeals sooner than later, most observers agreed.

Even when labs shift to a new method, though, the software implementing these models cannot be treated as a black box, the panel emphasized. There's inherently an interpretation element and without understanding the different statistical methodologies, they warned, crime labs could still get into trouble, a likelihood which became increasingly apparent as the hours-long session progressed. "All models are wrong but some are useful," one panelist quipped. Each are a different tool and one uses different tools for different things.

One final takeaway: Labs not only need to update their methods for performing statistical calculations, just as importantly they need to create standards for when they should make no calculation at all. One panelist gave an anecdote from a 2013 study: 108 labs were given a sample he'd created using four DNA sources, but for context he told them the names of three people, only two of whom were actual sources. Amazingly, 75 percent of labs mistakenly said the sample came from three people and included the person who wasn't a source. Only 20 percent said they couldn't make a calculation. If that's not a red flag, I don't know what is!

Budowle, who for 26 years worked for the FBI and was their lead expert on these topics, said that when there are too many DNA sources to make an interpretation, as is increasingly the case with touch DNA samples, the scientific term for what one sees in the test results is "crap." They'd operated in the past on the assumption that examiners could recognize crap, he said, but it's becoming apparent guidance needs to be developed because people are busily applying these statistical  models in invalid and problematic ways. All the other panelists agreed.

Finally, everyone agreed, this is not at all  just a Texas issue but is a national and even international problem. Everywhere DNA analysis is used for crime fighting, courts and labs eventually must grapple with these issues, and many jurisdictions have yet to do so. Texas crime labs weren't acting in bad faith on this; this isn't a drama with a villain. As science advanced, past errors became known, it's nothing nefarious, however problematic it may be for the justice system to have replied on unproven science. Texas is just confronting the issue first in large part because of leadership from the Forensic Science Commission. Their executive director Lynn Garcia has ably pieced together stakeholders and generated a meaningful, high-level conversation among decision makers, even if few decisions have been made yet.

The committee will meet again before the next Forensic Science Commission meeting Oct. 2, perhaps the day before, to take up the agenda they didn't get to yesterday in Dallas. Fascinating stuff. What a mess!

* CORRECTION: A commenter correctly noted Roady's sample was retested by DPS, not the FBI. See here.

16 comments:

Anonymous said...

I was also at this meeting. I have one correction, and a couple of comments.

The correction: The Galveston case that brought attention to this issue was not analyzed by the FBI lab, but by DPS.

The comments: There was not, to my mind, a full enough discussion of the issues surrounding not interpreting complicated mixtures. The concern of both prosecutors and defense attorneys I have spoken to is that if a suspect can be reliably excluded as being a contributor to a sample, then they want to know that. For a laboratory to not interpret a complicated sample, and as a result fail to correctly exclude a suspect who could be reliably excluded, would not be seen as an acceptable policy decision by most in the legal community.

In comparing current calculation methods to the down-the-road probabilistic genotyping methods, the charcterization that the former always generate the same result and the latter will give a different result every time misses an important point. The current methods (which are frequentist based approaches in statististical jargon) give the same result every time only if the same population database is used and if the profile interpretation rules are the same. Which may be true within a single lab (or maybe not), but will generally be different between laboratories. So the same data analyzed in diffierent labs using different databases and different interpretation procedures will give different statistical results even with the current calculation methods. So although it might not be generally acknowledged, there actually is variation in the the current calculation appraches, and in that respect there is no obvious virtue in avoiding moving to the probabilistic genotyping methods (Bayesian methods, not frequentist methods) and which make a fuller use of the available data.

Gritsforbreakfast said...

I double-checked and you're right re: Roady's test. Fixed it. I apparently misstated it in my notes, thanks for catching it.

On the matter of not interpreting complicated mixtures, perhaps there's a middle ground where they can be used to exclude but not include if the DNA sample is "crap." It's one thing to exclude based on questionable data, another to accuse someone.

Re: the lack of standardization, I'm not sure that justifies using a non-replicable method but you're right the issue hasn't been fully vetted yet. Scientific results being replicable by others is historically considered a key indicator of validity. I suppose in the end it will be up to the courts to decide what's acceptable.

Anonymous said...

3:12 here:

The probabilistic genotyping methods that were referenced yesterday use a widely accepted approach to statistical modeling called the Markov Chain Monte Carlo method. The approach is used in numerous areas of science to extract meaningful information from complex sets of data. A search of Google scholar for scientific publications with the keywords "markov chain monte carlo" returns more than 175,000 citations.

Gritsforbreakfast said...

I don't doubt its validity, 3:12/6:31, I just wonder whether judges, when posed with two methods which are both considered valid, might prefer the one which will be the same today as it would be twenty years from now (if calculated using the same method and baseline). I can't read their minds or predict the future, it's just something I flagged in my notes from the discussion.

Gritsforbreakfast said...

BTW, 3:12/6:31, following your suggested search terms, I found this article evaluating the method. Notably, it listed replicability as one of the challenges.

I'm not taking a position yet, just starting to think about these issues for the first time. But if we assume one uses different tools for different purposes ("the right tool for the job," as carpenters say), one wonders whether a tool for which results are not predictably replicable is the right tool for use in the judicial system?

It strikes me that the new method has been developed by scientists and statisticians, but the nine members of the Texas Court of Criminal Appeals are neither of those. Their practical requirements and points of emphasis may differ from those of Bayesian statisticians. ¿Quien sabe? Time will tell.

Anonymous said...

If I were king, DNA would only be admissible court if tested double-blind in five independent labs.

Anonymous said...

"...Texas crime labs weren't acting in bad faith on this; this isn't a drama with a villain. As science advanced, past errors became known, it's nothing nefarious..."

Please, Grits. This is not new science. Statistics has been around a lot longer than DNA testing. They forensic community does not get a free pass on this one. As Budowle stated, he's been doing this for 26 years. The problem has been present ever since DNA testing became available. Peerwani, Barnard, Eisenberg and the other FSC members knew this was a sticky issue...each having their own DNA testing facilities.

DPS and ASCLD-LAB gave Texas crime labs "accreditation" status declaring their protocols (in addition to the statistical analysis) were sound because, well, the accreditation agencies were given payments from the crime labs (taxpayer money). If these accrediting agencies were actually doing there jobs, this problem would have been solved 20+ years ago. But since there is no penalty for being lazy or incompetent, today we are stuck with addressing the problems.

Anonymous said...

These DNA calculations become ultra-critical, especially of your crime lab looks like this...

https://sliterchewspens.files.wordpress.com/2013/03/slide132.jpg

Multiple DNA contributors on evidence was a regular occurrence.

-SCP

Anonymous said...

The reality is that the "rules" for calculating CPI for complex DNA mixtures have never been published and the closest thing to it would be the John Butler book only published at the end of 2014. These "rules" have not made it through the various SWGDAM or other committees that develop the guidelines so there were no specific guidelines to follow. Labs acted in good faith to develop methods over the years to apply CPI. The stochastic threshold part published in the 2010 SWGDAM guidelines only scratches the surface and is an over simplification of the issues surrounding this process. Until the TFSC or SWGAM or ASCLD/LAB or some body publishes what the "rules" are, variation from lab to lab will continue.

Gritsforbreakfast said...

@4:37, I tend to agree with 1:40 on this. From all I've seen, they surely knew mixture analysis was complex, but the old methods weren't exposed as inadequate until new ones supplanted them. I haven't yet seen evidence this was intentionally elided. The folks at the crime lab work bench aren't automatically apprised of every update from high-end researchers.

Folks in the press, this blog included, who discussed DNA evidence as "gold standard" forensics also contributed to the problem. DNA was given a pass in these debates when, as Budowle said, it clearly shouldn't have been.

Anonymous said...

"...acted in good faith..."

This is what parents say when they use asbestos as insulation or let their children play with lawn darts.

Criminal negligence is a 'misfeasance where the fault lies in the failure to foresee and so allow otherwise avoidable dangers to manifest (e.g. 1-in-a-billion versus 1-in-38 probability). In some cases this failure can rise to the level of willful blindness where the individual(s) intentionally avoids adverting to the reality of a situation.

The labs knew of the problems, but did nothing to fix them. How many plea bargains were obtained based on faulty math?

EveryWhichWayButLoose said...
This comment has been removed by the author.
Anonymous said...

Grits for Breakfast, you seem to have a reasoned perspective on this for someone who has never worked in a crime lab as far as I can tell. Some, understandably, don't understand this complicated issue scientifically. In a nut shell, like all sciences, it comes down to when were new guidelines published and when were they implemented. And what qualifies as a guideline? Only ones required for accreditation? All other guidelines or publications are optional. If one cardiologist publishes on a technique are all other cardiologists immediately expected to stop they way they were doing things and follow this single publication? I am sure you would say no. Forensics, like other sciences, relies on a scientific body to review the current state of techniques and publications and develop standards for how to apply the scientific techniques. Eventually these become documents required for accreditation but it has not reached this level in this situation. Labs did not intentionally or negligently implement methods that were in violation of accreditation standards and they were clear in their SOPs which methods they used for deduction and calculations, which are available for review by all parties in criminal cases. But forensics, like all sciences, updates its techniques as new data or information becomes available. The work the oncologist did 10 years ago is not invalidated by new methods available today.

Anonymous said...

The Scientific Working Group on DNA Analysis Methods made their guidelines for Mixture interpretation available in 2010. This includes the recommendation for a stochastic threshold. Unfortunately, after five years, these are still not required for accreditation nor are they every audited against by any accrediting body. They are simply "guidelines". They can be found at the website below.

https://www.fbi.gov/about-us/lab/biometric-analysis/codis/swgdam.pdf

Anonymous said...

9:21:

Since the DNA Identification Act of 1994, authority for establishing standards for analysis and interpretation in CODIS-participating forensic DNA laboratories has rested by statute with the FBI director. The SWGDAM 2010 document has never been authorized by the FBI director as a mandated requirement for DNA testing laboratories. The SWGDAM guidelines are a true set of guidelines. They were developed by a small group of practitioners who did not fully represent the field. SWGDAM could have taken the guidelines through the ASTM standards development process, which would have involved broader participation by practitioners and academics. But that was not done. Maybe it will occur through the new OSAC process. But, at the moment, each laboratory is still responsible for interpreting the primary research literature for itself. The SWGDAM 2010 guideline document is useful, but it is no silver bullet that cures all differences of opinion about interpretation.

Anonymous said...

The 2010 SWGDAM doc is only a small portion of what needs to be done to meet the "current" standard being proposed. Stochastic threshold is just a part of the story. The John Butler 2104 book is a much better compendium of best practices today (not saying that people should have known all that stuff in 2005 for example). But it is only one book so if it is going to be codified as the authoritative text on this then some governing body needs to do so.