Thursday, October 08, 2015

In the Replication Funhouse

Eli was moaning with a quack today about Eli's health issues (he is an old bunny) and the issue of replication of scientific studies came up.  Of course this was a hot thing about two months ago when Science published a paper showing that only 39 of 100 experiments published in hot psych journals could be replicated

Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results.
It is hard to disagree with the conclusion
Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.
Rabett Run would like to add somethings to this.  A test to reject the null hypothesis (OK you Bayeseans sit down, you can have your turn in the barrel comments) of P < .05 is asking for trouble.  That means, roughly speaking one out of 19 times, just on the basis of statistics you are going to be wrong.

P < .05 is not a strong test.  In an experiment unconstrained by underlying theory or previous work, it is a dangerous place to be, especially in the environment of glamour magazine publishing, where as the authors point out novelty and press releases are the game.

CERN required a P < 3x10-7 before claiming the discovery of the Higgs boson and they had
theory on their side.  They also had a boat load of money.

Eli's second point is that one (journal editors to the front please) should establish a sliding scale of acceptable P values, with P < .05 only used for cases where there is iron clad (as in gravity and the greenhouse effect) theoretical backing for the outcome of the experiment.  Where the theory is novel, a smaller P value should be used.  Experiments (or surveys) that refer to previous experimental results to establish reasonableness should also require smaller P values.

Blue sky territory and stuff that says that Newton had it all wrong should only be established as teasers, not claims unless the results are at least in 3 and higher sigma land.  Then, of course is the issue of the number of tails on your beast.  Yes, there is an element of art here, but us experimentalists ARE artists.


Unknown said...

I don't think Fisher would be happy with a fixed threshold of 0.05 either, he wrote:

"It is open to the experimenter to be more or less exacting in respect to the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. ... It is usual and convenient for experimenters to take 5 per cent. as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results."

In other words 0.05 is a convenient threshold, but that doesn't mean it is universally appropriate as the "Null ritual" would have it. The self-skepticism of "willing to admit that his observations have demonstrated a positive result" is spot-on!

It is also interesting that he seems to consider this as a means of eliminating hypotheses too easily explained by random noise, but does not "accept" those where significance is achieved, nor that all of the results is not due to random variation.

Choosing an appropriate threshold implements part of the purpose of a Bayesian prior. It seems somewhat ironic that some of the "subjectivity" that frequentism was supposed to exclude is actually still present in the test of significance, but it is there implicitly, and frequently ignored (which is why the tests can give "wrong" answer)

Fernando Leanme said...

CERN could be very demanding because they had a huge budget, and knew they could keep blasting away until enough rubble bounced. However, experimental criteria should be flexible and liberal when the results don't get people killed or cost a lot of money. If they do get people killed then we have to factor the cost of a human life. This of course depends on whose life it is.

Unknown said...

David Colquhoun ( mentioned this problem in a post a couple of days ago. He published a (free access) paper last year, The important part that the p value does not include is the 'power' of the test to detect an effect if it does exist. If this is low then the proportion of positive results actually due to a real effect would correspond very well with the quoted reproducibility figures.
Its the same problem as screening for rare medical conditions, even very good test will give results dominated by false positives.

Kevin O'Neill said...

Cern also had 4200 systematic uncertainty parameters to consider. I've always wondered why physics had such a stringent requirement for 'belief' in the results. But if I had to model 4200 uncertainty parameters for a test I'd probably not even bother - with the test or the uncertainty calculation :)

Victor Venema said...

If someone was 10% sure of something big, I would not mind a publication about it, to give more scientists who are willing to take the 10% gamble that they are looking at noise for the opportunity to understand something interesting.

Even in other sciences I do not really understand why they complain that reproductions are not publishable. When you build on people's work, you immediately replicate it. You can do the same experiment together with a simplification of it, you can do the same experiment and take the next step. What is the problem? And if in doing this study you find you cannot reproduce, then you have found something new, that would be publishable, if only you were sure you did it right.

The main lesson is probably that we should get rid of the single-study syndrome in the mass media. Which is a problem because the media focusses on single studies because the "new study" is the event that makes something "newsworthy" in their community. Personally I really do not mind reading an article about the last 5 years of science studying phenomenon X. Let's not pretend that this bunny knew all the studies before this "new" "landmark" study and had already put them all in perspective.

jrkrideau said...

As someone who was trained in psychology I thought we did fairly well. :) Of course not good but people seem to forget the clinical trials studies of a few years ago.

Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers.

We actually seem to be doing as well or better than clinical trials; I do not find this encouraging.

It seems the problem of reproducibility may apply across a lot of areas in scientific research.

John Farley said...

One of the problems with clinical trials is that they are supposed to be double-blind, placebo-controlled. Alas, in many studies some of the participants can figure out whether they are getting the real drug or the placebo sugar pill. How do they figure it out? Because the real drug often produces dry mouth, headaches, dizziness, or other side effects, while the placebo pills don't.

For more details, check out the book by R. Barker Bausell, Snake Oil Sciende: the truth about alternative and complementary medicine.

John Farley said...

About the 5% cutoff. If you thought that an airplane or taxicab had only a 5%chance of crashing, would you get on board?

Anonymous said...

As someone who has actually been through the exercise of digging out a signal in particle physics there are a couple of things I'd like to point out:

1)Most important, particle physics is looking for very rare processes against a background of millions, billions or even trillions of events. Even before any event is recorded, it must pass one or more triggers that make it more likely to be "of interest". These triggers are motivated by the physics of the processes one is interested in, but nonetheless they introduce systematic errors.

2)Even then, the overwhelming majority of events recorded are "noise". One attempts to bring out the signal one is looking for against this noise by applying various physics-motivated criteria. With modern computing techniques, the number of plots one can generate is limited only by the phase space of variables and the creativity and integrity of the researcher.

3)Once one has a signal, one then attempts to "subtract" the background and assess the statistical significance of the resulting "bump" on the background. To do this, you have to model the background--and again there is plenty of room for introducing systematic error.

Given all this, if they did not demand extraordinary significance in the signal, no one would believe the result.

Thus, while I agree that a universal significance criterion of 0.05 is absurd, comparing the rest of science to particle physics is also absurd.