Eli was moaning with a quack today about Eli's health issues (he is an old bunny) and the issue of replication of scientific studies came up. Of course this was a hot thing about two months ago when Science published a paper showing that only 39 of 100 experiments published in hot psych journals could be replicated
Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results.It is hard to disagree with the conclusion
Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.Rabett Run would like to add somethings to this. A test to reject the null hypothesis (OK you Bayeseans sit down, you can have your turn in the
P < .05 is not a strong test. In an experiment unconstrained by underlying theory or previous work, it is a dangerous place to be, especially in the environment of glamour magazine publishing, where as the authors point out novelty and press releases are the game.
CERN required a P < 3x10-7 before claiming the discovery of the Higgs boson and they had
theory on their side. They also had a boat load of money.
Eli's second point is that one (journal editors to the front please) should establish a sliding scale of acceptable P values, with P < .05 only used for cases where there is iron clad (as in gravity and the greenhouse effect) theoretical backing for the outcome of the experiment. Where the theory is novel, a smaller P value should be used. Experiments (or surveys) that refer to previous experimental results to establish reasonableness should also require smaller P values.
Blue sky territory and stuff that says that Newton had it all wrong should only be established as teasers, not claims unless the results are at least in 3 and higher sigma land. Then, of course is the issue of the number of tails on your beast. Yes, there is an element of art here, but us experimentalists ARE artists.