It is often said that there are lies, damned lies, and statistics, a phrase popularized by Mark Twain. We often do have the problem that we are bombarded with various collections of numbers to try to motivate what we buy, who we vote for, etc. There are methods to the madness, including methods to make statistics lie (also see here). Combine that with our own misunderstanding of probability, such as with the Monty Hall problem, and we fool ourselves.

This is something scientists are not immune from, and statistics have been misinterpreted all the time. In the line of research, many decisions are made, all of which can have their own justification, but which may bias the results. (Journals also have the bias of publishing positive results, so we are more likely to see false positives in the news while the study that showed the lack of correlation is in some cabinet in Connecticut). False positives can come about by chance, and statistics has a tool to deal with that, though it has been misunderstood even by many who have used it.

The main statistical tool you will see is something called the p-value. What it measures is the probability that the correlation you find is due to the null hypothesis. What that means is the p-value tells you how likely is it the case that your results are due to chance. This has often been misunderstood to mean that (1-p) is the probability that your hypothesis is true; the p-value only tells you how likely the correlation is accidental, but it says nothing about causality.

Another point with the p-value is statistical significance. A lot of papers will call something statistically significant if p is less that 0.05 (less than 5%). That line is rather arbitrary, and it would mean 1 in 20 studies with p = 0.05 will have false positives, which is rather terrible; in physics there are much more strict p-values (the Higgs boson wasn’t declared discovered until the p-value was less than one in a million, and now it more like one in a billion). But there is also confusion about what statistical significance means. It does not mean a result is significant in the sense that it is important, just that it is arbitrarily improbable as a false positive. You can have a statistically significant result but it makes almost no difference to reality; perhaps it is statistically significant that the white of eggs affects your cholesterol levels, but the effect can be minute.

But even if you get past all that confusion, there can be issues of how one gets their results just below the p = 0.05 mark. One study published this year on psychology research showed that there are a disproportionate number of studies with p = 0.05. The graph below shows the expectation line from theory, and the data.

You can see a prominent peak at just about the 0.05 mark. That suggests there has been some adjustments to the data collected to make a paper publishable. (Some discussion here and here.) It doesn’t mean scientists are faking their results, just that they make small decisions that bias the results to make it past that arbitrary line. Because of the false positive rate already being too high for my tastes and with the bias in journals for publishing positive results, that can become a problem.

Now it looks like a solution is being worked out to find if results have been fiddled with. Just recently published on *PLOS ONE*, a simulation has been done to give an idea of what fiddled-with data can look like, allowing both researchers and editors to test the hypothesis that statistical significance was reached through biased means. The authors note that to get their own results to be statistically significant they will probably need a lot of results. So, it seems we can use statistics to police statistics.

Again, this is not the same as detecting fraud in academic work. A fraudster can probably manufacture their data to conform to a pattern. The best way still to discover if a result is false or fabricated is repetition. All the more-so if the results seem to go against other scientific results (such as the Bem experiments from the last year or so). That’s science, and we have to do this if we want to make progress. You don’t want to build on a faulty foundation, so I’m glad to see we continue to improve our error detection into more and more sophisticated tools.