One of my favorite studies of all time took the 50 most common ingredients from a cookbook and searched the literature for a connection to cancer: 72% had a study linking them to increased or decreased risk of cancer. (Here’s the link for the interested.)
Meta-analyses (studies examining multiple studies) quashed the effect pretty seriously, but how many of those single studies were probably reported on in multiple media outlets, permanently causing changes in readers’ dietary habits? (We know from studying juries that people are often unable to “forget” things that are subsequently proven false or misleading — misleading data is sticky.)
The phrase “statistically significant” is one of the more unfortunately misleading ones of our time. The word significant in the statistical sense — meaning distinguishable from random chance — does not carry the same meaning in common parlance, in which we mean distinguishable from something that does not matter. Don’t worry, we will get to what that means.
Confusing the two gets at the heart of a lot of misleading headlines and it’s worth a brief look into why they don’t mean the same thing, so you can stop being scared that everything you eat or do is giving you cancer.
***
The term statistical significance is used to denote when an effect is found to be extremely unlikely to have occurred by chance. To make that determination, we have to propose a null hypothesis to be rejected. Let’s say we propose that eating an apple a day reduces the incidence of colon cancer. The “null hypothesis” here would be that eating an apple a day does nothing to the incidence of colon cancer — that we’d be equally likely to get colon cancer if we ate that daily apple.
When we analyze the data of our study, we’re technically not looking to say “Eating an apple a day prevents colon cancer” — that’s a bit of a misconception. What we’re actually doing is an inversion— we want the data to provide us with sufficient weight to reject the idea that apples have no effect on colon cancer.
And even when that happens, it’s not an all-or-nothing determination. What we’re actually saying is, “It would be extremely unlikely for the data we have, which shows a daily apple reduces colon cancer by 50%, to have popped up by chance. Not impossible, but very unlikely.” The world does not quite allow us to have absolute conviction.
How unlikely? The currently accepted standard in many fields is 5% — there is a less than 5% chance the data would come up this way randomly. That immediately tells you that at least 1 out of every 20 studies must be wrong, but alas that is where we’re at. (The problem with the 5% p-value, and the associated problem of p-hacking has been subject to some intense debate, but we won’t deal with that here.)
We’ll get to why “significance can be insignificant,” and why that’s so important, in a moment. But let’s make sure we’re fully on board with the importance of sorting chance events from real ones with another illustration, this one outlined by Jordan Ellenberg in his wonderful book How Not to Be Wrong. Pay close attention:
Suppose we’re in null hypothesis land, where the chance of death is exactly the same (say, 10%) for the fifty patients who got your drug and the fifty who got [a] placebo. But that doesn’t mean that five of the drug patients die and five of the placebo patients die. In fact, the chance that exactly five of the drug patients die is about 18.5%; not very likely, just as it’s not very likely that a long series of coin tosses would yield precisely as many heads as tails. In the same way, it’s not very likely that exactly the same number of drug patients and placebo patients expire during the course of the trial. I computed:
13.3% chance equally many drug and placebo patients die
43.3% chance fewer placebo patients than drug patients die
43.3% chance fewer drug patients than placebo patients dieSeeing better results among the drug patients than the placebo patients says very little, since this isn’t at all unlikely, even under the null hypothesis that your drug doesn’t work.
But things are different if the drug patients do a lot better. Suppose five of the placebo patients die during the trial, but none of the drug patients do. If the null hypothesis is right, both classes of patients should have a 90% chance of survival. But in that case, it’s highly unlikely that all fifty of the drug patients would survive. The first of the drug patients has a 90% chance; now the chance that not only the first but also the second patient survives is 90% of that 90%, or 81%–and if you want the third patient to survive as well, the chance of that happening is only 90% of that 81%, or 72.9%. Each new patient whose survival you stipulate shaves a little off the chances, and by the end of the process, where you’re asking about the probability that all fifty will survive, the slice of probability that remains is pretty slim:
(0.9) x (0.9) x (0.9) x … fifty times! … x (0.9) x (0.9) = 0.00515 …
Under the null hypothesis, there’s only one chance in two hundred of getting results this good. That’s much more compelling. If I claim I can make the sun come up with my mind, and it does, you shouldn’t be impressed by my powers; but if I claim I can make the sun not come up, and it doesn’t, then I’ve demonstrated an outcome very unlikely under the null hypothesis, and you’d best take notice.
So you see, all this null hypothesis stuff is pretty important because what you want to know is if an effect is really “showing up” or if it just popped up by chance.
A final illustration should make it clear:
Imagine you were flipping coins with a particular strategy of getting more heads, and after 30 flips you had 18 heads and 12 tails. Would you call it a miracle? Probably not — you’d realize immediately that it’s perfectly possible for an 18/12 ratio to happen by chance. You wouldn’t write an article in U.S. News and World Report proclaiming you’d figured out coin-flipping.
Now let’s say instead you flipped the coin 30,000 times and you get 18,000 heads and 12,000 tails…well, then your case for statistical significance would be pretty tight. It would be approaching impossible to get that result by chance — your strategy must have something to it. The null hypothesis of “My coin flipping technique is no better than the usual one” would be easy to reject! (The p-value here would be orders of magnitude less than 5%, by the way.)
That’s what this whole business is about.
***
Now that we’ve got this idea down, we come to the big question that statistical significance cannot answer: Even if the result is distinguishable from chance, does it actually matter?
Statistical significance cannot tell you whether the result is worth paying attention to — even if you get the p-value down to a minuscule number, increasing your confidence that what you saw was not due to chance.
In How Not to Be Wrong, Ellenberg provides a perfect example:
A 1995 study published in a British journal indicated that a new birth control pill doubled the risk of venous thrombosis (potentially killer blood clot) in its users. Predictably, 1.5 million British women freaked out, and some meaningfully large percentage of them stopped taking the pill. In 1996, 26,000 more babies were born than the previous year, and there were 13,600 more abortions. Whoops!
So what, right? Lots of mothers’ lives were saved, right?
Not really. The initial probability of a women getting a venous thrombosis with any old birth control pill was about 1 in 7,000 or about 0.01%. That means that the “Killer Pill,” even if was indeed increasing “thrombosis risk,” only increased that risk to 2 in 7,000, or about 0.02%!! Is that worth rearranging your life for? Probably not.
Ellenberg makes the excellent point that, at least in the case of health, the null hypothesis is unlikely to be right in most cases! The body is a complex system — of course what we put in it affects how it functions in some direction or another. It’s unlikely to be absolute zero.
But numerical and scale-based thinking, indispensable for anyone looking to not be a sucker, tells us that we must distinguish between small and meaningless effects (like the connection between almost all individual foods and cancer so far) and real ones (like the connection between smoking and lung cancer).
And now we arrive at the problem of “significance” — even if an effect is really happening, it still may not matter! We must learn to be wary of “relative” statistics (i.e., “the risk has doubled”), and look to favor “absolute” statistics, which tell us whether the thing is worth worrying about at all.
So we have two important ideas:
A. Just like coin flips, many results are perfectly possible by chance. We use the concept of “statistical significance” to figure out how likely it is that the effect we’re seeing is real and not just a random illusion, like seeing 18 heads in 30 coin tosses.
B. Even if it is really happening, it still may be unimportant – an effect so insignificant in real terms that it’s not worth our attention.
These effects should combine to raise our level of skepticism when hearing about groundbreaking new studies! (A third and equally important problem is the fact that correlation is not causation, a common problem in many fields of science including nutritional epidemiology. Just because x is associated with y does not mean that x is causing y.)
Tread carefully and keep your thinking cap on.
***
Still Interested? Read Ellenberg’s great book to get your head working correctly, and check out our posts on Bayesian updating, another very useful statistical tool, and learn a little about how we distinguish science from pseudoscience.