# Statistics and experimentation

All timestamps are based on your local time of:

Posted by: stak
Tags:
Posted on: 2011-01-14 18:19:53

[This blog post is adapted from part of a presentation I gave in class a few days ago, on a topic I've been thinking about for a while.]

Recently there was a big brouhaha because of some research by Dr. Bem at Cornell. He's a psychologist who did a series of experiments, and 8 out of 9 of his experiments showed that precognition is possible. He's a respected researcher and the paper was peer-reviewed and is scheduled to be published soon in a prominent journal. A lot of people disagree with his conclusions for obvious reasons, and there's been a lot of discussion about how to interpret his results and whether or not his methodology and/or analysis was flawed and/or biased. I particularly like this rebuttal (PDF) of his approach.

Another example of a similar nature is the book Good Calories, Bad Calories which I was reading not too long ago (but didn't finish). It looks at a lot of studies in the field on nutrition and rips apart a lot of them. Most of these studies are not reproducible or even contradict each other, and often have conclusions that are not supported by the data.

The point that I'm trying to make is that when it comes to using statistics to analyze data, there is almost no consensus on how to do it correctly, despite the fact that we've been doing it for decades. It's pretty absurd, if you ask me. There's all sorts of pitfalls that people regularly get caught in, such as Simpson's Paradox, just because it's unclear which variables that were changed in the experiment are relevant to the outcome and which are not.

Take a simple example - that of the boiling point of water. The value of the boiling point is a function of a number of factors, like the atmospheric pressure and salinity of the water. However, it's not a function of other things, such as the heat source that's used to heat up the water. If you aren't aware of which variables affect the results and which do not, you might do something like run a few trials at sea level and run a few trials on top of a mountain, and then average (or more generally, statistically analyze) the results to get a final answer.

But of course, if you average measurements taken at different atmospheric pressures, you get a value that's garbage. It reflects neither the boiling point at sea level nor the boiling point on the mountain. It's the boiling point of somewhere in between, but only because the boiling point is a monotonic function with respect to pressure. If it were some other kind of function, the average would truly be just a nonsense number, even though it looks like a real result.

This is a trivial example and has very few variables. But a lot of the sciences that deal with human subjects do this all the time. Examples abound in psychology, medicine, nutrition, and of course, software engineering. For example, consider the classic software experiment to find out if technology A is better than technology B. You get a bunch of programmers, make sure they are trained equally on A and B, and have them sit down and do a task. Then you average the results from A and average the results from B and compare the two, and conclude that A (or B) is better. But the huge flaw in any experiment of this kind is that the thing you're measuring (the final code produced) is a function of both the technology (A or B) and the mind of the programmer. And the programmer's mind is a HUGE variable, a function of all sorts of things like education and experience and social influence and genetics.

In the boiling water example, it doesn't make sense to average two measurements from different pressures. Instead, it's better to state the result as a function that takes pressure as input and returns the boiling point as the output. Similarly, I think that for the software experiments, it doesn't make sense to just average the results from different programmers. Instead, a better (although currently infeasible) approach would be to represent the programmer as a vector of traits, and to give a function that takes as input such a vector and returns as output whether A or B is better. The vector would have to include every trait that we determine to be relevant to the software engineering process (that is, whether it affects the code that programmers write), so determining exactly what traits should be included is probably impossible. However, if even a few of the main traits can be isolated, we can start getting results that approximate something meaningful, rather than just being nonsense that looks like a real result.

(c) Kartikaya Gupta, 2004-2024. User comments owned by their respective posters. All rights reserved.
You are accessing this website via IPv4. Consider upgrading to IPv6!