March 09, 2005

Statistics Term of the Day: Measures of Variability

Variety is the spice of life, and variability is the essence of statistics. Why crunch numbers on anything? Why not just assume everyone is the same? Because we know they're not the same, but we don't necessarily know just how different everyone is. That's where variability come in. Variability, in a statistical sense, is a quantitative measure of how close together - or spread out - a distribution of scores are. In our last lesson, we discovered ways to understand where the representative score in a distribution lies, but while the mean, median, and mode tell us something about the most representative point in the data, they tell us nothing about how all the scores vary around that representative point.

Thus, measures of variability (or spread) go hand in hand with measures of central tendency, and you need at least these two measures to get a picture of what a distribution actually looks like.

Let's go from simplest to most complex. First, there's the range - crude, but easy to calculate. With observations as whole numbers, the range is (highest score - lowest score) + 1. (With non-integers as observations, you have to be concerned with upper and lower limits, but we'll skip that for now.) Note that the following two groups have the same range, but the distributions are very different:

Group 1 - 10, 10, 10, 10, 2
Group 2 - 10, 8, 6, 4, 2

The range can be divvied up. The interquartile range is the (75th percentile score) - (25th percentile score), answering the question of what the spread is in the middle of data (useful for when there are outliers).

What you'll most often see to describe variability is the standard deviation of a distribution. This is a quantity (so it can't be less than zero) that approximates the average distance from the mean. So the mean is the representative value, and the standard deviation is the representative distance of any one point in the distribution from the mean.

Let's skip back to just the term deviation. Let's say we have a distribution with a mean of 100. You have a score of 90. Your deviation from the mean is thus -10. Your friend, with a score of 105, has a deviation of +5. If I add up the deviations of everyone in the distribution from the mean, I'll get zero (that's part of the definition of the mean, in fact.) So adding these deviations up doesn't get us anywhere, yet.

Let's get rid of the + and - signs by squaring every deviation. If we add those squared deviations up, we have sums of squares (a very important concept in both descriptive and inferential statistics). If we divide by the number of observations in our distribution, we get what's called the variance. The variance gives us the representative squared distance from the mean, which is not that useful for descriptive statistics.

So take the square root of the variance, and you get the standard deviation (which is also sometimes sd, or just s). It's in the original unit of whatever your distribution was, so it's easy to interpret. If a distribution has mean of 100 points and an standard deviation of 5, then the representative deviation from the mean in that distribution is 5 points.

Because the standard deviation is an average, it's affected by outliers - those extreme scores on either tail of the distribution. This means when you have a distribution for which the mean isn't appropriate - like income, or number of children - the standard deviation won't be too useful either. The interquartile range, on the other hand, nicely complements the median in these situations. Just like with measures of central tendency, just because you can compute the standard deviation for skewed data, doesn't mean you should.

(You can also calculate the average absolute deviation and the median absolute deviation, which are just what they sound like - the average or median of the absolute unsquared deviations from the mean. These are less affected by outliers than the standard deviation. Thanks to Raina for pointing me down this path.)

And again, always look at your data (image borrowed from Dr. Gaten's online course):


Distribution A and B have the same mean, but different standard deviations. B's variability is smaller, so its variance and standard deviation are smaller, too. A and C have the same variability, but different means. Note that A and C overlap, so some people in A have higher scores than some in C, although C's mean is greater than A's.

If you understand all of this, you're ahead of some of the people who were criticizing Harvard Presidents Larry Summer's infamous comments about male and female scientific ability (link goes to a site that defends Summers). Many of Summer's critics immediately assumed that he was saying all men are smarter than all women, or that no women have the ability to become scientists and engineers. These statements could only be made be people who do not understand distributions, or even basic statistics.

As Slate so nicely described Summer's comments:

It isn't a claim about overall intelligence. Nor is it a justification for tolerating discrimination between two people of equal ability or accomplishment. Nor is it a concession that genetic handicaps can't be overcome. Nor is it a statement that girls are inferior at math and science: It doesn't dictate the limits of any individual, and it doesn't entail that men are on average better than women at math or science. It's a claim that the distribution of male scores is more spread out than the distribution of female scores—a greater percentage at both the bottom and the top. Nobody bats an eye at the overrepresentation of men in prison. But suggest that the excess might go both ways, and you're a pig.

I don't know what the population distributions look like for the abilities that Summer was describing. But if they looked like B (women) and A (men), and there were more men in fields that required this ability, it would be clear as to why.

Posted by kswygert at March 9, 2005 06:03 PM