March 11, 2005

Statistics Term of the Day: Populations vs. Samples

How many kitties are there in the United States? According to the Humane Society, 60 million live in homes , and National Geographic says that 70 million feral cats are roaming the streets. If we assume all cats are either in homes or the street, and toss in an extra 2 million for those temporarily housed in shelters, then we have a population of cats in the United States that's around 132 million, give or take a few fuzzbutts.

In statistics, population has a very specific meaning - it's the entire collection of scores/observations/cats of interest in a particular study. It can be very large, or it can be very small (if I were interested in studying just cats who live on my block, and not the entire US). But it's whatever I'm interested in, and it's whatever I'd like to be able to generalize to with my sample.

Samples are just subsets of the population. Samples are intended to represent the population; usually it's the sample on which you crunch all your numbers. Samples can also range from very large to very small. Larger is better; the closer the sample size gets to the population size, the more likely your sample statistics will be representative of the population statistics, and the better an inference you can make from your sample to the population.

Values used to describe populations are called parameters, while values used to describe sample are called statistics. When we calculate descriptive statistics on a sample, sometimes we are interested in just that sample, but more often we are interested in making inferences about the population parameters.

Perhaps what you want to know about American cats are their weight, their eye color, or their numbers of stripes. You can't possibly take measurements of every cat, but you can take a sample of cats that is as large and representative as possible. If you're interested in knowing weights, you'd want to be sure your sample included spayed- and non-spayed cats, old and young cats, and cats of both sexes - or you might want individual sample of all these groups. Perhaps kitties in Arizona have more stripes than those in New York; you'd want to make sure you got samples across geographic regions.

This representative sample problem is one that you often see in relation to studies related to education and testing. Earlier this week, we saw a columnist try to infer from a sample (of Bates College) that the SAT was not useful for the population (of universities in the United States). Not only is that not a large enough sample, but one could argue that, even if every small, private, liberal-arts college found the same results, the results do not generalize to big state schools.

No matter how representative a sample is (as long as it's not equal to the population), the measurements you obtain from it will not likely be the same as what you'd get from measuring the entire population. That difference between sample statistics and population parameters is called sampling error. Sampling error is affected by sample size and characteristics of the sample, and can be random or systematic. One way to combat systematic sampling error is to use random sampling, in which each observation in the population has an equal chance of being selected from the sample.

Our last topic is about bias in estimation. Let's say I have a magic wand that makes 1000 random kitties from all over the US appear in my laboratory. I can weigh each one, and calculate a mean and standard deviation of the weights. My goal is to make an inference about what the mean and standard deviation of weights are for all the kitties in the US.

When I calculate the mean of my sample, I have what's called an unbiased estimate of the mean of my population. This means that my sample mean does not consistently over- or under-estimate my population mean, and thus the sampling error is more likely to be random. Variability, though, is different. The formula I provided here is for the standard deviation of a population. However, if I were to use that formula on a sample, I'd get a measure that is biased, and that will systematically underestimate our population standard deviation.

So we correct for that bias by modifying our formula for the sample standard deviation. We still subtract each kitty's weight from the mean weight, square those deviations, and sum those up. But instead of dividing by the total number of kitties in our sample, we'll divide by the number of kitties minus 1. This decreases the denominator of our formula, resulting in an increased variance (and when we take the square root of that, an increased sd) that does not have systematic error from the true population standard deviation.

Now if you'll excuse me, I have a lot of kitties to feed (999, to be exact).

Posted by kswygert at March 11, 2005 02:08 PM
Sitemeter