April 20, 2005

Statistics term of the day: probabilities

Bet you thought I'd forgotten all about this, didn't you?

Let's talk probabilities, but first, let's talk about why they matter. Probability is a way of connecting populations with samples. Knowing the population distributions gives you some idea of what a sample will look like; if you walk into a room with 10 black cats and 1 white cat, and grab a cat at random, the probability is high that the cat will be black. Probabilities thus form a link between populations and samples, a link that we'll come back later when we're going in the opposite direction. When we grab a cat from a room and it's a white cat, what inference can we make about the population of the room? That's where inferential statistics come in, and it's the probability link between samples and populations that allow us to make such inferences.

We'll save a lot of the heavy stuff for later and just talk about the basic terms today. The probability of a given outcome, in a situation in which more than one outcome is possible, is the fraction:

probability of X = (outcomes that are X) / (total number of possible outcomes)

This fraction or proportion is easy to calculate, and easy to understand. If you roll a fair six-sided die, the probability that it will show a 4 is 1/6, or .16. (Here, 1/6 is the fraction, and .16 is the proportion. To get percentage form, you'd need to multiply .16 by 100 to get 16%. All forms are okay, but the proportion form is most often used.) If there are 8 cats in a room, and only two are black, your probability of one cat at random being non-black is 6/8, or .75. It's also correct to phrase this question as, "What proportion of the cats in the room are non-black?"

Notice that I've been tossing the phrase "at random" in here quite a bit. That's because the formula above depends on the assumption that the die is fair, or that the coin you're tossing is fair, or that you are choosing cats in a random fashion. The formula above assumes that each observation in the population has an equal chance of being chosen, and that if you're taking more than one observation at a time, there's a constant probability of each selection.

If a room has two black cats and 10 white ones, and I choose a cat at random, then the probability of choosing one black is 2/12, or .16. But if all the white cats are being quite loud, and I allow their persistent meows to sway me into choosing them, then the .16 probability won't be accurate, because my choices won't be completely random.

Another thing to consider when sampling is whether or not you are replacing observations in the population. If I reach into a cabinet that has 10 cans of tuna-flavored cat food and 5 cans of chicken-flavored, the probabilities are:

P(randomly choosing one can of tuna) = 10/15 or .67
P(randomly choosing one can of chicken) = 5/15 or .33

But suppose I reach in, select one can, then reach in and select another can. The probability of the second can now depends on what I took out on the first random draw, because I'm now sampling without replacement. If my first can is tuna, there are then 9 tuna and 5 chicken cans remaining, and my probabilities on the next random selection are:

P(randomly choosing one can of tuna) = 9/14 or .64
P(randomly choosing one can of chicken) = 5/14 or .35

The probability of choosing tuna just decreased from before (because we're short one) and the probability of choosing chicken just increased (because the 5 chicken cans are now a larger proportion of the population). However, if I sample with replacement - I select a can, then put it back, then select another can - then the probabilities stay constant.

Posted by kswygert at April 20, 2005 01:05 PM
Sitemeter