Bet you thought I'd forgotten all about this, didn't you?
Let's talk probabilities, but first, let's talk about why they matter. Probability is a way of connecting populations with samples. Knowing the population distributions gives you some idea of what a sample will look like; if you walk into a room with 10 black cats and 1 white cat, and grab a cat at random, the probability is high that the cat will be black. Probabilities thus form a link between populations and samples, a link that we'll come back later when we're going in the opposite direction. When we grab a cat from a room and it's a white cat, what inference can we make about the population of the room? That's where inferential statistics come in, and it's the probability link between samples and populations that allow us to make such inferences.
We'll save a lot of the heavy stuff for later and just talk about the basic terms today. The probability of a given outcome, in a situation in which more than one outcome is possible, is the fraction:
probability of X = (outcomes that are X) / (total number of possible outcomes)
This fraction or proportion is easy to calculate, and easy to understand. If you roll a fair six-sided die, the probability that it will show a 4 is 1/6, or .16. (Here, 1/6 is the fraction, and .16 is the proportion. To get percentage form, you'd need to multiply .16 by 100 to get 16%. All forms are okay, but the proportion form is most often used.) If there are 8 cats in a room, and only two are black, your probability of one cat at random being non-black is 6/8, or .75. It's also correct to phrase this question as, "What proportion of the cats in the room are non-black?"
Notice that I've been tossing the phrase "at random" in here quite a bit. That's because the formula above depends on the assumption that the die is fair, or that the coin you're tossing is fair, or that you are choosing cats in a random fashion. The formula above assumes that each observation in the population has an equal chance of being chosen, and that if you're taking more than one observation at a time, there's a constant probability of each selection.
If a room has two black cats and 10 white ones, and I choose a cat at random, then the probability of choosing one black is 2/12, or .16. But if all the white cats are being quite loud, and I allow their persistent meows to sway me into choosing them, then the .16 probability won't be accurate, because my choices won't be completely random.
Another thing to consider when sampling is whether or not you are replacing observations in the population. If I reach into a cabinet that has 10 cans of tuna-flavored cat food and 5 cans of chicken-flavored, the probabilities are:
P(randomly choosing one can of tuna) = 10/15 or .67
P(randomly choosing one can of chicken) = 5/15 or .33
But suppose I reach in, select one can, then reach in and select another can. The probability of the second can now depends on what I took out on the first random draw, because I'm now sampling without replacement. If my first can is tuna, there are then 9 tuna and 5 chicken cans remaining, and my probabilities on the next random selection are:
P(randomly choosing one can of tuna) = 9/14 or .64
P(randomly choosing one can of chicken) = 5/14 or .35
The probability of choosing tuna just decreased from before (because we're short one) and the probability of choosing chicken just increased (because the 5 chicken cans are now a larger proportion of the population). However, if I sample with replacement - I select a can, then put it back, then select another can - then the probabilities stay constant.
Kudos to all of you who knew kurtosis was next on the list! Kurtosis is the fourth moment of the distribution, and is the peakedness (that's three syllables, not two) of the distribution. From the Risk Glossary we get these lovely graphs:

The distribution on the right has greater kurtosis - more peaked, less flat - but it's possible that it has about the same SD as the graph on the left, which is more spread out but is thinner at the tails. Normal distributions are likely have a skew of 0 and a kurtosis of 3.The graph on the right is more likely to be leptokurtic (defined as a kurtosis value of greater than 3), while the graph on the left is platykurtic (kurtosis value less than 3).
You now know the first four moments of the distribution (mean, SD, skew, and kurtosis), which come in very handy for describing a set of scores. If a test score distribution has a mean of 75, an SD of 5, zero skew, but a kurtosis of 4, it might look very much like the right graph above. This would suggest a test on which most examinees score very close to the mean, with some out on the fat tails, and no real floor or ceiling effect (i.e., examinees aren't bunching up on the high or low end).
This is the funniest graph I've found to help you remember lepto (peaked) vs. platy (flat) in kurtosis:

Skewness is simply a measure of the non-symmetry of your distribution; thus, we can add it to measures of central tendency and variability in describing our distributions of scores. When distributions are skewed, this means that scores within that distribution are piled up on one end (or tail) more than on the other.
Skew can be negative or positive. To remember direction of skew, think of the positive/negative number scale, with the "negative" being to the left, and "positive" being to the right. The "skew" part is actually the skinny part of the distribution, not the end where all the numbers pile up. Thus, this distribution (which is what you'd see when measuring personal income, or number of children per family) is positively skewed:

Whereas this distribution of scores (such as you might see on a very easy exam) is negatively skewed:

You'd think that direction of skew would be easy to remember, but my time spent tutoring and teaching entry-level statistics suggests otherwise. Remember SK = "skew" = "skinny part". Graphing your data will make skew evident, but most statistical packages also calculate a statistic that quantifies the amount of skew in your data. In a skewed distribution, the mean, median, and mode may differ markedly from one another, so understanding the skew is crucial when describing your data (and in performing inferential stastistics, as we'll discuss later on). A nice discussion of the skew as the third moment of the distribution (with pretty graphs, too) can be found here.
And that's all, folks. Have a very Good Friday and a lovely Easter, and to avoid eating too many Peeps, try blasting them for a change.
The head's still pretty clogged, folks, so this one will be a quickie.
Now that we've learned some wonderful things about measures of spread and measures of central tendency, let's see a way we can use those to determine where someone's score lies within a distribution.
Let's say you have twin fifth-graders (bless your heart), enrolled in two different math classes. Bonnie comes home with a 75 on her math test, while Clyde has an 80. Clyde's score is higher, sure, but can you really compare the two? And how can you tell which of your kids is doing better within their class?
That's where z-scores (or standardized scores) come in. Z-scores tell you precisely where an observation lies within a distribution (they also tell you something about how representative a sample is of a population, but we'll get to that later).
Every score in a distribution has a corresponding z-score, or standardized score. All you need to calculate the z-score of a population is (a) the observed, or raw score, (b) the mean, and (c) the standard deviation. Let's consider Bonnie's class to be population #1, and Clyde's class to be population #2.
Bonnie's class has a mean of 71 and an SD of 2, while Clyde's class has a mean of 77 and an SD of 3. Hmm. We can tell already just from this information that both kids are above the mean in their class, so we know they're doing better than average. But how much better? To find the answer to this, we simply subtract the class mean from each score, and divide by the class's standard deviation.
Bonnie's z-score = (75-71)/2 = 4/2 = +2
Clyde's z-score = (80-77)/3 = 3/3 = +1
These scores of +2 and +1 are on the z-score metric, which has a mean of 0 and a standard deviation of 1. The positive sign tells us that Bonnie and Clyde's scores both fall above the mean, and their scores are in the standard deviation metric - Bonnie's score is 2 SDs above the class mean, while Clyde's is 1 SD above his class's mean. So, even though Clyde's raw score is higher, Bonnie is actually doing better in her class.
Can we really compare the two in this way? Yes, we can. This is one of things for which standardized scores are very useful. Without standardization, comparing Bonnie and Clyde's scores are like comparing apples to oranges.
One caveat to remember here. You can standardize any score, as long as you know the mean and SD. However, it's not as useful to standardize scores that come from a non-normal population, because the z-score transformation assumes that the underlying distribution is normal. For many psychological measures, it is, but it is important to be aware of any non-normality in the population. The assumption of normality doesn't matter so much for small comparisons like this; it matters a great deal when we get to probability and the use of z-scores to find areas underneath the curve.
One common error that people make is to misremember what standardization actually does. Standardization does not turn non-normal distributions into normal ones - a z-score distribution will always be the same as the underlying raw score distribution. The math2.org website gives this graphic to remind you of what the z-score distribution is assumed to be:

The experimental method is, basically, a method for manipulating variables in order to observe change in other variables. Let's say we have two methods of teaching a cat to roll over (yes, I realize that in real life, this will only happen when and if the cat feels like it, but work with me here). What's the simplest way that we could see if one method works better than the other?
Let's take that random sample of 1000 kitties that we obtained last week. First, we create our independent variable, or the variable that we are going to manipulate. In this case, or independent variable (or IV) is method of teaching. The IV needs to have at least two values, but it can be continuous (where an infinite number of possible values may fall in between observed values) or discrete (where values are separate and indivisible). In this case, we have two levels of a discrete variable (method #1 and method #2).
The IV is usually pretty easy to figure out, but the dependent variable - the variable in which you hope to see change after manipulating the IV - is trickier. The reason for this is that it all depends on how you define success. In our case, we're going to manipulate cat roll teaching method, but how do we show that one is "better" than the other? As measured by cat satisfaction with the roll? Owner satisfaction? Time to complete the roll? Total number of rolls completed in a minute? Flair and style in rolling over? Not scratching innocent bystanders while rolling? Deciding upon a specific DV takes some thought, because your research question should determine your DV, and what you measure with your DV will limit the research questions you can answer.
Let's say our DV is time to complete the roll. We'd want to measure it before we try any method, so that we can measure the change in time at the end; this is a repeated measure.
So what's the simplest experiment we could do? We could take that random sample of 1000 kitties, assign one random half to method #1, the other random half to method #2, and then try to hold all the other variables constant while we teach them. This means, perhaps, that we have the same woman teaching both sets of cats, and she teaches them at at the same time of day on alternate days, using the same room, the same treats for rewards, and the same tone of voice for commands. In real life we usually can't control all the confounding variables that we like, but it's best to limit confounds as much as possible. If a woman teaches one group of cats and a man teaches the other, the group that does better might be responding to the lower voice instead of to the teaching method.
We could also divide the cats randomly into three groups (with an extra kitty in one to balance out the total) and have a control group for comparison. This is a group to which you do nothing, but you still measure them before and after to see if they change on the DV. We all know that untrained cats aren't going to waste much time rolling over on command, so we don't expect to see much change here. Control groups are essential in the health sciences, though, when it's useful to compare treatment methods to subjects who either haven't been treated, or who have received placebos.
So good luck with the experiment. If you get any result other than your cat sleepily ignoring you, let me know.

How many kitties are there in the United States? According to the Humane Society, 60 million live in homes , and National Geographic says that 70 million feral cats are roaming the streets. If we assume all cats are either in homes or the street, and toss in an extra 2 million for those temporarily housed in shelters, then we have a population of cats in the United States that's around 132 million, give or take a few fuzzbutts.
In statistics, population has a very specific meaning - it's the entire collection of scores/observations/cats of interest in a particular study. It can be very large, or it can be very small (if I were interested in studying just cats who live on my block, and not the entire US). But it's whatever I'm interested in, and it's whatever I'd like to be able to generalize to with my sample.
Samples are just subsets of the population. Samples are intended to represent the population; usually it's the sample on which you crunch all your numbers. Samples can also range from very large to very small. Larger is better; the closer the sample size gets to the population size, the more likely your sample statistics will be representative of the population statistics, and the better an inference you can make from your sample to the population.
Values used to describe populations are called parameters, while values used to describe sample are called statistics. When we calculate descriptive statistics on a sample, sometimes we are interested in just that sample, but more often we are interested in making inferences about the population parameters.
Perhaps what you want to know about American cats are their weight, their eye color, or their numbers of stripes. You can't possibly take measurements of every cat, but you can take a sample of cats that is as large and representative as possible. If you're interested in knowing weights, you'd want to be sure your sample included spayed- and non-spayed cats, old and young cats, and cats of both sexes - or you might want individual sample of all these groups. Perhaps kitties in Arizona have more stripes than those in New York; you'd want to make sure you got samples across geographic regions.
This representative sample problem is one that you often see in relation to studies related to education and testing. Earlier this week, we saw a columnist try to infer from a sample (of Bates College) that the SAT was not useful for the population (of universities in the United States). Not only is that not a large enough sample, but one could argue that, even if every small, private, liberal-arts college found the same results, the results do not generalize to big state schools.
No matter how representative a sample is (as long as it's not equal to the population), the measurements you obtain from it will not likely be the same as what you'd get from measuring the entire population. That difference between sample statistics and population parameters is called sampling error. Sampling error is affected by sample size and characteristics of the sample, and can be random or systematic. One way to combat systematic sampling error is to use random sampling, in which each observation in the population has an equal chance of being selected from the sample.
Our last topic is about bias in estimation. Let's say I have a magic wand that makes 1000 random kitties from all over the US appear in my laboratory. I can weigh each one, and calculate a mean and standard deviation of the weights. My goal is to make an inference about what the mean and standard deviation of weights are for all the kitties in the US.
When I calculate the mean of my sample, I have what's called an unbiased estimate of the mean of my population. This means that my sample mean does not consistently over- or under-estimate my population mean, and thus the sampling error is more likely to be random. Variability, though, is different. The formula I provided here is for the standard deviation of a population. However, if I were to use that formula on a sample, I'd get a measure that is biased, and that will systematically underestimate our population standard deviation.
So we correct for that bias by modifying our formula for the sample standard deviation. We still subtract each kitty's weight from the mean weight, square those deviations, and sum those up. But instead of dividing by the total number of kitties in our sample, we'll divide by the number of kitties minus 1. This decreases the denominator of our formula, resulting in an increased variance (and when we take the square root of that, an increased sd) that does not have systematic error from the true population standard deviation.
Now if you'll excuse me, I have a lot of kitties to feed (999, to be exact).
Today we'll cover something simple: Scales of measurement. They're not tricky, but they're important, especially when it comes to deciding what inferential statistics can be used, and what conclusions can be made.
And we'll mix some catblogging in here as well.
First, there's the nominal, or categorial, scale. This really isn't a quantitative scale at all, but a qualitative grouping. A survey item that asks a community, "What different kinds of cats do you own?" is nominal; the responses might be "tabby," "Siamese," "Maine Coon," etc. The correct descriptives here are counts and the mode; when we get to inferential statistics, you'll learn about non-parametric analyses such as chi square that are suitable for categorical data (for example, a chi square test could help you answer the question, "Is type of cat owned independent of college major?").
Next, there's the ordinal scales. Think "order" when you hear ordinal, because that's what this scale preserves. Class rank, movie ratings, the "AmIHotOrNot" ten-point attractiveness scale - all of these group observations and preserve the order of observations (a perfect 10 is cuter than a 6, a movie that gets 5 stars is better than one with 3 stars), but you don't know how much cuter, or better, observations with the higher values are. With ordinal scales, the mode and median are useful, along with the inter-quartile range.
In the photo below (taken tonight at the intake center), the kitties are in position 1 (top), 2 (middle), and 3 (bottom), but just knowing their value on an ordinal scale doesn't tell you how far apart they are:

Next on the list there's the interval scale. This scale groups observations, preserves the order, and tells you how far apart each observation is. Each point on an interval scale represents the same magnitude on the trait being measured, no matter where on the scale you are. The classic example of a true interval scale is temperature in Fahrenheit. When it's 30 degrees out, it's 10 degrees warmer than when it's 20 degrees; when it's 80, it's 5 degrees cooler than when it's 85.
However, there isn't an absolute zero on the Fahrenheit scale, which is what keeps it from being the next level of measurement - ratio. Ratio scales have an true zero point, so not only does one unit's difference mean the same thing across the scale, but you can also say that 4 units on a ratio scale is twice as high as 2 units. 30 degrees F is not twice as hot as 15 degrees F, but 300 degrees Kelvin IS twice as hot as 150 degrees Kelvin , because the Kelvin scale starts from absolute zero.
Many measurements made from direct observation, or in the hard sciences, are on the ratio scale. Time in finishing a race is ratio - the person who finished in 20 minutes took half as long as the person finishing in 40 minutes. If I have 100 bucks in my pocket and you have 150, I have 50 dollars less than you, and the person with $200 has twice as much as me (and let me tell you, I know all about the true zero point when it comes to income).
"Number of stripes on this kitty" is on the ratio scale. If he has 25 stripes, he has half as many stripes as another cat with 50:

Note that each scale builds on the one before, and has the qualities of all previous scales. Ratio scales allow you to group observations, rank order, add or subtract scale values, and multiply and divide scale values.
Psychological and educational measurements - such as IQ, SAT scores, or personality measures - are not ratio, and not really interval, although they are often treated as such. The issue rests on whether you can say that measures of latent traits really have equal intervals across the scale. Someone with an IQ score of 180 is smarter than a person with a 150 (so the order is preserved), but does that difference of 30 points mean the same thing when we're talking about two people who have scores of 120 and 90, respectively? What about an anxiety scale? Does a 5-point difference at the bottom mean the same thing as a five-point difference at the top?
If we have these concerns, why do we often assume psychological/educational scales are interval? Basically, we do it so that we can use descriptive statistics like the mean and standard deviation, and use powerful parametric statistics to make inferences. However, we take our chances in doing this; the appropriateness of our analyses rest on the assumptions that we make about the underlying scale, and the more incorrect we are in our assumptions, the less confidence we'll have that our analyses - and our inferences - are correct.
Variety is the spice of life, and variability is the essence of statistics. Why crunch numbers on anything? Why not just assume everyone is the same? Because we know they're not the same, but we don't necessarily know just how different everyone is. That's where variability come in. Variability, in a statistical sense, is a quantitative measure of how close together - or spread out - a distribution of scores are. In our last lesson, we discovered ways to understand where the representative score in a distribution lies, but while the mean, median, and mode tell us something about the most representative point in the data, they tell us nothing about how all the scores vary around that representative point.
Thus, measures of variability (or spread) go hand in hand with measures of central tendency, and you need at least these two measures to get a picture of what a distribution actually looks like.
Let's go from simplest to most complex. First, there's the range - crude, but easy to calculate. With observations as whole numbers, the range is (highest score - lowest score) + 1. (With non-integers as observations, you have to be concerned with upper and lower limits, but we'll skip that for now.) Note that the following two groups have the same range, but the distributions are very different:
Group 1 - 10, 10, 10, 10, 2
Group 2 - 10, 8, 6, 4, 2
The range can be divvied up. The interquartile range is the (75th percentile score) - (25th percentile score), answering the question of what the spread is in the middle of data (useful for when there are outliers).
What you'll most often see to describe variability is the standard deviation of a distribution. This is a quantity (so it can't be less than zero) that approximates the average distance from the mean. So the mean is the representative value, and the standard deviation is the representative distance of any one point in the distribution from the mean.
Let's skip back to just the term deviation. Let's say we have a distribution with a mean of 100. You have a score of 90. Your deviation from the mean is thus -10. Your friend, with a score of 105, has a deviation of +5. If I add up the deviations of everyone in the distribution from the mean, I'll get zero (that's part of the definition of the mean, in fact.) So adding these deviations up doesn't get us anywhere, yet.
Let's get rid of the + and - signs by squaring every deviation. If we add those squared deviations up, we have sums of squares (a very important concept in both descriptive and inferential statistics). If we divide by the number of observations in our distribution, we get what's called the variance. The variance gives us the representative squared distance from the mean, which is not that useful for descriptive statistics.
So take the square root of the variance, and you get the standard deviation (which is also sometimes sd, or just s). It's in the original unit of whatever your distribution was, so it's easy to interpret. If a distribution has mean of 100 points and an standard deviation of 5, then the representative deviation from the mean in that distribution is 5 points.
Because the standard deviation is an average, it's affected by outliers - those extreme scores on either tail of the distribution. This means when you have a distribution for which the mean isn't appropriate - like income, or number of children - the standard deviation won't be too useful either. The interquartile range, on the other hand, nicely complements the median in these situations. Just like with measures of central tendency, just because you can compute the standard deviation for skewed data, doesn't mean you should.
(You can also calculate the average absolute deviation and the median absolute deviation, which are just what they sound like - the average or median of the absolute unsquared deviations from the mean. These are less affected by outliers than the standard deviation. Thanks to Raina for pointing me down this path.)
And again, always look at your data (image borrowed from Dr. Gaten's online course):

Distribution A and B have the same mean, but different standard deviations. B's variability is smaller, so its variance and standard deviation are smaller, too. A and C have the same variability, but different means. Note that A and C overlap, so some people in A have higher scores than some in C, although C's mean is greater than A's.
If you understand all of this, you're ahead of some of the people who were criticizing Harvard Presidents Larry Summer's infamous comments about male and female scientific ability (link goes to a site that defends Summers). Many of Summer's critics immediately assumed that he was saying all men are smarter than all women, or that no women have the ability to become scientists and engineers. These statements could only be made be people who do not understand distributions, or even basic statistics.
As Slate so nicely described Summer's comments:
It isn't a claim about overall intelligence. Nor is it a justification for tolerating discrimination between two people of equal ability or accomplishment. Nor is it a concession that genetic handicaps can't be overcome. Nor is it a statement that girls are inferior at math and science: It doesn't dictate the limits of any individual, and it doesn't entail that men are on average better than women at math or science. It's a claim that the distribution of male scores is more spread out than the distribution of female scores—a greater percentage at both the bottom and the top. Nobody bats an eye at the overrepresentation of men in prison. But suggest that the excess might go both ways, and you're a pig.
I don't know what the population distributions look like for the abilities that Summer was describing. But if they looked like B (women) and A (men), and there were more men in fields that required this ability, it would be clear as to why.
I'm already getting good feedback from Devoted Readers about this new feature, so I guess I'll continue. I'd also like to put my $.02 in here and say that, hands down, the best textbook I've found for teaching undergraduate statistics in the social sciences is Graveter & Wallnau's Statistics for the Behavioral Sciences. I have the third edition; two more editions are currently out.
Also, I'll try to add in statistical notation images later when I get home and have the software for it.
Let's step back from percentiles and talk about measures of central tendency. Central tendency of what, you might ask? Of a distribution - a word many non-statisticians consider intimidating, but which just means a collection of observations; in testing, distributions are often sets of scores.
Measures of central tendency are one kind of descriptive statistics (measures that describe or summarize a distribution of scores.) Any measure of central tendency identifies a single score as something that is representative of the entire distribution. Needless to say, the correct measure of central tendency - the one that is most likely to give you a representative score - will depend on the type of data and the shape of the distribution.
The three most-often used measures of central tendency are the mode, median, and mean. The mode couldn't be simpler - it's the most commonly-occuring score in the distribution, and it's suitable for use on discrete (not "discreet"!) and continuous data, and any scale of measurement (ratio, interval, ordinal, and nominal). If more students score a 1060 on the SAT than any other score, that score is the mode; if more students respond "Yes" to a survey than "No" or "No opinion," "Yes" is the mode.
The median is, as we discussed yesterday, the 50th percentile of the distribution - the point at which half the observations are above and half are below. Medians are good for skewed or open-ended observations. If you're asking local families how many children live in each house, a median will give you a better representative number than the mean, which could be skewed too high by that little old lady who lives in the shoe. Medians are also good for ordinal data; i.e., data that are ranked in order but that do not have equal intervals between scores. Ranking of standings in a race are on an ordinal scale - you know you got 2nd place, but you might have been only a second behind the person in first, whereas you might have been an hour faster than the person in 3rd place. Ordinal scales preserve only the order of observations, and not the distances between them.
Finally, there's the mean, aka the average. Add up all the scores in the distribution and divide by the number of scores. You can thus also think of the mean as the score each individual would get if the total of the scores were divided equally among the population (this is what anti-testing types would like to see happen, to combat the "unfairness" of unequal scores). The mean is also a balance point of the distribution, but not with 50% of scores on each side. Instead, it's more like a seesaw, where a score way on one end is balanced out by a lot of lower scores on the other end.
With standardized tests, means, medians, and modes are all useful to know. If the distribution of the test scores is normal (e.g. the "bell curve"), the mean, median, and mode should be roughly equal; this equality is in fact part of the definition of a normal distribution.
If you add/subtract a constant value to each score, the mean will shift by that constant.
If you multiply/divide each score by a constant value, you can multiply/divide the old mean by that same constant, and you'll get the new mean.
Last but not least - be suspicious when the wrong central tendency measure is used (mean income of a group and median income of a group could be very different), and be very wary of measures of central tendency that are provided in the absence of measures of spread (that's tomorrow's term).
Update Devoted Reader Doug S. notes the following:
You might wish to discuss situations when none of these measures is appropriate. Specifically, datasets with multiple strong local maxima make any measure of the central tendency of the whole data set much less useful. Lack of utility doesn't seem to correlate with a reduced frequency of use though. After all, if I have a dataset, the mean (or median, or mode) must mean something (so to speak).
Good point, and something I should have said first thing when discussing descriptive statistics. Important Rule # 1: Always look at your data. Graphing data is ideal; non-normality, skewness, kurtosis - all those will show up with graphing. Descriptives statistics provide one kind of look, but there's nothing like nifty graphical representation of quantitative data.
Here's one example of what Doug is talking about (graph cheerfully stolen from a discussion of what "normality" means for the atmosphere, by Chuck Doswell:

This is a bimodal distribution - something like what you might see if you combined men's and women's heights into one distribution of measures. Often, bimodal distributions are an indication that you have more than one distribution of scores going on in your sample, and might need to separate them. Sometimes, though, this is what one population actually looks like. There can be any number of modes, or any number of bumps in the distribution's curve (those would be the local maxima).
You can calculate a mean and median for this distribution, sure - but they won't be very descriptive as a representative of your data. Important Rule # 2 - Just because you can calculate a statistic doesn't mean it's correct for your data.
Welcome to my new feature - the Statistics Term of the Day. I'll try to start by focusing on terms that you non-statisticians out there are likely to see on your - or your child's - test score report. I'll try to keep the order somewhat logical but these early terms might be a tad out of order. I'll put stats terms that I've either covered, or (since this is the first installment), plan to cover, in bold.
If it gets heinously boring, tell me.
Today the term is percentile. Put simply, the percentile is a value ranging from 1 to 100 (so it looks like a percentage) that indicates the percent of the distribution that lies below it. Often test scores are accompanied by a percentile; e.g., "This student's reading score is at the 98th percentile." That value is the percent of examinees in some reference group (often, the examinees who have taken the exam in the years prior) below the given score.
If you are at the 98th percentile, your percentile rank is 98. You have scored higher than 98 percent of some reference group of examinees. the College Board provides percentiles so that examinees can see how they did on the SAT as compared to examinees who took it in previous years.
Another phrase you might see with percentiles is cumulative frequency distribution or cumulative percentage. Cumulative in this case means increasing by successive additions; a cumulative frequency distribution is created by adding up all the ranked values in a distribution. Cumulative percentages are used to create percentiles; in order to know that score X is at the 98th percentile, the percents of all scores below score X had to be summed.
The 50th percentile is also known as the median, which is the measure of central tendency that divides the sample (or distribution) in half. Fifty percent of observations lie above the median, and fifty percent below. The median is not the same as the mean, which is the mathematical average or center of a sample or distribution. When a distribution is non-normal or skewed, the median is often the correct measure of central tendency. This is why you often see median, not mean, incomes reported, so that the few millionaires in the bunch don't confuse the analyses.