## Confidence intervals

### From OPOSSEM

## Contents

# Objectives[edit]

- Get some more of your learn on.

# Introduction[edit]

It is not unusual to see headlines in a newspaper that say something like, "Romney pulls ahead of Obama in latest survey!" As you read the article, you realize that 47% of people chose Romney in the new poll and 46% of people chose Obama. It is certainly the case that Romney is ahead of Obama among those who answered the survey; in other words, Romney is ahead of Obama in the sample. The question we want to ask ourselves is: does this mean Romney is ahead of Obama among the group the sample is meant to represent? In other words, is Romney ahead of Obama in the population?

Researchers almost always use samples to examine a population, but using a sample will always introduce some uncertainty about our results. Even if a sample is perfectly done, there is a chance that you will get a sample that is

In this chapter we introduce the concept of a *confidence interval*, which can be used to infer the range of likely values of a population parameter given a sample statistic. Confidence intervals are used when we want to find what values of a population parameter (such as a mean or proportion) are the most likely to have resulted in the sample we observe.

For example, when the media and political pundits are observing elections, they conduct surveys to try to figure out which candidate is more likely to win, or whether a referendum is likely to pass or be defeated. As we discussed in the chapter on survey research, public opinion polls do not usually sample the whole population and so are going to have some degree of sampling error. They can use confidence intervals to get an idea of the likelihood of their desired outcome coming to pass given the results of their survey.

# When the population standard deviation is known[edit]

Assume for the moment that we know the standard deviation of the population, σ_{y}. If that is the case, based on the central limit theorem, we can find the *standard error* of the population mean as follows:

<math>\sigma_{\mu_y} = \frac{\sigma_y}{\sqrt{n}}</math>

In this formula, <math>n</math> represents the size of the sample. The standard error is a measure of the degree of uncertainty in our measurement of a particular parameter due to the use of sampling; in general, the larger the sample is, the more likely it is that our sample statistic is close to the true population parameter. Hence as the sample size <math>n</math> increases, the sampling error will become smaller.

The *confidence interval* for the mean is then given by the following formula:

<math>\bar{y} \pm Z_{\text{crit}} \sigma_{\mu_y}</math>

In this formula <math>Z</math> is the appropriate *Z score* for a certain confidence level. As we saw in the previous chapter, the most common confidence levels in social scientific research are 90% and 95%, corresponding to alpha levels of 0.10 and 0.05, respectively.

To find the correct Z score for a confidence level, we need to identify the value of Z where the shaded area under the standard normal curve is equivalent to our confidence level. Since the standard normal distribution is symmetric, tables of the normal distribution in textbooks typically only include the positive Z scores (those to the right of the mean), so finding the correct Z score is a bit challenging.

Since we are actually looking for the Z score such that the area under the curve between -Z and Z is equal to the desired confidence level, and the standard normal curve is symmetric around 0, it follows that the area under the normal curve from 0 to Z should be equal to *half* of the desired confidence level. So, for example, to find the Z value for a 90% confidence level, we would want to find Z such that the area under the curve between 0 and Z is 45% (or, in terms of proportions, 0.45) of the total area under the curve; similarly, to find the Z value for a 95% confidence level, we would seek a Z such that the area between 0 and Z is 47.5% (0.475) of the total area under the curve.

If we look up these values in a table of the standard normal distribution, such as that found here, we would find the following values of Z for the given confidence levels:

Confidence level | Alpha | Z |
---|---|---|

90% | 0.10 | 1.645 |

95% | 0.05 | 1.960 |

99% | 0.01 | 2.576 |

Of course, the confidence interval for a mean only makes sense if our variable is measured on an interval or ratio scale (i.e. it is continuous); if we have ordinal or nominal data, another method such as the confidence intervals for proportions is appropriate. In the case of ordinal data, one could also use the interquartile range of the distribution as a representation of a "50% confidence interval" for the median.

**Example:** A researcher conducts an IQ test of 100 freshman (9th grade) students at a public high school in Tacoma, Washington. The sample mean IQ is 105. Assuming, for the sake of argument, that these students are a random sample of 9th graders in the United States, and the standard deviation of IQ scores is 15, we would like to determine the *standard error of the mean* and the values of the population mean (i.e. the mean of all 9th graders in the USA) that most likely led to the sample mean we discovered.

# When the population standard deviation is unknown[edit]

You may have already noticed a small problem with the proceeding analysis. Recall that when you learned about the standard deviation, you needed to know the mean to be able to calculate the standard deviation. So, if we already have been able to calculate the population's standard deviation, *by definition* we must have already known the mean as well, and thus inferring the mean is a rather pointless exercise.

So, what do we do in the more common case where finding the population mean and standard deviation is impossible or at least impractical? The solution is to rely on the *sample* standard deviation. However, the sample standard deviation is only an *estimate* of the population standard deviation (just as the sample mean is an estimate of the population mean), so we must account for the additional error that results from using this sample statistic.

Thankfully this problem has already been solved for us from what may seem to be a rather unlikely source. In the early 20th century, a British chemist and statistician, William Sealy Gosset, was employed by the Guinness Brewery in Dublin, Ireland, as its chief brewmaster. As brewmaster he was responsible for ensuring that the brewery's products were of a consistently high quality, and thus had to make sure all of the raw ingredients were consistent and the brewing process stayed the same from day-to-day despite variations in weather, the specific production lines that were used, and the like. So he needed to be able to take samples of both the finished product and the inputs and be able to be confident that everything was in order.

Gosset discovered, as we have, that the sample standard deviation was not always very close to the population standard deviation; in fact, the smaller the sample, the more likely it is that due to sampling error the two quantities will be substantially different. This causes the standard error of the mean to be underestimated and the confidence intervals to be narrower than they should be.

Accordingly Gosset proposed an adjustment to the normal distribution to compensate for the higher uncertainty. However, Guinness—fearing that competitors might use information about how Guinness worked to improve their own brewing processes—was reluctant to allow Gosset to publish his findings, even though the statistical adjustment was not a particularly valuable "trade secret." Eventually he was able to persuade Guinness to allow him to publish the finding in the statistics journal *Biometrika*, but only under a pseudonym: "Student." Accordingly Gosset's distribution became known as "Student's t distribution," and the name has stuck even though we now know Gosset was the inventor.<ref>Wikipedia contributors (20 June 2011). "William Sealy Gosset". http://en.wikipedia.org/wiki/William_Sealy_Gosset. Retrieved 8 July 2011.</ref>

As suggested above, the formula for the standard error is essentially the same, just with different symbols; the only difference is that instead of using the population standard deviation <math>\sigma_y</math>, we now are using the sample standard deviation <math>s_y</math>.

<math>s_\bar{y} = \frac{s_y}{\sqrt{n}}</math>

The confidence interval formula is also essentially identical to the formula for using the population parameter; instead of the critical value of the normal distribution (Z), we use a critical value of Student's t distribution, which is obtained from the table for Student's t distribution.

<math>\bar{y} \pm t_{\text{crit}} s_{\bar{y}}</math>

However, the t distribution is actually a *family* of related distributions that are adjusted by the appropriate number of *degrees of freedom,* which depends on the statistical test being done. In the case of the confidence interval formula, the appropriate number of degrees of freedom is one fewer than the sample size, or <math>\text{df} = n-1</math>. Hence to look up the appropriate value of t in the table, you need to know beforehand both the confidence level you wish to use, as well as the sample size.

**Example:** Another researcher conducts an IQ test of 82 freshman (9th grade) students at a public high school in Biloxi, Mississippi. The sample mean IQ is 110, and the sample standard deviation is 17. Assuming these students are a random sample of 9th graders at the school, we would like to determine the *standard error of the mean* and the values of the mean among all 9th graders that most likely led to the sample mean we discovered.

# Confidence intervals for proportions[edit]

Another form of data that is commonly found in the social sciences involves proportions or percentages. For example, in educational research one of the key concerns of policymakers is to minimize the *dropout rate,* the percentage of students who leave high school without finishing their education with a diploma, while in political science we are often interested in the likely outcomes of elections based on polls. Since in these cases the data has no real "population mean" to speak of, we have to look at the proportion—the share of the total cases that are in the category of interest—instead. Studying proportions particularly lends itself to analysis of nominal or ordinal data where we don't have a valid mean.

Calculating the standard error of a sample proportion is relatively straightforward. If we take <math>p</math> to represent the sample proportion found in our category of interest, the standard error of the proportion is given as follows:

<math>\sigma_\pi = \frac{\sqrt{p(1-p)}}{\sqrt{n}}</math>

Similarly, the confidence interval formula is very straightforward as well; except for interchanging a few symbols (the population proportion π appears instead of the population mean μ, for example), the formula is the same as we've already seen for when we use the mean.

<math>p \pm Z_{\text{crit}} \sigma_\pi</math>

As discussed above, public opinion polling is a common application for using confidence intervals of proportions. You may have seen media reports of polls in which the results are presented with a *margin of error*; this is simply the standard error of the proportion multiplied by the value of Z corresponding to the desired confidence level (normally 95%, the equivalent to α=0.05). As we have previously seen, the appropriate value of Z for a 95% confidence level is 1.96.

There is no direct "t distribution" equivalent for constructing confidence intervals for proportions when we have a small sample (<math>n < 100</math> or so); instead, researchers use the *binomial distribution* when they have a small sample. Getting into the details of that approach is beyond the scope of this module; however, the principles are very similar and most statistical programs for computers will easily calculate the needed statistics, as discussed in the following module on single sample tests of means and proportions.

**Example:** A survey of 729 people in Laredo, Texas, indicates that 54% of residents support constructing a new baseball stadium in the city while 46% are undecided or opposed. Based on these figures:

- What is the standard error of the mean?
- What is the 95% confidence interval for the mean attitude among Laredoans?
- Can we be confident that a majority of residents favor building a baseball stadium?

# Advanced Topics[edit]

## Margins of Error in Survey Research[edit]

The percentage margin of error for a public opinion poll in survey research, assuming all of the error is due to sampling error and a 95% confidence level, can be estimated as:

<math>e = \pm \frac{98\%}{\sqrt{n}}</math>

So, for example, a survey with 600 respondents would be expected to have a margin of error of <math>98/\sqrt{600} \approx 98/24.5 \approx 4.0</math> percentage points.

## The Finite Population Correction[edit]

When the sample size is 5% or more of the underlying population, the standard error can be adjusted to reflect the decreased sampling error under these circumstances. The finite population correction is given by:

## Comparing confidence intervals for multiple groups[edit]

It is often tempting to look at the confidence intervals for two groups as a way to determine whether or not the two groups' means are the same. This approach is technically incorrect and can lead to erroneous inferences. The correct approach is to use the difference of means test discussed later in the book, if you have two groups, or Analysis of variance if you have more than two groups to compare.

# References[edit]

# Discussion questions[edit]

# Problems[edit]

# Glossary[edit]