Hypothesis testing



  • Get your learn on.


In the social sciences, hypothesis testing is a set of approaches used to determine whether or not our theories about the real world are borne out in practice. While there are a number of ways to test hypotheses, most empirical research (whether quantitative or qualitative in nature) relies on the same basic approach to seeing whether our theories work.

From a Theory to a Testable Hypothesis[edit]

To review, a theory is a relatively general, and testable, expectation about political or social reality. For example:

  • Countries with greater political freedom will have more economic growth than those that are less free.
  • Supreme Court justices' decisions on cases heard before the Court are influenced by the justices' individual partisan leanings.
  • College students who join a fraternity or sorority have more friends than those students who do not join Greek organizations.
  • Evaluations of the government's handling of the economy will influence vote choice.

Elements of a Testable Hypothesis[edit]

Taking a theory and turning it into one or more testable hypotheses requires us to ground the concepts and relationships expressed in the theory in a set of specific expectations about the relationships between measurable variables. Specifically, a testable hypothesis has three elements:

  1. It describes an expected relationship between two distinct concepts.
  2. It states the expected direction of the relationship.
  3. It must be falsifiable.

First a hypothesis is an argument that two distinct concepts are related to each other in some way (in other words, a hypothesis is not tautological). Differences or changes in one concept are believed to affect the other concept in some way.

Second, the hypothesis argues that certain values of one concept correspond to certain values of the other concept. This statement can either be very general ("higher values of X lead to lower values of Y") or very specific ("when X is greater than 2, Y will be observed to be true", or "when X is 2, Y will be 4; however when X is 3, Y will be 6").

Finally it must be possible to show that the statement is untrue. This does not mean that the statement must be untrue under some circumstances; it just means that if the statement is untrue it must be capable of being shown to be untrue. Most statements about morality and religion are not falsifiable, and thus cannot be tested using the scientific method. For example, statements such as "downloading videos and music from the Internet without paying for it is wrong" or "the best system of government would be a dictatorship run by philosopher-kings" are not testable hypotheses.

Operationalizing a Theory[edit]

The first theory discussed above, relating political freedom and economic growth, is pretty general; we need to operationalize concepts like "political freedom" and "economic growth" in a measurable way in order to produce a hypothesis. We might, for example, measure "political freedom" in a number of different ways; we could use a very rough qualitative scale, such as describing countries as "free," "partially free," or "not free," or we could use a more quantitative measure such as the scores compiled by activists and scholars. We could also produce our own measure if we did not think these existing measures captured the concept very well.

One commonly-used measure of political freedom is compiled by Freedom House, and is known as the "civil liberties index." Every year the organization examines the record of most independent states and assigns a score between 1 and 7 to each country based on its respect for civil liberties in practice, with lower values indicating more freedom than higher values; thus a totalitarian state such as North Korea would be coded as "7," indicating the lowest possible level of freedom, while most advanced industrialized democracies are typically scored "1" or "2" depending on political events in the given year, indicating a high level of respect for individual liberties.<ref>Freedom House (2011). "Freedom In The World, 2011". Retrieved 7 July 2011. </ref>

Thankfully there is (somewhat) less dispute about how we might measure economic growth; organizations such as the World Bank compile data on the size of national economies, so we can measure that concept fairly easily by just comparing the size of the economy over time.

So, we might operationalize the first theory above as follows:

  • Countries with lower scores on the Freedom House civil liberties index will have higher values of per capita GDP change than countries with higher Freedom House scores.

Exercise: Produce one or more testable hypotheses for the other two theoretical statements mentioned above.

The Research and Null Hypotheses[edit]

Once we have derived a testable hypothesis, we can refer to it as our research hypothesis; it is also commonly referred to as the alternative hypothesis. This hypothesis is what we expect to find evidence for (or against) in our research.

The research hypothesis is commonly referred to when using mathematical notation as <math>H_A</math> ("H-sub-A").

Writing a Hypothesis as a Mathematical Statement[edit]

We might want to rewrite our research hypothesis in mathematical terms to be able to test it. To do that, we should think about how we can best restate the hypothesis as a "math statement."

To continue our example of economic growth and political freedom, one way to think about this hypothesis is that we are going to compare countries that are more free to those that are less free. We can think of these countries as being two groups; the "more free" group we'll call "group 1" and the "less free" group can be called "group 2." Since we think that freer countries (e.g. those in group 1) will have more economic growth than less free countries (those in group 2), we'd expect that the average economic growth in group 1 countries would be higher than that in group 2.

So in mathematical terms we can say our research hypothesis is:

<math>H_A : \mu_{{\Delta\text{GDP}}_1} > \mu_{{\Delta\text{GDP}}_2}</math>

In this statement, μ (the Greek letter mu) represents the population mean (as you've seen earlier), and Δ (the Greek letter "delta") is a common symbol meaning "change" or "difference," so essentially our statement is "the average GDP change for countries in group 1 is expected to be higher than the average GDP change for countries in group 2."

Exercise: Rewrite each hypothesis from the previous exercise in mathematical terms.

The Null Hypothesis[edit]

For rather boring reasons, it turns out that statistical tests of hypotheses generally don't involve testing the research hypothesis directly. Instead, statistical tests look for the absence of a relationship between the concepts. So we must transform our research hypothesis into its logical inverse--you can think of this as its evil twin--known as the null hypothesis, or in mathematical terms, <math>H_0</math> ("H-sub-zero").

The null hypothesis is an assertion that the independent variable has no effect on the dependent variable; in an experimental design, we could think of the null as an assertion that the treatment had no effect on the outcome.

In terms of our example above, our null hypothesis would be <math>H_0 : \mu_{{\Delta\text{GDP}}_1} = \mu_{{\Delta\text{GDP}}_2}</math>.

Exercise: Determine a reasonable null hypothesis for each of the research hypotheses from the previous exercise.

Testing Hypotheses[edit]

Testing the Null[edit]

You'll note that our hypotheses were stated in terms of a population parameter, <math>\mu</math>. When we test a hypothesis, we typically use data from our sample to decide whether or not the null hypothesis is likely to be true in the population of interest.

Of course, as discussed earlier in this text, our ability to use our sample to test hypotheses about the population requires us to have a random sample of the population of interest to make valid inferences about the population. If we do not have a random sample our conclusions may be incorrect due to bias resulting from systematic sampling error.

There are two possible outcomes of a hypothesis test:

  • We can reject the null hypothesis if we conclude, based on our analysis, that the null hypothesis is untrue.
    • This provides evidence in favor of our research hypothesis and theory (in the population).
  • Or, we can fail to reject the null hypothesis if the findings are not sufficient to conclude the null hypothesis is untrue.
    • However, this does not necessarily mean our research hypothesis is wrong!

Most statistical tests follow the same two steps. First you calculate a test statistic; each statistical technique has its own method for calculating a test statistic, although there are some things that various statistics have in common that you may recognize as you learn more of these methods. Once you have calculated this test statistic, you normally then compare it to a critical value for that statistic, often found in a published table of values, that allows you to decide whether to reject or fail to reject the null hypothesis.

Regardless of the outcome, it is important to remember that even though we may have found evidence for (or against) our research hypothesis, that is not definitive proof of whether or not it is actually true. The accumulation of evidence over time from a series of hypothesis tests (from different samples over time, for example) can indicate that a hypothesis is likely to be true, but will never definitively prove its truth.

A good example from the history of science would be the theory of gravity advanced by Galileo and Sir Isaac Newton in the 17th century; while, to this day, experiments with gravity usually find evidence in support of the hypotheses derived from this theory, nonetheless Albert Einstein and other scientists discovered in the 20th century that Newtonian gravity is not an accurate description of gravity's effects in very large and very small scales. Hence it would be wrong to say that centuries of experiments "proved" the theory of gravity to be correct, as ultimately it was shown to be an incomplete description of reality.

Errors in Hypothesis Testing[edit]

It is possible, even through no fault of our own, that our test will lead us to an incorrect conclusion about the presence or absence of the hypothesized relationship in the underlying population of interest. As we have discussed before, samples are not perfect representations of the underlying population, and so it is possible our sample will not be sufficiently representative of the population to arrive at the correct conclusion. It is also possible the test itself may be incorrect in some circumstances.

There are two sorts of incorrect outcomes that can come about when testing hypotheses:

  • We can incorrectly reject the null when we shouldn't have. In other words, the null is actually true in the population, but we conclude it is not true. We call this Type I Error.
  • We can incorrectly fail to reject the null when we should have. In other words, the null is actually false in the population. This is called Type II Error.

We can illustrate the relationship between Type I error, Type II error, and correct conclusions as follows:

Null is false in population
(<math>H_A</math> is true)
Null is true in population
(<math>H_A</math> is false)
Reject null hypothesis Correct inference Type I Error
False Positive
Fail to reject null hypothesis Type II Error
False Negative
Correct Inference

In general, there is a trade-off between these two types of error: reducing the chances of Type I error increases the chances of Type II error, and vice versa. However, the severity of making one type of error may not be the same as the other; thus, sometimes we may want to minimize the probability of making a Type I error (and consequently increase the odds of a "false negative"), while other times we may want to minimize Type II error, increasing the chances of a "false positive."

Some examples of these trade-offs:

  • Issuing visas to tourist visitors: is it better to let someone in the country who shouldn't be here than it is to deny entry to someone who should?
  • Drug testing of athletes or employees: is it better to accuse someone of using drugs who is innocent or allow people who use drugs to escape detection?

Exercise: Come up with an example from politics or society where we would want to minimize false positives (Type I errors), and another example where it would be desirable to minimize false negatives (Type II errors).

Statistical Significance and Power[edit]

Statistical Significance and alpha[edit]

Social scientists, when testing hypotheses, want to be able to measure or describe mathematically the chances of reaching the wrong conclusion. This is particularly true when the evidence suggests that their research hypothesis is untrue. By quantifying the probability that they have reached the wrong conclusion, scholars are able to report how confident they are that the results they find are truly valid.

The statistical significance of a test is defined as the probability that, having rejected the null hypothesis, we have committed a Type I error; in other words, it is the probability of a false positive result. The degree of statistical significance is also referred to as the alpha level, α. In general, the lower the alpha level selected by the researcher, the less likely it is that the researcher's finding of a relationship is the result of a Type I error.

The value of α is normally given as a probability between 0 and 1. Researchers in the social sciences sometimes disagree about what level of α they will accept; the most common value among researchers is a 1-in-20 chance, or α=0.05; however, 1-in-10 (α=0.10) is also relatively common, and in experiments based on small sample sizes researchers will sometimes use alpha levels that are substantially higher.

Another term that is quite commonly used by researchers is the "confidence level"; this is the probability that the researcher has not committed a Type I error, and is usually given as a percentage (rather than a proportion). An alpha level of 0.05 corresponds to a confidence level of 95% (<math>100(1-0.05) = 100(0.95) = 95\%</math>), while an alpha level of 0.10 corresponds to a to a confidence level of 90%.

Statistical power and beta[edit]

There is also a corresponding way to quantify the chances of a statistical test resulting in a false negative or Type II error. The probability of a hypothesis test arriving at a Type II error is known as the statistical power or specificity of the test; the greater the power of a test, the less likely we are to conclude there is no effect in the population when we shouldn't do so. Statisticians refer to this probability as beta. In general, the larger the sample size, the greater specificity a test will have.

Substantive versus statistical significance[edit]

When we conduct a statistical test, even if we can reject the null hypothesis at a given alpha level, that doesn't necessarily mean that the actual difference in the population is large or important. A common mistake many new (and even experienced!) researchers make is believing that statistically significant results are automatically meaningful. Researchers should be conscious that substantive significance is usually at least as important as statistical significance.

For example, a researcher might (hypothetically) be interested in studying disparities in grades between white and black students at a major university. The researcher might have access to thousands of student records, and find a statistically significant difference between the average GPA of white and black students, but that the difference was only 0.02 grade points. Even though the difference is statistically significant—in other words, we can be confident there is a difference in the average GPAs of the two groups—the substantive significance of the finding is extremely low, as there is no real, meaningful difference between the two groups' averages.

How can this come about? Most statistical tests are designed for samples of a few thousand, at most. With very large samples (where the sample size is larger than 10,000 or so), most statistical tests will find “significant” differences even for small deviations between groups.

The bottom line: researchers should apply their own judgment to decide truly how important a “statistically significant” finding is.

Directional and Non-Directional Tests[edit]

In many cases, researchers will have expectations about the relationship between two concepts that include directionality; the hypothesis not only suggests that the two concepts are related, but also that "more" of one concept will lead to "more" or "less" of the other. Notably, our earlier example involving economic growth and political freedom suggested such a relationship.

Previously we said that the research hypothesis for this relationship was <math>H_A : \mu_{{\Delta\text{GDP}}_1} >\mu_{{\Delta\text{GDP}}_2}</math> and that the corresponding null hypothesis was <math>H_0 : \mu_{{\Delta\text{GDP}}_1} =\mu_{{\Delta\text{GDP}}_2}</math>. However, we can also propose a directional null hypothesis as follows:

<math>H_0 : \mu_{{\Delta\text{GDP}}_1} \leq \mu_{{\Delta\text{GDP}}_2}</math>

Note that instead of asserting equality, this null hypothesis states that group 1's average GDP is less than or equal to that of group 2.

Directional tests are also referred to as "one-tailed" tests, while non-directional tests are sometimes called "two-tailed" tests.

Directional tests are somewhat controversial among social scientists, particularly when multiple hypotheses are being tested simultaneously, in part because it is more likely you will arrive at a statistically significant finding with a directional test than when using the equivalent non-directional test. Also, some statistical tests (such as the <math>\chi^2</math> and <math>F</math> tests) are inherently non-directional.

Hypothesis Testing Using p values[edit]

Although the "textbook" approach to hypothesis testing involves calculating test statistics and comparing them to critical values, as discussed above, computer-based statistical software packages like R, SAS, SPSS, and Stata test hypotheses differently. After calculating the test statistic, stats programs automatically calculate a statistic known as the p value, which corresponds to the largest possible alpha level at which the null hypothesis could be rejected.

For example, if Stata says that the p value for a test is 0.03, that means that a researcher who would reject the null hypothesis for any value of alpha greater than 0.03 (such as the common 0.05 and 0.10 levels) can reject the null hypothesis, while researchers who would only reject the null at an alpha of 0.01 (or 0.001) would be unable to reject the null in this case.

Moreover it is common for researchers to simply report the p values of a statistical test (or set of statistical tests), or to mark statistically significant findings with asterisks or other symbols, rather than explicitly comparing the test statistic to a critical value, or even reporting the test statistic itself.

Interpreting p values from statistical software output[edit]

P values, when shown directly in text, are conventionally reported using notation such as "p ≈ 0.003"; an equal sign is sometimes used instead of the "approximately equal to" sign.

The p values reported by statistical software are often either rounded to 0 or expressed using scientific notation (e.g. something like 1.65e-04); in text, it is conventional to report these p values as being less than a (small) alpha level, typically something like "p < 0.001".

Most statistical software packages will, by default, conduct two-tailed tests for t tests and regression coefficients. If you want a one-tailed test statistic, you will often have to calculate it yourself. Stata's t test procedure is a notable exception; it produces both one-tailed and two-tailed tests.

Other statistical software issues[edit]

When stats programs test for the statistical significance of a particular value (such as the intercept or coefficients in a regression, the correlation coefficient, etc.), they typically are testing against the possibility the parameter is zero. While this is usually what you are interested in testing, it is possible your problem is different—for example, you might want to test whether or not the average speed of traffic on a highway is greater than the speed limit. Ultimately your theory, and not the software, should determine what you are testing against.


In this chapter we introduced the concepts related to hypothesis testing in the social sciences. And it was good.


<references group=""></references>

Discussion questions[edit]


  1. Convert the following theory to a testable hypothesis: "Evaluations of the government's handling of the economy will influence vote choice". What is the null hypothesis?