Actions

Contingency Tables

From OPOSSEM


Objectives[edit]

  • Understand how two variables can be associated.
  • Introduce crosstabulations.
  • Learn how to make crosstabulations.
  • Learn how to best present a crosstabulation.

Introduction[edit]

Although it is important to quantify and describe individual variables, most scholarship is concerned with how two concepts or variables are related. For example, researchers might ask:

  • Do people who get most of their news from television trust other people less than people who get their news primarily from newspapers or the internet?
  • Are Catholics more likely to be pro-life than Protestants?
  • Do democracies suffer less from corruption than non-democracies?
  • Are 18-25 year olds less interested in politics than older adults?
  • Are individuals who view the economy favorably more likely to vote for the incumbent government?

To answer questions like these, we will discuss how to make tables to examine how to test hypotheses that two categorical variables are related to each other. These bivariate tables are interchangeably called contingency tables or cross-tabulations (cross-tabs for short). Cross-tabulate can also be used as a verb to describe the process of completing a bivariate analysis.

When two variables are related, differences or changes in value in one variable coincide with differences or changes in the value of a second variable. By combining the frequency tables of each of the variables into a cross-tabulation, scholars look at the pattern of results to see if the variables are associated. Two variables are said to be associated if one can accurately guess the value of one variable if one knows the value of the second variable. Measures of association provide an indication of how well two variables are associated with each other.

This section will focus on comparing two categorical (nominal or ordinal) variables in a cross-tabulation . After describing how to make and interpret a cross-tabulation, we will discuss how to visually assess whether the variables are associated and provide some suggestions for making effective tables. We will also discuss the impact of recoding variables. In the following section, we will look at how scholars test to see if the relationship is not due to random chance and statistical measures of association.

The relationship between two continuous variables is analyzed in a correlation analysis, which is covered in a later section.

Cross-tabulation[edit]

A cross-tabulation takes the frequency table for one variable, and combines it with a frequency table for a second variable. Cross-tabulations can be accurately described as a bivariate frequency distribution. One variable's value categories make up the rows (typically the explained or dependent variable) of the new two-dimensional table, while the second variable's value categories make up the columns (typically the explanatory or independent variable). Here is an example:

Example 1: Cross-tabulation of gun ownership and support for gun permits (2010)[edit]

Favor or Oppose Police Permit for Guns Have Gun in Home Total
Yes No Refused
Favor 242

61.7%

683

81.9%

18

46.2%

943

74.6%

Oppose 150

38.7%

151

18.1%

21

53.9%

322

25.5%

Total 392

100.0%

834

100.0%

39

100.0%

1,265

100.0%

The two variables that were cross-tabulated above are:

  • GUNLAW: Would you favor or oppose a law which would require a person to obtain a police permit before he or she could carry a gun?
  • OWNGUN: Do you happen to have in your home any guns or revolvers?

These were taken from the dataset 2010 U.S. General Social Survey, a poll of Americans [1]. The value categories for favor/oppose a police permit to buy a gun make up the rows, and the value categories of whether or not you own any guns make up the columns.

In this example, we are using the contingency table to examine whether there is a relationship between owning guns and support for requiring people to receive permission from the police to carry guns. The dependent variable we are trying to explain whether people oppose or support requiring people to receive permission from the police to carry guns. Our dependent variable, therefore, is support for requiring permission from the police to carry guns. Our independent variable is whether or not the person owns a gun. We might hypothesize that support for requiring permission from the police to carry guns depends on whether or not a person owns a gun. The null hypothesis is that owning a gun has no effect on support for requiring permission from the police to carry guns.

Components of a contingency table[edit]

The row and column labels are called the category labels or value labels. In the top left corner is the variable label for the row variable. Above the column category labels is the variable label for the column variable.

Unless the table says otherwise, the boxes in the middle of the table contain frequencies (the number of observations or cases) that fall into this box. Each of these boxes are called data cells, or cells for short.

Look at the cell in the top left corner of the example above, "Gun Ownership and support for gun permits." There are 242 frequencies in this cell. This indicates that 242 respondents indicated that they own a gun and answered that they support requiring people to get a permit from the police before carrying a gun.

In the same cell, we also find the percentage of respondents who said they they support requiring people to get a permit from the police before carrying a gun (61.7%). Percentages in contingency tables should not exceed one digit past the decimal point. The percentages displayed in a contingency table could be row percentages that display the percentage of observations in the cell relative to the total number of observations in the row, column percentages that display the percentage of observations in the cell relative to the total number of observations in the column, and/or total percentages which display the percentage of observations in the cell relative to the total number of observations in the entire table. This table uses column percentages, which is the most common percentage displayed in contingency tables.

You can find the total number of observations in each row at the end of the row in the far right column labelled "total." Find the total number of observations in each column at the bottom of the column in the row labelled "total." These numbers are called marginal totals. For example, the total number of respondents, 392, who indicate that they own a gun can be found in the cell in the far left of the bottom row.

You may also see percentages in these cells, which are called marginal percentages. Marginal percentages at the end of top row in the table below indicate the percentage of responses who indicated that they support requiring people to get a permit from the police before carrying a gun. 943 respondents indicated that they support requiring people to get a permit from the police before carrying a gun, 74.6% of the 1,265 total respondents. Because this example uses column percentages, the marginal percentages at the bottom of each column are all 100%, which clearly conveys to the reader that the percentages are column percentages and sum to 100% in each column (it is not unusual nor a problem, if due to rounding, the marginal percentages sum to 101% or 99%).

The grand total of observations (1,265) can be found in the bottom-right corner. Since this example uses survey data, this is the total number of respondents whom the survey polled (including 39 people who refused to answer the question about whether or not they own any guns). Borders, colors, and underlined or highlighted text, are used at the table creator’s discretion for clarity or emphasis.

Comparing bivariate relationships[edit]

Percentages help readers more readily view patterns in the data. Including percentages in a contingency table is especially important to assess the relative probability of a value of one variable occurring if the second variable takes on a particular value. These relative probabilities are readily seen in tables when expressed as percentages. In the example of gun ownership and support for requiring police permission to carry guns, we find that 61.7% of gun owners favor requiring police to give permission for people to carry guns, compared to 81.9% of people who do not own guns and 46.2% of those who refused to answer the question. Comparing these percentages tells the reader that people who own guns are less likely to favor requiring police permission to carry guns than those who do not own guns. Those who refuse to answer the question are least likely to favor requiring police permission to own guns.

Which percentages to display[edit]

The choice of which percentages to display in a contingency table depends on what you are trying to explain. In this example, we are trying to explain support (or opposition) for requiring police permission to carry guns. Support (or opposition) for requiring police permission to carry guns is our dependent variable. By convention, the dependent variable is most commonly found in the rows of the table, but the dependent variable could go in the columns for aesthetic reasons like long category headings or a large number of categories. If the dependent variable is (as in the example above) found in the rows of the table, then provide column percentages to facilitate comparisons of the differences at each value of the dependent variable. Always provide the percentages of the independent variable, so provide the row percentages if you choose to put the independent variable in the rows of the table.

Evaluating the relationship[edit]

When comparing the percentages across each row, we look to see if there are large differences between each of the columns to test the hypothesis that there is a relationship between the two variables. If there are large differences, then we can reject the null hypothesis. Recall that the null hypothesis is that owning a gun has no effect on support for requiring permission from the police to carry guns. If we look at the table and compare percentages across the top row, we do find a large, twenty percentage point difference in support for requiring police to grant permits for people to carry guns between gun owners and those who do not carry guns. This evidences is inconsistent with the null hypotheses that there is no difference in support for requiring gun permits between gun owners and those who do not own guns. In a later chapter, we will discuss chi-square which will provide a statistical test for these differences to evaluate whether we can confidently reject the null hypothesis.

If the two variables in a contingency table are related, then scholars say that the two variables are associated. If there is a perfect relationship between two variables, all observations in each column would be found in a unique row. If there is a strong association between the two variables, then knowing the value of the independent variable allows one to guess the value of the dependent variable. A weak association would have differences between categories of the independent variable, but the differences would not be large enough to allow one to guess the value of the dependent variable. No differences between categories of the independent variable would indicate no association between the two variables.

In this case, if you know someone does not have a gun, then you can probably guess that he or she favors requiring people to get permission from the police to carry a gun. If someone does not have a gun, then they are less likely to support requiring people to get permission to carry a gun even though a majority does.

Too strong?[edit]

There is such a thing as too strong of a relationship between two variables. If two variables are nearly perfectly associated, then one must assess whether or not both variables may be measuring the same - or too similar - things. In such situations, you may want to run a reliability analysis, which is explained in a different section.

Bivariate association and causality[edit]

An association seen in a cross-tabulation does not, in and of itself, enable us to claim a causal relationship between the independent variable and the dependent variable. However, if there one finds a weak relationship or no association then we can rule out a causal relationship. In the next section, we will discuss how introducing a third variable allows us to investigate whether or not there is a causal relationship.

Example[edit]

Advantages of Crosstabulations[edit]

Contingency tables are useful because little or no understanding of statistical concepts are necessary for interpretation and little technical know-how is necessary to build tables. Readers can easily observe patterns of association and can see if the pattern is weaker across some rows. Crosstabulations are very flexible. Cross-tabs can be done with almost any variable (especially if recoding is done to simplify the presentation), and variables can be put in rows or columns.

Disadvantages of crosstabulations[edit]

Some contingency tables, though, can be confusing or misleading. Variables with many categories require large tables that are difficult to read. Categories with few observations can obfuscate the bivariate association.

Both of these problems can be solved by re-coding the variables to simplify the table by merging categories or dropping some values. However, this can make relationships appear stronger – or weaker - than they actually are, and should be done judiciously.

Alternatives to crosstabulations[edit]

There are some alternatives to crosstabulations that researchers can consider when using ordinal, interval or ratio-level variables.

Two nominal variables[edit]

There are few good alternatives to crosstabulations for two nominal variables. Row (or bar) charts showing the frequencies in each category as a cluster can be used. This is just a graphical display of a cross-tabulation, and is best used when the researcher wants to emphasize differences in the dependent variable at different values of the independent variable.

One ordinal or interval/ratio variable and one nominal variable[edit]

When the analysis includes one ordinal, interval and ratio-level variable, especially those with many categories it may be more aesthetic and sensible to present the mean or median values of that variable at each category of the second variable in a table. Similarly, a row (or bar) chart can be used to depict changes in the mean or median value across different values of the nominal variable.

Two ordinal or interval/ratio variables[edit]

For two ordinal or interval/ratio variables, a table showing the mean or media value of one variable at different levels of the second variable can be used. More often, though, a correlation analysis is used, and the researcher would calculate and present Pearson's R coefficient. Alternatively, the data can be plotted as a scatterplot or the mean or median values can be displayed as a column or line chart.

Conclusion[edit]

Contingency tables, or crosstabulations, are a powerful but easy to use tool to compare two categorical variables. The best contingency tables display frequencies and/or percentages of the independent variable. These tables tend to be readily and intuitively grasped by many readers, including those with little or no statistical training. Both ordinal and nominal variables can be displayed in contingency tables. Researchers can construct contingency tables using any statistical program and many spreadsheets like Microsoft Excel. Measures of association like Kendall's Tau and Cramer's V can accompany a contingency table, as can tests of significance like Chi-Square.

References[edit]

<references group=""></references>

Discussion questions[edit]

Problems[edit]

  1. Based on the data below:
  • What percentage of voters supported the incumbent party?
  • How many voters who viewed the economy as having worsened voted for the incumbent party?
  • How many voters who viewed the economy as having improved voted for the incumbent party?
  • Based on the data in the table below, does there appear to be a relationship between retrospective economic evaluations and support for the incumbent party?

Cross-tabulation of retrospective economic evaluations and support for the incumbent government (Canada 2011)[edit]

  Incumbent vote
Retrospective evaluation of national economy
Total
Worse Same Better
Did not vote for incumbent 384

73.7%

791

67.6%

454

45.1%

1,629

60.4%

Voted for incumbent
137

26.3%

379

32.4%

552

54.9%

552

54.9%

Total 521

100.0%

1,170

100.0%

1,006

100.0%

2.697

100.0%

The two variables that were cross-tabulated above are:

  • CPS11_39 (recoded to remove missing cases): Over the past year Canada's economy has become better, become worse, or stayed about the same?
  • PES11_6 (recoded to remove missing cases): Which party did you vote for?

These were taken from the 2011 Canadian election study dataset, available at: www.ces-eec.org

Glossary[edit]

  • [[Def: ]]
  • [[Def: ]]
  • [[Def: ]]