Actions

Univariate descriptive statistics

From OPOSSEM


Objectives[edit]

  • Goal: measuring and describing variables
  • Distributions: frequencies, rates and ratios, percentages and percentiles
  • Measures of central tendency: mode, mean, and median
  • Measures of dispersion: range, inter-quartile range, variance standard deviation.

Introduction[edit]

A recent survey of Israeli voters, the 2009 Israel National Election Study [1], asked people to provide their opinion of Prime Minister Benjamin Netanyahu. Survey respondents were asked to rate Netanyahu on a scale running from 1 (Hate) to 10 (Love). Imagine someone asked you, how would you describe Netanyahu's popularity? By looking at the data, you could easily describe how many people chose each rating on the scale. One hundred seventy-one people rated Netanyahu a "1", the lowest possible rating. Another 87 rated Netanyahu a "2", 75 rated him a "3"... There are three problems, though, with this approach. First, it would take a very long time to complete this description. Second, after this description, other people may have a very hard time gauging whether or not he is popular from this long list. Third, it would be very difficult to compare Netanyahu's popularity to others, whether the desired comparison is one of his opponents whose popularity was measured using the same scale or another politician. This chapter considers some common ways to describe data like this measure of Netanyahu's popularity so that each of these three problems are addressed by providing a concise, clear description that facilitates comparisons to other results. These methods are called univariate or descriptive statistics.

Univariate statistics are tools used to describe a single variable. Researchers use these tools whenever they want to describe their results after the data has been collected. As a result, these tools are often called descriptive statistics. The goal of descriptive statistics is to provide relevant information about a variable in a concise, easy-to-understand manner by referencing standard ways of summarizing the variation. Using standard ways of summarizing the variable enables others to readily understand your results, and facilitates comparisons to other findings.

Distributions[edit]

Frequencies[edit]

The most basic way of presenting data is to report all of the individual observations, often in a table or graph. Rather than provide a list of each observation, it is best to organize the observations by response categories Template:Definition:Response Category. When the observations are organized by response category, each category is called a frequency Template:Definition:Frequency. Frequencies tell us how many times a category appears in the sample. In the example above, 171 people rated Netanyahu a "1" so the frequency of people rating Netanyahu a "1" is 171. The sum of frequencies across all of the categories is the total. A table of all of the responses, organized by category, and the total is called a frequency table. Sometimes, the "cumulative frequency" or the running sum of the frequencies is displayed. Commonly used for categorical Template:Definition:Categorical variables or nominal Template:Definition:Nominal variables variables, these tables allow people to quickly see the frequency distribution of a variable.

Example: Frequency Table with Percentages and Cumulative Percentages[edit]

The Pew Global Attitudes survey is a large survey of over 26,000 people in 25 countries and territories. This table below displays how survey respondents answered a question about their primary source of news about national and international affairs. The types of media (television, newspapers, internet...) are the response categories. Since this is a nominal variable, the order of the types of media does not matter. The frequencies are listed in the first column on the left, next to the media type. So, for example, the frequency of getting most news about international affairs from the internet is 1,982. The total frequency is at the bottom of the column.


Frequency

Percent

Cum. Percent

Television

18,919

71.8

71.8

Newspapers

2,488

9.4

81.3

Radio

2,344

8.9

90.2

Magazines

167

0.6

90.8

Internet

1,982

7.5

98.3

Other (Volunteered)

254

1

99.3

Don't know

102

0.4

99.7

Refused

85

0.3

100.0

Total

26,341

100.0


Percentages[edit]

Frequencies can often be cumbersome or misleading. In the example above, 1,982 people told the survey researchers that they get their national and international news primarily from the internet. That number is large, but it is not very large compared to the number of people, 18,919, who get their news primarily from television. To facilitate comparisons, frequencies can be expressed as percentages, rates or ratios. These measures use a standard denominator Template:Definition: denominator that is the same across time, places, people or data sets.

A common way to transform frequencies to make them easier to compare is to convert them to percentages. Percentages are calculated by dividing the frequency by the total. Since the total divided by itself is always equal to 1, the denominator of a percentage is always 1 and they can be readily compared.

In the news consumption example presented above, the number of people who get their news primarily from the internet (1,982) divided by the total number of respondents (26,341) is 1,982/26,341= 0.075 or 7.5%. With percentages, you can easily compare the number of people who get their news from the internet to the number of people who get their news from newspapers. 7.5% get their news primarily from the internet, compared to 71.8% who get their news primarily from television. You can also readily compare percentages across variables. For example, this survey was conducted in twenty-five countries and territories. Using percentages, you can observe whether more Americans tend to get their news from the internet than Mexicans or Palestinians.

The table presents the percentages in the middle column. In the right column, the table presents cumulative percentages, which is the percentage of the total observations found in that row plus all of the rows above it. So, in the example above, note that 81.3% of all respondents get their news from either newspapers or television.Another way of transforming frequencies to make them easier to communicate is to convert frequencies to ratios or rates. Ratios express two frequencies as a fraction, allowing one to describe one frequency as a function of the other. The ratio of respondents who primarily get their news from the internet to those who primarily get their news from television is 1,982/18,919. Typically, these numbers are simplified to rounded integers of no more than three digits and written using a column. First, divide both sides of the fraction tby the nominator (1,982) to get 1/9.5. The denominator can round to 10. Report the ratio of people who get their news primarily from the internet compared to the number of people who get their news from newspapers as 1:10. Ratios can be used to compare categorical or continuous variables. When comparing variables, it is important to always choose the same denominator. So, we can compare the internet news-consumers (1:10) to those who get their news from magazines using television as the common denominator, 1:113.

Ratios[edit]

Another way of transforming frequencies to make them easier to communicate is to convert frequencies to ratios or rates. Ratios express two frequencies as a fraction, allowing one to describe one frequency as a function of the other. The ratio of respondents who primarily get their news from the internet to those who primarily get their news from television is 1,982/18,919. Typically, these numbers are simplified to rounded integers. First, divide both sides of the fraction by the nominator (1,982) to get 1/9.5. The denominator can round to 10. Report the ratio of people who get their news primarily from the internet compared to the number of people who get their news from newspapers using a column instead of a slash, as 1:10. Ratios can be used to compare categorical or continuous variables. When comparing variables, it is important to always choose the same denominator. So, we can compare the internet news-consumers (1:10) to those who get their news from magazines using television as the common denominator, 1:113.

Example: GDP per capita[edit]

The most common measure of a country's wealth is gross domestic product (or GDP). It is typically reported as a ratio, using the country's population as the denominator. For example, you can see how over 200 countries are ranked by GDP per capita in the World Factbook.

Rates[edit]

Rates are similar to ratios except that the denominator is a standardized unit typically used as a reference, like multiples of 100, or kilowatts per hour. In politics and public policy, rates are frequently used to describe aggregate, group characteristics in a way that allows groups to be readily compared even if they are of vastly different sizes. For instance, crime statistics are typically expressed as a rate with a base of 100,000 people. This is because more crimes occur in more populated areas but that does not necessarily mean that crime is any more common or that one is any more likely to be a victim of a crime in a well-populated area. By using a rate we imagine what if there were the same amount of people in the small town as in the big city to more accurately gauge whether the area is more or less safe. For example, this table presenting crime data from the U.S. Department of Justice shows crime rates per 100,000 people. There were 1,867,157 burglaries in large cities (Metropolitan Statistical Areas) compared to 164,859 in smaller cities and 167,109 in non-metropolitan areas (small towns and rural areas). Before you leave your door unlocked in smaller cities, consider that when these frequencies are converted to rates per 100,000 people, the burglary rate in large cities is 727.3, but in small cities the rate is 822.6. In small towns and rural areas, the rate is a mere 552.8 per 100,000 people. Similarly, many descriptive statistics for countries, like the number of people who can read, the number of deaths and the number of births are expressed as rates. Rates are only used to describe continuous variables.

Percentiles[edit]

After ordering the observations from smallest to largest, percentiles divide all of the observations into 100 equal-sized groups. Percentiles are commonly used when the rank of an observation is important, like students' scores on a standardized test or a student's class standing. Determining whose SAT score or grade point average is in the top 10% requires identifying the 90th percentile. Everyone who takes the SAT is told, along with their score, how their performance compares to students in their state and across the USA in the previous year [2]. If a student's score is in the 65th percentile, then that student knows that she did better than 65% of other students. Likewise, managers in both the public and private sector seeking to award performance bonuses or promotions often want to identify which employees are among the top 10 or 25% of all employees.

Some percentiles are used more frequently than others. Quartiles and deciles are the most common percentiles reported. Quartiles divide the observations into four equal groups, the 25th, 50th and 75th percentile. Deciles divide the observations into ten equal groups. The 50th percentile, the middle observation in a sample, is also called the median Template:Definition: median which we will discuss in the following section since it is an important measure of central tendency. Interquartile range (or IQR) Template:Definition: interquartile range measures the difference between the 25th and 75th percentile.

Measures of Central Tendency and Dispersion[edit]

Frequencies, percentages, rates and ratios often do not provide researchers with enough information about a variable, or provide researchers with too much information about a variable. Both of these problems makes comparisons difficult, as tables with many values are difficult to compare to other tables since they take long to read and may take even longer to make. Instead, statistics have been developed that provide an overview of the entire distribution of the observations rather than particular values. These descriptive statistics include:

  • Measures of central tendency provide an indicator of a value that is typical or representative of the entire distribution of observations. Measures of central tendency include the mode, mean and median values.
  • Measures of dispersion or variability provide researchers with a measure of how typical the measure of central tendency by providing estimates of how different the other values are from the typical value or a known distribution like the normal curve Template:Definition:Normal Curve. These measures provide a description of how the extreme ends of the sample distribution. For ordinal and interval level variables, the measures of dispersion measure how spread out the values are.

Descriptive statistics for categorical variables: mode[edit]

The simplest measure of central tendency is modeTemplate:Definition:Mode, the most frequent observation in the data. The mode of the news media consumption variable described above is newspapers, since more people get their news from newspapers than from any other source. Mode can be used for any level of measurement, but it is most appropriate for nominal Template:Definition:Nominal or categorical Template:Definition:Categorical variables.

Mode is actually a very good "typical" measure of the news consumption variable because over 70% of all respondents get their news from television. By itself, the mode does not provide any indication of how representative it is of the entire distribution. In fact, some variables may have more than one mode, or categories that have nearly as many observations as the mode. To provide an indication of how typical the mode is compared to the rest of the distribution researchers either report the percentage of observations contained in the mode or the percentage of all observations that do not fit into the modal category. When the percentage of observations in the mode is not very high, or the variance ratio is high, readers will know that the mode is not a very typical observation. The measure of how many observations do not fit into the modal category is called a "variation ratio." A standardized version of the variation ratio is called [[Wikipedia:Qualitative_variation|QV], but is very rarely used .

Measures of Central Tendency for Ordinal and Interval Variables: Mean and Median[edit]

The mean, or arithmetic mean, is colloquially called the average. The mean is the sum of all observation values divided by the total number of observations.

<math>\bar{y} = \frac{\sum_{i=1}^n{y_i}}{n}</math>


Mean is the appropriate measure of central tendency for continuous variables (ordinal and interval). However, be careful because the mean is sensitive to the values taken by each of the observations and the total number of observations, it is sensitive to observations that have vastly different values than the other observations.

An alternative measure of central tendency to mean is the median. When the observations are put in order from lowest to highest, the value of the middle observation in the sample is the median. Half of all observations have lower values than the median, and half of all observations have higher values. The median is just another name for the 50th percentile. Unlike the mean, the median is not sensitive to the values of the other observations nor to the total number of observations. As a result, it is not sensitive to observations ("Template:Definition:outliers") that are much higher or much lower than the median value. The median is the most appropriate measure for ordinal variables with relatively few categories and any distribution with outlying observations.

Template:Equation=Median

Example: Feelings towards Benjamin Netanyahu[edit]

At the start of the chapter, we introduced a survey item measuring the popularity of Israeli Prime Minister Benjamin Netanyahu in 2009. This survey item asked respondents to rate Netanyahu on a scale of 1-10. Giving Netanyahu a value of one meant that the respondent hated or rejected him, while his most ardent supporters would give him a ten. It is continuous and ordinal. Since there are ten categories, it creates a large table that is not easy to compare to other distributions to gauge whether or not Netanyahu is very popular compared to his opponents in Israel or relative to other political leaders. The mean rating for Netanyahu is 5.14 and the median is 5. The leader of the opposition to Netanyahu, the leader of the Kadima Party, Tzipi Livni, received a mean rating of 5.49 and a median rating of 6. Without even looking at the frequency tables, one can conclude using just the mean and/or the median that Livni is a little more popular than Netanyahu (on average). Both the mean and the median indicate that both politicians, on average, are rated in the middle of the rating scale. This suggests that neither politician was very popular with much of the electorate in 2009, nor was either politician especially unpopular (on average). However, without any measures of dispersion, be mindful that these values could be the result of many voters having so-so opinions of Livni and Netanyahu or these measures may be the product of many respondents strongly disliking them balanced by similar numbers of respondents strongly favoring them.

How do you think their popularity compares with politicians in other countries? By examining the mean and/or median popularity of other politicians, you can see whether these Israeli politicians are more popular with their own voters.

Selecting a measure of central tendency[edit]

As in the example above, mean and median often reveal similar information about a distribution. However, this is not always the case. For example, lets say you wanted to know how much money graduates of a political science seminar (with 20 students) were making five years after graduation. Imagine that one of the graduates became a star professional basketball player. While most graduates were making between $30,000 and $50,000, he was earning over $5 million dollars a year. The mean income would approach $300,000, which is clearly not an income typical of graduates of the seminar. In statistical terms, the basketball player is an outlier. In the presence of such outliers, the mean is not a very good measure of central tendency. Instead, choose to report the median. When median and mean provide very different estimates of central tendency, measures of dispersion can provide guidance over which measure is best or, indeed, whether neither measure describes the distribution very well.

Measures of dispersion: range, mean deviation, variance and standard deviation[edit]

To flag whether there are any extreme values in the sample, it is useful to know the dispersion of values from the mean. The presence of observations with extreme values like the basketball player, the higher the mean deviation. The simplest measure of dispersion is the range Template:Definition: range. After ordering the observation values, the range is the lowest value subtracted from the highest value. In the seminar with the pro basketball player, the highest value is $5 million while the lowest value is $20,000. The range, as a result is 5,000,000-20,000=4,800,000.

When there are quite a few outliers, the interquartile range (or IQR) is better than the full range. Instead of taking the difference between the highest and the lowest value, the interquartile range takes the difference between the 25th percentile and the 75th percentile. By restricting the range to the half the observations closest to the mean, researchers can then ignore extreme values that may not be of interest to them.

All interval variables tend to have large ranges. So, knowing how large the range is relative to the mean is more useful for many researchers than just the mean so that they can gauge whether the mean is indeed typical.

An easy way to measure how far observations are from the mean is to subtract the value of the mean from the value of each observation. While this simple arithmetic works for estimating the difference from the mean of each observation to identify outliers, the measure must be transformed when aggregating the measures for the entire sample. Consider a sample like the popularity of Netanyahu with a mean of 5 and a range of 10. An observation at the lowest end of the scale (1) has a difference from the mean of 4 because 5-1=4. An observation of 9 near the high end of the scale has a difference of -4 because 5-9=-4. If added together, the two differences would cancel each other out since 4-4=0. This is exactly the same score as would be found if both observations were 5, exactly the same as a mean. As a result, these differences alone do not provide reliable insights into the dispersion. Fortunately, this problem can be solved by either taking the absolute value of the difference or squaring the difference. Both transformations result in only positive numbers, and therefore values from either side of the mean do not cancel each other out.

The measure of mean deviation takes the average of the absolute values of the distance of each observation from the sample mean. The more observations are similar to the mean, the lower the mean deviation, reflecting the smaller distances between each observation and the mean. The presence of extreme outliers like the basketball player, the higher the mean deviation. The more future basketball players who will be making millions of dollars there are in the seminar, the greater the dispersion and the higher the mean deviation.

The variance Template:Definition: variance of a distribution is similar to the mean deviation, except that to calculate the variance, researchers square the difference from the mean value rather than the absolute value from the mean. To calculate the variance, subtract the mean from each value and square the differences. Then add the squared differences (or squared deviations) together and divide by the number of observations.

The square root of the variance is the standard deviation Template:Definition: standard deviation. Researchers frequently report standard deviation because the standard deviation relates to the normal curve on a standardized scale. One standard deviation from the mean for variables that are normally distributed includes 34.1% of all of the observations. So, one standard deviation on either side of the mean covers 68.2% of the observations. Two standard deviations account for 95% of the observations. Knowing these properties gives readers an intuitive understanding of the value of the reported standard deviation. If the standard deviation is very high, then readers know that two-thirds of all the observations have values very distant from the mean. Small standard deviations indicate that most of the observations have values very close to the mean.

Conclusion[edit]

References[edit]

<references group=""></references>

Discussion questions[edit]

Problems[edit]

  1. After a recent late-night shift, five police officers filed their reports for drivers ticketed for traffic violations. One officer ticketed two people, two ticketed three drivers, and two ticketed four drivers.
  • What is the median number of tickets given out by officers on that shift? A) 2; B) 3; or C) 4?
  • What is the mean number of tickets given out by officers on the shift described in the previous question? A) 2.8; B) 3.2; C) 4.0
  1. In 2010, 60 members of the U.S. forces fighting in Iraq died. According to icasualties.org, the month-by-month breakdown of fatalities are as follows: Jan. 6, Feb. 6, March 7, April 8, May 6, June 8, July 4, August 3, Sept. 7, Oct. 2, Nov. 2, Dec. 1.
  • What is the mean number of monthly fatalities?
  • What is the median number of monthly fatalities?
  • What is the range?
  • What is the mean deviation?
  • What is the variance? the standard deviation?
  1. In response to a question on the 2010 General Social Survey, 603 respondents said they supported making marijuana legal. 656 respondents said they opposed making marijuana legal. What answer is the mode? Support for making marijuana legal or opposing the legalization of marijuana?
  2. In the table describing people's media consumption habits, why would it be inappropriate to report the mean or median value of the distribution? A) Because they are so different, neither the mean nor the median is seen as a typical response; B) Because the median is exactly the same as the mode, so one only reports the mode; C) Because the variable is nominal and values cannot be ordered.
  3. Identify the measure of central tendency and dispersion appropriate for each of the following:

Variable 1:

Past year economic evaluations

 

Frequency

Percent

Cumulative

Became worse

916

21.75

21.75

Stayed about the same

1830

43.46

65.21

Become better

1465

34.79

100.00

Total 4211 100.00  

NOTE: Data from the 2011 CES (www.ces-eec.org)

Variable 2:

Incumbent vote

 

Frequency

Percent

Cumulative

Did not vote incumbent

1656

60.33

60.33

Voted incumbent

1089

39.67

100.00

Total 2745 100.00  

NOTE: Data from the 2011 CES (www.ces-eec.org)

Glossary[edit]

  • [[Def: ]]
  • [[Def: ]]
  • [[Def: ]]