Actions

Difference between revisions of "Univariate descriptive statistics"

From OPOSSEM

(Example: Frequency Table with Percentages and Cumulative Percentages)
(Example: Frequency Table with Percentages and Cumulative Percentages)
Line 1: Line 1:
<!-- add any hidden notes here -->
+
<p><!-- add any hidden notes here -->
 
+
</p><p><br />
 
+
</p>
=Objectives=
+
<h1>Objectives</h1>
* Goal: measuring and describing variables
+
<ul><li> Goal: measuring and describing variables
* Distributions: frequencies, rates and ratios, percentages and percentiles
+
</li><li> Distributions: frequencies, rates and ratios, percentages and percentiles
* Measures of central tendency: mode, mean, and median
+
</li><li> Measures of central tendency: mode, mean, and median
* Measures of dispersion: range, inter-quartile range, variance standard deviation.
+
</li><li> Measures of dispersion: range, inter-quartile range,
 
 
=Introduction=
 
A recent survey of Israeli voters, the 2009 Israel National Election Study [http://www.ines.tau.ac.il/2009.html], asked people
 
 
<!--DO NOT EDIT THE REFERENCE SECTION-->
 
<!--DO NOT EDIT THE REFERENCE SECTION-->
=References=
 
{{Reflist}}
 
 
=Discussion questions=
 
#
 
#
 
#
 
#
 
#
 
  
=Problems=
+
</p>
#
+
<h1>References</h1>
#
+
<p><span class="fck_mw_template">{{Reflist}}</span>
#
+
</p>
#
+
<h1>Discussion questions</h1>
#
+
<ol><li>
 
+
</li><li>
=Glossary=
+
</li><li>
 +
</li><li>
 +
</li><li>
 +
</li></ol>
 +
<h1>Problems</h1>
 +
<ol><li>
 +
</li><li>
 +
</li><li>
 +
</li><li>
 +
</li><li>
 +
</li></ol>
 +
<p>=Glossary=
 
<!-- Here add any keywords or terms introduced on this page. Add them in a list like:
 
<!-- Here add any keywords or terms introduced on this page. Add them in a list like:
 
:*[[Def:newterm1]]
 
:*[[Def:newterm1]]
Line 133: Line 136:
 
:*[[Def:newterm3]]
 
:*[[Def:newterm3]]
 
Do not edit above this line.-->
 
Do not edit above this line.-->
:*[[Def: ]]
 
:*[[Def: ]]
 
:*[[Def: ]]
 
  
 +
</p>
 +
<dl><dd><ul><li>[[Def: ]]
 +
</li><li>[[Def: ]]
 +
</li><li>[[Def: ]]
 +
</li></ul>
 +
</dd></dl>
 +
<p>
 
<!--Do not edit below this line.-->
 
<!--Do not edit below this line.-->
__FORCETOC__
+
 
 +
</p>
 +
<pre class="_fck_mw_lspace">__FORCETOC__
 +
</pre>

Revision as of 09:43, 8 July 2011


Objectives

  • Goal: measuring and describing variables
  • Distributions: frequencies, rates and ratios, percentages and percentiles
  • Measures of central tendency: mode, mean, and median
  • Measures of dispersion: range, inter-quartile range, variance standard deviation.

Introduction

A recent survey of Israeli voters, the 2009 Israel National Election Study <a href="http://www.ines.tau.ac.il/2009.html">[n]</a>, asked people to provide their opinion of Prime Minister Benjamin Netanyahu. Survey respondents were asked to rate Netanyahu on a scale running from 1 (Hate) to 10 (Love). Imagine someone asked you, how would you describe Netanyahu's popularity? By looking at the data, you could easily describe how many people chose each rating on the scale. 171 people rated Netanyahu a "1", the lowest possible rating. Another 87 rated Netanyahu a "2", 75 rated him a "3"... There are three problems, though, with this approach. First, it would take a very long time to complete this description. Second, after this description, other people may have a very hard time gauging whether or not he is popular from this long list. Third, it would be very difficult to compare Netanyahu's popularity to others, whether the desired comparison is one of his opponents whose popularity was measured using the same scale or another politician. This chapter considers some common ways to describe data like this measure of Netanyahu's popularity so that each of these three problems are addressed by providing a concise, clear description that facilitates comparisons to other results. These methods are called univariate or descriptive statistics.

Univariate statistics are tools used to describe a single variable. Researchers use these tools whenever they want to describe their results after the data has been collected. As a result, these tools are often called descriptive statistics. The goal of descriptive statistics is to provide relevant information about a variable in a concise, easy-to-understand manner by referencing standard ways of summarizing the variation. Using standard ways of summarizing the variable enables others to readily understand your results, and facilitates comparisons to other findings.

Distributions

Frequencies

The most basic way of presenting data is to report all of the individual observations, often in a table or graph. Rather than provide a list of each observation, it is best to organize the observations by response categories Template:Definition:Response Category. When the observations are organized by response category, each category is called a frequency Template:Definition:Frequency. Frequencies tell us how many times a category appears in the sample. In the example above, 171 people rated Netanyahu a "1" so the frequency of people rating Netanyahu a "1" is 171. The sum of frequencies across all of the categories is the total. A table of all of the responses, organized by category, and the total is called a frequency table. Sometimes, the "cumulative frequency" or the running sum of the frequencies is displayed. Commonly used for categorical Template:Definition:Categorical variables or nominal Template:Definition:Nominal variables variables, these tables allow people to quickly see the frequency distribution of a variable.

Example: Frequency Table with Percentages and Cumulative Percentages

The Pew Global Attitudes survey is a large survey of over 26,000 people in 25 countries and territories. This table below displays how survey respondents answered a question about their primary source of news about national and international affairs. The types of media (television, newspapers, internet...) are the response categories. Since this is a nominal variable, the order of the types of media does not matter. The frequencies are listed in the first column on the left, next to the media type. So, for example, the frequency of getting most news about international affairs from the internet is 1,982. The total frequency is at the bottom of the column.


Frequency

Percent

Cum. Percent

</TBODY><TBODY>

Television

18,919

71.8

71.8

Newspapers

2,488

9.4

81.3

Radio

2,344

8.9

90.2

Magazines

167

0.6

90.8

Internet

1,982

7.5

98.3

Other (Volunteered)

254

1

99.3

Don't know

102

0.4

99.7

Refused

85

0.3

100.0

Total

26,341

100.0


</TBODY>

Percentages

Frequencies can often be cumbersome or misleading. In the example above, 1,982 people told the survey researchers that they get their national and international news primarily from the internet. That number is large, but it is not very large compared to the number of people, 18,919, who get their news primarily from television. To facilitate comparisons, frequencies can be expressed as percentages, rates or ratios. These measures use a standard denominator Template:Definition: denominator that is the same across time, places, people or data sets.

A common way to transform frequencies to make them easier to compare is to convert them to percentages. Percentages are calculated by dividing the frequency by the total. Since the total divided by itself is always equal to 1, the denominator of a percentage is always 1 and they can be readily compared.

In the news consumption example presented above, the number of people who get their news primarily from the internet (1,982) divided by the total number of respondents (26,341) is 1,982/26,341= 0.075 or 7.5%. With percentages, you can easily compare the number of people who get their news from the internet to the number of people who get their news from newspapers. 7.5% get their news primarily from the internet, compared to 71.8% who get their news primarily from television. You can also readily compare percentages across variables. For example, this survey was conducted in twenty-five countries and territories. Using percentages, you can observe whether more Americans tend to get their news from the internet than Mexicans or Palestinians.

The table presents the percentages in the middle column. In the right column, the table presents cumulative percentages, which is the percentage of the total observations found in that row plus all of the rows above it. So, in the example above, note that 81.3% of all respondents get their news from either newspapers or television.Another way of transforming frequencies to make them easier to communicate is to convert frequencies to ratios or rates. Ratios express two frequencies as a fraction, allowing one to describe one frequency as a function of the other. The ratio of respondents who primarily get their news from the internet to those who primarily get their news from television is 1,982/18,919. Typically, these numbers are simplified to rounded integers of no more than three digits and written using a column. First, divide both sides of the fraction tby the nominator (1,982) to get 1/9.5. The denominator can round to 10. Report the ratio of people who get their news primarily from the internet compared to the number of people who get their news from newspapers as 1:10. Ratios can be used to compare categorical or continuous variables. When comparing variables, it is important to always choose the same denominator. So, we can compare the internet news-consumers (1:10) to those who get their news from magazines using television as the common denominator, 1:113.

Ratios

Another way of transforming frequencies to make them easier to communicate is to convert frequencies to ratios or rates. Ratios express two frequencies as a fraction, allowing one to describe one frequency as a function of the other. The ratio of respondents who primarily get their news from the internet to those who primarily get their news from television is 1,982/18,919. Typically, these numbers are simplified to rounded integers. First, divide both sides of the fraction by the nominator (1,982) to get 1/9.5. The denominator can round to 10. Report the ratio of people who get their news primarily from the internet compared to the number of people who get their news from newspapers using a column instead of a slash, as 1:10. Ratios can be used to compare categorical or continuous variables. When comparing variables, it is important to always choose the same denominator. So, we can compare the internet news-consumers (1:10) to those who get their news from magazines using television as the common denominator, 1:113.

Example: GDP per capita

The most common measure of a country's wealth is <a href="http://en.wikipedia.org/wiki/Gross_domestic_product" class="extiw" title="wikipedia:Gross domestic product">gross domestic product</a> (or GDP). It is typically reported as a ratio, using the country's population as the denominator. For example, you can see how over 200 countries are ranked by GDP per capita in the CIA World Factbook here: <a href="http://www2.fbi.gov/ucr/cius2009/data/table_01.html">[n]</a>

Rates

Rates are similar to ratios except that the denominator is a standardized unit typically used as a reference, like multiples of 100, or kilowatts per hour. In politics and public policy, rates are frequently used to describe aggregate, group characteristics in a way that allows groups to be readily compared even if they are of vastly different sizes. For instance, crime statistics are typically expressed as a rate with a base of 100,000 people. This is because more crimes occur in more populated areas but that does not necessarily mean that crime is any more common or that one is any more likely to be a victim of a crime in a well-populated area. By using a rate we imagine what if there were the same amount of people in the small town as in the big city to more accurately gauge whether the area is more or less safe. See <a href="http://www2.fbi.gov/ucr/cius2009/data/table_02.html">[n]</a>. Similarly, many descriptive statistics for countries, like the number of people who can read, the number of deaths and the number of births are expressed as rates. Rates are only used to describe continuous variables.

Percentiles

After ordering the observations from smallest to largest, percentiles divide all of the observations into 100 equal-sized groups. Percentiles are commonly used when the rank of an observation is important, like students' scores on a standardized test or a student's class standing. Determining whose SAT score or grade point average is in the top 10% requires identifying the 90th percentile. Everyone who takes the SAT is told, along with their score, how their performance compares to students in their state and across the USA in the previous year <a href="http://sat.collegeboard.org/scores/understanding-sat-scores">[n]</a>. If a student's score is in the 65th percentile, then that student knows that she did better than 65% of other students. Likewise, managers in both the public and private sector seeking to award performance bonuses or promotions often want to identify which employees are among the top 10 or 25% of all employees.

Some percentiles are used more frequently than others. Quartiles and deciles are the most common percentiles reported. Quartiles divide the observations into four equal groups, the 25th, 50th and 75th percentile. Deciles divide the observations into ten equal groups. The 50th percentile, the middle observation in a sample, is also called the median Template:Definition: median which we will discuss in the following section since it is an important measure of central tendency. Interquartile range (or IQR) Template:Definition: interquartile range measures the difference between the 25th and 75th percentile.

Heading

Sub-heading

Sub-heading

Example

Heading

Sub-heading

Conclusion

References

<references group=""></references>

Discussion questions

Problems

=Glossary=

  • [[Def: ]]
  • [[Def: ]]
  • [[Def: ]]

__FORCETOC__