# Introduction

In the previous chapter we discussed probability theory, which we expressed in terms of a variable $X$. We defined $X$ as a set of realizations of some process, which in turn is governed by rules of probability regarding potential outcomes in the sample space.

The variables we were talking about have been what are called random variables, which means that they have a probability distribution. As we noted before, broadly speaking, there are two kinds of random variables: discrete and continuous.

Discrete variables can take on any one of several distinct, mutually-exclusive values.

• Congressperson's ideology score {0,1,2,3...,100}
• An individual's political affiliation (Democrat, Republican, Independent}
• Whether or not a country is a member of European Union (true/false)

A Continuous variable can take on any value in its range.

• Individual income
• National population

This chapter focuses on a family of continuous distributions that are the most widely used in statistical inference, and are found in a wide variety of contexts, both applied and theoretical. The $Normal$ distribution is the well-known "bell-shaped curve" that most students usually encounter first in the artificial context of academic testing, but due to a powerful result called the Central Limit Theorem, occurs in a wide variety of uncontrolled situations where the value of a random variables is determined by the average effect of a large number of random variables with any combination of distributions. The $\chi^{2}$, $t$ and $F$ distributions can be derived from various products of normally-distributed variables, and are used extensively in statistical inference and applied statistics, so it's useful to understand them in a bit of depth.

## Need to do

Philip Schrodt 06:57, 13 July 2011 (PDT)

• Probably need to get most of the probability chapter---which are the moment hasn't been started---written before this one. In particular, will the pdf and cdf be defined there or here?
• Add some of the discrete distributions, particularly the binomial
• Do we add---or link to on another page---the derivation of the mean and standard errors for these: that code is available in CCL on an assortment of places on the web

# The Normal Distribution

We are all used to seeing normal distributions described, and to hearing that something is "normally distributed." We know that a normal distribution is "bell-shaped," and symmetrical, and probably that it has some mean and some standard deviation.

Formally, if $X$ is a normally distributed variate with mean $\mu$ and variance $\sigma^{2}$, then:

$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \text{exp} \left( - \frac{(x - \mu)^{2}}{2 \sigma^{2}} \right)$.

We denote this $X \sim N(\mu,\sigma^{2})$, and say $X$ is distributed normally with mean mu and variance sigma squared. The symbol $\phi$ is often used as a shorthand to represent the normal density in \eqref{normalden}:

$X \sim \phi_{\mu, \sigma^{2}}$.

The corresponding normal CDF -- which is the probability of a normal random variate taking on a value less than or equal to some specified number -- is (as always) the indefinite integral of \eqref{normalden}. This has no simple closed-form solution, so we typically just write:

$F(x) \equiv \Phi_{\mu, \sigma^{2}}(x) = \int \phi_{\mu, \sigma^{2}} f(x) d x.$

Here are a bunch of normal curves

## Bases for the Normal Distribution

The most common justification for the normal distribution has its roots in the 'central limit theorem'. Consider $i = {1,2,...N}$ independent, real-valued random variates $$X_{i}$$, each with finite mean $$\mu_{i}$$ and variance $\sigma^{2}_{i} > 0$. If we consider a new variable $$X$$ defined as the sum of these variables:

$X = \sum_{i=1}^{N} X_{i}$

then we know that

$\text{E}(X) = \sum_{i=1}^{N} \mu_{i}$

and

$\text{Var}(X) = \sum_{i=1}^{N} \sigma^{2}_{i}$

The central limit theorem states that:

$\underset{N \rightarrow \infty}{\lim} X = \underset{N \rightarrow \infty}{\lim} \sum_{i=1}^{N} X_{i} \overset{D}{\rightarrow} N(\cdot)$

where the notation $\overset{D}{\rightarrow}$ indicates convergence in distribution. That is, as $N$ gets sufficiently large, the distribution of the sum of $N$ independent random variates with finite mean and variance will converge to a normal distribution. As such, we often think of a normal distribution as being appropriate when the observed variable $X$ can take on a range of continuous values, and when the observed value of $X$ can be thought of as the product of a large number of relatively small, independent shocks or perturbations.

## Properties of the Normal Distribution

• A normal variate $X$ has support in $\mathfrak{R}$.
• The normal is a two-parameter distribution, where $\mu \in (-\infty, \infty)$ and $\sigma^{2} \in (0, \infty)$.
• The normal distribution is always symmetrical ($M_{3} = 0$) and mesokurtic.
• item The normal distribution is preserved under a linear transformation. That is, if $X \sim N(\mu,\sigma^{2})$, then $aX + b \sim N(a\mu + b, a^{2} \sigma^{2})$. (Why? Recall our earlier results on $\mu$ and $\sigma^{2}$).

## The Standard Normal Distribution

One linear transformation is especially useful:

\begin{align} b & = \frac{-\mu}{\sigma} \\ a & = \frac{1}{\sigma}  \end{align}.

This yields:

\begin{align} ax + b & \sim N(a\mu+b, a^{2} \sigma^{2}) \\ & \sim N(0,1)  \end{align}

This is the standard normal density function. We often denote this $\phi(\cdot)$, and say that "X is distributed as standard normal." We can also get this by transforming ("standardizing") the normal variate $X$...

• If $X \sim N(\mu,\sigma^{2})$, then $Z = \frac{(x - \mu)}{\sigma} \sim N(0,1)$.
• The density function then reduces to:

$f(z) = \equiv \phi(z) = \frac{1}{\sqrt{2\pi}} \text{exp} \left[ - \frac{(z)^{2}}{2} \right]$

Similarly, we often write the CDF for the standard normal as $\Phi(\cdot)$.

## Why do we care about the normal distribution?

The normal distribution's importance lies in its relationship to the central limit theorem. As we'll discuss at more length later, the central limit theorem means that as one's sample size increases, the distribution of sample means (or other estimates) approaches a normal distribution.

## Additional points needed on the normal

Philip Schrodt 07:00, 13 July 2011 (PDT)

• More extended discussion of the CLT, and a note that if we are dealing with a data generating process where the "error" is the average (or cumulative) effect of a large number of random variables with a variety of distributions, the CLT tells us that the net effect will be normally distributed. This, in turn, explains why linear models that assume Normally distributed error---regression and ANOVA---have proven to be so robust in practice
• Link to a number of examples of normally distributed data...should be easy to find these on the web. E.g. the classical height. Maybe SAT scores, though these are artificially normal
• ref to the wikipedia article; there is also a nice graphic to snag from there---introductory sidebar---which shows the standard normal
• sidebar on the log-normal?
• something about the bivariate normal and some nice graphics of this?
• sidebar on the issue of fat tails and how these destroyed the economy in 2007?---there is a fairly readable Wired article on this: http://www.wired.com/techbiz/it/magazine/17-03/wp_quant

# The $\chi^{2}$ Distribution

The chi-square ($\chi^{2}$) distribution is a one-parameter distribution defined only how positive values. If $Z \sim N(0,1)$, then $Z^{2} \sim \chi^{2}_{1}$. That is, the square of a $N(0,1)$ variable is chi-squared with one degree of freedom. The fact that the square of a standard normal variate is a one-degree-of-freedom chi-square variable also explains why (e.g.) a chi-squared variate is only defined for nonnegative real numbers. If $W_{1},W_{2},...W_{k}$ are all independent $\chi^{2}_{1}$ variables, then $\sum_{i=1}^{k}W_{i} \sim \chi^{2}_{k}$. (The sum of $k$ independent chi-squared variables is chi-squared with $k$ degrees of freedom). By extension, the sum of the squares of $k$ independent $N(0,1)$ variables are also $\sim \chi^{2}_{k}$.

The $\chi^{2}$ distribution is positively skewed, with $\text{E}(W) = k$ and $\text{Var}(W) = 2k.$

Figure below presents five $\chi^{2}$ densities with different values of $k$.

Need to define degrees of freedom here


### Characteristics of the $\chi^{2}$ Distribution

If $W_{j}$ and $W_{k}$ are independent $\chi^{2}_{j}$ and $\chi^{2}_{k}$ variables, respectively, then $W_{j} + W_{k}$ is $\sim \chi^{2}_{j+k}$; this result can be extended to any number of independent chi-squared variables. This in turn implies the result the sum of the squares of $k$ independent $N(0,1)$ variables are also $\sim \chi^{2}_{k}$

## Derivation of the $\chi^{2}$ from Gamma functions

Gill discusses the $\chi^{2}$ distribution as a special case of the gamma PDF. That's fine, but there's actually a much more intuitive way of thinking about it, and one that comports more closely with how it is (most commonly) used in statistics. Formally, a variable $W$ that is distributed as $\chi^{2}$ with $k$ degrees of freedom has a density of:

\begin{align} f(w) &=& \frac{1}{2^{k} \Gamma(k)} w^{k} \text{exp} \left[ \frac{-w}{2} \right] \\  &=& \frac{w^{\frac{k-2}{2}} \exp(\frac{-w}{2})}{2^{\frac{k}{2}} \Gamma(\frac{k}{2})}  \end{align}

where $\Gamma(k) = \int_{0}^{\infty} t^{k - 1} \text{exp}(-t) \, dt$ is the gamma integral (see, e.g., Gill, p.\ 222). As with the normal distribution, the need to write the distribution in this fashion reflects the fact that it has no closed-form solution. The corresponding CDF is

$F(w)=\frac{\gamma(k/2,w/2)}{\Gamma(k/2)}$

where $\Gamma(\cdot)$ is as before and $\gamma(\cdot)$ is the \texttt{http://en.wikipedia.org/wiki/Incomplete\_Gamma\_function}{lower incomplete gamma function}. We write this\footnote{One also occasionally sees $W \sim \chi^{2}(k)$, with the degrees of freedom in parentheses.} as $W \sim \chi^{2}_{k}$, and say $W$ is distributed as chi-squared with $k$ degrees of freedom. \\

## Additional points needed on the chi-square

Philip Schrodt 07:00, 13 July 2011 (PDT)

• Probably want to mention the use in contingency tables here, since the connection isn't obvious.
• Agresti and Finlay state this was introduced by Pearson in 1900, apparently in the context of contingency tables---confirm this, any sort of story here?
• As df becomes very large, the chi-square approximates the normal; this is a asymptotic distribution and for practical purposes, can be used if df > 50
• Discuss more about the assumption of statistical independence?
• Chi-square as the test for comparing whether an observed frequency fits a known distribution

# Student's $t$ Distribution

For a variable $X$ which is distributed as $t$ with $k$ degrees of freedom, the PDF function is:

$f(x) = \frac{\Gamma(\frac{k+1}{2})} {\sqrt{k\pi}\,\Gamma(\frac{k}{2})} \left(1+\frac{x^2}{k} \right)^{-(\frac{k+1}{2})}\!$

where once again $\Gamma(\cdot)$ is the gamma integral. We write $X \sim t_{k}$, and say $X$ is distributed as Student's $t$ with $k$ degrees of freedom. The figure below presents $t$ densities for five different values of $k$, along with a standard normal density for comparison.

The t-distribution is sometimes known as "Student's t", after a then-anonymous student of the statistician Karl Pearson. The story, from Wikipedia,

The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland ("Student" was his pen name). Gosset had been hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness' industrial processes. Gosset devised the t-test as a way to cheaply monitor the quality of stout. He published the test in Biometrika in 1908, but was forced to use a pen name by his employer, who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown to fellow statisticians.

Note a few things about $t$:

• The mean/mode/median of a $t$-distributed variate is zero, and its variance is $\frac{k}{k - 2}$.
• $t$ looks like a standard normal distribution (symmetrical, bell-shaped) but has thicker tails (read: higher probabilities of draws being relatively far from the mean/mode). However...
• ...as $k$ gets larger, $t$ converges to a standard normal distribution; at or above $k = 30$ or so, the two are effectively indistinguishable.

The importance of the $t$ distribution lies in its relationship to the normal and chi-square distributions. In particular, if $Z \sim N(0,1)$ and $W \sim \chi^{2}_{k}$, and $Z$ and $W$ are independent, then

$\frac{Z}{\sqrt{W/k}} \sim t_{k}$

That is, the ratio of an $N(0,1)$ variable and a (properly transformed) chi-squared variable follows a $t$ distribution, with d.f.\ equal to the number of d.f.\ of the chi-squared variable. Of course, this also means that $\frac{Z^{2}}{W/k} \sim t_{k}.$

Since we know that $Z^{2} \sim \chi^{2}_{1}$, this means that another derivation of the $t$ distribution is as a ratio of a $\chi^{2}_{1}$ variate and a $\chi^{2}_{k}$ variate.

## Additional points needed on the t distribution

Philip Schrodt 07:00, 13 July 2011 (PDT)

• May want to note that it is ubiquitous in the inference on regression coefficients
• Might want to note somewhere---this might go earlier in the discussion of df---that in most social science research (e.g. survey research and time-series cross-sections), the sample sizes are well above the point where the t is asymtotically normal. The t is actually important only in very small samples, though these can be found in situations such as small subsamples in survey research (are Hispanic ferret owners in Wyoming more likely to support the Tea Party?) and situations where the population itself is small (e.g. state membership in the EU, Latin America, or ECOWAS), and experiments with a small number of subjects or cases (this is commonly found in medical research, for example, and this also motivated Gossett's original development of the test, albeit with yeast and hops---we presume---rather than experimental subjects.). In these instances, using the conventional normal approximation to the t---in particular, the rule-of-thumb of looking for standard errors less than twice the size of the coefficient estimate to establish two-tailed 0.05 significance---will be misleading.

# The $F$ Distribution

An $F$ distribution is the ratio of two chi-squared variates. If $W_{1}$ and $W_{2}$ are independent and $\sim \chi^{2}_{k}$ and $\chi^{2}_{\ell}$, respectively, then $\frac{W_{1}}{W_{2}} \sim F_{k,\ell}$

That is, the ratio of two chi-squared variables is distributed as $F$ with d.f.\ equal to the number of d.f.\ in the numerator and denominator variables, respectively.

Formally, if $X$ is distributed as $F$ with $k$ and $\ell$ degrees of freedom, then the PDF of $X$ is:

$f(x) = \frac{\left(\frac{k\,x}{k\,x + \ell}\right)^{k/2} \left(1-\frac{k\,x}{k\,x + \ell}\right)^{\ell/2}}{x\; \mathrm{B}(k/2, \ell/2)}$

where $\mathrm{B}(\cdot)$ is the beta function. That is, $\mathrm{B}(x,y) = \int_0^1t^{x-1}(1-t)^{y-1}\,dt$.} We write $X \sim F_{k,\ell}$, and say $X$ is distributed as $F$ with $k$ and $\ell$ degrees of freedom. \\}

The $F$ is a two-parameter distribution, with degrees of freedom parameters (say $k$ and $\ell$), both of which are limited to the positive integers. An $F$ variate $X$ takes values only on the non-negative real line; it has expected value equal to $\text{E}(X) = \frac{\ell}{\ell - 2},$ which implies that the mean of an $F$-distributed variable converges on 1.0 as $\ell \rightarrow \infty$. Likewise, it has variance $\text{Var}(X) = \frac{2\,\ell^2\,(k+\ell-2)}{k (\ell-2)^2 (\ell-4)},$ which bears no simple relationship to either $k$ or $\ell$.

The $F$ distribution is (generally) positively skewed. Examples of some $F$ densities with different values of $k$ and $\ell$ are presented in the figure below.

If $X \sim F(k, \ell)$, then $\frac{1}{X} \sim F(\ell, k)$ (because $\frac{1}{X} = \frac{1}{(W_{1} / W_{2})} = \frac{W_{2}}{W_{1}}$). In addition, the square of a $t$ distributed variable is $\sim F(1,k)$ (\textit{why}? -- take the formula for $t$, and square it...)

## Additional points needed on the F distribution

Philip Schrodt 10:00, 13 July 2011 (PDT)

• Discovered by Fisher in 1922, hence "F"
• Mention how it will be used for $R^2$ and ANOVA $F = MS_between/MS_within$
• Square of a $t_k$ statistic is an $F_{1,k}$ statistic

# Summary: Relationships Among Continuous Distributions

The substantive importance of all these distributions will become apparent as we move on to sampling distributions and statistical inference. In the meantime, it is useful to consider the relationship between the four distributions we discussion above