# Introduction

In the previous chapter we discussed probability theory, which we expressed in terms of a variable $X$. We defined $X$ as a set of realizations of some process, which in turn is governed by rules of probability regarding potential outcomes in the sample space.

### Need to do

Philip Schrodt 06:57, 13 July 2011 (PDT)

• Probably need to get most of the probability chapter---which are the moment hasn't been started---written before this one. In particular, will the pdf and cdf be defined there or here?
• Add some of the discrete distributions, particularly the binomial
• Do we add---or link to on another page---the derivation of the mean and standard errors for these: that code is available in CCL on an assortment of places on the web

The variables we were talking about have been what are called random variables, which means that they have a probability distribution. As we noted before, broadly speaking, there are two kinds of random variables: discrete and continuous.

Discrete variables can take on any one of several distinct, mutually-exclusive values.

• Congressperson's ideology score {0,1,2,3...,100}
• An individual's political affiliation (Democrat, Republican, Independent}
• Whether or not a country is a member of European Union (true/false)

A Continuous variable can take on any value in its range.

• Individual income
• National population

# The Normal Distribution

We are all used to seeing normal distributions described, and to hearing that something is "normally distributed." We know that a normal distribution is "bell-shaped," and symmetrical, and probably that it has some mean and some standard deviation.

Formally, if $X$ is a normally distributed variate with mean $\mu$ and variance $\sigma^{2}$, then:

$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \text{exp} \left( - \frac{(x - \mu)^{2}}{2 \sigma^{2}} \right)$.

We denote this $X \sim N(\mu,\sigma^{2})$, and say $X$ is distributed normally with mean mu and variance sigma squared. The symbol $\phi$ is often used as a shorthand to represent the normal density in \eqref{normalden}:

$X \sim \phi_{\mu, \sigma^{2}}$.

The corresponding normal CDF -- which is the probability of a normal random variate taking on a value less than or equal to some specified number -- is (as always) the indefinite integral of \eqref{normalden}. This has no simple closed-form solution, so we typically just write:

$F(x) \equiv \Phi_{\mu, \sigma^{2}}(x) = \int \phi_{\mu, \sigma^{2}} f(x) d x.$

Here are a bunch of normal curves

## Bases for the Normal Distribution

The most common justification for the normal distribution has its roots in the 'central limit theorem'. Consider $i = {1,2,...N}$ independent, real-valued random variates $$X_{i}$$, each with finite mean $$\mu_{i}$$ and variance $\sigma^{2}_{i} > 0$. If we consider a new variable $$X$$ defined as the sum of these variables:

$X = \sum_{i=1}^{N} X_{i}$

then we know that

$\text{E}(X) = \sum_{i=1}^{N} \mu_{i}$

and

$\text{Var}(X) = \sum_{i=1}^{N} \sigma^{2}_{i}$

The central limit theorem states that:

$\underset{N \rightarrow \infty}{\lim} X = \underset{N \rightarrow \infty}{\lim} \sum_{i=1}^{N} X_{i} \overset{D}{\rightarrow} N(\cdot)$

where the notation $\overset{D}{\rightarrow}$ indicates convergence in distribution. That is, as $N$ gets sufficiently large, the distribution of the sum of $N$ independent random variates with finite mean and variance will converge to a normal distribution. As such, we often think of a normal distribution as being appropriate when the observed variable $X$ can take on a range of continuous values, and when the observed value of $X$ can be thought of as the product of a large number of relatively small, independent shocks or perturbations.

## Properties of the Normal Distribution

• A normal variate $X$ has support in $\mathfrak{R}$.
• The normal is a two-parameter distribution, where $\mu \in (-\infty, \infty)$ and $\sigma^{2} \in (0, \infty)$.
• The normal distribution is always symmetrical ($M_{3} = 0$) and mesokurtic.
• item The normal distribution is preserved under a linear transformation. That is, if $X \sim N(\mu,\sigma^{2})$, then $aX + b \sim N(a\mu + b, a^{2} \sigma^{2})$. (Why? Recall our earlier results on $\mu$ and $\sigma^{2}$).

## The Standard Normal Distribution

One linear transformation is especially useful:

\begin{align} b & = \frac{-\mu}{\sigma} \\ a & = \frac{1}{\sigma}  \end{align}.

This yields:

\begin{align} ax + b & \sim N(a\mu+b, a^{2} \sigma^{2}) \\ & \sim N(0,1)  \end{align}

This is the standard normal density function. We often denote this $\phi(\cdot)$, and say that "X is distributed as standard normal." We can also get this by transforming ("standardizing") the normal variate $X$...

• If $X \sim N(\mu,\sigma^{2})$, then $Z = \frac{(x - \mu)}{\sigma} \sim N(0,1)$.
• The density function then reduces to:

$f(z) = \equiv \phi(z) = \frac{1}{\sqrt{2\pi}} \text{exp} \left[ - \frac{(z)^{2}}{2} \right]$

Similarly, we often write the CDF for the standard normal as $\Phi(\cdot)$.

### Why do we care about the normal distribution?

The normal distribution's importance lies in its relationship to the central limit theorem. As we'll discuss at more length later, the central limit theorem means that as one's sample size increases, the distribution of sample means (or other estimates) approaches a normal distribution.

### Additional points needed on the normal

Philip Schrodt 06:57, 13 July 2011 (PDT)

• More extended discussion of the CLT, and a note that if we are dealing with a data generating process where the "error" is the average (or cumulative) effect of a large number of random variables with a variety of distributions, the CLT tells us that the net effect will be normally distributed. This, in turn, explains why linear models that assume Normally distributed error---regression and ANOVA---have proven to be so robust in practice
• Link to a number of examples of normally distributed data...should be easy to find these on the web
• ref to the wikipedia article

## The $\chi^{2}$ Distribution

The $\chi^{2}$, $t$ and $F$ distributions can be derived from various products of normally-distributed variables. All three are used extensively in statistical inference and applied statistics, so it's useful to understand them in a bit of depth. \\

Gill discusses the $\chi^{2}$ distribution as a special case of the gamma PDF. That's fine, but there's actually a much more intuitive way of thinking about it, and one that comports more closely with how it is (most commonly) used in statistics. Formally, a variable $W$ that is distributed as $\chi^{2}$ with $k$ degrees of freedom has a density of:

\begin{align} f(w) &=& \frac{1}{2^{k} \Gamma(k)} w^{k} \text{exp} \left[ \frac{-w}{2} \right] \\  &=& \frac{w^{\frac{k-2}{2}} \exp(\frac{-w}{2})}{2^{\frac{k}{2}} \Gamma(\frac{k}{2})}  \end{align}

where $\Gamma(k) = \int_{0}^{\infty} t^{k - 1} \text{exp}(-t) \, dt$ is the gamma integral (see, e.g., Gill, p.\ 222). As with the normal distribution, the need to write the distribution in this fashion reflects the fact that it has no closed-form solution. The corresponding CDF is

$F(w)=\frac{\gamma(k/2,w/2)}{\Gamma(k/2)}$

where $\Gamma(\cdot)$ is as before and $\gamma(\cdot)$ is the \texttt{http://en.wikipedia.org/wiki/Incomplete\_Gamma\_function}{lower incomplete gamma function}. We write this\footnote{One also occasionally sees $W \sim \chi^{2}(k)$, with the degrees of freedom in parentheses.} as $W \sim \chi^{2}_{k}$, and say $W$ is distributed as chi-squared with $k$ degrees of freedom. \\

The chi-square distribution is a one-parameter distribution defined only on the nonnegative real line, $W \in [0, \infty)$. It is positively skewed, with $\text{E}(W) = k$

and $\text{Var}(W) = 2k.$

Figure \ref{ChiSquares} presents five $\chi^{2}$ densities with different values of $k$.

### Characteristics of the $\chi^{2}$ Distribution

More importantly, one needs to remember two key things about the chi-square distribution:

• If $Z \sim N(0,1)$, then $Z^{2} \sim \chi^{2}_{1}$. That is, \emph{the square of a $N(0,1)$ variable is chi-squared with one degree of freedom}.
• If $W_{j}$ and $W_{k}$ are independent $\chi^{2}_{j}$ and $\chi^{2}_{k}$ variables, respectively, then $W_{j} + W_{k}$ is $\sim \chi^{2}_{j+k}$; this result can be extended to any number of independent chi-squared variables.

The first of these is key, since it points out that the square of a standard normal variate is a one-degree-of-freedom chi-square variable. This explains why (e.g.) a chi-squared variate only has support on the nonnegative real numbers. The second point is also tremendously useful to know, in that it has a number of valuable corollaries. For example, it implies that

• if $W_{1},W_{2},...W_{k}$ are all independent $\chi^{2}_{1}$ variables, then $\sum_{i=1}^{k}W_{i} \sim \chi^{2}_{k}$. (The sum of $k$ independent chi-squared variables is chi-squared with $k$ degrees of freedom).
• By extension, the sum of the squares of $k$ independent $N(0,1)$ variables are also $\sim \chi^{2}_{k}$.

All this means that, if we know a variable to be normally distributed, we can consider its squared, standardized values to be $\chi^{2}_{1}$, and the sums of $k$ of such variables to be $\chi^{2}_{1}$.

## Student's $t$ Distribution

For a variable $X$ which is distributed as $t$ with $k$ degrees of freedom, the PDF function is:

$f(x) = \frac{\Gamma(\frac{k+1}{2})} {\sqrt{k\pi}\,\Gamma(\frac{k}{2})} \left(1+\frac{x^2}{k} \right)^{-(\frac{k+1}{2})}\!$

where once again $\Gamma(\cdot)$ is the gamma integral. We write $X \sim t_{k}$, and say $X$ is distributed as Student's $t$ with $k$ degrees of freedom. The CDF is complicated, so I won't go into it here; Figure \ref{Ts} presents $t$ densities for five different values of $k$, along with a standard normal density for comparison.

The t-distribution is sometimes known as "Student's t", after a then-anonymous student of the statistician Karl Pearson. The story, from Wikipedia,

The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland ("Student" was his pen name). Gosset had been hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness' industrial processes. Gosset devised the t-test as a way to cheaply monitor the quality of stout. He published the test in Biometrika in 1908, but was forced to use a pen name by his employer, who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown to fellow statisticians.

Note a few things about $t$:

• The mean/mode/median of a $t$-distributed variate is zero, and its variance is $\frac{k}{k - 2}$.
• $t$ looks like a standard normal distribution (symmetrical, bell-shaped) but has thicker tails (read: higher probabilities of draws being relatively far from the mean/mode). However...
• ...as $k$ gets larger, $t$ converges to a standard normal distribution; at or above $k = 30$ or so, the two are effectively indistinguishable.

The importance of the $t$ distribution lies in its relationship to the normal and chi-square distributions. In particular, if $Z \sim N(0,1)$ and $W \sim \chi^{2}_{k}$, and $Z$ and $W$ are independent, then

$\frac{Z}{\sqrt{W/k}} \sim t_{k}$

That is, the ratio of an $N(0,1)$ variable and a (properly transformed) chi-squared variable follows a $t$ distribution, with d.f.\ equal to the number of d.f.\ of the chi-squared variable. Of course, this also means that $\frac{Z^{2}}{W/k} \sim t_{k}.$

Since we know that $Z^{2} \sim \chi^{2}_{1}$, this means that another derivation of the $t$ distribution is as a ratio of a $\chi^{2}_{1}$ variate and a $\chi^{2}_{k}$ variate. As we'll see in a week or so, that turns out to be quite important, and useful.

## The $F$ Distribution

An $F$ distribution is best understood as the ratio of two chi-squared variates. Formally, if $X$ is distributed as $F$ with $k$ and $\ell$ degrees of freedom, then the PDF of $X$ is:

$f(x) = \frac{\left(\frac{k\,x}{k\,x + \ell}\right)^{k/2} \left(1-\frac{k\,x}{k\,x + \ell}\right)^{\ell/2}}{x\; \mathrm{B}(k/2, \ell/2)}$

%where $\mathrm{B}(\cdot)$ is the beta function.\footnote{That is, $\mathrm{B}(x,y) = \int_0^1t^{x-1}(1-t)^{y-1}\,dt$.} The corresponding CDF is (once again) complicated, so we'll skip it. We write $X \sim F_{k,\ell}$, and say $X$ is distributed as $F$ with $k$ and $\ell$ degrees of freedom. \\}

The $F$ is a two-parameter distribution, with degrees of freedom parameters (say $k$ and $\ell$), both of which are limited to the positive integers. An $F$ variate $X$ has support on the non-negative real line; it has expected value equal to $\text{E}(X) = \frac{\ell}{\ell - 2},$

which implies that the mean of an $F$-distributed variable converges on 1.0 as $\ell \rightarrow \infty$. Likewise, it has variance $\text{Var}(X) = \frac{2\,\ell^2\,(k+\ell-2)}{k (\ell-2)^2 (\ell-4)},$

which bears no simple relationship to either $k$ or $\ell$. It is (generally) positively skewed. Examples of some $F$ densities with different values of $k$ and $\ell$ are presented in Figure \ref{Fs}.

As I noted a minute ago, if $W_{1}$ and $W_{2}$ are independent and $\sim \chi^{2}_{k}$ and $\chi^{2}_{\ell}$, respectively, then $\frac{W_{1}}{W_{2}} \sim F_{k,\ell}$

That is, the ratio of two chi-squared variables is distributed as $F$ with d.f.\ equal to the number of d.f.\ in the numerator and denominator variables, respectively. This implies (at least) a couple of interesting things:

• If $X \sim F(k, \ell)$, then $\frac{1}{X} \sim F(\ell, k)$ (because $\frac{1}{X} = \frac{1}{(W_{1} / W_{2})} = \frac{W_{2}}{W_{1}}$).
• The square of a $t$ distributed variable is $\sim F(1,k)$ (\textit{why}? -- take the formula for $t$, and square it...)

The substantive importance of all these distributions will become apparent as we move on to sampling distributions, in our quest to (eventually) do statistical inference.

# References

<references group=""></references>

• [[Def: ]]
• [[Def: ]]
• [[Def: ]]