Actions

Bivariate linear regression

From OPOSSEM


Objectives[edit]

Introduction[edit]

Measures of association, such as Pearson's r, do a good job of showing whether two variables are related to one another. Pearson's r allows us to say whether the relationship between two variables is positive or negative, and it allows us to determine the strength of the relationship between two variables. However, it does have its limitations.

Pearson's r does not allow you to specify a causal relationship between variables. That means you cannot use the technique to examine the impact of one variable on the other variable. For instance, using Pearson's r, you might be able to tell if attendance is related to final exam grade, but you cannot tell how much of an impact attendance has on final exam grade. In order to test a bivariate causal relationship, you should employ bivariate regression. Like Pearson's r, bivariate regression can only be run on ratio, interval or dummy variables.

Drawing lines[edit]

Any regression analysis needs to start with an hypothesis. For this example, we will posit the following:

The more poverty there is in a country, the lower its rate of voter turnout.

We will use data from the Quality of Government dataset <ref>Teorell, Jan, Marcus Samanni, Sören Holmberg and Bo Rothstein
 (2011
). "Quality of Government 
". . The QOG Institute, University of Gothenburg
. http://www.qog.pol.gu.se/data/data_1.htm
. Retrieved 
. </ref> to examine the hypothesis using bivariate regression. A bivariate regression analysis is essentially drawing a line through a set of points. First, we want to display the data in a scatterplot. The independent variable always goes on the x(horizontal)-axis and the dependent variable always goes on the y (vertical)-axis. In the example we're using here, poverty levels will be on the x-axis and participation rates on the y-axis. Here is what the scatterplot looks like:

Scatterplot--participation and poverty.png

Looking at the scatterplot, it seems that our hypothesis is correct; in general, those countries with lower poverty levels--i.e. those on the left side of the graph--have higher turnout, and those countries with higher poverty levels--i.e. those on the right side of the graph--have lower turnout. We can see this just by looking at the scatterplot. What we cannot tell is how much participation decreases as poverty increases. To determine this, we must use bivariate regression.

Drawing a line through the scatterplot gives us the regression line. The line is drawn to minimise the distance from it to the individual points on the plot SSE. The line for the poverty and participation hypothesis is displayed below:

Scotterplot w line.png

Equation of a line[edit]

You have probably learned about drawing lines through plots before: many students encounter lines in high school algebra. There, you may have learned that the equation of a line is:

Y=aX+b

where a is the slope of the line and b is the y-intercept. You may encounter different terms to name both Y and X. The most common term for Y is "dependent variable". This term is used because the equation implies that Y is dependent on the slope a, the intercept b, and the value of X. The term is nonetheless problematic since it assumes a strong level of causality of X on Y that is not always appropriate. This is why terms like "outcome variable" or "explained variable" are better choices. Both are more neutral and illustrate quite well the type of relationship we are trying to modelize. The most common term for X is "independent variable" (do you see a pattern?). The logic is the same as for Y but the problems with this term are even more serious. Indeed, not only causality is still implied -- which is a problem -- but the term "independent variable" also assumes that no other factor or variable has any impact on X. This is highly problematic most of the time. The term "explanatory variable" is a better choice. It is certainly less ambitious but it solves most (but unfortunately not all) of the potential misunderstandings that emerge when one interprets a bivariate linear regression.

Regression line[edit]

The equation for a regression line is the same as the equation for a line, just with different names and labels for things. Instead of the y-intercept, we now have the constant; instead of the slope, we now have the regression coefficient.

Constant[edit]

The constant in a regression equation is always represented by <math>\alpha</math>. The constant gives us a baseline for what the dependent variable (Y) would look like if the independent variable (X) was zero. Sometimes the constant is meaningful, particularly if it is plausible that the independent variable could equal zero; other times, the constant just serves as a baseline to build on, particularly if the independent variable is unlikely to ever be 0.

Regression coefficient[edit]

The regression coefficient in a bivariate regression equation is represented by <math>\beta</math>. The regression coefficient tells us how much the dependent variable (Y) changes for a one unit change in the independent variable (X). The regression coefficient is the part of the regression equation we focus on, as it tells us about the relationship between the variables; we use the regression coefficient to evaluate whether our hypothesis is correct or not.

The regression equation[edit]

These elements come together to form a bivariate regression equation. The regression equation takes this form:

<math>Y = \alpha + \beta X + \epsilon</math>

Poverty and Participation: a regression equation[edit]

Using a statistics program, we estimate the regression equation for the above hypothesis:

"The more poverty there is in a country, the lower its rate of voter turnout."

We find that

<math>\alpha</math>=43.35
<math>\beta</math>=-0.29

This means that the regression equation looks like this:

Y=43.35+.29X

The constant of 43.35 tells us that, if the poverty level were zero, we would predict the participation rate to be 43.35%. The regression coefficient of 0.29 tells us that for every one point increase in the percentage of people living in poverty in a country, the participation rate will increase by 0.29 percentage points.

Interpreting the regression coefficient[edit]

The regression coefficient is interpreted as the expected change in the value of the independent variable when the value of the dependent variable increases by one point.

Statistical significance[edit]

You should already be familiar with the concept of hypothesis testing. The goal of hypothesis testing is to reject the null hypothesis. For the kind of questions we test with regression analysis, the null hypothesis is usually that the independent variable has no effect on the dependent variable. In the example we've been using here, the null hypothesis is

"Poverty has no impact on the rate of voter turnout"

or

H0: Poverty has no impact on the rate of voter turnout.

In practical terms, if the null hypothesis is true, the regression coefficient is equal to zero. If the regression coefficient is zero, the dependent variable would not change at all as the independent variable increases or decreases.

The goal of testing for statistical significance is to see if your evidence is strong enough to reject the null hypothesis. In the case of regression, we want to be able to see that our regression coefficient is different from zero. But how different from zero does it need to be? Is a regression coefficient of .008 the same as zero? What about a regression coefficient of .0000001? The answer depends on the kind of data that you are looking at.

In order to tell whether your regression coefficient is far enough away from zero. Generally, in the social sciences we test for statistical significance to <math>\alpha</math><=.05; in other words, we want there to be less than a 1 in 20 chance that our findings are merely the result of a bad sample. In order to tell if our regression coefficient is statistically significant at the <math>\alpha</math><=.05, we can use three different techniques. Output from statistical programs and presentation of regression analysis in books or journal articles won't always provide you with complete information, so it is important to learn all three techniques.

Determining statistical significance using the p-value[edit]

The easiest way to tell if your regression coefficient is statistically significant is to use the p-value. Output from a statistical program, such as SPSS or Stata, will usually include a column labeled 'p'. The 'p' is the likelihood that your results are from chance; i.e. if p=.50, there is a .50 probability, or about a 50% chance, that your results are due to chance, such as a bad sample. Because we want our <math>\alpha</math><=.05, we are looking for p-values<=.05.

In journal articles, there is usually some system of asterisks in the regression table that will tell you what the p-value is. If you look at the bottom of the table, there will usually be a key to tell you what the asterisks mean; generally, this will look like this:

*p<=.05, **p<=.01

Determining statistical significance using the standard error[edit]

Most statistical programs, and some journal articles, will report the standard error of a regression coefficient as well as its p-value.

Determining statistical significance using the t-score[edit]

Journal articles will almost always report the p-value; sometimes they will also include the t-score.

Conclusion[edit]

Other Examples[edit]

Vote support for competitive parties in Canada is relatively stable between elections, even when the distribution of seats changes significantly. There are obviously exceptions -- the NDP surge in Quebec in 2011 is one of them -- but it is safe to say the vote share in the previous election is a powerful predictor of vote share in the subsequent election. In the following example, we look at the relationship between the Conservative Party (CPC) vote share in the 2008 general election and its vote share in 2011 general election. For this analysis, we include the 307 electoral districts (or ridings) where the Conservative Party had candidates in 2008 and 2011 (The CPC did not compete in Portneuf-Jacques-Cartier).

Among these 307 districts, the CPC managed to win 143 contests (47%) in 2008 and 166 contests (54%) in 2011. Its performance varies widely across districts though. In some strongholds, the party vote share has reached more than 80% in both elections while in other the party had not been able to get 5% of the vote. On average, 38% of voters gave their support to the Conservatives in 2008 against 40% in 2011. We can thus say that these additional percentage points were gained in the right districts, mostly in Ontario.

Vote share is a continuous variable that is bounded between 0% and 100%. In this particluar case, the distribution is not strictly normal though there is a concentration of values around the mean and the median, and very few cases near the extreme. We have seen in a previous section that the linear regression is more powerful when the explained variable (or dependent variable) is normally distributed. It is not the case here but we are close enough to make this tool useful. Quantitative analysis often needs some compromising and it is then a question of finding a reasonable balance between theory and practice. In the following figure, we can see how vote share for the Conservative in 2011 is distributed.

Error creating thumbnail: Unable to save thumbnail to destination

The distribution of vote share in 2008 (our explanatory variable of IV) is quite similar and, in any case, has no consequence on the quality of our estimation. But it is important to note that the data in both years have a similar dispersion with a standard deviation in 2008 of 17.36 against 19.08 in 2011. The following figure gives an idea of the kind of relationship we want to estimate. As you can see, the relationship is positive (when X increases, Y also increases), roughly linear (an increase in X has the roughly the same impact on Y notwithstanding the value of X), and is very systematic (there are very few outliers). Unfortunately, this figure does not tell us precisely what is the exact slope of this relationship. This is where the bivariate linear regression becomes appealing.

Error creating thumbnail: Unable to save thumbnail to destination

The first step in our estimation is to decide what functional form our model should take. Two decisions must be made. First, we must decide if we want to estimate a linear relationship or not. Remember that a bivariate linear regression is necessarily linear in parameters (the slope and the intercept are constant) but not necessarily a linear relationship. The relationship between X and Y can indeed be linear but it can also be polynomial (i.e. bell shape) or dichotomous. A quick look at the figure shows that the relationship is indeed linear (why?). Second, we must decide if we want to include an intercept in our model. Discussions about the intercept are often neglected in political science even though the intercept can provide important information about the nature of the estimated relationship. The easiest way to think about it is to ask yourself "on average, what is the expected value of Y if X is at 0?" In the context of Conservative vote share in Canada, we should thus ask ourselves "on average, how much support should we expect the Conservative Party to get in 2011 in a scenario where they would have gotten no support at all in 2008?" This is an empirical question that can be informed by what we know about Canadian politics. To make a long story short, let us just say that since the Conservative Party has improved its electoral fortune between 2008 and 2011, we can assume that, on average, even the worst-performing candidate in 2008 should see its support increased in 2011. That being said, we could also decide to drop the intercept if we believe that non-existant support in 2008 should be associated with non-existant support in 2011. After making these tow decisions, we can now estimate the model and get:

<math>\begin{array}{lcr}VoteShare_{2011} = 1.02 \star VoteShare_{2008} + 0.93~~~~~~Adjusted R2 = 0.88\\ ~~~~~~~~~~~~~~~~~~~~~~ (0.02)~~~~~~~~~~~~~~~~~~~~~~~(0.89)~~~~~~N = 307\end{array}</math>

Standard errors are in parentheses. The fit of our model is quite high with an adjusted R-2 of 0.88. The results show that if each additional percentage point the Conservative Party got in 2008 is associated, on average, with an additional 1.02 percentage points in 2011. In other word, vote shares in 2008 and 2011 mirror one another. What about the intercept? The results seem to suggest that there is a 0.93 percentage points "bonus" for the Conservative Party in 2011.

Are these estimates statistically different from 0? We have seen in a previous section that most common way to answer this question is to test the null hypothesis. In other words, we want to test if our estimated slope and intercept are statistically different from 0 at the chosen level of confidence. Most statistical software provides this information in the regression output. P-values are aconvenient way to have evaluate the likelihood of the null hypothesis. A p-value under 0.05 is usually the standard threshold to confirm or infirm the null hypothesis. The p-value of the slope if smaller than 0.01 (but not 0) so we can conclude that the relationship between the Conservative Party vote share in 2008 and 2011 is different from zero. The p-value for the intercept is much higher (p-value=0.30) and, most importantly, above the 0.05 theshold usually adopted in the political science literature. What should we conclude? The estimated "bonus" for the Conservative Party in 2011 may be due to chance and thus cannot be taken for granted.

Should we drop the intercept then? Not necessarily. A linear regression model without an intercept makes some interesting statistics unavailable without improving the quality of the estimation. Consequently, unless we have a strong and justified preference for a model without intercept, it is a better ideai to keep it. The figure below add the regression line to the bivariate scatterplot presented above. As you can see the slope follow roughly a 45 degrees angle (slope = 1) and most cases are very close to the estimated slope.

Error creating thumbnail: Unable to save thumbnail to destination

References[edit]

<references group=""></references>

Discussion questions[edit]

Problems[edit]

Glossary[edit]

  • [[Def: ]]
  • [[Def: ]]
  • [[Def: ]]