Simple Linear Regression Analysis

By Bowerman, B.L., O'Connell, R.T., Murphree, E.S.

Edited by Paul Ducham

LINEAR REGRESSION ANALYSIS

The SIMPLE LINEAR REGRESSION MODEL assumes that the relationship between the dependent variable, which is denoted y, and the independent variable, denoted x, can be approximated by a straight line.We can tentatively decide whether there is an approximate straight-line relationship between y and x by making a scatter diagram, or scatter plot, of y versus x. First, data concerning the two variables are observed in pairs. To construct the scatter plot, each value of y is plotted against its corresponding value of x. If the y values tend to increase or decrease in a straight-line fashion as the x values increase, and if there is a scattering of the (x, y) points around the straight line, then it is reasonable to describe the relationship between y and x by using the simple linear regression model.We illustrate this in the following case study.

EXAMPLE 13.1 The Tasty Sub Shop Case:Predicting Yearly Revenue for a Potential Restaurant Site

Part 1: Purchasing a restaurant franchise  Quiznos Sub Shops and other restaurant chains sell franchises to business entrepreneurs. Unlike McDonald’s, Pizza Hut, and certain other chains, Quiznos does not construct a standard, recognizable building to house each of its restaurants. Instead, the entrepreneur wishing to purchase a Quiznos franchise finds a suitable site, which includes a suitable geographical location and suitable store space to rent. Then, when Quiznos approves the site, Quiznos hires an architect and a contractor to remodel the store rental space and thus “build” the Quiznos restaurant. Quiznos will help an entrepreneur evaluate potential sites, will help negotiate leases, and will provide national advertising and other support once a franchise is purchased. However, strict regulations prevent Quiznos (and other chains) from predicting how profitable an entrepreneur’s potential restaurant might be. These regulations exist to prevent restaurant chains from overpredicting profit and thus misleading an entrepreneur into purchasing a franchise that might not be successful. As stated on the Quiznos website:

    There are strict regulations in the franchise industry that limit our ability to estimate how successful your business could be. You need to do this yourself, but we can give some      guidance. . . . Your sales primarily depend on the quality of the site, and your skill as an operator. So to estimate what your sales might be, look at other Quiznos restaurants that are   in similar sites to the one you are reviewing. Find one with similar demographics (nearby employer and residence counts). . . . Ask that operator what their sales are.

Part 2: The Tasty Sub Shop Sales Data  In this case study, we suppose that there is a restaurant chain—The Tasty Sub Shop—that is similar to Quiznos in the way it sells franchises to business entrepreneurs. We will also suppose that there is an entrepreneur who has found several potential sites for a Tasty Sub Shop restaurant. Similar to most existing Tasty Sub restaurant sites, each of the entrepreneur’s sites is a store rental space located in an outdoor shopping area that is close to one or more residential areas. For a Tasty Sub restaurant built on such a site, yearly revenue is known to partially depend on (1) the number of residents living near the site and (2) the amount of business and shopping near the site. Referring to the number of residents living near a site as population size and to the yearly revenue for a Tasty Sub restaurant built on the site as yearly revenue, the entrepreneur will try to predict the dependent (response) variable yearly revenue (y) on the basis of the independent (predictor) variable population size (x). To predict yearly revenue on the basis of population size, the entrepreneur chooses 10 existing Tasty Sub restaurants that are built on sites similar to the sites that the entrepreneur is considering. The entrepreneur then asks the owner of each existing restaurant what the restaurant’s revenue y was last year and estimates—with the help of the owner and published demographic information—the number of residents, or population size x, living near the site. The values of y (measured in thousands of dollars) and x (measured in thousands of residents) that are obtained are given in Table 13.1. In Figure 13.1 we give an Excel output of a scatter plot of y versus x. This plot shows (1) a tendency for the yearly revenues to increase in a straight-line fashion as the population sizes increase and (2) a scattering of points around the straight line. A regression model describing the relationship between y and x must represent these two characteristics. We now develop such a model.

Part 3: The simple linear regression model  The simple linear regression model relating y to x can be expressed as follows:

y = β0 + β1x + є

This model says that the values of y can be represented by a mean level—µy = β0 + β1x —that that changes in a straight line fashion as x changes, combined with random fluctuations—described by the error term є —that cause the values of y to deviate from the mean level. Here:

1 The mean level is the mean yearly revenue corresponding to a particular µy = β0 + β1x population size x. That is, noting that different Tasty Sub restaurants could potentially be built near different populations of the same size x, the mean level µy = β0 + β1x is the mean of the yearly revenues that would be obtained by all such restaurants. In addition, because µy= β0 + β1x is the equation of a straight line, the mean yearly revenues that correspond to increasing values of the population size x lie on a straight line. For example, Table 13.1 tells us that 32,300 residents live near restaurant 3 and 45,100 residents live near restaurant 6. It follows that the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of 32,300 residents is β0 + β1 (32.3) Similarly, the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of 45,100 residents is β0 + β1 (45.1) Figure 13.2 depicts these two mean yearly revenues as triangles that lie on the straight line µy = β0 + β1x which we call the line of means. The unknown parameters βo and β1 are the y-intercept and the slope of the line of means. When we estimate βo and β1 , we will be able to estimate mean yearly revenue µy on the basis of the population size x.

2 The y-intercept β0 of the line of means can be understood by considering Figure 13.2. As illustrated in this figure, the y-intercept β0 is the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of zero residents. However, since it is unlikely that a Tasty Sub restaurant would be built near a population of zero residents, this interpretation of β0 is of dubious practical value. There are many regression situations where the y-intercept β0 lacks a practical interpretation. In spite of this, statisticians have found that β0 is almost always an important component of the line of means and thus of the simple linear regression model.

3 The slope β1 of the line of means can also be understood by considering Figure 13.2. As illustrated in this figure, the slope β1is the change in mean yearly revenue that is associated with a one-unit increase (that is, a 1,000 resident increase) in the population size x.

4 The error term of the simple linear regression model accounts for any factors affecting yearly revenue other than the population size x. Such factors would include the amount of business and shopping near a restaurant and the skill of the owner as an operator of the restaurant. For example, Figure 13.2 shows that the error term for restaurant 3 is positive. Therefore, the observed yearly revenue y = 767.2 for restaurant 3 is above the corresponding mean yearly revenue for all restaurants that have x = 32.3. As another example, Figure 13.2 also shows that the error term for restaurant 6 is negative. Therefore, the observed yearly revenue y = 810.5 for restaurant 6 is below the corresponding mean yearly revenue for all restaurants that have x =45.1. Of course, since we do not know the true values of β0 and β1 the relative positions of the quantities pictured in Figure 13.2 are only hypothetical

    With the Tasty Sub Shop example as background, we are ready to define the simple linear regression model relating the dependent variable y to the independent variable x. We suppose that we have gathered n observations—each observation consists of an observed value of x and its corresponding value of y. Then:

The Simple Linear Regression Model

The simple linear (or straight line) regression model is: y =β0 + β1x + є

1 µy = β0 + β1x is the mean value of the dependent variable y when the value of the independent variable is x.

2 β0 is the y-intercept. β0 is the mean value of y when x equals zero.

3 β1 is the slope. β1 is the change (amount of increase or decrease) in the mean value of y y =β0 + β1x + є associated with a one-unit increase in x. If β1 is positive, the mean value of y increases as x increases. If β1 is negative, the mean value of y decreases as x increases.

4 є is an error term that describes the effects on y of all factors other than the value of the independent variable x.

This model is illustrated in Figure 13.3 (note that x0 in this figure denotes a specific value of the independent variable x).The y-intercept β0 and the slope β1 are called regression parameters. In addition, we have interpreted the slope β1 to be the change in the mean value of y associated with a one-unit increase in x. We sometimes refer to this change as the effect of the independent variable x on the dependent variable y. However, we cannot prove that a change in an independent variable causes a change in the dependent variable. Rather, regression can be used only to establish that the two variables move together and that the independent variable contributes information for predicting the dependent variable. For instance, regression analysis might be used to establish that as liquor sales have increased over the years, college professors’ salaries have also increased. However, this does not prove that increases in liquor sales cause increases in college professors’ salaries. Rather, both variables are influenced by a third variable—long-run growth in the national economy

The least squares point estimates Suppose that we have gathered n observations(x1,y1,(x2, y2), . . . , (xn, yn), .) where each observation consists of a value of an independent variable x and a corresponding value of a dependent variable y. Also, suppose that a scatter plot of the n observations indicates that the simple linear regression model relates y to x. In order to estimate the y-intercept βoand the slope β1 of the line of means of this model, we could visually draw a line—called an estimated regression line—through the scatter plot. Then, we could read the y-intercept and slope off the estimated regression line and use these values as the point estimates βo and β1 Unfortunately, if different people visually drew lines through the scatter plot, their lines would probably differ from each other. What we need is the “best line” that can be drawn through the scatter plot. Although there are various definitions of what this best line is, one of the most useful best lines is the least squares line.

To understand the least squares line, we let

ŷ = βo + β1x

denote the general equation of an estimated regression line drawn through a scatter plot. Here, since we will use this line to predict y on the basis of x, we call ŷ the predicted value of y when the value of the independent variable is x. In addition,bo is the y-intercept and is the slope of the estimated regression line. When we determine numerical values for bo and b1 these values will be the point estimates of the y-intercept βo and the slope β1i,yi in a sample of n observations. For this observation, the predicted value of the dependent variable y given by an estimated regression line is ŷ=b0 + b1xi Furthermore, the difference between the observed and predicted values of y,yi — ŷis the residual for the observation, and the sum of squared residuals for all n observations is 

equation

The least squares line is the line that minimizes SSE. To find this line, we find the values of the y-intercept bo and and slope b1 that give values of ŷ=b0 + b1xi that minimize SSE. These values of bo and b1 are called the least squares point estimates of βo and β1 Using calculus, it can be shown that these estimates are calculated as follows:

least Square

   The following example illustrates how to calculate these point estimates and how to use these point estimates to estimate mean values and predict individual values of the dependent variable. Note that the quantities SSxy and SSxx used to calculate the least squares point estimates are also used throughout this article to perform other important calculations.

EXAMPLE 13.2 The Tasty Sub Shop Case

Part 1: Calculating the least squares point estimates  Again consider the Tasty Sub Shop problem. To compute the least squares point estimates of the regression parameters βo and β1 we first calculate the following preliminary summations:

example 13.2

 

Part 2: Estimating a mean yearly revenue and predicting an individual yearly revenue  We define the experimental region to be the range of the previously observed population sizes. Referring to Table 13.2, we see that the experimental region consists of the range of population sizes from 20.8 to 64.6. The simple linear regression model relates yearly revenue y to population size x for values of x that are in the experimental region. For such values of x, the least squares line is the estimate of the line of means. It follows that the point on the least squares line corresponding to a population size of x

ŷ=b0 + b1x

is the point estimate of βo + β1x, the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of size x. In addition, we predict the error term to be 0. Therefore, is also the point prediction of an individual value y= βo + β1x + є which is the yearly revenue for a single (individual) Tasty Sub restaurant that is built near a population of size x. Note that the reason we predict the error term to be zero is that, because of several regression assumptions , it has a 50 percent chance of being positive and a 50 percent chance of being negative.For example, suppose that one of the business entrepreneur’s potential restaurant sites is near population of 47,300 residents. Because x = 47.3 is in the experimental region,

ŷ = 183.31 + 15.596(47.3)

= 921.0 (that is, $921,000)

is

1 The point estimate of the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of 47,300 residents.

2 The point prediction of the yearly revenue for a single Tasty Sub restaurant that is built near a population of 47,300 residents.

Figure 13.6 illustrates ŷ = 921.0 as a square on the least squares line. Moreover, suppose that the yearly rent and other fixed costs for the entrepreneur’s potential restaurant will be $257,550 and that (according to Tasty Sub corporate headquarters) the yearly food and other variable costs for the restaurant will be 60 percent of the yearly revenue. Because we predict that the yearly revenue for the restaurant will be $921,000, it follows that we predict that the yearly total operating cost for the restaurant will be $257,550 .6($921,000) $810,150. In addition, if we subtract this predicted yearly operating cost from the predicted yearly revenue of $921,000, we predict that the yearly profit for the restaurant will be $110,850. Of course, these predictions are point predictions.

  To conclude this example, note that Figure 13.6 illustrates the potential danger of using the least squares line to predict outside the experimental region. In the figure, we extrapolate the least squares line beyond the experimental region to obtain a prediction for a population size of x = 90. As shown in Figure 13.6, for values of x in the experimental region (that is, between 20.8 and 64.6) the observed values of y tend to increase in a straight-line fashion as the values of x increase. However, for population sizes greater than x = 64.6, we have no data to tell us whether the relationship between y and x continues as a straight-line relationship or, possibly, becomes a curved relationship. If, for example, this relationship becomes the sort of curved relationship shown in Figure 13.6, then extrapolating the straight-line prediction equation to obtain a prediction for x = 90 would overestimate mean yearly revenue (see Figure 13.6).

  The previous example illustrates that when we are using a least squares regression line, we should not estimate a mean value or predict an individual value unless the corresponding value of x is in the experimental region—the range of the previously observed values of x. Often the value x=0 is not in the experimental region. In such a situation, it would not be appropriate to interpret the y-intercept b0 as the estimate of the mean value of y when x equals 0. For example, consider the Tasty Sub Shop problem. Figure 13.6 illustrates that the population size x = 0 is not in the experimental region. Therefore, it would not be appropriate to use b0 = 183.31 as the point estimate of the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of zero residents. Because it is not meaningful to interpret the y-intercept in many regression situations, we often omit such interpretations.

  We now present a general procedure for estimating a mean value and predicting an individual value:

Point Estimation and Point Prediction in Simple Linear Regression:

Let b0 and b1 be the least squares point estimates of the y-intercept b0 and the slope b1 in the simple linear regression model, and suppose that x0, a specified value of the independent variable x, is inside the experimental region. Then ŷ = b0 + b1x0

1 is the point estimate of the mean value of the dependent variable when the value of the independent variable is x0.

2 is the point prediction of an individual value of the dependent variable when the value of the independent variable is x0. Here we predict the error term to be 0.

table 13.1

figure 13.2

figure 13.3

figure 13.4

table 13.2

figure 13.6

 

LINEAR REGRESSION AND STANDARD ERROR

Model assumptions In order to perform hypothesis tests and set up various types of intervals when using the simple linear regression model

Y = βo + β1x + є

we need to make certain assumptions about the error term є. At any given value of x, there is a population of error term values that could potentially occur. These error term values describe the different potential effects on y of all factors other than the value of x. Therefore, these error term values explain the variation in the y values that could be observed when the independent variable is x. Our statement of the simple linear regression model assumes that µythe mean of the population of all y values that could be observed when the independent variable is x, is βo + β1x This model also implies that є = y -( βo + β1x) so this is equivalent to assuming that the mean of the corresponding population of potential error term values is 0. In total, we make four assumptions—called the regression assumptions—about the simple linear regression model. These assumptions can be stated in terms of potential y values or, equivalently, in terms of potential error term values. Following tradition, we begin by stating these assumptions in terms of potential error term values:

The Regression Assumptions

1 At any given value of x, the population of potential error term values has a mean equal to 0.

2 Constant Variance Assumption At any given value of x, the population of potential error term values has a variance that does not depend on the value of x. That is, the different populations of potential error term values corresponding to different values of x have equal variances. We denote the constant variance as σ2.

3 Normality Assumption At any given value of x, the population of potential error term values has a normal distribution.

4 Independence Assumption  Any one value of the error term є is statistically independent of any other value of є That is, the value of the error term є corresponding to an observed value of y is statistically independent of the value of the error term corresponding to any other observed value of y.

Taken together, the first three assumptions say that, at any given value of x, the population of potential error term values is normally distributed with mean zero and a variance σ2 that does not depend on the value of x. Because the potential error term values cause the variation in the potential y values, these assumptions imply that the population of all y values that could be observed when the independent variable is x is normally distributed with mean βo + β1x and a variance σ2 that does not depend on x. These three assumptions are illustrated in Figure 13.7 in the context of the Tasty Sub Shop problem. Specifically, this figure depicts the populations of yearly revenues corresponding to two values of the population size x—32.3 and 61.7. Note that these populations are shown to be normally distributed with different means (each of which is on the line of means) and with the same variance (or spread).

  The independence assumption is most likely to be violated when time series data are being utilized in a regression study. For example, the fuel consumption data in Exercise 13.3 are time series data. Intuitively, the independence assumption says that there is no pattern of positive error terms being followed (in time) by other positive error terms, and there is no pattern of positive error terms being followed by negative error terms. That is, there is no pattern of higher-than average y values being followed by other higher-than-average y values, and there is no pattern of higher-than-average y values being followed by lower-than-average y values.

It is important to point out that the regression assumptions very seldom, if ever, hold exactly in any practical regression problem. However, it has been found that regression results are not extremely sensitive to mild departures from these assumptions. In practice, only pronounced departures from these assumptions require attention.

  When we predict an individual value of the dependent variable, we predict the error term to be 0. To see why we do this, note that the regression assumptions state that, at any given value of the independent variable, the population of all error term values that can potentially occur is normally distributed with a mean equal to 0. Since we also assume that successive error terms (observed over time) are statistically independent, each error term has a 50 percent chance of being positive and a 50 percent chance of being negative. Therefore, it is reasonable to predict any particular error term value to be 0.

The mean square error and the standard error  To present statistical inference formulas, we need to be able to compute point estimates of σ2 and s, the constant variance and standard deviation of the error term populations. The point estimate of σ2 is called the mean square error and the point estimate of σ is called the standard error. In the following box, we show how to compute these estimates:

mean square error

In order to understand these point estimates, recall that σ2 is the variance of the population of µy values (for a given value of x) around the mean value my. Because ŷ is the point estimate of this mean, it seems natural to use

equation

EXAMPLE 13.3 The Tasty Sub Shop Case

Consider the Tasty Sub Shop situation, and recall that in Table 13.2, we have calculated the sum of squared residuals to be SSE30,460.21. It follows, because we have observed n=10 yearly revenues, that the point estimate of σ2 is the mean square error

rexample 13.3

 

figure 13.7

TESTING FOR SIGNIFICANCE OF SLOPE

Testing the significance of the slope  A simple linear regression model is not likely to be useful unless there is a significant relationship between y and x. In order to judge the significance of the relationship between y and x, we test the null hypothesis

H0: β1 =0

which says that there is no change in the mean value of y associated with an increase in x, versus the alternative hypothesis

Ha: β1 ≠ 0

which says that there is a (positive or negative) change in the mean value of y associated with an increase in x, . It would be reasonable to conclude that x is significantly related to y if we can be quite certain that we should reject HO in favor of Ha.

  In order to test these hypotheses, recall that we compute the least squares point estimate b1 of the true slope β1 by using a sample of n observed values of the dependent variable y. Different samples of n observed y values would yield different values of the least squares point estimate b1. It can be shown that, if the regression assumptions hold, then the population of all possible values of b1 is normally distributed with a mean of β1 and with a standard deviation of

standard error

regression relationship

We usually use the two-sided alternative Ha: β1 ≠0 for this test of significance. However, sometimes a one-sided alternative is appropriate. For example, in the Tasty Sub Shop problem we can say that if the slope β1 is not 0, then it must be positive. A positive β1 would say that mean yearly revenue increases as the population size x increases. Because of this, it would be appropriate to decide that x is significantly related to y if we can reject H0: β1>0 in favor of the one-sided alternative Ha: β1>0. Although this test would be slightly more effective than the usual two-sided test, there is little practical difference between using the one-sided or two-sided alternative. Furthermore, computer packages (such as Excel and MINITAB) present results for testing a two-sided alternative hypothesis. For these reasons we will emphasize the twosided test It should also be noted that

1 If we can decide that the slope is significant at the .05 significance level, then we have concluded that x is significantly related to y by using a test that allows only a .05 probability of concluding that x is significantly related to y when it is not. This is usually regarded as strong evidence that the regression relationship is significant.

2 If we can decide that the slope is significant at the .01 significance level, this is usually regarded as very strong evidence that the regression relationship is significant.

3 The smaller the significance level α at which H0 can be rejected, the stronger is the evidence that the regression relationship is significant.

example 13.4

example

EXAMPLE 13.5 The Tasty Sub Shop Case

The Excel and MINITAB outputs in Figure 13.8 tell us that b1 = 15.596 and sb1= 11411 Thus,for instance, because t.025 based on n-2 = 10-2 =8 degrees of freedom equals 2.306, a95percent confidence interval for β1 is

[b1 ±t.025] = [15.596 ± 2.306(1.411)]= [12.342, 18.849

(where we have used more decimal place accuracy than shown to obtain the final result). This interval says we are 95 percent confident that, if the population size increases by one thousand residents, then mean yearly revenue will increase by at least $12,342 and by at most $18,849. Also, because the 95 percent confidence interval for β1 does not contain 0, we can reject H0: β1 = 0 in favor of Ha: β1 ≠0 at level of significance .05 Note that the 95 percent confidence interval for b1is given on the Excel output but not on the MINITAB output (see Figure 13.8).

Testing the significance of the y-intercept  We can also test the significance of the y-intercept β0. We do this by testing the null hypothesis H0: b00 versus the alternative hypothesis Ha: β0 ≠0. If we can reject H0 in favor of Ha by setting the probability of a Type I error equal to A, we conclude that the intercept β0 is significant at the α level. To carry out the hypothesis test, we use the test statistic

y intercept

figure 13.8