The SIMPLE LINEAR REGRESSION MODEL assumes that the relationship between the dependent variable, which is denoted y, and the independent variable, denoted x, can be approximated by a straight line.We can tentatively decide whether there is an approximate straight-line relationship between y and x by making a scatter diagram, or scatter plot, of y versus x. First, data concerning the two variables are observed in pairs. To construct the scatter plot, each value of y is plotted against its corresponding value of x. If the y values tend to increase or decrease in a straight-line fashion as the x values increase, and if there is a scattering of the (x, y) points around the straight line, then it is reasonable to describe the relationship between y and x by using the simple linear regression model.We illustrate this in the following case study.
EXAMPLE 13.1 The Tasty Sub Shop Case:Predicting Yearly Revenue for a Potential Restaurant Site
Part 1: Purchasing a restaurant franchise Quiznos Sub Shops and other restaurant chains sell franchises to business entrepreneurs. Unlike McDonald’s, Pizza Hut, and certain other chains, Quiznos does not construct a standard, recognizable building to house each of its restaurants. Instead, the entrepreneur wishing to purchase a Quiznos franchise finds a suitable site, which includes a suitable geographical location and suitable store space to rent. Then, when Quiznos approves the site, Quiznos hires an architect and a contractor to remodel the store rental space and thus “build” the Quiznos restaurant. Quiznos will help an entrepreneur evaluate potential sites, will help negotiate leases, and will provide national advertising and other support once a franchise is purchased. However, strict regulations prevent Quiznos (and other chains) from predicting how profitable an entrepreneur’s potential restaurant might be. These regulations exist to prevent restaurant chains from overpredicting profit and thus misleading an entrepreneur into purchasing a franchise that might not be successful. As stated on the Quiznos website:
There are strict regulations in the franchise industry that limit our ability to estimate how successful your business could be. You need to do this yourself, but we can give some guidance. . . . Your sales primarily depend on the quality of the site, and your skill as an operator. So to estimate what your sales might be, look at other Quiznos restaurants that are in similar sites to the one you are reviewing. Find one with similar demographics (nearby employer and residence counts). . . . Ask that operator what their sales are.
Part 2: The Tasty Sub Shop Sales Data In this case study, we suppose that there is a restaurant chain—The Tasty Sub Shop—that is similar to Quiznos in the way it sells franchises to business entrepreneurs. We will also suppose that there is an entrepreneur who has found several potential sites for a Tasty Sub Shop restaurant. Similar to most existing Tasty Sub restaurant sites, each of the entrepreneur’s sites is a store rental space located in an outdoor shopping area that is close to one or more residential areas. For a Tasty Sub restaurant built on such a site, yearly revenue is known to partially depend on (1) the number of residents living near the site and (2) the amount of business and shopping near the site. Referring to the number of residents living near a site as population size and to the yearly revenue for a Tasty Sub restaurant built on the site as yearly revenue, the entrepreneur will try to predict the dependent (response) variable yearly revenue (y) on the basis of the independent (predictor) variable population size (x). To predict yearly revenue on the basis of population size, the entrepreneur chooses 10 existing Tasty Sub restaurants that are built on sites similar to the sites that the entrepreneur is considering. The entrepreneur then asks the owner of each existing restaurant what the restaurant’s revenue y was last year and estimates—with the help of the owner and published demographic information—the number of residents, or population size x, living near the site. The values of y (measured in thousands of dollars) and x (measured in thousands of residents) that are obtained are given in Table 13.1. In Figure 13.1 we give an Excel output of a scatter plot of y versus x. This plot shows (1) a tendency for the yearly revenues to increase in a straight-line fashion as the population sizes increase and (2) a scattering of points around the straight line. A regression model describing the relationship between y and x must represent these two characteristics. We now develop such a model.
Part 3: The simple linear regression model The simple linear regression model relating y to x can be expressed as follows:
y = β0 + β1x + є
This model says that the values of y can be represented by a mean level—µy = β0 + β1x —that that changes in a straight line fashion as x changes, combined with random fluctuations—described by the error term є —that cause the values of y to deviate from the mean level. Here:
1 The mean level is the mean yearly revenue corresponding to a particular µy = β0 + β1x population size x. That is, noting that different Tasty Sub restaurants could potentially be built near different populations of the same size x, the mean level µy = β0 + β1x is the mean of the yearly revenues that would be obtained by all such restaurants. In addition, because µy= β0 + β1x is the equation of a straight line, the mean yearly revenues that correspond to increasing values of the population size x lie on a straight line. For example, Table 13.1 tells us that 32,300 residents live near restaurant 3 and 45,100 residents live near restaurant 6. It follows that the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of 32,300 residents is β0 + β1 (32.3) Similarly, the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of 45,100 residents is β0 + β1 (45.1) Figure 13.2 depicts these two mean yearly revenues as triangles that lie on the straight line µy = β0 + β1x which we call the line of means. The unknown parameters βo and β1 are the y-intercept and the slope of the line of means. When we estimate βo and β1 , we will be able to estimate mean yearly revenue µy on the basis of the population size x.
2 The y-intercept β0 of the line of means can be understood by considering Figure 13.2. As illustrated in this figure, the y-intercept β0 is the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of zero residents. However, since it is unlikely that a Tasty Sub restaurant would be built near a population of zero residents, this interpretation of β0 is of dubious practical value. There are many regression situations where the y-intercept β0 lacks a practical interpretation. In spite of this, statisticians have found that β0 is almost always an important component of the line of means and thus of the simple linear regression model.
3 The slope β1 of the line of means can also be understood by considering Figure 13.2. As illustrated in this figure, the slope β1is the change in mean yearly revenue that is associated with a one-unit increase (that is, a 1,000 resident increase) in the population size x.
4 The error term of the simple linear regression model accounts for any factors affecting yearly revenue other than the population size x. Such factors would include the amount of business and shopping near a restaurant and the skill of the owner as an operator of the restaurant. For example, Figure 13.2 shows that the error term for restaurant 3 is positive. Therefore, the observed yearly revenue y = 767.2 for restaurant 3 is above the corresponding mean yearly revenue for all restaurants that have x = 32.3. As another example, Figure 13.2 also shows that the error term for restaurant 6 is negative. Therefore, the observed yearly revenue y = 810.5 for restaurant 6 is below the corresponding mean yearly revenue for all restaurants that have x =45.1. Of course, since we do not know the true values of β0 and β1 the relative positions of the quantities pictured in Figure 13.2 are only hypothetical
With the Tasty Sub Shop example as background, we are ready to define the simple linear regression model relating the dependent variable y to the independent variable x. We suppose that we have gathered n observations—each observation consists of an observed value of x and its corresponding value of y. Then:
The Simple Linear Regression Model
The simple linear (or straight line) regression model is: y =β0 + β1x + є
1 µy = β0 + β1x is the mean value of the dependent variable y when the value of the independent variable is x.
2 β0 is the y-intercept. β0 is the mean value of y when x equals zero.
3 β1 is the slope. β1 is the change (amount of increase or decrease) in the mean value of y y =β0 + β1x + є associated with a one-unit increase in x. If β1 is positive, the mean value of y increases as x increases. If β1 is negative, the mean value of y decreases as x increases.
4 є is an error term that describes the effects on y of all factors other than the value of the independent variable x.
This model is illustrated in Figure 13.3 (note that x0 in this figure denotes a specific value of the independent variable x).The y-intercept β0 and the slope β1 are called regression parameters. In addition, we have interpreted the slope β1 to be the change in the mean value of y associated with a one-unit increase in x. We sometimes refer to this change as the effect of the independent variable x on the dependent variable y. However, we cannot prove that a change in an independent variable causes a change in the dependent variable. Rather, regression can be used only to establish that the two variables move together and that the independent variable contributes information for predicting the dependent variable. For instance, regression analysis might be used to establish that as liquor sales have increased over the years, college professors’ salaries have also increased. However, this does not prove that increases in liquor sales cause increases in college professors’ salaries. Rather, both variables are influenced by a third variable—long-run growth in the national economy
The least squares point estimates Suppose that we have gathered n observations(x1,y1,(x2, y2), . . . , (xn, yn), .) where each observation consists of a value of an independent variable x and a corresponding value of a dependent variable y. Also, suppose that a scatter plot of the n observations indicates that the simple linear regression model relates y to x. In order to estimate the y-intercept βoand the slope β1 of the line of means of this model, we could visually draw a line—called an estimated regression line—through the scatter plot. Then, we could read the y-intercept and slope off the estimated regression line and use these values as the point estimates βo and β1 Unfortunately, if different people visually drew lines through the scatter plot, their lines would probably differ from each other. What we need is the “best line” that can be drawn through the scatter plot. Although there are various definitions of what this best line is, one of the most useful best lines is the least squares line.
To understand the least squares line, we let
ŷ = βo + β1x
denote the general equation of an estimated regression line drawn through a scatter plot. Here, since we will use this line to predict y on the basis of x, we call ŷ the predicted value of y when the value of the independent variable is x. In addition,bo is the y-intercept and is the slope of the estimated regression line. When we determine numerical values for bo and b1 these values will be the point estimates of the y-intercept βo and the slope β1i,yi in a sample of n observations. For this observation, the predicted value of the dependent variable y given by an estimated regression line is ŷ=b0 + b1xi Furthermore, the difference between the observed and predicted values of y,yi — ŷi is the residual for the observation, and the sum of squared residuals for all n observations is
The least squares line is the line that minimizes SSE. To find this line, we find the values of the y-intercept bo and and slope b1 that give values of ŷ=b0 + b1xi that minimize SSE. These values of bo and b1 are called the least squares point estimates of βo and β1 Using calculus, it can be shown that these estimates are calculated as follows:
The following example illustrates how to calculate these point estimates and how to use these point estimates to estimate mean values and predict individual values of the dependent variable. Note that the quantities SSxy and SSxx used to calculate the least squares point estimates are also used throughout this article to perform other important calculations.
EXAMPLE 13.2 The Tasty Sub Shop Case
Part 1: Calculating the least squares point estimates Again consider the Tasty Sub Shop problem. To compute the least squares point estimates of the regression parameters βo and β1 we first calculate the following preliminary summations:
Part 2: Estimating a mean yearly revenue and predicting an individual yearly revenue We define the experimental region to be the range of the previously observed population sizes. Referring to Table 13.2, we see that the experimental region consists of the range of population sizes from 20.8 to 64.6. The simple linear regression model relates yearly revenue y to population size x for values of x that are in the experimental region. For such values of x, the least squares line is the estimate of the line of means. It follows that the point on the least squares line corresponding to a population size of x
ŷ=b0 + b1x
is the point estimate of βo + β1x, the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of size x. In addition, we predict the error term to be 0. Therefore, is also the point prediction of an individual value y= βo + β1x + є which is the yearly revenue for a single (individual) Tasty Sub restaurant that is built near a population of size x. Note that the reason we predict the error term to be zero is that, because of several regression assumptions , it has a 50 percent chance of being positive and a 50 percent chance of being negative.For example, suppose that one of the business entrepreneur’s potential restaurant sites is near population of 47,300 residents. Because x = 47.3 is in the experimental region,
ŷ = 183.31 + 15.596(47.3)
= 921.0 (that is, $921,000)
1 The point estimate of the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of 47,300 residents.
2 The point prediction of the yearly revenue for a single Tasty Sub restaurant that is built near a population of 47,300 residents.
Figure 13.6 illustrates ŷ = 921.0 as a square on the least squares line. Moreover, suppose that the yearly rent and other fixed costs for the entrepreneur’s potential restaurant will be $257,550 and that (according to Tasty Sub corporate headquarters) the yearly food and other variable costs for the restaurant will be 60 percent of the yearly revenue. Because we predict that the yearly revenue for the restaurant will be $921,000, it follows that we predict that the yearly total operating cost for the restaurant will be $257,550 .6($921,000) $810,150. In addition, if we subtract this predicted yearly operating cost from the predicted yearly revenue of $921,000, we predict that the yearly profit for the restaurant will be $110,850. Of course, these predictions are point predictions.
To conclude this example, note that Figure 13.6 illustrates the potential danger of using the least squares line to predict outside the experimental region. In the figure, we extrapolate the least squares line beyond the experimental region to obtain a prediction for a population size of x = 90. As shown in Figure 13.6, for values of x in the experimental region (that is, between 20.8 and 64.6) the observed values of y tend to increase in a straight-line fashion as the values of x increase. However, for population sizes greater than x = 64.6, we have no data to tell us whether the relationship between y and x continues as a straight-line relationship or, possibly, becomes a curved relationship. If, for example, this relationship becomes the sort of curved relationship shown in Figure 13.6, then extrapolating the straight-line prediction equation to obtain a prediction for x = 90 would overestimate mean yearly revenue (see Figure 13.6).
The previous example illustrates that when we are using a least squares regression line, we should not estimate a mean value or predict an individual value unless the corresponding value of x is in the experimental region—the range of the previously observed values of x. Often the value x=0 is not in the experimental region. In such a situation, it would not be appropriate to interpret the y-intercept b0 as the estimate of the mean value of y when x equals 0. For example, consider the Tasty Sub Shop problem. Figure 13.6 illustrates that the population size x = 0 is not in the experimental region. Therefore, it would not be appropriate to use b0 = 183.31 as the point estimate of the mean yearly revenue for all Tasty Sub restaurants that could potentially be built near populations of zero residents. Because it is not meaningful to interpret the y-intercept in many regression situations, we often omit such interpretations.
We now present a general procedure for estimating a mean value and predicting an individual value:
Point Estimation and Point Prediction in Simple Linear Regression:
Let b0 and b1 be the least squares point estimates of the y-intercept b0 and the slope b1 in the simple linear regression model, and suppose that x0, a specified value of the independent variable x, is inside the experimental region. Then ŷ = b0 + b1x0
1 is the point estimate of the mean value of the dependent variable when the value of the independent variable is x0.
2 is the point prediction of an individual value of the dependent variable when the value of the independent variable is x0. Here we predict the error term to be 0.