# Data Analysis: Testing for Association

Edited by Paul Ducham

COVARIATION AND VARIABLE RELATIONSHIPS

Since we are interested in finding out whether two variables describing our customers are related, the concept of covariation is a very useful idea. Covariation is defined as the amount of change in one variable that is consistently related to a change in another variable of interest. For example, if we know that DVD purchases are related to age, then we want to know the extent to which younger persons purchase more DVDs. Another way of stating the concept of covariation is that it is the degree of association between two items (e.g., the attitude toward Starbucks coffee advertising is more favorable among heavy consumers of Starbucks coffee than it is for light consumers). If two variables are found to change together on a reliable or consistent basis, then we can use that information to make predictions that will improve decision making about advertising and marketing strategies.

One way of visually describing the covariation between two variables is with the use of a scatter diagram. A scatter diagram plots the relative position of two variables using horizontal and vertical axes to represent the variable values. Exhibits 16.1 through 16.4 show some examples of possible relationships between two variables that might show up on a scatter diagram. In Exhibit 16.1, the best way to describe the visual impression left by the dots representing the values of each variable is probably a circle. That is, there is no particular pattern to the collection of dots. Thus, if you take two or three sample values of variable Y from the scatter diagram and look at the values for X, there is no predictable pattern to the values for X. Knowing the values of Y or X would not tell you very much (maybe nothing at all) about the possible values of the other variable. Exhibit 16.1 suggests that there is no systematic relationship between Y and X and that there is very little or no covariation shared by the two variables. If we measured the amount of covariation shared by these two variables, it would be very close to zero.

In Exhibit 16.2, the two variables present a very different picture from that of Exhibit 16.1. There is a distinct pattern to the dots. As the values of Y increase, so do the values of X. This pattern could be described as a straight line or an ellipse (a circle that has been stretched out from both sides). We could also describe this relationship as positive, because increases in the value of Y are associated with increases in the value of X. That is, if we know the relationship between Y and X is a linear, positive relationship, we would know that the values of Y and X change in the same direction. As the values of Y increase, so do the values of X. Similarly, if the values of Y decrease, the values of X should decrease as well. If we try to measure the amount of covariation shown by the values of Y and X, it would be relatively high. Thus, changes in the value of Y are systematically related to changes in the value of X.

Exhibit 16.3 shows the same type of distinct pattern between the values of Y and X, but the direction of the relationship is the opposite of Exhibit 16.2. There is a linear pattern, but now increases in the values of Y are associated with decreases in the values of X. This type of relationship is known as a negative relationship. The amount of covariation shared between the two variables is still high, because Y and X still change together, though in a direction opposite from that shown in Exhibit 16.2. The concept of covariation refers to the amount of shared movement, not the direction of the relationship between two variables.

Finally, Exhibit 16.4 shows a more complicated relationship between the values of Y and X. This pattern of dots can be described as curvilinear. That is, the relationship between the values of Y and the values of X is different for different values of the variables. Part of the relationship is positive (increases in the small values of Y are associated with increases in the small values of X), but then the relationship becomes negative (increases in the larger values of Y are now associated with decreases in the larger values of X).

This pattern of dots cannot be described as a linear relationship. Many of the statistics marketing researchers use to describe association assume the two variables have a linear relationship. These statistics do not perform well when used to describe a curvilinear relationship. In Exhibit 16.4, we can still say the relationship is strong, or that the covariation exhibited by the two variables is strong. But now we can’t talk very easily about the direction (positive or negative) of the relationship, because the direction changes. To make matters more difficult, many statistical methods of describing relationships between variables cannot be applied to situations where you suspect the relationship is curvilinear.

PEARSON CORRELATION COEFFICIENT USING SPSS

The Santa Fe Grill database can help us better understand the Pearson correlation. Assume the owners anticipate the relationship between satisfaction and likelihood to recommend the restaurant would be significant and positive. Looking at the database variables you note that information was collected on Likely to Recommend (variable X24) and Satisfaction (variable X22). With SPSS it is easy to compute a Pearson correlation between these two variables and test this assumption.

Before testing the assumption that satisfaction is related to likelihood of recommending, one consideration is “Should this assumption be tested for both restaurants combined or for each restaurant separately?” Logic suggests that examining the restaurants separately is the best approach since the Santa Fe owners would like to test this relationship for their restaurant compared to Jose’s. To split the sample into two groups to compare the restaurants, you can use the options under the Data pull-down menu. For example, to compare the customers of the Santa Fe Grill and the customers of Jose’s Southwestern Café, the click-through sequence is: DATA SPLIT FILEClick on Compare Groups. Now highlight the fourth screening question (Favorite Mexican Restaurant) and move it into the “Groups Based on:” window, and then click OK. Your results will now be computed for each restaurant separately.

Now we can test the assumption that satisfaction is related to likelihood of recommending. To do so, the SPSS click-through sequence is ANALYZECORRELATEBIVARIATE, which leads to a dialog box where you select the variables. Transfer variables X22 and X24 into the Variables box. Note that we will use all three default options shown below: Pearson correlation, two-tailed test of significance, and flag significant correlations. Next go to the Options box, and after it opens click on Means and Standard Deviations and then continue. Finally, when you click on OK at the top right of the dialog box it will execute the Pearson correlation.

The Pearson correlation results are shown in Exhibit 16.6. As you can see in the Correlations table, the correlation between variable X24—Likely to Recommend and X22—Satisfaction is .837 for Jose’s and .776 for the Santa Fe Grill, and the STATISTICAL SIGNIFICANCE of both correlations is .000. Thus, we have confirmed our hypothesis that satisfaction is positively related to likely to recommend. When we examine the means of the two variables, we see that satisfaction (4.54 Santa Fe Grill) is somewhat higher than likely to recommend (3.61 Santa Fe Grill) but the pattern of responses to these questions is very similar. That is, there is covariation between the responses to the two variables. What we also know is there is room for improvement in both measures, for both restaurants, because they are measured on a 7-point scale and except for satisfac- tion with Jose’s the means are close to the mid-point.

SUBSTANTIVE SIGNIFICANCE

When the correlation coefficient is strong and significant, you can be confident the two variables are associated in a linear fashion. In our Santa Fe Grill example, we can be reasonably confident that likelihood to recommend is in fact related to satisfaction. When the correlation coefficient is weak, two possibilities must be considered: (1) there is no consistent, systematic relationship between the two variables; or (2) the association exists, but it is not linear, and other types of relationships must be investigated further.

When you square the correlation coefficient, you arrive at the coefficient of determination, or r 2. This number ranges from .00 to 1.0 and shows the proportion of variation explained or accounted for in one variable by another. In our Santa Fe Grill example, the correlation coefficient was .776. Thus, the r2=602, meaning that approximately 60.2 percent of the variation in likelihood to recommend the Santa Fe Grill is associated with the customer’s satisfaction. If the size of the coefficient of determination is large (closer to 1), then the linear relationship between the two variables being examined is stronger. In our example, we have accounted for almost one-half of the variation in likelihood to recommend by relating it to satisfaction.

There is a difference between statistical significance and substantive significance. Thus, you need to assess the substantive significance (that is, do the numbers you calculate provide useful information for management?). Since the statistical significance calculation for correlation coefficients depends partly on sample size, it is possible to find statistically significant correlation coefficients that are too small to be of much practical use. This is possible because large samples result in more confidence that a relationship exists, even if it is weak. For example, if we had correlated satisfaction with the likelihood of recommending the Santa Fe Grill to others, and the correlation coefficient was .20 (significant at the .05 level), the coefficient of determination would be .04. Can we conclude that the results are meaningful? It is unlikely they are since the amount of shared variance is only 4 percent. Remember you must always look at both types of significance (statistical and substantive) before you develop conclusions.

MEASUREMENT SCALES AND CORRELATION ANALYSIS

A common occurrence in marketing research studies is that the questions marketing researchers are most interested in can be measured only with ordinal or nominal scales. For example, if we are interested in learning more about Starbucks coffee consumption, we might consider consumption patterns of female versus male coffee drinkers. In these cases, applying the Pearson correlation coefficient to the data and assuming the gender measures have interval or ratio scale properties (when they do not) will produce misleading or overstated results.

What options are available to the researcher when ordinal scales are used to collect data, or when the data simply cannot be measured with an interval scale or better? The Spearman rank order correlation coefficient is the recommended statistic to use when two variables have been measured using ordinal scales. If either one of the variables is represented by rank order data, the best approach is to use the Spearman rank order correlation coefficient, rather than the Pearson product moment correlation.

In addition to the Spearman rank order correlation coefficient, there are other correlation coefficients that may be used to take into consideration the scale properties inherent in the data. For example, if you think the gender of your customers makes a difference in the amount of your product they purchase, it would be possible to correlate customer gender (male/female) with product purchases (dollars) to answer your question. To do so, you would use a biserial correlation coefficient to make this calculation. You must use the appropriate statistic to match the characteristics of your data, and there are formulas available to calculate almost any type of correlation coefficient to match the situation.

SPEARMAN RANK ORDER CORRELATION

The Santa Fe Grill customer survey collected data that ranked four restaurant selection factors. These data are represented by variables X26 to X29. Management is interested in knowing whether “Food Quality” is a significantly more important selection factor than is “Service.” Since these are ordinal (ranking) data, the Pearson correlation is not appropriate. The Spearman correlation is the appropriate coefficient to calculate. Variables X27— Food Quality and X29—Service are the variables we will use.

To conduct this analysis, we will use the database for the two restaurants combined. This is based on the logic that the owners are interested in knowing the results for dining customers in general and not for the two restaurants’ customers separately. The SPSS click- through sequence is ANALYZE→CORRELATE→BIVARIATE, which leads to a dialog box where you select the variables. Transfer variables X27 and X29 into the Variables box. You will note the Pearson correlation is the default along with the two-tailed test of significance, and flag significant correlations. “Unclick” the Pearson correlation and click on Spearman. Then click on OK at the top right of the dialog box to execute the program.

The SPSS results for the Spearman correlation are shown in Exhibit 16.7. As you can see in the Correlations table, the correlation between variable X27—Food Quality and X29—Service is –.130, and the significance value is .009. Thus, we have confirmed that there is a statistically significant relationship between the two restaurant selection factors, although the correlation is very small. The negative correlation indicates that a customer who ranks food quality high in importance tends to rank service significantly lower. Thus, these customers select restaurants to eat in more on the basis of food quality than service.

CALCULATING MEDIAN RANKINGS USING SPSS

To better understand the Spearman correlation findings, we need to calculate the median rankings of the four selection factors. To do this, the SPSS click-through sequence is ANALYZE→DESCRIPTIVE STATISTICS→FREQUENCIES. Click on variables X26–X29 to highlight them and then on the arrow box for the Variables box to use them in your analysis. We use all four selection factors because this will enable us to examine the overall relative rankings of all the restaurant selection factors. Next, open the Statistics box and click on median and mode, and then Continue. For the Charts and Format options we will use the defaults, so click on OK to execute the program.

The SPSS results for median rankings are shown in the Statistics table in Exhibit 16.8. Recall that medians are descriptive data and can only be used to describe respondents. The variable with the lowest median is ranked the highest and is the most important, and the variable with the highest median is the least important, since the four selection factors were ranked from 1 to 4, with 1 most important, and 4 least important. Food quality is ranked as the most important (median 1.0) while atmosphere and service are the least important. The modal response for atmosphere is 3 and for service it is 2. Thus, since the median is the same for the two variables we can use the mode to break the median tie, and conclude that atmosphere is ranked as less important than service. Moreover, the calculated Spearman rank correlation compared food quality (median 1) with service (median 3.00), so food quality is significantly more important in restaurant selection than service.

REGRESSION ANALYSIS

A fundamental basis of regression analysis is the assumption of a straight line relationship between the independent and dependent variables. This relationship is illustrated in Exhibit 16.9. The general formula for a straight line is:

Y=a+ bX + ei

where

Y=the dependent variable

a=the intercept (point where the straight line intersects the y-axis when x = 0)

b=the slope (the change in y for every 1-unit change in x)

X=the independent variable used to predict y

ei=the error for the prediction In regression analysis,

we examine the relationship between the independent variable X and the dependent variable Y. To do so, we use the known values of X and Y and the computed values of a and b. The calculations are based on the least squares procedure. The least squares procedure determines the best-fitting line by minimizing the vertical distances of all the points from the line, as shown in Exhibit 16.10. The best-fitting line is the regression line. Any point that does not fall on the line is the result of unexplained variance, or the variance in Y that is not explained by X. This unexplained variance is called error and is represented by the vertical distance between the regression straight line and the points not on the line. The distances of all the points not on the line are squared and added together to determine the sum of the squared errors, which is a measure of the total error in the regression.

After we compute the values of a and b, we must test their statistical significance. The calculated a (intercept) and b (slope) are sample estimates of the true population parameters (alpha) and (beta). The t-test is used to determine whether the computed intercept and slope are significantly different from zero. In the SPSS regression examples, the significance of these tests is reported in the Sig. column for each of these coefficients. The a is referred to as a “Constant” and the b is associated with each independent variable.

In the case of bivariate regression analysis, we are looking at one independent variable and one dependent variable. Managers frequently want to look at the combined influence of several independent variables on one dependent variable. For example, are DVD purchases related only to age, or are they also related to income, ethnicity, gender, geographic location, education level, and so on? Similarly, referring to the Santa Fe Grill database, we might ask whether customer satisfaction is related only to perceptions of the restaurant’s food taste (X18), or is satisfaction also related to perceptions of friendly employees (X12), reasonable prices (X16), and speed of service (X21)? Multiple regression is the appropriate technique to measure these relationships. We discuss bivariate or simple regression analysis before moving on to multiple regression analysis.

ESTIMATING REGRESSION COEFFICIENT

The regression equation examining the relationship between two variables is based on the equation for a straight line. The slope coefficient b tells us how much we can expect Y to change, given a 1-unit change in X. Once this equation is developed from sample data, we can use it to make predictions about Y, given different values of X.

Regression uses an estimation procedure called ordinary least squares (OLS) that guarantees the line it estimates will be the best-fitting line. We said earlier that the best prediction would be one in which the difference between the actual value of Y and the predicted value of Y was the smallest. Ordinary least squares is a statistical procedure that results in equation parameters (a and b) that produce predictions with the lowest sum of squared differences between actual and predicted values.

Error in Regression

The differences between actual and predicted values of Y are represented by e(the error term of the regression equation). If we square these errors for each observation (the difference between actual values of Y and predicted values of Y) and add them up, the total would represent an aggregate or overall measure of the accuracy of the regression equation.

Regression equations calculated using ordinary least squares will always give the lowest squared error totals, and this is why both bivariate and multiple regression analysis are sometimes referred to as OLS regression.

Besides enabling the researcher to evaluate the quality of the regression equation prediction, the error terms also are used to diagnose problems that arise when the regression analysis assumptions are not met. The pattern of errors produced by comparing actual Y values with predicted Y values can tell you whether the errors are normally distributed and/or have equal variances across the range of X values. Exhibits 16.11, 16.12, and 16.13 show several possible patterns of residuals (another term for the error between actual and predicted Y values).

In Exhibit 16.12, there is no discernible pattern to the error terms when you plot the predicted values against the residuals. In Exhibit 16.13, there is an apparent pattern. The predictions made for small values of Y are more precise (spread is narrower) than the predictions made for large values of Y (spread is wider). As a result, the regression equation is more accurate for some values of the independent variable X than for others. There are transformation techniques that can be applied to the data, which may reduce this problem.  These techniques are beyond the scope of this text.

Exhibit 16.13 portrays a pattern to the error terms that suggests a nonlinear relationship between Y and X. In this case, the researcher’s initial assumption that a straight line would be the best way to describe the potential relationship may need to be changed. The best approach may be a nonlinear relationship–based technique.

Examination of the error terms and the pattern obtained by comparing the predicted values of Y against the residuals can tell us whether our initial assumptions about the appropriateness of using regression analysis to examine variable relationships are correct. This type of evidence also can suggest other types of analysis given the characteristics of the data. While other statistical techniques may occasionally be more appropriate, regression analysis is widely used, very robust, and often an appropriate statistical technique for analysis of metric data.

BIVARIATE REGRESSION WITH SPSS

Suppose the owners of the Santa Fe Grill want to know if more favorable perceptions of their prices are associated with higher customer satisfaction. The obvious answer would be “of course they would.” But how much improvement would be expected in customer satisfaction if the owners improved the perceptions of prices? Moreover, does the relationship between prices and satisfaction differ between the Santa Fe Grill and Jose’s Southwestern Café? Bivariate regression provides answers to these questions.

In the Santa Fe Grill database X22 is a measure of customer satisfaction, with 1=Not Satisfied At All and 7=Highly Satisfied. Variable X16 is a measure of whether the respondents perceive the restaurants’ prices as reasonable (1=Strongly Disagree, 7= Strongly Agree). The null hypothesis is there is no relationship between X22—Satisfaction and X16—Reasonable Prices. The alternative hypothesis is that X22 and X16 are significantly related. To complete this analysis, we need to split the sample and analyze customers of the Santa Fe Grill and Jose’s separately, as explained before.

After you split the sample, the SPSS click-through sequence is ANALYZE→ REGRESSION→LINEAR. Click on X22—Satisfaction and move it to the Dependent Variable box. Click on X16—Reasonable Prices and move it to the Independent Variables box. We use the defaults for the other options so click OK to run the bivariate regression.

Exhibit 16.14 contains the results of the bivariate regression analysis. The table labeled Model Summary has three types of “Rs” in it. The R on the far left is the correlation coefficient (.157 for Jose’s; .479 for Santa Fe Grill). The R-square for Jose’s is very small (.025)—you get it by squaring the correlation coefficient (.157) for this regression model. The R-square for the Santa Fe Grill is moderately strong .230—you get it by squaring the correlation coefficient (.479) for this regression model. The R-square shows the percent- age of variation in one variable that is accounted for by another variable. In this case, customer perceptions of the Santa Fe Grill’s prices accounts for 23.0 percent of the total variation in customer satisfaction with the restaurant. We also look at the Std. Error of the Estimate—a measure of the accuracy of the predictions of the regression equation. The smaller the standard error of the estimate, the better the fit of the regression line and therefore the better the predictive power of the regression.

The ANOVA table shows the F-ratio for the regression models and the associated statistical significance. The F-ratio is calculated the same way for regression analysis as it is for the ANOVA techniques. The variance in X22— Customer Satisfaction that is associated with X16—Reasonable Prices is referred to as explained variance. The remainder of the total variance in X22 that is not associated with X16 is referred to as unexplained variance. The F-ratio compares the amount of explained variance to the unexplained variance. The larger the F-ratio, the more variance in the dependent variable that is associated with the independent variable. In our example, the F-ratio for the Santa Fe Grill is 58.127. The statistical significance is .000—the “Sig.” value on the SPSS output—so we can reject the null hypothesis that no relationship exists between the two variables for the Santa Fe Grill. In contrast, there is not a statistically significant relationship (.054) for Jose’s, although it is very close. Even if it were significant, the correlation for the two variables for Jose’s is so small that the relationship is not meaningful.

The Coefficients table shows the regression coefficient for X16 (reasonable prices). The column labeled Unstandardized Coefficients indicates the Santa Fe Grill unstandardized regression coefficient for X16 is .347. The column labeled Sig. shows the statistical significance of the regression coefficient for X16, as measured by the t-test. The t-test examines the question of whether the regression coefficient is different enough from zero to be statistically significant. The t-statistic is calculated by dividing the regression coefficient by its standard error (labeled Std. Error in the Coefficients table). If you divide .347 by .040, you will get a t-value of 8.657, which is significant at the .000 level.

The Coefficients table also shows the result for the Constant component in the regression equation. This item is a term in the equation for a straight line we discussed earlier. It is the X-intercept, or the value of Y when X is 0. If the independent variable takes on a value of 0, the dependent measure (X22) would have a value of 2.991. Considering only the Santa Fe Grill results, and combining the results of the Coefficients table into a regression equation, we have:

Predicted value of X22 = 2.991 + .347 * (value of X16)

+ .881 (standard error of the estimate)

For the Santa Fe Grill, the relationship between customer satisfaction and reasonable prices is positive and moderately strong. The regression coefficient for X16 is interpreted as “For every unit that X16 (the rating of reasonable prices) increases, X22 (satisfaction) will increase by .347 units.” Recall the Santa Fe Grill owners asked: “If the prices in our restaurant are perceived as being reasonable, will this be associated with improved customer satisfaction?” The answer is yes, because the model was significant at the .000 level. But how closely are they related? For every unit increase in X16, X22 goes up .347 units.

One additional note must be mentioned. The Coefficients table contained a column labeled Standardized Beta Coefficients. This number is the same in a bivariate regression as the Correlation Coefficient (similarly, with only one independent variable a squared standardized beta coefficient is the same as the coefficient of determination). However,  when there are several independent variables as in multiple regression, the standardized coefficients represent the relative contribution of each of the several independent variables.

SIGNIFICANCE

Once the statistical significance of the regression coefficients is determined, we have answered the first question about our relationship: “Is there a relationship between our dependent and independent variables?” In this case, the answer is yes. But recall our discussion of statistical versus substantive significance. The logic of that discussion also applies when we evaluate whether regression coefficients are meaningful. A second question to ask is: “How strong is that relationship?” The output of regression analysis includes the coefficient of determination, or r2 , which describes the amount of variation in the dependent variable associated with the variation in the independent variable. The regression r 2 also tells you what percentage of the total variation in your dependent variable you can explain by using the independent variable. The r 2 measure varies between .00 and 1.00, and is calculated by dividing the amount of variation you have been able to explain with your regression equation by the total variation in the dependent variable. In the Santa Fe Grill example that examined the relationship between reasonable prices and satisfaction, the r2 was .23. That means approximately 23.0 percent of the variation in customer satisfaction is associated with the variation in respondents’ perceptions of the reasonableness of prices. Remember, we cannot say that perceptions about the reasonableness of prices cause changes in satisfaction, only that changes in perceived prices tend to be reliably associated with changes in satisfaction.

When examining the substantive significance of a regression equation, you should look at the size of the r2 for the regression equation and the size of the regression coefficient. The regression coefficient may be statistically significant, but still relatively small, meaning that your dependent measure won’t change very much for a given unit change in the independent measure. In our Santa Fe Grill example, the unstandardized regression coefficient was .347, which is a moderately strong relationship. When regression coefficients are significant but small, we say a relationship is present in our population, but that it is weak. In this case, the Santa Fe Grill owners have confirmed a relationship, but they still need to consider additional independent variables that will help them to better understand and predict customer satisfaction.

STATISTICAL SIGNIFICANCE

After the regression coefficients have been estimated, you must examine the statistical significance of each coefficient. This is done in the same manner as with bivariate regression. Each regression coefficient is divided by its standard error to produce a t-statistic, which is compared against the critical value to determine whether the null hypothesis can be rejected. The basic question is still the same: “What is the probability we would get a coefficient of this size if the real regression coefficient in the population were zero?” You should examine the t-test statistics for each regression coefficient. Many times, not all the independent variables in a regression equation will be statistically significant. If a regression coefficient is not statistically significant, that means the independent variable does not have a relationship with the dependent variable and the slope describing that relationship is relatively flat (i.e., the value of the dependent variable does not change at all as the value of the statistically insignificant independent variable changes).

When using multiple regression analysis, it is important to examine the overall statistical significance of the regression model. The amount of variation in the dependent variable you have been able to explain with the independent measures is compared with the total variation in the dependent measure. This comparison results in a statistic called a model F-statistic which is compared against a critical value to determine whether or not to reject the null hypothesis. If the F-statistic is statistically significant, it means the chances of the regression model for your sample producing a large r2when the population r2 is actually 0 are acceptably small.

SUBSTANTIVE SIGNIFICANCE

Once we have estimated the regression equation, we need to assess the strength of the association. The multiple  r2 or multiple coefficient of determination describes the strength of the relationship between all the independent variables in our equation and the dependent variable. The r2 in correlation analysis, the coefficient of determination is a measure of the amount of variation in the dependent variable associated with the variation in the independent variable. In the case of multiple regression analysis, the rmeasure shows the amount of variation in the dependent variable associated with (or explained by) all of the independent variables considered together.

The larger the r2 measure, the more of the behavior of the dependent measure is associated with the independent measures we are using to predict it. For example, if the multiple r2 in our Canon copier example above were .78, it would mean that we can account for, or explain, 78 percent of the change in sales revenue by examining the variable sales force size, advertising budget, and customer attitudes toward our copier products. Higher values for r 2 mean stronger relationships between the group of independent variables and the dependent measure. As before, the measure of the strength of the relationship between an individual independent variable and the dependent measure of interest is shown by the regression coefficient or the beta coefficient for that variable.

To summarize, the elements of a multiple regression model to examine in determining its significance include the  r2 ; the model F-statistic; the individual regression coefficients for each independent variable; their associated t-statistics; and the individual beta coefficients. The appropriate procedure to follow in evaluating the results of a regression analysis is: (1) assess the statistical significance of the overall regression model using the F-statistic and its associated probability; (2) evaluate the obtained  r2 to see how large it is; (3) examine the individual regression coefficients and their t-statistics to see which are statistically significant; and (4) look at the beta coefficients to assess relative influence. Taken together, these elements give you a comprehensive picture of the answers to our basic three questions about the relationships between your dependent and independent variables.

MULTIPLE REGRESSION USING SPSS

Regression can be used to examine the relationship between a single metric dependent variable and one or more metric independent variables. If you examine the Santa Fe Grill database you will note that the first 21 variables are metric independent variables. They include lifestyle variables and perceptions of the two restaurants, measured using a 7-point Likert-type rating scale with 7 representing the positive dimension and 1 the negative dimension. Variables X22, X23, and X24 are metric dependent variables measured on a 7-point Likert-type rating scale. Variables X25—Frequency of Patronage, X30—Distance Driven, X31—Ad Recall, and X32—Gender are nonmetric. Variables X26 to X29 also are nonmetric variables because they are ranking data and cannot therefore be used in regression.

One relationship to examine with multiple regression would be to see if customers’ perceptions of their experience eating at the Santa Fe Grill are related to satisfaction. Variables X12 to X21 in the database measure customer perceptions of dining experiences in the restaurant. As a specific example, let’s use X22—Satisfaction as the single metric dependent variable, and X12—Friendly Employees, X13—Fun Place to Eat, and X15— Fresh Food as three metric independent variables. The null hypothesis would be that there is no relationship between the three independent variables and X22—Satisfaction. The alternative hypothesis would be that X12, X13, and X15 are significantly related to X22— Customer Satisfaction.

To test this hypothesis, let’s look only at the experiences of the Santa Fe Grill customers. To do so, we need to select the customer responses for the Santa Fe Grill from the sample and analyze them separately. This can be done using the “Select Cases” option under the Data pull-down menu. For example, to select customers from only the Santa Fe Grill, the click-through sequence is DATA→SELECT CASES→IF CONDITION IS SATISFIED →IF. Next, highlight x_s4 Favorite Mexican restaurant and move it into the window, click the sign and then 1. This instructs the SPSS software to select only questionnaires coded 1 in the x_s4 column (the fourth screening question on the survey), which is the Santa Fe Grill. If you wanted to analyze only the Jose’s Southwestern Café respondents, then follow the same steps except after the=sign put a 0.

After you select the Santa Fe Grill customers, the SPSS click-through sequence to examine this relationship is ANALYZE→REGRESSION→LINEAR. Highlight X22 and move it to the Dependent Variables box. Highlight X12, X13, and X15 and move them to the Independent Variables box. We will use the defaults for the other options so click OK to run the multiple regression.

The SPSS output for the multiple regression is shown in Exhibit 16.15. The Model Summary table shows the R-square for this model is .602. This means that 60.2 percent of the variation in satisfaction (dependent variable) can be explained from the three independent variables. The results in the ANOVA table indicate that the overall model is significantly different from zero (F-ratio=125.749; probability level (“Sig.”) = .000). This probability level means there are .000 chances the regression model results come from a population where the R-square actually is zero. That is, there are no chances out of 1000 that the correlation coefficient is zero.

To determine if one or more of the independent variables is a significant predictor of the dependent variable satisfaction we examine the information provided in the Coefficients table. Looking at the Standardized Coefficients Beta column reveals that X12—Friendly Employees has a beta coefficient of .471 which is significant (.000). Similarly, X15—Fresh Food has a beta coefficient of .722 (Sig = .000). The Beta for X13—Fun Place to Eat is –0.020 and not significant (.655). These findings indicate that we can reject the null hypothesis that the independent variables are not related to X22—Customer Satisfaction with the Santa Fe Grill, at least for two of the variables. Thus, this regression analysis tells us that customer perceptions of the friendliness of employees (X12) and food freshness (X15) in the Santa Fe Grill are predictors of the level of satisfaction with the restaurant. Insignificant Beta coefficients (X13) are not interpreted.

Examination of the SPSS output reveals that there is a lot of information provided we did not discuss. Experts in statistics may use this information, but managers typically do not. One of the challenges for you will be to learn which information from the SPSS output is most important to analyze and present in a report. At this point we recommend that you start with simple problems and learn from there. For example, the next problem you may want to examine is changing the dependent variable from X22—Satisfaction to X23—Likely to Return, and run regression with the same independent variables. Another possibility is to keep X22—Satisfaction as the dependent variable and use either the lifestyle variables or the other restaurant perceptions as independent variables. By doing this you will learn how to use the SPSS package and also see if any relationships exist between the variables. Have fun!

MULTICOLLINEARITY USING SPSS

The Santa Fe Grill can be used as an example to demonstrate the potential problems of multicollinearity among regression independent variables. First, select only the Santa Fe Grill customers to analyze, as described earlier. After selecting them, the click-through sequence is ANALYZEREGRESSIONLINEAR. Highlight X22 and move it to the Dependent Variable box. Next, highlight the three food perceptions variables—X15, X18, and X20—and move them to the Independent Variables box. In the Methods box we will keep Enter, which is the default. Next, click on the Statistics button below and keep Estimates in the Regression Coefficients box and Model Fit as defaults (already checked). Now click Collinearity Diagnostics, then Continue, and OK.

The results are shown in Exhibit 16.16. First, note that the R<sup>2</sup> for this regression model is .417 and it is significant at the .000 level. Next, we look at the information in the Coefficients table. Looking at the Standardized Coefficients Beta column reveals that X15— Fresh Food has a beta coefficient of .767 which is significant (.000). Similarly, X18—Food Taste has a beta coefficient of –0.267 (Sig. = .002). The Beta for X20—Food Temperature is 0.096 and not significant (.191). These findings indicate that we can reject the null hypothesis that the independent variables are not related to X22—Customer Satisfaction with the Santa Fe Grill, at least for two of the variables. Thus, this regression analysis tells us that customer perceptions of food freshness (X15) and food taste (X18) at the Santa Fe Grill are predictors of the level of satisfaction with the restaurant. The Beta coefficient for the third independent variable (X20) is not significant (.191) and therefore not interpreted.

A word of caution is needed at this point regarding the interpretation of Beta coefficients in regression. Recall that the size of the individual coefficients shows how strongly each independent variable is related to the dependent variable. The signs (negative or positive) also are important. A positive sign indicates a positive relationship (higher independent variable values are associated with higher dependent variable values). A negative sign indicates a negative relationship. But when multicollinearity is present among the independent variables, the beta coefficients and signs must be interpreted very cautiously.

The negative sign of X18—Excellent Food Taste (–0.267, sig. = .002) suggests that less favorable perceptions of food taste are associated with higher levels of satisfaction. This result is clearly not logical and points out one of the weaknesses of multiple regression. When the independent variables are highly correlated with each other, the signs of the beta coefficients may be reversed in a regression model, which happened in this case.

Because multicollinearity creates problems when using regression, you must always examine the logic of the signs for the regression betas. When a relationship is different from what is anticipated (i.e., wrong sign), one must look at a simple bivariate correlation of the two variables. For example, in Exhibit 16.17 the information clearly shows that satisfaction and food taste (X18) are positively correlated (.393) and the relationship is statistically significant (.000). Moreover, the information in the exhibit also shows that food temperature (X20) is indeed positively and significantly related to satisfaction (.430, sig. = .000). Thus, in addition to reversing the signs of the coefficients, the regression model may indicate that a variable is not significant when in fact it is—if multicollinearity is present.

How can we anticipate and deal with multicollinearity among independent variables? One way is to look at the bivariate correlations, as shown in Exhibit 16.17. Note that the correlations between the three food variables range from .686 to .770. These high correlations (multicollinearity) between the three independent variables are the reason for the problems in interpreting the Betas in this regression model. Using this type of information enables you to confirm the true relationship when multicollinearity is present.

Another way to assess multicollinearity is to request collinearity diagnostics from the SPSS software. Look back at Exhibit 16.16. In the columns labeled Tolerance and VIF, under the heading Collinearity Statistics on the right side of the Coefficients table, there are some measures we need to consider. These are both measures of multicollinearity among the independent variables (VIF stands for variance inflation factor). These measures— Tolerance and VIF—show the degree to which each independent variable is explained by the other independent variables and can be used to assess whether multicollinearity is likely a problem in regression.

Multicollinearity is indicated by the sizes of the Tolerance and VIF. The rules of thumb or “threshold sizes” for Tolerance and VIF can vary, depending on the sample size and other issues. The following guidelines are based on the author’s experiences in the application of regression models for samples in the range of 200 to 500 observations. For Tolerance, values smaller than .75 generally indicate multicollinearity is likely a problem. In contrast, if the tolerance value is larger than .75, multicollinearity is likely not a problem. These guidelines can be extended when examining simple bivariate correlations of the independent variables. The result indicates that if bivariate correlations are .50 or larger, multicollinearity is likely a problem in regression models.

The VIF is the inverse of Tolerance (VIF = 1/Tolerance). Using a similar threshold value as with Tolerance, the maximum threshold for VIF is typically a value of 1.33 (1/.75). Thus, values larger than 1.33 indicate multicollinearity is likely a problem, and values below 1.33 indicate multicollinearity is not a problem. But researchers should use their own judgment in choosing criteria for Tolerance and VIF, as the levels suggested here may differ in other research contexts. In the Santa Fe Grill output, the tolerance for X15 is .371, for X18 it is .337, and X20 tolerance is .438. Moreover, the VIF for all three variables is above 2.0. Since the tolerance value is substantially below .75 and the VIF is much larger than 1.33, we conclude that multicollinearity among the independent variables is definitely a serious problem for this regression model. Therefore, action is necessary if the researcher wants to use several independent variables exhibiting multicollinearity similar to that of the Santa Fe Grill database.

To avoid the problem of multicollinearity in regression, you should examine the correlations between the independent variables ahead of time. If they are too high (> .50), then you should consider removing one or more of the highly correlated variables. To do so, you can select one of your independent variables as a representative variable (proxy variable) for the highly correlated variables and use only that variable in your regression model. But there are other solutions to consider that overcome the problem of multicollinearity. A second solution is to combine the variables (in our example, variables X15, X18, and X20) into a summated composite variable. The third solution is to apply Factor Analysis to the set of independent variables. These last two solutions are considered better for most regression models.

DUMMY VARIABLES IN REGRESSION USING SPSS

To see how multiple regression works with dummy variables, let’s again use the customer responses for the Santa Fe Grill. Let’s examine the relationship between satisfaction of customers and X12—Friendly Employees, X17—Attractive Interior, and the dummy variable Gender (X32). The null hypothesis would be that X22—Customer Satisfaction is not related to X12, X17, or X32.

After you select the Santa Fe Grill customers, the SPSS click-through sequence is ANALYZEREGRESSIONLINEAR. Click on X22—Satisfaction and move it to the Dependent Variables box. Highlight X12—Friendly Employees, X17—Attractive Interior, and X32—Gender and move them to the Independent Variables box. Now click OK to run the multiple regression.

The SPSS results are shown in Exhibit 16.18. In the Model Summary table, you can see the R<sup>2</sup> for the model is .147. Thus, approximately 14.7 percent of the total variation in X22 is associated with X12—Friendly Employees, X17—Attractive Interior, and X32— Gender. The ANOVA table indicates the regression model is significant. The “Sig.” value indicates a probability level of .000.

The Coefficients table shows that the variable X12—Friendly Employees is a significant predictor of satisfaction, with a beta coefficient of .286. Attractive Interior (X17) with a beta of .080 is not significantly related to customer satisfaction (i.e., probability level of .181). Now, the question of interest is: “Does the satisfaction level of Santa Fe Grill customers differ depending on whether they are male or female?” According to the results in the Coefficients table, the beta coefficient of –.171 for X32—Gender is significant (Sig. level of .005). This means the female and male customers exhibit significantly different levels of satisfaction with the Santa Fe Grill. The negative beta coefficient means that lower numbers for gender are associated with higher values for satisfaction. Since males were coded 0 in our database this means males are more satisfied with the Santa Fe Grill than are females.

It is also possible to use categorical independent variables with more than just two categories. Let’s say you wanted to use consumers’ purchase behavior of Starbucks coffee to help predict their purchase behavior for Maxwell House, and you had separated your sample into nonusers, light users, and heavy users. To use dummy variables in your regression model, you would pick one category as a reference group (nonusers) and add two dummy variables for the remaining categories. The variables would be coded as follows, using 0 and 1:

The use of dummy variables in regression models allows different types of independent variables to be included in prediction efforts. The researcher must keep in mind the difference in the interpretation of the regression coefficient and the identity of the reference category that is represented by the intercept term.