Standardized Coefficients
Standardized Coefficients
To determine the relative importance of the significant predictors, look at the standardized
coefficients. Even though Price in thousands has a small coefficient compared to Vehicle type, Price
in thousands actually contributes more to the model because it has a larger absolute standardized
coefficient.
The tolerance is the percentage of the variance in a given predictor that cannot be explained by the
other predictors. Thus, the small tolerances show that 70%-90% of the variance in a given predictor
can be explained by the other predictors. When the tolerances are close to 0, there is high
multicollinearity and the standard error of the regression coefficients will be inflated. A variance
inflation factor greater than 2 is usually considered problematic, and the smallest VIF in the table is
3.193. The collinearity diagnostics confirm that there are serious problems with multicollinearity.
Several eigenvalues are close to 0, indicating that the predictors are highly intercorrelated and that
small changes in the data values may lead to large changes in the estimates of the coefficients.
The condition indices are computed as the square roots of the ratios of the largest eigenvalue to each
successive eigenvalue. Values greater than 15 indicate a possible problem with collinearity; greater
than 30, a serious problem. Six of these indices are larger than 30, suggesting a very serious problem
with collinearity. Now try to fix the collinearity problems by rerunning the regression using z scores of
the independent variables.
To run a Linear Regression on the standardized variables, recall the Linear Regression dialog box.
► Click OK.
The eigenvalues and condition indices are vastly improved relative to the original model. However,
the collinearity statistics reported in the Coefficients table are unimproved. This is because the z-
score transformation does not change the correlation between two variables. As a multicollinearity
diagnostic, the condition index is useful for flagging datasets that could cause numerical estimation
problems in algorithms that do not internally rescale the independent variables. The z-score
transformation solves this problem, but we need another tactic for improving the variance inflation.
Using the Factor Analysis procedure, we can create a set of independent variables that are
uncorrelated and fit the dependent variable as well as the original independent variables.
► To run a Factor Analysis on the standardized variables, from the menus choose:
► Select Zscore: Vehicle type through Zscore: Fuel efficiency as analysis variables.
► Click Extraction.
► In the Extract group, select Fixed number of factors and type 10 as the number of factors to
extract.
► Click Continue, then click Rotation in the Factor Analysis dialog box.
► In the Method group, select Varimax.
► Click Continue, then click Scores in the Factor Analysis dialog box.
► Click Continue, then click OKin the Factor Analysis dialog box.
► To run a Linear Regression on the factor scores, recall the Linear Regression dialog box.
► Select REGR factor score 1 for analysis 1 [FAC1_1]through REGR factor score 10 for analysis 1
[FAC10_1]as independent variables.
► Click OK.
As expected, the model fit is the same for the model built using the factor scores as for the model
using the original predictors. Also as expected, the collinearity statistics show that the factor
scores are uncorrelated. Also note that since the variability of the coefficient estimates are not
artificially inflated by collinearity, the coefficient estimates are larger, relative to their standard
errors, in this model than in the original model. This means that more of the factors are identified
as statistically significant, which can affect your final results if you want to build a model that only
includes significant effects.
► For example, to run a stepwise Linear Regression on the factor scores, recall the Linear
Regression dialog box.
Note that because stepwise methods select models based solely upon statistical merit, it may choose
predictors that have no practical significance. While stepwise methods are a convenient way to
focus on a smaller subset of predictors, you should take care to examine the results to see if they
make sense.
► Click Statistics.
► Click Continue.
► Select Histogram.
► Click Continue.
► Click Save in the Linear Regression dialog box.
► Click Continue.
The new model's ability to explain sales compares favorably with that of the previous model. Look in
particular at the adjusted R-square statistics, which are nearly identical. A model with extra
predictors will always have at least as large an R-square value, but the adjusted R-square
compensates for model complexity to provide a more fair comparison of model performance.
The stepwise algorithm chooses factor scores 1, 2, 3, 5, and 6 as predictors; in order to interpret
these results, you'll need to look at the rotated component matrix in the Factor Analysis output.
• The first component (factor scores) loads most strongly on price and horsepower. Since the
regression coefficient is negative for factor score 1, you can conclude that more expensive, higher
horsepower cars can be expected to have lower sales.
• The second component loads most strongly on wheelbase and length. Since the regression
coefficient is positive for factor score 2, this suggests that larger vehicles are expected to have
higher sales.
• The third component loads most strongly on vehicle type. The positive coefficient for factor score 3
suggests that trucks are expected to have higher sales.
• The sixth component loads most strongly on engine size; note that engine size loads almost as
strongly on the first component, so the positive coefficient for factor score 6 suggests that this
partially offsets the negative association between engine size and sales that is suggested by the
negative coefficient for factor score 1.
• The fifth component loads most strongly on fuel efficient; the negative component loading
combined with the negative coefficient for factor score 5 suggests that more fuel efficient cars are
expected to have higher sales, all other things being equal.
Checking normality
The shape of the histogram follows the shape of the normal curve fairly well, but there are one or two
large negative residuals. For more information on these cases, see the casewise diagnostics.
This table identifies the cases with large negative residuals as the 3000GT and the Cutlass. This
means that, based on the expected sales predicted by the regression model, these two models
underperformed in the market. The Breeze and SW also appear to have underperformed to a
lesser extent. The plot of residuals by predicted values clearly shows the two most
underperforming vehicles. Additionally, you can see that the Breeze and SW are quite close to the
majority of cases. This suggests that the apparent underperformance of the Breeze and SW could
be due to random chance. What is of greater concern in this plot are the clusters of cases far to
the left of the general cluster of cases. While the vehicles in these clusters do not have large
residuals, their distance from the general cluster may have given these cases undue influence in
determining the regression coefficients.
► Select Standardized Residualas the y variable and REGR factor score 1 for analysis 1as
the x variable.
► Click OK.
The resulting scatterplot reveals that the unusual grouping of points noted in the residuals by
predicted values scatterplot have large values for factor score 1; that is, they are high-priced
vehicles. Since the distribution of prices is right-skewed, it might be a good idea to use log-
transformed prices in future analyses. By recalling the Chart Builder, you can produce similar for
the other factor scores. The charts for factor scores 2 and 3 don't reveal anything interesting, but
the residuals by factor score 5 reveal that the Metro may be an influential point because it has a
much higher fuel efficiency than any other vehicle in the dataset and lies far outside the main
cluster of points. The residuals by factor score 5 chart reveals that the Viper may also be an
influential point because it has an unusually large engine size and lies outside the main cluster of
points.
► To check Cook's distance by the centered leverage value, recall the Chart Builder.
► Click OK.
The resulting scatterplot shows a few unusual points. The 3000GT has a large Cook's distance, but it
does not have a high leverage value, so while it adds a lot of variability to the regression
estimates, it likely did not affect the slope of the regression equation. The Viper has a high
leverage value, but does not have a large Cook's distance, so it is not likely to have exerted
undue influence on the model. The most worrisome case is the Metro, which has both a high
leverage and a large Cook's distance. The next step would be to run the analysis without this
case, but we will not pursue this here.
Summary
Using stepwise methods in Linear Regression, you have selected a "best" model for predicting motor-
vehicle sales. With this model, you found two vehicle models that were underperforming in the
market, while no vehicle was clearly overperforming.
Diagnostic plots of residuals and influence statistics indicated that your regression model may be
adversely affected by the Metro. Removing this case and rerunning the analysis to see the difference
in the results would be a good next step.
The Linear Regression procedure is useful for modeling the relationship between a scale dependent
variable and one or more scale independent variables.
• Use the Correlations procedure to study the strength of relationships between the variables before
fitting a model.
• If you have categorical predictor variables, try the GLM Univariate procedure.
• If you have a lot of predictors and want to reduce their number, use the Factor Analysis procedure.