Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Mathematica Engineering - Modelli e Metodi dell'Inferenza Statistica

Laboratorio 1 - Regressione Lineare

Laboratory

Laboratorio con R -1 Metodi e Modelli per l’Inferenza Statistica - Ing. Matematica - a.a. 2017-18 28/05/2018 Contents 0. Required packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1. Linear regression and tests for coecients signiﬁcance. . . . . . . . . . . . . . . . . . . . . . . . 1 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2. Conﬁdence Intervals and Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Conﬁdence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Conﬁdence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3. Diagnostics: detecting inﬂuential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Leverages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Standardized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Studentized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Cook’s distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4. Hypotheses of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Homoschedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Nonlinearity/Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Violations of assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Violation of normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5. Transformation: Box-Cox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6. Variable Selection: stepwise procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7. Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 0. Required packages library ( car ) library ( ellipse ) library ( faraway ) library ( leaps ) library (MASS) library ( GGally) # library( qpcR ) # library(rgl) 1. Linear regression and tests for coecients signiﬁcance. 1.a Upload faraway library and the dataset savings , an economic dataset on 50 dierent countries. These data are averages over 1960-1970 ( to remove business cycle or other short-term ﬂuctuations ). The recorded covariates are: • sr is aggregate personal saving divided by disposable income ( risparmio personale diviso per il reddito disponibile ). • pop15 is the percentage population under 15. 1 • pop75 is the percentage population under 75. • dpi is per-capita disposable income in U.S. dollars ( reddito pro-capite in dollari, al netto delle tasse ). • ddpi is the rate [percentage] of change in per capita disposable income ( potere d’acquisto - indice economico aggregato, espresso in % ). Moreover, create a summary of the data. How many variables have missing data? Which are quantitative and which are qualitative? Solution # import data data (savings) # Dimensioni dim (savings) ## [1] 50 5 We have 50 observations (50 countries) with 5 attributes each. To visualize the ﬁrst 5: # Overview delle prime righe head (savings) ## sr pop15 pop75 dpi ddpi ## Australia 11.43 29.35 2.87 2329.68 2.87 ## Austria 12.07 23.32 4.41 1507.99 3.93 ## Belgium 13.17 23.80 4.43 2108.47 3.82 ## Bolivia 5.75 41.89 1.67 189.13 0.22 ## Brazil 12.88 42.19 0.83 728.47 4.56 ## Canada 8.79 31.72 2.85 2982.88 2.43 Look at the main statistics for each covariate: # a brief description of each columns summary (savings) ## sr pop15 pop75 dpi ## Min. : 0.600 Min. :21.44 Min. :0.560 Min. : 88.94 ## 1st Qu.: 6.970 1st Qu.:26.21 1st Qu.:1.125 1st Qu.: 288.21 ## Median :10.510 Median :32.58 Median :2.175 Median : 695.66 ## Mean : 9.671 Mean :35.09 Mean :2.293 Mean :1106.76 ## 3rd Qu.:12.617 3rd Qu.:44.06 3rd Qu.:3.325 3rd Qu.:1795.62 ## Max. :21.100 Max. :47.64 Max. :4.700 Max. :4001.89 ## ddpi ## Min. : 0.220 ## 1st Qu.: 2.002 ## Median : 3.000 ## Mean : 3.758 ## 3rd Qu.: 4.478 ## Max. :16.710 If missing values were present, ‘summary’ function would have informed us. To check it directly: # observe that there are no missing values print (sapply (savings, function (x) any (is.na (x)))) ## sr pop15 pop75 dpi ddpi ## FALSE FALSE FALSE FALSE FALSE Finally we get the data type of each column: # check the type of each column (integer, double, character, ...) print (sapply (savings, typeof)) 2 ## sr pop15 pop75 dpi ddpi ## "double" "double" "double" "double" "double" 1.b Visualize the data and try to ﬁt a complete linear model, in which sr is the outcome of interest. Explore the output of the model. solution For visualizing the data, we can plot the pairs. It is useful also for making an idea about the relationship between the covariates. pairs (savings[ , c(sr , pop15 , pop75 , dpi , ddpi )], pch = 16 ) sr 25 40 0 2000 0 5 10 20 25 35 45 pop15 pop75 1 2 3 4 0 2000 4000 dpi 0 10 20 1 3 0 5 10 15 0 5 15 ddpi For a nicer vizualization of these scatterplots, we can use the package ‘GGally’ We can easily vizualize the relationship of couple of variables, their sample correlation and their approximated density function. ggpairs (data = savings, title = "Relationships between predictors & response" , lower = list (continuous= wrap ("points" , alpha = 0.5 , size= 0.1 ))) 3 Corr: −0.456 Corr: 0.317 Corr: −0.908 Corr: 0.22 Corr: −0.756 Corr: 0.787 Corr: 0.305 Corr: −0.0478 Corr: 0.0253 Corr: −0.129 sr pop15 pop75 dpi ddpi sr pop15 pop75 dpi ddpi 0 5 10 15 20 30 40 1 2 3 4 0 1000 2000 3000 4000 0 5 10 15 0.000 0.025 0.050 0.075 30 40 1 2 3 4 0 1000 2000 3000 4000 0 5 10 15 Relationships between predictors & response Secondly, we can ﬁt the complete linear model and look at a summary of the estimated coecients. g= lm ( sr ~ pop15 + pop75 + dpi + ddpi, data = savings ) #g = lm( sr ~ ., savings ) summary (g) ## ## Call: ## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.2422 -2.6857 -0.2488 2.4280 9.7509 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 28.5660865 7.3545161 3.884 0.000334 *** ## pop15 -0.4611931 0.1446422 -3.189 0.002603 ** ## pop75 -1.6914977 1.0835989 -1.561 0.125530 ## dpi -0.0003369 0.0009311 -0.362 0.719173 ## ddpi 0.4096949 0.1961971 2.088 0.042471 * ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 3.803 on 45 degrees of freedom ## Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797 ## F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904 gs = summary (g) 4 In order to measure the goodness of ﬁt of the model, we have to look at R2and R2adj . They assume values between 0 and 1 and represent the percentage of explained variability by regressors, thus the more they are near to 1 the more the model explain well the dependent variable. Both of them are low in this case. Third and last columns of the summary represent univariate statistics and p-values related to each estimated coecient. They make us know the results of the test of estimated coecients being equal from 0. In other words, they communicate us if the ICs of ˆ—icontain or not the 0. Only ‘pop15’ and ‘ddpi’ seem to be signiﬁcant in this model. Through the F-statistic, we can investigate whether there is at least one covariate’s parameter among —1, —2,—3and —4which is dierent from 0. Since the p-value of F-statistic is so small (0.0007904), the null hypothesis is rejected and there is at least one covariate’s parameter that is dierent from 0. names (g) # this gives you the attributes of the linear model object ## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "xlevels" "call" "terms" "model" We can look through the model’s attributes. g$call # linear model forumla ## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings) g$coefficients #beta_hat ## (Intercept) pop15 pop75 dpi ddpi ## 28.5660865407 -0.4611931471 -1.6914976767 -0.0003369019 0.4096949279 g$fitted.values # estimated sr for each observation ## Australia Austria Belgium Bolivia Brazil ## 10.566420 11.453614 10.951042 6.448319 9.327191 ## Canada Chile China Colombia Costa Rica ## 9.106892 8.842231 9.363964 6.431707 5.654922 ## Denmark Ecuador Finland France Germany ## 11.449761 5.995631 12.921086 10.164528 12.730699 ## Greece Guatamala Honduras Iceland India ## 13.786168 6.365284 6.989976 7.480582 8.491326 ## Ireland Italy Japan Korea Luxembourg ## 7.948869 12.353245 15.818514 10.086981 12.020807 ## Malta Norway Netherlands New Zealand Nicaragua ## 12.505090 11.121785 14.224454 8.384445 6.653603 ## Panama Paraguay Peru Philippines Portugal ## 7.734166 8.145759 6.160559 6.104992 13.258445 ## South Africa South Rhodesia Spain Sweden Switzerland ## 10.656834 12.008566 12.441156 11.120283 11.643174 ## Turkey Tunisia United Kingdom United States Venezuela ## 7.795682 5.627920 10.502413 8.671590 5.587482 ## Zambia Jamaica Uruguay Libya Malaysia ## 8.809086 10.738531 11.503827 11.719526 7.680869 We could also compute directly the ﬁtted values of the dependent variable: X= model.matrix (g) y_hat_man = X %*% g$coefficients #beta_hat g$residuals # residuals ## Australia Austria Belgium Bolivia Brazil ## 0.8635798 0.6163860 2.2189579 -0.6983191 3.5528094 ## Canada Chile China Colombia Costa Rica 5 ## -0.3168924 -8.2422307 2.5360361 -1.4517071 5.1250782 ## Denmark Ecuador Finland France Germany ## 5.4002388 -2.4056313 -1.6810857 2.4754718 -0.1806993 ## Greece Guatamala Honduras Iceland India ## -3.1161685 -3.3552838 0.7100245 -6.2105820 0.5086740 ## Ireland Italy Japan Korea Luxembourg ## 3.3911306 1.9267549 5.2814855 -6.1069814 -1.6708066 ## Malta Norway Netherlands New Zealand Nicaragua ## 2.9749098 -0.8717854 0.4255455 2.2855548 0.6463966 ## Panama Paraguay Peru Philippines Portugal ## -3.2941656 -6.1257589 6.5394410 6.6750084 -0.7684447 ## South Africa South Rhodesia Spain Sweden Switzerland ## 0.4831656 1.2914342 -0.6711565 -4.2602834 2.4868259 ## Turkey Tunisia United Kingdom United States Venezuela ## -2.6656824 -2.8179200 -2.6924128 -1.1115901 3.6325177 ## Zambia Jamaica Uruguay Libya Malaysia ## 9.7509138 -3.0185314 -2.2638273 -2.8295257 -2.9708690 g$rank # the numeric rank of the fitted linear model ## [1] 5 Calculate Variance-Covariance Matrix for a Fitted Model Object #help( vcov ) vcov (g) ## (Intercept) pop15 pop75 dpi ## (Intercept) 54.088907156 -1.046928e+00 -6.4480864740 -1.135929e-03 ## pop15 -1.046927609 2.092137e-02 0.1199574165 2.422953e-05 ## pop75 -6.448086474 1.199574e-01 1.1741866426 -3.703298e-04 ## dpi -0.001135929 2.422953e-05 -0.0003703298 8.669606e-07 ## ddpi -0.271654582 2.907814e-03 -0.0116339234 4.667202e-05 ## ddpi ## (Intercept) -2.716546e-01 ## pop15 2.907814e-03 ## pop75 -1.163392e-02 ## dpi 4.667202e-05 ## ddpi 3.849331e-02 1.c Try to compute F-test, manually. solution Recall that the F-tes says us if there is at least one estimated coeacient signiﬁcantly dierent from 0. Comupute: SS tot = ÿ i (yi≠ ¯y)2 SS_tot = sum ( ( savings $sr -mean ( savings $sr ) ) ^2 ) SS res = ÿ i (yi≠ ˆyi)2 SS_res = sum (g $res ^2 ) 6 F = (SS tot ≠SS res )/(p≠1) SS res /(n≠p) p= g$rank #p=5 n= dim (savings)[ 1] # n = 50 f_test = ( ( SS_tot - SS_res ) /(p -1 )) /( SS_res /(n -p) ) 1 - pf ( f_test, p - 1,n - p) ## [1] 0.0007903779 1.d Test the signiﬁcance of the parameter —1(the parameter related to pop_15), manually. solution We want to test: H0:—1=0 vs H 1:—1”=0 There are several ways to execute this test: • t-test We compute the test, whose output is shown in the R summary. X= model.matrix (g) sigma2 = (summary (g) $sigma) ^2 #manually sigma2 = sum ( ( savings $sr - g$fitted.values ) ^2 ) / (n -p) se_beta_ 1 = summary (g) $coef[ 2, 2 ] #manually se_beta_ 1 = sqrt ( sigma2 * diag ( solve ( t(X) %*% X ) )[ 2]) T.0 = abs ((g $coefficients[ 2 ] - 0 )/ se_beta_ 1 ) 2*( 1-pt (T .0 ,n -p)) ## pop15 ## 0.002603019 • F-test on nested model You ﬁt a nested model (the complete model without the covariate in which you are interested) then you compute the residuals of the 2 models and execute the F-test. REMARK it is NOT the F-test that you ﬁnd in the summary! F0= SS res (complete model )≠SS res (nested model ) df(complete model )≠df(nested model ) SS res (complete model ) df(complete model ) ≥ F(df(complete model )≠df(nested model ),df (complete model )) g2 = lm ( sr ~ pop75 + dpi + ddpi, data = savings ) summary ( g2 ) ## ## Call: ## lm(formula = sr ~ pop75 + dpi + ddpi, data = savings) ## 7 Modello ridotto Modello completo ## Residuals: ## Min 1Q Median 3Q Max ## -8.0577 -3.2144 0.1687 2.4260 10.0763 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.4874944 1.4276619 3.844 0.00037 *** ## pop75 0.9528574 0.7637455 1.248 0.21849 ## dpi 0.0001972 0.0010030 0.197 0.84499 ## ddpi 0.4737951 0.2137272 2.217 0.03162 * ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 4.164 on 46 degrees of freedom ## Multiple R-squared: 0.189, Adjusted R-squared: 0.1361 ## F-statistic: 3.573 on 3 and 46 DF, p-value: 0.02093 SS_res_ 2 = sum ( g2 $residuals ^2 ) f_test_ 2 = ( ( SS_res_ 2 - SS_res ) / 1 )/( SS_res / (n -p) ) 1 - pf ( f_test_ 2, 1,n -p) ## [1] 0.002603019 • ANOVA between the two nested models The analysis of variance between two nested models is based on the statistics F0that we computed before! anova ( g2, g ) ## Analysis of Variance Table ## ## Model 1: sr ~ pop75 + dpi + ddpi ## Model 2: sr ~ pop15 + pop75 + dpi + ddpi ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 46 797.72 ## 2 45 650.71 1 147.01 10.167 0.002603 ** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 We notice that the result is the same in all the three methods. —1is signﬁcant. Homework 1.e Test the signiﬁcance of all the regression parameters, separately. 1.f Test the regression parameter —4( the one related to ‘ddpi’ ) for this test: H0:—4=0 .35 vs H 1:—4> 0.35 2. Conﬁdence Intervals and Regions Conﬁdence Intervals 2.a Compute the 95% conﬁdence intervals for the regression parameter related to ‘pop75’. solution The formula for the required conﬁdence interval is: IC (1≠–)(—2)=[ ˆ—2±t1≠–/2(n≠p)·se(ˆ—2)], 8 where –= 5% and df = n≠p= 45 . alpha = 0.05 t_alpha2 = qt ( 1-alpha /2,n -p) beta_hat_pop75 = g$coefficients[ 3] se_beta_hat_pop75 = summary ( g )[[ 4]][ 3,2] IC_pop75 = c( beta_hat_pop75 - t_alpha2 * se_beta_hat_pop75, beta_hat_pop75 + t_alpha2 * se_beta_hat_pop75 ) IC_pop75 ## pop75 pop75 ## -3.8739780 0.4909826 We observe that IC (1≠–)(—2)includes 0, so there is no evidence for rejectig H0:—2=0 , at the 5% level. Indeed, this parameter was not signiﬁcant even in the previous section (p-value 12.5% ). summary (g) $coef[ 3,4] ## [1] 0.1255298 2.b Compute the 95% conﬁdence intervals for the regression parameter related to ‘ddpi’. solution alpha = 0.05 t_alpha2 = qt ( 1-alpha /2,n -p) beta_hat_ddpi = g$coefficients[ 5] se_beta_hat_ddpi = summary ( g )[[ 4]][ 5,2] IC_ddpi = c( beta_hat_ddpi - t_alpha2 * se_beta_hat_ddpi, beta_hat_ddpi + t_alpha2 * se_beta_hat_ddpi ) IC_ddpi ## ddpi ddpi ## 0.01453363 0.80485623 In this case, we observe that IC (1≠–)(—4)does NOT include 0, so there is evidence for rejecting H0:—4=0 , at the 5% level. However, the lower bound of the IC (1≠–)(—4)is really close to 0. We can see from the output above that the p-value is 4.2% - lower than 5% - conﬁrming this point. summary (g) $coef[ 5,4] ## [1] 0.04247114 Notice that this conﬁdence interval is pretty wide in the sense that the upper limit is about 80 times larger than the lower limit. This means that we are not really that conﬁdent about what the exact eect of growth on savings really is. REMARK Conﬁdence intervals often have a duality with two-sided hypothesis tests. A 95% conﬁdence interval contains all the null hypotheses that would not be rejected at the 5% level. Conﬁdence Regions 2.c Build the joint 95% conﬁdence region for parameters ‘pop15’ e ‘pop75’. And add the value of (—1,— 2) according to the null hypothesis. solution #help( ellipse ) plot ( ellipse ( g, c( 2, 3 ) ), type = "l" , xlim = c( -1 , 0 )) 9 #add the origin and the point of the estimates: #vettore che stiamo testando nell hp nulla points ( 0, 0 ) # add also the cenre of the ellipse, that is, the estimated couple of coefficients points (g $coef[ 2 ],g $coef[ 3 ], pch = 18 ) −1.0 −0.8 −0.6 −0.4 −0.2 0.0 −4 −3 −2 −1 0 1 pop15 pop75 The ﬁlled dot is the center of the ellipse and represents the estimates of the 2 parameters ( ˆ—1,ˆ—2). Now, we are interested in this test: H0:(—1,— 2)=(0 ,0) vs H 1:(—1,— 2)”=(0 ,0) We observe that the empty dot ( 0,0) is not included in the Conﬁdence Region (which is now an ellipse), so we reject H0at 5% level. In other words, we are saying that there is at least one parameter between —1and —2which is not equal to 0. REMARK It is important to stress that this Conﬁdence Region is dierent from the one obtained by the cartesian product of the two Conﬁdence Intervals, IC (1≠–)(—1)X IC (1≠–)(—2). The cartesian product of the two Conﬁdence Intervals is represented by the four dashed lines. beta_hat_pop15 = g$coefficients[ 2] se_beta_hat_pop15 = summary ( g )[[ 4]][ 2,2] IC_pop15 = c( beta_hat_pop15 - t_alpha2 * se_beta_hat_pop15, beta_hat_pop15 + t_alpha2 * se_beta_hat_pop15 ) IC_pop15 ## pop15 pop15 ## -0.7525175 -0.1698688 plot ( ellipse ( g, c( 2, 3 ) ), type = "l" , xlim = c( -1 , 0 )) points ( 0, 0 ) points (g $coef[ 2 ],g $coef[ 3 ], pch = 18 ) 10 #new part abline ( v= c( IC_pop15[ 1], IC_pop15[ 2] ), lty = 2 ) abline ( h= c( IC_pop75[ 1], IC_pop75[ 2] ), lty = 2 ) −1.0 −0.8 −0.6 −0.4 −0.2 0.0 −4 −3 −2 −1 0 1 pop15 pop75 REMARK The origin (0,0) is included in the IC (1≠–)(—2)and is NOT included in the IC (1≠–)(—1), as expected from the previous point (we are expecting that —1is signiﬁcantly dierent from 0, while —2is not.) REMARK It can happen that you could reject according to one Conﬁdence Region and accept according to the other Conﬁdence Region. So which region should we choose? • blue point: inside the the cartesian product of marginal ICs, outside the joint Conﬁdence Region • red point: outside the the cartesian product of marginal ICs, inside the joint Conﬁdence Region plot ( ellipse ( g, c( 2, 3 ) ), type = "l" , xlim = c( -1 , 0 )) points ( 0, 0 ) points (g $coef[ 2 ],g $coef[ 3 ], pch = 18 ) abline ( v= c( IC_pop15[ 1], IC_pop15[ 2] ), lty = 2 ) abline ( h= c( IC_pop75[ 1], IC_pop75[ 2] ), lty = 2 ) #new part points ( -0.22 , 0.7 , col = "red" , lwd = 2 ) points ( -0.71 , 0, col = "blue" , lwd = 2 ) 11 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 −4 −3 −2 −1 0 1 pop15 pop75 We should alaways refer the joint Conﬁdence Region (the elliptic one), because it is taking into account the correlation between the parameters. So we will accept the hypothesis represented by the red point and reject the hypothesis represented by the blue one. In this case, correlation is near to -1. This means that the variables share a lot of variability and, consequently, of information cor ( savings $pop15, savings $pop75 ) ## [1] -0.9084787 3. Diagnostics: detecting inﬂuential points The goal of diagnostics consists in detecting possible inﬂuential points in a sample. In general, an inﬂuential point is one whose removal from the dataset would cause a large change in the ﬁt. inﬂuential points are outliers and leverages (in italiano, punti leva). The deﬁnitions of outliers and leverages can overlap. A possible deﬁnition of outlier is ‘a point that does not ﬁt the chosen model’. On the other hand, a leverage is ‘a point that signiﬁcantly aects the estimates of the model’. It is immediate to see that often an outlier is also a leverage point. By ‘inﬂuential’, we mean that : 1. estimated coecients with or without an observation signiﬁcantly changes: ˆ—≠ ˆ—≠i 2. ﬁtted values with or without an observation signiﬁcantly changes: xT(ˆ—≠ ˆ—≠i)= ˆy≠ ˆy≠iAnyway, these are hard measures to judge in the sense that the scale varies between datasets. There are several approaches for identifying inﬂuential points in a sample, such as: • Leverages (projectiion matrix) • Standardized Residuals • Studentized Residuals • Cook’s Distance 12 Leverages 3.a Investigate possible leverages among data. Leverages are deﬁned as the diagonal elements of H matrix: H = X(XTX)≠1XT such that ˆy= Hy . solution We can compute diagonal elements of H matrix with two dierent functions: X= model.matrix (g) X ## (Intercept) pop15 pop75 dpi ddpi ## Australia 1 29.35 2.87 2329.68 2.87 ## Austria 1 23.32 4.41 1507.99 3.93 ## Belgium 1 23.80 4.43 2108.47 3.82 ## Bolivia 1 41.89 1.67 189.13 0.22 ## Brazil 1 42.19 0.83 728.47 4.56 ## Canada 1 31.72 2.85 2982.88 2.43 ## Chile 1 39.74 1.34 662.86 2.67 ## China 1 44.75 0.67 289.52 6.51 ## Colombia 1 46.64 1.06 276.65 3.08 ## Costa Rica 1 47.64 1.14 471.24 2.80 ## Denmark 1 24.42 3.93 2496.53 3.99 ## Ecuador 1 46.31 1.19 287.77 2.19 ## Finland 1 27.84 2.37 1681.25 4.32 ## France 1 25.06 4.70 2213.82 4.52 ## Germany 1 23.31 3.35 2457.12 3.44 ## Greece 1 25.62 3.10 870.85 6.28 ## Guatamala 1 46.05 0.87 289.71 1.48 ## Honduras 1 47.32 0.58 232.44 3.19 ## Iceland 1 34.03 3.08 1900.10 1.12 ## India 1 41.31 0.96 88.94 1.54 ## Ireland 1 31.16 4.19 1139.95 2.99 ## Italy 1 24.52 3.48 1390.00 3.54 ## Japan 1 27.01 1.91 1257.28 8.21 ## Korea 1 41.74 0.91 207.68 5.81 ## Luxembourg 1 21.80 3.73 2449.39 1.57 ## Malta 1 32.54 2.47 601.05 8.12 ## Norway 1 25.95 3.67 2231.03 3.62 ## Netherlands 1 24.71 3.25 1740.70 7.66 ## New Zealand 1 32.61 3.17 1487.52 1.76 ## Nicaragua 1 45.04 1.21 325.54 2.48 ## Panama 1 43.56 1.20 568.56 3.61 ## Paraguay 1 41.18 1.05 220.56 1.03 ## Peru 1 44.19 1.28 400.06 0.67 ## Philippines 1 46.26 1.12 152.01 2.00 ## Portugal 1 28.96 2.85 579.51 7.48 ## South Africa 1 31.94 2.28 651.11 2.19 ## South Rhodesia 1 31.92 1.52 250.96 2.00 ## Spain 1 27.74 2.87 768.79 4.35 ## Sweden 1 21.44 4.54 3299.49 3.01 ## Switzerland 1 23.49 3.73 2630.96 2.70 ## Turkey 1 43.42 1.08 389.66 2.96 ## Tunisia 1 46.12 1.21 249.87 1.13 ## United Kingdom 1 23.27 4.46 1813.93 2.01 13 ## United States 1 29.81 3.43 4001.89 2.45 ## Venezuela 1 46.40 0.90 813.39 0.53 ## Zambia 1 45.25 0.56 138.33 5.14 ## Jamaica 1 41.12 1.73 380.47 10.23 ## Uruguay 1 28.13 2.72 766.54 1.88 ## Libya 1 43.69 2.07 123.58 16.71 ## Malaysia 1 47.20 0.66 242.69 5.08 ## attr(,"assign") ## [1] 0 1 2 3 4 lev = hat (X) lev ## [1] 0.06771343 0.12038393 0.08748248 0.08947114 0.06955944 0.15840239 ## [7] 0.03729796 0.07795899 0.05730171 0.07546780 0.06271782 0.06372651 ## [13] 0.09204246 0.13620478 0.08735739 0.09662073 0.06049212 0.06008079 ## [19] 0.07049590 0.07145213 0.21223634 0.06651170 0.22330989 0.06079915 ## [25] 0.08634787 0.07940290 0.04793213 0.09061400 0.05421789 0.05035056 ## [31] 0.03897459 0.06937188 0.06504891 0.06425415 0.09714946 0.06510405 ## [37] 0.16080923 0.07732854 0.12398898 0.07359423 0.03964224 0.07456729 ## [43] 0.11651375 0.33368800 0.08628365 0.06433163 0.14076016 0.09794717 ## [49] 0.53145676 0.06523300 # oppure lev = hatvalues (g) lev ## Australia Austria Belgium Bolivia Brazil ## 0.06771343 0.12038393 0.08748248 0.08947114 0.06955944 ## Canada Chile China Colombia Costa Rica ## 0.15840239 0.03729796 0.07795899 0.05730171 0.07546780 ## Denmark Ecuador Finland France Germany ## 0.06271782 0.06372651 0.09204246 0.13620478 0.08735739 ## Greece Guatamala Honduras Iceland India ## 0.09662073 0.06049212 0.06008079 0.07049590 0.07145213 ## Ireland Italy Japan Korea Luxembourg ## 0.21223634 0.06651170 0.22330989 0.06079915 0.08634787 ## Malta Norway Netherlands New Zealand Nicaragua ## 0.07940290 0.04793213 0.09061400 0.05421789 0.05035056 ## Panama Paraguay Peru Philippines Portugal ## 0.03897459 0.06937188 0.06504891 0.06425415 0.09714946 ## South Africa South Rhodesia Spain Sweden Switzerland ## 0.06510405 0.16080923 0.07732854 0.12398898 0.07359423 ## Turkey Tunisia United Kingdom United States Venezuela ## 0.03964224 0.07456729 0.11651375 0.33368800 0.08628365 ## Zambia Jamaica Uruguay Libya Malaysia ## 0.06433163 0.14076016 0.09794717 0.53145676 0.06523300 Alternatively, we can compute H manually and then exctract its diagonal elements: #manually H= X %*% solve ( t(X) %*% X) %*% t(X) lev = diag (H) sum (lev) # verifica: sum_i hat( x )_i = p ## [1] 5 14 REMARK The trace of the H matrix (sum of the diagonal elements of a matrix) is equal to the rank of X matrix, which is p(number of covariates + 1 for the intercept), assuming that covariates are all linearly independent and p 2·p n plot (g $fitted.values, lev, ylab = "Leverages" , main = "Plot of Leverages" , pch = 16 , col = black ) abline ( h= 2 * p/n, lty = 2, col = red ) watchout_points_lev = lev[ which ( lev > 2 * p/n)] watchout_ids_lev = seq_along ( lev )[ which ( lev > 2 * p/n)] points (g $fitted.values[ watchout_ids_lev ], watchout_points_lev, col = red , pch = 16 ) 6 8 10 12 14 16 0.1 0.2 0.3 0.4 0.5 Plot of Leverages g$fitted.values Leverages lev [ lev > 2 * 5 / 50 ] ## Ireland Japan United States Libya ## 0.2122363 0.2233099 0.3336880 0.5314568 sum ( lev [ lev > 2 * 5 / 50 ]) ## [1] 1.300691 3.b Fit the model without leverages. solution 15 gl = lm ( sr ~ pop15 + pop75 + dpi + ddpi, savings, subset = ( lev < 0.2 )) summary ( gl ) ## ## Call: ## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings, ## subset = (lev < 0.2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.9632 -2.6323 0.1466 2.2529 9.6687 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.221e+01 9.319e+00 2.384 0.0218 * ## pop15 -3.403e-01 1.798e-01 -1.893 0.0655 . ## pop75 -1.124e+00 1.398e+00 -0.804 0.4258 ## dpi -4.499e-05 1.160e-03 -0.039 0.9692 ## ddpi 5.273e-01 2.775e-01 1.900 0.0644 . ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 3.805 on 41 degrees of freedom ## Multiple R-squared: 0.2959, Adjusted R-squared: 0.2272 ## F-statistic: 4.308 on 4 and 41 DF, p-value: 0.005315 #summary( g ) Moreover, investigate the relative variation of ˆ—due to these inﬂuential points. abs ((g $coefficients - gl $coefficients ) / g$coefficients ) ## (Intercept) pop15 pop75 dpi ddpi ## 0.2223914 0.2622274 0.3353998 0.8664714 0.2871002 The leverages aect the estimate heavily (there is a variation of 22% at least). We can also visualize the position of leverages for each covariate couple. colors = rep ( black , nrow ( savings ) ) colors[ watchout_ids_lev ] = c(red , blue , green , orange ) pairs ( savings[ , c( sr , pop15 , pop75 , dpi , ddpi ) ], pch = 16 , col = colors, cex = 1 + 0.5 * as.numeric ( colors != black )) 16 sr 25 40 0 2000 0 5 10 20 25 35 45 pop15 pop75 1 2 3 4 0 2000 4000 dpi 0 10 20 1 3 0 5 10 15 0 5 15 ddpi Standardized Residuals 3.c Plot the residuals of the complete model. solution # Residui non standardizzati (nè studentizzati) plot (g $res, ylab = "Residuals" , main = "Plot of residuals" ) sort (g $res ) ## Chile Iceland Paraguay Korea Sweden ## -8.2422307 -6.2105820 -6.1257589 -6.1069814 -4.2602834 ## Guatamala Panama Greece Jamaica Malaysia ## -3.3552838 -3.2941656 -3.1161685 -3.0185314 -2.9708690 ## Libya Tunisia United Kingdom Turkey Ecuador ## -2.8295257 -2.8179200 -2.6924128 -2.6656824 -2.4056313 ## Uruguay Finland Luxembourg Colombia United States ## -2.2638273 -1.6810857 -1.6708066 -1.4517071 -1.1115901 ## Norway Portugal Bolivia Spain Canada ## -0.8717854 -0.7684447 -0.6983191 -0.6711565 -0.3168924 ## Germany Netherlands South Africa India Austria ## -0.1806993 0.4255455 0.4831656 0.5086740 0.6163860 ## Nicaragua Honduras Australia South Rhodesia Italy ## 0.6463966 0.7100245 0.8635798 1.2914342 1.9267549 ## Belgium New Zealand France Switzerland China ## 2.2189579 2.2855548 2.4754718 2.4868259 2.5360361 ## Malta Ireland Brazil Venezuela Costa Rica ## 2.9749098 3.3911306 3.5528094 3.6325177 5.1250782 ## Japan Denmark Peru Philippines Zambia ## 5.2814855 5.4002388 6.5394410 6.6750084 9.7509138 17 sort (g $res ) [ c( 1, 50 )] ## Chile Zambia ## -8.242231 9.750914 countries = row.names ( savings ) identify ( 1:50 ,g $res, countries ) 0 10 20 30 40 50 −5 0 5 10 Plot of residuals Index Residuals ## integer(0) # click 2 times on the points you want to make a label appear # it works only by console and plots window identify is a useful function for detecting inﬂuent points. In input, you should call the x and y axes of the plot and the labels of data. Usually, the residuals are represented wrt y-values or the single predictors. This is useful also for testing the model hypotheses: homoschedasticity and normality of residuals (we are going to explain them later). The representation with the index of the observation as x-axis is not that useful (except if we are interested in investigating the distribution of the residuals wrt the procedure used for data collection). 3.d Plot the Standardized Residuals of the complete model. Rule of thumb Given that standardized residuals are deﬁned as: rstdi = yi≠ ˆyi ˆS = ˆÁi ˆS, where ˆS is the sample standard deviation of y, inﬂuential points satisfy the following inequality: |rstdi |> 2 18 solution It is easy to see that inﬂuetial points according to standardized residuals and to leverages are dierent. gs = summary (g) res_std = g$res /gs $sigma watchout_ids_rstd = which ( abs ( res_std ) > 2 ) watchout_rstd = res_std[ watchout_ids_rstd ] watchout_rstd ## Chile Zambia ## -2.167486 2.564229 # Residui standardizzati plot (g $fitted.values, res_std, ylab = "Standardized Residuals" , main = "Standardized Residuals" ) abline ( h= c(-2,2), lty = 2, col = orange ) points (g $fitted.values[watchout_ids_rstd], res_std[watchout_ids_rstd], col = red , pch = 16 ) points (g $fitted.values[watchout_ids_lev], res_std[watchout_ids_lev], col = orange , pch = 16 ) legend (topright , col = c(red ,orange ), c(Standardized Residuals , Leverages ), pch = rep ( 16 , 2 ), bty = n ) sort (g $res /gs $sigma ) ## Chile Iceland Paraguay Korea Sweden ## -2.16748590 -1.63321671 -1.61091051 -1.60597254 -1.12034042 ## Guatamala Panama Greece Jamaica Malaysia ## -0.88234977 -0.86627732 -0.81946884 -0.79379291 -0.78125899 ## Libya Tunisia United Kingdom Turkey Ecuador ## -0.74408946 -0.74103748 -0.70803244 -0.70100307 -0.63261660 ## Uruguay Finland Luxembourg Colombia United States ## -0.59532595 -0.44208050 -0.43937737 -0.38176008 -0.29231843 ## Norway Portugal Bolivia Spain Canada ## -0.22925620 -0.20208037 -0.18363922 -0.17649617 -0.08333420 ## Germany Netherlands South Africa India Austria ## -0.04751907 0.11190707 0.12705960 0.13376764 0.16209300 ## Nicaragua Honduras Australia South Rhodesia Italy ## 0.16998499 0.18671742 0.22709835 0.33961261 0.50668493 ## Belgium New Zealand France Switzerland China ## 0.58352650 0.60103970 0.65098279 0.65396859 0.66690956 ## Malta Ireland Brazil Venezuela Costa Rica ## 0.78232159 0.89177653 0.93429372 0.95525486 1.34775829 ## Japan Denmark Peru Philippines Zambia ## 1.38888924 1.42011817 1.71969783 1.75534843 2.56422914 sort (g $res /gs $sigma ) [ c( 1, 50 )] ## Chile Zambia ## -2.167486 2.564229 countries = row.names ( savings ) identify ( 1:50 ,g $res /gs $sigma, countries ) 19 6 8 10 12 14 16 −2 −1 0 1 2 Standardized Residuals g$fitted.values Standardized Residuals Standardized Residuals Leverages ## integer(0) # cliccare 2 volte sui punti a cui si vuole apporre il label Studentized Residuals 3.e Compute the Studentized Residuals, highlighting the inﬂuential points. solution Studentized residuals, ri, are computed as: ri= ˆÁi ˆS·(1 ≠hii)≥ tn≠p Since riis distributed according to a Student- twith (n≠p)degrees of freedom, we can calculate a p-value to test whether point i≠th is an outlier. gs = summary (g) gs $sigma ## [1] 3.802669 # manually stud = g$residuals / ( gs $sigma * sqrt ( 1 - lev ) ) # rstandard gives studentized residuals automaically stud = rstandard (g) watchout_ids_stud = which ( abs ( stud ) > 2 ) watchout_stud = stud[ watchout_ids_stud ] watchout_stud 20 ## Chile Zambia ## -2.209074 2.650915 plot (g $fitted.values, stud, ylab = "Studentized Residuals" , main = "Studentized Residuals" , pch = 16 ) points (g $fitted.values[watchout_ids_stud], stud[watchout_ids_stud], col = pink , pch = 16 ) points (g $fitted.values[watchout_ids_lev], stud[watchout_ids_lev], col = orange , pch = 16 ) abline ( h= c(-2,2), lty = 2, col = orange ) legend (topright , col = c(pink ,orange ), c(Studentized Residual , Leverages ), pch = rep ( 16 , 3 ), bty = n ) 6 8 10 12 14 16 −2 −1 0 1 2 Studentized Residuals g$fitted.values Studentized Residuals Studentized Residual Leverages Studentized residuals and Standardized residuals identify the same inﬂuential points in this case. Cook’s distance Cook’s distance is a commonly used inﬂuential measure that combines the two characteristics of an inﬂuential point, that is the residual eect and the leverage, i.e. how observations are ﬁtted by the model and how they are inﬂuential to the ﬁtting. It can be expressed as: Ci= r2i p · Ë hii 1≠hii È in which riare the studentized residuals. Rule of thumb A point is deﬁned inﬂuential if: Ci> 4 n≠p 21 Cdist = cooks.distance (g) watchout_ids_Cdist = which ( Cdist > 4/(n -p) ) watchout_Cdist = Cdist[ watchout_ids_Cdist ] watchout_Cdist ## Japan Zambia Libya ## 0.14281625 0.09663275 0.26807042 Three suspect points are detected. par ( mfrow = c( 1, 3 )) plot (g $fitted.values, Cdist, pch = 16 , xlab = Fitted values , ylab = Cooks Distance , main = Cooks Distance ) points (g $fitted.values[ watchout_ids_Cdist ], Cdist[ watchout_ids_Cdist ], col = green , pch = 16 ) plot (g $fitted.values, stud, pch = 16 , xlab = Fitted values , ylab = Studentized Residuals , main = Studentized Residuals ) points (g $fitted.values[ watchout_ids_stud ], stud[ watchout_ids_stud ], col = pink , pch = 16 ) plot (g $fitted.values, lev, pch = 16 , xlab = Fitted values , ylab = Leverages , main = Leverages ) points (g $fitted.values[ watchout_ids_lev ], lev[ watchout_ids_lev ], col = orange , pch = 16 ) 6 8 10 12 14 16 0.00 0.05 0.10 0.15 0.20 0.25 Cooks Distance Fitted values Cooks Distance 6 8 10 12 14 16 −2 −1 0 1 2 Studentized Residuals Fitted values Studentized Residuals 6 8 10 12 14 16 0.1 0.2 0.3 0.4 0.5 Leverages Fitted values Leverages 3.f Fit the model without inﬂuential points wrt Cook’s distance and compare the outcome to the former model (on the complete dataset). solution 22 #id_to_keep = (1:n)[ - watchout_ids_Cdist ] id_to_keep = !( 1:n %in% watchout_ids_Cdist ) gl = lm ( sr ~ pop15 + pop75 + dpi + ddpi, savings[ id_to_keep, ] ) summary ( gl ) ## ## Call: ## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings[id_to_keep, ## ]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.4552 -2.5129 -0.1117 1.7477 6.6646 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 19.8321854 7.8341981 2.531 0.0152 * ## pop15 -0.3047007 0.1513371 -2.013 0.0505 . ## pop75 -0.3030249 1.1135478 -0.272 0.7869 ## dpi -0.0005535 0.0008469 -0.654 0.5170 ## ddpi 0.4137823 0.2526006 1.638 0.1089 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 3.441 on 42 degrees of freedom ## Multiple R-squared: 0.3503, Adjusted R-squared: 0.2885 ## F-statistic: 5.662 on 4 and 42 DF, p-value: 0.0009742 Observe that the ﬁtting in terms of R2slightly improved wrt to the complete model. abs ( ( gl $coef - g$coef ) /g$coef ) ## (Intercept) pop15 pop75 dpi ddpi ## 0.305743704 0.339320881 0.820854095 0.642906116 0.009976742 The coecient for dpi changed by about 64%, the coecient of pop75 by 82%. Inﬂuential Plot The inﬂuential plot represents the studentized residuals vs leverages, and highlights them with a circle which is proportional to Cook’s distance. influencePlot ( g, id.method = "identify" , main = "influential Plot" , sub = "Circle size is proportial to Cook s Distance" ) ## Warning in plot.window(...): "id.method" is not a graphical parameter ## Warning in plot.xy(xy, type, ...): "id.method" is not a graphical parameter ## Warning in axis(side = side, at = at, labels = labels, ...): "id.method" is not ## a graphical parameter ## Warning in axis(side = side, at = at, labels = labels, ...): "id.method" is not ## a graphical parameter ## Warning in box(...): "id.method" is not a graphical parameter ## Warning in title(...): "id.method" is not a graphical parameter ## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.method" is not a ## graphical parameter 23 0.1 0.2 0.3 0.4 0.5 −2 −1 0 1 2 3 influential Plot Circle size is proportial to Cook's Distance Hat−Values Studentized Residuals Chile Japan United States Zambia Libya ## StudRes Hat CookD ## Chile -2.3134295 0.03729796 0.03781324 ## Japan 1.6032158 0.22330989 0.14281625 ## United States -0.3546151 0.33368800 0.01284481 ## Zambia 2.8535583 0.06433163 0.09663275 ## Libya -1.0893033 0.53145676 0.26807042 There is another easy way to visually detect the inﬂuential points by Cook’s distance. plot (g, which = 5) 24 0.0 0.1 0.2 0.3 0.4 0.5 −2 −1 0 1 2 3 Leverage Standardized residuals lm(sr ~ pop15 + pop75 + dpi + ddpi) Cook's distance 1 0.5 0.5 1 Residuals vs Leverage Libya Japan Zambia Inﬂuential measures influential.measures produces a class “inﬂ” object tabular display showing several diagnostics measures (such as hiiand Cook’s distance). Those cases which are inﬂuential with respect to any of these measures are marked with an asterisk. influence.measures (g) ## Influence measures of ## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings) : ## ## dfb.1_ dfb.pp15 dfb.pp75 dfb.dpi dfb.ddpi dffit cov.r ## Australia 0.01232 -0.01044 -0.02653 0.04534 -0.000159 0.0627 1.193 ## Austria -0.01005 0.00594 0.04084 -0.03672 -0.008182 0.0632 1.268 ## Belgium -0.06416 0.05150 0.12070 -0.03472 -0.007265 0.1878 1.176 ## Bolivia 0.00578 -0.01270 -0.02253 0.03185 0.040642 -0.0597 1.224 ## Brazil 0.08973 -0.06163 -0.17907 0.11997 0.068457 0.2646 1.082 ## Canada 0.00541 -0.00675 0.01021 -0.03531 -0.002649 -0.0390 1.328 ## Chile -0.19941 0.13265 0.21979 -0.01998 0.120007 -0.4554 0.655 ## China 0.02112 -0.00573 -0.08311 0.05180 0.110627 0.2008 1.150 ## Colombia 0.03910 -0.05226 -0.02464 0.00168 0.009084 -0.0960 1.167 ## Costa Rica -0.23367 0.28428 0.14243 0.05638 -0.032824 0.4049 0.968 ## Denmark -0.04051 0.02093 0.04653 0.15220 0.048854 0.3845 0.934 ## Ecuador 0.07176 -0.09524 -0.06067 0.01950 0.047786 -0.1695 1.139 ## Finland -0.11350 0.11133 0.11695 -0.04364 -0.017132 -0.1464 1.203 ## France -0.16600 0.14705 0.21900 -0.02942 0.023952 0.2765 1.226 ## Germany -0.00802 0.00822 0.00835 -0.00697 -0.000293 -0.0152 1.226 ## Greece -0.14820 0.16394 0.02861 0.15713 -0.059599 -0.2811 1.140 ## Guatamala 0.01552 -0.05485 0.00614 0.00585 0.097217 -0.2305 1.085 ## Honduras -0.00226 0.00984 -0.01020 0.00812 -0.001887 0.0482 1.186 ## Iceland 0.24789 -0.27355 -0.23265 -0.12555 0.184698 -0.4768 0.866 ## India 0.02105 -0.01577 -0.01439 -0.01374 -0.018958 0.0381 1.202 25 ## Ireland -0.31001 0.29624 0.48156 -0.25733 -0.093317 0.5216 1.268 ## Italy 0.06619 -0.07097 0.00307 -0.06999 -0.028648 0.1388 1.162 ## Japan 0.63987 -0.65614 -0.67390 0.14610 0.388603 0.8597 1.085 ## Korea -0.16897 0.13509 0.21895 0.00511 -0.169492 -0.4303 0.870 ## Luxembourg -0.06827 0.06888 0.04380 -0.02797 0.049134 -0.1401 1.196 ## Malta 0.03652 -0.04876 0.00791 -0.08659 0.153014 0.2386 1.128 ## Norway 0.00222 -0.00035 -0.00611 -0.01594 -0.001462 -0.0522 1.168 ## Netherlands 0.01395 -0.01674 -0.01186 0.00433 0.022591 0.0366 1.229 ## New Zealand -0.06002 0.06510 0.09412 -0.02638 -0.064740 0.1469 1.134 ## Nicaragua -0.01209 0.01790 0.00972 -0.00474 -0.010467 0.0397 1.174 ## Panama 0.02828 -0.05334 0.01446 -0.03467 -0.007889 -0.1775 1.067 ## Paraguay -0.23227 0.16416 0.15826 0.14361 0.270478 -0.4655 0.873 ## Peru -0.07182 0.14669 0.09148 -0.08585 -0.287184 0.4811 0.831 ## Philippines -0.15707 0.22681 0.15743 -0.11140 -0.170674 0.4884 0.818 ## Portugal -0.02140 0.02551 -0.00380 0.03991 -0.028011 -0.0690 1.233 ## South Africa 0.02218 -0.02030 -0.00672 -0.02049 -0.016326 0.0343 1.195 ## South Rhodesia 0.14390 -0.13472 -0.09245 -0.06956 -0.057920 0.1607 1.313 ## Spain -0.03035 0.03131 0.00394 0.03512 0.005340 -0.0526 1.208 ## Sweden 0.10098 -0.08162 -0.06166 -0.25528 -0.013316 -0.4526 1.086 ## Switzerland 0.04323 -0.04649 -0.04364 0.09093 -0.018828 0.1903 1.147 ## Turkey -0.01092 -0.01198 0.02645 0.00161 0.025138 -0.1445 1.100 ## Tunisia 0.07377 -0.10500 -0.07727 0.04439 0.103058 -0.2177 1.131 ## United Kingdom 0.04671 -0.03584 -0.17129 0.12554 0.100314 -0.2722 1.189 ## United States 0.06910 -0.07289 0.03745 -0.23312 -0.032729 -0.2510 1.655 ## Venezuela -0.05083 0.10080 -0.03366 0.11366 -0.124486 0.3071 1.095 ## Zambia 0.16361 -0.07917 -0.33899 0.09406 0.228232 0.7482 0.512 ## Jamaica 0.10958 -0.10022 -0.05722 -0.00703 -0.295461 -0.3456 1.200 ## Uruguay -0.13403 0.12880 0.02953 0.13132 0.099591 -0.2051 1.187 ## Libya 0.55074 -0.48324 -0.37974 -0.01937 -1.024477 -1.1601 2.091 ## Malaysia 0.03684 -0.06113 0.03235 -0.04956 -0.072294 -0.2126 1.113 ## cook.d hat inf ## Australia 8.04e-04 0.0677 ## Austria 8.18e-04 0.1204 ## Belgium 7.15e-03 0.0875 ## Bolivia 7.28e-04 0.0895 ## Brazil 1.40e-02 0.0696 ## Canada 3.11e-04 0.1584 ## Chile 3.78e-02 0.0373 * ## China 8.16e-03 0.0780 ## Colombia 1.88e-03 0.0573 ## Costa Rica 3.21e-02 0.0755 ## Denmark 2.88e-02 0.0627 ## Ecuador 5.82e-03 0.0637 ## Finland 4.36e-03 0.0920 ## France 1.55e-02 0.1362 ## Germany 4.74e-05 0.0874 ## Greece 1.59e-02 0.0966 ## Guatamala 1.07e-02 0.0605 ## Honduras 4.74e-04 0.0601 ## Iceland 4.35e-02 0.0705 ## India 2.97e-04 0.0715 ## Ireland 5.44e-02 0.2122 ## Italy 3.92e-03 0.0665 26 ## Japan 1.43e-01 0.2233 ## Korea 3.56e-02 0.0608 ## Luxembourg 3.99e-03 0.0863 ## Malta 1.15e-02 0.0794 ## Norway 5.56e-04 0.0479 ## Netherlands 2.74e-04 0.0906 ## New Zealand 4.38e-03 0.0542 ## Nicaragua 3.23e-04 0.0504 ## Panama 6.33e-03 0.0390 ## Paraguay 4.16e-02 0.0694 ## Peru 4.40e-02 0.0650 ## Philippines 4.52e-02 0.0643 ## Portugal 9.73e-04 0.0971 ## South Africa 2.41e-04 0.0651 ## South Rhodesia 5.27e-03 0.1608 ## Spain 5.66e-04 0.0773 ## Sweden 4.06e-02 0.1240 ## Switzerland 7.33e-03 0.0736 ## Turkey 4.22e-03 0.0396 ## Tunisia 9.56e-03 0.0746 ## United Kingdom 1.50e-02 0.1165 ## United States 1.28e-02 0.3337 * ## Venezuela 1.89e-02 0.0863 ## Zambia 9.66e-02 0.0643 * ## Jamaica 2.40e-02 0.1408 ## Uruguay 8.53e-03 0.0979 ## Libya 2.68e-01 0.5315 * ## Malaysia 9.11e-03 0.0652 There are other indices for detecting inﬂuential points, such as DFBETAs and DFFITs. https://cran.r-project.org/web/packages/olsrr/vignettes/influence_measures.html 4. Hypotheses of the model Homoschedasticity 4.a Plot residuals ( ˆÁ) vs ﬁtted values ( ˆy). A sequence of random variables is homoscedastic if all its random variables have the same ﬁnite variance. Homoschedasticity is a fundamental assumptions in OLS. plot (g $fit, g $res, xlab = "Fitted" , ylab = "Residuals" , main = "Residuals vs Fitted Values" , pch = 16 ) abline ( h= 0, lwd = 2, lty = 2, col = red ) # variabilità sembra sufficientemente uniforme 27 6 8 10 12 14 16 −5 0 5 10 Residuals vs Fitted Values Fitted Residuals The residual vs ﬁtted value plot does not show any particular features that challenge the normality and constant variance assumptions of the residuals. Alternatively, we can plot plot (g, which= 1 ) 6 8 10 12 14 16 −10 −5 0 5 10 Fitted values Residuals lm(sr ~ pop15 + pop75 + dpi + ddpi) Residuals vs Fitted Zambia Chile Philippines 28 Normality 4.b Plot QQ plot and test the normality of the residuals with Shapiro-Wilks test. Normality of data is another strong assumption of the OLS model. # QQ plot qqnorm (g $res, ylab = "Raw Residuals" , pch = 16 ) qqline (g $res ) −2 −1 0 1 2 −5 0 5 10 Normal Q−Q Plot Theoretical Quantiles Raw Residuals # Andamento adeguatamente rettilineo, l ipotesi è verificata # Shapiro-Wilk normality test shapiro.test (g $res ) ## ## Shapiro-Wilk normality test ## ## data: g$res ## W = 0.98698, p-value = 0.8524 Since p-value is very high, I have no evidence to reject H0, which is the gaussianity of the data. We can see also if the studentized residuals follow a Gaussian distribution. qqnorm ( rstandard ( g ), ylab = "Studentized residuals" , pch = 16 ) abline ( 0, 1 ) 29 −2 −1 0 1 2 −2 −1 0 1 2 Normal Q−Q Plot Theoretical Quantiles Studentized residuals Histogram of residuals and boxplot are useful tools to look at the shape of the residual distribution and to see if the tails oh the distributions are souspiciously heavy (i.e., there are many outliers) respectively. # altri strumenti utili... hist (g $res, 10 , probability = TRUE , col = lavender , main = residuals ) residuals g$res Density −10 −5 0 5 10 0.00 0.04 0.08 30 boxplot (g $res, main = "Boxplot of savings residuals" , pch = 16 , col = lavender ) −5 0 5 10 Boxplot of savings residuals Nonlinearity/Collinearity Partial regression plots Partial Regression or Added Variable plots can help isolate the eect of xion y. 1. Regress yon all xexcept xi, get residuals ˆ”.Thisrepresents ywith the other X-eect taken out. 2. Regress xion all xexcept xi, get residuals ˆ“.This represents xiwith the other X-eect taken out. 3. Plot ˆ”against ˆ“ The slope of a line ﬁtted to the plot adds some insight into the meaning of regression coecients. Look for non-linearity and outliers and/or inﬂuential points. Partial Residual plots Partial Residual plots are a competitor to added variable plots. These plot Ái+—ixiagainst xi. The slope on the plot will have the same interpretation of Partial regression plots. Partial residual plots are reckoned to be better for non-linearity detection while added variable plots are better for outlier/inﬂuential detection. First we construct a partial regression (added variable) plot for pop15: d |t|) ## (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 *** 47 ## Population 5.180e-05 2.919e-05 1.775 0.0832 . ## Income -2.180e-05 2.444e-04 -0.089 0.9293 ## Illiteracy 3.382e-02 3.663e-01 0.092 0.9269 ## Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 *** ## HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 * ## Frost -5.735e-03 3.143e-03 -1.825 0.0752 . ## Area -7.383e-08 1.668e-06 -0.044 0.9649 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.7448 on 42 degrees of freedom ## Multiple R-squared: 0.7362, Adjusted R-squared: 0.6922 ## F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10 We can observe that income, illiteracy, and area seem to be non signiﬁcant. Also for population and frost we have not such a high p-value. It may depend by many issues: variables could eectively be non informative, there could be high collinearity, there could be too much noise in the model. We need a systematic method to discriminate what variables ﬁt better this model. Manual backward selection method At each step we remove the predictor with the largest p-value and (ideally) stop when we have only predictors with p-values below 0.05 help (update ) ## Help on topic update was found in the following packages: ## ## Package Library ## BAS /Library/Frameworks/R.framework/Versions/3.6/Resources/library ## stats /Library/Frameworks/R.framework/Versions/3.6/Resources/library ## data.table /Library/Frameworks/R.framework/Versions/3.6/Resources/library ## ## ## Using the first match ... help (update.formula ) # remove Area g1 = update ( g, . ~ . - Area ) summary ( g1 ) ## ## Call: ## lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + ## HS.Grad + Frost, data = statedata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.49047 -0.52533 -0.02546 0.57160 1.50374 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.099e+01 1.387e+00 51.165 < 2e-16 *** ## Population 5.188e-05 2.879e-05 1.802 0.0785 . ## Income -2.444e-05 2.343e-04 -0.104 0.9174 ## Illiteracy 2.846e-02 3.416e-01 0.083 0.9340 ## Murder -3.018e-01 4.334e-02 -6.963 1.45e-08 *** 48 ## HS.Grad 4.847e-02 2.067e-02 2.345 0.0237 * ## Frost -5.776e-03 2.970e-03 -1.945 0.0584 . ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.7361 on 43 degrees of freedom ## Multiple R-squared: 0.7361, Adjusted R-squared: 0.6993 ## F-statistic: 19.99 on 6 and 43 DF, p-value: 5.362e-11 # remove Illiteracy g2 = update ( g1, . ~ . - Illiteracy ) summary ( g2 ) ## ## Call: ## lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + ## Frost, data = statedata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.4892 -0.5122 -0.0329 0.5645 1.5166 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.107e+01 1.029e+00 69.067 < 2e-16 *** ## Population 5.115e-05 2.709e-05 1.888 0.0657 . ## Income -2.477e-05 2.316e-04 -0.107 0.9153 ## Murder -3.000e-01 3.704e-02 -8.099 2.91e-10 *** ## HS.Grad 4.776e-02 1.859e-02 2.569 0.0137 * ## Frost -5.910e-03 2.468e-03 -2.395 0.0210 * ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.7277 on 44 degrees of freedom ## Multiple R-squared: 0.7361, Adjusted R-squared: 0.7061 ## F-statistic: 24.55 on 5 and 44 DF, p-value: 1.019e-11 # Remove Income g3 = update ( g2, . ~ . - Income ) summary ( g3 ) ## ## Call: ## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, ## data = statedata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.47095 -0.53464 -0.03701 0.57621 1.50683 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 *** ## Population 5.014e-05 2.512e-05 1.996 0.05201 . ## Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 *** 49 ## HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 ** ## Frost -5.943e-03 2.421e-03 -2.455 0.01802 * ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.7197 on 45 degrees of freedom ## Multiple R-squared: 0.736, Adjusted R-squared: 0.7126 ## F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12 # remove Population g4 = update ( g3, . ~ . - Population ) summary ( g4 ) ## ## Call: ## lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = statedata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.5015 -0.5391 0.1014 0.5921 1.2268 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 71.036379 0.983262 72.246 < 2e-16 *** ## Murder -0.283065 0.036731 -7.706 8.04e-10 *** ## HS.Grad 0.049949 0.015201 3.286 0.00195 ** ## Frost -0.006912 0.002447 -2.824 0.00699 ** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.7427 on 46 degrees of freedom ## Multiple R-squared: 0.7127, Adjusted R-squared: 0.6939 ## F-statistic: 38.03 on 3 and 46 DF, p-value: 1.634e-12 The ﬁnal removal of variable Population is a close call. We may want to consider including this variable if interpretation is aided. Notice that the R2for the full model of 0.736 is reduced only slightly to 0.713 in the ﬁnal model. Thus the removal of four predictors causes only a minor reduction in ﬁt. Let’s observe what are the correlated variables. X= statedata [ , -4 ] #not considering the response variable cor (X) ## Population Income Illiteracy Murder HS.Grad Frost ## Population 1.00000000 0.2082276 0.10762237 0.3436428 -0.09848975 -0.3321525 ## Income 0.20822756 1.0000000 -0.43707519 -0.2300776 0.61993232 0.2262822 ## Illiteracy 0.10762237 -0.4370752 1.00000000 0.7029752 -0.65718861 -0.6719470 ## Murder 0.34364275 -0.2300776 0.70297520 1.0000000 -0.48797102 -0.5388834 ## HS.Grad -0.09848975 0.6199323 -0.65718861 -0.4879710 1.00000000 0.3667797 ## Frost -0.33215245 0.2262822 -0.67194697 -0.5388834 0.36677970 1.0000000 ## Area 0.02254384 0.3633154 0.07726113 0.2283902 0.33354187 0.0592291 ## Area ## Population 0.02254384 ## Income 0.36331544 ## Illiteracy 0.07726113 50 ## Murder 0.22839021 ## HS.Grad 0.33354187 ## Frost 0.05922910 ## Area 1.00000000 # Note: - the high positive correlation between Murder and Illiteracy, Income and HS.Grad # - the high negative correlation between Murder and HS.Grad, Illiteracy and Frost heatmap ( cor ( X ), Rowv = NA , Colv = NA , symm = TRUE , keep.dendro = F) Population Income Illiteracy Murder HS.Grad Frost Area Population Income Illiteracy Murder HS.Grad Frost Area #image( as.matrix( cor( X ) ), main = Correlation of X ) pairs (X) 51 Population 3000 2 10 0 150 0 10000 3000 5000 Income Illiteracy 0.5 1.5 2.5 2 6 12 Murder HS.Grad 40 55 0 100 Frost 0 20000 0.5 2.5 40 60 0e+00 4e+05 0e+00 Area Mind spurious correlations! http://www.tylervigen.com/spurious-correlations http://guessthecorrelation.com Automatic backward selection method Now, we see come information criterion to decide what is the best variable subset. Criterion based procedures 1. AIC & BIC 2. R2adjusted 3. Mallow’s Cp 1. AIC & BIC The Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of ﬁt (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. AI C = ≠2·log( likelihood ) +2 k where kis the number of parameter in the model. The formula for the Bayesian information criterion (BIC) is similar to the formula for AIC, but with a dierent penalty for the number of parameters. The BIC generally penalizes free parameters more strongly than the Akaike information criterion, though it depends on the size of n and relative magnitude of n and k. It works well when n >> k. BIC = ≠2·log( likelihood ) +k·log(n) g= lm ( Life.Exp ~ ., data = statedata ) 52 AIC (g) ## [1] 121.7092 BIC (g) ## [1] 138.9174 AIC and BIC for the complete regression model. step ( g, direction = "backward" , trace = T) ## Start: AIC=-22.18 ## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + ## Frost + Area ## ## Df Sum of Sq RSS AIC ## - Area 1 0.0011 23.298 -24.182 ## - Income 1 0.0044 23.302 -24.175 ## - Illiteracy 1 0.0047 23.302 -24.174 ## 23.297 -22.185 ## - Population 1 1.7472 25.044 -20.569 ## - Frost 1 1.8466 25.144 -20.371 ## - HS.Grad 1 2.4413 25.738 -19.202 ## - Murder 1 23.1411 46.438 10.305 ## ## Step: AIC=-24.18 ## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + ## Frost ## ## Df Sum of Sq RSS AIC ## - Illiteracy 1 0.0038 23.302 -26.174 ## - Income 1 0.0059 23.304 -26.170 ## 23.298 -24.182 ## - Population 1 1.7599 25.058 -22.541 ## - Frost 1 2.0488 25.347 -21.968 ## - HS.Grad 1 2.9804 26.279 -20.163 ## - Murder 1 26.2721 49.570 11.569 ## ## Step: AIC=-26.17 ## Life.Exp ~ Population + Income + Murder + HS.Grad + Frost ## ## Df Sum of Sq RSS AIC ## - Income 1 0.006 23.308 -28.161 ## 23.302 -26.174 ## - Population 1 1.887 25.189 -24.280 ## - Frost 1 3.037 26.339 -22.048 ## - HS.Grad 1 3.495 26.797 -21.187 ## - Murder 1 34.739 58.041 17.456 ## ## Step: AIC=-28.16 ## Life.Exp ~ Population + Murder + HS.Grad + Frost ## ## Df Sum of Sq RSS AIC ## 23.308 -28.161 ## - Population 1 2.064 25.372 -25.920 ## - Frost 1 3.122 26.430 -23.877 ## - HS.Grad 1 5.112 28.420 -20.246 53 ## - Murder 1 34.816 58.124 15.528 ## ## Call: ## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, ## data = statedata) ## ## Coefficients: ## (Intercept) Population Murder HS.Grad Frost ## 7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03 # k=2 is the default for pure AIC Population, Murder, HS.Grad, Frost is the selected subset. Observe that we keep three variables selected according to p-values and also Population, which was controversial. g_BIC_back = step ( g, direction = "backward" , k= log (n), trace = T) ## Start: AIC=-6.89 ## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + ## Frost + Area ## ## Df Sum of Sq RSS AIC ## - Area 1 0.0011 23.298 -10.7981 ## - Income 1 0.0044 23.302 -10.7910 ## - Illiteracy 1 0.0047 23.302 -10.7903 ## - Population 1 1.7472 25.044 -7.1846 ## - Frost 1 1.8466 25.144 -6.9866 ## 23.297 -6.8884 ## - HS.Grad 1 2.4413 25.738 -5.8178 ## - Murder 1 23.1411 46.438 23.6891 ## ## Step: AIC=-10.8 ## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + ## Frost ## ## Df Sum of Sq RSS AIC ## - Illiteracy 1 0.0038 23.302 -14.7021 ## - Income 1 0.0059 23.304 -14.6975 ## - Population 1 1.7599 25.058 -11.0691 ## 23.298 -10.7981 ## - Frost 1 2.0488 25.347 -10.4960 ## - HS.Grad 1 2.9804 26.279 -8.6912 ## - Murder 1 26.2721 49.570 23.0406 ## ## Step: AIC=-14.7 ## Life.Exp ~ Population + Income + Murder + HS.Grad + Frost ## ## Df Sum of Sq RSS AIC ## - Income 1 0.006 23.308 -18.601 ## - Population 1 1.887 25.189 -14.720 ## 23.302 -14.702 ## - Frost 1 3.037 26.339 -12.488 ## - HS.Grad 1 3.495 26.797 -11.627 ## - Murder 1 34.739 58.041 27.017 ## ## Step: AIC=-18.6 54 ## Life.Exp ~ Population + Murder + HS.Grad + Frost ## ## Df Sum of Sq RSS AIC ## 23.308 -18.601 ## - Population 1 2.064 25.372 -18.271 ## - Frost 1 3.122 26.430 -16.228 ## - HS.Grad 1 5.112 28.420 -12.598 ## - Murder 1 34.816 58.124 23.176 # k = log(n) is for BIC Population, Murder, HS.Grad, Frost is the selected subset also according to BIC. 2. R2adjusted We already met R2and R2adjusted. The use of an adjusted R2is an attempt to account for the phenomenon of the R2automatically and spuriously increasing when extra explanatory variables are added to the model. It is a modiﬁcation of R2that adjusts for the number of explanatory terms (k) in a model relative to the number of data points (n). R2adj =1 ≠(1 ≠R2) n≠1 n≠pìk≠1 leaps performs an exhaustive (all possibile combinations) search for the best subsets of the variables in x for predicting y in linear regression, using an ecient branch-and-bound algorithm. # solo matrice dei predittori senza colonna di 1 x= model.matrix (g)[, -1 ] y= statedata $Life adjr = leaps ( x, y, method = "adjr2" ) names (adjr) ## [1] "which" "label" "size" "adjr2" adjr ## $which ## 1 2 3 4 5 6 7 ## 1 FALSE FALSE FALSE TRUE FALSE FALSE FALSE ## 1 FALSE FALSE TRUE FALSE FALSE FALSE FALSE ## 1 FALSE FALSE FALSE FALSE TRUE FALSE FALSE ## 1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE ## 1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE ## 1 FALSE FALSE FALSE FALSE FALSE FALSE TRUE ## 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## 2 FALSE FALSE FALSE TRUE TRUE FALSE FALSE ## 2 TRUE FALSE FALSE TRUE FALSE FALSE FALSE ## 2 FALSE FALSE FALSE TRUE FALSE TRUE FALSE ## 2 FALSE TRUE FALSE TRUE FALSE FALSE FALSE ## 2 FALSE FALSE FALSE TRUE FALSE FALSE TRUE ## 2 FALSE FALSE TRUE TRUE FALSE FALSE FALSE ## 2 FALSE FALSE FALSE FALSE TRUE FALSE TRUE ## 2 FALSE FALSE TRUE FALSE TRUE FALSE FALSE ## 2 FALSE FALSE TRUE FALSE FALSE TRUE FALSE ## 2 FALSE TRUE TRUE FALSE FALSE FALSE FALSE ## 3 FALSE FALSE FALSE TRUE TRUE TRUE FALSE ## 3 TRUE FALSE FALSE TRUE TRUE FALSE FALSE ## 3 FALSE TRUE FALSE TRUE FALSE TRUE FALSE ## 3 TRUE FALSE FALSE TRUE FALSE TRUE FALSE ## 3 FALSE FALSE TRUE TRUE FALSE TRUE FALSE 55 ## 3 FALSE FALSE TRUE TRUE TRUE FALSE FALSE ## 3 FALSE FALSE FALSE TRUE TRUE FALSE TRUE ## 3 TRUE TRUE FALSE TRUE FALSE FALSE FALSE ## 3 FALSE TRUE FALSE TRUE TRUE FALSE FALSE ## 3 TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## 4 TRUE FALSE FALSE TRUE TRUE TRUE FALSE ## 4 FALSE TRUE FALSE TRUE TRUE TRUE FALSE ## 4 FALSE FALSE TRUE TRUE TRUE TRUE FALSE ## 4 FALSE FALSE FALSE TRUE TRUE TRUE TRUE ## 4 TRUE FALSE TRUE TRUE TRUE FALSE FALSE ## 4 TRUE FALSE FALSE TRUE TRUE FALSE TRUE ## 4 TRUE TRUE FALSE TRUE TRUE FALSE FALSE ## 4 TRUE TRUE FALSE TRUE FALSE TRUE FALSE ## 4 TRUE FALSE FALSE TRUE FALSE TRUE TRUE ## 4 FALSE TRUE TRUE TRUE FALSE TRUE FALSE ## 5 TRUE TRUE FALSE TRUE TRUE TRUE FALSE ## 5 TRUE FALSE TRUE TRUE TRUE TRUE FALSE ## 5 TRUE FALSE FALSE TRUE TRUE TRUE TRUE ## 5 FALSE TRUE TRUE TRUE TRUE TRUE FALSE ## 5 FALSE TRUE FALSE TRUE TRUE TRUE TRUE ## 5 TRUE FALSE TRUE TRUE TRUE FALSE TRUE ## 5 FALSE FALSE TRUE TRUE TRUE TRUE TRUE ## 5 TRUE TRUE TRUE TRUE TRUE FALSE FALSE ## 5 TRUE FALSE TRUE TRUE FALSE TRUE TRUE ## 5 TRUE TRUE TRUE TRUE FALSE TRUE FALSE ## 6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## 6 TRUE FALSE TRUE TRUE TRUE TRUE TRUE ## 6 TRUE TRUE FALSE TRUE TRUE TRUE TRUE ## 6 FALSE TRUE TRUE TRUE TRUE TRUE TRUE ## 6 TRUE TRUE TRUE TRUE TRUE FALSE TRUE ## 6 TRUE TRUE TRUE TRUE FALSE TRUE TRUE ## 6 TRUE TRUE TRUE FALSE TRUE TRUE TRUE ## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## ## $label ## [1] "(Intercept)" "1" "2" "3" "4" ## [6] "5" "6" "7" ## ## $size ## [1] 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 ## [39] 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 8 ## ## $adjr2 ## [1] 0.601589257 0.332687649 0.325204369 0.097352314 0.049277135 ## [6] -0.009073186 -0.016105785 0.648499092 0.640531108 0.630123171 ## [11] 0.621505415 0.598658042 0.596338472 0.417498412 0.388611537 ## [16] 0.352293622 0.327377601 0.693922972 0.681157120 0.660546563 ## [21] 0.657142928 0.652680094 0.646188613 0.644205869 0.644065363 ## [26] 0.642090700 0.640621243 0.712569018 0.689369653 0.689240309 ## [31] 0.687438534 0.686844403 0.675198232 0.675188701 0.669545043 ## [36] 0.665612898 0.664714272 0.706112903 0.706085962 0.706046460 ## [41] 0.683964374 0.683039206 0.682778094 0.682185571 0.680321045 ## [46] 0.671976112 0.668571103 0.699326842 0.699283899 0.699279833 56 ## [51] 0.676792721 0.675509995 0.667835310 0.400695811 0.692182314 Select the model with the greater adjusted R2and then display the best models from a leaps object. bestmodel_adjr2_ind = which.max ( adjr $adjr2 ) adjr $which[ bestmodel_adjr2_ind, ] ## 1 2 3 4 5 6 7 ## TRUE FALSE FALSE TRUE TRUE TRUE FALSE # Displays the best models from a leaps object maxadjr ( adjr, 5 ) ## 1,4,5,6 1,2,4,5,6 1,3,4,5,6 1,4,5,6,7 1,2,3,4,5,6 ## 0.713 0.706 0.706 0.706 0.699 We see that also in this case the best model is that made by features 1,4,5,6, namely the model with Population + Murder + HS graduation + Frost, is the best one, since it has the largest R2adj (71.26% ). # Other possibilities R2 = leaps ( x, y, method = "r2" ) bestmodel_R2_ind = which.max ( R2 $r2 ) R2 $which[ bestmodel_R2_ind, ] ## 1 2 3 4 5 6 7 ## TRUE TRUE TRUE TRUE TRUE TRUE TRUE According to R2, the best model is the complete one. REMARK Remember that Variable selection methods are sensitive to inﬂuential points. 3. Mallow’s Cp Cp= n(RSS +2 pˆ‡2) where RSS is the residual sum of squares, p is the number of predictors and ˆ‡2refers to an estimate of the variance associated with each response in the linear model (estimated on a model containing all predictors). A small value of Cpmeans that the model is relatively precise. Rule of thumb The best model according to Cp, is the one that leads to Cpclose to p. If we are uncertain, we should choose the simplest model (Occa