Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Mathematica Engineering - Modelli e Metodi dell'Inferenza Statistica

Laboratorio 2 - ANOVA

Laboratory

Laboratorio con R -2 Metodi e Modelli per l’Inferenza Statistica - Ing. Matematica - a.a. 2019-20 Contents 0. Librerie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1. Visualizzazione della decomposizione della varianza . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Costruzione della matrice disegno di ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4. One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5. Two-ways ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 0. Librerie library ( MASS ) library ( car ) #for LEVENE TEST library ( faraway ) library ( Matrix ) Lavoreremo con il dataset presente nel documento ciclisti.txt. Sono state considerate dieci strade con pista ciclabile ed è stata misurata la distanza tra la linea di mezzeria (linea longitudinale che scorre in mezzo alla strada, dividendo la carreggiata in diverse corsie) e un ciclista sulla pista ciclabile. In queste stesse dieci strade è stata determinata attraverso fotograﬁe la distanza tra lo stesso ciclista e una macchina passante per la strada considerata. • Center -> distanza tra la linea di mezzeria e un ciclista sulla pista ciclabile [misuarata in piedi] • Car -> distanza tra lo stesso ciclista e una macchina passante per la strada considerata [misuarata in piedi] 1. Visualizzazione della decomposizione della varianza Quando operiamo una regressione lineare semplice, possiamo usare le sequenti tre quantità per scomporre la varianza (l’informazione) del modello. Explained sum of squares oSum of Squares Regression : SS reg = nÿ i=1 (ˆyi≠ ¯y)2 è una misura della variabilità della variabile dipendente del modello rispetto a come è spiegata dalle variabili indipendenti. Residual sum of squares oSum of Squares Error : SS err = nÿ i=1 (yi≠ ˆyi)2 é una misura della variabilità della variabile dipendente che NON è spiegata dalla nostra regressione. 1 Total sum of squares : SS tot = nÿ i=1 (yi≠ ¯y)2 rappresenta tutta la variabilità della variabile dipendente. dove con yisi sono indicati i valori della variabile risposta in corrispondenza di ciascun valore xi, con ˆyii valori stimati sulla retta di regressione e con ¯yla media delle yi. È facile vedere che: SS tot = SS reg +SS err Soluzione Importiamo del dataset CICLISTI: ciclisti = read.table ( "ciclisti.txt" , header = TRUE ) head (ciclisti) ## Center Car ## 1 12.8 5.5 ## 2 12.9 6.2 ## 3 12.9 6.3 ## 4 13.6 7.0 ## 5 14.5 7.8 ## 6 14.6 8.3 names ( ciclisti ) ## [1] "Center" "Car" dim ( ciclisti ) ## [1] 10 2 n= dim ( ciclisti )[ 1] Fittiamo ora un modello lineare semplice e calcoliamo le quantità di interesse. reg.ciclisti = lm ( Car ~ Center, data = ciclisti ) summary ( reg.ciclisti ) ## ## Call: ## lm(formula = Car ~ Center, data = ciclisti) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.76990 -0.44846 0.03493 0.35609 0.84148 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2.18247 1.05669 -2.065 0.0727 . ## Center 0.66034 0.06748 9.786 9.97e-06 *** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.5821 on 8 degrees of freedom ## Multiple R-squared: 0.9229, Adjusted R-squared: 0.9133 ## F-statistic: 95.76 on 1 and 8 DF, p-value: 9.975e-06 2 # y_i ( dati osservati ) y.i = ciclisti $Car y.i ## [1] 5.5 6.2 6.3 7.0 7.8 8.3 7.1 10.0 10.8 11.0 # y medio y.mean = mean ( ciclisti $Car ) y.mean ## [1] 8 # y.hat ( dati fittati dal modello ) y.hat = reg.ciclisti $fitted.value y.hat ## 1 2 3 4 5 6 7 8 ## 6.269904 6.335939 6.335939 6.798178 7.392485 7.458520 7.788691 9.373511 ## 9 10 ## 10.694195 11.552639 Facciamo il graﬁco delle quantità di interesse. par ( mfrow = c( 1, 3 )) # SStot = Sum( y_i - y_medio )^2 plot ( NULL , xlim = c( 12 , 22 ), ylim = c( 5, 12 ), xlab = "Center" , ylab = "Car" , main = "Contributo di y_i a SStot" ) points ( ciclisti $Center, ciclisti $Car, pch = 16 , col = blue , cex = 1.2 ) for (i in 1:n) segments ( ciclisti $Center [ i ], ciclisti $Car[ i ], ciclisti $Center [ i ], y.mean, lwd = 2, col = red , lty = 1 ) abline ( h= y.mean , col = darkblue , lwd = 1.2 ) abline ( reg.ciclisti, lwd = 2, col = black , lty = 2 ) # SSreg = Sum( y.hat_i - y_medio )^2 plot ( NULL , xlim = c( 12 , 22 ), ylim = c( 5, 12 ), xlab = "Center" , ylab = "Car" , main = "Contributo di y.hat_i a SSreg" ) points ( ciclisti $Center, ciclisti $Car, pch = 16 , col = blue , cex = 1.2 ) for (i in 1:n) segments ( ciclisti $Center [ i ], y.hat[ i ], ciclisti $Center [ i ], y.mean, lwd = 2, col = red , lty = 1 ) abline ( h= y.mean , col = darkblue , lwd = 1.2 ) abline ( reg.ciclisti, lwd = 2, col = black , lty = 2 ) # SSerr = Sum( y_i - y.hat_i )^2 plot ( NULL , xlim = c( 12 , 22 ), ylim = c( 5, 12 ), xlab = "Center" , ylab = "Car" , main = "Contributo di y_i a SSerr" ) 3 points ( ciclisti $Center, ciclisti $Car, pch = 16 , col = blue , cex = 1.2 ) for (i in 1:n) segments ( ciclisti $Center [ i ], ciclisti $Car[ i ], ciclisti $Center [ i ], y.hat [ i ], lwd = 2, col = red , lty = 1 ) abline ( h= y.mean , col = darkblue , lwd = 1.2 ) abline ( reg.ciclisti, lwd = 2, col = black , lty = 2 ) 12 14 16 18 20 22 5 6 7 8 9 10 11 12 Contributo di y_i a SStot Center Car 12 14 16 18 20 22 5 6 7 8 9 10 11 12 Contributo di y.hat_i a SSreg Center Car 12 14 16 18 20 22 5 6 7 8 9 10 11 12 Contributo di y_i a SSerr Center Car 2. One-way ANOVA Importiamo i dati chickwts . Vogliamo investigare se il peso dei polli è inﬂuenzato dal tipo di alimentazione (y = weights e ·=feed). Soluzione Importiamo i dati. data ( chickwts ) head ( chickwts ) ## weight feed ## 1 179 horsebean ## 2 160 horsebean ## 3 136 horsebean ## 4 227 horsebean ## 5 217 horsebean ## 6 168 horsebean tail ( chickwts ) 4 ## weight feed ## 66 352 casein ## 67 359 casein ## 68 216 casein ## 69 222 casein ## 70 283 casein ## 71 332 casein attach ( chickwts ) Iniziamo a farci un’idea descrittiva dei dati per avere indicazioni qualitative sulla presenza di dierenziazione nella risposta a causa dell’appartenenza ad una o all’altra categoria. L’analisi della varianza, nota con l’acronimo di ANOVA, è una tecnica statistica che ha come obiettivo il confronto tra le medie di un fenomeno aleatorio fra dierenti gruppi di unità statistiche. Tale analisi viene arontata tramite decomposizione della varianza. Consiglio graﬁco: spesso è utile rappresentare i gruppi di dati con colori diversi, potete trovare alcune palette di colori nel pacchetto RColorBrewer. library (RColorBrewer) # display.brewer.all() # mostra le palette disponibili in RColorBrewer my_colors = brewer.pal ( length ( levels ( chickwts $feed ) ), Set2 ) summary ( chickwts ) ## weight feed ## Min. :108.0 casein :12 ## 1st Qu.:204.5 horsebean:10 ## Median :258.0 linseed :12 ## Mean :261.3 meatmeal :11 ## 3rd Qu.:323.5 soybean :14 ## Max. :423.0 sunflower:12 boxplot ( weight ~ feed, xlab = feed , ylab = weight , main = Chicken weight according to feed , col = my_colors ) abline ( h= mean ( weight ) ) 5 casein horsebean linseed meatmeal sunflower 100 200 300 400 Chicken weight according to feed feed weight tapply ( chickwts $weight, chickwts $feed, length ) ## casein horsebean linseed meatmeal soybean sunflower ## 12 10 12 11 14 12 Sembra che un qualche eetto ci sia, le medie appaiono diverse e sembra vi sia una dominanza stocastica delle distribuzioni dei pesi. Il modello che vogliamo ﬁttare è il seguente: yij= µj+Áij= µ+·j+Áij in cui iœ{ 1,..,n j}è l’indice dell’unità statistica all’interno del gruppo j,mentre jœ{ 1,..,g }èl’indicedi gruppo. ·jrappresenta la deviazione media rispetto a µnel gruppo j. Siamo interessati ad eseguire il seguente test: H0:µi= µj ’i, j œ{ 1,.., 6} vs H 1:÷(i, j )|µi”= µj Parafrasando, H0prevede che tutti i polli appartengono ad una sola popolazione, mentre H1prevede che i polli appartengono a 2, 3, 4, 5 o 6 popolazioni. par ( mfrow = c( 1, 2 )) barplot ( rep ( mean ( weight ), 6 ), names.arg = levels ( feed ), ylim = c( 0, max ( weight ) ), main = "se H0 vera" , col = grey ) barplot ( tapply ( weight, feed, mean ), names.arg = levels ( feed ), ylim = c( 0, max ( weight ) ), main = "se H0 falsa" , col = my_colors ) 6 casein meatmeal se H0 vera 0 100 200 300 400 casein meatmeal se H0 falsa 0 100 200 300 400 Facciamo un’ANOVA manuale. Veriﬁchiamo che siano soddisfatte le ipotesi dell’ANOVA: • Normalità intragruppo; n= length ( feed ) ng = table ( feed ) treat = levels ( feed ) g= length ( treat ) # Normalità dei dati nei gruppi Ps = c( shapiro.test ( weight [ feed == treat [ 1 ]]) $p, shapiro.test ( weight [ feed == treat [ 2 ]]) $p, shapiro.test ( weight [ feed == treat [ 3 ]]) $p, shapiro.test ( weight [ feed == treat [ 4 ]]) $p, shapiro.test ( weight [ feed == treat [ 5 ]]) $p, shapiro.test ( weight [ feed == treat [ 6 ]]) $p) Ps ## [1] 0.2591841 0.5264499 0.9034734 0.9611795 0.5063768 0.3602904 # In maniera più compatta ed elegante: Ps = tapply ( weight, feed, function (x)( shapiro.test (x) $p)) Ps # tutti p-value alti = > non rifiuto mai hp di Normalità ## casein horsebean linseed meatmeal soybean sunflower ## 0.2591841 0.5264499 0.9034734 0.9611795 0.5063768 0.3602904 • Omoschedasticità fra i gruppi. # Varianze dei gruppi omogenee Var = c( var ( weight [ feed == treat [ 1 ] ] ), var ( weight [ feed == treat [ 2 ] ] ), var ( weight [ feed == treat [ 3 ] ] ), 7 var ( weight [ feed == treat [ 4 ] ] ), var ( weight [ feed == treat [ 5 ] ] ), var ( weight [ feed == treat [ 6 ]])) Var ## [1] 4151.720 1491.956 2728.568 4212.091 2929.956 2384.992 # In maniera più compatta ed elegante: Var = tapply ( weight, feed, var ) Var ## casein horsebean linseed meatmeal soybean sunflower ## 4151.720 1491.956 2728.568 4212.091 2929.956 2384.992 Per veriﬁcare l’omogeneità tra le varianze abbiamo diverse possibilità. Bartlett’s test Il Bartlett test è il seguente: H0:‡1= ‡2= ... = ‡g vs H 1:HC0 La statistica test ‘K’ del test di Bartlett è approssimativamente distribuita come una ‰2con (g-1) gradi di libertà e si deﬁnisce come: K = (N ≠g)ln( S2p)≠q gi=1 (ni≠1) ln( S2i) 1+ 1 3(g≠1) 1q gi=1 ( 1ni≠1)≠ 1N≠g 2 dove N = q ki=1 niè la numerosità totale del campione, nila numerosità del gruppo i,S2p= 1N≠k q i(ni≠1)S2i la stima della varianza pooled. La statistica test ‘K’ confronta una varianza globale (pooled) con uno stimatore costruito sulla base delle varianze dei singoli gruppi. Valori piccoli di ‘K’ mi portano ad accettare H0, mentre valori grandi di ‘K’ mi portano a riﬁutare H0. Il test di Bartlett assume che le osservazioni appartenenti ai vari gruppi siano iid da una Normale. Il test è costruito su questa ipotesi, quindi scostamenti anche di un solo gruppo dalla normalità hanno ripercussioni signiﬁcative sulla validità e l’esito del test. Ci sono altri test, più robusti, che vagliano le stesse ipotesi (e.g. Levene, Brown-Forsythe). # Test di uniformità delle varianze bartlett.test ( weight, feed ) ## ## Bartlett test of homogeneity of variances ## ## data: weight and feed ## Bartlett s K-squared = 3.2597, df = 5, p-value = 0.66 Il test di Bartlett in questo caso accetta H0. Levene’s test Anche il test di Levene serve per valutare l’omogeneità delle varianze di una variabile calcolato per due o più gruppi. Questo test è un’alternativa a quello di Bartlett, meno sensibile alla non normalità dei dati. # Alternative: Levene-Test leveneTest ( weight, feed ) ## Levene s Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 5 0.7493 0.5896 ## 65 8 Anche il test di Levene è concorde nell’accettare l’ipotesi nulla. Ora che abbiamo veriﬁcato che le ipotesi sono soddisfatte possiamo procedere con una One-Way ANOVA . Prima, però, osserviamo come possiamo decomporre la varianza tenendo conto di questi gruppi. • Varianza inter-gruppi (Variance Between group) SS B= gÿ j=1 nj(¯yj≠ ¯y)2 dove njè la numerosità del j-esimo gruppo, ¯yjèlamediadel j-esimo gruppo, ¯yla media dell’intero campione. • Varianza intra-gruppi (Variance Within group) SS W = gÿ j=1 njÿ i=1 (yij≠ ¯yj)2 • Varianza totale SS TOT = SS B+SS W Queste statistiche a meno di coecienti di normalizzazione, sono distribuite come ‰2. Allora possiamo deﬁnire una statistica F0come: F0= SS B/(g≠1) SS W/(N ≠g) Sotto l’ipotesi H0 :·1 = ·2 = ... = ·g, abbiamo F0 ≥ Fg≠1,N≠g, da testare contro H1 :·i”= ·jper qualche (i, j ). Media = mean ( weight ) Media .1 = mean ( weight [ feed == treat [ 1 ]]) Media .2 = mean ( weight [ feed == treat [ 2 ]]) Media .3 = mean ( weight [ feed == treat [ 3 ]]) Media .4 = mean ( weight [ feed == treat [ 4 ]]) Media .5 = mean ( weight [ feed == treat [ 5 ]]) Media .6 = mean ( weight [ feed == treat [ 6 ]]) Mediag = c( Media .1 , Media .2 , Media .3 , Media .4 , Media .5 , Media .6 ) # oppure (FORTEMENTE CONSIGLIATO): Media = mean ( weight ) Mediag = tapply ( weight, feed, mean ) SStot = var ( weight ) * (n -1 ) SSB = sum ( ng * ( Mediag -Media ) ^2 ) SSW = SStot - SSB alpha = 0.05 Fstatistic = ( SSB / (g -1 )) / ( SSW / (n -g)) # valori "piccoli" non ci portano a rifiutare cfr.fisher = qf ( 1-alpha, g -1 ,n -g) Fstatistic > cfr.fisher ## [1] TRUE Fstatistic ## [1] 15.3648 9 cfr.fisher ## [1] 2.356028 F0siamo proprio ben oltre la soglia trovata con la distribuzione F, quindi abbiamo un’evidenza forte per riﬁutare. Calcoliamo anche il p-value. P= 1-pf ( Fstatistic, g -1 ,n -g) P ## [1] 5.93642e-10 In Rsi può anche eseguire l’ANOVA in modo automatico. Costruiamo un modelo anova e guardiamo il summary # help( aov ) fit = aov ( weight ~ feed ) summary ( fit ) ## Df Sum Sq Mean Sq F value Pr(>F) ## feed 5 231129 46226 15.37 5.94e-10 *** ## Residuals 65 195556 3009 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Alternativamente, possiamo anche eseguire l’ANOVA in questo modo. Notate che anche nel summary del modello lineare si può trovare la statistica F. # Oppure: mod = lm ( weight ~ feed ) summary ( mod ) ## ## Call: ## lm(formula = weight ~ feed) ## ## Residuals: ## Min 1Q Median 3Q Max ## -123.909 -34.413 1.571 38.170 103.091 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 323.583 15.834 20.436 < 2e-16 *** ## feedhorsebean -163.383 23.485 -6.957 2.07e-09 *** ## feedlinseed -104.833 22.393 -4.682 1.49e-05 *** ## feedmeatmeal -46.674 22.896 -2.039 0.045567 * ## feedsoybean -77.155 21.578 -3.576 0.000665 *** ## feedsunflower 5.333 22.393 0.238 0.812495 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 54.85 on 65 degrees of freedom ## Multiple R-squared: 0.5417, Adjusted R-squared: 0.5064 ## F-statistic: 15.36 on 5 and 65 DF, p-value: 5.936e-10 # Meglio fare: 10 anova ( mod ) ## Analysis of Variance Table ## ## Response: weight ## Df Sum Sq Mean Sq F value Pr(>F) ## feed 5 231129 46226 15.365 5.936e-10 *** ## Residuals 65 195556 3009 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Osserviamo che la conferma che ci siano medie diverse fra i gruppi ci è data dai seguenti elementi: ANOVA MANUALE P-value del test di dierenza fra medie dei gruppi (5.93642e-10); ANOVA AUTOMATICA P-value del comando ANOVA (5.94e-10); LINEAR MODEL P-value del test di signiﬁcatività dei regressori (5.936e-10). REMARK : La statistica F può essere interpretata in generale come una statistica test sulla signiﬁcatività totale di un modello di regressione. Immaginiamo di analizzare il solito summary di un modello lineare (anche con variabili continue). La statistica F che troviamo, testa l’ipotesi nulla che tutti i coecienti della regressione siano uguali a zero. Viene testato il modello completo contro un modello senza variabili indipendenti (la stima della variabile dipendente è la media dei valori della variabile dipendente). Un valore basso di F implica che almeno qualche parametro della regressione è non nullo e che l’equazione della regressione ha validità nel ﬁtting dei dati (cioè, le variabili indipendenti non sono puramente randomiche rispetto alla variabile dipendente). Aermiamo quindi che c’è dierenza delle medie fra i gruppi. È interessante notare che SS B calcolato a mano fa parte dell’output del comando summary dell’ANOVA. Lo stesso vale per per SS W. 3. Costruzione della matrice disegno di ANOVA Poniamoci nel caso di ONE-WAY ANOVA. Per veriﬁcare l’esistenza di diversi gruppi, di fatto quello che vogliamo fare è un modello di regressione lineare con una variabile factor (variabile dummy , categorica). y= X—+Á Cerchiamo di capire come mai il numero dei regressori sarà g≠1,quindi X avrà dimensioni n◊g(perchè viene aggiunta l’intercetta). Supponiamo di avere un campione di 18 osservazioni suddivisi in 7 gruppi con numerosità {3,2,3,2,3,2,3} rispettivamente, la matrice disegno X dovrebbe essere: 11 X = S WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWU 11000000 11000000 11000000 10100000 10100000 10010000 10010000 10010000 10001000 10001000 10000100 10000100 10000100 10000010 10000010 10000001 10000001 10000001 T XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXV Tuttavia questa matrice disegno (X.full nel codice) è singolare, cioè non invertibile. Per invertirla manualmente dobbiamo ricorrere alla pseudoinversa di Moore-Penrose. In alternativa, possiamo scrivere la matrice disegno sotto forma di contrasto, ovvero rimuoviamo una colonna (un parametro), denotando gli elementi dell’ultimo gruppo come assenti in tutti gli altri nel seguente modo: X = S WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWU 11 0 0 0 0 0 11 0 0 0 0 0 11 0 0 0 0 0 10 1 0 0 0 0 10 1 0 0 0 0 10 0 1 0 0 0 10 0 1 0 0 0 10 0 1 0 0 0 10 0 0 1 0 0 10 0 0 1 0 0 10 0 0 0 1 0 10 0 0 0 1 0 10 0 0 0 1 0 10 0 0 0 0 1 10 0 0 0 0 1 1 ≠1 ≠1 ≠1 ≠1 ≠1 ≠1 1 ≠1 ≠1 ≠1 ≠1 ≠1 ≠1 1 ≠1 ≠1 ≠1 ≠1 ≠1 ≠1 T XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXV Notiamo subito che questa matrice è analoga alla precedente (nel senso che studio la signiﬁcatività degli stessi regressori) ma è non singolare e quindi inveritibile. REMARK Se facciamo tapply(feed, feed, length) , R calcola le numerosità dei gruppi e le riordina in ordine alfabetico di nome del gruppo. Se vogliamo utilizzare le numerosità nell’ordine in cui si presentano i gruppi nel dataset (feed), dobbiamo fare: n ## [1] 71 group_names = unique ( as.character ( feed ) ) ng = tapply ( feed, feed, length )[ group_names ] 12 ng ## horsebean linseed soybean sunflower meatmeal casein ## 10 12 14 12 11 12 Costruiamo la matrice X.full, ovvero una matrice disegno dove consideriamo tutti i gruppi (dimensione = n◊(g+ 1) ). # gruppo 1 (nell ordine dei dati in ( weight,feed ) x1.full = c( rep ( 1, ng [ 1 ] ), rep ( 0,n - ng [ 1 ])) # gruppo 2 (nell ordine dei dati in ( weight,feed ) x2.full = c( rep ( 0, ng [ 1 ] ), rep ( 1, ng [ 2 ] ), rep ( 0,n - ng [ 1 ] - ng [ 2 ])) # gruppo 3 (nell ordine dei dati in ( weight,feed ) x3.full = c( rep ( 0, ng [ 1 ] + ng [ 2 ] ), rep ( 1, ng [ 3 ] ), rep ( 0,n - ng [ 1 ] - ng [ 2 ] - ng [ 3 ])) # gruppo 4 (nell ordine dei dati in ( weight,feed ) x4.full = c( rep ( 0,n - ng [ 6 ] - ng [ 5 ] - ng [ 4 ] ), rep ( 1, ng [ 4 ] ), rep ( 0, ng [ 5 ] + ng [ 6 ])) # gruppo 5 (nell ordine dei dati in ( weight,feed ) x5.full = c( rep ( 0,n - ng [ 6 ] - ng [ 5 ] ), rep ( 1, ng [ 5 ] ), rep ( 0, ng [ 6 ])) # gruppo 6 (nell ordine dei dati in ( weight,feed ) x6.full = c( rep ( 0,n - ng [ 6 ] ), rep ( 1, ng [ 6 ])) X.full = cbind ( rep ( 1, n ), x1.full, x2.full, x3.full, x4.full, x5.full, x6.full ) Visualizziamo questa matrice disegno. #corrplot( X.full, corr = F, method = ellipse ) image (Matrix (X.full)) 13 Dimensions: 71 x 7 Column Row 20 40 60 1357 Vediamo che non ha rango pieno (proviamo che una colonna è combinazione lineare di un’altra). X.full[ , 1 ] - rowSums ( X.full[ , - 1 ]) ## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Stimiamo ora i ˆ—. Ricordiamo che X è singolare, quindi la matrice di proiezione H sarà calcolata come segue: H = X ·(XT·X)†·XT in cui (XT·X)†indica la pseudo-inversa di Moore-Penrose. E i ˆ— saranno calcolati come: ˆ—=( XT·X)†·XT·y # H.full = X.full %*% solve( t( X.full ) %*% X.full ) %*% t( X.full ) # R dà errore, perchè singolare! H.full = X.full %*% ginv ( t( X.full ) %*% X.full ) %*% t( X.full ) y= weight betas.full = as.numeric ( ginv ( t( X.full ) %*% X.full ) %*% t( X.full ) %*% y) betas.full ## [1] 222.112523 -61.912523 -3.362523 24.316048 106.804143 54.796568 101.470810 ginv calcola la matrice pseudo-inversa di Moore-Penrose di una matrice. La media nel gruppo j-esimo è: E[yj]= µj= —0+—j j=1: g 14 means_by_group = betas.full[ 1 ] + betas.full[ 2:length ( betas.full ) ] names ( means_by_group ) = group_names means_by_group ## horsebean linseed soybean sunflower meatmeal casein ## 160.2000 218.7500 246.4286 328.9167 276.9091 323.5833 tapply ( weight, feed, mean )[ unique ( as.character ( feed ) ) ] ## horsebean linseed soybean sunflower meatmeal casein ## 160.2000 218.7500 246.4286 328.9167 276.9091 323.5833 La media globale è: µ= gÿ j=1 nj·µj n global_mean = ng %*% means_by_group / n global_mean ## [,1] ## [1,] 261.3099 mean ( weight ) ## [1] 261.3099 Ora, costruiamo la matrice disegno X basata sui contrasti, ovvero esprimendo l’ultimo gruppo come la negazione di tutti gli altri. x1.red = c( rep ( 1, ng [ 1 ] ), rep ( 0,n - ng [ 1 ] - ng [ 6 ] ), rep ( -1 , ng [ 6 ])) stopifnot ( sum ( x1.red - ( x1.full - x6.full ) ) == 0 ) x2.red = c( rep ( 0, ng [ 1 ] ), rep ( 1, ng [ 2 ] ), rep ( 0,n - ng [ 1 ] - ng [ 2 ] - ng [ 6 ] ), rep ( -1 , ng [ 6 ])) stopifnot ( sum ( x2.red - ( x2.full - x6.full ) ) == 0 ) x3.red = c( rep ( 0, ng [ 1 ] + ng [ 2 ] ), rep ( 1, ng [ 3 ] ), rep ( 0,n - ng [ 1 ] - ng [ 2 ] - ng [ 3 ] - ng [ 6 ] ), rep ( -1 , ng [ 6 ])) stopifnot ( sum ( x3.red - ( x3.full - x6.full ) ) == 0 ) x4.red = c( rep ( 0,n - ng [ 6 ] - ng [ 5 ] - ng [ 4 ] ), rep ( 1, ng [ 4 ] ), rep ( 0, ng [ 5 ] ), rep ( -1 , ng [ 6 ])) stopifnot ( sum ( x4.red - ( x4.full - x6.full ) ) == 0 ) x5.red = c( rep ( 0,n - ng [ 6 ] - ng [ 5 ] ), rep ( 1, ng [ 5 ] ), rep ( -1 , ng [ 6 ])) stopifnot ( sum ( x5.red - ( x5.full - x6.full ) ) == 0 ) 15 X.red = cbind ( rep ( 1, n ), x1.red, x2.red, x3.red, x4.red, x5.red ) Visualizziamo questa matrice di dimensione n◊g. #corrplot( X.red, corr = F, method = ellipse ) image (Matrix (X.red)) Dimensions: 71 x 6 Column Row 20 40 60 135 −1.0 −0.5 0.0 0.5 1.0 Stimiamo ora i ˆ—. # solve( t( X.red ) %*% X.red ) # solve( t( X.red ) %*% X.red ) - ginv( t( X.red ) %*% X.red ) H.red = X.red %*% solve ( t( X.red ) %*% X.red ) %*% t( X.red ) betas.red = as.numeric ( solve ( t( X.red ) %*% X.red ) %*% t( X.red ) %*% y) betas.red ## [1] 259.13128 -98.93128 -40.38128 -12.70271 69.78539 17.77781 La media nel gruppo i-esimo si ottiene nel seguente modo: µj= —0+—i i=1 ,...,g ≠1 µg= —0≠(—1+... +—g≠1) 16 Media per l’ultimo gruppo : NB I valori ﬁttati sono le medie all’interno dei gruppi means_by_group = betas.red[ 1 ] + betas.red[ -1 ] means_by_group = c( means_by_group, betas.red[ 1 ] - sum ( betas.red[ -1 ])) means_by_group.full = betas.full[ 1 ] + betas.full[ -1 ] means_by_group.full = c( means_by_group.full, betas.full[ 1 ] - sum ( betas.full[ -1 ])) names ( means_by_group ) = group_names means_by_group ## horsebean linseed soybean sunflower meatmeal casein ## 160.2000 218.7500 246.4286 328.9167 276.9091 323.5833 tapply ( weight, feed, mean )[ group_names ] ## horsebean linseed soybean sunflower meatmeal casein ## 160.2000 218.7500 246.4286 328.9167 276.9091 323.5833 names ( means_by_group.full ) = group_names means_by_group.full ## horsebean linseed soybean sunflower meatmeal casein ## 1.602000e+02 2.187500e+02 2.464286e+02 3.289167e+02 2.769091e+02 3.235833e+02 ## ## 2.273737e-13 In questo caso abbiamo in tutti i casi dei risultati molto coerenti (nonostante le approssimazioni). REMARK Ma cosa fa Rin automatico? mod_aov = aov ( weight ~ feed ) X_aov = model.matrix ( mod_aov ) Vediamo che la matrice disegno creata dall’ANOVA è di dimensioni n◊g. Notiamo che manca la variabile (livello) casein che è usata come baseline. mod_lm = lm ( weight ~ feed ) X_lm = model.matrix ( mod_lm ) Idem nel caso di modello lineare. Visualizziamo la matrice disegno dell’ANOVA implementata in R. #corrplot( X_lm, corr = F, method = ellipse ) image ( Matrix ( X_lm ) ) 17 Dimensions: 71 x 6 Column Row 20 40 60 135 levels (feed) ## [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower" unique (feed) ## [1] horsebean linseed soybean sunflower meatmeal casein ## Levels: casein horsebean linseed meatmeal soybean sunflower REMARK Ilprimo gruppo (nell’ordine alfanumerico dei livelli della variabile di stratiﬁcazione, feed, e NON nell’ordine di comparizione dei dati) viene soppresso e preso come riferimento (baseline). Calcoliamo ora i ˆ—. betas.lm = coefficients ( mod_lm ) La media nel gruppo j-esimo si ottiene nel seguente modo: µbaseline = —0 µj= —0+—j,j ”= baseline means_by_group = c( betas.lm[ 1 ], betas.lm[ 1 ] + betas.lm[ -1 ]) names ( means_by_group ) = levels ( feed ) means_by_group ## casein horsebean linseed meatmeal soybean sunflower ## 323.5833 160.2000 218.7500 276.9091 246.4286 328.9167 tapply ( weight, feed, mean ) ## casein horsebean linseed meatmeal soybean sunflower ## 323.5833 160.2000 218.7500 276.9091 246.4286 328.9167 18 4. One-way ANOVA The example dataset we will use is a set of 24 blood coagulation times. 24 animals were randomly assigned to four dierent diets and the samples were taken in a random order. This data comes from Box, Hunter, and Hunter (1978). data ( coagulation ) str ( coagulation ) ## data.frame : 24 obs. of 2 variables: ## $ coag: num 62 60 63 59 63 67 71 64 65 66 ... ## $ diet: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 2 2 2 2 2 2 ... dim ( coagulation ) ## [1] 24 2 names ( coagulation ) ## [1] "coag" "diet" head ( coagulation ) ## coag diet ## 1 62 A ## 2 60 A ## 3 63 A ## 4 59 A ## 5 63 B ## 6 67 B The ﬁrst step is to plot the data to check for: 1. Normality assumption; 2. Equal variances for each level of the factor. We don’t want to detect: 1. Skewness - this will be suggested by an asymmetrical form of the boxes. 2. Unequal variance - this will be suggested by unequal box sizes. Some care is required because often there is very little data to be used in the construction of the boxplots and so even when the variances are truly equal in the groups, we can expect a bit variability. plot ( coag ~ diet, data = coagulation ) 19 A B C D 60 65 70 diet coag In this case, there are no obvious problems. For group C, there are only 4 distinct observations and one is somewhat separated which accounts for the slightly odd looking plot. Always look at sample sizes. table ( coagulation $diet ) ## ## A B C D ## 4 6 6 8 coagulation $coag[ coagulation $diet == C ] ## [1] 68 66 71 67 68 68 unique ( coagulation $coag[ coagulation $diet == C ]) ## [1] 68 66 71 67 Anyway, let’s check assumptions. Ps = tapply (coagulation $coag, coagulation $diet, function (x)( shapiro.test ( x ))) Ps ## $A ## ## Shapiro-Wilk normality test ## ## data: x ## W = 0.94971, p-value = 0.7143 ## ## ## $B ## ## Shapiro-Wilk normality test ## ## data: x ## W = 0.92239, p-value = 0.5227 ## ## 20 ## $C ## ## Shapiro-Wilk normality test ## ## data: x ## W = 0.87279, p-value = 0.2375 ## ## ## $D ## ## Shapiro-Wilk normality test ## ## data: x ## W = 0.93173, p-value = 0.5319 # Alternative: Levene-Test leveneTest ( coagulation $coag, coagulation $diet ) ## Levene s Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 3 0.6492 0.5926 ## 20 We accept normality and homoschedasticity. Now let’s ﬁt the model. mod = lm ( coag ~ diet, coagulation ) summary ( mod ) ## ## Call: ## lm(formula = coag ~ diet, data = coagulation) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.00 -1.25 0.00 1.25 5.00 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 *** ## dietB 5.000e+00 1.528e+00 3.273 0.003803 ** ## dietC 7.000e+00 1.528e+00 4.583 0.000181 *** ## dietD 2.991e-15 1.449e+00 0.000 1.000000 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 2.366 on 20 degrees of freedom ## Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212 ## F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05 What kind of design matrix has been used in this case? Look at the design matrix to understand the coding: model.matrix ( mod ) #n x g ## (Intercept) dietB dietC dietD ## 1 1 0 0 0 ## 2 1 0 0 0 ## 3 1 0 0 0 21 ## 4 1 0 0 0 ## 5 1 1 0 0 ## 6 1 1 0 0 ## 7 1 1 0 0 ## 8 1 1 0 0 ## 9 1 1 0 0 ## 10 1 1 0 0 ## 11 1 0 1 0 ## 12 1 0 1 0 ## 13 1 0 1 0 ## 14 1 0 1 0 ## 15 1 0 1 0 ## 16 1 0 1 0 ## 17 1 0 0 1 ## 18 1 0 0 1 ## 19 1 0 0 1 ## 20 1 0 0 1 ## 21 1 0 0 1 ## 22 1 0 0 1 ## 23 1 0 0 1 ## 24 1 0 0 1 ## attr(,"assign") ## [1] 0 1 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$diet ## [1] "contr.treatment" The eects returned by ANOVA have to be interpreted as dierences from a reference level, the baseline (the ﬁrst in alphabetical order). What do we conclude by looking at the p-value? We can read the output in this way: Group A is the reference level ( baseline ) and has a mean equal to 61 (ˆ—0), groups B, C and D are 5 ( ˆ—1), 7 ( ˆ—2) and 0 ( ˆ—3) seconds larger on average. For completeness, look at: anova (mod) ## Analysis of Variance Table ## ## Response: coag ## Df Sum Sq Mean Sq F value Pr(>F) ## diet 3 228 76.0 13.571 4.658e-05 *** ## Residuals 20 112 5.6 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 We have evidence of the fact that dierent groups have signiﬁcantly dierent means. We can ﬁt the model without an intercept term. mod_i = lm ( coag ~ diet - 1, coagulation ) summary ( mod_i ) ## ## Call: ## lm(formula = coag ~ diet - 1, data = coagulation) 22 Output di un ANOVA automatico ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.00 -1.25 0.00 1.25 5.00 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## dietA 61.0000 1.1832 51.55 F) ## group 11 1.1272 0.3698 ## 36 bartlett.test ( 1/rats $time, rats $treat :rats $poison ) ## ## Bartlett test of homogeneity of variances ## ## data: 1/rats$time and rats$treat:rats$poison ## Bartlett s K-squared = 9.8997, df = 11, p-value = 0.5394 After Box-Cox transformation (without scale-location adjustment), assumptions are respected! g1 = lm ( 1/time ~ poison * treat, rats ) summary (g1) ## ## Call: ## lm(formula = 1/time ~ poison * treat, data = rats) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.76847 -0.29642 -0.06914 0.25458 1.07936 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.48688 0.24499 10.151 4.16e-12 *** ## poisonII 0.78159 0.34647 2.256 0.030252 * ## poisonIII 2.31580 0.34647 6.684 8.56e-08 *** ## treatB -1.32342 0.34647 -3.820 0.000508 *** ## treatC -0.62416 0.34647 -1.801 0.080010 . ## treatD -0.79720 0.34647 -2.301 0.027297 * ## poisonII:treatB -0.55166 0.48999 -1.126 0.267669 ## poisonIII:treatB -0.45030 0.48999 -0.919 0.364213 ## poisonII:treatC 0.06961 0.48999 0.142 0.887826 ## poisonIII:treatC 0.08646 0.48999 0.176 0.860928 ## poisonII:treatD -0.76974 0.48999 -1.571 0.124946 ## poisonIII:treatD -0.91368 0.48999 -1.865 0.070391 . ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.49 on 36 degrees of freedom 31 ## Multiple R-squared: 0.8681, Adjusted R-squared: 0.8277 ## F-statistic: 21.53 on 11 and 36 DF, p-value: 1.289e-12 anova (g1) ## Analysis of Variance Table ## ## Response: 1/time ## Df Sum Sq Mean Sq F value Pr(>F) ## poison 2 34.877 17.4386 72.6347 2.310e-13 *** ## treat 3 20.414 6.8048 28.3431 1.376e-09 *** ## poison:treat 6 1.571 0.2618 1.0904 0.3867 ## Residuals 36 8.643 0.2401 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 The results tell us that levels of dierent factors have signiﬁcantly dierent means taken singularly, but interactions are not signiﬁcant. Then, we should not consider the interaction: g1_sel = lm ( 1/time ~ poison + treat, data = rats ) summary ( g1_sel ) ## ## Call: ## lm(formula = 1/time ~ poison + treat, data = rats) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.82757 -0.37619 0.02116 0.27568 1.18153 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.6977 0.1744 15.473 < 2e-16 *** ## poisonII 0.4686 0.1744 2.688 0.01026 * ## poisonIII 1.9964 0.1744 11.451 1.69e-14 *** ## treatB -1.6574 0.2013 -8.233 2.66e-10 *** ## treatC -0.5721 0.2013 -2.842 0.00689 ** ## treatD -1.3583 0.2013 -6.747 3.35e-08 *** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 0.4931 on 42 degrees of freedom ## Multiple R-squared: 0.8441, Adjusted R-squared: 0.8255 ## F-statistic: 45.47 on 5 and 42 DF, p-value: 6.974e-16 anova ( g1_sel ) ## Analysis of Variance Table ## ## Response: 1/time ## Df Sum Sq Mean Sq F value Pr(>F) ## poison 2 34.877 17.4386 71.708 2.865e-14 *** ## treat 3 20.414 6.8048 27.982 4.192e-10 *** ## Residuals 42 10.214 0.2432 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 We should check the hypotheses (residual normality and homosckedasticity) of the model. 32 par ( mfrow = c( 1, 2 )) qqnorm ( g1_sel $res /summary ( g1_sel ) $sigma, pch = 16 , main = QQ-norm of residuals ) abline ( 0, 1, lwd = 2, lty = 2, col = red ) shapiro.test ( g1_sel $res ) ## ## Shapiro-Wilk normality test ## ## data: g1_sel$res ## W = 0.97918, p-value = 0.5451 plot ( g1_sel $fitted, g1 $res, xlab = "Fitted" , ylab = "Residuals" , main = "Reciprocal response" , pch = 16 ) −2 −1 0 1 2 −1 0 1 2 QQ−norm of residuals Theoretical Quantiles Sample Quantiles 1 2 3 4 −0.5 0.0 0.5 1.0 Reciprocal response Fitted Residuals The hypotheses are respected. 33