R language Access Menu

Title Text Both  

Multiple regression

Multiple regression is a very useful technique where the effect of many predictor variables are tested for their effect against one common dependent or outcome variable. The final effects shown indicate independent effects of predictor variables after controlling of all other predictors. The predictor variables can be of any type: continuous (i.e. numeric) or categorical or ordinal. The method of performing multiple regression depends on the type of outcome variable. The method is linear regression if outcome variable is continuous, logistic regression if outcome variable is categorical and ordinal regression if the outcome variable is ordinal. All these can easily be done in R. 

Multiple regression for continuous outcome
We can use following bwdf dataset:

Code:

> bwdf = mybwdf(F)
> str(bwdf)
'data.frame':   189 obs. of  9 variables:
 $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
 $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
 $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
 $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
 $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 …

Multiple regression can be performed to determine independent predictors of bwt in above dataset. The function used is lm and we can specify the formula as “bwt~.” which means that bwt is the outcome variable and all other variables in the data frame are predictors. 

Code:

> res = lm(bwt~., data=bwdf)
> summary(res)

Call:
lm(formula = bwt ~ ., data = bwdf)

Residuals:
     Min       1Q   Median       3Q      Max 
-1825.26  -435.21    55.91   473.46  1701.20 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 2927.962    312.904   9.357 < 0.0000000000000002 ***
age           -3.570      9.620  -0.371             0.711012    
lwt            4.354      1.736   2.509             0.013007 *  
race2       -488.428    149.985  -3.257             0.001349 ** 
race3       -355.077    114.753  -3.094             0.002290 ** 
smoke1      -352.045    106.476  -3.306             0.001142 ** 
ptl          -48.402    101.972  -0.475             0.635607    
ht1         -592.827    202.321  -2.930             0.003830 ** 
ui1         -516.081    138.885  -3.716             0.000271 ***
ftv          -14.058     46.468  -0.303             0.762598    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 650.3 on 179 degrees of freedom
Multiple R-squared:  0.2427,    Adjusted R-squared:  0.2047 
F-statistic: 6.376 on 9 and 179 DF,  p-value: 0.00000007891

This shows that lwt, race (group 2), race (group 3), smoke (group 1), ht (group 1) and ui (group 1) are significant independent predictors of bwt. The estimates show that all these except lwt are negative predictors of bwt. 

Graphical representation: 

For graphical representation of coefficients with confidence intervals, following function can be used: 

Code:

rnlmcoefplot = function(mod, stitle='Coefficients of different variables in Multiple regression'){
    library(broom)
    tt = tidy(mod)
    tt = tt[-1,]
    cc = confint(mod, level=0.95)
    cc = data.frame(cc)
    cc = cc[-1,]
    cc = na.omit(cc)
    print(cc)
    cat('----------------------------------------\n')

    names(tt) = c('term','estimate','std.error','statistic','p.value')
    tt$p.value = round(tt$p.value,9)
    tt$upper = cc[,2]
    tt$lower = cc[,1]
    
    print(tt)
    cat('----------------------------------------------------\n')
    pstr = rnpstr(tt$p.value)
    
    dd = data.frame(var=tt$term, value=tt$estimate, lower=tt$lower, upper=tt$upper, text=pstr)
    print(dd)
    cat('====================================================\n')
    rnggforest(dd, stitle)
}

rnggforest <- function(d, stitle='Forest plot', xlabel='Values with 95% CI', ylabel='Variable', vertical_at=0, logscale=F){
    # d is a data frame with 4 columns
    # d$var gives variable names
    # d$value gives center point
    # d$lower gives lower limits
    # d$upper gives upper limits
    # d$text for text to be put with points; like p values;
    # d$n_study for number in the study;

    require(ggplot2)
    p <- ggplot(d, aes(x=var, y=value, ymin=lower, ymax=upper))+
        geom_pointrange()+   
        coord_flip() + 
        geom_hline(yintercept=vertical_at, lty=2)+ 
        theme_bw()+
        labs(x=ylabel, y=xlabel, title=stitle)+  # inverted x and ylables since coord_flip is used; 
        geom_text(aes(label=text), vjust=-.5, size=4)+
        geom_text(aes(x=var, y=lower, label='I'), size=4.5)+
        geom_text(aes(x=var, y=upper, label='I'), size=4.5)+
        rngg_goodsizes()+
        rnggleg_nil()

if(logscale) p = p + scale_y_log10()

    return(p)

Output graph:

                 

 

}This plot shows coefficients and their confidence intervals. The coefficients are the effect sizes in multiple regression and hence different predictors can be compared for their effect on dependent variable. Coefficients indicate change in outcome for one unit change in predictor. Hence the values are generally small for numeric predictors than for factors and this needs to be taken into account while comparing coefficients. 

Summary of all variables versus dependent variable (univariate analysis) can be done by following simple command. The continuous numeric predictor variables are also grouped into categories and mean dependent variable is printed for each category. Hence any trends can clearly be seen: 

Code:

> summary(bwt~., birthwt)
bwt    N=189

+-------+---------+---+--------+
|       |         |N  |bwt     |
+-------+---------+---+--------+
|low    |No       |130|3329.108|
|       |Yes      | 59|2097.339|
+-------+---------+---+--------+
|age    |[14,20)  | 51|2973.529|
|       |[20,24)  | 56|2919.607|
|       |[24,27)  | 36|2828.444|
|       |[27,45]  | 46|3033.804|
+-------+---------+---+--------+
|lwt    |[ 80,112)| 53|2656.340|
|       |[112,122)| 43|3058.721|
|       |[122,141)| 46|3074.609|
|       |[141,250]| 47|3037.957|
+-------+---------+---+--------+
|race   |1        | 96|3102.719|
|       |2        | 26|2719.692|
|       |3        | 67|2805.284|
+-------+---------+---+--------+
|smoke  |No       |115|3055.696|
|       |Yes      | 74|2771.919|
+-------+---------+---+--------+
|ptl    |0        |159|3013.491|
|       |1        | 24|2496.292|
|       |2        |  5|2766.800|
|       |3        |  1|3637.000|
+-------+---------+---+--------+
|ht     |No       |177|2972.232|
|       |Yes      | 12|2536.833|
+-------+---------+---+--------+
|ui     |No       |161|3030.702|
|       |Yes      | 28|2449.429|
+-------+---------+---+--------+
|ftv    |0        |100|2865.140|
|       |1        | 47|3108.000|
|       |2        | 30|3010.333|
|       |3        |  7|2521.857|
|       |4        |  4|3167.750|
|       |6        |  1|3303.000|
+-------+---------+---+--------+
|Overall|         |189|2944.587|
+-------+---------+---+--------+

Simple regression equations with few variables can be shown using scatterplots and regresion lines by groups. For example, to show regression of lwt~age+low, regression lines are shown with color is specified to depend on grouping variable: 

Code:

> ggplot(bwdf, aes(x=age, y=lwt, col=low))+ geom_point()+ geom_smooth(method='lm', se=F)

Output graph:

                       

 

The plot shows that group low=1 has lower intercept than the group low=0. The difference, however, is not statisitically significant since the confidence intervals are overlapping. The slopes are same since the regression lines are parallel.

Similarly, line type can also be specified: 

Code:

> ggplot(bwdf, aes(x=age, y=lwt, col=low, lty=smoke))+ geom_point()+

   geom_smooth(method='lm', se=F)

Output graph:

                     

 

Assumptions for linear regression:

Linear regression method assumes following about the data and these should be true for a valid regression: 

1. Outcome variable is a continuous numeric value or a ratio

2. There is a linear relationship between predictors and outcome variable

3. There are no significant outliers

4. Homoscedasticity should be there in the data. This means that in predictor (x-axis) versus outcome (y-axis) graph, the variation in outcome should be almost same at all levels of the predictor. 

5. Residuals should be normally distributed. This is easily checked with residual plots.

6. All predictors are independence of each other. If they are highly correlated, it is called multicollinearity. It can be checked with following command:

Code:

> checkMC = function(mydf){
     eigen(cov(mydf))$values
}

> checkMC(birthwt)
 [1] 531786.22525070    903.62816814     27.10976476      1.06347409      0.82460439      0.25387956      0.16789166
 [8]      0.11461404      0.07479460      0.05123586

Values close to 0 indicate multicollinearity to be present (as in above case). 
 


    Comments & Feedback