R language Access Menu

Title Text Both  

Partial Least Squares Discriminant Analysis

This is very useful to determine relation between numeric predictors and categorical outcome or multiple groups which may not be ordered or related. The pls package is used here, though it can be performed with plsda() function of mixOmics package also.  The commands can be grouped into following  function: 

Code:

mypls_da_multinomial = function(mydf, ynum){
# Partial least squares - (multiple) discriminant analysis PLS-DA
# all predictors need to be numeric only; 
# can also be done with plsda() fn of mixOmics; 

    tempdf = mydf[-ynum]
    len = length(tempdf)
    for(i in 1:len) if(!is.numeric(tempdf[,i])) {
        print('Error: All predictor variables must be numeric')
        return (0)
    }

    if(!is.factor(mydf[,ynum])) mydf[,ynum] = factor(mydf[,ynum])
    len = length(levels(mydf[,ynum]))
    print(paste('Number of categories =',len))

    library(pls)
    mod = cppls(as.matrix(as.numeric(mydf[,ynum]))~as.matrix(mydf[-ynum]), scale=T)
    print(summary(mod))
    plot(mod$scores[,1], mod$scores[,2], 
            col=mydf[,ynum], 
            xlab='Component_1', ylab='Component_2', 
            main='Partial Least Squares Discriminant Analysis', pch=19)
    legend('bottomright', legend=levels(mydf[,ynum]),pch=19, col=1:length(levels(mydf[,ynum])))
    abline(h=0, v=0)
    mod
}

The iris dataset of R, which has petal and sepal lengths and widths of 3 plant species is used here: 

Code:

> str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 …

> mypls_da_multinomial(iris, 5)
[1] "Number of categories = 3"
Data:   X dimension: 150 4 
        Y dimension: 150 1
Fit method: cppls
Number of components considered: 4
TRAINING: % variance explained
                                     1 comps  2 comps  3 comps  4 comps
X                                      72.95    77.98    99.48   100.00
as.matrix(as.numeric(mydf[, ynum]))    89.62    92.85    93.02    93.04
NULL
Canonical powered partial least squares , fitted with the cppls algorithm.
Call:
cppls(formula = as.matrix(as.numeric(mydf[, ynum])) ~ as.matrix(mydf[-ynum]),     scale = T)

Output graph:

                  

Same result can be obtained by giving each command separately, rather than using the function: 

Code:

> myiris = iris
> myiris$num_species = as.numeric(myiris$Species)

> str(myiris)
'data.frame':   150 obs. of  6 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ num_species : num  1 1 1 1 1 1 1 1 1 1 ...

> library(pls)
> mod = cppls(num_species~., data=myiris[-5], scale=T)
> summary(mod)
Data:   X dimension: 150 4 
        Y dimension: 150 1
Fit method: cppls
Number of components considered: 4
TRAINING: % variance explained
             1 comps  2 comps  3 comps  4 comps
X              72.95    77.98    99.48   100.00
num_species    89.62    92.85    93.02    93.04

> plot(mod$scores, col=myiris$Species)
> abline(v=0, h=0)
> legend('bottomright',legend=levels(myiris$Species), col=1:length(levels(myiris$Species)), pch=1 )


References:
Bjørn-Helge Mevik, Ron Wehrens and Kristian Hovde Liland (2013). pls: Partial Least Squares and Principal Component regression. R package version 2.4-3. http://CRAN.R-project.org/package=pls
 


    Comments & Feedback