## Data frames

For statistical work, “data.frame” objects are very convenient. They basically are common tables with rows and columns. There are row names and column names also that can be accessed. Easiest is to create the table in a spreadsheet program and save it as csv (comma separated values file). This  csv file can then be read from within R using following command:

code:

The characters used for missing values and the separator (if not comma), can also be specified here:

code:

The data is now in mydf which is a data.frame object. Its values of which can be accessed using following common commands:

code:

> mydf         # full data.frame is printed out
> str(mydf)      # gives structure of data.frame

> nrow(mydf)     # number of rows
> ncol(mydf)     # number of columns
> rownames(mydf)    # print row names
> colnames(mydf)    # print column names

> head(mydf)     # print top 6 rows
> head(mydf, 10)    # print top 10 rows
> tail(mydf)    # print last 6 rows
> tail(mydf, 10)    # print last 10 rows

> mydf\$varname    # prints out the varname column
> mydf\$varname = 0    # puts 0 in all rows of varname column
> mydf\$varname = vect     # puts values of vect in varname column

> mydf[i,j]        # gives value in ith row and jth column

For examples here we can use modified 'birthwt' dataset available in MASS package of R. The dataset is modified by following function:

code:

> mybwdf = function(sendfactor=T){
library(MASS)
newbw = birthwt
rownames(newbw) = with(newbw, paste0(low,'-',1:nrow(birthwt)))
newbw\$low = factor(newbw\$low)
newbw\$race = factor(newbw\$race)
newbw\$smoke = factor(newbw\$smoke)
newbw\$ui = factor(newbw\$ui)
newbw\$ht = factor(newbw\$ht)
if(sendfactor) newbw = newbw[-10]
else newbw=newbw[-1]
print(str(newbw))
newbw
}

It is obtained with following command:

code:

> bwdf = mybwdf()

The structure of data set can be seen with following command:

code:

> str(bwdf)

output

'data.frame':   189 obs. of  9 variables:
\$ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
\$ age  : int  19 33 20 21 18 21 22 17 29 26 ...
\$ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
\$ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
\$ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
\$ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
\$ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
\$ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
\$ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...

First 6 rows can be seen with head command:

code:

output

low age lwt race smoke ptl ht ui ftv

0-1   0  19 182    2     0   0  0  1   0
0-2   0  33 155    3     0   0  0  0   3
0-3   0  20 105    1     1   0  0  0   1
0-4   0  21 108    1     1   0  0  1   2
0-5   0  18 107    1     1   0  0  1   0
0-6   0  21 124    3     0   0  0  0   0

The rownames start with '0-' if the value in low column is 0 and with '1-' if the value in low column is 1. This will be helpful when rownames are plotted on graphs.

The variables (columns) indicate following:

low:     0 indicates of birth weight less than 2.5 kg; 1 if not.

age:     age of mother (years).

lwt:     weight of mother (pounds)

race:    race of mother (1 if white, 2 if black, 3 if other)

smoke:   1 if smoked during pregnancy.

ptl:     number of premature labours in the past.

Ht:      1 of history of hypertension is present.

Ui:      1 of uterine irritability is present.

ftv:     number of physician visits in early pregnancy.

bwt:     weight of newborn at birth (grams).

Getting a subset of a dataframe:

A subset of rows of a dataframe can be obtained as follows

code:

> subsetdf = mydf[1:10, ]            # get first 10 rows of a dataframe

> subsetdf = mydf[, 1:3]  # get first 3 columns of a dataframe

> subsetdf = mydf[1:3]    # same result as above, since numbers are taken to be column (and not row) numbers if no comma is placed.

> subsetdf = mydf[mydf\$age>5, ]  # get all rows where age column is > 5

> subsetdf = mydf[mydf\$gender=='M', ] # get all rows where gender is 'M'

> subsetdf = mydf[mydf\$gender=='M' & age>5, ]  # get all rows where age is greater than 5 and gender is 'M'

Sorting a data.frame:

The command for sorting a data.frame on a column can be put into following convenient function:

code:

> mysort=function(dd, colnumber){

dd[order(dd[,colnumber]),]

}

> bwdf = mybwdf()

output

low age lwt race smoke ptl ht ui ftv

0-1    0  19 182    2     0   0  0  1   0

0-2    0  33 155    3     0   0  0  0   3

0-3    0  20 105    1     1   0  0  0   1

0-4    0  21 108    1     1   0  0  1   2

0-5    0  18 107    1     1   0  0  1   0

0-6    0  21 124    3     0   0  0  0   0

0-7    0  22 118    1     0   0  0  0   1

0-8    0  17 103    3     0   0  0  0   1

0-9    0  29 123    1     1   0  0  0   1

0-10   0  26 113    1     1   0  0  0   0

code:

> bwbdf = mysort(bwdf,2) # sort on 2nd column

output

low age lwt race smoke ptl ht ui ftv

0-117   0  14 135    1     0   0  0  0   0

1-184   1  14 101    3     1   1  0  0   0

1-186   1  14 100    3     0   0  0  0   2

0-17    0  15  98    2     0   0  0  0   0

1-170   1  15 110    1     0   0  0  0   0

1-174   1  15 115    3     0   0  0  1   0

0-56    0  16 110    3     0   0  0  0   0

0-74    0  16 112    2     0   0  0  0   0

0-75    0  16 135    1     1   0  0  0   0

0-95    0  16 135    1     1   0  0  0   0

Converting data between wide and long forms:

It is often required to convert data between wide and long forms. The functions melt() and dcast() of reshape2 package are most useful for this:

code:

> wide_form= data.frame(subjectID=1:5,

pre_value=sample(1:5),

post_value=sample(1:5))

> wide_form

output

subjectID pre_value post_value

1         1         3          1

2         2         4          5

3         3         1          4

4         4         5          3

5         5         2          2

code:

> library(reshape2)

> long_form = melt(wide_form, id='subjectID')

> long_form

output

subjectID   variable value

1          1  pre_value     3

2          2  pre_value     4

3          3  pre_value     1

4          4  pre_value     5

5          5  pre_value     2

6          1 post_value     1

7          2 post_value     5

8          3 post_value     4

9          4 post_value     3

10         5 post_value     2

code:

> back2wide = dcast(subjectID~variable, value.var='value', data=long_form)

> back2wide

output

subjectID pre_value post_value

1         1         3          1

2         2         4          5

3         3         1          4

4         4         5          3

5         5         2          2

Binding dataframes by rows or columns:

These can be achieved with rbind (bind by rows) cbind (bind by columns) functions.

code:

> newdf = cbind(firstdf, seconddf)  # 2 dataframes should have same number of rows

> newdf = rbind(firstdf, secondf)    # 2 dataframes should have same column names (order may not be the same)

*********************