Summarizing data

A number of functions are available to find general descriptive statistics such as minimum, maximum, range, mean, median, standard deviation and variance: 

code: 

                    > min(bwdf$age)
                    [1] 14
                    > 
                    > max(bwdf$age)
                    [1] 45
                    > 
                    > range(bwdf$age)
                    [1] 14 45

                    > mean(bwdf$age)
                    [1] 23.2381
                    > 
                    > sd(bwdf$age)
                    [1] 5.298678
                    > 
                    > var(bwdf$age)
                    [1] 28.07599

                    > median(bwdf$age)
                    [1] 23

Confidence intervals
Confidence intervals are commonly used for communicating variation in the data. Broadly, 95% confidence intervals indicate 95% chance of real value being found in this range, although this definition is not correct in strict statistical sense. The 95% confidence intervals can simply be found by the formula “mean +/- 1.96 * standard error”. The confidence intervals are also related to the P values and if confidence intervals of 2 values do not overlap, this generally indicates that the difference between them is statistically significant (P<0.05). 

Summary statistics of all numeric variables of data.frame

Simple summary() function gives out overall information about the variables. For example, for a summary of first 7 variables of bwdf dataset: 

code: 

> summary(bwdf[1:7], digits=2)
 low          age          lwt      race   smoke        ptl      ht     
 0:130   Min.   :14   Min.   : 80   1:96   0:115   Min.   :0.0   0:177  
 1: 59   1st Qu.:19   1st Qu.:110   2:26   1: 74   1st Qu.:0.0   1: 12  
         Median :23   Median :121   3:67           Median :0.0          
         Mean   :23   Mean   :130                  Mean   :0.2          
         3rd Qu.:26   3rd Qu.:140                  3rd Qu.:0.0          
         Max.   :45   Max.   :250                  Max.   :3.0   
       

Following function also creates a good summary statistics of all numeric variables of a data.frame. The advantage is that the function list (fnlist)  can be altered as needed:

code: 

mysummary = function(mydt){
  library(data.table)
  if(!is.data.table(mydt)) mydt = data.table(mydt)
  mydt = mydt[,sapply(mydt, is.numeric),with=F]

  fnlist = c('length', 'min', 'max', 'median', 'mean', 'sd', 'se', 'rnskewness', 'rnkurtosis', 'mad')
  ll = sapply(fnlist, function(x) mydt[,sapply(.SD, x),])
  print(t(ll))
}

> mysummary(bwdf)

                   age        lwt          ptl          ftv
length     189.0000000 189.000000 189.00000000 189.00000000
min         14.0000000  80.000000   0.00000000   0.00000000
max         45.0000000 250.000000   3.00000000   6.00000000
median      23.0000000 121.000000   0.00000000   0.00000000
mean        23.2380952 129.814815   0.19576720   0.79365079
sd           5.2986779  30.579380   0.49334191   1.05928614
se           0.3854221   2.224323   0.03588534   0.07705173
rnskewness   0.7100000   1.380000   2.76000000   1.56000000
rnkurtosis   0.5300000   2.250000   8.17000000   3.00000000
mad          5.9304000  20.756400   0.00000000   0.00000000


    Comments & Feedback