11.1. Basic Statistics¶

Summary Statistics

Some data:

> x <- rnorm(100)

Maximum value:

> max(x)
[1] 3.251759

Minimum value:

> min(x)
[1] -2.340614

Sum of all values:

> sum(x)
[1] 8.446345

Mean of all values:

> mean(x)
[1] 0.08446345

Median of all values:

> median(x)
[1] 0.0814703

Range (min, max) of all values:

> range(x)
[1] -2.340614  3.251759

Parallel max and min:

> a <- sample(1:10)
> b <- sample(1:10)
> a
 [1]  8  5  9 10  4  2  6  1  7  3
> b
 [1]  4  6  9 10  2  8  7  1  3  5
> pmax(a,b)
 [1]  8  6  9 10  4  8  7  1  7  5
> pmin(a,b)
 [1]  4  5  9 10  2  2  6  1  3  3

Summary statistics for all variables in a data frame can be obtained simultaneously:

> data("mtcars")
> summary(mtcars)
      mpg             cyl             disp             hp             drat             wt
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760   Min.   :1.513
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695   Median :3.325
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597   Mean   :3.217
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930   Max.   :5.424
      qsec             vs               am              gear            carb
 Min.   :14.50   Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000
 1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000
 Median :17.71   Median :0.0000   Median :0.0000   Median :4.000   Median :2.000
 Mean   :17.85   Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812
 3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000
 Max.   :22.90   Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000

Variance, Covariance

Computing the sample variance:

> var(mtcars$mpg)
[1] 36.3241

The mean square value:

> n <- length(mtcars$mpg)
> ms <- sum(mtcars$mpg^2) / n
> ms
[1] 438.8222

Verifying the variance and mean square value relationship:

> ms - mean(mtcars$mpg)^2
[1] 35.18897
> var(mtcars$mpg)  * (n - 1) / n
[1] 35.18897

Computing the variance of each variable in a data frame:

> round(sapply(mtcars, var), digits=2)
     mpg      cyl     disp       hp     drat       wt     qsec       vs       am     gear     carb
   36.32     3.19 15360.80  4700.87     0.29     0.96     3.19     0.25     0.25     0.54     2.61

Variances of selected columns:

> sapply(mtcars[, c('cyl', 'disp', 'wt')], var)
         cyl         disp           wt
    3.189516 15360.799829     0.957379

Computing the covariance matrix for all variables in a data frame:

> round(cov(mtcars), digits=2)
         mpg    cyl     disp      hp   drat     wt   qsec     vs     am   gear  carb
mpg    36.32  -9.17  -633.10 -320.73   2.20  -5.12   4.51   2.02   1.80   2.14 -5.36
cyl    -9.17   3.19   199.66  101.93  -0.67   1.37  -1.89  -0.73  -0.47  -0.65  1.52
disp -633.10 199.66 15360.80 6721.16 -47.06 107.68 -96.05 -44.38 -36.56 -50.80 79.07
hp   -320.73 101.93  6721.16 4700.87 -16.45  44.19 -86.77 -24.99  -8.32  -6.36 83.04
drat    2.20  -0.67   -47.06  -16.45   0.29  -0.37   0.09   0.12   0.19   0.28 -0.08
wt     -5.12   1.37   107.68   44.19  -0.37   0.96  -0.31  -0.27  -0.34  -0.42  0.68
qsec    4.51  -1.89   -96.05  -86.77   0.09  -0.31   3.19   0.67  -0.20  -0.28 -1.89
vs      2.02  -0.73   -44.38  -24.99   0.12  -0.27   0.67   0.25   0.04   0.08 -0.46
am      1.80  -0.47   -36.56   -8.32   0.19  -0.34  -0.20   0.04   0.25   0.29  0.05
gear    2.14  -0.65   -50.80   -6.36   0.28  -0.42  -0.28   0.08   0.29   0.54  0.33
carb   -5.36   1.52    79.07   83.04  -0.08   0.68  -1.89  -0.46   0.05   0.33  2.61

Computing the covariance matrix of selected variables:

> cov(mtcars[, c('cyl', 'disp', 'wt')])
            cyl       disp         wt
cyl    3.189516   199.6603   1.367371
disp 199.660282 15360.7998 107.684204
wt     1.367371   107.6842   0.957379

Computing the standard deviation:

> sd(mtcars$mpg)
[1] 6.026948

Standard deviation of each variable in a data frame:

> sapply(mtcars, sd)
        mpg         cyl        disp          hp        drat          wt        qsec          vs
  6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574   1.7869432   0.5040161
         am        gear        carb
  0.4989909   0.7378041   1.6152000

Pearson Correlation

Pearson correlation coefficients are useful in estimating dependence between different (numeric) variables. The value varies from 0 to 1. This corresponds between no correlation to complete correlation.

Pearson coefficient	Interpretation
0.00 - 0.19	very weak correlation
0.20 - 0.39	weak correlation
0.40 - 0.59	moderate correlation
0.60 - 0.79	strong correlation
0.80 - 1.00	very strong correlation

Computing Pearson correlation coefficients for all variables in a data frame:

> round(cor(mtcars), digits=2)
       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

Computing Pearson correlation coefficients for selected variables:

> cor(mtcars[, c('cyl', 'disp', 'wt')])
           cyl      disp        wt
cyl  1.0000000 0.9020329 0.7824958
disp 0.9020329 1.0000000 0.8879799
wt   0.7824958 0.8879799 1.0000000

Here is a way to map the actual correlation values to 5 ranges.

Let us compute the correlation coefficients for numerical variables in iris dataset:

> iris.correlations <- cor(iris[, -c(5)])
> iris.correlations
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Note that we have left out the Species variable which is a factor variable.

Let us use cut to break it into 5 ranges:

iris.correlation.levels <- cut(abs(iris.correlations), breaks=c(0, .2, .4, .6, .8, 1.0), include.lowest = T, labels = c('VW', 'WK', 'MD', 'ST', 'VS'))

Cut returns a vector. We need to reshape it into a matrix:

> iris.correlation.levels<- matrix(iris.correlation.levels, nrow=4)
> iris.correlation.levels
     [,1] [,2] [,3] [,4]
[1,] "VS" "VW" "VS" "VS"
[2,] "VW" "VS" "MD" "WK"
[3,] "VS" "MD" "VS" "VS"
[4,] "VS" "WK" "VS" "VS"

We are still missing the row names and column names. Let us get them back too:

> rownames(iris.correlation.levels) <- rownames(iris.correlations)
> colnames(iris.correlation.levels) <- colnames(iris.correlations)
> iris.correlation.levels
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length "VS"         "VW"        "VS"         "VS"
Sepal.Width  "VW"         "VS"        "MD"         "WK"
Petal.Length "VS"         "MD"        "VS"         "VS"
Petal.Width  "VS"         "WK"        "VS"         "VS"

Some interesting exercises:

> x <- rnorm(100)
> cor(x, x)
[1] 1
> cor(x, abs(x))
[1] 0.03242731
> cor(x, x^2)
[1] -0.01069063
> cor(abs(x), x^2)
[1] 0.9333162
> cor(x, x^3)
[1] 0.7631594
> cor(abs(x), abs(x)^3)
[1] 0.8048567
> cor(abs(x), x^4)
[1] 0.6817026
> cor(abs(x), log(abs(x)))
[1] 0.8360999

Spearman Correlation

The Spearman correlation method is suitable for computing correlation of a factor variable with other variables. In the next example, we will compute the Spearman correlation of Species variable with other variables in the iris dataset.

In order to compute the correlation, a factor variable needs to be cast as a numeric variable:

> as.numeric(iris$Species)
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [48] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [95] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[142] 3 3 3 3 3 3 3 3 3

We can use this for computing the correlations:

> cor(as.numeric(iris$Species), iris[, -5], method='spearman')
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]    0.7980781  -0.4402896    0.9354305   0.9381792

Note that we have removed the Species variable from the second parameter for computing the correlations.

This may look a bit cumbersome if the number of variables is large. Here is a cleaner look:

> t(cor(as.numeric(iris$Species), iris[, -5], method='spearman'))
                   [,1]
Sepal.Length  0.7980781
Sepal.Width  -0.4402896
Petal.Length  0.9354305
Petal.Width   0.9381792

It can be made a little bit more beautiful:

> iris.species.correlations <- t(cor(as.numeric(iris$Species), iris[, -5], method='spearman'))
> colnames(iris.species.correlations) <- c('correlation')
> iris.species.correlations
             correlation
Sepal.Length   0.7980781
Sepal.Width   -0.4402896
Petal.Length   0.9354305
Petal.Width    0.9381792

We note that petal length and width are very strongly correlated with the species.

Tukey Five Number Summary

The five numbers include: minimum, lower-hinge, median, upper-hinge, maximum:

> fivenum(mtcars$mpg)
[1] 10.40 15.35 19.20 22.80 33.90

Quantiles

Computing the quantiles of a given data:

> quantile(mtcars$mpg)
    0%    25%    50%    75%   100%
10.400 15.425 19.200 22.800 33.900
> quantile(sort(mtcars$mpg))
    0%    25%    50%    75%   100%
10.400 15.425 19.200 22.800 33.900
> quantile(mtcars$mpg, probs=c(0.1, 0.2, 0.4, 0.8, 1.0))
  10%   20%   40%   80%  100%
14.34 15.20 17.92 24.08 33.90

> IQR(mtcars$mpg)
[1] 7.375

Median Absolute Deviation

> mad(mtcars$mpg)
[1] 5.41149

Skewness

This is available in e1071 library:

> library(e1071)
> skewness(mtcars$mpg)
[1] 0.610655
> skewness(discoveries)
[1] 1.2076

Kurtosis

This is available in e1071 library:

> library(e1071)
> kurtosis(mtcars$mpg)
[1] -0.372766
> kurtosis(discoveries)
[1] 1.989659

Samples with negative kurtosis value are called platykurtic.
Samples with positive kurtosis values are called leptokurtic.
Samples with kurtosis very close to 0 are called mesokurtic.

Scaling or Standardizing a Variable

Let us pick a variable and check its mean and variance:

> x <- mtcars$mpg
> mean(x)
[1] 20.09062
> var(x)
[1] 36.3241

Let us now scale it to zero mean unit variance:

> y <- scale(x)
> mean(y)
[1] 7.112366e-17
> var(y)
     [,1]
[1,]    1

Scaling whole data frame:

> mtcars2 <- scale(mtcars)

Let us verify the means:

> colMeans(mtcars)
       mpg        cyl       disp         hp       drat         wt       qsec         vs
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750   0.437500
        am       gear       carb
  0.406250   3.687500   2.812500
> colMeans(mtcars2)
          mpg           cyl          disp            hp          drat            wt
 7.112366e-17 -1.474515e-17 -9.085614e-17  1.040834e-17 -2.918672e-16  4.682398e-17
         qsec            vs            am          gear          carb
 5.299580e-16  6.938894e-18  4.510281e-17 -3.469447e-18  3.165870e-17

Note that the original means are still maintained inside the scaled data frame as an attribute:

> attr(mtcars2, 'scaled:center')
       mpg        cyl       disp         hp       drat         wt       qsec         vs
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750   0.437500
        am       gear       carb
  0.406250   3.687500   2.812500

And so are original standard deviations:

> attr(mtcars2, 'scaled:scale')
        mpg         cyl        disp          hp        drat          wt        qsec          vs
  6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574   1.7869432   0.5040161
         am        gear        carb
  0.4989909   0.7378041   1.6152000

Verifying that the scaled data frame indeed has unit variance:

> apply(mtcars, 2, sd)
        mpg         cyl        disp          hp        drat          wt        qsec          vs
  6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574   1.7869432   0.5040161
         am        gear        carb
  0.4989909   0.7378041   1.6152000
> apply(mtcars, 2, var)
         mpg          cyl         disp           hp         drat           wt         qsec
3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01 9.573790e-01 3.193166e+00
          vs           am         gear         carb
2.540323e-01 2.489919e-01 5.443548e-01 2.608871e+00
> apply(mtcars2, 2, var)
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
   1    1    1    1    1    1    1    1    1    1    1

Scaling a data frame which contains a mixture of numeric and factor variables is a bit more involved. We will work with iris dataset for this example:

> iris2 <- iris

We first identify the variables which are numeric:

> ind <- sapply(iris2, is.numeric)
> ind
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
        TRUE         TRUE         TRUE         TRUE        FALSE

Next we scale these variables:

> iris2.scaled <- scale(iris2[ind])
> attr(iris2.scaled, 'scaled:center')
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
    5.843333     3.057333     3.758000     1.199333
> attr(iris2.scaled, 'scaled:scale')
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
   0.8280661    0.4358663    1.7652982    0.7622377

Time to merge it back:

> iris2[ind] <- iris2.scaled

Verify that the numeric columns have indeed been scaled:

> sapply(iris2[ind], mean)
 Sepal.Length   Sepal.Width  Petal.Length   Petal.Width
-4.484318e-16  2.034094e-16 -2.895326e-17 -3.663049e-17
> sapply(iris2[ind], sd)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
           1            1            1            1
> sapply(iris2[ind], var)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
           1            1            1            1

11.1.1. Group Wise Statistics¶

We will compute summary statistics for each species in iris database:

> data("iris")
> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
> tapply(iris$Petal.Length, iris$Species, summary)
$setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   1.400   1.500   1.462   1.575   1.900

$versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   3.00    4.00    4.35    4.26    4.60    5.10

$virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  4.500   5.100   5.550   5.552   5.875   6.900

We can compute individual group-wise statistics too:

> tapply(iris$Petal.Length, iris$Species, mean)
    setosa versicolor  virginica
     1.462      4.260      5.552
> tapply(iris$Petal.Length, iris$Species, max)
    setosa versicolor  virginica
       1.9        5.1        6.9
> tapply(iris$Petal.Length, iris$Species, var)
    setosa versicolor  virginica
0.03015918 0.22081633 0.30458776
> tapply(iris$Petal.Length, iris$Species, min)
    setosa versicolor  virginica
       1.0        3.0        4.5

11.1.1.1. Frequency Tables¶

When we factor a list into levels, we can compute the frequency table from the factors as follows:

> states <- sample(datasets::state.name[1:10], 20, replace=TRUE)
> statesf <- factor(states)
> table(statesf)
statesf
    Alabama      Alaska     Arizona  California    Colorado Connecticut    Delaware     Florida     Georgia
          1           1           1           2           2           2           3           5           3

Looking at the tabulation in proportional terms:

> states <- sample(datasets::state.name[1:10], 20, replace=TRUE)
> statesf <- factor(states)
> prop.table(table(statesf))
statesf
   Alabama     Alaska   Arkansas California   Colorado   Delaware    Florida    Georgia
      0.05       0.10       0.25       0.15       0.05       0.15       0.15       0.10

Building a two-dimensional frequency table

Here is a simple example. We will extract the fuel type and vehicle class from the mpg data set and tabulate the co-occurrence of pairs of these two variables:

> table(mpg[, c('fl', 'class')])
   class
fl  2seater compact midsize minivan pickup subcompact suv
  c       0       0       0       0      0          1   0
  d       0       1       0       0      0          2   2
  e       0       0       0       1      3          0   4
  p       5      21      15       0      0          3   8
  r       0      25      26      10     30         29  48

Let’s convert this table into a simple data frame:

> df <- as.data.frame(table(mpg[, c('fl', 'class')]))
> df[df$Freq != 0,]
   fl      class Freq
 p    2seater    5
 d    compact    1
 p    compact   21
r    compact   25
p    midsize   15
r    midsize   26
e    minivan    1
r    minivan   10
e     pickup    3
r     pickup   30
c subcompact    1
d subcompact    2
p subcompact    3
r subcompact   29
d        suv    2
e        suv    4
p        suv    8
r        suv   48

Note that only 18 of the rows (or combinations of fuel type and vehicle class) have non-zero entries.

US states income data:

> incomes <- datasets::state.x77[,2]
> summary(incomes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   3098    3993    4519    4436    4814    6315

Categorizing the income data:

> incomes_fr <- cut(incomes, breaks=2500+1000*(0:4), dig.lab = 4)
> incomes_fr
 [1] (3500,4500] (5500,6500] (4500,5500] (2500,3500] (4500,5500] (4500,5500] (4500,5500] (4500,5500]
 [9] (4500,5500] (3500,4500] (4500,5500] (3500,4500] (4500,5500] (3500,4500] (4500,5500] (4500,5500]
[17] (3500,4500] (3500,4500] (3500,4500] (4500,5500] (4500,5500] (4500,5500] (4500,5500] (2500,3500]
[25] (3500,4500] (3500,4500] (4500,5500] (4500,5500] (3500,4500] (4500,5500] (3500,4500] (4500,5500]
[33] (3500,4500] (4500,5500] (4500,5500] (3500,4500] (4500,5500] (3500,4500] (4500,5500] (3500,4500]
[41] (3500,4500] (3500,4500] (3500,4500] (3500,4500] (3500,4500] (4500,5500] (4500,5500] (3500,4500]
[49] (3500,4500] (4500,5500]
Levels: (2500,3500] (3500,4500] (4500,5500] (5500,6500]

Tabulating the income data frequencies:

> table(incomes_fr)
incomes_fr
(2500,3500] (3500,4500] (4500,5500] (5500,6500]
          2          22          25           1

US states illiteracy data:

> illiteracy <- datasets::state.x77[,3]
> summary(illiteracy)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.500   0.625   0.950   1.170   1.575   2.800

Categorizing the illiteracy data:

> illiteracy_fr <- cut(illiteracy, breaks=c(0, .5, 1.0, 1.5, 2.0,2.5, 3.0))
> illiteracy_fr
 [1] (2,2.5] (1,1.5] (1.5,2] (1.5,2] (1,1.5] (0.5,1] (1,1.5] (0.5,1] (1,1.5] (1.5,2] (1.5,2] (0.5,1] (0.5,1]
[14] (0.5,1] (0,0.5] (0.5,1] (1.5,2] (2.5,3] (0.5,1] (0.5,1] (1,1.5] (0.5,1] (0.5,1] (2,2.5] (0.5,1] (0.5,1]
[27] (0.5,1] (0,0.5] (0.5,1] (1,1.5] (2,2.5] (1,1.5] (1.5,2] (0.5,1] (0.5,1] (1,1.5] (0.5,1] (0.5,1] (1,1.5]
[40] (2,2.5] (0,0.5] (1.5,2] (2,2.5] (0.5,1] (0.5,1] (1,1.5] (0.5,1] (1,1.5] (0.5,1] (0.5,1]
Levels: (0,0.5] (0.5,1] (1,1.5] (1.5,2] (2,2.5] (2.5,3]

Tabulating the illiteracy data frequencies:

> table(illiteracy_fr)
illiteracy_fr
(0,0.5] (0.5,1] (1,1.5] (1.5,2] (2,2.5] (2.5,3]
      3      23      11       7       5       1

Tabulating income vs illiteracy

> table(incomes_fr, illiteracy_fr)
             illiteracy_fr
incomes_fr    (0,0.5] (0.5,1] (1,1.5] (1.5,2] (2,2.5] (2.5,3]
  (2500,3500]       0       0       0       1       1       0
  (3500,4500]       1      10       2       4       4       1
  (4500,5500]       2      13       8       2       0       0
  (5500,6500]       0       0       1       0       0       0

11.1.1.2. Aggregation¶

Computing mean of sepal length for each species in iris:

> aggregate(iris$Sepal.Length, by=list(iris$Species), FUN=mean)
     Group.1     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

Computing mean mileage per gallon of cars aggreated by their number of cylinders and V/S:

> unique(mtcars$cyl)
[1] 6 4 8
> unique(mtcars$vs)
[1] 0 1
> aggregate(mtcars$mpg, by=list(mtcars$cyl,mtcars$vs),
+           FUN=mean, na.rm=TRUE)
  Group.1 Group.2        x
1       4       0 26.00000
2       6       0 20.56667
3       8       0 15.10000
4       4       1 26.73000
5       6       1 19.12500

Computing mean all attributes of cars aggregated by their number of cylinders and V/S:

> aggregate(mtcars, by=list(mtcars$cyl,mtcars$vs),
+           FUN=mean, na.rm=TRUE)
  Group.1 Group.2      mpg cyl   disp       hp     drat       wt     qsec vs        am
     4       0 26.00000   4 120.30  91.0000 4.430000 2.140000 16.70000  0 1.0000000
     6       0 20.56667   6 155.00 131.6667 3.806667 2.755000 16.32667  0 1.0000000
     8       0 15.10000   8 353.10 209.2143 3.229286 3.999214 16.77214  0 0.1428571
     4       1 26.73000   4 103.62  81.8000 4.035000 2.300300 19.38100  1 0.7000000
     6       1 19.12500   6 204.55 115.2500 3.420000 3.388750 19.21500  1 0.0000000
      gear     carb
5.000000 2.000000
4.333333 4.666667
3.285714 3.500000
4.000000 1.500000
3.500000 2.500000

Aggregation using formula

> aggregate(mpg~cyl+vs, data=mtcars, FUN=mean)
  cyl vs      mpg
 4  0 26.00000
 6  0 20.56667
 8  0 15.10000
 4  1 26.73000
 6  1 19.12500

11.1. Basic Statistics¶

11.1.1. Group Wise Statistics¶

11.1.1.1. Frequency Tables¶

11.1.1.2. Aggregation¶

Table Of Contents

Previous topic

Next topic