Compute Summary Statistics in R

Home Data Manipulation in R Compute Summary Statistics in R

Compute Summary Statistics in R

Easy

40 mins

Data Manipulation in R

109812781077

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)my_data

## # A tibble: 150 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 144 more rows

Summary statistics of ungrouped data

Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function n():

my_data %>% summarise( count = n(), mean_sep = mean(Sepal.Length, na.rm = TRUE), mean_pet = mean(Petal.Length, na.rm = TRUE) )

## # A tibble: 1 x 3## count mean_sep mean_pet## <int> <dbl> <dbl>## 1 150 5.84 3.76

Note that, we used the additional argument na.rm to remove NAs, before computing means.

Summary statistics of grouped data

Key R functions: group_by() and summarise()

Group by one variable

my_data %>% group_by(Species) %>% summarise( count = n(), mean_sep = mean(Sepal.Length), mean_pet = mean(Petal.Length) )

## # A tibble: 3 x 4## Species count mean_sep mean_pet## <fct> <int> <dbl> <dbl>## 1 setosa 50 5.01 1.46## 2 versicolor 50 5.94 4.26## 3 virginica 50 6.59 5.55

Note that, it’s possible to combine multiple operations using the maggrittr forward-pipe operator : %>%. For example, x %>% f is equivalent to f(x).

In the R code above:

first, my_data is passed to group_by() function
next, the output of group_by() is passed to summarise() function

Group by multiple variables

# ToothGrowth demo data setshead(ToothGrowth)

## len supp dose## 1 4.2 VC 0.5## 2 11.5 VC 0.5## 3 7.3 VC 0.5## 4 5.8 VC 0.5## 5 6.4 VC 0.5## 6 10.0 VC 0.5

# SummarizeToothGrowth %>%group_by(supp, dose) %>% summarise( n = n(), mean = mean(len), sd = sd(len) )

## # A tibble: 6 x 5## # Groups: supp [?]## supp dose n mean sd## <fct> <dbl> <int> <dbl> <dbl>## 1 OJ 0.5 10 13.2 4.46## 2 OJ 1 10 22.7 3.91## 3 OJ 2 10 26.1 2.66## 4 VC 0.5 10 7.98 2.75## 5 VC 1 10 16.8 2.52## 6 VC 2 10 26.1 4.80

Summarise multiple variables

Key R functions

The functions summarise_all(), summarise_at() and summarise_if() can be used to summarise multiple columns at once.

The simplified formats are as follow:

summarise_all(.tbl, .funs, ...)summarise_if(.tbl, .predicate, .funs, ...)summarise_at(.tbl, .vars, .funs, ...)

.tbl: a tbl data frame
.funs: List of function calls generated by funs(), or a character vector of function names, or simply a function.
…: Additional arguments for the function calls in .funs.
.predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

Summarise variables

Summarise all variables - compute the mean of all variables:

my_data %>% group_by(Species) %>% summarise_all(mean)

## # A tibble: 3 x 5## Species Sepal.Length Sepal.Width Petal.Length Petal.Width## <fct> <dbl> <dbl> <dbl> <dbl>## 1 setosa 5.01 3.43 1.46 0.246## 2 versicolor 5.94 2.77 4.26 1.33 ## 3 virginica 6.59 2.97 5.55 2.03

Summarise specific variables selected with a character vector:

my_data %>% group_by(Species) %>% summarise_at(c("Sepal.Length", "Sepal.Width"), mean, na.rm = TRUE)

Summarise specific variables selected with a predicate function:

my_data %>% group_by(Species) %>% summarise_if(is.numeric, mean, na.rm = TRUE)

Useful statistical summary functions

This section presents some R functions for computing statistical summaries.

Measure of location:

mean(x): sum of x divided by the length
median(x): 50% of x is above and 50% is below

Measure of variation:

sd(x): standard deviation
IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)
mad(x): median absolute deviation (robust equivalent of sd when outliers are present in the data)

Measure of rank:

min(x): minimum value of x
max(x): maximum value of x
quantile(x, 0.25): 25% of x is below this value

Measure of position:

first(x): equivalent to x[1]
nth(x, 2): equivalent to n<-2; x[n]
last(x): equivalent to x[length(x)]

Counts:

n(x): the number of element in x
sum(!is.na(x)): count non-missing values
n_distinct(x): count the number of unique value

Counts and proportions of logical values:

sum(x > 10): count the number of elements where x > 10
mean(y == 0): proportion of elements where y = 0

Summary

In this tutorial, we describe how to easily compute statistical summaries using the R functions summarise() and group_by() [in dplyr package].

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Data science

Course: Machine Learning: Master the Fundamentals by Stanford
Specialization: Data Science by Johns Hopkins University
Specialization: Python for Everybody by University of Michigan
Courses: Build Skills for a Top Job in any Industry by Coursera
Specialization: Master Machine Learning Fundamentals by University of Washington
Specialization: Statistics with R by Duke University
Specialization: Software Development in R by Johns Hopkins University
Specialization: Genomic Data Science by Johns Hopkins University

Popular Courses Launched in 2020

Google IT Automation with Python by Google
AI for Medicine by deeplearning.ai
Epidemiology in Public Health Practice by Johns Hopkins University
AWS Fundamentals by Amazon Web Services

Trending Courses

The Science of Well-Being by Yale University
Google IT Support Professional by Google
Python for Everybody by University of Michigan
IBM Data Science Professional Certificate by IBM
Business Foundations by University of Pennsylvania
Introduction to Psychology by Yale University
Excel Skills for Business by Macquarie University
Psychological First Aid by Johns Hopkins University
Graphic Design by Cal Arts

Amazon FBA

Amazing Selling Machine

Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Compute and Add new Variables to a Data Frame in R (Prev Lesson)

Back to Data Manipulation in R

Comments ( 3 )

Azzeddine REGHAIS
02 Jan 2021

Thank you teacher
Reply
Ador tor
19 Feb 2022

this tutorial was very helpful.
thank you so much
Reply
Nabi
20 Sep 2023

Thank you! btw do you know how to save the summarized results in a datafame? I summarized means of many variables and the R console doesn’t show the results at once.
Reply

Give a comment

Course Curriculum

Select Data Frame Columns in R
40 mins
Subset Data Frame Rows in R
50 mins
Identify and Remove Duplicate Data in R
30 mins
Reorder Data Frame Rows in R
30 mins
Rename Data Frame Columns in R
20 mins
Compute and Add new Variables to a Data Frame in R
30 mins
Compute Summary Statistics in R
40 mins

Teacher

Alboukadel Kassambara

Role : Founder of Datanovia

Website : https://www.datanovia.com/en
Experience : >10 years
Specialist in : Bioinformatics and Cancer Biology

Compute Summary Statistics in R - Datanovia (2024)