Compute Summary Statistics in R - Datanovia (2024)

  • Login
  • |
  • Register

Home Data Manipulation in R Compute Summary Statistics in R

Compute Summary Statistics in R

Easy

40 mins

Data Manipulation in R

This tutorial introduces how to easily compute statistcal summaries in R using the dplyr package.

You will learn, how to:

  • Compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables. R functions: summarise() and group_by().
  • Summarise multiple variable columns. R functions:
    • summarise_all(): apply summary functions to every columns in the data frame.
    • summarise_at(): apply summary functions to specific columns selected with a character vector
    • summarise_if(): apply summary functions to columns selected with a predicate function that returns TRUE.

Compute Summary Statistics in R - Datanovia (7)



Contents:

  • Required packages
  • Demo dataset
  • Summary statistics of ungrouped data
  • Summary statistics of grouped data
    • Group by one variable
    • Group by multiple variables
  • Summarise multiple variables
    • Key R functions
    • Summarise variables
  • Useful statistical summary functions
  • Summary

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)my_data
## # A tibble: 150 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 144 more rows

Summary statistics of ungrouped data

Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function n():

my_data %>% summarise( count = n(), mean_sep = mean(Sepal.Length, na.rm = TRUE), mean_pet = mean(Petal.Length, na.rm = TRUE) )
## # A tibble: 1 x 3## count mean_sep mean_pet## <int> <dbl> <dbl>## 1 150 5.84 3.76

Note that, we used the additional argument na.rm to remove NAs, before computing means.

Summary statistics of grouped data

Key R functions: group_by() and summarise()

Group by one variable

my_data %>% group_by(Species) %>% summarise( count = n(), mean_sep = mean(Sepal.Length), mean_pet = mean(Petal.Length) )
## # A tibble: 3 x 4## Species count mean_sep mean_pet## <fct> <int> <dbl> <dbl>## 1 setosa 50 5.01 1.46## 2 versicolor 50 5.94 4.26## 3 virginica 50 6.59 5.55

Note that, it’s possible to combine multiple operations using the maggrittr forward-pipe operator : %>%. For example, x %>% f is equivalent to f(x).

In the R code above:

  • first, my_data is passed to group_by() function
  • next, the output of group_by() is passed to summarise() function

Group by multiple variables

# ToothGrowth demo data setshead(ToothGrowth)
## len supp dose## 1 4.2 VC 0.5## 2 11.5 VC 0.5## 3 7.3 VC 0.5## 4 5.8 VC 0.5## 5 6.4 VC 0.5## 6 10.0 VC 0.5
# SummarizeToothGrowth %>%group_by(supp, dose) %>% summarise( n = n(), mean = mean(len), sd = sd(len) )
## # A tibble: 6 x 5## # Groups: supp [?]## supp dose n mean sd## <fct> <dbl> <int> <dbl> <dbl>## 1 OJ 0.5 10 13.2 4.46## 2 OJ 1 10 22.7 3.91## 3 OJ 2 10 26.1 2.66## 4 VC 0.5 10 7.98 2.75## 5 VC 1 10 16.8 2.52## 6 VC 2 10 26.1 4.80

Summarise multiple variables

Key R functions

The functions summarise_all(), summarise_at() and summarise_if() can be used to summarise multiple columns at once.

The simplified formats are as follow:

summarise_all(.tbl, .funs, ...)summarise_if(.tbl, .predicate, .funs, ...)summarise_at(.tbl, .vars, .funs, ...)
  • .tbl: a tbl data frame
  • .funs: List of function calls generated by funs(), or a character vector of function names, or simply a function.
  • …: Additional arguments for the function calls in .funs.
  • .predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

Summarise variables

  • Summarise all variables - compute the mean of all variables:
my_data %>% group_by(Species) %>% summarise_all(mean)
## # A tibble: 3 x 5## Species Sepal.Length Sepal.Width Petal.Length Petal.Width## <fct> <dbl> <dbl> <dbl> <dbl>## 1 setosa 5.01 3.43 1.46 0.246## 2 versicolor 5.94 2.77 4.26 1.33 ## 3 virginica 6.59 2.97 5.55 2.03
  • Summarise specific variables selected with a character vector:
my_data %>% group_by(Species) %>% summarise_at(c("Sepal.Length", "Sepal.Width"), mean, na.rm = TRUE)
  • Summarise specific variables selected with a predicate function:
my_data %>% group_by(Species) %>% summarise_if(is.numeric, mean, na.rm = TRUE)

Useful statistical summary functions

This section presents some R functions for computing statistical summaries.

Measure of location:

  • mean(x): sum of x divided by the length
  • median(x): 50% of x is above and 50% is below

Measure of variation:

  • sd(x): standard deviation
  • IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)
  • mad(x): median absolute deviation (robust equivalent of sd when outliers are present in the data)

Measure of rank:

  • min(x): minimum value of x
  • max(x): maximum value of x
  • quantile(x, 0.25): 25% of x is below this value

Measure of position:

  • first(x): equivalent to x[1]
  • nth(x, 2): equivalent to n<-2; x[n]
  • last(x): equivalent to x[length(x)]

Counts:

  • n(x): the number of element in x
  • sum(!is.na(x)): count non-missing values
  • n_distinct(x): count the number of unique value

Counts and proportions of logical values:

  • sum(x > 10): count the number of elements where x > 10
  • mean(y == 0): proportion of elements where y = 0

Summary

In this tutorial, we describe how to easily compute statistical summaries using the R functions summarise() and group_by() [in dplyr package].



Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Data science

  • Course: Machine Learning: Master the Fundamentals by Stanford
  • Specialization: Data Science by Johns Hopkins University
  • Specialization: Python for Everybody by University of Michigan
  • Courses: Build Skills for a Top Job in any Industry by Coursera
  • Specialization: Master Machine Learning Fundamentals by University of Washington
  • Specialization: Statistics with R by Duke University
  • Specialization: Software Development in R by Johns Hopkins University
  • Specialization: Genomic Data Science by Johns Hopkins University

Popular Courses Launched in 2020

  • Google IT Automation with Python by Google
  • AI for Medicine by deeplearning.ai
  • Epidemiology in Public Health Practice by Johns Hopkins University
  • AWS Fundamentals by Amazon Web Services

Trending Courses

  • The Science of Well-Being by Yale University
  • Google IT Support Professional by Google
  • Python for Everybody by University of Michigan
  • IBM Data Science Professional Certificate by IBM
  • Business Foundations by University of Pennsylvania
  • Introduction to Psychology by Yale University
  • Excel Skills for Business by Macquarie University
  • Psychological First Aid by Johns Hopkins University
  • Graphic Design by Cal Arts

Amazon FBA

Amazing Selling Machine

  • Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM

Books - Data Science

Our Books

  • Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
  • Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
  • Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
  • R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
  • GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
  • Network Analysis and Visualization in R by A. Kassambara (Datanovia)
  • Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
  • Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

  • R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
  • Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
  • Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
  • An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
  • Deep Learning with R by François Chollet & J.J. Allaire
  • Deep Learning with Python by François Chollet

Compute and Add new Variables to a Data Frame in R (Prev Lesson)

Back to Data Manipulation in R

Comments ( 3 )

  • Compute Summary Statistics in R - Datanovia (8)

    Azzeddine REGHAIS

    02 Jan 2021

    Thank you teacher

    Reply

  • Compute Summary Statistics in R - Datanovia (9)

    Ador tor

    19 Feb 2022

    this tutorial was very helpful.
    thank you so much

    Reply

  • Compute Summary Statistics in R - Datanovia (10)

    Nabi

    20 Sep 2023

    Thank you! btw do you know how to save the summarized results in a datafame? I summarized means of many variables and the R console doesn’t show the results at once.

    Reply

Give a comment

Course Curriculum

  • Select Data Frame Columns in R

    40 mins

  • Subset Data Frame Rows in R

    50 mins

  • Identify and Remove Duplicate Data in R

    30 mins

  • Reorder Data Frame Rows in R

    30 mins

  • Rename Data Frame Columns in R

    20 mins

  • Compute and Add new Variables to a Data Frame in R

    30 mins

  • Compute Summary Statistics in R

    40 mins

Teacher

Compute Summary Statistics in R - Datanovia (11)

Alboukadel Kassambara
Role : Founder of Datanovia
  • Website : https://www.datanovia.com/en
  • Experience : >10 years
  • Specialist in : Bioinformatics and Cancer Biology

Read More

Compute Summary Statistics in R - Datanovia (2024)

References

Top Articles
Latest Posts
Article information

Author: Rev. Leonie Wyman

Last Updated:

Views: 5871

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Rev. Leonie Wyman

Birthday: 1993-07-01

Address: Suite 763 6272 Lang Bypass, New Xochitlport, VT 72704-3308

Phone: +22014484519944

Job: Banking Officer

Hobby: Sailing, Gaming, Basketball, Calligraphy, Mycology, Astronomy, Juggling

Introduction: My name is Rev. Leonie Wyman, I am a colorful, tasty, splendid, fair, witty, gorgeous, splendid person who loves writing and wants to share my knowledge and understanding with you.