Book review: Beyond Spreadsheets with R

Disclaimer: Manning publications gave me the ebook version of Beyond Spreadsheets with R - A beginner’s guide to R and RStudio by Dr. Jonathan Carroll free of charge.

Beyond Spreadsheets with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. You’ll build on simple programming techniques like loops and conditionals to create your own custom functions. You’ll come away with a toolkit of strategies for analyzing and visualizing data of all sorts using R and RStudio.

Aimed at

This book covers all the very basics of getting started with R, so it is useful to beginners but probably won’t offer much for users who are already familiar with R. However, it does hint at some advanced functions, like defining specific function calls for custom classes.

What stood out for me in the book (both positive and negative)

It introduced version control right away in the first chapter. Coming from an academic background where such things as Github and virtual machines were not widely used (at least not in the biology department), I learned about these invaluable tools way too late. Thus, I would always encourage students to start using them as early as possible. It will definitely make your analyses more reproducible, failproof and comprehensible. And it will make it easier to collaborate and keep track of different projects from past and present!
Great recap of the very basics, like data types. Even though you might have been working with R for a while, I still find it useful to go over some basics once in a while - often I find some minor detail that I hadn’t known before or had forgotten again (e.g. that you shouldn’t use T and F instead of TRUE and FALSE because T and F can be variable names and in the worst case assigned to the opposite logical).
The book also includes a short part to debugging and testing, which I think is valuable knowledge that most introductions to R skip. Of course, you won’t get an in depth covering of these two topics but at least you will have heard about them early on.
It introduced the rex package, which I hadn’t known before. Looks very convenient for working with regular expressions!
The book introduces tibble and dplyr and also goes into Non-Standard Evaluation. It shows a couple of ways to work around that using tidy evaluation.

This is what is shown:

library(dplyr)
library(rlang)

head(mutate(mtcars, ratio = mpg / cyl))

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb    ratio
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.500000
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.500000
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 5.700000
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.566667
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.337500
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 3.016667

column_to_divide <- "cyl"
head(mutate(mtcars, ratio = mpg / !!sym(column_to_divide)))

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb    ratio
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.500000
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.500000
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 5.700000
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.566667
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.337500
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 3.016667

This snippet about Non-Standard-Evaluation in the book is very short. Users who worked with dplyr and functions before, might wonder why the book doesn’t mention the standard evaluation form of the dplyr verbs (like mutate_). This is because those verbs are deprecated now. Instead, if you want to write functions with dplyr, use tidy evaluation.

dplyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics: rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with dplyr. However, dplyr now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.

This is how it would work:

You write your function with arguments as you would normally (the input to column_to_divide will be unquoted).
You use enquo to “capture the expression supplied as argument by the user of the current function” (?enquo).
You feed this object to your dplyr function (here mutate) and use !! “to say that you want to unquote an input so that it’s evaluated”.

my_mutate_2 <- function(df, column_to_divide) {
  
  quo_column_to_divide <- enquo(column_to_divide)
  
  df %>%
    mutate(ratio = mpg / !!quo_column_to_divide)
}

my_mutate_2(mtcars, cyl) %>%
  select(mpg, cyl, ratio) %>%
  head()

##    mpg cyl    ratio
## 1 21.0   6 3.500000
## 2 21.0   6 3.500000
## 3 22.8   4 5.700000
## 4 21.4   6 3.566667
## 5 18.7   8 2.337500
## 6 18.1   6 3.016667

Weirdly (to me) the book introduces dplyr and pipe operators but when Jonathan talked about replacing NAs, he only mentions the base R version:

d_NA <- data.frame(x = c(7, NA, 9, NA), y = 1:4)
d_NA[is.na(d_NA)] <- 0
d_NA

##   x y
## 1 7 1
## 2 0 2
## 3 9 3
## 4 0 4

But he doesn’t mention the tidy way to do this:

d_NA %>%
  tidyr::replace_na()

##   x y
## 1 7 1
## 2 0 2
## 3 9 3
## 4 0 4

Similar with aggregate. This is the base R version that is shown:

aggregate(formula = . ~ Species, data = iris, FUN = mean)

##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

But doesn’t show the tidy way. I guess this is because at this point in book the tidy way would be a bit too complicated, due to the “wrong” data format (wide vs. long). He does introduce the tidy principles of data analysis in the next chapters.

library(tidyverse)
iris %>%
  gather(x, y, Sepal.Length:Petal.Width) %>%
  group_by(Species, x) %>%
  summarise(mean = mean(y))

## # A tibble: 12 x 3
## # Groups:   Species [?]
##    Species    x             mean
##    <fct>      <chr>        <dbl>
##  1 setosa     Petal.Length 1.46 
##  2 setosa     Petal.Width  0.246
##  3 setosa     Sepal.Length 5.01 
##  4 setosa     Sepal.Width  3.43 
##  5 versicolor Petal.Length 4.26 
##  6 versicolor Petal.Width  1.33 
##  7 versicolor Sepal.Length 5.94 
##  8 versicolor Sepal.Width  2.77 
##  9 virginica  Petal.Length 5.55 
## 10 virginica  Petal.Width  2.03 
## 11 virginica  Sepal.Length 6.59 
## 12 virginica  Sepal.Width  2.97

It covers web scraping and exporting data to .RData, .rds and how to use dput() when asking StackOverflow questions.
In the part about looping, the book mainly focuses on using the purrr package. You can find an example for using purrr in data analysis on my blog.
The book introduces package development, which is very useful, even if you just want to have a place where all your self-made functions “live”. :-) Moreover, it also goes into Unit Testing with testthat!

Verdict

Even though I am not really the target audience for a book like that since I am not a beginner any more (I basically skimmed through a lot of the very basic stuff), it still offered a lot of valuable information! I particularly liked that the book introduces very important parts of learning R for practical and modern applications, such as using the tidyverse, version control, debugging, how to aks proper questions on StackOverflow and basic package development, that don’t usually get mentioned in beginner’s books.

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.3.0   stringr_1.3.1   purrr_0.2.5     readr_1.3.1    
##  [5] tidyr_0.8.2     tibble_2.0.1    ggplot2_3.1.0   tidyverse_1.2.1
##  [9] bindrcpp_0.2.2  rlang_0.3.1     dplyr_0.7.8    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.4         haven_2.0.0      lattice_0.20-38 
##  [5] colorspace_1.4-0 generics_0.0.2   htmltools_0.3.6  yaml_2.2.0      
##  [9] utf8_1.1.4       pillar_1.3.1     glue_1.3.0       withr_2.1.2     
## [13] modelr_0.1.2     readxl_1.2.0     bindr_0.1.1      plyr_1.8.4      
## [17] munsell_0.5.0    blogdown_0.10    gtable_0.2.0     cellranger_1.1.0
## [21] rvest_0.3.2      evaluate_0.12    knitr_1.21       fansi_0.4.0     
## [25] broom_0.5.1      Rcpp_1.0.0       scales_1.0.0     backports_1.1.3 
## [29] jsonlite_1.6     hms_0.4.2        digest_0.6.18    stringi_1.2.4   
## [33] bookdown_0.9     grid_3.5.2       cli_1.0.1        tools_3.5.2     
## [37] magrittr_1.5     lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.2 
## [41] xml2_1.2.0       lubridate_1.7.4  assertthat_0.2.0 rmarkdown_1.11  
## [45] httr_1.4.0       rstudioapi_0.9.0 R6_2.3.0         nlme_3.1-137    
## [49] compiler_3.5.2

Book review: Beyond Spreadsheets with R

Aimed at

What stood out for me in the book (both positive and negative)

Verdict

Dr. Shirin Elsinghorst