Book review: Beyond Spreadsheets with R
Disclaimer: Manning publications gave me the ebook version of Beyond Spreadsheets with R - A beginner’s guide to R and RStudio by Dr. Jonathan Carroll free of charge.
Beyond Spreadsheets with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. You’ll build on simple programming techniques like loops and conditionals to create your own custom functions. You’ll come away with a toolkit of strategies for analyzing and visualizing data of all sorts using R and RStudio.
Aimed at
This book covers all the very basics of getting started with R, so it is useful to beginners but probably won’t offer much for users who are already familiar with R. However, it does hint at some advanced functions, like defining specific function calls for custom classes.
What stood out for me in the book (both positive and negative)
It introduced version control right away in the first chapter. Coming from an academic background where such things as Github and virtual machines were not widely used (at least not in the biology department), I learned about these invaluable tools way too late. Thus, I would always encourage students to start using them as early as possible. It will definitely make your analyses more reproducible, failproof and comprehensible. And it will make it easier to collaborate and keep track of different projects from past and present!
Great recap of the very basics, like data types. Even though you might have been working with R for a while, I still find it useful to go over some basics once in a while - often I find some minor detail that I hadn’t known before or had forgotten again (e.g. that you shouldn’t use
T
andF
instead ofTRUE
andFALSE
becauseT
andF
can be variable names and in the worst case assigned to the opposite logical).The book also includes a short part to debugging and testing, which I think is valuable knowledge that most introductions to R skip. Of course, you won’t get an in depth covering of these two topics but at least you will have heard about them early on.
It introduced the
rex
package, which I hadn’t known before. Looks very convenient for working with regular expressions!The book introduces
tibble
anddplyr
and also goes into Non-Standard Evaluation. It shows a couple of ways to work around that using tidy evaluation.
This is what is shown:
library(dplyr)
library(rlang)
head(mutate(mtcars, ratio = mpg / cyl))
## mpg cyl disp hp drat wt qsec vs am gear carb ratio
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.500000
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.500000
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5.700000
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.566667
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.337500
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.016667
column_to_divide <- "cyl"
head(mutate(mtcars, ratio = mpg / !!sym(column_to_divide)))
## mpg cyl disp hp drat wt qsec vs am gear carb ratio
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.500000
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.500000
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5.700000
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.566667
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.337500
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.016667
This snippet about Non-Standard-Evaluation in the book is very short. Users who worked with dplyr
and functions before, might wonder why the book doesn’t mention the standard evaluation form of the dplyr
verbs (like mutate_
). This is because those verbs are deprecated now. Instead, if you want to write functions with dplyr
, use tidy evaluation.
dplyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics: rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with dplyr. However, dplyr now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.
This is how it would work:
- You write your function with arguments as you would normally (the input to
column_to_divide
will be unquoted). - You use
enquo
to “capture the expression supplied as argument by the user of the current function” (?enquo
). - You feed this object to your
dplyr
function (heremutate
) and use!!
“to say that you want to unquote an input so that it’s evaluated”.
my_mutate_2 <- function(df, column_to_divide) {
quo_column_to_divide <- enquo(column_to_divide)
df %>%
mutate(ratio = mpg / !!quo_column_to_divide)
}
my_mutate_2(mtcars, cyl) %>%
select(mpg, cyl, ratio) %>%
head()
## mpg cyl ratio
## 1 21.0 6 3.500000
## 2 21.0 6 3.500000
## 3 22.8 4 5.700000
## 4 21.4 6 3.566667
## 5 18.7 8 2.337500
## 6 18.1 6 3.016667
- Weirdly (to me) the book introduces
dplyr
and pipe operators but when Jonathan talked about replacing NAs, he only mentions the base R version:
d_NA <- data.frame(x = c(7, NA, 9, NA), y = 1:4)
d_NA[is.na(d_NA)] <- 0
d_NA
## x y
## 1 7 1
## 2 0 2
## 3 9 3
## 4 0 4
But he doesn’t mention the tidy way to do this:
d_NA %>%
tidyr::replace_na()
## x y
## 1 7 1
## 2 0 2
## 3 9 3
## 4 0 4
- Similar with
aggregate
. This is the base R version that is shown:
aggregate(formula = . ~ Species, data = iris, FUN = mean)
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
But doesn’t show the tidy way. I guess this is because at this point in book the tidy way would be a bit too complicated, due to the “wrong” data format (wide vs. long). He does introduce the tidy principles of data analysis in the next chapters.
library(tidyverse)
iris %>%
gather(x, y, Sepal.Length:Petal.Width) %>%
group_by(Species, x) %>%
summarise(mean = mean(y))
## # A tibble: 12 x 3
## # Groups: Species [?]
## Species x mean
## <fct> <chr> <dbl>
## 1 setosa Petal.Length 1.46
## 2 setosa Petal.Width 0.246
## 3 setosa Sepal.Length 5.01
## 4 setosa Sepal.Width 3.43
## 5 versicolor Petal.Length 4.26
## 6 versicolor Petal.Width 1.33
## 7 versicolor Sepal.Length 5.94
## 8 versicolor Sepal.Width 2.77
## 9 virginica Petal.Length 5.55
## 10 virginica Petal.Width 2.03
## 11 virginica Sepal.Length 6.59
## 12 virginica Sepal.Width 2.97
It covers web scraping and exporting data to .RData, .rds and how to use
dput()
when asking StackOverflow questions.In the part about looping, the book mainly focuses on using the
purrr
package. You can find an example for usingpurrr
in data analysis on my blog.The book introduces package development, which is very useful, even if you just want to have a place where all your self-made functions “live”. :-) Moreover, it also goes into Unit Testing with
testthat
!
Verdict
Even though I am not really the target audience for a book like that since I am not a beginner any more (I basically skimmed through a lot of the very basic stuff), it still offered a lot of valuable information! I particularly liked that the book introduces very important parts of learning R for practical and modern applications, such as using the tidyverse
, version control, debugging, how to aks proper questions on StackOverflow and basic package development, that don’t usually get mentioned in beginner’s books.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.3.0 stringr_1.3.1 purrr_0.2.5 readr_1.3.1
## [5] tidyr_0.8.2 tibble_2.0.1 ggplot2_3.1.0 tidyverse_1.2.1
## [9] bindrcpp_0.2.2 rlang_0.3.1 dplyr_0.7.8
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.4 haven_2.0.0 lattice_0.20-38
## [5] colorspace_1.4-0 generics_0.0.2 htmltools_0.3.6 yaml_2.2.0
## [9] utf8_1.1.4 pillar_1.3.1 glue_1.3.0 withr_2.1.2
## [13] modelr_0.1.2 readxl_1.2.0 bindr_0.1.1 plyr_1.8.4
## [17] munsell_0.5.0 blogdown_0.10 gtable_0.2.0 cellranger_1.1.0
## [21] rvest_0.3.2 evaluate_0.12 knitr_1.21 fansi_0.4.0
## [25] broom_0.5.1 Rcpp_1.0.0 scales_1.0.0 backports_1.1.3
## [29] jsonlite_1.6 hms_0.4.2 digest_0.6.18 stringi_1.2.4
## [33] bookdown_0.9 grid_3.5.2 cli_1.0.1 tools_3.5.2
## [37] magrittr_1.5 lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.2
## [41] xml2_1.2.0 lubridate_1.7.4 assertthat_0.2.0 rmarkdown_1.11
## [45] httr_1.4.0 rstudioapi_0.9.0 R6_2.3.0 nlme_3.1-137
## [49] compiler_3.5.2