Why I use R for Data Science - An Ode to R
Working in Data Science, I often feel like I have to justify using R over Python. And while I do use Python for running scripts in production, I am much more comfortable with the R environment. Basically, whenever I can, I use R for prototyping, testing, visualizing and teaching. But because personal gut-feeling preference isn’t a very good reason to give to (scientifically minded) people, I’ve thought a lot about the pros and cons of using R. This is what I came up with why I still prefer R…
Disclaimer: I have “grown up” with R and I’m much more familiar with it, so I admit that I am quite biased in my assessment. If you think I’m not doing other languages justice, I’ll be happy to hear your pros and cons!
First of, R is an open-source, cross-platform language, so it’s free to use by any- and everybody. This in itself doesn’t make it special, though, because so are other languages, like Python.
It is an established language, so that there are lots and lots of packages for basically every type of analysis you can think of. You find packages for data analysis, machine learning, visualization, data wrangling, spatial analysis, bioinformatics and much more. But, same as with Python, this plethora of packages can sometimes make things a bit confusing: you would often need to test and compare several similar packages in order to find the best one.
Most of the packages are of very high quality. And when a package is on CRAN or Bioconductor (as most are), you can be sure that it has been checked, that you will get proper documentation and that you won’t have problems with installation, dependencies, etc. In my experience, R package and function documentation generally tends to be better than, say, of Python packages.
R’s graphics capabilities are superior to any other I know. Especially ggplot2 with all its extensions provides a structured, yet powerful set of tools for producing high-quality publication-ready graphs and figures. Moreover, ggplot2 is part of the tidyverse and works well with broom. This has made data wrangling and analysis much more convenient and structured and structured for me.
The suite of tools around R Studio make it perfect for documenting data analysis workflows and for teaching. You can provide easy instructions for installation and R Markdown files for your students to follow along. Everybody is going to use the same system. In Python, you are always dealing with questions like version 2 vs version 3, Spyder vs Jupyter Notebook, pip vs conda, etc. Everything around R Studio is very well maintained and comes with extensive documentation and detailed tutorials. You find add-ins for version control, Shiny apps, writing books or other documents (bookdown) and you can write presentations directly in R Markdown, including code + output and everything as LaTeX beamer presentations, ioslides or reveal.js. You can also create Dashboards, include interactive HTML widgets and you can even build your blog (as this one is) with blogdown conveniently from within RStudio!
If you are looking for advanced functionality, it is very likely that somebody has already written a package for it. There are packages that allow you to access Spark, H2O, elasticsearch, TensorFlow, Keras, tesseract, and so many more with no hassle at all. And you can even run bash, Python from within R!
There is a big - and very active - community! This is one of the things I most enjoy about working with R. You can find many high-quality manuals, resources and tutorials for all kinds of topics. Most of them provided free of charge by people who often dedicate their spare time to help others. The same goes for asking questions on Stack Overflow, putting up issues on Github or Google groups: usually you will get several answers within a short period of time (from my experience minutes to hours). What other community is so supportive and so helpful!? But for most things, you wouldn’t even need to ask for help because many of the packages come with absolutely amazing vignettes, that describe the functions and workflows in a detailed, yet easy to understand way. If that’s not enough, you will very likely find additional tutorials on R-bloggers, a site maintained by Tal Galili that aggregates hundreds of R-blogs. There are several R Conferences, like the useR, rOpenSci Unconference and many R-user groups all around the globe.
I can’t stress enough how much I appreciate all the people who are involved in the R-community; who write packages, tutorials, blogs, who share information, provide support and who think about how to make data analysis easy, more convenient and - dare I say - fun!
The main drawbacks I experience with R are that scripts tends to be harder to deploy than Python (R-server might be a solution, but I don’t know enough about it to really judge). Dealing with memory, space and security issues is often difficult in R. But there has already been a vast improvement over the last months/years, so I’m sure we will see development there in the future…