rOpenSci unconference 2018 + introduction to TensorFlow Probability & the 'greta' package

On May 21st and 22nd, I had the honor of having been chosen to attend the rOpenSci unconference 2018 in Seattle. It was a great event and I got to meet many amazing people!

rOpenSci

rOpenSci is a non-profit organisation that maintains a number of widely used R packages and is very active in promoting a community spirit around the R-world. Their core values are to have open and reproducible research, shared data and easy-to-use tools and to make all this accessible to a large number of people.

rOpenSci unconference

Part of creating a welcoming community infrastructure is their yearly unconference. At the unconference, about 60 invited R users from around the world get together to work on small projects that are relevant to the R community at the time. Project ideas are collected and discussed in Github issues during the weeks before the unconference but the final decision which projects will be worked on is made by the participants on the first morning of the unconference.

This year’s rOpenSci unconference was held at the Microsoft Reactor in Seattle.

The whole organizing team - most and foremost Stefanie Butland - did a wonderful job hosting this event. Everybody made sure that the spirit of the unconference was inclusive and very welcoming to everybody, from long-established fixtures in the R-world to newbies and anyone in between.

We were a pretty diverse group of social scientists, bioinformaticians, ecologists, historians, data scientists, developers, people working with Google, Microsoft or RStudio and R enthusiasts from many other fields. Some people already knew a few others, many knew each other from Twitter, R-Ladies, or other online communities but most of us (including me) had never met in person.

Therefore, the official part of the unconference was started on Monday morning with a few “ice breakers”: Stefanie would ask a question or make a statement and we would position ourselves in the room according to our answer and discuss with the people close to us. Starting with “Are you a dog or a cat person?” and finishing with “I know my place in the R community”, we all quickly had a lot to talk about! It was a great way to meet many of the people we would spend the next two days with.

It was a great experience working with so many talented and motivated people who share my passion for the R language - particularly because in my line of work as a data scientist R is often considered inferior to Python and the majority of the active R community is situated in the Pacific Northwest and California. It was a whole new experience to work together with other people on an R project and I absolutely loved it!

Working on `greta`

During the 2 days of the unconference, people worked on many interesting, useful and cool projects (click here for a complete list with links to the Github repos for every project!)!

The group I joined originally wanted to bring TensorFlow Probability to R.

TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFlow ecosystem, TensorFlow Probability provides integration of probabilistic methods with deep networks, gradient-based inference via automatic differentiation, and scalability to large datasets and models via hardware acceleration (e.g., GPUs) and distributed computation. https://github.com/tensorflow/probability

In the end, we - that is Michael Quinn, David Smith, Tiphaine Martin, Matt Mulvahill and I - ended up working with the R package greta, which has similar functionalities as TensorFlow Probability. We recreated some of the examples from the TensorFlow Probability package tutorials in greta and we also added a few additional examples that show how you can use greta.

Check out the README repo for an overview and links to everything we’ve contributed; it is a forked repo from the original package repo of greta and the vignettes will hopefully get included in the main repo at some point in the near future.

What is `greta`?

greta is an R package that has been created by Nick Golding for implementing Markov-Chain Monte-Carlo (MCMC) models, e.g. a Hamiltonian Monte Carlo (HMC) method. It offers a number of functions that make it easy to define these models, particularly for Bayesian statistics (similar to Stan).

greta lets us build statistical models interactively in R, and then sample from them by MCMC. https://greta-dev.github.io/greta/get_started.html#how_greta_works

Google’s TensorFlow is used as a backend to compute the defined models. Because TensorFlow has been optimized for large-scale computing, multi-core and GPU calculations are supported as well, greta is particularly efficient and useful for working with complex models. As TensorFlow is not natively an R package, greta makes use of RStudio’s reticulate and tensorflow packages to connect with the TensorFlow backend. This way, we can work with all the TensorFlow functions directly from within R.

How does `greta` work?

There are three layers to how greta defines a model: users manipulate greta arrays, these define nodes, and nodes then define Tensors. https://greta-dev.github.io/greta/technical_details.html

This is the minimum working example of the linear mixed model that we developed in greta based on an example from a TensorFlow Probability Jupyter notebook. The full example with explanations can be found here.

library(greta)

# data
N <- letters[1:8]
treatment_effects <- c(28.39, 7.94, -2.75 , 6.82, -0.64, 0.63, 18.01, 12.16)
treatment_stddevs <- c(14.9, 10.2, 16.3, 11.0, 9.4, 11.4, 10.4, 17.6)

# variables and priors
avg_effect <- normal(mean = 0, sd = 10)
avg_stddev <- normal(5, 1)
school_effects_standard <- normal(0, 1, dim = length(N))
school_effects <- avg_effect + exp(avg_stddev) * school_effects_standard

# likelihood
distribution(treatment_effects) <- normal(school_effects, treatment_stddevs)

# defining the hierarchical model
m <- model(avg_effect, avg_stddev, school_effects_standard)
m

## greta model

plot(m)

# sampling
draws <- greta::mcmc(m)
plot(draws)

The main type of object you’ll be using in greta is the greta array. You can create greta arrays or convert R objects, like data frames into greta arrays. greta arrays are basically a list with one element: an R6 class object with node + data, operation or variable property. This way, greta makes use of the graph-based organisation of modeling. Every node in our model graph is from a greta array node and thus connects variables, data and operations to create a directed acyclic graph (DAG) that defines the model when the model() function is called.

TensorFlow Probability

While greta makes it super easy to build similar models as with TensorFlow Probability, I also tried migrating the example code directly into R using the reticulate package. It’s still a work in progress but for everyone who might want to try as well (and achieve what I couldn’t up until now), here is how I started out.

TensorFlow Probability isn’t part of the core TensorFlow package, so we won’t have it loaded with library(tensorflow). But we can use the reticulate package instead to import any Python module (aka library) into R and use it there. This way, we could use the original functions from the tensorflow_probability Python package in R.

We could, for example, work with the Edward2 functionalities from TensorFlow probabilities.

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilistic models, ranging from classical hierarchical models on small data sets to complex deep probabilistic models on large data sets. Edward fuses three fields: Bayesian statistics and machine learning, deep learning, and probabilistic programming. […] Edward is built on TensorFlow. It enables features such as computational graphs, distributed training, CPU/GPU integration, automatic differentiation, and visualization with TensorBoard. http://edwardlib.org/

library(reticulate)
tf <- import("tensorflow")
tfp <- import("tensorflow_probability")
ed <- tfp$edward2

Note on installing a working version of TensorFlow Probability for R

As TensorFlow Probability isn’t part of the core TensorFlow package, we need to install the nightly bleeding edge version. However, we had a few problems installing a working version of TensorFlow Probability that had all the necessary submodules we wanted to use (like edward2). So, this is the version that worked in the end (as of today):

TensorFlow Probability version 0.0.1.dev20180515
TensorFlow version 1.9.0.dev20180515

For full disclosure: I worked from within the R virtualenv r-tensorflow that was created when I ran install_tensorflow() from within R. In this environment I installed:

pip install tfp-nightly==0.0.1.dev20180515
pip install tf-nightly==1.9.0.dev20180515

I used Python 3.6 on a Mac OS High Sierra version 10.13.4.

Thanks

Huge thanks go out to my amazing greta team and to rOpenSci - particularly Stefanie Butland - for organizing such a wonderful event!

Thank you also to all sponsors, who made it possible for me to fly all the way over to the Pacific Northwest and attend the unconf!

A sincere thank you to all participants in #runconf18

This thread👇includes links to all project repos: https://t.co/2PhAz4zSuK #rstats pic.twitter.com/8SICcWkQ0v
— rOpenSci (@rOpenSci) May 25, 2018

Session Information

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.4
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2 greta_0.2.4   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17       lattice_0.20-35    tidyr_0.8.1       
##  [4] visNetwork_2.0.3   prettyunits_1.0.2  assertthat_0.2.0  
##  [7] rprojroot_1.3-2    digest_0.6.15      R6_2.2.2          
## [10] plyr_1.8.4         backports_1.1.2    evaluate_0.10.1   
## [13] coda_0.19-1        ggplot2_2.2.1      blogdown_0.6      
## [16] pillar_1.2.3       tfruns_1.3         rlang_0.2.1       
## [19] progress_1.1.2     lazyeval_0.2.1     rstudioapi_0.7    
## [22] whisker_0.3-2      Matrix_1.2-14      reticulate_1.7    
## [25] rmarkdown_1.9      DiagrammeR_1.0.0   downloader_0.4    
## [28] readr_1.1.1        stringr_1.3.1      htmlwidgets_1.2   
## [31] igraph_1.2.1       munsell_0.4.3      compiler_3.5.0    
## [34] influenceR_0.1.0   rgexf_0.15.3       xfun_0.1          
## [37] pkgconfig_2.0.1    base64enc_0.1-3    tensorflow_1.5    
## [40] htmltools_0.3.6    tidyselect_0.2.4   tibble_1.4.2      
## [43] gridExtra_2.3      bookdown_0.7       XML_3.98-1.11     
## [46] viridisLite_0.3.0  dplyr_0.7.5        grid_3.5.0        
## [49] jsonlite_1.5       gtable_0.2.0       magrittr_1.5      
## [52] scales_0.5.0       stringi_1.2.2      viridis_0.5.1     
## [55] brew_1.0-6         RColorBrewer_1.1-2 tools_3.5.0       
## [58] glue_1.2.0         purrr_0.2.5        hms_0.4.2         
## [61] Rook_1.1-1         parallel_3.5.0     yaml_2.1.19       
## [64] colorspace_1.3-2   knitr_1.20         bindr_0.1.1