Slides from Münster Data Science Meetup

These are my slides from the Münster Data Science Meetup on December 12th, 2017.


My sketchnotes were collected from these two podcasts:

Sketchnotes: TWiML Talk #7 with Carlos Guestrin – Explaining the Predictions of Machine Learning Models & Data Skeptic Podcast - Trusting Machine Learning Models with Lime

Example Code

  • the following libraries were loaded:
library(tidyverse)  # for tidy data analysis
library(farff)      # for reading arff file
library(missForest) # for imputing missing values
library(dummies)    # for creating dummy variables
library(caret)      # for modeling
library(lime)       # for explaining predictions


The Chronic Kidney Disease dataset was downloaded from UC Irvine’s Machine Learning repository:

data_file <- file.path("path/to/chronic_kidney_disease_full.arff")
  • load data with the farff package
data <- readARFF(data_file)


  • age - age
  • bp - blood pressure
  • sg - specific gravity
  • al - albumin
  • su - sugar
  • rbc - red blood cells
  • pc - pus cell
  • pcc - pus cell clumps
  • ba - bacteria
  • bgr - blood glucose random
  • bu - blood urea
  • sc - serum creatinine
  • sod - sodium
  • pot - potassium
  • hemo - hemoglobin
  • pcv - packed cell volume
  • wc - white blood cell count
  • rc - red blood cell count
  • htn - hypertension
  • dm - diabetes mellitus
  • cad - coronary artery disease
  • appet - appetite
  • pe - pedal edema
  • ane - anemia
  • class - class

Missing data

  • impute missing data with Nonparametric Missing Value Imputation using Random Forest (missForest package)
data_imp <- missForest(data)

One-hot encoding

  • create dummy variables (
  • scale and center
data_imp_final <- data_imp$ximp
data_dummy <-, -class), sep = "_")
data <- cbind(dplyr::select(data_imp_final, class), scale(data_dummy, 
                                                   center = apply(data_dummy, 2, min),
                                                   scale = apply(data_dummy, 2, max)))


# training and test set
index <- createDataPartition(data$class, p = 0.9, list = FALSE)
train_data <- data[index, ]
test_data  <- data[-index, ]

# modeling
model_rf <- caret::train(class ~ .,
  data = train_data,
  method = "rf", # random forest
  trControl = trainControl(method = "repeatedcv", 
       number = 10, 
       repeats = 5, 
       verboseIter = FALSE))
## Random Forest 
## 360 samples
##  48 predictor
##   2 classes: 'ckd', 'notckd' 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 324, 324, 324, 324, 325, 324, ... 
## Resampling results across tuning parameters:
##   mtry  Accuracy   Kappa    
##    2    0.9922647  0.9838466
##   25    0.9917392  0.9826070
##   48    0.9872930  0.9729881
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# predictions
pred <- data.frame(sample_id = 1:nrow(test_data), predict(model_rf, test_data, type = "prob"), actual = test_data$class) %>%
  mutate(prediction = colnames(.)[2:3][apply(.[, 2:3], 1, which.max)], correct = ifelse(actual == prediction, "correct", "wrong"))

confusionMatrix(pred$actual, pred$prediction)
## Confusion Matrix and Statistics
##           Reference
## Prediction ckd notckd
##     ckd     23      2
##     notckd   0     15
##                Accuracy : 0.95            
##                  95% CI : (0.8308, 0.9939)
##     No Information Rate : 0.575           
##     P-Value [Acc > NIR] : 1.113e-07       
##                   Kappa : 0.8961          
##  Mcnemar's Test P-Value : 0.4795          
##             Sensitivity : 1.0000          
##             Specificity : 0.8824          
##          Pos Pred Value : 0.9200          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5750          
##          Detection Rate : 0.5750          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.9412          
##        'Positive' Class : ckd             


  • LIME needs data without response variable
train_x <- dplyr::select(train_data, -class)
test_x <- dplyr::select(test_data, -class)

train_y <- dplyr::select(train_data, class)
test_y <- dplyr::select(test_data, class)
  • build explainer
explainer <- lime(train_x, model_rf, n_bins = 5, quantile_bins = TRUE)
  • run explain() function
explanation_df <- lime::explain(test_x, explainer, n_labels = 1, n_features = 8, n_permutations = 1000, feature_select = "forward_selection")
  • model reliability
explanation_df %>%
  ggplot(aes(x = model_r2, fill = label)) +
    geom_density(alpha = 0.5)

  • plot explanations
plot_features(explanation_df[1:24, ], ncol = 1)

