Lab 5A: ML in Practice (Easy)

1. Explore the Data

  1. Set working directory
setwd("C:/Workshop/Data")
  1. Read CSV file
raw <- read.csv("Titanic.csv")
  1. Peek at the data
head(raw)
##   pclass survived                                            name    sex
## 1      1        1                   Allen, Miss. Elisabeth Walton female
## 2      1        1                  Allison, Master. Hudson Trevor   male
## 3      1        0                    Allison, Miss. Helen Loraine female
## 4      1        0            Allison, Mr. Hudson Joshua Creighton   male
## 5      1        0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female
## 6      1        1                             Anderson, Mr. Harry   male
##       age sibsp parch ticket     fare   cabin embarked boat body
## 1 29.0000     0     0  24160 211.3375      B5        S    2   NA
## 2  0.9167     1     2 113781 151.5500 C22 C26        S   11   NA
## 3  2.0000     1     2 113781 151.5500 C22 C26        S        NA
## 4 30.0000     1     2 113781 151.5500 C22 C26        S       135
## 5 25.0000     1     2 113781 151.5500 C22 C26        S        NA
## 6 48.0000     0     0  19952  26.5500     E12        S    3   NA
##                         home.dest
## 1                    St Louis, MO
## 2 Montreal, PQ / Chesterville, ON
## 3 Montreal, PQ / Chesterville, ON
## 4 Montreal, PQ / Chesterville, ON
## 5 Montreal, PQ / Chesterville, ON
## 6                    New York, NY
  1. Summarize the data set
summary(raw)
##      pclass         survived                                   name     
##  Min.   :1.000   Min.   :0.000   Connolly, Miss. Kate            :   2  
##  1st Qu.:2.000   1st Qu.:0.000   Kelly, Mr. James                :   2  
##  Median :3.000   Median :0.000   Abbing, Mr. Anthony             :   1  
##  Mean   :2.295   Mean   :0.382   Abbott, Master. Eugene Joseph   :   1  
##  3rd Qu.:3.000   3rd Qu.:1.000   Abbott, Mr. Rossmore Edward     :   1  
##  Max.   :3.000   Max.   :1.000   Abbott, Mrs. Stanton (Rosa Hunt):   1  
##                                  (Other)                         :1301  
##      sex           age              sibsp            parch      
##  female:466   Min.   : 0.1667   Min.   :0.0000   Min.   :0.000  
##  male  :843   1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000  
##               Median :28.0000   Median :0.0000   Median :0.000  
##               Mean   :29.8811   Mean   :0.4989   Mean   :0.385  
##               3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000  
##               Max.   :80.0000   Max.   :8.0000   Max.   :9.000  
##               NA's   :263                                       
##       ticket          fare                     cabin      embarked
##  CA. 2343:  11   Min.   :  0.000                  :1014    :  2   
##  1601    :   8   1st Qu.:  7.896   C23 C25 C27    :   6   C:270   
##  CA 2144 :   8   Median : 14.454   B57 B59 B63 B66:   5   Q:123   
##  3101295 :   7   Mean   : 33.295   G6             :   5   S:914   
##  347077  :   7   3rd Qu.: 31.275   B96 B98        :   4           
##  347082  :   7   Max.   :512.329   C22 C26        :   4           
##  (Other) :1261   NA's   :1         (Other)        : 271           
##       boat          body                      home.dest  
##         :823   Min.   :  1.0                       :564  
##  13     : 39   1st Qu.: 72.0   New York, NY        : 64  
##  C      : 38   Median :155.0   London              : 14  
##  15     : 37   Mean   :160.8   Montreal, PQ        : 10  
##  14     : 33   3rd Qu.:256.0   Cornwall / Akron, OH:  9  
##  4      : 31   Max.   :328.0   Paris, France       :  9  
##  (Other):308   NA's   :1188    (Other)             :639
  1. Visualize the data set
plot(raw)

  1. Load the corrgram package
library(corrgram)
  1. Visualize the correlations
corrgram(raw)

  1. Count the rows with missing values
sum(is.na(raw))
## [1] 1452
  1. Load the dplyr package
library(dplyr)
  1. Transform, clean, engineer, and select features
clean <- raw %>%
  mutate(age = ifelse(is.na(age), mean(na.omit(age)), age)) %>%
  mutate(family = sibsp + parch) %>%
  mutate(survived = as.factor(ifelse(survived == 0, "No", "Yes"))) %>%
  select(
    Class = pclass,
    Sex = sex,
    Age = age,
    Family = family,
    Survived = survived) %>%
  as.data.frame()

2. Create Training and Test Set

  1. Load the caret package
library(caret)
  1. Set the seed to 42 to make randomness reproducable.
set.seed(42)
  1. Create row indexes for the training set
indexes <- createDataPartition(
  clean$Survived, 
  p = .8, 
  list = FALSE, 
  times = 1)
  1. Create the training set using the row indexes
train <- clean[indexes, ]
  1. Create the test set using the remaining rows
test <- clean[-indexes, ]
  1. Specify center and scale as preprocessing steps
preProcess <- c("center", "scale")
  1. Specify training control parameters
control <- trainControl(
  method = "cv",
  number = 10)

3. Train KNN Models

  1. Train k-nearest neighbor models
knnModel <- train(
  form = Survived ~ .,
  data = train,
  method = "knn",
  preProcess = preProcess,
  trControl = control,
  tuneLength = 5,
  metric = "Accuracy")
  1. Display model details
print(knnModel)
## k-Nearest Neighbors 
## 
## 1048 samples
##    4 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 943, 943, 943, 944, 943, 943, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8015842  0.5754371
##    7  0.7929762  0.5562429
##    9  0.8034890  0.5779054
##   11  0.8120788  0.5945939
##   13  0.8120696  0.5931258
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
  1. Plot model accuracy
plot(knnModel)

4. Train Decision Tree Models

  1. Train decision tree models
treeModel <- train(
  form = Survived ~ .,
  data = train,
  method = "rpart",
  preProcess = preProcess,
  trControl = control,
  tuneLength = 5,
  metric = "Accuracy")
  1. Display model summary
print(treeModel)
## CART 
## 
## 1048 samples
##    4 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 943, 943, 943, 943, 943, 944, ... 
## Resampling results across tuning parameters:
## 
##   cp        Accuracy   Kappa    
##   0.004375  0.8053388  0.5776352
##   0.005000  0.8072619  0.5811883
##   0.021250  0.8005861  0.5695910
##   0.023750  0.7871978  0.5433680
##   0.450000  0.6727473  0.1890019
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005.
  1. Plot model accuracy
plot(treeModel)

5. Train Neural Network Models

  1. Specify hyperparameter-tuning grid
neuralTuneGrid <- data.frame(
  size = c(3, 4, 5, 3, 4, 5, 3, 4, 5),
  decay = c(0.1, 0.1, 0.1, 0.01, 0.01, 0.01, 0.001, 0.001, 0.001))
  1. Train neural network models
neuralModel <- train(
  form = Survived ~ .,
  data = train,
  method = "nnet",
  preProcess = preProcess,
  trControl = control,
  tuneGrid = neuralTuneGrid)
  1. Display model summary
print(neuralModel)
## Neural Network 
## 
## 1048 samples
##    4 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 943, 943, 943, 943, 943, 943, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   3     0.001  0.8053388  0.5789952
##   3     0.010  0.8139469  0.5966001
##   3     0.100  0.8158700  0.5994007
##   4     0.001  0.8129945  0.5935100
##   4     0.010  0.8168315  0.6029929
##   4     0.100  0.8158700  0.5968726
##   5     0.001  0.8196886  0.6092592
##   5     0.010  0.8129945  0.5953079
##   5     0.100  0.8177564  0.6022955
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.001.

4 Plot model accuracy

plot(neuralModel)

6. Evaluate the Models

  1. Combine results
results <- resamples(list(
  knn = knnModel,
  tree = treeModel,
  nnet = neuralModel))
  1. Summarize results
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: knn, tree, nnet 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## knn  0.7428571 0.8023810 0.8142857 0.8120788 0.8281593 0.8557692    0
## tree 0.7238095 0.7852335 0.8229853 0.8072619 0.8357143 0.8476190    0
## nnet 0.7714286 0.8081502 0.8238095 0.8196886 0.8380952 0.8653846    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## knn  0.4521739 0.5805516 0.5956140 0.5945939 0.6252555 0.6849758    0
## tree 0.3817259 0.5313906 0.6191558 0.5811883 0.6373072 0.6673267    0
## nnet 0.5058824 0.5790588 0.6090769 0.6092592 0.6465084 0.7156250    0
  1. Create a dot plot to compare top models
dotplot(results)

  1. Create a box plot to compare models
bwplot(results)

  1. Create a density plot to compare models
densityplot(results, auto.key = TRUE)

  1. Question: Which model would you choose? Why?

7. Evalute the Final Model

  1. Make final predictions using hold-out test set
final_predictions <- predict(
  object = treeModel,
  newdata = test)
  1. Determine final prediction accuracy
finalMatrix <- confusionMatrix(
  data = final_predictions,
  reference = test$Survived)
  1. Inspect final prediction accuracy
print(finalMatrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  134  30
##        Yes  27  70
##                                           
##                Accuracy : 0.7816          
##                  95% CI : (0.7265, 0.8302)
##     No Information Rate : 0.6169          
##     P-Value [Acc > NIR] : 9.348e-09       
##                                           
##                   Kappa : 0.5353          
##  Mcnemar's Test P-Value : 0.7911          
##                                           
##             Sensitivity : 0.8323          
##             Specificity : 0.7000          
##          Pos Pred Value : 0.8171          
##          Neg Pred Value : 0.7216          
##              Prevalence : 0.6169          
##          Detection Rate : 0.5134          
##    Detection Prevalence : 0.6284          
##       Balanced Accuracy : 0.7661          
##                                           
##        'Positive' Class : No              
## 

8. Deploy the Model

  1. Determine the likelihood that Jack survives the Titanic.
predict(
  object = neuralModel,
  newdata = data.frame(
    Class = 3,
    Sex = "male",
    Age = 20, 
    Family = 1),
  type = "prob")
##          No       Yes
## 1 0.8598103 0.1401897
  1. Question: Would you bet on Jack surviving?

  2. Determine the likelihood that you would survive the Titanic.

  3. Question: Would you take that bet?