Lab 7B: Machine Learning (Hard)

Set the working directory
Load policies from the CSV file
Set the seed to 123 to make randomness reproducable

Problem 1: Prepare the Data

Create a dataset called “full” containing just Gender, State, Age, and Rate
Inspect the first six rows

##   Gender State Age     Rate
## 1   Male    MA  77 0.033200
## 2   Male    VA  82 0.067000
## 3   Male    NY  31 0.001000
## 4   Male    TN  39 0.001900
## 5   Male    FL  68 0.014975
## 6   Male    WA  64 0.013150

Eliminate rows containing NA values
Find the average rate

## [1] 0.01179769

Create a categorical variable called “Risk” with two classes (i.e. High and Low)
NOTE: Use “Low” if less than or equal to average Rate and “High” if greater than average Rate
Convert Risk to a factor variable
Remove the Rate column
Hint: Set the column equal to NULL
Inspect the first six rows

##   Gender State Age Risk
## 1   Male    MA  77 High
## 2   Male    VA  82 High
## 3   Male    NY  31  Low
## 4   Male    TN  39  Low
## 5   Male    FL  68 High
## 6   Male    WA  64 High

Randomly sample 1500 out of the 2000 row indexes
Create a training set called “train” from the indexes
Create a test set called “test” from the remaining indexes

Problem 2: Predict using a Decision Tree

Load the tree package
Train the decision tree model with “Risk ~ Gender + Age”
NOTE: We can’t use State because tree only supports 32 levels in predictor variables
Inspect the model

## 
## Classification tree:
## tree(formula = Risk ~ Gender + Age, data = train)
## Number of terminal nodes:  7 
## Residual mean deviance:  0.1254 = 187.3 / 1493 
## Misclassification error rate: 0.03067 = 46 / 1500

Plot the model

Question: How do you interpret this model?

Predict the test data set with the model
Summarize the prediction accuracy

##       y
## x      High Low
##   High  124   1
##   Low    15 360

Load the caret package
Evaluate the prediction results

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  124   1
##       Low    15 360
##                                           
##                Accuracy : 0.968           
##                  95% CI : (0.9486, 0.9816)
##     No Information Rate : 0.722           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9177          
##  Mcnemar's Test P-Value : 0.001154        
##                                           
##             Sensitivity : 0.8921          
##             Specificity : 0.9972          
##          Pos Pred Value : 0.9920          
##          Neg Pred Value : 0.9600          
##              Prevalence : 0.2780          
##          Detection Rate : 0.2480          
##    Detection Prevalence : 0.2500          
##       Balanced Accuracy : 0.9447          
##                                           
##        'Positive' Class : High            
##

Note the accuracy

## Accuracy 
##    0.968

Problem 3: Predict using a Naive Bayes Classifier

Load the e1071 package
Train the model
Inspect the model

##         Length Class  Mode     
## apriori 2      table  numeric  
## tables  3      -none- list     
## levels  2      -none- character
## call    4      -none- call

Predict with the model
Evaluate the prediction results

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  129  14
##       Low    10 347
##                                          
##                Accuracy : 0.952          
##                  95% CI : (0.9294, 0.969)
##     No Information Rate : 0.722          
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8815         
##  Mcnemar's Test P-Value : 0.5403         
##                                          
##             Sensitivity : 0.9281         
##             Specificity : 0.9612         
##          Pos Pred Value : 0.9021         
##          Neg Pred Value : 0.9720         
##              Prevalence : 0.2780         
##          Detection Rate : 0.2580         
##    Detection Prevalence : 0.2860         
##       Balanced Accuracy : 0.9446         
##                                          
##        'Positive' Class : High           
##

Note the accuracy

## Accuracy 
##    0.952

Problem 4: Predict with a Neural Network

Load the nnet package
Train the model with size = 3, decay = 0.0001, and maxit = 500

## # weights:  163
## initial  value 899.369719 
## iter  10 value 892.253620
## final  value 892.253443 
## converged

Inspect the model

## a 52-3-1 network with 163 weights
## options were - entropy fitting  decay=1e-04
##   b->h1  i1->h1  i2->h1  i3->h1  i4->h1  i5->h1  i6->h1  i7->h1  i8->h1 
##    0.60    0.60   -0.05   -0.33    0.24    0.07    0.11   -0.08   -0.09 
##  i9->h1 i10->h1 i11->h1 i12->h1 i13->h1 i14->h1 i15->h1 i16->h1 i17->h1 
##    0.12   -0.05    0.56    0.43   -0.34   -0.10   -0.09   -0.59   -0.15 
## i18->h1 i19->h1 i20->h1 i21->h1 i22->h1 i23->h1 i24->h1 i25->h1 i26->h1 
##   -0.10   -0.38    0.26   -0.60    0.29   -0.08    0.29   -0.60   -0.57 
## i27->h1 i28->h1 i29->h1 i30->h1 i31->h1 i32->h1 i33->h1 i34->h1 i35->h1 
##   -0.10    0.18   -0.35    0.41    0.24    0.49   -0.11   -0.11   -0.30 
## i36->h1 i37->h1 i38->h1 i39->h1 i40->h1 i41->h1 i42->h1 i43->h1 i44->h1 
##   -0.15   -0.33    0.54   -0.03   -0.56    0.34   -0.21    0.07    0.23 
## i45->h1 i46->h1 i47->h1 i48->h1 i49->h1 i50->h1 i51->h1 i52->h1 
##   -0.28   -0.55   -0.09    0.14    0.59    0.58   -0.51   -4.83 
##   b->h2  i1->h2  i2->h2  i3->h2  i4->h2  i5->h2  i6->h2  i7->h2  i8->h2 
##   -0.16   -0.15   -0.23    0.23   -0.27   -0.14    0.61    0.36   -0.23 
##  i9->h2 i10->h2 i11->h2 i12->h2 i13->h2 i14->h2 i15->h2 i16->h2 i17->h2 
##    0.39   -0.35    0.00   -0.41   -0.47    0.49   -0.51    0.45   -0.28 
## i18->h2 i19->h2 i20->h2 i21->h2 i22->h2 i23->h2 i24->h2 i25->h2 i26->h2 
##   -0.59    0.36   -0.18   -0.59    0.31   -0.58   -0.08   -0.30    0.00 
## i27->h2 i28->h2 i29->h2 i30->h2 i31->h2 i32->h2 i33->h2 i34->h2 i35->h2 
##    0.19   -0.52   -0.40   -0.49    0.39    0.16    0.16   -0.14   -0.09 
## i36->h2 i37->h2 i38->h2 i39->h2 i40->h2 i41->h2 i42->h2 i43->h2 i44->h2 
##   -0.19    0.36    0.48    0.19    0.12   -0.21   -0.53    0.11   -0.40 
## i45->h2 i46->h2 i47->h2 i48->h2 i49->h2 i50->h2 i51->h2 i52->h2 
##    0.30   -0.26    0.17   -0.47   -0.30    0.40    0.38   -1.12 
##   b->h3  i1->h3  i2->h3  i3->h3  i4->h3  i5->h3  i6->h3  i7->h3  i8->h3 
##    0.40   -0.48    0.13   -0.04    0.07    0.27   -0.04    0.61    0.60 
##  i9->h3 i10->h3 i11->h3 i12->h3 i13->h3 i14->h3 i15->h3 i16->h3 i17->h3 
##   -0.13    0.13   -0.19   -0.08   -0.34   -0.61   -0.59   -0.55    0.33 
## i18->h3 i19->h3 i20->h3 i21->h3 i22->h3 i23->h3 i24->h3 i25->h3 i26->h3 
##   -0.04    0.43    0.10   -0.39    0.14    0.48    0.36    0.00   -0.10 
## i27->h3 i28->h3 i29->h3 i30->h3 i31->h3 i32->h3 i33->h3 i34->h3 i35->h3 
##    0.19    0.48    0.49   -0.40   -0.30   -0.48    0.37   -0.48    0.40 
## i36->h3 i37->h3 i38->h3 i39->h3 i40->h3 i41->h3 i42->h3 i43->h3 i44->h3 
##    0.33    0.55   -0.04    0.55    0.35    0.46   -0.13   -0.50   -0.28 
## i45->h3 i46->h3 i47->h3 i48->h3 i49->h3 i50->h3 i51->h3 i52->h3 
##    0.27   -0.18   -0.15   -0.58    0.03   -0.47    0.48   -1.81 
##  b->o h1->o h2->o h3->o 
##  0.93  0.75 -0.53 -0.06

Predict with the model
Evaluate the prediction results

## Warning in confusionMatrix.default(data = neuralPredictions, reference
## = test$Risk): Levels are not in the same order for reference and data.
## Refactoring data to match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High    0   0
##       Low   139 361
##                                           
##                Accuracy : 0.722           
##                  95% CI : (0.6805, 0.7609)
##     No Information Rate : 0.722           
##     P-Value [Acc > NIR] : 0.5228          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000           
##             Specificity : 1.000           
##          Pos Pred Value :   NaN           
##          Neg Pred Value : 0.722           
##              Prevalence : 0.278           
##          Detection Rate : 0.000           
##    Detection Prevalence : 0.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : High            
##

Note the accuracy

## Warning in confusionMatrix.default(data = neuralPredictions, reference
## = test$Risk): Levels are not in the same order for reference and data.
## Refactoring data to match.

## Accuracy 
##    0.722

Question: Which algorithm provides the highest accuracy?

Question: Which algorithm provides the most transparency?