Set the working directory
Load policies from the CSV file
Set the seed to 123 to make randomness reproducable
Create a dataset called “full” containing just Gender, State, Age, and Rate
Inspect the first six rows
## Gender State Age Rate
## 1 Male MA 77 0.033200
## 2 Male VA 82 0.067000
## 3 Male NY 31 0.001000
## 4 Male TN 39 0.001900
## 5 Male FL 68 0.014975
## 6 Male WA 64 0.013150
Eliminate rows containing NA values
Find the average rate
## [1] 0.01179769
Create a categorical variable called “Risk” with two classes (i.e. High and Low)
NOTE: Use “Low” if less than or equal to average Rate and “High” if greater than average Rate
Convert Risk to a factor variable
Remove the Rate column
Hint: Set the column equal to NULL
Inspect the first six rows
## Gender State Age Risk
## 1 Male MA 77 High
## 2 Male VA 82 High
## 3 Male NY 31 Low
## 4 Male TN 39 Low
## 5 Male FL 68 High
## 6 Male WA 64 High
Randomly sample 1500 out of the 2000 row indexes
Create a training set called “train” from the indexes
Create a test set called “test” from the remaining indexes
Load the tree package
Train the decision tree model with “Risk ~ Gender + Age”
NOTE: We can’t use State because tree only supports 32 levels in predictor variables
Inspect the model
##
## Classification tree:
## tree(formula = Risk ~ Gender + Age, data = train)
## Number of terminal nodes: 7
## Residual mean deviance: 0.1254 = 187.3 / 1493
## Misclassification error rate: 0.03067 = 46 / 1500
Question: How do you interpret this model?
Predict the test data set with the model
Summarize the prediction accuracy
## y
## x High Low
## High 124 1
## Low 15 360
Load the caret package
Evaluate the prediction results
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 124 1
## Low 15 360
##
## Accuracy : 0.968
## 95% CI : (0.9486, 0.9816)
## No Information Rate : 0.722
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9177
## Mcnemar's Test P-Value : 0.001154
##
## Sensitivity : 0.8921
## Specificity : 0.9972
## Pos Pred Value : 0.9920
## Neg Pred Value : 0.9600
## Prevalence : 0.2780
## Detection Rate : 0.2480
## Detection Prevalence : 0.2500
## Balanced Accuracy : 0.9447
##
## 'Positive' Class : High
##
## Accuracy
## 0.968
Load the e1071 package
Train the model
Inspect the model
## Length Class Mode
## apriori 2 table numeric
## tables 3 -none- list
## levels 2 -none- character
## call 4 -none- call
Predict with the model
Evaluate the prediction results
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 129 14
## Low 10 347
##
## Accuracy : 0.952
## 95% CI : (0.9294, 0.969)
## No Information Rate : 0.722
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8815
## Mcnemar's Test P-Value : 0.5403
##
## Sensitivity : 0.9281
## Specificity : 0.9612
## Pos Pred Value : 0.9021
## Neg Pred Value : 0.9720
## Prevalence : 0.2780
## Detection Rate : 0.2580
## Detection Prevalence : 0.2860
## Balanced Accuracy : 0.9446
##
## 'Positive' Class : High
##
## Accuracy
## 0.952
Load the nnet package
Train the model with size = 3, decay = 0.0001, and maxit = 500
## # weights: 163
## initial value 899.369719
## iter 10 value 892.253620
## final value 892.253443
## converged
## a 52-3-1 network with 163 weights
## options were - entropy fitting decay=1e-04
## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1
## 0.60 0.60 -0.05 -0.33 0.24 0.07 0.11 -0.08 -0.09
## i9->h1 i10->h1 i11->h1 i12->h1 i13->h1 i14->h1 i15->h1 i16->h1 i17->h1
## 0.12 -0.05 0.56 0.43 -0.34 -0.10 -0.09 -0.59 -0.15
## i18->h1 i19->h1 i20->h1 i21->h1 i22->h1 i23->h1 i24->h1 i25->h1 i26->h1
## -0.10 -0.38 0.26 -0.60 0.29 -0.08 0.29 -0.60 -0.57
## i27->h1 i28->h1 i29->h1 i30->h1 i31->h1 i32->h1 i33->h1 i34->h1 i35->h1
## -0.10 0.18 -0.35 0.41 0.24 0.49 -0.11 -0.11 -0.30
## i36->h1 i37->h1 i38->h1 i39->h1 i40->h1 i41->h1 i42->h1 i43->h1 i44->h1
## -0.15 -0.33 0.54 -0.03 -0.56 0.34 -0.21 0.07 0.23
## i45->h1 i46->h1 i47->h1 i48->h1 i49->h1 i50->h1 i51->h1 i52->h1
## -0.28 -0.55 -0.09 0.14 0.59 0.58 -0.51 -4.83
## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2
## -0.16 -0.15 -0.23 0.23 -0.27 -0.14 0.61 0.36 -0.23
## i9->h2 i10->h2 i11->h2 i12->h2 i13->h2 i14->h2 i15->h2 i16->h2 i17->h2
## 0.39 -0.35 0.00 -0.41 -0.47 0.49 -0.51 0.45 -0.28
## i18->h2 i19->h2 i20->h2 i21->h2 i22->h2 i23->h2 i24->h2 i25->h2 i26->h2
## -0.59 0.36 -0.18 -0.59 0.31 -0.58 -0.08 -0.30 0.00
## i27->h2 i28->h2 i29->h2 i30->h2 i31->h2 i32->h2 i33->h2 i34->h2 i35->h2
## 0.19 -0.52 -0.40 -0.49 0.39 0.16 0.16 -0.14 -0.09
## i36->h2 i37->h2 i38->h2 i39->h2 i40->h2 i41->h2 i42->h2 i43->h2 i44->h2
## -0.19 0.36 0.48 0.19 0.12 -0.21 -0.53 0.11 -0.40
## i45->h2 i46->h2 i47->h2 i48->h2 i49->h2 i50->h2 i51->h2 i52->h2
## 0.30 -0.26 0.17 -0.47 -0.30 0.40 0.38 -1.12
## b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3 i8->h3
## 0.40 -0.48 0.13 -0.04 0.07 0.27 -0.04 0.61 0.60
## i9->h3 i10->h3 i11->h3 i12->h3 i13->h3 i14->h3 i15->h3 i16->h3 i17->h3
## -0.13 0.13 -0.19 -0.08 -0.34 -0.61 -0.59 -0.55 0.33
## i18->h3 i19->h3 i20->h3 i21->h3 i22->h3 i23->h3 i24->h3 i25->h3 i26->h3
## -0.04 0.43 0.10 -0.39 0.14 0.48 0.36 0.00 -0.10
## i27->h3 i28->h3 i29->h3 i30->h3 i31->h3 i32->h3 i33->h3 i34->h3 i35->h3
## 0.19 0.48 0.49 -0.40 -0.30 -0.48 0.37 -0.48 0.40
## i36->h3 i37->h3 i38->h3 i39->h3 i40->h3 i41->h3 i42->h3 i43->h3 i44->h3
## 0.33 0.55 -0.04 0.55 0.35 0.46 -0.13 -0.50 -0.28
## i45->h3 i46->h3 i47->h3 i48->h3 i49->h3 i50->h3 i51->h3 i52->h3
## 0.27 -0.18 -0.15 -0.58 0.03 -0.47 0.48 -1.81
## b->o h1->o h2->o h3->o
## 0.93 0.75 -0.53 -0.06
Predict with the model
Evaluate the prediction results
## Warning in confusionMatrix.default(data = neuralPredictions, reference
## = test$Risk): Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 0 0
## Low 139 361
##
## Accuracy : 0.722
## 95% CI : (0.6805, 0.7609)
## No Information Rate : 0.722
## P-Value [Acc > NIR] : 0.5228
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000
## Specificity : 1.000
## Pos Pred Value : NaN
## Neg Pred Value : 0.722
## Prevalence : 0.278
## Detection Rate : 0.000
## Detection Prevalence : 0.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : High
##
## Warning in confusionMatrix.default(data = neuralPredictions, reference
## = test$Risk): Levels are not in the same order for reference and data.
## Refactoring data to match.
## Accuracy
## 0.722
Question: Which algorithm provides the highest accuracy?
Question: Which algorithm provides the most transparency?