# Lab 7B: Machine Learning (Hard)

1. Set the working directory

2. Load policies from the CSV file

3. Set the seed to 123 to make randomness reproducable

### Problem 1: Prepare the Data

1. Create a dataset called “full” containing just Gender, State, Age, and Rate

2. Inspect the first six rows

##   Gender State Age     Rate
## 1   Male    MA  77 0.033200
## 2   Male    VA  82 0.067000
## 3   Male    NY  31 0.001000
## 4   Male    TN  39 0.001900
## 5   Male    FL  68 0.014975
## 6   Male    WA  64 0.013150
1. Eliminate rows containing NA values

2. Find the average rate

## [1] 0.01179769
1. Create a categorical variable called “Risk” with two classes (i.e. High and Low)
NOTE: Use “Low” if less than or equal to average Rate and “High” if greater than average Rate

2. Convert Risk to a factor variable

3. Remove the Rate column
Hint: Set the column equal to NULL

4. Inspect the first six rows

##   Gender State Age Risk
## 1   Male    MA  77 High
## 2   Male    VA  82 High
## 3   Male    NY  31  Low
## 4   Male    TN  39  Low
## 5   Male    FL  68 High
## 6   Male    WA  64 High
1. Randomly sample 1500 out of the 2000 row indexes

2. Create a training set called “train” from the indexes

3. Create a test set called “test” from the remaining indexes

### Problem 2: Predict using a Decision Tree

2. Train the decision tree model with “Risk ~ Gender + Age”
NOTE: We can’t use State because tree only supports 32 levels in predictor variables

3. Inspect the model

##
## Classification tree:
## tree(formula = Risk ~ Gender + Age, data = train)
## Number of terminal nodes:  7
## Residual mean deviance:  0.1254 = 187.3 / 1493
## Misclassification error rate: 0.03067 = 46 / 1500
1. Plot the model

Question: How do you interpret this model?

1. Predict the test data set with the model

2. Summarize the prediction accuracy

##       y
## x      High Low
##   High  124   1
##   Low    15 360

2. Evaluate the prediction results

## Confusion Matrix and Statistics
##
##           Reference
## Prediction High Low
##       High  124   1
##       Low    15 360
##
##                Accuracy : 0.968
##                  95% CI : (0.9486, 0.9816)
##     No Information Rate : 0.722
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.9177
##  Mcnemar's Test P-Value : 0.001154
##
##             Sensitivity : 0.8921
##             Specificity : 0.9972
##          Pos Pred Value : 0.9920
##          Neg Pred Value : 0.9600
##              Prevalence : 0.2780
##          Detection Rate : 0.2480
##    Detection Prevalence : 0.2500
##       Balanced Accuracy : 0.9447
##
##        'Positive' Class : High
## 
1. Note the accuracy
## Accuracy
##    0.968

### Problem 3: Predict using a Naive Bayes Classifier

2. Train the model

3. Inspect the model

##         Length Class  Mode
## apriori 2      table  numeric
## tables  3      -none- list
## levels  2      -none- character
## call    4      -none- call
1. Predict with the model

2. Evaluate the prediction results

## Confusion Matrix and Statistics
##
##           Reference
## Prediction High Low
##       High  129  14
##       Low    10 347
##
##                Accuracy : 0.952
##                  95% CI : (0.9294, 0.969)
##     No Information Rate : 0.722
##     P-Value [Acc > NIR] : <2e-16
##
##                   Kappa : 0.8815
##  Mcnemar's Test P-Value : 0.5403
##
##             Sensitivity : 0.9281
##             Specificity : 0.9612
##          Pos Pred Value : 0.9021
##          Neg Pred Value : 0.9720
##              Prevalence : 0.2780
##          Detection Rate : 0.2580
##    Detection Prevalence : 0.2860
##       Balanced Accuracy : 0.9446
##
##        'Positive' Class : High
## 
1. Note the accuracy
## Accuracy
##    0.952

### Problem 4: Predict with a Neural Network

2. Train the model with size = 3, decay = 0.0001, and maxit = 500

## # weights:  163
## initial  value 899.369719
## iter  10 value 892.253620
## final  value 892.253443
## converged
1. Inspect the model
## a 52-3-1 network with 163 weights
## options were - entropy fitting  decay=1e-04
##   b->h1  i1->h1  i2->h1  i3->h1  i4->h1  i5->h1  i6->h1  i7->h1  i8->h1
##    0.60    0.60   -0.05   -0.33    0.24    0.07    0.11   -0.08   -0.09
##  i9->h1 i10->h1 i11->h1 i12->h1 i13->h1 i14->h1 i15->h1 i16->h1 i17->h1
##    0.12   -0.05    0.56    0.43   -0.34   -0.10   -0.09   -0.59   -0.15
## i18->h1 i19->h1 i20->h1 i21->h1 i22->h1 i23->h1 i24->h1 i25->h1 i26->h1
##   -0.10   -0.38    0.26   -0.60    0.29   -0.08    0.29   -0.60   -0.57
## i27->h1 i28->h1 i29->h1 i30->h1 i31->h1 i32->h1 i33->h1 i34->h1 i35->h1
##   -0.10    0.18   -0.35    0.41    0.24    0.49   -0.11   -0.11   -0.30
## i36->h1 i37->h1 i38->h1 i39->h1 i40->h1 i41->h1 i42->h1 i43->h1 i44->h1
##   -0.15   -0.33    0.54   -0.03   -0.56    0.34   -0.21    0.07    0.23
## i45->h1 i46->h1 i47->h1 i48->h1 i49->h1 i50->h1 i51->h1 i52->h1
##   -0.28   -0.55   -0.09    0.14    0.59    0.58   -0.51   -4.83
##   b->h2  i1->h2  i2->h2  i3->h2  i4->h2  i5->h2  i6->h2  i7->h2  i8->h2
##   -0.16   -0.15   -0.23    0.23   -0.27   -0.14    0.61    0.36   -0.23
##  i9->h2 i10->h2 i11->h2 i12->h2 i13->h2 i14->h2 i15->h2 i16->h2 i17->h2
##    0.39   -0.35    0.00   -0.41   -0.47    0.49   -0.51    0.45   -0.28
## i18->h2 i19->h2 i20->h2 i21->h2 i22->h2 i23->h2 i24->h2 i25->h2 i26->h2
##   -0.59    0.36   -0.18   -0.59    0.31   -0.58   -0.08   -0.30    0.00
## i27->h2 i28->h2 i29->h2 i30->h2 i31->h2 i32->h2 i33->h2 i34->h2 i35->h2
##    0.19   -0.52   -0.40   -0.49    0.39    0.16    0.16   -0.14   -0.09
## i36->h2 i37->h2 i38->h2 i39->h2 i40->h2 i41->h2 i42->h2 i43->h2 i44->h2
##   -0.19    0.36    0.48    0.19    0.12   -0.21   -0.53    0.11   -0.40
## i45->h2 i46->h2 i47->h2 i48->h2 i49->h2 i50->h2 i51->h2 i52->h2
##    0.30   -0.26    0.17   -0.47   -0.30    0.40    0.38   -1.12
##   b->h3  i1->h3  i2->h3  i3->h3  i4->h3  i5->h3  i6->h3  i7->h3  i8->h3
##    0.40   -0.48    0.13   -0.04    0.07    0.27   -0.04    0.61    0.60
##  i9->h3 i10->h3 i11->h3 i12->h3 i13->h3 i14->h3 i15->h3 i16->h3 i17->h3
##   -0.13    0.13   -0.19   -0.08   -0.34   -0.61   -0.59   -0.55    0.33
## i18->h3 i19->h3 i20->h3 i21->h3 i22->h3 i23->h3 i24->h3 i25->h3 i26->h3
##   -0.04    0.43    0.10   -0.39    0.14    0.48    0.36    0.00   -0.10
## i27->h3 i28->h3 i29->h3 i30->h3 i31->h3 i32->h3 i33->h3 i34->h3 i35->h3
##    0.19    0.48    0.49   -0.40   -0.30   -0.48    0.37   -0.48    0.40
## i36->h3 i37->h3 i38->h3 i39->h3 i40->h3 i41->h3 i42->h3 i43->h3 i44->h3
##    0.33    0.55   -0.04    0.55    0.35    0.46   -0.13   -0.50   -0.28
## i45->h3 i46->h3 i47->h3 i48->h3 i49->h3 i50->h3 i51->h3 i52->h3
##    0.27   -0.18   -0.15   -0.58    0.03   -0.47    0.48   -1.81
##  b->o h1->o h2->o h3->o
##  0.93  0.75 -0.53 -0.06
1. Predict with the model

2. Evaluate the prediction results

## Warning in confusionMatrix.default(data = neuralPredictions, reference
## = test$Risk): Levels are not in the same order for reference and data. ## Refactoring data to match. ## Confusion Matrix and Statistics ## ## Reference ## Prediction High Low ## High 0 0 ## Low 139 361 ## ## Accuracy : 0.722 ## 95% CI : (0.6805, 0.7609) ## No Information Rate : 0.722 ## P-Value [Acc > NIR] : 0.5228 ## ## Kappa : 0 ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.000 ## Specificity : 1.000 ## Pos Pred Value : NaN ## Neg Pred Value : 0.722 ## Prevalence : 0.278 ## Detection Rate : 0.000 ## Detection Prevalence : 0.000 ## Balanced Accuracy : 0.500 ## ## 'Positive' Class : High ##  1. Note the accuracy ## Warning in confusionMatrix.default(data = neuralPredictions, reference ## = test$Risk): Levels are not in the same order for reference and data.
## Refactoring data to match.
## Accuracy
##    0.722

Question: Which algorithm provides the highest accuracy?

Question: Which algorithm provides the most transparency?