Lab 5B - Machine Learning in Practice

1. Load the Data

1. Load all required libraries.

In [1]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

2. Set the working directory to "C:\Workshop\Data".

In [2]:
os.chdir("C:\Workshop\Data")

3. Read the Risk.csv file into a data frame called policies.

In [3]:
policies = pd.read_csv("Risk.csv")

2. Explore the Data

1. Inspect the policies data using the head function.

In [4]:
policies.head()
Out[4]:
Gender State State_Rate Height Weight BMI Age Risk
0 Male MA 0.100434 184 67.8 20.025992 77 High
1 Male VA 0.141723 163 89.4 33.648237 82 High
2 Male NY 0.090803 170 81.2 28.096886 31 Low
3 Male TN 0.119973 175 99.7 32.555102 39 Low
4 Male FL 0.110345 184 72.1 21.296078 68 High

2. Summarize the columns in the data frame using the info function.

In [5]:
policies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1942 entries, 0 to 1941
Data columns (total 8 columns):
Gender        1942 non-null object
State         1942 non-null object
State_Rate    1942 non-null float64
Height        1942 non-null int64
Weight        1942 non-null float64
BMI           1942 non-null float64
Age           1942 non-null int64
Risk          1942 non-null object
dtypes: float64(3), int64(2), object(3)
memory usage: 121.5+ KB

3. Sumarize the data in the data frame using the describe function.

In [6]:
policies.describe(
    include = "all")
Out[6]:
Gender State State_Rate Height Weight BMI Age Risk
count 1942 1942 1942.000000 1942.000000 1942.000000 1942.000000 1942.000000 1942
unique 2 51 NaN NaN NaN NaN NaN 2
top Male CA NaN NaN NaN NaN NaN Low
freq 986 191 NaN NaN NaN NaN NaN 1366
mean NaN NaN 0.138064 169.718847 81.155767 28.292804 50.841401 NaN
std NaN NaN 0.044180 9.571082 16.009041 5.808799 19.327130 NaN
min NaN NaN 0.001000 150.000000 44.100000 16.022174 18.000000 NaN
25% NaN NaN 0.110345 162.000000 68.600000 23.739705 34.000000 NaN
50% NaN NaN 0.127584 170.000000 81.300000 28.055706 51.000000 NaN
75% NaN NaN 0.144251 176.000000 93.800000 32.456822 68.000000 NaN
max NaN NaN 0.318100 190.000000 116.500000 46.796193 84.000000 NaN

4. Create a correlation matrix using the corr function.

In [7]:
correlations = policies.corr()

5. Create a correlogram using the seaborn heatmap function.

In [8]:
sns.heatmap(
    data = correlations,
    cmap = sns.diverging_palette(
        h_neg = 10, 
        h_pos = 220, 
        as_cmap = True));

6. Inspect missing values with the isnull and sum functions.

In [9]:
policies.isnull().sum()
Out[9]:
Gender        0
State         0
State_Rate    0
Height        0
Weight        0
BMI           0
Age           0
Risk          0
dtype: int64

3. Transform the Data

1. Assign the following features to a data frame named X: Gender, State Rate, Height, Weight, BMI, and Age.

In [10]:
X = policies[["Gender", "State_Rate", "Height", "Weight", "BMI", "Age"]]

2. Encode the categorical Gender variable {Female, Male} as an integer {0, 1}.

In [11]:
X.Gender.replace(("Female", "Male"), (0, 1), inplace = True)
C:\Users\Matthew\Anaconda3\lib\site-packages\pandas\core\generic.py:4619: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

3. Inspect the transfomred data with the head function.

In [12]:
X.head()
Out[12]:
Gender State_Rate Height Weight BMI Age
0 1 0.100434 184 67.8 20.025992 77
1 1 0.141723 163 89.4 33.648237 82
2 1 0.090803 170 81.2 28.096886 31
3 1 0.119973 175 99.7 32.555102 39
4 1 0.110345 184 72.1 21.296078 68

4. Create a new series for the labels named y.

In [13]:
y = policies.Risk

5. Scale the feature data using the standard scaler.

In [14]:
scaler = StandardScaler()

scaler.fit(X)

X_scaled = scaler.transform(X)

4. Create the Training and Test Set

1. Set the random number seed to 42.

In [15]:
np.random.seed(42)

2. Create stratified training and test sets (80/20).

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify = y,
    train_size = 0.80,
    test_size = 0.20)

5. Create KNN Classifier Models

1. Create a KNN model.

In [17]:
knn_model = KNeighborsClassifier()

2. Define the KNN hyperparameters to test (i.e. k = {2, 7, 9, 11, 13})

In [18]:
knn_params = [5, 7, 9, 11, 13]

knn_param_grid = {"n_neighbors" : knn_params }

3. Create 10 KNN models for each of the five hyper-parameters using 10-fold cross validation.

In [19]:
knn_models = GridSearchCV(
    estimator = knn_model, 
    param_grid = knn_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.

In [20]:
knn_models.fit(
    X = X_train, 
    y = y_train)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    4.3s finished
Out[20]:
GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [5, 7, 9, 11, 13]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

5. Get the average accuracy for each of the five hyperparameters.

In [21]:
knn_avg_scores = knn_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

In [22]:
for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(knn_params[i], knn_avg_scores[i]))
  5 : 0.965
  7 : 0.967
  9 : 0.968
 11 : 0.965
 13 : 0.968

7. Plot the change in accuracy over each hyper-parameter.

In [23]:
plt.plot(
    knn_params, 
    knn_avg_scores)
plt.xlabel("k (neighbors)")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error of the top performing model.

In [24]:
knn_top_index = np.argmax(knn_avg_scores)
knn_top_param = knn_params[knn_top_index]
knn_top_score = knn_avg_scores[knn_top_index]
knn_top_error = knn_models.cv_results_["std_test_score"][knn_top_index]

9. Inspect the top performing model.

In [25]:
print("Top knn model is k = {:d} at {:0.2f} +/- {:0.3f} accuracy"
    .format(knn_top_param, knn_top_score, knn_top_error))
Top knn model is k = 9 at 0.97 +/- 0.015 accuracy

6. Create Decision Tree Classifier Models.

1. Create a decision tree model.

In [26]:
tree_model = DecisionTreeClassifier()

2. Define the hyper-parameters to test (i.e. max_depth = {3, 4, 5, 6, 7}).

In [27]:
tree_params = [3, 4, 5, 6, 7]

tree_param_grid = {"max_depth" : tree_params }

3. Create 10 tree models for each of the 5 hyper-parameters using 10-fold cross validation.

In [28]:
tree_models = GridSearchCV(
    estimator = tree_model, 
    param_grid = tree_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.

In [29]:
tree_models.fit(
    X = X_train, 
    y = y_train)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    2.3s finished
Out[29]:
GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [3, 4, 5, 6, 7]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring='accuracy',
       verbose=1)

5. Get the average accuracy for each hyper-parameter.

In [30]:
tree_avg_scores = tree_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

In [31]:
for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(tree_params[i], tree_avg_scores[i]))
  3 : 0.972
  4 : 0.980
  5 : 0.979
  6 : 0.977
  7 : 0.979

7. Plot the change in accuracy over each hyper-parameter.

In [32]:
plt.plot(
    tree_params, 
    tree_avg_scores)
plt.xlabel("Max Depth (nodes)")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error for the top-performing model.

In [33]:
tree_top_index = np.argmax(tree_avg_scores)
tree_top_param = tree_params[tree_top_index]
tree_top_score = tree_avg_scores[tree_top_index]
tree_top_error = tree_models.cv_results_["std_test_score"][tree_top_index]

9. Inspect the top-performing model.

In [34]:
print("Top tree model is k = {:d} at {:0.2f} +/- {:0.3} accuracy"
    .format(tree_top_param, tree_top_score, tree_top_error))
Top tree model is k = 4 at 0.98 +/- 0.0197 accuracy

7. Create Neural Network Classifier Models.

1. Create a neural network model with tanh activation functions and 5000 max iterations.

In [35]:
neural_model = MLPClassifier(
    activation = "tanh",
    solver = "sgd",
    max_iter = 5000)

2. Define hyper-parameters to test (i.e. hidden_layer_sizes = {3, 4, 5, 6, 7}).

In [36]:
neural_params = [3, 4, 5, 6, 7]

neural_param_grid = {"hidden_layer_sizes" : neural_params }

3. Create 10 models for each of the 5 hyper-parameters using 10-fold cross validation.

In [37]:
neural_models = GridSearchCV(
    estimator = neural_model, 
    param_grid = neural_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.
Note: This could take a few minutes.

In [38]:
neural_models.fit(
    X = X_train, 
    y = y_train)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.2min finished
Out[38]:
GridSearchCV(cv=10, error_score='raise',
       estimator=MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='sgd', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'hidden_layer_sizes': [3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

5. Get the average accuracy for each hyper-parameter.

In [39]:
neural_avg_scores = neural_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

In [40]:
for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(neural_params[i], neural_avg_scores[i]))
  3 : 0.797
  4 : 0.771
  5 : 0.818
  6 : 0.840
  7 : 0.787

7. Plot the change in accuracy over each hyper-parameter.

In [41]:
plt.plot(
    neural_params, 
    neural_avg_scores)
plt.xlabel("Hidden Layer Nodes")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error for the top-performing model.

In [42]:
neural_top_index = np.argmax(neural_avg_scores)
neural_top_param = neural_params[neural_top_index]
neural_top_score = neural_avg_scores[neural_top_index]
neural_top_error = neural_models.cv_results_["std_test_score"][neural_top_index]

9. Inspect the statistics of the top-performing 10 models.

In [43]:
print("Top nnet model is k = {:d} at {:0.2f} +/- {:0.3f} accuracy"
    .format(neural_top_param, neural_top_score, neural_top_error))
Top nnet model is k = 6 at 0.84 +/- 0.113 accuracy

8. Evaluate the Models

1. Compare the top three performers numerically.

In [44]:
print("KNN:  {:0.2f} +/- {:0.3f} accuracy"
    .format(knn_top_score, knn_top_error))
print("Tree: {:0.2f} +/- {:0.3f} accuracy"
    .format(tree_top_score, tree_top_error))
print("NNet: {:0.2f} +/- {:0.3f} accuracy"
    .format(neural_top_score, neural_top_error))
KNN:  0.97 +/- 0.015 accuracy
Tree: 0.98 +/- 0.020 accuracy
NNet: 0.84 +/- 0.113 accuracy

2. Compare the top-three performing models visually.

In [45]:
plt.errorbar(
    x = [knn_top_score, tree_top_score, neural_top_score],
    y = ["KNN", "Tree", "NNet"],
    xerr = [knn_top_error, tree_top_error, neural_top_error],
    linestyle = "none",
    marker = "o")
plt.xlim(0, 1)
Out[45]:
(0, 1)

3. Question: Which model would you choose based on this information?

9. Test the Final Model

1. Create a final model based on the top-performing algorithm and hyper-parameter.

In [46]:
final_model = DecisionTreeClassifier(
    max_depth = 3)

2. Train the final model using the entire training set.

In [47]:
final_model.fit(
    X = X_train, 
    y = y_train)
Out[47]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

6. Predict the labels of the test set using the hold-out test set.

In [48]:
final_predictions = final_model.predict(X_test)

7. Get the final prediction accuracy.

In [49]:
final_score = accuracy_score(
    y_true = y_test, 
    y_pred = final_predictions)

8. Inspect the final prediction accuracy.

In [50]:
print(final_score)
0.9794344473007712

10. Deploy the Model

Question to be answered: Is Jack (from the Titanic) a high risk or low risk policy?

1. Create an input feature for Jack.

In [51]:
X_jack = pd.DataFrame(
    columns = ["Gender", "State_Rate", "Height", "Weight", "BMI", "Age"],
    data = [[1, 0.09080315, 183, 75, 22.4, 20]])

2. Predict the risk class of Jack.

In [52]:
final_model.predict(X_jack)[0]
Out[52]:
'Low'

3. Predict the probablility that Jack belongs to the above risk class.

In [53]:
final_model.predict_proba(X_jack)[0][1]
Out[53]:
1.0

4. Question: Would you offer life insurance to Jack?