Lab 2B - Classification (Hard)¶

1. Load the Data¶

1. Import the OS library.

import os

2. Set the working directory to "C:\Workshop\Data".

os.chdir("C:\Workshop\Data")

3. Import the pandas library as "pd".

import pandas as pd

4. Read the Risk.csv file into a data frame named policies.

policies = pd.read_csv("Risk.csv")

2. Explore the Data¶

1. Inspect the policies data set with the head function.

policies.head()

2. Import the matplotlib.pyplot library as "plt".

import matplotlib.pyplot as plt

3. Create a color palette containing two colors for Low and High risk.

palette = {
    'Low':'#fb8072', 
    'High':'#80b1d3'}

4. Map the colors to each risk category.

colors = policies.Risk.apply(lambda x:palette[x])

5. Create a scatterplot matrix of the policies data set colored by risk.
Note: The semicolon at the end returns only plot and no text output.

pd.plotting.scatter_matrix(
    frame = policies,
    color = colors,
    alpha = 1,
    s = 100,
    diagonal = "none");

6. Create a scatterplot of BMI (on the y-axis) vs Age (on the x-axis) colored by risk.

plt.scatter(
    x = policies.Age,
    y = policies.BMI,
    color = colors)
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

3. Transform the Data¶

1. Create a data frame named X containing the features (i.e. Age, BMI, Gender, State Rate).

X = policies.loc[:, ["Age", "BMI", "Gender", "State_Rate"]]

2. Inspect the features data frame X using the head function.

X.head()

3. Encode the categorical Gender variable {Female, Male} as integers {0, 1}.

X.Gender = X.Gender.apply(lambda x: 0 if x == "Female" else 1)

4. Inspect the new Gender encoding using the head function.

X.Gender.head()

0    1
1    1
2    1
3    1
4    1
Name: Gender, dtype: int64

5. Create a series named y containing the Risk labels.

y = policies.Risk

6. Inspect the series of labels y using the head function.

y.head()

0    High
1    High
2     Low
3     Low
4    High
Name: Risk, dtype: object

4. Create the Training and Test Set¶

1. Import the numpy library as "np".

import numpy as np

2. Set the random number seed to 42.

np.random.seed(42)

3. Import the train_test_split function from sklearn.

from sklearn.model_selection import train_test_split

4. Randomly sample 80% of the rows for the training set and 20% for the test set.

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.8,
    test_size = 0.2)

5. Inspect the shape of the training and test sets using their shape property.

print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)

X_train:  (1553, 4)
y_train:  (1553,)
X_test:   (389, 4)
y_test:   (389,)

6. Question: How do you interpret these shapes in terms of columns and rows?

4. Predict with K-Nearest Neighbors¶

1. Import KNN classifier class from sklearn.

from sklearn.neighbors import KNeighborsClassifier

2. Create a KNN model with k = 3.

knn_model = KNeighborsClassifier(
    n_neighbors = 3)

3. Train the model using the training data.

knn_model.fit(
    X = X_train,
    y = y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

4. Predict the labels of the test set using the model.

knn_predictions = knn_model.predict(X_test)

5. Create a confusion matrix for the predictions.

pd.crosstab(
    y_test, 
    knn_predictions, 
    rownames = ['Reference'], 
    colnames = ['Predicted'])

6. Import the accuracy_score function from sklearn.

from sklearn.metrics import accuracy_score

7. Get the prediction accuracy.

knn_score = accuracy_score(
    y_true = y_test,
    y_pred = knn_predictions)

8. Inspect the prediction accuracy.

print(knn_score)

0.9691516709511568

9. Visualize the knn predictions with correct prediction in black and incorrect predictions in red.

plt.scatter(
    x = X_test.Age,
    y = X_test.BMI,
    color = np.where(
        y_test == knn_predictions, 
        'black', 
        'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

10. Question: Why do you think these data points were misclassified?

5. Predict with a Decision Tree Classifier¶

1. Import the decision tree classifier from sklearn.

from sklearn.tree import DecisionTreeClassifier

2. Create a decision tree classifier with max_depth = 3.

tree_model = DecisionTreeClassifier(
    max_depth = 3)

3. Train the model using the training data.

tree_model.fit(
    X = X_train, 
    y = y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

4. Import the tree visualizer from sklearn.

from sklearn.tree import export_graphviz

5. Visualize the decision tree.

import graphviz
tree_graph = export_graphviz(
    decision_tree = tree_model,
    feature_names = list(X_train.columns.values),  
    class_names = list(y_train.unique()), 
    out_file = None) 
graphviz.Source(tree_graph)

6. Question: Are you able to read and follow the logic of this decision tree?

7. Predict the labels of the test set with the model.

tree_predictions = tree_model.predict(X_test)

8. Get the prediction accuracy.

tree_score = accuracy_score(
    y_true = y_test, 
    y_pred = tree_predictions)

9. Inspect the prediction accuracy.

print(tree_score)

0.9665809768637532

10. Visualize the prediction errors (in red).

plt.scatter(
    x = X_test.Age,
    y = X_test.BMI,
    color = np.where(
        y_test == tree_predictions, 
        'black', 
        'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

6. Predict with a Neural Network Classifier¶

1. Import the standard scaler from sklearn.

from sklearn.preprocessing import StandardScaler

2. Create a standard scaler.

scaler = StandardScaler()

3. Fit the scaler to all training data (i.e. X).

scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

4. Scale the training and test set.

X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

5. Import the neural network classifier from sklearn.

from sklearn.neural_network import MLPClassifier

6. Create a neural network classifier with 4 hidden tanh layers.

neural_model = MLPClassifier(
    hidden_layer_sizes = (4),
    activation = "tanh",
    max_iter = 2000)

7. Train the model using the training data.

neural_model.fit(
    X = X_train_scaled, 
    y = y_train)

MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=4, learning_rate='constant',
       learning_rate_init=0.001, max_iter=2000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

8. Predict the test set labels using the model.

neural_predictions = neural_model.predict(X_test_scaled)

9. Get the prediction accuracy.

neural_score = accuracy_score(
    y_true = y_test, 
    y_pred = neural_predictions)

10. Inspect the prediction accuracy.

print(neural_score)

0.9717223650385605

11. Visualize the prediction errors (in red).

plt.scatter(
    x = X_test.Age,
    y = X_test.BMI,
    color = np.where(
        y_test == neural_predictions, 
        'black', 
        'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

7. Evaluate the Classifier¶

1. Compare the accuracy of all three models.

print("KNN: ", knn_score)
print("Tree:", tree_score)
print("NNet:", neural_score)

KNN:  0.9691516709511568
Tree: 0.9665809768637532
NNet: 0.9717223650385605

2. Question: Which of these three classifiers would you choose? Why?

	Gender	State	State_Rate	Height	Weight	BMI	Age	Risk
0	Male	MA	0.100434	184	67.8	20.025992	77	High
1	Male	VA	0.141723	163	89.4	33.648237	82	High
2	Male	NY	0.090803	170	81.2	28.096886	31	Low
3	Male	TN	0.119973	175	99.7	32.555102	39	Low
4	Male	FL	0.110345	184	72.1	21.296078	68	High

	Age	BMI	Gender	State_Rate
0	77	20.025992	Male	0.100434
1	82	33.648237	Male	0.141723
2	31	28.096886	Male	0.090803
3	39	32.555102	Male	0.119973
4	68	21.296078	Male	0.110345

Predicted	High	Low
Reference
High	108	7
Low	5	269