Lab 2B - Classification (Hard)

1. Load the Data

1. Import the OS library.

In [1]:
import os

2. Set the working directory to "C:\Workshop\Data".

In [2]:
os.chdir("C:\Workshop\Data")

3. Import the pandas library as "pd".

In [3]:
import pandas as pd

4. Read the Risk.csv file into a data frame named policies.

In [4]:
policies = pd.read_csv("Risk.csv")

2. Explore the Data

1. Inspect the policies data set with the head function.

In [5]:
policies.head()
Out[5]:
Gender State State_Rate Height Weight BMI Age Risk
0 Male MA 0.100434 184 67.8 20.025992 77 High
1 Male VA 0.141723 163 89.4 33.648237 82 High
2 Male NY 0.090803 170 81.2 28.096886 31 Low
3 Male TN 0.119973 175 99.7 32.555102 39 Low
4 Male FL 0.110345 184 72.1 21.296078 68 High

2. Import the matplotlib.pyplot library as "plt".

In [6]:
import matplotlib.pyplot as plt

3. Create a color palette containing two colors for Low and High risk.

In [7]:
palette = {
    'Low':'#fb8072', 
    'High':'#80b1d3'}

4. Map the colors to each risk category.

In [8]:
colors = policies.Risk.apply(lambda x:palette[x])

5. Create a scatterplot matrix of the policies data set colored by risk.
Note: The semicolon at the end returns only plot and no text output.

In [9]:
pd.plotting.scatter_matrix(
    frame = policies,
    color = colors,
    alpha = 1,
    s = 100,
    diagonal = "none");

6. Create a scatterplot of BMI (on the y-axis) vs Age (on the x-axis) colored by risk.

In [10]:
plt.scatter(
    x = policies.Age,
    y = policies.BMI,
    color = colors)
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

3. Transform the Data

1. Create a data frame named X containing the features (i.e. Age, BMI, Gender, State Rate).

In [11]:
X = policies.loc[:, ["Age", "BMI", "Gender", "State_Rate"]]

2. Inspect the features data frame X using the head function.

In [12]:
X.head()
Out[12]:
Age BMI Gender State_Rate
0 77 20.025992 Male 0.100434
1 82 33.648237 Male 0.141723
2 31 28.096886 Male 0.090803
3 39 32.555102 Male 0.119973
4 68 21.296078 Male 0.110345

3. Encode the categorical Gender variable {Female, Male} as integers {0, 1}.

In [13]:
X.Gender = X.Gender.apply(lambda x: 0 if x == "Female" else 1)

4. Inspect the new Gender encoding using the head function.

In [14]:
X.Gender.head()
Out[14]:
0    1
1    1
2    1
3    1
4    1
Name: Gender, dtype: int64

5. Create a series named y containing the Risk labels.

In [15]:
y = policies.Risk

6. Inspect the series of labels y using the head function.

In [16]:
y.head()
Out[16]:
0    High
1    High
2     Low
3     Low
4    High
Name: Risk, dtype: object

4. Create the Training and Test Set

1. Import the numpy library as "np".

In [17]:
import numpy as np

2. Set the random number seed to 42.

In [18]:
np.random.seed(42)

3. Import the train_test_split function from sklearn.

In [19]:
from sklearn.model_selection import train_test_split

4. Randomly sample 80% of the rows for the training set and 20% for the test set.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.8,
    test_size = 0.2)

5. Inspect the shape of the training and test sets using their shape property.

In [21]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)
X_train:  (1553, 4)
y_train:  (1553,)
X_test:   (389, 4)
y_test:   (389,)

6. Question: How do you interpret these shapes in terms of columns and rows?

4. Predict with K-Nearest Neighbors

1. Import KNN classifier class from sklearn.

In [22]:
from sklearn.neighbors import KNeighborsClassifier

2. Create a KNN model with k = 3.

In [23]:
knn_model = KNeighborsClassifier(
    n_neighbors = 3)

3. Train the model using the training data.

In [24]:
knn_model.fit(
    X = X_train,
    y = y_train)
Out[24]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

4. Predict the labels of the test set using the model.

In [25]:
knn_predictions = knn_model.predict(X_test)

5. Create a confusion matrix for the predictions.

In [26]:
pd.crosstab(
    y_test, 
    knn_predictions, 
    rownames = ['Reference'], 
    colnames = ['Predicted'])
Out[26]:
Predicted High Low
Reference
High 108 7
Low 5 269

6. Import the accuracy_score function from sklearn.

In [27]:
from sklearn.metrics import accuracy_score

7. Get the prediction accuracy.

In [28]:
knn_score = accuracy_score(
    y_true = y_test,
    y_pred = knn_predictions)

8. Inspect the prediction accuracy.

In [29]:
print(knn_score)
0.9691516709511568

9. Visualize the knn predictions with correct prediction in black and incorrect predictions in red.

In [30]:
plt.scatter(
    x = X_test.Age,
    y = X_test.BMI,
    color = np.where(
        y_test == knn_predictions, 
        'black', 
        'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

10. Question: Why do you think these data points were misclassified?

5. Predict with a Decision Tree Classifier

1. Import the decision tree classifier from sklearn.

In [31]:
from sklearn.tree import DecisionTreeClassifier

2. Create a decision tree classifier with max_depth = 3.

In [32]:
tree_model = DecisionTreeClassifier(
    max_depth = 3)

3. Train the model using the training data.

In [33]:
tree_model.fit(
    X = X_train, 
    y = y_train)
Out[33]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

4. Import the tree visualizer from sklearn.

In [34]:
from sklearn.tree import export_graphviz

5. Visualize the decision tree.

In [35]:
import graphviz
tree_graph = export_graphviz(
    decision_tree = tree_model,
    feature_names = list(X_train.columns.values),  
    class_names = list(y_train.unique()), 
    out_file = None) 
graphviz.Source(tree_graph)
Out[35]:
Tree 0 Age <= 64.5 gini = 0.417 samples = 1553 value = [461, 1092] class = High 1 BMI <= 34.695 gini = 0.075 samples = 1109 value = [43, 1066] class = High 0->1 True 8 Age <= 69.5 gini = 0.11 samples = 444 value = [418, 26] class = Low 0->8 False 2 Age <= 60.5 gini = 0.006 samples = 927 value = [3, 924] class = High 1->2 5 Age <= 55.5 gini = 0.343 samples = 182 value = [40, 142] class = High 1->5 3 gini = 0.0 samples = 850 value = [0, 850] class = High 2->3 4 gini = 0.075 samples = 77 value = [3, 74] class = High 2->4 6 gini = 0.031 samples = 129 value = [2, 127] class = High 5->6 7 gini = 0.406 samples = 53 value = [38, 15] class = Low 5->7 9 Gender <= 0.5 gini = 0.411 samples = 90 value = [64, 26] class = Low 8->9 12 gini = 0.0 samples = 354 value = [354, 0] class = Low 8->12 10 gini = 0.455 samples = 40 value = [14, 26] class = High 9->10 11 gini = 0.0 samples = 50 value = [50, 0] class = Low 9->11

6. Question: Are you able to read and follow the logic of this decision tree?

7. Predict the labels of the test set with the model.

In [36]:
tree_predictions = tree_model.predict(X_test)

8. Get the prediction accuracy.

In [37]:
tree_score = accuracy_score(
    y_true = y_test, 
    y_pred = tree_predictions)

9. Inspect the prediction accuracy.

In [38]:
print(tree_score)
0.9665809768637532

10. Visualize the prediction errors (in red).

In [39]:
plt.scatter(
    x = X_test.Age,
    y = X_test.BMI,
    color = np.where(
        y_test == tree_predictions, 
        'black', 
        'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

6. Predict with a Neural Network Classifier

1. Import the standard scaler from sklearn.

In [40]:
from sklearn.preprocessing import StandardScaler

2. Create a standard scaler.

In [41]:
scaler = StandardScaler()

3. Fit the scaler to all training data (i.e. X).

In [42]:
scaler.fit(X)
Out[42]:
StandardScaler(copy=True, with_mean=True, with_std=True)

4. Scale the training and test set.

In [43]:
X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

5. Import the neural network classifier from sklearn.

In [44]:
from sklearn.neural_network import MLPClassifier

6. Create a neural network classifier with 4 hidden tanh layers.

In [45]:
neural_model = MLPClassifier(
    hidden_layer_sizes = (4),
    activation = "tanh",
    max_iter = 2000)

7. Train the model using the training data.

In [46]:
neural_model.fit(
    X = X_train_scaled, 
    y = y_train)
Out[46]:
MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=4, learning_rate='constant',
       learning_rate_init=0.001, max_iter=2000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

8. Predict the test set labels using the model.

In [47]:
neural_predictions = neural_model.predict(X_test_scaled)

9. Get the prediction accuracy.

In [48]:
neural_score = accuracy_score(
    y_true = y_test, 
    y_pred = neural_predictions)

10. Inspect the prediction accuracy.

In [49]:
print(neural_score)
0.9717223650385605

11. Visualize the prediction errors (in red).

In [50]:
plt.scatter(
    x = X_test.Age,
    y = X_test.BMI,
    color = np.where(
        y_test == neural_predictions, 
        'black', 
        'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()

7. Evaluate the Classifier

1. Compare the accuracy of all three models.

In [51]:
print("KNN: ", knn_score)
print("Tree:", tree_score)
print("NNet:", neural_score)
KNN:  0.9691516709511568
Tree: 0.9665809768637532
NNet: 0.9717223650385605

2. Question: Which of these three classifiers would you choose? Why?