Lab 3B - Regression (Hard)

1. Load the Data

1. Import the OS library.

In [1]:
import os

2. Set the working directory.

In [2]:
os.chdir("C:\\Workshop\\Data")

3. Import the pandas library as "pd".

In [3]:
import pandas as pd

4. Read the Rates.csv file into a data frame called policies.

In [4]:
policies = pd.read_csv("Rates.csv")

2. Explore the Data

1. Inspect the policy rates data set using the head function.
Note: Notice this data set has a numeric Rate variable instead of a categorical Risk variable.

In [5]:
policies.head()
Out[5]:
Gender State State_Rate Height Weight BMI Age Rate
0 Male MA 0.100434 184 67.8 20.025992 77 0.332000
1 Male VA 0.141723 163 89.4 33.648237 82 0.869148
2 Male NY 0.090803 170 81.2 28.096886 31 0.010000
3 Male TN 0.119973 175 99.7 32.555102 39 0.021532
4 Male FL 0.110345 184 72.1 21.296078 68 0.149750

2. Import the matplotlib pyplot library as "plt".

In [6]:
import matplotlib.pyplot as plt

3. Create a scatterplot matrix of the policies data set.
Note: The semicolon at the end prevents text output from being displayed with the plot.

In [7]:
pd.plotting.scatter_matrix(
    frame = policies,
    alpha = 1,
    s = 100,
    diagonal = 'none');

4. Create a correlation matrix of the policies data set.

In [8]:
correlations = policies.corr()

print(correlations)
            State_Rate    Height    Weight       BMI       Age      Rate
State_Rate    1.000000 -0.016523  0.009233  0.019241  0.112347  0.226852
Height       -0.016523  1.000000  0.238085 -0.316961 -0.164781 -0.128582
Weight        0.009233  0.238085  1.000000  0.839628  0.011679  0.060939
BMI           0.019241 -0.316961  0.839628  1.000000  0.102317  0.140507
Age           0.112347 -0.164781  0.011679  0.102317  1.000000  0.780079
Rate          0.226852 -0.128582  0.060939  0.140507  0.780079  1.000000

5. Import the seaborn library as "sns".

In [9]:
import seaborn as sns

6. Create a correlogram using the correlation matrix.

In [10]:
sns.heatmap(
    data = correlations,
    cmap = sns.diverging_palette(
        h_neg = 10, 
        h_pos = 220, 
        as_cmap=True));

7. Question: Which variable is most strongly correlated with Rate?

8. Get the correlation between Age and Rate.

In [11]:
policies.Age \
    .corr(policies.Rate)
Out[11]:
0.7800790487947441

9. Create a scatterplot of Rate (on the y-axis) vs Age (on the x-axis).

In [12]:
plt.scatter(
    x = policies.Age,
    y = policies.Rate)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

3. Transform the Data

1. Inspect the policies data set.

In [13]:
policies.head()
Out[13]:
Gender State State_Rate Height Weight BMI Age Rate
0 Male MA 0.100434 184 67.8 20.025992 77 0.332000
1 Male VA 0.141723 163 89.4 33.648237 82 0.869148
2 Male NY 0.090803 170 81.2 28.096886 31 0.010000
3 Male TN 0.119973 175 99.7 32.555102 39 0.021532
4 Male FL 0.110345 184 72.1 21.296078 68 0.149750

2. Create a data frame named X containing feature variables Age, Gender, State Rate, and BMI.

In [14]:
X = policies[["Gender", "Age", "State_Rate", "BMI"]]

3. Inspect the features X.

In [15]:
X.head()
Out[15]:
Gender Age State_Rate BMI
0 Male 77 0.100434 20.025992
1 Male 82 0.141723 33.648237
2 Male 31 0.090803 28.096886
3 Male 39 0.119973 32.555102
4 Male 68 0.110345 21.296078

4. Convert the categorical variable Gender into a set of one-hot-encoding variables.

In [16]:
dummies = pd.get_dummies(X.Gender)

5. Inspect the one-hot encoded variables.

In [17]:
dummies.head()
Out[17]:
Female Male
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1

6. Append the one-hot-encoded gender variables to the features data set X.

In [18]:
X = pd.concat([X, dummies], axis = 1)

7. Drop the Gender column from the features data frame X.

In [19]:
X = X.drop("Gender", 1)

8. Inspect the features data frame X.

In [20]:
X.head()
Out[20]:
Age State_Rate BMI Female Male
0 77 0.100434 20.025992 0 1
1 82 0.141723 33.648237 0 1
2 31 0.090803 28.096886 0 1
3 39 0.119973 32.555102 0 1
4 68 0.110345 21.296078 0 1

9. Create a series named y containing just the labels (i.e. Rate).

In [21]:
y = policies.Rate

10. Inspect the series of labels y.

In [22]:
y.head()
Out[22]:
0    0.332000
1    0.869148
2    0.010000
3    0.021532
4    0.149750
Name: Rate, dtype: float64
In [23]:
### 4. Create the Training and Test Set

1. Import the numpy library as "np".

In [24]:
import numpy as np

2. Set the random number seed to 42.

In [25]:
np.random.seed(42)

3. Import the test_train_split function from sklearn.

In [26]:
from sklearn.model_selection import train_test_split

4. Randomly sample 80% of the rows for the training set and 20% of the rows for the test set.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.80,
    test_size = 0.20)

5. Inspect the shape of the training and test sets using the shape property.

In [28]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)
X_train:  (1553, 5)
y_train:  (1553,)
X_test:   (389, 5)
y_test:   (389,)

5. Predict with Simple Linear Regression

1. Import the linear regression class from sklearn.

In [29]:
from sklearn.linear_model import LinearRegression

2. Create a simple linear regression model.

In [30]:
simple_model = LinearRegression()

3. Create a data frame named x1_train containing only the Age feature from the training set.

In [31]:
x1_train = X_train.loc[:, ["Age"]]

4. Create a data frame named x1_test containing onl the Age feature from the test set.

In [32]:
x1_test = X_test.loc[:, ["Age"]]

5. Train the model using the training data.
Note: You should be using x1_train as your training data.

In [33]:
simple_model.fit(
    X = x1_train,
    y = y_train)
Out[33]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

6. Draw the regression line on top of a scatterplot of Rate (y-axis) vs Age (x-axis).

In [34]:
plt.scatter(
    x = policies.Age,
    y = policies.Rate,
    color = "grey")
plt.plot(
    x1_test,
    simple_model.predict(
        x1_test),
    color = "blue",
    linewidth = 3)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

7. Inspect the slope (m) and y-intercept (b) parameter estimates.

In [35]:
print("y-intercept (b): ", simple_model.intercept_)
print("Slope (m):        ", simple_model.coef_[0])
y-intercept (b):  -0.26291950607211056
Slope (m):         0.007888729541323627

8. Question: How do you interpret these two values?

9. Predict the labels of the test set using the model.

In [36]:
simple_predictions = simple_model.predict(x1_test)

10. Visualize the prediction error.

In [37]:
# Plot the training set (grey dots)
plt.scatter(
    x = x1_train.Age,
    y = y_train,
    color = "grey",
    facecolor = "none")

# Plot the predictions (blue dots)
plt.scatter(
    x = x1_test.Age,
    y = simple_predictions,
    color = "blue",
    marker = 'x')

# Plot the correct answer (green dots)
plt.scatter(
    x = x1_test.Age,
    y = y_test,
    color = "green")

# Plot the error (red lines)
plt.plot(
    [x1_test.Age, x1_test.Age],
    [simple_predictions, y_test],
    color = "red",
    zorder = 0)

# Finish the plot
plt.xlabel("Age")
plt.ylabel("Risk")
plt.show()

11. How do you interpret this graph?

12. Compute the root mean squared error (RMSE) the these predictions.

In [38]:
simple_rmse = np.sqrt(np.mean((y_test - simple_predictions) ** 2))

print(simple_rmse)
0.12079653135112772

13. Question: Was simple linear regression a good choice for modeling this relationship? Why or why not?

6. Predict with Multiple Linear Regression

1. Create a linear regression model.

In [39]:
multiple_model = LinearRegression()

2. Train the model using all features of the training data.

In [40]:
multiple_model.fit(
    X = X_train,
    y = y_train)
Out[40]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

3. Inspect the parameter estimates.

In [41]:
print("{:<12}: {: .3f}"
    .format("y-intercept", multiple_model.intercept_))

for i, column_name in enumerate(X_train.columns):
    print("{:<12}: {: .3f}".format(
        column_name, 
        multiple_model.coef_[i]))
y-intercept : -0.406
Age         :  0.008
State_Rate  :  0.626
BMI         :  0.002
Female      : -0.018
Male        :  0.018

4. Question: How do you interpret these values?

5. Predict output values for the input values in the test set.

In [42]:
multiple_predictions = multiple_model.predict(X_test)

6. Visualize the prediction error.

In [43]:
plt.scatter(
    x = X_train.Age,
    y = y_train,
    color = "black",
    facecolor = "none")
plt.scatter(
    x = X_test.Age,
    y = multiple_predictions,
    color = "blue",
    marker = 'x')
plt.scatter(
    x = X_test.Age,
    y = y_test,
    color = "green")
plt.plot(
    [X_test.Age, X_test.Age],
    [multiple_predictions, y_test],
    color = "red",
    zorder = 0)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

7. Question: How do you interpret this graph?

8. Compute the root mean squared error (RMSE) of these predictions.

In [44]:
multiple_rmse = np.sqrt(np.mean((y_test - multiple_predictions) ** 2))

print(multiple_rmse)
0.11691051667329788

9. Question: Is this a better predictive model of the data?

7. Predict with a Neural Network Regressor

1. Import the standard scaler from sklearn.

In [45]:
from sklearn.preprocessing import StandardScaler

2. Create standard scalers for training and test data.

In [46]:
X_scaler = StandardScaler()
y_scaler = StandardScaler()

3. Fit the scaler to all training data.

In [47]:
X_scaler.fit(X)
y_scaler.fit(y.values.reshape(-1, 1))
Out[47]:
StandardScaler(copy=True, with_mean=True, with_std=True)

4. Scale the training and test data.

In [48]:
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train.values.reshape(-1, 1))
y_test_scaled = y_scaler.transform(y_test.values.reshape(-1, 1))

5. Import the neural network regressor class from sklearn.

In [49]:
from sklearn.neural_network import MLPRegressor

6. Create a neural network regressor with 4 hidden nodes, a tanh activation function, an LBFGS solver, and 1000 maximum iterations.

In [50]:
neural_model = MLPRegressor(
    hidden_layer_sizes = (4),
    activation = "tanh",
    solver = "lbfgs",
    max_iter = 1000)

7. Train the model with the training set.

In [51]:
neural_model.fit(
    X = X_train_scaled,
    y = y_train_scaled.reshape(-1, ))
Out[51]:
MLPRegressor(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=4, learning_rate='constant',
       learning_rate_init=0.001, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

8. Predict output values for the test set.

In [52]:
scaled_predictions = neural_model.predict(X_test_scaled)

9. Unscale the predictions.

In [53]:
neural_predictions = y_scaler.inverse_transform(scaled_predictions)

10. Visualize the prediction error.

In [54]:
plt.scatter(
    x = X_train.Age,
    y = y_train,
    color = "black",
    facecolor = "none")
plt.scatter(
    x = X_test.Age,
    y = neural_predictions,
    color = "blue",
    marker = 'x')
plt.scatter(
    x = X_test.Age,
    y = y_test,
    color = "green")
plt.plot(
    [X_test.Age, X_test.Age],
    [neural_predictions, y_test],
    color = "red",
    zorder = 0)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

11. Compute the root mean squared error (RMSE) of these predictions.

In [55]:
neural_rmse = np.sqrt(np.mean((y_test - neural_predictions) ** 2))

12. Inspect the RMSE of these predictions.

In [56]:
print(neural_rmse)
0.03612720419871486

8. Evaluate the Regressors

1. Compare all three results.

In [57]:
print("Simple RMSE:   ", simple_rmse)
print("Multiple RMSE: ", multiple_rmse)
print("Neural RMSE:   ", neural_rmse)
Simple RMSE:    0.12079653135112772
Multiple RMSE:  0.11691051667329788
Neural RMSE:    0.03612720419871486

2. Question: Which of these models would you choose? Why?