Lab 3A - Regression

1. Load the Data

1. Import the OS library.

In [1]:
import os

2. Set the working directory.

In [2]:
os.chdir("C:\\Workshop\\Data")

3. Import the pandas library as "pd".

In [3]:
import pandas as pd

4. Read the Iris CSV file into a data frame called iris.

In [4]:
iris = pd.read_csv("Iris.csv")

2. Explore the Data

1. Inspect the iris data set.

In [5]:
iris.head()
Out[5]:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

2. Import the matplotlib pyplot library as "plt".

In [6]:
import matplotlib.pyplot as plt

3. Create a scatterplot matrix of the iris data set.
Note: The semicolon at the end prevents text output from being displayed with the plot.

In [7]:
pd.plotting.scatter_matrix(
    frame = iris,
    alpha = 1,
    s = 100,
    diagonal = 'none');

6. Create a correlation matrix of the iris data set.

In [8]:
correlations = iris.corr()

print(correlations)
              Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
Sepal_Length      1.000000    -0.117570      0.871754     0.817941
Sepal_Width      -0.117570     1.000000     -0.428440    -0.366126
Petal_Length      0.871754    -0.428440      1.000000     0.962865
Petal_Width       0.817941    -0.366126      0.962865     1.000000

5. Import the seaborn library as "sns".

In [9]:
import seaborn as sns

6. Create a correlogram using the correlation matrix.

In [10]:
sns.heatmap(
    data = correlations,
    cmap = sns.diverging_palette(
        h_neg = 10, 
        h_pos = 220, 
        as_cmap=True));

7. Question: Which variable is most strongly correlated with Petal Width?

8. Get the correlation between petal length and width.

In [11]:
iris.Petal_Length \
    .corr(iris.Petal_Width)
Out[11]:
0.9628654314027961

9. Create a scatterplot of petal width (y) vs. petal width (x)

In [12]:
plt.scatter(
    x = iris.Petal_Length,
    y = iris.Petal_Width)
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

3. Transform the Data

1. Inspect the iris data set.

In [13]:
iris.head()
Out[13]:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

2. Create a data frame named X containing all variables except for petal width.

In [14]:
X = iris.iloc[:, iris.columns != "Petal_Width"]

3. Inspect the features X.

In [15]:
X.head()
Out[15]:
Sepal_Length Sepal_Width Petal_Length Species
0 5.1 3.5 1.4 setosa
1 4.9 3.0 1.4 setosa
2 4.7 3.2 1.3 setosa
3 4.6 3.1 1.5 setosa
4 5.0 3.6 1.4 setosa

4. Convert the categorical variable Species into a set of one-hot-encoding variables.

In [16]:
dummies = pd.get_dummies(X.Species)

5. Inspect the one-hot encoded variables.

In [17]:
dummies.head()
Out[17]:
setosa versicolor virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0

6. Append the one-hot-encoded species variables to the features data set X.

In [18]:
X = pd.concat([X, dummies], axis = 1)

7. Drop the Species column from the features data frame X.

In [19]:
X = X.drop("Species", 1)

8. Inspect the features data frame X.

In [20]:
X.head()
Out[20]:
Sepal_Length Sepal_Width Petal_Length setosa versicolor virginica
0 5.1 3.5 1.4 1 0 0
1 4.9 3.0 1.4 1 0 0
2 4.7 3.2 1.3 1 0 0
3 4.6 3.1 1.5 1 0 0
4 5.0 3.6 1.4 1 0 0

9. Create a series named y containing just the labels (i.e. Petal_Width).

In [21]:
y = iris.Petal_Width

10. Inspect the series of labels y.

In [22]:
y.head()
Out[22]:
0    0.2
1    0.2
2    0.2
3    0.2
4    0.2
Name: Petal_Width, dtype: float64

4. Create the Training and Test Set

1. Import the numpy library as "np".

In [23]:
import numpy as np

2. Set the random number seed to 234.

In [24]:
np.random.seed(234)

3. Import the test_train_split function from sklearn.

In [25]:
from sklearn.model_selection import train_test_split

4. Randomly sample 80% of the rows for the training set and 20% of the rows for the test set.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.80,
    test_size = 0.20)

5. Inspect the shape of the training and test sets using the shape property.

In [27]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)
X_train:  (120, 6)
y_train:  (120,)
X_test:   (30, 6)
y_test:   (30,)

5. Predict with Simple Linear Regression

1. Import the linear regression class from sklearn.

In [28]:
from sklearn.linear_model import LinearRegression

2. Create a simple linear regression model.

In [29]:
simple_model = LinearRegression()

3. Create a data frame named x1_train containing only the petal length feature from the training set.

In [30]:
x1_train = X_train.loc[:, ["Petal_Length"]]

4. Create a data frame named x1_test containing onl the petal length feature from the test set.

In [31]:
x1_test = X_test.loc[:, ["Petal_Length"]]

5. Train the model using the training data.
Note: You should be using x1_train as your training data.

In [32]:
simple_model.fit(
    X = x1_train,
    y = y_train)
Out[32]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

6. Draw the regression line on top of a scatterplot of petal width (y) vs petal width (x).

In [33]:
plt.scatter(
    x = iris.Petal_Length,
    y = iris.Petal_Width,
    color = "black")
plt.plot(
    x1_test,
    simple_model.predict(
        x1_test),
    color = "blue",
    linewidth = 3)
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

7. Inspect the slope (m) and y-intercept (b) parameter estimates.

In [34]:
print("y-intercept (b): ", simple_model.intercept_)
print("Slope (m):        ", simple_model.coef_[0])
y-intercept (b):  -0.3486959626278219
Slope (m):         0.40736325883164276

8. Question: How do you interpret these two values?

9. Predict the labels of the test set using the model.

In [35]:
simple_predictions = simple_model.predict(x1_test)

10. Visualize the prediction error.

In [36]:
# Plot the training set (black dots)
plt.scatter(
    x = x1_train.Petal_Length,
    y = y_train,
    color = "black",
    facecolor = "none")

# Plot the predictions (blue dots)
plt.scatter(
    x = x1_test.Petal_Length,
    y = simple_predictions,
    color = "blue",
    marker = 'x')

# Plot the correct answer (green dots)
plt.scatter(
    x = x1_test.Petal_Length,
    y = y_test,
    color = "green")

# Plot the error (red lines)
plt.plot(
    [x1_test.Petal_Length, x1_test.Petal_Length],
    [simple_predictions, y_test],
    color = "red",
    zorder = 0)

# Finish the plot
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

11. Question: How do you interpret this graph?

12. Compute the root mean squared error (RMSE) the these predictions.

In [37]:
simple_rmse = np.sqrt(np.mean((y_test - simple_predictions) ** 2))

print(simple_rmse)
0.24794604821207347

6. Predict with Multiple Linear Regression

1. Create a linear regression model.

In [38]:
multiple_model = LinearRegression()

2. Train the model using all features of the training data.

In [39]:
multiple_model.fit(
    X = X_train,
    y = y_train)
Out[39]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

3. Inspect the parameter estimates.

In [40]:
print("{:<12}: {: .3f}"
    .format("y-intercept", multiple_model.intercept_))

for i, column_name in enumerate(X_train.columns):
    print("{:<12}: {: .3f}".format(
        column_name, 
        multiple_model.coef_[i]))
y-intercept :  0.108
Sepal_Length: -0.116
Sepal_Width :  0.270
Petal_Length:  0.249
setosa      : -0.564
versicolor  :  0.103
virginica   :  0.461

4. Question: How do you interpret these values?

5. Predict output values for the input values in the test set.

In [41]:
multiple_predictions = multiple_model.predict(X_test)

6. Visualize the prediction error.

In [42]:
plt.scatter(
    x = X_train.Petal_Length,
    y = y_train,
    color = "black",
    facecolor = "none")
plt.scatter(
    x = X_test.Petal_Length,
    y = multiple_predictions,
    color = "blue",
    marker = 'x')
plt.scatter(
    x = X_test.Petal_Length,
    y = y_test,
    color = "green")
plt.plot(
    [X_test.Petal_Length, X_test.Petal_Length],
    [multiple_predictions, y_test],
    color = "red",
    zorder = 0)
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

7. Question: How do you interpret this graph?

8. Compute the root mean squared error (RMSE) of these predictions.

In [43]:
multiple_rmse = np.sqrt(np.mean((y_test - multiple_predictions) ** 2))

print(multiple_rmse)
0.20398495060494315

9. How do you interpret this value?

7. Predict with a Neural Network Regressor

1. Import the standard scaler from sklearn.

In [44]:
from sklearn.preprocessing import StandardScaler

2. Create standard scalers for training and test data.

In [45]:
X_scaler = StandardScaler()
y_scaler = StandardScaler()

3. Fit the scaler to all training data.

In [46]:
X_scaler.fit(X)
y_scaler.fit(y.values.reshape(-1, 1))
Out[46]:
StandardScaler(copy=True, with_mean=True, with_std=True)

4. Scale the training and test data.

In [47]:
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train.values.reshape(-1, 1))
y_test_scaled = y_scaler.transform(y_test.values.reshape(-1, 1))

5. Import the neural network regressor class from sklearn.

In [48]:
from sklearn.neural_network import MLPRegressor

6. Create a neural network regressor with 4 hidden nodes, a tanh activation function, an LBFGS solver, and 1000 maximum iterations.

In [49]:
neural_model = MLPRegressor(
    hidden_layer_sizes = (4),
    activation = "tanh",
    solver = "lbfgs",
    max_iter = 1000)

7. Train the model with the training set.

In [50]:
neural_model.fit(
    X = X_train_scaled,
    y = y_train_scaled.reshape(-1, ))
Out[50]:
MLPRegressor(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=4, learning_rate='constant',
       learning_rate_init=0.001, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

8. Predict output values for the test set.

In [51]:
scaled_predictions = neural_model.predict(X_test_scaled)

9. Unscale the predictions.

In [52]:
neural_predictions = y_scaler.inverse_transform(scaled_predictions)

10. Visualize the prediction error.

In [53]:
plt.scatter(
    x = X_train.Petal_Length,
    y = y_train,
    color = "black",
    facecolor = "none")
plt.scatter(
    x = X_test.Petal_Length,
    y = neural_predictions,
    color = "blue",
    marker = 'x')
plt.scatter(
    x = X_test.Petal_Length,
    y = y_test,
    color = "green")
plt.plot(
    [X_test.Petal_Length, X_test.Petal_Length],
    [neural_predictions, y_test],
    color = "red",
    zorder = 0)
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

11. Compute the root mean squared error (RMSE) of these predictions.

In [54]:
neural_rmse = np.sqrt(np.mean((y_test - neural_predictions) ** 2))

12. Inspect the RMSE of these predictions.

In [55]:
print(neural_rmse)
0.18341008377583856

8. Evaluate the Regressors

1. Compare all three results.

In [56]:
print("Simple RMSE:   ", simple_rmse)
print("Multiple RMSE: ", multiple_rmse)
print("Neural RMSE:   ", neural_rmse)
Simple RMSE:    0.24794604821207347
Multiple RMSE:  0.20398495060494315
Neural RMSE:    0.18341008377583856

2. Question: Which of these models would you choose? Why?