Lab 4B - Clustering

1. Load the Data

1. Import the OS library.

In [1]:
import os

2. Set the working directory.

In [2]:
os.chdir("C:\\Workshop\\Data")

3. Import the pandas library as "pd".

In [3]:
import pandas as pd

4. Read the Rates.csv file into a data frame called policies.

In [4]:
policies = pd.read_csv("Rates.csv")

2. Explore the Data

  1. Inspect the policies data frame using the head function.
In [5]:
policies.head()
Out[5]:
Gender State State_Rate Height Weight BMI Age Rate
0 Male MA 0.100434 184 67.8 20.025992 77 0.332000
1 Male VA 0.141723 163 89.4 33.648237 82 0.869148
2 Male NY 0.090803 170 81.2 28.096886 31 0.010000
3 Male TN 0.119973 175 99.7 32.555102 39 0.021532
4 Male FL 0.110345 184 72.1 21.296078 68 0.149750

2. Import pyplot from matplotlib as "plt".

In [6]:
import matplotlib.pyplot as plt

3. Create a scatterplot matrix of the data set.

In [7]:
pd.plotting.scatter_matrix(
    frame = policies,
    alpha = 1,
    s = 100,
    diagonal = 'none');

4. Question: Do you see any natural clusters in these data? How Many?

3. Transform the Data

1. Create a data frame of features for clustering (omit the categorical State column).

In [8]:
X = policies.iloc[:, policies.columns != "State"]

2. Inspect the features using the head function.

In [9]:
X.head()
Out[9]:
Gender State_Rate Height Weight BMI Age Rate
0 Male 0.100434 184 67.8 20.025992 77 0.332000
1 Male 0.141723 163 89.4 33.648237 82 0.869148
2 Male 0.090803 170 81.2 28.096886 31 0.010000
3 Male 0.119973 175 99.7 32.555102 39 0.021532
4 Male 0.110345 184 72.1 21.296078 68 0.149750

3. Convert the categorical Gender variable {Female, Male} to integers {0, 1}.

In [10]:
X.Gender = X.Gender.apply(lambda x: 0 if x == "Female" else 1)
C:\Users\Matthew\Anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

4. Inspect the integer Gender series.

In [11]:
X.Gender.head()
Out[11]:
0    1
1    1
2    1
3    1
4    1
Name: Gender, dtype: int64

3. Import the numpy library as "np".

In [12]:
import numpy as np

4. Set the random number seed

In [13]:
np.random.seed(42)

4. Cluster with k-Means

1. Import the k-means class from sklearn.

In [14]:
from sklearn.cluster import KMeans

2. Create a k-means model with k = 3 and 10 random initializations.

In [15]:
k_model = KMeans(
    n_clusters = 3,
    n_init = 10)

3. Fit the model to the data.

In [16]:
k_model.fit(X)
Out[16]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

4. Create a palette with three colors for each of the tree clusters.

In [17]:
palette = {0:'#fb8072', 1:'#80b1d3', 2:'#b3de69'}

5. Map the colors to each of the clusters.

In [18]:
k_colors = pd.Series(k_model.labels_) \
    .apply(lambda x:palette[x])

6. Create a scatterplot matrix colored by cluster.

In [19]:
pd.plotting.scatter_matrix(
    frame = policies,
    color = k_colors,
    alpha = 1,
    s = 100,
    diagonal = 'none');

7. Question: Do you see any 2D projections of thes data that show each of the clusters clearly separated?

8. Plot a scatterplot of BMI (y-axis) vs. Age (x-axis) colored by the clusters.
Superimpose the centroids of each cluster as X's on the scatterplot.

In [20]:
plt.scatter(
    x = policies.Age,
    y = policies.BMI,
    color = k_colors)
plt.scatter(
    x = k_model.cluster_centers_[:,5],
    y = k_model.cluster_centers_[:,4],
    marker = 'x',
    color = "black",
    s = 100)
plt.xlabel = "BMI"
plt.ylabel = "Age"
plt.show()

8. Question: What would you call each of these clusters if you had to give them a name based on their common properties?

5. Cluster with Hierarchical Clustring

1. Import the agglomerative clustering class from sklearn.

In [21]:
from sklearn.cluster import AgglomerativeClustering

2. Create a hierachical cluster model with three clusters.

In [22]:
h_model = AgglomerativeClustering(
    n_clusters = 3)

3. Fit the model to the data.

In [23]:
h_model.fit(X)
Out[23]:
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=3,
            pooling_func=<function mean at 0x000001F594B640D0>)

4. Import the dendrogram function from scipy.

In [ ]:
from scipy.cluster.hierarchy import dendrogram

5. Plot the dendrogram.

In [ ]:
children = h_model.children_

distance = np.arange(children.shape[0])

observations = np.arange(2, children.shape[0] + 2)

linkage_matrix = np.column_stack([children, distance, observations]).astype(float)

dendrogram(
    Z = linkage_matrix,
    leaf_font_size = 8,
    color_threshold = 1939);

6. Question: How do you interpret this dendrogram?

7. Map the previous three colors to each cluster.

In [ ]:
h_colors = pd.Series(h_model.labels_) \
    .apply(lambda x:palette[x])

8. Plot a scatterplot of BMI (y-axis) vs. Age (x-axis) colored by cluster.

In [ ]:
plt.scatter(
    x = policies.Age,
    y = policies.BMI,
    color = h_colors)
plt.xlabel = "Age"
plt.ylabel = "BMI"
plt.show()

9. Question: What is the difference between these two methods of clustering?