Lab 4B - Clustering¶

1. Load the Data¶

1. Import the OS library.

import os

2. Set the working directory.

os.chdir("C:\\Workshop\\Data")

3. Import the pandas library as "pd".

import pandas as pd

4. Read the Rates.csv file into a data frame called policies.

policies = pd.read_csv("Rates.csv")

2. Explore the Data¶

Inspect the policies data frame using the head function.

policies.head()

2. Import pyplot from matplotlib as "plt".

import matplotlib.pyplot as plt

3. Create a scatterplot matrix of the data set.

pd.plotting.scatter_matrix(
    frame = policies,
    alpha = 1,
    s = 100,
    diagonal = 'none');

4. Question: Do you see any natural clusters in these data? How Many?

3. Transform the Data¶

1. Create a data frame of features for clustering (omit the categorical State column).

X = policies.iloc[:, policies.columns != "State"]

2. Inspect the features using the head function.

X.head()

3. Convert the categorical Gender variable {Female, Male} to integers {0, 1}.

X.Gender = X.Gender.apply(lambda x: 0 if x == "Female" else 1)

C:\Users\Matthew\Anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

4. Inspect the integer Gender series.

X.Gender.head()

0    1
1    1
2    1
3    1
4    1
Name: Gender, dtype: int64

3. Import the numpy library as "np".

import numpy as np

4. Set the random number seed

np.random.seed(42)

4. Cluster with k-Means¶

1. Import the k-means class from sklearn.

from sklearn.cluster import KMeans

2. Create a k-means model with k = 3 and 10 random initializations.

k_model = KMeans(
    n_clusters = 3,
    n_init = 10)

3. Fit the model to the data.

k_model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

4. Create a palette with three colors for each of the tree clusters.

palette = {0:'#fb8072', 1:'#80b1d3', 2:'#b3de69'}

5. Map the colors to each of the clusters.

k_colors = pd.Series(k_model.labels_) \
    .apply(lambda x:palette[x])

6. Create a scatterplot matrix colored by cluster.

pd.plotting.scatter_matrix(
    frame = policies,
    color = k_colors,
    alpha = 1,
    s = 100,
    diagonal = 'none');

7. Question: Do you see any 2D projections of thes data that show each of the clusters clearly separated?

8. Plot a scatterplot of BMI (y-axis) vs. Age (x-axis) colored by the clusters.
Superimpose the centroids of each cluster as X's on the scatterplot.

plt.scatter(
    x = policies.Age,
    y = policies.BMI,
    color = k_colors)
plt.scatter(
    x = k_model.cluster_centers_[:,5],
    y = k_model.cluster_centers_[:,4],
    marker = 'x',
    color = "black",
    s = 100)
plt.xlabel = "BMI"
plt.ylabel = "Age"
plt.show()

8. Question: What would you call each of these clusters if you had to give them a name based on their common properties?

5. Cluster with Hierarchical Clustring¶

1. Import the agglomerative clustering class from sklearn.

from sklearn.cluster import AgglomerativeClustering

2. Create a hierachical cluster model with three clusters.

h_model = AgglomerativeClustering(
    n_clusters = 3)

3. Fit the model to the data.

h_model.fit(X)

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=3,
            pooling_func=<function mean at 0x000001F594B640D0>)

4. Import the dendrogram function from scipy.

from scipy.cluster.hierarchy import dendrogram

5. Plot the dendrogram.

children = h_model.children_

distance = np.arange(children.shape[0])

observations = np.arange(2, children.shape[0] + 2)

linkage_matrix = np.column_stack([children, distance, observations]).astype(float)

dendrogram(
    Z = linkage_matrix,
    leaf_font_size = 8,
    color_threshold = 1939);

6. Question: How do you interpret this dendrogram?

7. Map the previous three colors to each cluster.

h_colors = pd.Series(h_model.labels_) \
    .apply(lambda x:palette[x])

8. Plot a scatterplot of BMI (y-axis) vs. Age (x-axis) colored by cluster.

plt.scatter(
    x = policies.Age,
    y = policies.BMI,
    color = h_colors)
plt.xlabel = "Age"
plt.ylabel = "BMI"
plt.show()

9. Question: What is the difference between these two methods of clustering?

	Gender	State	State_Rate	Height	Weight	BMI	Age	Rate
0	Male	MA	0.100434	184	67.8	20.025992	77	0.332000
1	Male	VA	0.141723	163	89.4	33.648237	82	0.869148
2	Male	NY	0.090803	170	81.2	28.096886	31	0.010000
3	Male	TN	0.119973	175	99.7	32.555102	39	0.021532
4	Male	FL	0.110345	184	72.1	21.296078	68	0.149750

	Gender	State_Rate	Height	Weight	BMI	Age	Rate
0	Male	0.100434	184	67.8	20.025992	77	0.332000
1	Male	0.141723	163	89.4	33.648237	82	0.869148
2	Male	0.090803	170	81.2	28.096886	31	0.010000
3	Male	0.119973	175	99.7	32.555102	39	0.021532
4	Male	0.110345	184	72.1	21.296078	68	0.149750