1. Import the OS library.
import os
2. Set the working directory.
os.chdir("C:\\Workshop\\Data")
3. Import the pandas library as "pd".
import pandas as pd
4. Read the Rates.csv file into a data frame called policies.
policies = pd.read_csv("Rates.csv")
head
function.policies.head()
2. Import pyplot from matplotlib as "plt".
import matplotlib.pyplot as plt
3. Create a scatterplot matrix of the data set.
pd.plotting.scatter_matrix(
frame = policies,
alpha = 1,
s = 100,
diagonal = 'none');
4. Question: Do you see any natural clusters in these data? How Many?
1. Create a data frame of features for clustering (omit the categorical State column).
X = policies.iloc[:, policies.columns != "State"]
2. Inspect the features using the head
function.
X.head()
3. Convert the categorical Gender variable {Female, Male} to integers {0, 1}.
X.Gender = X.Gender.apply(lambda x: 0 if x == "Female" else 1)
4. Inspect the integer Gender series.
X.Gender.head()
3. Import the numpy library as "np".
import numpy as np
4. Set the random number seed
np.random.seed(42)
1. Import the k-means class from sklearn.
from sklearn.cluster import KMeans
2. Create a k-means model with k = 3 and 10 random initializations.
k_model = KMeans(
n_clusters = 3,
n_init = 10)
3. Fit the model to the data.
k_model.fit(X)
4. Create a palette with three colors for each of the tree clusters.
palette = {0:'#fb8072', 1:'#80b1d3', 2:'#b3de69'}
5. Map the colors to each of the clusters.
k_colors = pd.Series(k_model.labels_) \
.apply(lambda x:palette[x])
6. Create a scatterplot matrix colored by cluster.
pd.plotting.scatter_matrix(
frame = policies,
color = k_colors,
alpha = 1,
s = 100,
diagonal = 'none');
7. Question: Do you see any 2D projections of thes data that show each of the clusters clearly separated?
8. Plot a scatterplot of BMI (y-axis) vs. Age (x-axis) colored by the clusters.
Superimpose the centroids of each cluster as X's on the scatterplot.
plt.scatter(
x = policies.Age,
y = policies.BMI,
color = k_colors)
plt.scatter(
x = k_model.cluster_centers_[:,5],
y = k_model.cluster_centers_[:,4],
marker = 'x',
color = "black",
s = 100)
plt.xlabel = "BMI"
plt.ylabel = "Age"
plt.show()
8. Question: What would you call each of these clusters if you had to give them a name based on their common properties?
1. Import the agglomerative clustering class from sklearn.
from sklearn.cluster import AgglomerativeClustering
2. Create a hierachical cluster model with three clusters.
h_model = AgglomerativeClustering(
n_clusters = 3)
3. Fit the model to the data.
h_model.fit(X)
4. Import the dendrogram function from scipy.
from scipy.cluster.hierarchy import dendrogram
5. Plot the dendrogram.
children = h_model.children_
distance = np.arange(children.shape[0])
observations = np.arange(2, children.shape[0] + 2)
linkage_matrix = np.column_stack([children, distance, observations]).astype(float)
dendrogram(
Z = linkage_matrix,
leaf_font_size = 8,
color_threshold = 1939);
6. Question: How do you interpret this dendrogram?
7. Map the previous three colors to each cluster.
h_colors = pd.Series(h_model.labels_) \
.apply(lambda x:palette[x])
8. Plot a scatterplot of BMI (y-axis) vs. Age (x-axis) colored by cluster.
plt.scatter(
x = policies.Age,
y = policies.BMI,
color = h_colors)
plt.xlabel = "Age"
plt.ylabel = "BMI"
plt.show()
9. Question: What is the difference between these two methods of clustering?