Lab 4B - Clustering (Hard)

1. Explore the Data

Set the working directory to “C:/Workshop/Data”

setwd("C:/Workshop/Data")

Load the Rates.csv data into a dataframe called “policies”

policies <- read.csv("Rates.csv")

Load the dplyr package

library(dplyr)

Convert all policies data into numeric values with the following code.

policiesNumeric <- policies %>%
  mutate(Gender = as.numeric(Gender)) %>%
  select(-State)

Inspect the results

head(policiesNumeric)

##   Gender State.Rate Height Weight      BMI Age       Rate
## 1      2 0.10043368    184   67.8 20.02599  77 0.33200000
## 2      2 0.14172319    163   89.4 33.64824  82 0.86914779
## 3      2 0.09080315    170   81.2 28.09689  31 0.01000000
## 4      2 0.11997276    175   99.7 32.55510  39 0.02153204
## 5      2 0.11034460    184   72.1 21.29608  68 0.14975000
## 6      2 0.16292470    166   98.4 35.70910  64 0.21123703

Question: Why do you think we are converting all policy data to numeric values before using our clustering algorithms?

2. Create Equal Size Clusters using a Single Variable

Load the RColorBrewer package

library(RColorBrewer)

Create a “Set2” color palette with 3 colors

palette <- brewer.pal(3, "Set2")

Cut the policies mortality rates into 3 equaldistant clusters

cuts <- cut(policiesNumeric$Rate, 3)

Create a scatterplot matrix colored by mortality rate clusters.

plot(
  x = policiesNumeric, 
  col = palette[cuts],
  pch = 19)

Question: How might these three market segments (based on Mortality Rates) be useful?

3. Create Clusters using k-Means

Set the seed to 42 to make randomness reproducable.

set.seed(42)

Create K-means clusters.

kClusters <- kmeans(
  x = policiesNumeric, 
  centers = 3, 
  nstart = 10)

Create a scatterplot matrix colored by cluster.

plot(
  x = policies, 
  col = palette[kClusters$cluster])

Create a scatterplot of BMI vs Age colored by cluster.

plot(
  x = policiesNumeric$BMI,
  y = policiesNumeric$Age, 
  col = palette[kClusters$cluster])

Plot the centroids of the clusters.

plot(
  x = policiesNumeric$BMI,
  y = policiesNumeric$Age, 
  col = palette[kClusters$cluster])
  
points(
  x = kClusters$centers[, "BMI"], 
  y = kClusters$centers[, "Age"],
  pch = 4, 
  lwd = 4, 
  col = "blue")

Plot the labels of the clusters.

plot(
  x = policiesNumeric$BMI,
  y = policiesNumeric$Age, 
  col = palette[kClusters$cluster])

text(
  x = kClusters$centers[, "BMI"], 
  y = kClusters$centers[, "Age"],
  labels = c(1, 2, 3),
  cex = 4, 
  lwd = 4, 
  col = "blue")

Question: What would you name each of these three clusters?
Question: How might these market segments be more or less useful than the previous three segments?

4. Create Hierachical Clusters

Create hierachical clusters.

hclusters <- hclust(dist(policiesNumeric))

Cut tree into 3 clusters.

hCuts <- cutree(
  tree = hclusters, 
  k = 3)

Create a scatterplot matrix colored by cluster.

plot(
  x = policies, 
  col = palette[hCuts])

Create a scatterplot of BMI and Age and color by cluster.

plot(
  x = policiesNumeric$BMI,
  y = policiesNumeric$Age, 
  col = palette[hCuts])

Question: What would you name each of these three market segments?
Question: How might these market segments be more or less useful than the previous clusters?