# Lab 2B: Working with Data (Hard)

1. Set the working directory

2. Read the tab-delimited mortality statistics file into a data set called “rates”
(NOTE: Mortality Statistics.txt)

3. Inspect the data

##     State State.Code Gender Gender.Code Five.Year.Age.Groups
## 1 Alabama          1 Female           F             < 1 year
## 2 Alabama          1 Female           F             < 1 year
## 3 Alabama          1 Female           F             < 1 year
## 4 Alabama          1 Female           F             < 1 year
## 5 Alabama          1 Female           F            1-4 years
## 6 Alabama          1 Female           F            1-4 years
##   Five.Year.Age.Groups.Code                             Race Deaths
## 1                         1 American Indian or Alaska Native     12
## 2                         1        Asian or Pacific Islander     18
## 3                         1        Black or African American   1855
## 4                         1                            White   1932
## 5                     4-Jan        Black or African American    264
## 6                     4-Jan                            White    349
##   Population  Rate
## 1       3028 0.40%
## 2       6386 0.28%
## 3     149748 1.24%
## 4     312213 0.62%
## 5     596169 0.04%
## 6    1239104 0.03%

### Problem 1: Rename the Age Group Column

1. Inspect the name of the Age Group column (i.e the fifth column)
## [1] "Five.Year.Age.Groups"
1. Rename “Five.Year.Age.Groups” to “Age.Group”

2. Verify rename

## [1] "Age.Group"

### Problem 2: Convert Rate from Percentage to Decimal

1. Inspect the data type of the Rate column
## [1] "factor"
1. Convert Rate to a character string

2. Eliminate the percent sign from the Rate values

3. Convert “#VALUE!” in Rate to an empty string (i.e. “”)

4. Cast Rate to numeric

5. Verify Rate column is numeric

## [1] "numeric"
1. Convert Rate from percent to decimal (i.e. multiply by 0.01)

2. Inspect the first six rows

## [1] 0.0040 0.0028 0.0124 0.0062 0.0004 0.0003
1. Inspect the last six rows
## [1] NA NA NA NA NA NA

### Problem 3: Eliminate rows with NA values

1. Count the rows with NA values
## [1] 1373
1. Eliminate the rows with NA values

2. Verify that the NA rows have been omitted

## [1] 0

### Problem 4: Create High and Low Age Range Columns

1. Inspect the unique values in the Age Group column
##  [1] < 1 year     1-4 years    5-9 years    10-14 years  15-19 years
##  [6] 20-24 years  25-29 years  30-34 years  35-39 years  40-44 years
## [11] 45-49 years  50-54 years  55-59 years  60-64 years  65-69 years
## [16] 70-74 years  75-79 years  80-84 years
## 23 Levels: < 1 year 1-4 years 10-14 years 100+ years ... Not Stated
1. Inspect unique values in Five-Year Age Group Codes column
##  [1] 1      4-Jan  9-May  14-Oct 15-19  20-24  25-29  30-34  35-39  40-44
## [11] 45-49  50-54  55-59  60-64  65-69  70-74  75-79  80-84  42461  42618
## [21] 41913
## 26 Levels: 1 100+ 14-Oct 15-19 20-24 25-29 30-34 35-39 4-Jan ... NS
1. Decide which column to use to create your high and low age columns

Question: Why did you choose this column?

1. Create a function to extract the low value from each age group
Note: The code to create this function is provided below
Note: use 0 for “< 1 year”
Note: use NA for “Not Stated”
getLowAge <- function(group)
{
if (group == "Not Stated"){
NA
}
else if (group == "< 1 year") {
0
}
else if (group == "100+ years") {
100
}
else {
as.integer(sub("-.*", "", group))
}
}
1. Apply the function to create a column called “Age.Group.Low” Note: The code to apply this function is below
rates$Age.Group.Low <- sapply( X = rates$Age.Group,
FUN = getLowAge)
1. Inspect the first six Age.Group.Low rows
## [1] 0 0 0 0 1 1
1. Create a function to extract the high value from each age group
Note: use 999 for “100+ years”
Note: use NA for “Not Stated”
getHighAge <- function(group)
{
if (group == "Not Stated") {
NA
}
else if (group == "< 1 year") {
1
}
else if (group == "100+ years") {
999
}
else {
as.integer(gsub("[a-z]|\\s|.*-", "", group))
}
}
1. Apply the function to create a column called “Age.Group.High”

2. Inspect the first six Age.Group.High rows

## [1] 1 1 1 1 4 4
1. Save the data frame as a CSV file called “Mortality Rates 2.csv”

2. Inspect the newly created CSV file

### Problem 5: Compute the Mortality Rate for an Individual

2. Use dplyr to find the average mortality rate for 38-year old male from Iowa
NOTE: It might be best to do this step-by-step first
NOTE: The equivalent SQL would be:
select avg(Rate)
from [mortality]
where State = “Iowa”
and Gender = “Male”
and Age.Group.Low <= 38
and Age.Group.High >= 38

## Warning: package 'bindrcpp' was built under R version 3.4.1
##    AvgRate
## 1 0.001575
1. Create a function called “getRate” to return the average mortality rate given state, gender, and age

2. Use this function to calculate the rate for a 20-year old female from New York

getRate("New York", "Female", 20)
##    AvgRate
## 1 0.000275

Question: Who has the higher mortality rate?