Lab 2B: Working with Data (Hard)

Set the working directory
Read the tab-delimited mortality statistics file into a data set called “rates”
(NOTE: Mortality Statistics.txt)
Inspect the data

##     State State.Code Gender Gender.Code Five.Year.Age.Groups
## 1 Alabama          1 Female           F             < 1 year
## 2 Alabama          1 Female           F             < 1 year
## 3 Alabama          1 Female           F             < 1 year
## 4 Alabama          1 Female           F             < 1 year
## 5 Alabama          1 Female           F            1-4 years
## 6 Alabama          1 Female           F            1-4 years
##   Five.Year.Age.Groups.Code                             Race Deaths
## 1                         1 American Indian or Alaska Native     12
## 2                         1        Asian or Pacific Islander     18
## 3                         1        Black or African American   1855
## 4                         1                            White   1932
## 5                     4-Jan        Black or African American    264
## 6                     4-Jan                            White    349
##   Population  Rate
## 1       3028 0.40%
## 2       6386 0.28%
## 3     149748 1.24%
## 4     312213 0.62%
## 5     596169 0.04%
## 6    1239104 0.03%

Problem 1: Rename the Age Group Column

Inspect the name of the Age Group column (i.e the fifth column)

## [1] "Five.Year.Age.Groups"

Rename “Five.Year.Age.Groups” to “Age.Group”
Verify rename

## [1] "Age.Group"

Problem 2: Convert Rate from Percentage to Decimal

Inspect the data type of the Rate column

## [1] "factor"

Convert Rate to a character string
Eliminate the percent sign from the Rate values
Convert “#VALUE!” in Rate to an empty string (i.e. “”)
Cast Rate to numeric
Verify Rate column is numeric

## [1] "numeric"

Convert Rate from percent to decimal (i.e. multiply by 0.01)
Inspect the first six rows

## [1] 0.0040 0.0028 0.0124 0.0062 0.0004 0.0003

Inspect the last six rows

## [1] NA NA NA NA NA NA

Problem 3: Eliminate rows with NA values

Count the rows with NA values

## [1] 1373

Eliminate the rows with NA values
Verify that the NA rows have been omitted

## [1] 0

Problem 4: Create High and Low Age Range Columns

Inspect the unique values in the Age Group column

##  [1] < 1 year     1-4 years    5-9 years    10-14 years  15-19 years 
##  [6] 20-24 years  25-29 years  30-34 years  35-39 years  40-44 years 
## [11] 45-49 years  50-54 years  55-59 years  60-64 years  65-69 years 
## [16] 70-74 years  75-79 years  80-84 years 
## 23 Levels: < 1 year 1-4 years 10-14 years 100+ years ... Not Stated

Inspect unique values in Five-Year Age Group Codes column

##  [1] 1      4-Jan  9-May  14-Oct 15-19  20-24  25-29  30-34  35-39  40-44 
## [11] 45-49  50-54  55-59  60-64  65-69  70-74  75-79  80-84  42461  42618 
## [21] 41913 
## 26 Levels: 1 100+ 14-Oct 15-19 20-24 25-29 30-34 35-39 4-Jan ... NS

Decide which column to use to create your high and low age columns

Question: Why did you choose this column?

Create a function to extract the low value from each age group
Note: The code to create this function is provided below
Note: use 0 for “< 1 year”
Note: use NA for “Not Stated”

getLowAge <- function(group)
{
    if (group == "Not Stated"){
        NA
    }
    else if (group == "< 1 year") {
        0
    }
    else if (group == "100+ years") {
        100
    }
    else {
        as.integer(sub("-.*", "", group))
    }
}

Apply the function to create a column called “Age.Group.Low” Note: The code to apply this function is below

rates$Age.Group.Low <- sapply(
    X = rates$Age.Group,
    FUN = getLowAge)

Inspect the first six Age.Group.Low rows

## [1] 0 0 0 0 1 1

Create a function to extract the high value from each age group
Note: use 999 for “100+ years”
Note: use NA for “Not Stated”

getHighAge <- function(group)
{
    if (group == "Not Stated") {
        NA
    }
    else if (group == "< 1 year") {
        1
    }
    else if (group == "100+ years") {
        999
    }
    else {
        as.integer(gsub("[a-z]|\\s|.*-", "", group))
    }
}

Apply the function to create a column called “Age.Group.High”
Inspect the first six Age.Group.High rows

## [1] 1 1 1 1 4 4

Save the data frame as a CSV file called “Mortality Rates 2.csv”
Inspect the newly created CSV file

Problem 5: Compute the Mortality Rate for an Individual

Load the dplyr library
Use dplyr to find the average mortality rate for 38-year old male from Iowa
NOTE: It might be best to do this step-by-step first
NOTE: The equivalent SQL would be:
select avg(Rate)
from [mortality]
where State = “Iowa”
and Gender = “Male”
and Age.Group.Low <= 38
and Age.Group.High >= 38

## Warning: package 'bindrcpp' was built under R version 3.4.1

##    AvgRate
## 1 0.001575

Create a function called “getRate” to return the average mortality rate given state, gender, and age
Use this function to calculate the rate for a 20-year old female from New York

getRate("New York", "Female", 20)

##    AvgRate
## 1 0.000275

Question: Who has the higher mortality rate?