Lab 2B: Working with Data (Hard)

  1. Set the working directory

  2. Read the tab-delimited mortality statistics file into a data set called “rates”
    (NOTE: Mortality Statistics.txt)

  3. Inspect the data

##     State State.Code Gender Gender.Code Five.Year.Age.Groups
## 1 Alabama          1 Female           F             < 1 year
## 2 Alabama          1 Female           F             < 1 year
## 3 Alabama          1 Female           F             < 1 year
## 4 Alabama          1 Female           F             < 1 year
## 5 Alabama          1 Female           F            1-4 years
## 6 Alabama          1 Female           F            1-4 years
##   Five.Year.Age.Groups.Code                             Race Deaths
## 1                         1 American Indian or Alaska Native     12
## 2                         1        Asian or Pacific Islander     18
## 3                         1        Black or African American   1855
## 4                         1                            White   1932
## 5                     4-Jan        Black or African American    264
## 6                     4-Jan                            White    349
##   Population  Rate
## 1       3028 0.40%
## 2       6386 0.28%
## 3     149748 1.24%
## 4     312213 0.62%
## 5     596169 0.04%
## 6    1239104 0.03%

Problem 1: Rename the Age Group Column

  1. Inspect the name of the Age Group column (i.e the fifth column)
## [1] "Five.Year.Age.Groups"
  1. Rename “Five.Year.Age.Groups” to “Age.Group”

  2. Verify rename

## [1] "Age.Group"

Problem 2: Convert Rate from Percentage to Decimal

  1. Inspect the data type of the Rate column
## [1] "factor"
  1. Convert Rate to a character string

  2. Eliminate the percent sign from the Rate values

  3. Convert “#VALUE!” in Rate to an empty string (i.e. “”)

  4. Cast Rate to numeric

  5. Verify Rate column is numeric

## [1] "numeric"
  1. Convert Rate from percent to decimal (i.e. multiply by 0.01)

  2. Inspect the first six rows

## [1] 0.0040 0.0028 0.0124 0.0062 0.0004 0.0003
  1. Inspect the last six rows
## [1] NA NA NA NA NA NA

Problem 3: Eliminate rows with NA values

  1. Count the rows with NA values
## [1] 1373
  1. Eliminate the rows with NA values

  2. Verify that the NA rows have been omitted

## [1] 0

Problem 4: Create High and Low Age Range Columns

  1. Inspect the unique values in the Age Group column
##  [1] < 1 year     1-4 years    5-9 years    10-14 years  15-19 years 
##  [6] 20-24 years  25-29 years  30-34 years  35-39 years  40-44 years 
## [11] 45-49 years  50-54 years  55-59 years  60-64 years  65-69 years 
## [16] 70-74 years  75-79 years  80-84 years 
## 23 Levels: < 1 year 1-4 years 10-14 years 100+ years ... Not Stated
  1. Inspect unique values in Five-Year Age Group Codes column
##  [1] 1      4-Jan  9-May  14-Oct 15-19  20-24  25-29  30-34  35-39  40-44 
## [11] 45-49  50-54  55-59  60-64  65-69  70-74  75-79  80-84  42461  42618 
## [21] 41913 
## 26 Levels: 1 100+ 14-Oct 15-19 20-24 25-29 30-34 35-39 4-Jan ... NS
  1. Decide which column to use to create your high and low age columns

Question: Why did you choose this column?

  1. Create a function to extract the low value from each age group
    Note: The code to create this function is provided below
    Note: use 0 for “< 1 year”
    Note: use NA for “Not Stated”
getLowAge <- function(group)
{
    if (group == "Not Stated"){
        NA
    }
    else if (group == "< 1 year") {
        0
    }
    else if (group == "100+ years") {
        100
    }
    else {
        as.integer(sub("-.*", "", group))
    }
}
  1. Apply the function to create a column called “Age.Group.Low” Note: The code to apply this function is below
rates$Age.Group.Low <- sapply(
    X = rates$Age.Group,
    FUN = getLowAge)
  1. Inspect the first six Age.Group.Low rows
## [1] 0 0 0 0 1 1
  1. Create a function to extract the high value from each age group
    Note: use 999 for “100+ years”
    Note: use NA for “Not Stated”
getHighAge <- function(group)
{
    if (group == "Not Stated") {
        NA
    }
    else if (group == "< 1 year") {
        1
    }
    else if (group == "100+ years") {
        999
    }
    else {
        as.integer(gsub("[a-z]|\\s|.*-", "", group))
    }
}
  1. Apply the function to create a column called “Age.Group.High”

  2. Inspect the first six Age.Group.High rows

## [1] 1 1 1 1 4 4
  1. Save the data frame as a CSV file called “Mortality Rates 2.csv”

  2. Inspect the newly created CSV file

Problem 5: Compute the Mortality Rate for an Individual

  1. Load the dplyr library

  2. Use dplyr to find the average mortality rate for 38-year old male from Iowa
    NOTE: It might be best to do this step-by-step first
    NOTE: The equivalent SQL would be:
    select avg(Rate)
    from [mortality]
    where State = “Iowa”
    and Gender = “Male”
    and Age.Group.Low <= 38
    and Age.Group.High >= 38

## Warning: package 'bindrcpp' was built under R version 3.4.1
##    AvgRate
## 1 0.001575
  1. Create a function called “getRate” to return the average mortality rate given state, gender, and age

  2. Use this function to calculate the rate for a 20-year old female from New York

getRate("New York", "Female", 20)
##    AvgRate
## 1 0.000275

Question: Who has the higher mortality rate?