Set the working directory
Read the tab-delimited mortality statistics file into a data set called “rates”
(NOTE: Mortality Statistics.txt)
Inspect the data
## State State.Code Gender Gender.Code Five.Year.Age.Groups
## 1 Alabama 1 Female F < 1 year
## 2 Alabama 1 Female F < 1 year
## 3 Alabama 1 Female F < 1 year
## 4 Alabama 1 Female F < 1 year
## 5 Alabama 1 Female F 1-4 years
## 6 Alabama 1 Female F 1-4 years
## Five.Year.Age.Groups.Code Race Deaths
## 1 1 American Indian or Alaska Native 12
## 2 1 Asian or Pacific Islander 18
## 3 1 Black or African American 1855
## 4 1 White 1932
## 5 4-Jan Black or African American 264
## 6 4-Jan White 349
## Population Rate
## 1 3028 0.40%
## 2 6386 0.28%
## 3 149748 1.24%
## 4 312213 0.62%
## 5 596169 0.04%
## 6 1239104 0.03%
## [1] "Five.Year.Age.Groups"
Rename “Five.Year.Age.Groups” to “Age.Group”
Verify rename
## [1] "Age.Group"
## [1] "factor"
Convert Rate to a character string
Eliminate the percent sign from the Rate values
Convert “#VALUE!” in Rate to an empty string (i.e. “”)
Cast Rate to numeric
Verify Rate column is numeric
## [1] "numeric"
Convert Rate from percent to decimal (i.e. multiply by 0.01)
Inspect the first six rows
## [1] 0.0040 0.0028 0.0124 0.0062 0.0004 0.0003
## [1] NA NA NA NA NA NA
## [1] 1373
Eliminate the rows with NA values
Verify that the NA rows have been omitted
## [1] 0
## [1] < 1 year 1-4 years 5-9 years 10-14 years 15-19 years
## [6] 20-24 years 25-29 years 30-34 years 35-39 years 40-44 years
## [11] 45-49 years 50-54 years 55-59 years 60-64 years 65-69 years
## [16] 70-74 years 75-79 years 80-84 years
## 23 Levels: < 1 year 1-4 years 10-14 years 100+ years ... Not Stated
## [1] 1 4-Jan 9-May 14-Oct 15-19 20-24 25-29 30-34 35-39 40-44
## [11] 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 42461 42618
## [21] 41913
## 26 Levels: 1 100+ 14-Oct 15-19 20-24 25-29 30-34 35-39 4-Jan ... NS
Question: Why did you choose this column?
getLowAge <- function(group)
{
if (group == "Not Stated"){
NA
}
else if (group == "< 1 year") {
0
}
else if (group == "100+ years") {
100
}
else {
as.integer(sub("-.*", "", group))
}
}
rates$Age.Group.Low <- sapply(
X = rates$Age.Group,
FUN = getLowAge)
## [1] 0 0 0 0 1 1
getHighAge <- function(group)
{
if (group == "Not Stated") {
NA
}
else if (group == "< 1 year") {
1
}
else if (group == "100+ years") {
999
}
else {
as.integer(gsub("[a-z]|\\s|.*-", "", group))
}
}
Apply the function to create a column called “Age.Group.High”
Inspect the first six Age.Group.High rows
## [1] 1 1 1 1 4 4
Save the data frame as a CSV file called “Mortality Rates 2.csv”
Inspect the newly created CSV file
Load the dplyr library
Use dplyr to find the average mortality rate for 38-year old male from Iowa
NOTE: It might be best to do this step-by-step first
NOTE: The equivalent SQL would be:
select avg(Rate)
from [mortality]
where State = “Iowa”
and Gender = “Male”
and Age.Group.Low <= 38
and Age.Group.High >= 38
## Warning: package 'bindrcpp' was built under R version 3.4.1
## AvgRate
## 1 0.001575
Create a function called “getRate” to return the average mortality rate given state, gender, and age
Use this function to calculate the rate for a 20-year old female from New York
getRate("New York", "Female", 20)
## AvgRate
## 1 0.000275
Question: Who has the higher mortality rate?