# Lab 2A: Working with Data (Easy)

### Import the Data

setwd("C:/Workshop/Data")
1. Load the data from the tab-delimited data file.
movies <- read.table(
file = "Movies.txt",
sep = "\t",
quote = "\"")
1. Take a peek at data.
head(movies)
##                   Title Year Rating Runtime Tomato.Meter Box.Office
## 1  The Whole Nine Yards 2000      R  98 min           45     $57.3M ## 2 Gladiator 2000 R 155 min 76$187.3M
## 3      Cirque du Soleil 2000      G  39 min           45     $13.4M ## 4 Dinosaur 2000 PG 82 min 65$135.6M
## 5     Big Momma's House 2000  PG-13  99 min           30      $0.5M ## 6 Gone in Sixty Seconds 2000 PG-13 118 min 24$101.0M
1. Inspect the column names.
names(movies)
## [1] "Title"        "Year"         "Rating"       "Runtime"
## [5] "Tomato.Meter" "Box.Office"

### Problem 1: Column name is incorrect

1. Inspect the incorrect column name (i.e. Tomato.Meter).
names(movies)[5]
## [1] "Tomato.Meter"
1. Rename the column to “Critic.Score”.
names(movies)[5] <- "Critic.Score"
1. Verify that problem 1 is solved.
names(movies)
## [1] "Title"        "Year"         "Rating"       "Runtime"
## [5] "Critic.Score" "Box.Office"

### Problem 2: Missing values

1. Count the missing values.
sum(is.na(movies))
## [1] 4
1. Exclude rows with missing values.
movies <- na.omit(movies)
1. Verify that Problem 2 is solved.
sum(is.na(movies))
## [1] 0

### Problem 3: Units of measure in runtime column

1. Peek at the movie runtime data.
head(movies$Runtime) ## [1] 98 min 155 min 39 min 82 min 99 min 118 min ## 114 Levels: 100 min 101 min 102 min 103 min 104 min 105 min ... 99 min 1. Verify that we cannot perform mathematical operations on the column. NOTE: This next line will throw an error mean(movies$Runtime)
## Warning in mean.default(movies$Runtime): argument is not numeric or ## logical: returning NA ## [1] NA 1. Determine the data type. class(movies$Runtime)
## [1] "factor"
1. Cast from factor to a character string.
runtimes <- as.character(movies$Runtime) 1. Verify the values are now character strings. head(runtimes) ## [1] "98 min" "155 min" "39 min" "82 min" "99 min" "118 min" class(runtimes) ## [1] "character" 1. Eliminate the unit of measure. runtimes <- sub(" min", "", runtimes) 1. Verify the unit of measure has been eliminated. head(runtimes) ## [1] "98" "155" "39" "82" "99" "118" 1. Cast the character string to integer. movies$Runtime <- as.integer(runtimes)
1. Verify the values are now integers.
head(movies$Runtime) ## [1] 98 155 39 82 99 118 class(movies$Runtime)
## [1] "integer"
1. Verify that we can now perform mathematical operations.
mean(movies$Runtime) ## [1] 104.4052 ### Problem 4: Box Office uses three units of measure 1. Inspect the Box Office column. head(movies$Box.Office)
## [1] $57.3M$187.3M $13.4M$135.6M $0.5M$101.0M
## 1367 Levels: $0.1M$0.2M $0.3M$0.4M $0.5M$0.6M $0.7M$0.8M $0.9M ...$99.9k
1. Create a function to convert box office revenue to a single unit of measure (i.e. millions of USD).
convertBoxOffice <- function(boxOffice)
{
stringBoxOffice <- as.character(boxOffice)

replacedBoxOffice <- gsub("[$|k|M]", "", stringBoxOffice) numericBoxOffice <- as.numeric(replacedBoxOffice) if (grepl("M", boxOffice)) { numericBoxOffice } else if (grepl("k", boxOffice)){ numericBoxOffice * 0.001 } else { numericBoxOffice * 0.000001 } } 1. Convert box office to single unit of measure. movies$Box.Office <- sapply(movies$Box.Office, convertBoxOffice) 1. Verify that Problem 4 has been solved. head(movies$Box.Office)
## [1]  57.3 187.3  13.4 135.6   0.5 101.0
class(movies$Box.Office) ## [1] "numeric" mean(movies$Box.Office)
## [1] 40.67558

### Export the data

1. Export the data to a CSV file.
write.csv(movies, "Movies2.csv")
1. Verify the new CSV file exists in your C:/Workshop directory.

2. Verify the file’s contents by opening the file.

### Bonus Problem: Replace Joe’s Analysis Software

library(dplyr)
1. Select columns
temp <- select(movies, Year, Rating, Box.Office)
1. Filter rows
temp <- filter(temp, Year == 2014)
## Warning: package 'bindrcpp' was built under R version 3.4.1
1. Create new revenue column (\$B)
temp <- mutate(temp, Revenue = Box.Office / 1000)
1. Group by rating
temp <- group_by(temp, Rating)
1. Summarize by rating category
temp <- summarize(temp, Total.Revenue = sum(Revenue))
1. Sort rows by revenue (descending)
temp <- arrange(temp, desc(Total.Revenue))
1. Display results
print(temp)
## # A tibble: 4 x 2
##   Rating Total.Revenue
##   <fctr>         <dbl>
## 1  PG-13      4.207022
## 2      R      1.985307
## 3     PG      1.931452
## 4      G      0.159132
1. Chain dplyr methods
report <- movies %>%
select(Year, Rating, Box.Office) %>%
filter(Year == 2014) %>%
mutate(Revenue = Box.Office / 1000) %>%
group_by(Rating) %>%
summarize(Total.Revenue = sum(Revenue)) %>%
arrange(desc(Total.Revenue)) %>%
as.data.frame()
1. Display results
print (report)
##   Rating Total.Revenue
## 1  PG-13      4.207022
## 2      R      1.985307
## 3     PG      1.931452
## 4      G      0.159132