# Lab 3A: Descriptive Statistics (Easy)

1. Load the CSV data files.
movies <- read.csv("Movies.csv")

genres <- read.csv("Genres.csv")
1. Peek at the data.
head(movies)
##                   Title Year Rating Runtime Critic.Score Box.Office
## 1  The Whole Nine Yards 2000      R      98           45       57.3
## 2             Gladiator 2000      R     155           76      187.3
## 3      Cirque du Soleil 2000      G      39           45       13.4
## 4              Dinosaur 2000     PG      82           65      135.6
## 5     Big Momma's House 2000  PG-13      99           30        0.5
## 6 Gone in Sixty Seconds 2000  PG-13     118           24      101.0
head(genres)
##                  Title  Genre Year Rating Runtime Critic.Score Box.Office
## 1 The Whole Nine Yards  Crime 2000      R      98           45       57.3
## 2 The Whole Nine Yards Comedy 2000      R      98           45       57.3
## 3     Cirque du Soleil  Drama 2000      G      39           45       13.4
## 4     Cirque du Soleil Family 2000      G      39           45       13.4
## 5            Gladiator Action 2000      R     155           76      187.3
## 6            Gladiator  Drama 2000      R     155           76      187.3

### Analyzing One Categorical Variable

1. Create a frequency table of observations of movies by rating category.
table(movies$Rating) ## ## G PG PG-13 R ## 93 497 1225 1423 ### Analyzing One Numeric Variable 1. Analyze measures of central tendancy (i.e. location) for movie runtime. mean(movies$Runtime)
## [1] 104.4052
median(movies$Runtime) ## [1] 101 1. Analyze measures dispersion (i.e. spread) for movie runtime. min(movies$Runtime)
## [1] 38
max(movies$Runtime) ## [1] 219 range(movies$Runtime)
## [1]  38 219
diff(range(movies$Runtime)) ## [1] 181 quantile(movies$Runtime)
##   0%  25%  50%  75% 100%
##   38   93  101  113  219
quantile(movies$Runtime, 0.95) ## 95% ## 135 IQR(movies$Runtime)
## [1] 20
var(movies$Runtime) ## [1] 284.4487 sd(movies$Runtime)
## [1] 16.86561
1. Analyze measures of the shape of movie runtime.
library(moments)

skewness(movies$Runtime) ## [1] 1.007788 kurtosis(movies$Runtime)
## [1] 5.956355
1. Summarize a quantitative variable (i.e. movie runtime).
summary(movies$Runtime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 38.0 93.0 101.0 104.4 113.0 219.0 ### Analyzing Two Categorical Variables 1. Create a contingency table containing the frequency of observations of movies by genre and rating. table(genres$Genre, genres$Rating) ## ## G PG PG-13 R ## Action 2 70 311 229 ## Adventure 44 179 209 64 ## Animation 43 111 8 6 ## Biography 0 27 73 93 ## Comedy 45 258 472 506 ## Crime 0 9 141 328 ## Documentary 27 73 78 65 ## Drama 12 136 586 836 ## Family 38 181 10 1 ## Fantasy 6 51 115 43 ## History 3 12 36 35 ## Horror 0 3 71 195 ## Music 5 31 81 59 ## Musical 0 11 20 6 ## Mystery 0 6 102 136 ## Sci-Fi 0 7 119 72 ## Sport 4 36 62 19 ## Thriller 0 2 167 324 ## War 1 0 19 31 ## Western 0 4 6 10 ### Analyzing Two Numeric Variables 1. Analyze the correlation coefficient for runtime and box office. cor(movies$Runtime, movies$Box.Office) ## [1] 0.347748 1. Analyze the correlation coefficient for runtime and box office. cor(movies$Critic.Score, movies$Box.Office) ## [1] 0.1608324 ### Analyzing a Numeric Variable Grouped by a Categorical Variable 1. Create a table of aggregate numeric values (i.e average box office revenue) grouped by a categorical variable (i.e. rating category). tapply(movies$Box.Office, movies$Rating, mean) ## G PG PG-13 R ## 55.47561 56.40439 54.56134 22.26118 1. Create a table of average box office revenue grouped by a genre. tapply(genres$Box.Office, genres\$Genre, mean)
##      Action   Adventure   Animation   Biography      Comedy       Crime
##   76.530806  101.745110   96.603311   26.500308   40.860973   34.320142
## Documentary       Drama      Family     Fantasy     History      Horror
##    6.268575   24.740296   68.339200   93.251211   24.181583   27.932895
##       Music     Musical     Mystery      Sci-Fi       Sport    Thriller
##   21.978918   37.172776   40.328661   86.874763   27.739240   38.523364
##         War     Western
##   26.474298   36.146105

### Analyzing Many Variables

1. Create a correlation matrix
cor(movies[, 4:6])
##                Runtime Critic.Score Box.Office
## Runtime      1.0000000    0.1881713  0.3477480
## Critic.Score 0.1881713    1.0000000  0.1608324
## Box.Office   0.3477480    0.1608324  1.0000000
1. Summarize an entire table.
summary(movies)
##                   Title           Year        Rating        Runtime
##  Camp                :   2   Min.   :2000   G    :  93   Min.   : 38.0
##  Frozen              :   2   1st Qu.:2004   PG   : 497   1st Qu.: 93.0
##  The Other Woman     :   2   Median :2008   PG-13:1225   Median :101.0
##  (500) Days of Summer:   1   Mean   :2008   R    :1423   Mean   :104.4
##  (Untitled)          :   1   3rd Qu.:2011                3rd Qu.:113.0
##  10 Items or Less    :   1   Max.   :2015                Max.   :219.0
##  (Other)             :3229
##   Critic.Score      Box.Office
##  Min.   :  0.00   Min.   :  0.0002
##  1st Qu.: 26.00   1st Qu.:  1.0000
##  Median : 49.00   Median : 16.1000
##  Mean   : 49.68   Mean   : 40.6756
##  3rd Qu.: 74.00   3rd Qu.: 51.4750
##  Max.   :100.00   Max.   :760.5000
##