Assignment 6
Assignment 6
Raymond Guo
2020-02-26
Exercise 1
The range span time of each observation is by 7 days or 168 hours
Exercise 2
avocado %>%
select(region) %>%
unique()
region
Albany
Atlanta
BaltimoreWashington
Boise
Boston
BuffaloRochester
California
Charlotte
Chicago
CincinnatiDayton
Columbus
DallasFtWorth
Denver
Detroit
GrandRapids
GreatLakes
HarrisburgScranton
HartfordSpringfield
Houston
Indianapolis
Jacksonville
LasVegas
LosAngeles
Louisville
MiamiFtLauderdale
Midsouth
Nashville
NewOrleansMobile
NewYork
Northeast
1
region
NorthernNewEngland
Orlando
Philadelphia
PhoenixTucson
Pittsburgh
Plains
Portland
RaleighGreensboro
RichmondNorfolk
Roanoke
Sacramento
SanDiego
SanFrancisco
Seattle
SouthCarolina
SouthCentral
Southeast
Spokane
StLouis
Syracuse
Tampa
TotalUS
West
WestTexNewMexico
2
Time series of average avocado price
1.50
Average Price/$
1.25
1.00
0.75
2015 2016 2017 2018
Date
Exercise 4
avocado %>%
filter(region == "TotalUS") %>%
ggplot() +
geom_line(mapping = aes(x = Date, y = AveragePrice)) +
labs(title = "Time series of average avocado price", x = "Date", y = "Average Price/$") +
geom_smooth(mapping = aes(x = Date, y = AveragePrice), span = 0.1, se = FALSE, color = "blu
1.50
Average Price/$
1.25
1.00
0.75
2015 2016 2017 2018
Date
Decreasing the span will make the new line look more aligned to the original graph.
Exercise 5
3
Exercise 6
avocado_regional %>%
ggplot() +
geom_smooth(mapping = aes(x = Date, y = AveragePrice, color = region), span = 0.1, method =
labs(title = "Time series of average avocado price", x = "Date", y = "Average Price/$")
1.6
Average Price/$
region
1.4 Albany
BaltimoreWashington
1.2
TotalUS
1.0
0.8
2015 2016 2017 2018
Date
ggplot(data = avocado_regional) +
geom_smooth(mapping = aes(x = Date, y = AveragePrice, color = region), span = .3, method = "l
facet_grid(. ~ region)
1.50
AveragePrice
region
Albany
1.25
BaltimoreWashington
TotalUS
1.00
2015201620172018
2015201620172018
2015201620172018
Date
Facet makes it eaiser to see each of the specific regions and analyzing its graph compared to the
other graph where it looks quite difficult to see what each region is doing.
Facet also makes it harder to do comparative analysis like which region is performing better than
the other based off a given timeframe. This can be easily determined when you overlap all 3 graphs
together to arrive with a precise answer.
4
Exercise 7
avocado_usa %>%
ggplot() +
geom_smooth(mapping = aes(x = Date, y = AveragePrice, color = region), span = 0.1, method =
labs(title = "Time series of average avocado price", x = "Date", y = "Average Price/$")
1.50
Average Price/$
region
1.25
TotalUS
1.00
avocado_usa_model %>%
tidy()
avocado_usa_model %>%
glance() %>%
select(r.squared)
r.squared
0.2445715
avocado_usa_model %>%
ggplot() +
geom_point(mapping = aes(x = Date, y = AveragePrice)) +
geom_abline(slope = avocado_usa_model$coefficients[2], intercept = avocado_usa_model$coeffici
5
1.50
AveragePrice 1.25
1.00
0.75
2015 2016 2017 2018
Date
Exercise 8
According to information found using Google, “avocado trees is best harvested when immature, green
and hard and ripened off the tree”. To maximize those attributes is by harvesting in September(Fall).
That is when the avocado supply goes up and the quality goes up making the pricing goes down.
All other seasons like Spring and Summer sees no play as in lack of quality and quantity resulting
in an exponential price value. ## Exercise 9
avocado %>%
ggplot() +
geom_point(mapping = aes(x = TotalVolume, y = AveragePrice))
2.0
AveragePrice
1.5
1.0
0.5
It seems the data points holding the least amount of volume does not really show a correlation with
the price tag. There is a minor negative condition where the lower price tag yields more volume.
The other data points that are pretty much outliers of the others show that people will purchase
avocados during the harvest season. ## Exercise 10
6
avocado_usa_model1 <- lm(AveragePrice ~ TotalVolume, data = avocado_usa)
avocado_usa_model1 %>%
tidy()
avocado_usa_model1 %>%
glance() %>%
select(r.squared)
r.squared
0.2600595
ggplot(avocado_usa_df)+
geom_point(mapping =aes(pred, AveragePrice)) +
geom_abline(slope = 1, intercept = 0, color = "red", size = 1)
1.50
AveragePrice
1.25
1.00
0.75
0.8 1.0 1.2
pred
ggplot(avocado_usa_df) +
geom_point(aes(pred, resid)) +
geom_ref_line(h = 0)
7
0.4
0.2
resid
0.0
−0.2
ggplot(data = avocado_usa_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid))
0.25
sample
0.00
−0.25
−0.50
−3 −2 −1 0 1 2 3
theoretical
There is a linear relationship from the observed vs predicted graph, the residual dispersion represents
an equilibrium from the predicted vs residuals graph, and the variability of the points around the
line is consistent from the Q-Q plot graph. With all 3 conditions being satisfied this model is reliable.
## Exercise 11
new <- avocado %>%
filter(region == "Albany" | region == "BaltimoreWashington") %>%
group_by(region) %>%
summarize(mean_avocado_price = mean(AveragePrice, na.rm = TRUE))
8
specify(AveragePrice ~ region) %>%
calculate(stat = "diff in means", order = c("Albany", "BaltimoreWashington"))
Exercise 12
i. Null hypothesis states there is no significant difference between the average price of the 2
regions and the alternative hypothesis states there is a significant difference between the
average price of the 2 regions
ii.
null <- avocado %>%
filter(region == "Albany" | region == "BaltimoreWashington") %>%
specify(AveragePrice ~ region) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Albany", "BaltimoreWashington"))
iii.
null %>%
get_p_value(obs_stat = obs_stat, direction = "right")
p_value
0.4221
iv.
null %>%
visualize() +
shade_p_value(obs_stat = obs_stat, direction = "right")
1500
count
1000
500
0
−0.05 0.00 0.05
stat