0% found this document useful (0 votes)
112 views9 pages

Assignment 6

This document discusses analyzing avocado pricing data from different regions in the United States. Various data wrangling and visualization exercises are presented to explore trends in average avocado prices over time for different regions. Key findings include that average avocado prices have increased overall from 2015 to 2018, but pricing trends differ between individual regions. Faceted graphs are useful for comparing trends between regions. A linear regression model finds a moderate positive relationship between date and average price for the total US.

Uploaded by

Ray Guo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views9 pages

Assignment 6

This document discusses analyzing avocado pricing data from different regions in the United States. Various data wrangling and visualization exercises are presented to explore trends in average avocado prices over time for different regions. Key findings include that average avocado prices have increased overall from 2015 to 2018, but pricing trends differ between individual regions. Faceted graphs are useful for comparing trends between regions. A linear regression model finds a moderate positive relationship between date and average price for the total US.

Uploaded by

Ray Guo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Assignment 6: Avocado outrage

Raymond Guo
2020-02-26

Exercise 1
The range span time of each observation is by 7 days or 168 hours

Exercise 2

avocado %>%
select(region) %>%
unique()

region
Albany
Atlanta
BaltimoreWashington
Boise
Boston
BuffaloRochester
California
Charlotte
Chicago
CincinnatiDayton
Columbus
DallasFtWorth
Denver
Detroit
GrandRapids
GreatLakes
HarrisburgScranton
HartfordSpringfield
Houston
Indianapolis
Jacksonville
LasVegas
LosAngeles
Louisville
MiamiFtLauderdale
Midsouth
Nashville
NewOrleansMobile
NewYork
Northeast

1
region
NorthernNewEngland
Orlando
Philadelphia
PhoenixTucson
Pittsburgh
Plains
Portland
RaleighGreensboro
RichmondNorfolk
Roanoke
Sacramento
SanDiego
SanFrancisco
Seattle
SouthCarolina
SouthCentral
Southeast
Spokane
StLouis
Syracuse
Tampa
TotalUS
West
WestTexNewMexico

i. All of the regions in the dataset are above this.


ii. BaltimoreWashington
iii. TotalUS ## Exercise 3
avocado %>%
filter(region == "TotalUS") %>%
ggplot() +
geom_line(mapping = aes(x = Date, y = AveragePrice)) +
labs(title = "Time series of average avocado price", x = "Date", y = "Average Price/$")

2
Time series of average avocado price

1.50

Average Price/$
1.25

1.00

0.75
2015 2016 2017 2018
Date

Exercise 4

avocado %>%
filter(region == "TotalUS") %>%
ggplot() +
geom_line(mapping = aes(x = Date, y = AveragePrice)) +
labs(title = "Time series of average avocado price", x = "Date", y = "Average Price/$") +
geom_smooth(mapping = aes(x = Date, y = AveragePrice), span = 0.1, se = FALSE, color = "blu

Time series of average avocado price

1.50
Average Price/$

1.25

1.00

0.75
2015 2016 2017 2018
Date

Decreasing the span will make the new line look more aligned to the original graph.

Exercise 5

avocado_regional <- avocado %>%


filter(region == "TotalUS" | region == "BaltimoreWashington" | region == "Albany")

3
Exercise 6

avocado_regional %>%
ggplot() +
geom_smooth(mapping = aes(x = Date, y = AveragePrice, color = region), span = 0.1, method =
labs(title = "Time series of average avocado price", x = "Date", y = "Average Price/$")

Time series of average avocado price


1.8

1.6
Average Price/$

region
1.4 Albany
BaltimoreWashington
1.2
TotalUS

1.0

0.8
2015 2016 2017 2018
Date
ggplot(data = avocado_regional) +
geom_smooth(mapping = aes(x = Date, y = AveragePrice, color = region), span = .3, method = "l
facet_grid(. ~ region)

Albany BaltimoreWashington TotalUS

1.50
AveragePrice

region
Albany
1.25
BaltimoreWashington
TotalUS

1.00

2015201620172018
2015201620172018
2015201620172018
Date

Facet makes it eaiser to see each of the specific regions and analyzing its graph compared to the
other graph where it looks quite difficult to see what each region is doing.
Facet also makes it harder to do comparative analysis like which region is performing better than
the other based off a given timeframe. This can be easily determined when you overlap all 3 graphs
together to arrive with a precise answer.

4
Exercise 7

avocado_usa <- avocado %>%


filter(region == "TotalUS")

avocado_usa %>%
ggplot() +
geom_smooth(mapping = aes(x = Date, y = AveragePrice, color = region), span = 0.1, method =
labs(title = "Time series of average avocado price", x = "Date", y = "Average Price/$")

Time series of average avocado price

1.50
Average Price/$

region
1.25
TotalUS

1.00

2015 2016 2017 2018


Date
avocado_usa_model <- lm(AveragePrice ~ Date, data = avocado_usa)

avocado_usa_model %>%
tidy()

term estimate std.error statistic p.value


(Intercept) -3.1886561 0.5822833 -5.476125 2e-07
Date 0.0002514 0.0000342 7.352999 0e+00

avocado_usa_model %>%
glance() %>%
select(r.squared)

r.squared
0.2445715

avocado_usa_model %>%
ggplot() +
geom_point(mapping = aes(x = Date, y = AveragePrice)) +
geom_abline(slope = avocado_usa_model$coefficients[2], intercept = avocado_usa_model$coeffici

5
1.50

AveragePrice 1.25

1.00

0.75
2015 2016 2017 2018
Date

Exercise 8
According to information found using Google, “avocado trees is best harvested when immature, green
and hard and ripened off the tree”. To maximize those attributes is by harvesting in September(Fall).
That is when the avocado supply goes up and the quality goes up making the pricing goes down.
All other seasons like Spring and Summer sees no play as in lack of quality and quantity resulting
in an exponential price value. ## Exercise 9
avocado %>%
ggplot() +
geom_point(mapping = aes(x = TotalVolume, y = AveragePrice))

2.0
AveragePrice

1.5

1.0

0.5

0e+00 2e+07 4e+07 6e+07


TotalVolume

It seems the data points holding the least amount of volume does not really show a correlation with
the price tag. There is a minor negative condition where the lower price tag yields more volume.
The other data points that are pretty much outliers of the others show that people will purchase
avocados during the harvest season. ## Exercise 10

6
avocado_usa_model1 <- lm(AveragePrice ~ TotalVolume, data = avocado_usa)

avocado_usa_model1 %>%
tidy()

term estimate std.error statistic p.value


(Intercept) 1.581618 0.0649437 24.353671 0
TotalVolume 0.000000 0.0000000 -7.661189 0

avocado_usa_model1 %>%
glance() %>%
select(r.squared)

r.squared
0.2600595

avocado_usa_df <- avocado_usa %>%


add_predictions(avocado_usa_model1) %>%
add_residuals(avocado_usa_model1)

ggplot(avocado_usa_df)+
geom_point(mapping =aes(pred, AveragePrice)) +
geom_abline(slope = 1, intercept = 0, color = "red", size = 1)

1.50
AveragePrice

1.25

1.00

0.75
0.8 1.0 1.2
pred

ggplot(avocado_usa_df) +
geom_point(aes(pred, resid)) +
geom_ref_line(h = 0)

7
0.4

0.2
resid

0.0

−0.2

0.8 1.0 1.2


pred

ggplot(data = avocado_usa_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid))

0.25
sample

0.00

−0.25

−0.50
−3 −2 −1 0 1 2 3
theoretical

There is a linear relationship from the observed vs predicted graph, the residual dispersion represents
an equilibrium from the predicted vs residuals graph, and the variability of the points around the
line is consistent from the Q-Q plot graph. With all 3 conditions being satisfied this model is reliable.
## Exercise 11
new <- avocado %>%
filter(region == "Albany" | region == "BaltimoreWashington") %>%
group_by(region) %>%
summarize(mean_avocado_price = mean(AveragePrice, na.rm = TRUE))

0.004556 is the difference between the regions


obs_stat <- avocado %>%
filter(region == "Albany" | region == "BaltimoreWashington") %>%

8
specify(AveragePrice ~ region) %>%
calculate(stat = "diff in means", order = c("Albany", "BaltimoreWashington"))

Exercise 12
i. Null hypothesis states there is no significant difference between the average price of the 2
regions and the alternative hypothesis states there is a significant difference between the
average price of the 2 regions
ii.
null <- avocado %>%
filter(region == "Albany" | region == "BaltimoreWashington") %>%
specify(AveragePrice ~ region) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Albany", "BaltimoreWashington"))

iii.
null %>%
get_p_value(obs_stat = obs_stat, direction = "right")

p_value
0.4221

iv.
null %>%
visualize() +
shade_p_value(obs_stat = obs_stat, direction = "right")

Simulation−Based Null Distribution


2000

1500
count

1000

500

0
−0.05 0.00 0.05
stat

v. The p value is less 0.05 so we reject the null hypothesis

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy