Milestone 6 Solution Sheet
Milestone 6 Solution Sheet
INTRODUCTION: Data is often stored across multiple tables to keep the storage
requirements compact, and to organize different types of data. Knowing how to
use a join is a vital skill when working with data, since bringing tables together can
open the door to additional insights that are cumbersome or impossible looking at
just one table at a time.
In this Milestone, you’ll use your proficiency with joins to help a reporter in California
use data to support an article they’re writing on the causes of motor vehicle
accidents. In particular, they want some information about how many accidents are
caused by the influence of alcohol, or due to inattention (such as using a cell phone
to text or talk to others), and when these types of accidents tend to occur.
HOW IT WORKS: Follow the prompts in the questions below to investigate your
data. Post your answers in the provided boxes: the yellow boxes for the queries you
write, purple boxes for visualizations and blue boxes for text-based answers. When
you're done, export your document as a pdf file and submit it on the Milestone page
– see instructions for creating a PDF at the end of the Milestone.
RESOURCES: If you need hints on the Milestone or are feeling stuck, there are
multiple ways of getting help. Attend Drop-In Hours to work on these problems with
your peers, or reach out to the HelpHub if you have questions. Good luck!
PROMPT: To help the reporters out, you will be making use of data regarding traffic
accidents in the state of California released by the California Highway Patrol.
Certain insights can be found by looking at data on the incident level, while other
insights are possible by looking deeper at the parties involved in an incident. But to
make insights across those two levels, we need a join to be able to relate the unique
information contained in each table.
SQL App: Here’s that link to our specialized SQL app, where you’ll write your SQL
queries and interact with the data.
The original collisions table has 469 664 rows and 76 columns, but we’ll be focusing
on only the following four columns in this Milestone:
The original parties table has 940 216 rows and 33 columns, with the following five
columns of interest:
Most of the features in the dataset are coded in some way for efficient data
storage, which can make working with highly detailed data like this tricky. This
includes the party_sobriety, oaf_1, and oaf_2 columns you’ll be investigating in
the Milestone. Don’t sweat that point, though: the instructions will explain the
encoding values relevant to the tasks.
If you’re curious to explore the data further on your own, or want to see what other
parts of the dataset that aren’t available are like, you can find a comprehensive
description of the data in full here, on the SWITRS information page.
A. Write a query and answer the following question: How many parties are cited
as being at fault for a collision?
SELECT
COUNT(case_id),
at_fault
FROM switrs.parties
GROUP BY at_fault
There are 438,491 parties that are cited as being at fault for the
collision. Out of 940,216 total collisions, this is roughly 47%.
B. The party_sobriety field takes on a value of 'B' when the party is known to
have been drinking, and under the influence of alcohol. Modify your query
from part A to answer the following question: How many parties were found
at fault while under the influence of alcohol?
SELECT
COUNT(case_id),
party_sobriety
FROM switrs.parties
WHERE at_fault = 'Y'
GROUP BY party_sobriety
HAVING party_sobriety = 'B'
C. The oaf_1 or oaf_2 feature takes on a value of 'F' if inattention was a factor in
the collision. Modify your query to answer the following question: How many
parties were found at fault while lack of attention was a factor in the collision?
SELECT
COUNT(case_id),
at_fault
FROM switrs.parties
WHERE oaf_1 = 'F'
OR oaf_2 = 'F'
GROUP BY at_fault
There were 18,311 collisions where inattention was a factor and the
party was at fault.
A. Let’s start with the collisions table on its own. Write a query that returns the
number of collisions, grouped by day of the week. Which days have the
highest number of collisions, and which days have the least number? Note:
Day of week is encoded slightly differently than what comes out of the
date_part function: Sunday is indicated by a 7 instead of a 0.
SELECT
COUNT(case_id),
day_of_week
FROM switrs.collisions
GROUP BY day_of_week
ORDER BY COUNT(case_id)
B. The collisions table and parties tables share values in the case_id column.
Write a new query that inner joins the two tables on that column, returning the
number of rows. How many rows are in the combined output table, and why?
SELECT
COUNT(parties.case_id)
FROM switrs.collisions AS collisions
INNER JOIN switrs.parties AS parties
ON collisions.case_id = parties.case_id
There are 940,216 rows in the combined table. In the collisions
table alone there are only 469,664 rows, however because the
parties table has 940,216 the combined table takes the greater
amount of rows.
C. Combine the queries from parts A and B to return the number of collisions
grouped by the day of the week. Add a condition for the involved parties so
that we only count accidents where the party was found to be at fault AND
under the influence of alcohol. Which days have the highest number of
collisions, and which days have the smallest number?
SELECT
COUNT(collisions.case_id),
collisions.day_of_week
FROM switrs.collisions AS collisions
INNER JOIN switrs.parties AS parties
ON collisions.case_id = parties.case_id
WHERE parties.party_sobriety = 'B'
AND parties.at_fault = 'Y'
GROUP BY collisions.day_of_week
ORDER BY COUNT(collisions.case_id) DESC
D. Modify your query to look at the number of accidents by the day of the week
where the party was found to be at fault AND inattention was a factor. Which
days have the highest number of collisions, and which days have the smallest
number?
SELECT
COUNT(collisions.case_id),
collisions.day_of_week,
parties.at_fault
FROM switrs.collisions AS collisions
INNER JOIN switrs.parties AS parties
ON collisions.case_id = parties.case_id
WHERE parties.oaf_1 = 'F'
OR parties.oaf_2 = 'F'
GROUP BY collisions.day_of_week, parties.at_fault
ORDER BY COUNT(collisions.case_id) DESC
Friday is seen to have the most collisions in which the party was at
fault and inattention was a factor, whereas Sunday has the least
amount of collisions reported. Looking at the data, you can see
that driving inattentively slightly steadily increases throughout
the week, where it then peaks at Friday and drops off again at
Sunday.
Let’s use this new data summary to look at how accident patterns change based on
the time of day. Since the data has already been queried, we’ll do this visually within
Tableau! Click this link to navigate to the workbook you’ll use to complete the
remainder of this Milestone. Once you’ve published your Tableau Workbook in the
folder named Upload Workbooks Here, paste the Share Link in the box below.
https://prod-useast-b.online.tableau.com/#/site/globaltech/w
orkbooks/733296?:origin=card_share_link
Continue to post your answers in the provided boxes: purple boxes for your
visualizations, and blue boxes for text-based answers.
A. On Sheet 1, create a bar chart of the number of collisions by the hour of day.
Describe the pattern in the data. Are there times of day where more
accidents occur? Does this fit in with your expectations?
From the visualization, it looks like collisions are most likely to
occur at around 1700 (5pm). This is most likely due to “rush hour”
and people commuting from work back home. There is also a
slight peak at 7-8 am, most likely also caused by commuting to
work.
B. Copy the chart into a new sheet and add a filter so that the bar chart only
shows accidents where the party at fault was found to be under the influence
of alcohol. How does this distribution of accidents by time of day compare
to the overall distribution?
Compared to the first bar chart, this visualization shows a
completely different story. This visualization is basically inverted
from the last, and shows that most collisions that involve alcohol
take place in the night and very early morning hours, where it
peaks at 2 am.A quick Google search told me that a majority of
bars in California close at either 1 or 2 am most nights.
C. Copy the chart into one more sheet, but now change the filter to only look at
accidents where inattention was a factor from the party-at-fault. How does
this distribution compare to the overall distribution?
This visualization closely resembles the first visualization, telling
the story that most collisions due to inattentive driving occur
when the roads are the busiest, again peaking at 5 pm.
— Level Up
Simply because an accident was such that inattention was a factor does not
necessarily mean that a cell phone was the source of the driver’s distraction. In the
parties table, there is a column called sp_info_2. This feature takes on a value of B,
1, or 2 if a cell phone was known to be in use at the time of the accident. If you’re
interested in digging deeper, you might want to try seeing what proportion of
accidents were caused by cell phone distraction, and if they differ from other
‘inattention’ accidents. Keep in mind that the sp_info_2 column is a string data type,
so you’ll need to treat the '1', and '2' codes appropriately!
SELECT
sp_info_2,
COUNT(case_id)
FROM switrs.parties
WHERE sp_info_2 = '1'
OR sp_info_2 = '2'
OR sp_info_2 = 'B'
GROUP BY sp_info_2;
SELECT
COUNT(case_id),
at_fault
FROM switrs.parties
WHERE oaf_1 = 'F'
OR oaf_2 = 'F'
GROUP BY at_fault;
For this, I did two separate queries so I could compare the total
number of collisions caused by inattentive driving to the number
of collisions caused by phone usage. In this, I found that there is a
total of 12,010 collisions reported that were directly linked to
phone usage and 18,311 collisions that were due to inattention
while driving. Therefore, around 66% of all accidents due to
inattention while driving were linked to phone usage.
— Submission
Great work completing this Milestone! To submit your completed Milestone, you will
need to download / export this document as a PDF and then upload it to the
Milestone submission page. You can find the option to download as a PDF from the
File menu in the upper-left corner of the Google Doc interface.