0% found this document useful (0 votes)
681 views12 pages

Milestone 6 Solution Sheet

This document discusses analyzing traffic collision data from California to identify trends related to alcohol use and inattention. The data comes from the California Highway Patrol and is stored across two tables - collisions and parties. Several SQL queries are performed to analyze the data: 1) Over 3% of collisions involved parties found at fault while under the influence of alcohol. Around 18,000 collisions involved inattention as a factor for parties at fault. 2) Fridays and Thursdays had the highest number of overall collisions, while Sundays had the fewest. 3) Sundays saw the most collisions involving parties at fault and under the influence of alcohol. Tuesdays saw the fewest of these types of collisions.

Uploaded by

api-708555321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
681 views12 pages

Milestone 6 Solution Sheet

This document discusses analyzing traffic collision data from California to identify trends related to alcohol use and inattention. The data comes from the California Highway Patrol and is stored across two tables - collisions and parties. Several SQL queries are performed to analyze the data: 1) Over 3% of collisions involved parties found at fault while under the influence of alcohol. Around 18,000 collisions involved inattention as a factor for parties at fault. 2) Fridays and Thursdays had the highest number of overall collisions, while Sundays had the fewest. 3) Sundays saw the most collisions involving parties at fault and under the influence of alcohol. Tuesdays saw the fewest of these types of collisions.

Uploaded by

api-708555321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Milestone 6 | Traffic Collisions in California

INTRODUCTION: Data is often stored across multiple tables to keep the storage
requirements compact, and to organize different types of data. Knowing how to
use a join is a vital skill when working with data, since bringing tables together can
open the door to additional insights that are cumbersome or impossible looking at
just one table at a time.

In this Milestone, you’ll use your proficiency with joins to help a reporter in California
use data to support an article they’re writing on the causes of motor vehicle
accidents. In particular, they want some information about how many accidents are
caused by the influence of alcohol, or due to inattention (such as using a cell phone
to text or talk to others), and when these types of accidents tend to occur.

HOW IT WORKS: Follow the prompts in the questions below to investigate your
data. Post your answers in the provided boxes: the yellow boxes for the queries you
write, purple boxes for visualizations and blue boxes for text-based answers. When
you're done, export your document as a pdf file and submit it on the Milestone page
– see instructions for creating a PDF at the end of the Milestone.

RESOURCES: If you need hints on the Milestone or are feeling stuck, there are
multiple ways of getting help. Attend Drop-In Hours to work on these problems with
your peers, or reach out to the HelpHub if you have questions. Good luck!

PROMPT: To help the reporters out, you will be making use of data regarding traffic
accidents in the state of California released by the California Highway Patrol.
Certain insights can be found by looking at data on the incident level, while other
insights are possible by looking deeper at the parties involved in an incident. But to
make insights across those two levels, we need a join to be able to relate the unique
information contained in each table.
SQL App: Here’s that link to our specialized SQL app, where you’ll write your SQL
queries and interact with the data.

— Data Set Description


Data for this Milestone comes from the California Highway Patrol’s Statewide
Integrated Traffic Records System (SWITRS). The SWITRS data we’ve provided
(switrs.*) consists of two tables from the 2019 data collection: collisions and
parties. The tables are related hierarchically. At the top level, there is a unique row
and identifier for each incident in the collisions table. Then, in the lower level, each
collision is between one or more parties, which include vehicles, pedestrians, etc.

The original collisions table has 469 664 rows and 76 columns, but we’ll be focusing
on only the following four columns in this Milestone:

● case_id - unique identifier for each collision


● collision_time - time of day when collision occurred, in 24 hour format
● day_of_week - day of week when collision occurred. Note that numbering
starts at 1 = Monday and ends at 7 = Sunday (instead of 0 = Sunday)
● party_count - number of parties involved in the collision

The original parties table has 940 216 rows and 33 columns, with the following five
columns of interest:

● case_id - associated with a collision with matching case_id, may not be


unique
● party_number - numbering of parties involved, always starts from 1 for each
collision
● at_fault - Y/N indicating whether party was at fault for collision
● party_sobriety - encodings for whether or not the party had been drinking
● oaf_1, oaf_2 - encodings for other associated factors

Most of the features in the dataset are coded in some way for efficient data
storage, which can make working with highly detailed data like this tricky. This
includes the party_sobriety, oaf_1, and oaf_2 columns you’ll be investigating in
the Milestone. Don’t sweat that point, though: the instructions will explain the
encoding values relevant to the tasks.
If you’re curious to explore the data further on your own, or want to see what other
parts of the dataset that aren’t available are like, you can find a comprehensive
description of the data in full here, on the SWITRS information page.

— Task 1: How frequently does alcohol use or lack of


attention feature in accidents?
To start, we should run some queries on the parties table to understand how fault,
alcohol use, and inattention are attributed to accidents.

A. Write a query and answer the following question: How many parties are cited
as being at fault for a collision?

SELECT
COUNT(case_id),
at_fault
FROM switrs.parties
GROUP BY at_fault

There are 438,491 parties that are cited as being at fault for the
collision. Out of 940,216 total collisions, this is roughly 47%.

B. The party_sobriety field takes on a value of 'B' when the party is known to
have been drinking, and under the influence of alcohol. Modify your query
from part A to answer the following question: How many parties were found
at fault while under the influence of alcohol?

SELECT
COUNT(case_id),
party_sobriety
FROM switrs.parties
WHERE at_fault = 'Y'
GROUP BY party_sobriety
HAVING party_sobriety = 'B'

33,512 parties were found at fault while under the influence of


alcohol. That is roughly 3% of all accidents.

C. The oaf_1 or oaf_2 feature takes on a value of 'F' if inattention was a factor in
the collision. Modify your query to answer the following question: How many
parties were found at fault while lack of attention was a factor in the collision?

SELECT
COUNT(case_id),
at_fault
FROM switrs.parties
WHERE oaf_1 = 'F'
OR oaf_2 = 'F'
GROUP BY at_fault

There were 18,311 collisions where inattention was a factor and the
party was at fault.

— Task 2: When do accidents occur by day of the week?


Now that we have a way to identify whether or not a collision can be attributed to
alcohol or inattention, let’s add in the collisions table to answer the journalist’s
question of whether or not there are differences between the two accident
sources.

A. Let’s start with the collisions table on its own. Write a query that returns the
number of collisions, grouped by day of the week. Which days have the
highest number of collisions, and which days have the least number? Note:
Day of week is encoded slightly differently than what comes out of the
date_part function: Sunday is indicated by a 7 instead of a 0.

SELECT
COUNT(case_id),
day_of_week
FROM switrs.collisions
GROUP BY day_of_week
ORDER BY COUNT(case_id)

Friday(5) have the highest number of reported collisions at


55,159, whereas Sunday(7) have the least amount of collisions at
75,654. Not far behind Friday, however is Thursday being the
runner up for most collisions.

B. The collisions table and parties tables share values in the case_id column.
Write a new query that inner joins the two tables on that column, returning the
number of rows. How many rows are in the combined output table, and why?

SELECT
COUNT(parties.case_id)
FROM switrs.collisions AS collisions
INNER JOIN switrs.parties AS parties
ON collisions.case_id = parties.case_id
There are 940,216 rows in the combined table. In the collisions
table alone there are only 469,664 rows, however because the
parties table has 940,216 the combined table takes the greater
amount of rows.

C. Combine the queries from parts A and B to return the number of collisions
grouped by the day of the week. Add a condition for the involved parties so
that we only count accidents where the party was found to be at fault AND
under the influence of alcohol. Which days have the highest number of
collisions, and which days have the smallest number?

SELECT
COUNT(collisions.case_id),
collisions.day_of_week
FROM switrs.collisions AS collisions
INNER JOIN switrs.parties AS parties
ON collisions.case_id = parties.case_id
WHERE parties.party_sobriety = 'B'
AND parties.at_fault = 'Y'
GROUP BY collisions.day_of_week
ORDER BY COUNT(collisions.case_id) DESC

In this query, we are looking at the number of collisions reported


by the day of week, where the party was deemed at fault and the
party was under the influence of alcohol. Sunday(7) returned as
the day where the most accidents occured under these
conditions, where Tuesday(2) returned as the day with the least
number of accidents. This is strange because, before, Sunday
was the day with the least number of collisions, but now with
adding the parties table, we can see that Sunday had the most
drunk driving incidents.

D. Modify your query to look at the number of accidents by the day of the week
where the party was found to be at fault AND inattention was a factor. Which
days have the highest number of collisions, and which days have the smallest
number?

SELECT
COUNT(collisions.case_id),
collisions.day_of_week,
parties.at_fault
FROM switrs.collisions AS collisions
INNER JOIN switrs.parties AS parties
ON collisions.case_id = parties.case_id
WHERE parties.oaf_1 = 'F'
OR parties.oaf_2 = 'F'
GROUP BY collisions.day_of_week, parties.at_fault
ORDER BY COUNT(collisions.case_id) DESC

Friday is seen to have the most collisions in which the party was at
fault and inattention was a factor, whereas Sunday has the least
amount of collisions reported. Looking at the data, you can see
that driving inattentively slightly steadily increases throughout
the week, where it then peaks at Friday and drops off again at
Sunday.

— Task 3: When do accidents occur by the time of day?


A data analyst colleague of yours has taken interest in your project with the
journalist and has pitched in their own contribution by providing you a summary of
the dataset with five features:
● alcohol_involved - TRUE/FALSE whether or not the party at fault was under
the influence of alcohol
● inattention_involved - TRUE/FALSE whether or not inattention was a factor
for the party at fault
● day_of_week - day of week when collision occurred. Note that numbering
starts at 1 = Monday and ends at 7 = Sunday (instead of 0 = Sunday)
● hour_of_day -hour of day when collision occurred, in 24 hour format
(0-2300). Values of 2500 indicate an unknown time of day.
● n_collisions - number of collisions matching the conditions of the first four
columns

Let’s use this new data summary to look at how accident patterns change based on
the time of day. Since the data has already been queried, we’ll do this visually within
Tableau! Click this link to navigate to the workbook you’ll use to complete the
remainder of this Milestone. Once you’ve published your Tableau Workbook in the
folder named Upload Workbooks Here, paste the Share Link in the box below.

https://prod-useast-b.online.tableau.com/#/site/globaltech/w
orkbooks/733296?:origin=card_share_link

Continue to post your answers in the provided boxes: purple boxes for your
visualizations, and blue boxes for text-based answers.

A. On Sheet 1, create a bar chart of the number of collisions by the hour of day.
Describe the pattern in the data. Are there times of day where more
accidents occur? Does this fit in with your expectations?
From the visualization, it looks like collisions are most likely to
occur at around 1700 (5pm). This is most likely due to “rush hour”
and people commuting from work back home. There is also a
slight peak at 7-8 am, most likely also caused by commuting to
work.

B. Copy the chart into a new sheet and add a filter so that the bar chart only
shows accidents where the party at fault was found to be under the influence
of alcohol. How does this distribution of accidents by time of day compare
to the overall distribution?
Compared to the first bar chart, this visualization shows a
completely different story. This visualization is basically inverted
from the last, and shows that most collisions that involve alcohol
take place in the night and very early morning hours, where it
peaks at 2 am.A quick Google search told me that a majority of
bars in California close at either 1 or 2 am most nights.

C. Copy the chart into one more sheet, but now change the filter to only look at
accidents where inattention was a factor from the party-at-fault. How does
this distribution compare to the overall distribution?
This visualization closely resembles the first visualization, telling
the story that most collisions due to inattentive driving occur
when the roads are the busiest, again peaking at 5 pm.

— Level Up
Simply because an accident was such that inattention was a factor does not
necessarily mean that a cell phone was the source of the driver’s distraction. In the
parties table, there is a column called sp_info_2. This feature takes on a value of B,
1, or 2 if a cell phone was known to be in use at the time of the accident. If you’re
interested in digging deeper, you might want to try seeing what proportion of
accidents were caused by cell phone distraction, and if they differ from other
‘inattention’ accidents. Keep in mind that the sp_info_2 column is a string data type,
so you’ll need to treat the '1', and '2' codes appropriately!

SELECT
sp_info_2,
COUNT(case_id)
FROM switrs.parties
WHERE sp_info_2 = '1'
OR sp_info_2 = '2'
OR sp_info_2 = 'B'
GROUP BY sp_info_2;

SELECT
COUNT(case_id),
at_fault
FROM switrs.parties
WHERE oaf_1 = 'F'
OR oaf_2 = 'F'
GROUP BY at_fault;

For this, I did two separate queries so I could compare the total
number of collisions caused by inattentive driving to the number
of collisions caused by phone usage. In this, I found that there is a
total of 12,010 collisions reported that were directly linked to
phone usage and 18,311 collisions that were due to inattention
while driving. Therefore, around 66% of all accidents due to
inattention while driving were linked to phone usage.

— Submission
Great work completing this Milestone! To submit your completed Milestone, you will
need to download / export this document as a PDF and then upload it to the
Milestone submission page. You can find the option to download as a PDF from the
File menu in the upper-left corner of the Google Doc interface.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy