PMT2 21
PMT2 21
This problem builds on your knowledge of Pandas and Numpy. It has 8 exercises, numbered 0 to 7. There
are 13 available points. However, to earn 100%, the threshold is just 11 points. (Therefore, once you hit 11
points, you can stop. There is no extra credit for exceeding this threshold.)
Each exercise builds logically on the previous one, but you may solve them in any order. That is, if you can't
solve an exercise, you can still move on and try the next one. However, if you see a code cell introduced
by the phrase, "Sample result for ...", please run it. Some demo cells in the notebook may depend on
these precomputed results.
Pro-tips.
Many or all test cells use randomly generated inputs. Therefore, try your best to write solutions
that do not assume too much. To help you debug, when a test cell does fail, it will often tell you
exactly what inputs it was using and what output it expected, compared to yours.
If your program behavior seem strange, try resetting the kernel and rerunning everything.
If you mess up this notebook or just want to start from scratch, save copies of all your partial
responses and use Actions → Reset Assignment to get a fresh, original copy of this
notebook. (Resetting will wipe out any answers you've written so far, so be sure to stash those
somewhere safe if you intend to keep or reuse them!)
If you generate excessive output (e.g., from an ill-placed print statement) that causes the
notebook to load slowly or not at all, use Actions → Clear Notebook Output to get a clean
copy. The clean copy will retain your code but remove any generated output. However, it will also
rename the notebook to clean.xxx.ipynb. Since the autograder expects a notebook file with
the original name, you'll need to rename the clean notebook accordingly. Be forewarned: we won't
manually grade "cleaned" notebooks if you forget!
Good luck!
file:///Users/ongalissa/Downloads/problem21.html 1/39
31/03/2023, 13:32 problem21
Once you've loaded the data, the overall workflow consists of the following steps:
This problem is designed to test your fluency with pandas and Numpy, as well as your ability to quickly
connect what you know with new tools.
Setup
Run the code cell below to load some modules that subsequent cells will need.
In [29]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
import pandas as pd
import numpy as np
import scipy as sp
from pprint import pprint # For pretty-printing native Python data structures
from testing_tools import load_df, load_geopandas
The NYC Taxi Dataset that you will analyze contains records for taxi rides or trips. Each trip starts in one
"zone" and ends in another. The NYC Metropolitan area is divided into 266 "zones."
Run the cell below, which loads a pandas dataframe holding metadata about these zones, which are stored
in the dataframe named zones.
file:///Users/ongalissa/Downloads/problem21.html 2/39
31/03/2023, 13:32 problem21
In [30]:
Each zone has a unique integer ID (the LocationID column), a name (Zone), and an administrative district
(Borough).
Note that all location IDs from 1 to len(zones) are represented in this dataframe. However, you should not
assume that in the exercises below.
In [31]:
count 265.000000
mean 133.000000
std 76.643112
min 1.000000
25% 67.000000
50% 133.000000
75% 199.000000
max 265.000000
Name: LocationID, dtype: float64
file:///Users/ongalissa/Downloads/problem21.html 3/39
31/03/2023, 13:32 problem21
Note: Your function must not modify the input dataframe, zones. The test cell will check for
that and may fail with an error if it detects a change.
In [32]:
# 10 mins
def zones_to_dict(zones):
###
### YOUR CODE HERE
###
df = zones.copy()
df = df.set_index('LocationID')
df['value'] = df['Zone'].str.strip() + ", " + df['Borough'].str.strip()
df = df[['value']]
d = df.to_dict()['value']
return d
In [33]:
# Demo:
zones_to_dict(zones.iloc[:3]) # Sample output on the first three examples of `zo
nes`
Out[33]:
file:///Users/ongalissa/Downloads/problem21.html 4/39
31/03/2023, 13:32 problem21
In [34]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
zones_to_dict__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
Read and run this cell even if you skipped or otherwise did not complete Exercise 0.
In [35]:
print("\nExamples:")
for loc_id in range(1, 6):
print("* Location", loc_id, "=>", zones_dict[loc_id])
Examples:
* Location 1 => Newark Airport, EWR
* Location 2 => Jamaica Bay, Queens
* Location 3 => Allerton/Pelham Gardens, Bronx
* Location 4 => Alphabet City, Manhattan
* Location 5 => Arden Heights, Staten Island
file:///Users/ongalissa/Downloads/problem21.html 5/39
31/03/2023, 13:32 problem21
Complete the function, path_to_zones(p, zones_dict), below. It takes as input two objects:
It should output a Python list of zone names, in the same sequence as they appear in the path p. However,
these zone names should be formatted to include the location ID, using the specific format, "{loc_id}.
{zone_borough_name}". For example,
In [36]:
return dict_map
In [37]:
# Demo:
path_to_zones([3, 2, 1], zones_dict)
Out[37]:
file:///Users/ongalissa/Downloads/problem21.html 6/39
31/03/2023, 13:32 problem21
In [38]:
zones_to_dict__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
In [39]:
!date
taxi_trips_raw_dfs = []
for month in ['06']: #, '07', '08']:
taxi_trips_raw_dfs.append(load_df(f"nyc-taxi-data/yellow_tripdata_2019-{mont
h}.csv",
parse_dates=['tpep_pickup_datetime', 'tpep
_dropoff_datetime']))
taxi_trips_raw = pd.concat(taxi_trips_raw_dfs)
del taxi_trips_raw_dfs # Save some memory
!date
file:///Users/ongalissa/Downloads/problem21.html 7/39
31/03/2023, 13:32 problem21
In [40]:
print(f"The raw taxi trips data has {len(taxi_trips_raw):,} records (rows). Her
e's a sample:")
taxi_trips_raw.head()
The raw taxi trips data has 6,941,024 records (rows). Here's a sampl
e:
Out[40]:
Let's start by "focusing" our attention on just the columns we'll need in this problem.
1. Pick-up location ID, 'PULocationID', which should be renamed to 'I' in the new dataframe.
2. Drop-off location ID, 'DOLocationID', which should be renamed to 'J'.
3. Trip distance in miles, 'trip_distance', which should be renamed to 'D' (for "distance").
4. The fare amount (cost) in dollars, 'fare_amount', which should be renamed to 'C' (for "cost").
5. The pick-up time, 'tpep_pickup_datetime', which should be renamed to 'T_start'.
6. The drop-off time, 'tpep_dropoff_datetime', which should be renamed to 'T_end'.
file:///Users/ongalissa/Downloads/problem21.html 8/39
31/03/2023, 13:32 problem21
I J D C T_start T_end
Note 0: The test code will use randomly generated columns and values. Your function should
depend only the columns you need to keep. It should not depend on the order of columns or
specific column names other than those in the list above. For instance, the example above
contains a column named 'VendorID'; since that is not a column we need for the output,
your solution should work whether or not the input has a column named 'VendorID'.
Note 1: The order of columns or rows in the returned dataframe will not matter, since the test
code uses a tibble-equivalency test to check your answer against the reference solution.
In [41]:
# 5 min
def focus(trips_raw):
###
### YOUR CODE HERE
###
df = trips_raw.copy()
df = df[['PULocationID','DOLocationID','trip_distance','fare_amount','tpep_p
ickup_datetime','tpep_dropoff_datetime']]
df = df.rename(columns = {'PULocationID': 'I','DOLocationID': 'J','trip_dist
ance': 'D','fare_amount': 'C','tpep_pickup_datetime': 'T_start','tpep_dropoff_d
atetime': 'T_end'})
return df
file:///Users/ongalissa/Downloads/problem21.html 9/39
31/03/2023, 13:32 problem21
In [42]:
# Demo:
focus(taxi_trips_raw.iloc[:3])
Out[42]:
I J D C T_start T_end
In [43]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
focus__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
Read and run this cell even if you skipped or otherwise did not complete Exercise 2.
file:///Users/ongalissa/Downloads/problem21.html 10/39
31/03/2023, 13:32 problem21
In [44]:
I J D C T_start T_end T
In [45]:
trips.head(3)
Out[45]:
I J D C T_start T_end T
Suppose we want to know the duration of the very first ride (row 0). The contents of the T_start and
T_end columns are special objects for storing date/timestamps:
In [46]:
print(type(trips['T_start'].iloc[0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
You can use simple arithmetic to compute the time difference, which produces a special timedelta object
[documentation link (https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html)]:
file:///Users/ongalissa/Downloads/problem21.html 11/39
31/03/2023, 13:32 problem21
In [47]:
t_start_demo = trips['T_start'].iloc[0]
t_end_demo = trips['T_end'].iloc[0]
dt_demo = t_end_demo - t_start_demo
print(dt_demo, "<==", type(dt_demo))
This ride was evidently a short one, lasting just over 1 minute (1 minute and 4 seconds).
These timedelta objects have special accessor fields, too. For example, if you want to convert this value to
seconds, you can use the .total_seconds() function [docs
(https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.total_seconds.html)]:
In [48]:
dt_demo.total_seconds()
Out[48]:
64.0
Vectorized datetime accessors via .dt. Beyond one-at-a-time access, there is another, faster way to do
operations on any datetime or timedelta Series object using the .dt accessor. For example, here we
calculate the time differences and extract the seconds for the first 3 rows:
In [49]:
0 0 days 00:01:04
1 0 days 00:00:21
2 0 days 00:19:33
dtype: timedelta64[ns]
0 64.0
1 21.0
2 1173.0
dtype: float64
file:///Users/ongalissa/Downloads/problem21.html 12/39
31/03/2023, 13:32 problem21
I J D C T_start T_end
0 1.066667
1 0.350000
2 19.550000
dtype: float64
Note 1: The index of your Series should match the index of the input trips.
In [50]:
#10mins
def get_minutes(trips):
###
### YOUR CODE HERE
###
trips = trips.copy()
trips['time_delta'] = (trips.T_end - trips.T_start) / pd.Timedelta(minutes =
1)
return trips['time_delta']
In [51]:
# Demo:
get_minutes(trips.head(3))
Out[51]:
0 1.066667
1 0.350000
2 19.550000
Name: time_delta, dtype: float64
file:///Users/ongalissa/Downloads/problem21.html 13/39
31/03/2023, 13:32 problem21
In [52]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
get_minutes__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
Read and run this cell even if you skipped or otherwise did not complete Exercise 3.
In [53]:
I J D C T_start T_end T
file:///Users/ongalissa/Downloads/problem21.html 14/39
31/03/2023, 13:32 problem21
In [54]:
assert 'trip_times' in globals(), "*** Be sure you ran the 'sample results' cell
for Exercise 3 ***"
Out[54]:
D C T
The trip-distances column ('D') has distances as small as 0 miles (min value) and as large as
45,000 miles (max value).
The fare-amount or cost column ('C') includes negative costs (-305 US dollars) and costs as large
as 346,949.99 USD. (Sure, NYC is expensive, but ... seriously?)
The trip-times data also includes negative values, as well as times as high as 1,500 minutes (over
24 hours!).
It's possible these are legitimate data points. But to avoid skewing our later analyses too much, let's get rid
of them.
For instance, we might want to keep a value x only if it lies in the right-open interval [4, 20) , meaning greater
than or equal to 4 and strictly less than 20 (4 ≤ x < 20 ). Or perhaps we only want x ∈ [−3, 5] , meaning
between -3 and 5 inclusive (−3 ≤ x ≤ 5). Or maybe x ∈ (10, ∞) meaning strictly greater than 10 with no
upper-bound (10 < x ).
file:///Users/ongalissa/Downloads/problem21.html 15/39
31/03/2023, 13:32 problem21
so that it can implement bounds-based filters like those shown above. (Note the use of default values.)
An interval may be empty. For instance, the interval (3, 2) has its "lower" bound of 3 greater than its "upper"
bound of 2. In instances like this one, your function should return all False values.
By "inclusive" versus "strict," we mean the following. Suppose lower=4.5. Then setting
include_lower=True means we want to keep any value that is greater than or equal to
4.5; setting it to False means we only want values that are strictly greater than 4.5. The
upper-bound is treated similarly. Therefore, a right-open bound like [4, 20) becomes
lower=4, upper=20, include_lower=True, include_upper=False; and a bound
like (10, ∞) becomes lower=10, upper=None. (When lower or upper are None, the
include_lower and include_upper flags, respectively, can be ignored.)
Your function should return a pandas Series of boolean values (True or False), where an entry is True
only if the corresponding value of s lies within the desired bounds. For example, suppose s is the pandas
Series,
filter_bounds(ex4_s_demo, lower=2) \
== pd.Series([True, False, False, False, False, True, True, False, Fals
e, True])
# 10, 2, -10, -2, -9, 9, 5, 1,
2, 8
Note: Your Series should have the same index as the input s.
file:///Users/ongalissa/Downloads/problem21.html 16/39
31/03/2023, 13:32 problem21
In [77]:
In [78]:
# Demo
ex4_s_demo = pd.Series([10, 2, -10, -2, -9, 9, 5, 1, 2, 8])
print(f"Input:\n{ex4_s_demo.values}")
Input:
[ 10 2 -10 -2 -9 9 5 1 2 8]
Note: There are three test cells below for Exercise 4, meaning it is possible to get partial
credit if only a subset pass.
file:///Users/ongalissa/Downloads/problem21.html 17/39
31/03/2023, 13:32 problem21
In [79]:
def mt2_ex4a_filter_bounds_check():
from testing_tools import mt2_ex4__check
print("Testing...")
for trial in range(100):
mt2_ex4__check(filter_bounds, lower=True, include_lower=True, upper=Tru
e, include_upper=True)
mt2_ex4__check(filter_bounds, lower=True, include_lower=True, upper=Tru
e, include_upper=False)
mt2_ex4__check(filter_bounds, lower=True, include_lower=False, upper=Tru
e, include_upper=True)
mt2_ex4__check(filter_bounds, lower=True, include_lower=False, upper=Tru
e, include_upper=False)
mt2_ex4a_filter_bounds_check()
filter_bounds_a__passed = True
print("\n(Passed!)")
file:///Users/ongalissa/Downloads/problem21.html 18/39
31/03/2023, 13:32 problem21
Testing...
file:///Users/ongalissa/Downloads/problem21.html 19/39
31/03/2023, 13:32 problem21
--------------------------------------------------------------------
-------
ValueError Traceback (most recent cal
l last)
<ipython-input-79-feced6ad720d> in <module>
10 mt2_ex4__check(filter_bounds, lower=True, include_lo
wer=False, upper=True, include_upper=False)
11
---> 12 mt2_ex4a_filter_bounds_check()
13 filter_bounds_a__passed = True
14 print("\n(Passed!)")
<ipython-input-79-feced6ad720d> in mt2_ex4a_filter_bounds_check()
8 mt2_ex4__check(filter_bounds, lower=True, include_lo
wer=True, upper=True, include_upper=False)
9 mt2_ex4__check(filter_bounds, lower=True, include_lo
wer=False, upper=True, include_upper=True)
---> 10 mt2_ex4__check(filter_bounds, lower=True, include_lo
wer=False, upper=True, include_upper=False)
11
12 mt2_ex4a_filter_bounds_check()
~/testing_tools.py in <listcomp>(.0)
490 a = lower+(not include_lower) if lower is not None else
-infinity
491 b = upper-(not include_upper) if upper is not None else
infinity
--> 492 middle = [randint(a, b) for _ in range(1, 5)]
493 left_flags = [False] * len(left)
494 middle_flags = [True] * len(middle)
/usr/local/lib/python3.8/random.py in randint(self, a, b)
246 """
247
--> 248 return self.randrange(a, b+1)
249
250 def _randbelow_with_getrandbits(self, n):
In [80]:
def mt2_ex4b_filter_bounds_check():
from testing_tools import mt2_ex4__check
print("Testing...")
for trial in range(50):
mt2_ex4__check(filter_bounds, lower=False, include_lower=True, upper=Tru
e, include_upper=True)
mt2_ex4__check(filter_bounds, lower=False, include_lower=True, upper=Tru
e, include_upper=False)
mt2_ex4__check(filter_bounds, lower=False, include_lower=False, upper=Tr
ue, include_upper=True)
mt2_ex4__check(filter_bounds, lower=False, include_lower=False, upper=Tr
ue, include_upper=False)
mt2_ex4__check(filter_bounds, lower=True, include_lower=True, upper=Fals
e, include_upper=True)
mt2_ex4__check(filter_bounds, lower=True, include_lower=True, upper=Fals
e, include_upper=False)
mt2_ex4__check(filter_bounds, lower=True, include_lower=False, upper=Fal
se, include_upper=True)
mt2_ex4__check(filter_bounds, lower=True, include_lower=False, upper=Fal
se, include_upper=False)
mt2_ex4__check(filter_bounds, lower=False, include_lower=False, upper=Fa
lse, include_upper=False)
mt2_ex4b_filter_bounds_check()
filter_bounds_b__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
file:///Users/ongalissa/Downloads/problem21.html 21/39
31/03/2023, 13:32 problem21
In [81]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
def mt2_ex4c_filter_bounds_check():
from testing_tools import mt2_ex4__check
print("Testing...")
for trial in range(50):
mt2_ex4__check(filter_bounds, lower=False, include_lower=False, upper=Fa
lse, include_upper=False)
mt2_ex4c_filter_bounds_check()
filter_bounds_c__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
We did this filtering for you. In particular, we kept trips such that
The code cell below loads a precomputed trips_clean. Observe from its descriptive statistics, below, that
these bounds are indeed satisfied.
Read and run this cell even if you skipped or otherwise did not complete Exercise 4.
file:///Users/ongalissa/Downloads/problem21.html 22/39
31/03/2023, 13:32 problem21
In [56]:
Out[56]:
D C T
file:///Users/ongalissa/Downloads/problem21.html 23/39
31/03/2023, 13:32 problem21
Let trip_coords be a dataframe consisting of just two columns, 'I', and 'J', taken from trips or
trips_clean, for instance. Complete the function, count_trips(trip_coords, min_trips=0), so
that it counts the number of start/end pairs and retains only those where the count is at least a certain value.
trip_coords: a pandas DataFrame with two columns, 'I' and 'J', indicating start/end zone
pairs
min_trips: the minimum number of trips to consider (the default is 0, meaning include all pairs)
The function should return a pandas DataFrame object with three columns derived from the input
trip_coords:
Your function should only include start-end pairs where 'N' >= min_trips.
I J
139 28
169 51
231 128
169 51
169 51
169 51
139 28
85 217
231 128
231 128
I J N
169 51 4
231 128 3
which omits the pairs (85, 217) and (139, 28) since they appear only once and twice, respectively.
file:///Users/ongalissa/Downloads/problem21.html 24/39
31/03/2023, 13:32 problem21
Note: If no pair meets the minimum trips threshold, your function should return a DataFrame
with the required columns but no rows.
In [57]:
# 10mins
return grouped_df_min
file:///Users/ongalissa/Downloads/problem21.html 25/39
31/03/2023, 13:32 problem21
In [58]:
# Demo:
ex5_df = trips_clean[((trips_clean['I'] == 85) & (trips_clean['J'] == 217))
| ((trips_clean['I'] == 139) & (trips_clean['J'] == 28))
| ((trips_clean['I'] == 231) & (trips_clean['J'] == 128))
| ((trips_clean['I'] == 169) & (trips_clean['J'] == 51))] \
[['I', 'J']] \
.reset_index(drop=True)
display(ex5_df)
count_trips(ex5_df, min_trips=3)
I J
0 169 51
1 231 128
2 139 28
3 231 128
4 169 51
5 139 28
6 231 128
7 85 217
8 169 51
9 169 51
Out[58]:
I J N
2 169 51 4
3 231 128 3
In [59]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
count_trips__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
file:///Users/ongalissa/Downloads/problem21.html 26/39
31/03/2023, 13:32 problem21
The code cell below loads that result into the global object trip_counts (as distinct from your
count_trips() function).
Read and run this cell even if you skipped or otherwise did not complete Exercise 5.
In [60]:
display(trip_counts.sample(5))
I J N
6348 68 90 4500
Out[60]:
count 4515.000000
mean 1392.024142
std 2315.544401
min 100.000000
25% 206.000000
50% 502.000000
75% 1546.000000
max 40526.000000
Name: N, dtype: float64
file:///Users/ongalissa/Downloads/problem21.html 27/39
31/03/2023, 13:32 problem21
In [61]:
figure(figsize=(18, 6))
subplot(1, 3, 1)
histplot(data=trips_clean, x='D', stat='density')
xlabel("Distance (miles)")
subplot(1, 3, 2)
histplot(data=trips_clean, x='T', stat='density')
xlabel("Time (minutes)")
subplot(1, 3, 3)
histplot(data=trips_clean, x='C', stat='density')
xlabel("Cost (US Dollars)")
pass
file:///Users/ongalissa/Downloads/problem21.html 28/39
31/03/2023, 13:33 problem21
The input tss is a pandas Series containing datetime objects, just like trips_clean['T_start'] or
trips_clean['T_end'].
Your function should determine the hour, as an integer between 0-23 inclusive, corresponding to a 24-hour
clock. Hint: Per Exercise 3, recall that a datetime Series s has an accessor s.dt; the attribute s.dt.hour
will return the hour as a value in the interval [0, 23]; see this link
(https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.hour.html) if you need details.
Your function should then return a new pandas Series object with hour ranges converted to strings as
follows:
23 2019-06-01 00:30:42
39781 2019-06-01 06:42:38
164505 2019-06-01 17:40:07
404098 2019-06-02 18:35:08
Name: T_start, dtype: datetime64[ns]
(The leftmost column of this example shows hypothetical index values.) Observe that the hours are 0, 6, 17,
and 18. Therefore, your function would return a Series with these values:
23 wee hours
39781 morning
164505 afternoon
404098 evening
Name: T_start, dtype: object
Note: Your Series should have the same index as the input tss, as suggested by the
example above.
file:///Users/ongalissa/Downloads/problem21.html 29/39
31/03/2023, 13:33 problem21
In [62]:
def part_of_day(tss):
### 1. divide dt.hour by 6 to segment hours into 4 parts
### 2. astype(int) floors the series
### 3. map() function map the int to string description
segments =(tss.dt.hour/6).astype(int)
In [63]:
# Demo:
print("* Sample input `Series`:")
ex6_demo = trips_clean['T_start'].iloc[[20, 37752, 155816, 382741]]
display(ex6_demo)
23 2019-06-01 00:30:42
39781 2019-06-01 06:42:38
164505 2019-06-01 17:40:07
404098 2019-06-02 18:35:08
Name: T_start, dtype: datetime64[ns]
* Your output:
Out[63]:
23 wee hours
39781 morning
164505 afternoon
404098 evening
Name: T_start, dtype: object
In [64]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
part_of_day__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
file:///Users/ongalissa/Downloads/problem21.html 30/39
31/03/2023, 13:33 problem21
The code cell below loads that result into the global object trips_pod and runs a simple aggregation query
to summarize the median distances, costs, and trip times by part-of-day.
Read and run this cell even if you skipped or otherwise did not complete Exercise 6.
In [65]:
trips_pod = trips_clean.copy()
trips_pod['P'] = pod
trips_pod[['P', 'D', 'C', 'T']].groupby('P').agg('median')
Out[65]:
D C T
Perhaps unsurprisingly, people tend to travel longer distances in the "wee hours," but it takes less time to do
so (presumably due to less traffic).
By analogy, when you are shopping for flights, you might sometimes find that a route through a particular
city (e.g., New York to Houston to Los Angeles) is cheaper than flying directly from New York to Los Angeles.
Are there such potential routes in the taxi dataset?
Direct "routes." The taxi dataset itself contains "direct routes" between pairs of zones.
To start, for each pair of zones, let's calculate the median trip cost.
file:///Users/ongalissa/Downloads/problem21.html 31/39
31/03/2023, 13:33 problem21
In [66]:
Out[66]:
I J C
0 1 1 89.0
1 1 158 30.0
2 1 161 8.5
3 1 162 55.0
4 1 163 75.0
In the sample output above, the columns 'I' and 'J' are the starting and ending zones, and C is the
median (dollar) cost to travel from zone 'I' to zone 'J'. Here are the most expensive zone-to-zone trips:
In [67]:
pair_costs.sort_values(by='C', ascending=False).head()
Out[67]:
I J C
8321 83 1 99.0
2938 37 1 98.5
2180 28 1 97.5
For the path analysis, we'll need to convert pair_costs into a sparse matrix representation. That is your
next (and final) task.
file:///Users/ongalissa/Downloads/problem21.html 32/39
31/03/2023, 13:33 problem21
It should return a Scipy sparse matrix in CSR (compressed sparse row) format. For the nonzero coordinates,
use the zone IDs, pair_costs['I'] and pair_costs['J'] as-is. For the nonzero values, use the cost,
pair_costs['C'].
I J C
1 1 89
3 3 10
4 1 70
4 3 46
4 4 5
The matrix dimension must be n >= 5; suppose we take it to be n=5. Then the corresponding sparse
matrix is, logically, as follows (blanks are zeroes):
0 1 2 3 4
1 89.0
3 10.0
You need to construct this matrix and store it as a Scipy CSR sparse matrix object.
Note: Assume coordinates start at 0 and end at n-1, inclusive. If any zones IDs are missing,
which may have happened during our filtering, those will simply become zero rows and
columns in the matrix, as shown in the above example where there are no coordinates for
row/column 0 or row/column 2.
file:///Users/ongalissa/Downloads/problem21.html 33/39
31/03/2023, 13:33 problem21
In [68]:
# 10 mins
row = pair_costs['I']
col = pair_costs['J']
data = pair_costs['C']
shp = n
return csr
file:///Users/ongalissa/Downloads/problem21.html 34/39
31/03/2023, 13:33 problem21
In [69]:
# Demo:
ex7_demo = pair_costs[(pair_costs['I'] <= 4) & (pair_costs['J'] <= 4)]
display(ex7_demo)
# Try to visualize:
from matplotlib.pyplot import spy
spy(ex7_csr);
I J C
0 1 1 89.0
19 3 3 10.0
71 4 1 70.0
72 4 3 46.0
73 4 4 5.0
file:///Users/ongalissa/Downloads/problem21.html 35/39
31/03/2023, 13:33 problem21
In [70]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
make_csr__passed = True
print("\n(Passed!)")
Testing...
(Passed!)
The code cell below loads that result into the global object Cost_matrix.
Read and run this cell even if you skipped or otherwise did not complete Exercise 7.
file:///Users/ongalissa/Downloads/problem21.html 36/39
31/03/2023, 13:33 problem21
In [71]:
Congrats, you’ve reached the end of this exam problem. Don’t forget to restart and run all cells again to
make sure it’s all working when run in sequence; and make sure your work passes the submission process.
Good luck!
Epilogue. If you have some time to spare, the rest of this notebook shows you how to use the infrastructure
you just built to do an interesting analysis, namely, looking for indirect paths between locations that might be
cheaper than going "directly" between those locations.
This analysis relies on a standard Python module for graph analysis called NetworkX (https://networkx.org/).
Recall that a sparse matrix can be interpreted as a weighted graph of interconnected vertices, where we can
assign a cost or weight to each edge that directly connects two vertices. Let's start by constructing this
graph.
In [72]:
Cost_graph = from_scipy_sparse_matrix(Cost_matrix)
The weight of every edge of this graph is the value of the corresponding entry of the sparse matrix. For
instance:
file:///Users/ongalissa/Downloads/problem21.html 37/39
31/03/2023, 13:33 problem21
In [73]:
Shortest paths. One cool aspect of the NetworkX graph representation is that we can perform graph
queries. For example, here is a function that will look for the shortest path---that is, the sequence of vertices
such that traversing their edges yields a path whose total weight is the smallest among all possible paths.
Indeed, that path can be smaller than the direct path, as you'll see momentarily!
The function get_shortest_path(G, i, j) finds the shortest path in the graph G going between i and
j, and returns the path as a list of vertices along with the length of that path:
In [74]:
In the example above, the path starting at 83 and going through 233 and 156 before arriving at 1 has a cost
of 69.5. Compare that to the direct path cost of 99!
Here is a visual representation of that path (run the next two cells).
In [75]:
shapes = load_geopandas('nyc-taxi-data/zone-shapes/geo_export_28967859-3b38-43de
-a1a2-26aee980d05c.shp')
shapes['location_i'] = shapes['location_i'].astype(int)
file:///Users/ongalissa/Downloads/problem21.html 38/39
31/03/2023, 13:33 problem21
In [76]:
This example is just a teaser; we hope you'll find some time to explore examples like this one in your own
projects.
file:///Users/ongalissa/Downloads/problem21.html 39/39