Yash Week 3 Uber Case Study
Yash Week 3 Uber Case Study
0.0.1 Context:
Ridesharing is a service that arranges transportation on short notice. It is a very volatile market
and its demand fluctuates wildly with time, place, weather, local events, etc. The key to being
successful in this business is to be able to detect patterns in these fluctuations and cater to the
demand at any given time.
0.0.2 Objective:
Uber Technologies, Inc. is an American multinational transportation network company based in San
Francisco and has operations in over 785 metropolitan areas with over 110 million users worldwide.
As a newly hired Data Scientist in Uber’s New York Office, you have been given the task of
extracting actionable insights from data that will help in the growth of the the business.
1
• pcp06: 6-hour liquid precipitation
• pcp24: 24-hour liquid precipitation
• sd: Snow depth in inches
• hday: Being a holiday (Y) or not (N)
[ ]: #Mount drive
from google.colab import drive
drive.mount('/content/drive')
[ ]: data = pd.read_csv('Uber_Data.csv')
2
1 0.0 0.0 0.0 Y
2 0.0 0.0 0.0 Y
3 0.0 0.0 0.0 Y
4 0.0 0.0 0.0 Y
[ ]: (29101, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pickup_dt 29101 non-null object
1 borough 26058 non-null object
2 pickups 29101 non-null int64
3 spd 29101 non-null float64
4 vsb 29101 non-null float64
5 temp 28742 non-null float64
6 dewp 29101 non-null float64
7 slp 29101 non-null float64
8 pcp01 29101 non-null float64
9 pcp06 29101 non-null float64
10 pcp24 29101 non-null float64
11 sd 29101 non-null float64
12 hday 29101 non-null object
dtypes: float64(9), int64(1), object(3)
memory usage: 2.9+ MB
• All the columns have 29,101 observations except borough and temp which has 26058 and
28742 observations indicating that there are some missing values in them
• The pickup_dt column is being read as a ‘object’ data type but it should be in date-time
format
3
• The borough and hday columns are of object type while the rest of the columns are numerical
in nature
• The object type columns contain categories in them
max
pickups 7883.00
spd 21.00
vsb 10.00
temp 89.00
dewp 73.00
slp 1043.40
pcp01 0.28
pcp06 1.24
pcp24 2.10
sd 19.00
• There is a huge difference between the 3rd quartile and the maximum value for the number of
pickups (pickups) and snow depth (sd) indicating that there might be outliers to the right in
these variables
• The temperature has a wide range indicating that data consists of entries for different seasons
Let’s check the count of each unique category in each of the categorical/object type
variables.
[ ]: df['borough'].unique()
• We can observe that there are 5 unique boroughs present in the dataset for New York plus
EWR (Newark Liberty Airport)
– Valid NYC Boroughs (Bronx, Brooklyn, Manhattan, Queens, and Staten Island)
– EWR is the acronym for Newark Liberty Airport (EWR IS NOT AN NYC BOROUGH)
4
∗ NYC customers have the flexibility to catch flights to either: (1) JFK Airport,
LaGuardia Airport, or Newark Liberty Airport (EWR).
[ ]: df['hday'].value_counts(normalize=True)
[ ]: N 0.961479
Y 0.038521
Name: hday, dtype: float64
• The number of non-holiday observations is much more than holiday observations which make
sense -Around 96% of the observations are from non-holidays
We have observed earlier that the data type for pickup_dt is object in nature. Let us change the
data type of pickup_dt to date-time format.
[ ]: pd.to_datetime(df['pickup_dt'])
[ ]: 0 2015-01-01 01:00:00
1 2015-01-01 01:00:00
2 2015-01-01 01:00:00
3 2015-01-01 01:00:00
4 2015-01-01 01:00:00
…
29096 2015-06-30 23:00:00
29097 2015-06-30 23:00:00
29098 2015-06-30 23:00:00
29099 2015-06-30 23:00:00
29100 2015-06-30 23:00:00
Name: pickup_dt, Length: 29101, dtype: datetime64[ns]
5
Let’s check the data types of the columns again to ensure that the change has been executed
properly.
[ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pickup_dt 29101 non-null datetime64[ns]
1 borough 26058 non-null object
2 pickups 29101 non-null int64
3 spd 29101 non-null float64
4 vsb 29101 non-null float64
5 temp 28742 non-null float64
6 dewp 29101 non-null float64
7 slp 29101 non-null float64
8 pcp01 29101 non-null float64
9 pcp06 29101 non-null float64
10 pcp24 29101 non-null float64
11 sd 29101 non-null float64
12 hday 29101 non-null object
dtypes: datetime64[ns](1), float64(9), int64(1), object(2)
memory usage: 2.9+ MB
• The data type of the pickup_dt column has been succesfully changed to date-time format
• There are now 10 numerical columns, 2 object type columns and 1 date-time column
Now let’s check the range of time period for which the data has been collected.
[ ]: df['pickup_dt'].min() # this will display the date from which data observations␣
↪have been started
[ ]: Timestamp('2015-01-01 01:00:00')
[ ]: Timestamp('2015-06-30 23:00:00')
• So the time period for the data is from Janunary to June for the year 2015
• There is a significant difference in the weather conditions in this period which we have observed
from our statistical summary for various weather parameters such as temperature ranging from
2F to 89F
Since the pickup_dt column contains the combined information in the form of date, month, year
and time of the day, let’s extract each piece of information as a separate entity to get the trend of
rides varying over time.
6
[ ]:
Now we can remove the pickup_dt column from our dataset as it will not be required for further
analysis.
Let’s check the first few rows of the dataset to see if changes have been applied properly
[ ]: df.head()
[ ]: borough pickups spd vsb temp dewp slp pcp01 pcp06 pcp24 \
0 Bronx 152 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0
1 Brooklyn 1519 5.0 10.0 NaN 7.0 1023.5 0.0 0.0 0.0
2 EWR 0 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0
3 Manhattan 5258 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0
4 Queens 405 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0
We can see the changes have been applied to the dataset properly.
Let’s analyze the statistical summary for the new columns added in the dataset.
[ ]: df.describe(include='all').T
# setting include='all' will get the statistical summary for both the numerical␣
↪and categorical variables.
7
spd 29101.0 NaN NaN NaN … 3.0 6.0 8.0
21.0
vsb 29101.0 NaN NaN NaN … 9.1 10.0 10.0
10.0
temp 28742.0 NaN NaN NaN … 32.0 46.5 65.0
89.0
dewp 29101.0 NaN NaN NaN … 14.0 30.0 50.0
73.0
slp 29101.0 NaN NaN NaN … 1012.5 1018.2 1022.9
1043.4
pcp01 29101.0 NaN NaN NaN … 0.0 0.0 0.0
0.28
pcp06 29101.0 NaN NaN NaN … 0.0 0.0 0.0
1.24
pcp24 29101.0 NaN NaN NaN … 0.0 0.0 0.05
2.1
sd 29101.0 NaN NaN NaN … 0.0 0.0 2.958333
19.0
hday 29101 2 N 27980 … NaN NaN NaN
NaN
start_year 29101.0 NaN NaN NaN … 2015.0 2015.0 2015.0
2015.0
start_month 29101 6 May 5058 … NaN NaN NaN
NaN
start_hour 29101.0 NaN NaN NaN … 6.0 12.0 18.0
23.0
start_day 29101.0 NaN NaN NaN … 8.0 16.0 23.0
31.0
week_day 29101 7 Friday 4219 … NaN NaN NaN
NaN
8
central tendency to deal with the missing values over mean.
• Replacing with mode: In this method the missing values are imputed with the mode of the
column. This method is generally preferred with categorical data.
Let’s check how many missing values are present in each variable.
[ ]: borough 3043
pickups 0
spd 0
vsb 0
temp 359
dewp 0
slp 0
pcp01 0
pcp06 0
pcp24 0
sd 0
hday 0
start_year 0
start_month 0
start_hour 0
start_day 0
week_day 0
dtype: int64
• The variable borough and temp have 3043 and 359 missing values in them
• There are no missing values in other variables
Let us first see the missing value of the borough column in detail.
[ ]: Bronx 0.149239
Brooklyn 0.149239
EWR 0.149239
Manhattan 0.149239
Queens 0.149239
Staten Island 0.149239
NaN 0.104567
Name: borough, dtype: float64
• All the 6 categories have the same percentage i.e. ~15%. There is no mode (or multiple modes)
for this variable
• The percentage of missing values is close to the percentage of observations from other boroughs
• We can treat the missing values as a separate category for this variable
9
We can replace the null values present in the borough column with a new label as Unknown.
[ ]: df['borough'].unique()
It can be observed that the new label Unknown has been added in the borough column
[ ]: df.isnull().sum()
[ ]: borough 0
pickups 0
spd 0
vsb 0
temp 359
dewp 0
slp 0
pcp01 0
pcp06 0
pcp24 0
sd 0
hday 0
start_year 0
start_month 0
start_hour 0
start_day 0
week_day 0
dtype: int64
The missing values in the borough column have been treated. Let us now move on to temp variable
and see how to deal with the missing values present there.
Since this is a numerical variable, so we can impute the missing values by mean or median but
before imputation, let’s analyze the temp variable in detail.
Let us print the rows where the temp variable is having missing values.
[ ]: df.loc[df['temp'].isnull()==True]
[ ]: borough pickups spd vsb temp dewp slp pcp01 pcp06 pcp24 \
1 Brooklyn 1519 5.0 10.0 NaN 7.0 1023.5 0.0 0.0 0.0
8 Brooklyn 1229 3.0 10.0 NaN 6.0 1023.0 0.0 0.0 0.0
15 Brooklyn 1601 5.0 10.0 NaN 8.0 1022.3 0.0 0.0 0.0
22 Brooklyn 1390 5.0 10.0 NaN 9.0 1022.0 0.0 0.0 0.0
29 Brooklyn 759 5.0 10.0 NaN 9.0 1021.8 0.0 0.0 0.0
… … … … … … … … … … …
10
2334 Brooklyn 594 5.0 10.0 NaN 13.0 1016.2 0.0 0.0 0.0
2340 Brooklyn 620 5.0 10.0 NaN 13.0 1015.5 0.0 0.0 0.0
2347 Brooklyn 607 3.0 10.0 NaN 14.0 1015.4 0.0 0.0 0.0
2354 Brooklyn 648 9.0 10.0 NaN 14.0 1015.4 0.0 0.0 0.0
2361 Brooklyn 602 5.0 10.0 NaN 16.0 1015.4 0.0 0.0 0.0
[ ]: a=[10,20,30,40,50,60,np.nan]
[ ]: temp=pd.DataFrame({"a":a})
[ ]: temp.mean()
[ ]: a 35.0
dtype: float64
[ ]: temp.sum()
[ ]: a 210.0
dtype: float64
[ ]:
[ ]: 210/7
[ ]: 30.0
[ ]: 210/6
[ ]: 35.0
[ ]: temp=temp.fillna(35)
[ ]: temp['a'].mean()
11
[ ]: 35.0
[ ]: Seg1
Seg2
Seg3
____
[ ]: 4m
20
[ ]: missing_vals_of_temp=df.loc[df['temp'].isnull()==True]
There are 359 observations where temp variable has missing values. From the overview of the
dataset, it seems as if the missing temperature values are from the Brooklyn borough in the month
of January.
So let’s confirm our hypothesis by printing the unique boroughs and month names present for these
missing values.
[ ]: missing_vals_of_temp['borough'].value_counts()
[ ]: Brooklyn 359
Name: borough, dtype: int64
[ ]: missing_vals_of_temp['start_month'].value_counts()
[ ]: January 359
Name: start_month, dtype: int64
[ ]: data['temp'].mean()
[ ]: 47.90001872156705
[ ]: df.groupby(['borough']).mean()['temp']
[ ]: borough
Bronx 47.489005
Brooklyn 49.139130
EWR 47.489005
Manhattan 47.489005
Queens 47.489005
Staten Island 47.489005
Unknown 49.210748
Name: temp, dtype: float64
[ ]:
[ ]: df.groupby(['borough','start_month']).mean()['temp']
12
[ ]: borough start_month
Bronx April 53.535417
February 24.291872
January 30.085740
June 70.602774
March 37.839807
May 67.250794
Brooklyn April 53.535417
February 24.291872
January 30.935547
June 70.602774
March 37.839807
May 67.250794
EWR April 53.535417
February 24.291872
January 30.085740
June 70.602774
March 37.839807
May 67.250794
Manhattan April 53.535417
February 24.291872
January 30.085740
June 70.602774
March 37.839807
May 67.250794
Queens April 53.535417
February 24.291872
January 30.085740
June 70.602774
March 37.839807
May 67.250794
Staten Island April 53.535417
February 24.291872
January 30.085740
June 70.602774
March 37.839807
May 67.250794
Unknown April 53.764121
February 24.278442
January 30.193443
June 70.843026
March 37.603922
May 67.456578
Name: temp, dtype: float64
[ ]: df.loc[df['temp'].isnull()==True,'borough'].value_counts()
13
[ ]: Brooklyn 359
Name: borough, dtype: int64
[ ]: df.loc[df['temp'].isnull()==True,'start_month'].value_counts()
[ ]: January 359
Name: start_month, dtype: int64
The missing values in temp are from the Brooklyn borough and they are from the month of January.
Let’s check on which the date for the month of January, missing values are present.
It can be observed that out of the 31 days in January, the data is missing for the first 15 days.
Since from the statistical summary, the mean and median values of temperature are close to each
other, hence we can impute the missing values in the temp column by taking the mean tempertaure
of the Brooklyn borough during 16th to 31st January.
We will use fillna() function to impute the missing values.
fillna() - The fillna() function is used to fill NaN values by using the provided input value.
Syntax of fillna(): data['column'].fillna(value = x)
[ ]: df.isnull().sum()
[ ]: borough 0
pickups 0
spd 0
vsb 0
temp 0
dewp 0
slp 0
pcp01 0
pcp06 0
pcp24 0
sd 0
14
hday 0
start_year 0
start_month 0
start_hour 0
start_day 0
week_day 0
dtype: int64
• All the missing values have been imputed and there are no missing values in our dataset now.
Let’s now perform the Exploratory Data Analysis on the dataset
A boxplot gives a summary of one or several numeric variables. The line that divides the box into
2 parts represents the median of the data. The end of the box shows the upper and lower quartiles.
The extreme lines show the highest and lowest value excluding outliers.
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
15
figsize=figsize) # creating the 2 subplots
if bins:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,␣
↪color="mediumpurple")
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2,␣
↪color="mediumpurple") # For histogram
Observations on Pickups
[ ]: df.describe().T
75% max
pickups 449.000000 7883.00
spd 8.000000 21.00
vsb 10.000000 10.00
temp 64.500000 89.00
dewp 50.000000 73.00
slp 1022.900000 1043.40
16
pcp01 0.000000 0.28
pcp06 0.000000 1.24
pcp24 0.050000 2.10
sd 2.958333 19.00
start_year 2015.000000 2015.00
start_hour 18.000000 23.00
start_day 23.000000 31.00
[ ]: 449+1.5*(449-1)
[ ]: 1121.0
[ ]: df[df['pickups']>1121].shape[0]
[ ]: 3498
[ ]: df.shape[0]
[ ]: 29101
[ ]: 29101-3498
[ ]: 25603
[ ]: 3498/29101
[ ]: 0.12020205491220233
[ ]: histogram_boxplot(df,'pickups')
17
• The distribution of pickups is highly right skewed
• There are a lot of outliers in this variable
• While mostly the number of pickups are at a lower end, we have observations where the number
of pickups went as high as 8000
Observations on Visibility
[ ]: histogram_boxplot(df,'vsb')
Observations on Temperature
[ ]: histogram_boxplot(df,'temp')
18
• Temperature does not have any outliers
• 50% of the temperature values are less than 45F (~7 degree celcius), indicating cold weather
conditions
19
• There are no outliers for dew point either
• The distribution is similar to that of temperature. It suggests possible correlation between the
two variables
• Dew point is an indication of humidity, which is correlated with temperature
[ ]: histogram_boxplot(df,'pcp01')
20
6 hour liquid precipitation
[ ]: histogram_boxplot(df,'pcp06')
21
[ ]: histogram_boxplot(df,'pcp24')
22
• We can observe that there is snowfall in the time period that we are analyzing
• There are outliers in this data
• We will have to see how snowfall affects pickups. We know that very few people are likely to
get out if it is snowing heavily, so our pickups is most likely to decrease when it snows
Let’s explore the categorical variables now
Bar Charts can be used to explore the distribution of Categorical Variables. Each entity of the
categorical variable is represented as a bar. The size of the bar represents its numeric value.
Observations on holiday
[ ]: sns.countplot(data=df,x='hday');
[ ]: <matplotlib.axes._subplots.AxesSubplot at 0x7ffa79b0f9d0>
23
• The number of pickups is more on non-holidays than on holidays
Observations on borough
[ ]: sns.countplot(data=df,x='borough');
plt.xticks(rotation = 90)
24
• The observations are uniformly distributed across the boroughs except the observations that
had NaN values and were attributed to Unknown borough
[ ]: # df.corr()
25
[ ]: missing_val_exists_df=data.isna()*1
missing_val_exists_df
# plt.figure(figsize=(15, 7))
# sns.heatmap(missing_val_exists_df, annot=True, vmin=0, vmax=1, fmt=".2f",␣
↪cmap="Spectral")
# plt.show()
[ ]: pickup_dt borough pickups spd vsb temp dewp slp pcp01 pcp06 \
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
… … … … … … … … … … …
29096 0 0 0 0 0 0 0 0 0 0
29097 0 0 0 0 0 0 0 0 0 0
29098 0 0 0 0 0 0 0 0 0 0
29099 0 0 0 0 0 0 0 0 0 0
29100 0 1 0 0 0 0 0 0 0 0
pcp24 sd hday
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
… … .. …
29096 0 0 0
29097 0 0 0
29098 0 0 0
29099 0 0 0
29100 0 0 0
corr = df[num_var].corr()
plt.figure(figsize=(15, 7))
sns.heatmap(corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
26
• As expected, temperature shows high correlation with dew point
• Visibility is negatively correlated with precipitation. If it is raining heavily, then the visibility
will be low. This is aligned with our intuitive understanding
• Snow depth of course would be negatively correlated with temperature.
• Wind speed and sea level pressure are negatively correlated with temperature
• It is important to note that correlation does not imply causation
• There does not seem to be a strong relationship between number of pickups and weather stats
Bivariate Scatter Plots A scatterplot displays the relationship between 2 numeric variables.
For each data point, the value of its first variable is represented on the X axis, the second on the
Y axis
[ ]: sns.pairplot(data=df[num_var], diag_kind="kde")
plt.show()
27
• We get the same insights as from the correlation plot
• There does not seem to be a strong relationship between number of pickups and weather stats
Now let’s check the trend between pickups across different time based variables We
can check the trends for time measures by plotting Line charts
A line chart is often used to visualize a trend in data over intervals of time, thus the line is often
drawn chronologically.
28
[ ]: cats = df.start_month.unique().tolist()
df.start_month = pd.Categorical(df.start_month, ordered=True, categories=cats)
plt.figure(figsize=(15,7))
sns.lineplot(data=df, x="start_month", y="pickups", ci=False, color="red",␣
↪estimator='sum')
plt.ylabel('Total pickups')
plt.xlabel('Month')
plt.show()
[ ]: df.groupby(['start_month']).sum()['pickups']
[ ]: start_month
April 2279842
February 2263427
January 1947808
June 2816348
March 2261710
May 2696638
Name: pickups, dtype: int64
plt.ylabel('Total pickups')
29
plt.xlabel('Day of Month')
plt.show()
• There is a steep fall in the number of pickups over the last days of the month
• This can partially be attributed to month of Feb having just 28 days. We can drop Feb and
have a look at this chart again
• There is a peak in the bookings around the 20th day of the month
Let us drop the observations for the month of Feb and see the trend
plt.ylabel('Total pickups')
plt.xlabel('Day of Month')
plt.show()
30
• Number of pickups for 31st is still low because not all months have the 31st day
plt.ylabel('Total pickups')
plt.xlabel('Hour of the day')
plt.show()
• Bookings peak around the 19th and 20th hour of the day
31
• The peak can be attributed to the time when people leave their workplaces
• From 5 AM onwards, we can see an increasing trend till 10, possibly the office rush
• Pickups then go down from 10AM to 12PM post which they start increasing
plt.figure(figsize=(15,7))
sns.lineplot(data=df, x="week_day", y="pickups", ci=False, color="red",␣
↪estimator='sum')
plt.ylabel('Total pickups')
plt.xlabel('Weeks')
plt.show()
• Pickups gradually increase as the week progresses and starts dropping down after Saturday
• We need to do more investigation to understand why demand for Uber is low in the beginning
of the week
Let’s check if there is any significant effect of the categorical variables on the number
of pickups
32
[ ]: # Dispersion of pickups in every borough
sns.catplot(x='pickups', col='borough', data=df, col_wrap=4, kind="violin")
plt.show()
• There is a clear difference in the number of riders across the different boroughs
• Manhattan has the highest number of bookings
• Brooklyn and Queens are distant followers
• EWR, Unknown and Staten Island have very low number of bookings. The demand is so small
that probably it can be covered by the drop-offs of the inbound trips from other areas
33
[ ]: sns.catplot(x='hday', y='pickups', data=df, kind="bar")
plt.show()
34
The bars for EWR, Staten Island and Unknown are not visible. Let’s check the mean pickups in
all the borough to verify this.
[ ]: borough hday
Bronx N 50.771073
Y 48.065868
Brooklyn N 534.727969
Y 527.011976
EWR N 0.023467
Y 0.041916
Manhattan N 2401.302921
35
Y 2035.928144
Queens N 308.899904
Y 320.730539
Staten Island N 1.606082
Y 1.497006
Unknown N 2.057456
Y 2.050420
Name: pickups, dtype: float64
• In all the boroughs, except Manhattan, the mean pickups on a holiday is very similar to that
on a non holiday
• In Queens, mean pickups on a holiday is higher
• There are hardly any pickups in EWR
Since we have seen that borough has a significant effect on the number of pickups, let’s check if
that effect is present across different hours of the day.
[ ]: plt.figure(figsize=(15,7))
sns.lineplot(data=df, x="start_hour", y="pickups", hue='borough',␣
↪estimator='sum', ci=False)
plt.ylabel('Total pickups')
plt.xlabel('Hour of the day')
plt.show()
• The number of pickups in Manhattan is very high and dominant when we see the spread across
boroughs
• The hourly trend which we have observed earlier can be mainly attributed to the borough
Manhattan, as rest of the other boroughs do not show any significant change for the number
of pickups on the hourly basis
36
0.0.13 Outlier Detection and Treatment
Let’s visualize all the outliers present in data together
plt.figure(figsize=(15, 12))
plt.show()
• The pickups column has a wide range of values with lots of outliers. However we are not
going to treat this column since the number of pickups can have a varying range and we can
miss out on some genuine values if we treat this column
• Starting from spd to sd, all the columns are related to weather. The weather related variables
have some outliers, however all of them seem to be genuine values. So we are not going to
treat the outliers present in these columns
37
0.0.14 Actionable Insights and Recommendations
Insights We analyzed a dataset of nearly 30K hourly Uber pickup informations, from New York
boroughs. The data spanned over every day of the first six months of the year 2015. The main
feature of interest here is the number of pickups. Both from an environmental and business
perspective, having cars roaming in an area while the demand is in another or filling the streets
with cars during a low demand period while lacking during peak hours is inefficient. Thus we
determined the factors that effect pickup and the nature of their effect.
We have been able to conclude that -
1. Uber cabs are most popular in the Manhattan area of New York
2. Contrary to intuition, weather conditions do not have much impact on the number of Uber
pickups
3. The demand for Uber has been increasing steadily over the months (Jan to June)
4. The rate of pickups is higher on the weekends as compared to weekdays
5. It is encouraging to see that New Yorkers trust Uber taxi services when they step out to enjoy
their evenings
6. We can also conclude that people use Uber for regular office commutes.The demand steadily
increases from 6 AM to 10 AM, then declines a little and starts picking up till midnight. The
demand peaks at 7-8 PM
7. We need to further investigate the low demand for Uber on Mondays
Recommendations to business
1. Manhattan is the most mature market for Uber. Brooklyn, Queens, and Bronx show potential
2. There has been a gradual increase in Uber rides over the last few months and we need to keep
up the momentum
3. The number of rides are high at peak office commute hours on weekdays and during late
evenings on Saturdays. Cab availability must be ensured during these times
4. The demand for cabs is highest on Saturday nights. Cab availability must be ensured during
this time of the week
5. Data should be procured for fleet size availability to get a better understanding of the demand-
supply status and build a machine learning model to accurately predict pickups per hour, to
optimize the cab fleet in respective areas
6. More data should be procured on price and a model can be built that can predict optimal
pricing
7. It would be great if Uber provides rides to/from the JFK Airport, LaGuardia Airport airports.
This would rival other services that provide rides to/from the airports throughout the USA.
[ ]:
38