Taxi Trip Analysis Using Hive
Taxi Trip Analysis Using Hive
This case study requires analyzing a large dataset using Hive for exploratory data analysis. Here’s a
structured approach to tackle the tasks:
Run the following DDL script in Hive to create the table schema for storing the taxi data:
vendor_id STRING,
pickup_datetime STRING,
dropoff_datetime STRING,
passenger_count INT,
trip_distance DECIMAL(9,6),
pickup_longitude DECIMAL(9,6),
pickup_latitude DECIMAL(9,6),
rate_code INT,
store_and_fwd_flag STRING,
dropoff_longitude DECIMAL(9,6),
dropoff_latitude DECIMAL(9,6),
payment_type STRING,
fare_amount DECIMAL(9,6),
extra DECIMAL(9,6),
mta_tax DECIMAL(9,6),
tip_amount DECIMAL(9,6),
tolls_amount DECIMAL(9,6),
total_amount DECIMAL(9,6),
trip_time_in_secs INT
TBLPROPERTIES ("skip.header.line.count"="1");
3. Preview Data:
4. Analysis Queries
SELECT
payment_type,
AVG(total_amount) AS average_fare,
AVG(tip_amount) AS average_tip,
AVG(mta_tax) AS average_tax
FROM taxidata
GROUP BY payment_type;
To find the hour of the day with the highest average revenue, extract the hour from
pickup_datetime:
SELECT
HOUR(TO_TIMESTAMP(pickup_datetime)) AS hour_of_day,
AVG(total_amount) AS average_revenue
FROM taxidata
GROUP BY HOUR(TO_TIMESTAMP(pickup_datetime))
LIMIT 1;
5. Notes
• Ensure the file is formatted correctly and accessible in HDFS before loading.
• Verify Hive connectivity and configurations (like Hadoop and Hive services being active).
This workflow should give you insights into the taxi dataset while using Hive effectively!