0% found this document useful (0 votes)
139 views10 pages

Hadoop Hive - One

This document analyzes car data to identify vehicles with the best resale value. It loads car data into Hive tables and runs queries to find the top manufacturers by average mileage and price. The analysis finds Chrysler has the highest average mileage, indicating longevity. However, luxury brands dominate highest average prices. To account for this, economy and luxury groups are analyzed separately. Recommendations are made for Volvo, Honda, Land Rover, Audi, and Mercedes/Lexus based on balance of high mileage and price. Fuel type is also analyzed, finding gasoline has highest mileage and resale value, followed by diesel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views10 pages

Hadoop Hive - One

This document analyzes car data to identify vehicles with the best resale value. It loads car data into Hive tables and runs queries to find the top manufacturers by average mileage and price. The analysis finds Chrysler has the highest average mileage, indicating longevity. However, luxury brands dominate highest average prices. To account for this, economy and luxury groups are analyzed separately. Recommendations are made for Volvo, Honda, Land Rover, Audi, and Mercedes/Lexus based on balance of high mileage and price. Fuel type is also analyzed, finding gasoline has highest mileage and resale value, followed by diesel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

CSDA1020

Big Data Analytics Tools


Project 1, Hadoop and Hive Project

S.E.A.L. Cars
Analysis
Report

Created: Susiette Adams


Table of Contents
1. Introduction.................................................................................................................................................................2
2. Data being prepared to be loaded into Hive................................................................................................................2
3. Creat project1 database then a cartabled to load car.csv data.....................................................................................2
4. Loading the car.csv data into “cartable” and confirm its first 5 rows of the data.......................................................3
Analysis (Questions).......................................................................................................................................................3
5. Identifying cars that has the most value .....................................................................................................................3
6. Loading in the value to newly created “average” table, and diplay the first 5 entries................................................4
7.Top 10 makers with highest avg. mileage ...................................................................................................................4
8.Top 10 makes with highest ave.price ..........................................................................................................................5
...........................................................................................................................................................................................
9. Sort the economy cars group ......................................................................................................................................5
10. Sort the luxury cars group ........................................................................................................................................6
11. Quality of data impacting the quality of this analysis ..............................................................................................6
12. Broaden the search to top 20 highest avg. Mileage .................................................................................................6
13. Our 5 recommendations ...........................................................................................................................................7
14. Analyzing the data by fuel type ...............................................................................................................................7
15. Adding the data into the new table “fuel” ................................................................................................................7
16. Reesults from the fuel table query ..........................................................................................................................8
17. Is there a relationship beteen make, model and price? ...........................................................................................8
18. Cleaning the data ...................................................................................................................................................8
19. Does Fuel Type have an impact on price? .............................................................................................................9
20. Conclusion ...............................................................................................................................................................9

1
1. Introduction
The cars data that is used for our analysis was put together from several different websites. In
Czech Republic and Germany over a period of more than a year. Some of the sources provided
unstructured data, so as a result the data is dirty. There are missing values and some values are
very wrong. For example phone numbers have been inputed as mileage in some cases. The data
is consist of roughly 3.5 Million rows and sixteen columns: maker, model, mileage - in KM,
manufacture_year, engine_displacement, engine_power, body_type, color_slug, stk_year,
transmission, door_count, seat_count, fuel_type, date_created, date_last_seen, price_eur.

We will be focusing on the make, milgage,fuel type and price to estimate which cars have the
best resale value.These factors will help us come to a decision to consider only economical cars
or both luxury and economical cars for our business to specialize in. We will be using Hadoop
and Hive in Google Cload Platform to perform our analysis.

2. The car data was previously uploaded into hadoop and the database was previosly craeted below shows
the data being prepared to be loaded into Hive.

3. Here we have “project1” database in Hive then a table named “cartable” to load car.csv data in.

2
4. Loading the car.csv data into “cartable” and confirm its first 5 rows of the data.

Analysis (Questions)
1. What is the relationship between car makes, models and price?
2. What are the top five vehicle manufacturers would you recommend? Why?
3. Does fuel type have any impact on the car price? Explain

5. Although these are specific questions, our analysis is to identify cars that has the most value. We have
decided to run numerous queries to find out what can be indicators of good value in cars.

We first tried to identify car makers with the highest avg. mileage (lifespan) and highest avg. resale price.
In order to get the results we created a new table to store the avg. mileage and avg. price by maker in.

3
6. From the car.csv data we are loading in the value to newly created “average” table, and diplay the first
entries.

From the results above the Jaguar, Lotus and the Maserati seem to have the best results out of the 5
outputs when comparing the milage and the price.

7. Here are top 10 makers with highest avg. mileage. Although there are other factors to be considered,
this can be an indicator of avg. lifespan of vehicle.

If we are looking at the lifespan of the cars from the outputs above Chrysler seem to have the best
lifespan. The life span of the cars would also depend on how well the owners maintained the cars.

4
8. Here are the top 10 makers with highest avg. price. This can be an indication of good resale value of a
car.

Unfortunately, highest avg. price itself does not seem like a good indication of car value because this list
mostly consists of high-end luxury vehicle. Obviously, luxury vehicle’s price is higher than economy
cars. As stated in the introduction the data quality is poor.Therefore, we have decided to analyze cars in
economy group (top 10 highest avg. mileage) and luxury group (top 10 highest avg. price).

9. In order to come up with more reasonable numbers, we will sort the economy cars group by highest
avg. mileage then also including their avg. prices to view their avg. resale value as well.

Keep in mind the quality of the dataset is impacting the results of the analysis. For instance, Subaru’s avg.
mileage(136K) has placed it in economy group but its avg. price (147K) which does not seem realistic. In
addition, rows are missing “maker” field got included in this group, which averaged 547K in price and
124K in mileage. Despite the fact, this analysis is suggesting some good economic options such as
Chrysler, Skoda, Volvo, Honda, Ford, and Volkswagen. And some options with realy bad resale value
such as Alfa-Romeo and Land-Rover.
10. Now we perform the same analysis on luxury group. In order to do that, we sort the group by highest

5
avg. price then also including their avg. mileage to review how much they are driven.

11. Again, we see that quality of data is impacting the quality of this analysis. For instance, prices for
Renault, BMW, Subaru(again), Citroen are not realistic and value with no maker field got included in this
list again. However, a trend in luxury group can be seen in this analysis. High-end luxury vehicles like
Lamborghini, Bentley, Porsche, Maserati are not driven more than 100K in mileage whereas top 10 in
economy group are all driven more than 100K.

12. After all, as we are analysing a used-car dataset, and because we are making a recommendation for
the valueable vehicles we will broaden our search to top 20 highest avg. mileage.

13. As we keep “Value” in my mind and long-distance driven cars while maintaining higher resale value,
here are our 5 recommendations.

6
 Volvo
 Honda
 Land Rover
 Audi
 Mercedes-Benz or Lexus

14. Now, in order to analyze the data by fuel type we have created a new table.

15. We have added the data into the new table “fuel”, applying the same logic where avg. mileage and
avg. price will illustrate vehicle’s value by fuel type.

16. Here are the result from the fuel table query.

While Gasoline vehicles show most distance driven and highest resale value, Diesel seems to be 2 nd. LPG
and Electric vehicles seem to have similar efficiency. CNG has the least avg. mileage and lowest price.

17. Is there a relationship beteen make, model and price?

7
There is definitely a relationship between the 3 variables they are all connected and can be dependant or
independent variables in some cases. We would no have been able to complete our analysis without
inclused one or more of these variables.

18. From the analysis, it was evident that the quality of the data was impacting our analysis. Steps were
taken to clean the data, but it seems as though these attempts did not work. See below where we
attempted to remove all NULL values from the ‘maker’ column. By creating a temp table which was
essentially a copy of the original table but with a condition applied. The original table was overwritten
with the temp table (intending to account for the NULL values in ‘maker’ thereby reducing the overall
number of rows). However, after checking the number of rows, it seems that this code had no effect.

Other possible data cleansing tasks could include:

 Limiting the range of values for certain columns (Mileage, Price)


 Re-mapping ‘price’ column to be presented in CAD (vs. Euros)

19. Does Fuel Type have an impact on price?

8
The type of fuel does seem to have an impact on the price of the car. In our analysis above
gasoline cars seem to have more distance drive with the highest resale value. Disel cars are second
in line with driving distance and price. If we are to compare the two gasoline/petrol is much easily
to find at a gas station, and some fuel station do not carry deisel as much as gasoline. In the past
disel fuel had alot of air pollution concerns than gasoline cars. In the past disel cars were seen as
more fuel efficient than gasoline cars, but that has changed because gasoline cars are now built to
be more fuel efficient.

20. Conclusion
Afetr carefully revieving and analysing the cars dataset for the prospective used car business our
advice is based on the intention to maximize revenue and profit. Revenue is dependent on car
sales and sales is dependent on the value proposition for prospective customers. We have made a
judgement that used car customers want reliable cars that also retain monetary value relatively
longer than other alternatives. As such, the cars we are recommending are the best candidates to
meet those two requirements as shown by the 'mileage' and 'price' analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy