Hadoop Hive - One
Hadoop Hive - One
S.E.A.L. Cars
Analysis
Report
1
1. Introduction
The cars data that is used for our analysis was put together from several different websites. In
Czech Republic and Germany over a period of more than a year. Some of the sources provided
unstructured data, so as a result the data is dirty. There are missing values and some values are
very wrong. For example phone numbers have been inputed as mileage in some cases. The data
is consist of roughly 3.5 Million rows and sixteen columns: maker, model, mileage - in KM,
manufacture_year, engine_displacement, engine_power, body_type, color_slug, stk_year,
transmission, door_count, seat_count, fuel_type, date_created, date_last_seen, price_eur.
We will be focusing on the make, milgage,fuel type and price to estimate which cars have the
best resale value.These factors will help us come to a decision to consider only economical cars
or both luxury and economical cars for our business to specialize in. We will be using Hadoop
and Hive in Google Cload Platform to perform our analysis.
2. The car data was previously uploaded into hadoop and the database was previosly craeted below shows
the data being prepared to be loaded into Hive.
3. Here we have “project1” database in Hive then a table named “cartable” to load car.csv data in.
2
4. Loading the car.csv data into “cartable” and confirm its first 5 rows of the data.
Analysis (Questions)
1. What is the relationship between car makes, models and price?
2. What are the top five vehicle manufacturers would you recommend? Why?
3. Does fuel type have any impact on the car price? Explain
5. Although these are specific questions, our analysis is to identify cars that has the most value. We have
decided to run numerous queries to find out what can be indicators of good value in cars.
We first tried to identify car makers with the highest avg. mileage (lifespan) and highest avg. resale price.
In order to get the results we created a new table to store the avg. mileage and avg. price by maker in.
3
6. From the car.csv data we are loading in the value to newly created “average” table, and diplay the first
entries.
From the results above the Jaguar, Lotus and the Maserati seem to have the best results out of the 5
outputs when comparing the milage and the price.
7. Here are top 10 makers with highest avg. mileage. Although there are other factors to be considered,
this can be an indicator of avg. lifespan of vehicle.
If we are looking at the lifespan of the cars from the outputs above Chrysler seem to have the best
lifespan. The life span of the cars would also depend on how well the owners maintained the cars.
4
8. Here are the top 10 makers with highest avg. price. This can be an indication of good resale value of a
car.
Unfortunately, highest avg. price itself does not seem like a good indication of car value because this list
mostly consists of high-end luxury vehicle. Obviously, luxury vehicle’s price is higher than economy
cars. As stated in the introduction the data quality is poor.Therefore, we have decided to analyze cars in
economy group (top 10 highest avg. mileage) and luxury group (top 10 highest avg. price).
9. In order to come up with more reasonable numbers, we will sort the economy cars group by highest
avg. mileage then also including their avg. prices to view their avg. resale value as well.
Keep in mind the quality of the dataset is impacting the results of the analysis. For instance, Subaru’s avg.
mileage(136K) has placed it in economy group but its avg. price (147K) which does not seem realistic. In
addition, rows are missing “maker” field got included in this group, which averaged 547K in price and
124K in mileage. Despite the fact, this analysis is suggesting some good economic options such as
Chrysler, Skoda, Volvo, Honda, Ford, and Volkswagen. And some options with realy bad resale value
such as Alfa-Romeo and Land-Rover.
10. Now we perform the same analysis on luxury group. In order to do that, we sort the group by highest
5
avg. price then also including their avg. mileage to review how much they are driven.
11. Again, we see that quality of data is impacting the quality of this analysis. For instance, prices for
Renault, BMW, Subaru(again), Citroen are not realistic and value with no maker field got included in this
list again. However, a trend in luxury group can be seen in this analysis. High-end luxury vehicles like
Lamborghini, Bentley, Porsche, Maserati are not driven more than 100K in mileage whereas top 10 in
economy group are all driven more than 100K.
12. After all, as we are analysing a used-car dataset, and because we are making a recommendation for
the valueable vehicles we will broaden our search to top 20 highest avg. mileage.
13. As we keep “Value” in my mind and long-distance driven cars while maintaining higher resale value,
here are our 5 recommendations.
6
Volvo
Honda
Land Rover
Audi
Mercedes-Benz or Lexus
14. Now, in order to analyze the data by fuel type we have created a new table.
15. We have added the data into the new table “fuel”, applying the same logic where avg. mileage and
avg. price will illustrate vehicle’s value by fuel type.
16. Here are the result from the fuel table query.
While Gasoline vehicles show most distance driven and highest resale value, Diesel seems to be 2 nd. LPG
and Electric vehicles seem to have similar efficiency. CNG has the least avg. mileage and lowest price.
7
There is definitely a relationship between the 3 variables they are all connected and can be dependant or
independent variables in some cases. We would no have been able to complete our analysis without
inclused one or more of these variables.
18. From the analysis, it was evident that the quality of the data was impacting our analysis. Steps were
taken to clean the data, but it seems as though these attempts did not work. See below where we
attempted to remove all NULL values from the ‘maker’ column. By creating a temp table which was
essentially a copy of the original table but with a condition applied. The original table was overwritten
with the temp table (intending to account for the NULL values in ‘maker’ thereby reducing the overall
number of rows). However, after checking the number of rows, it seems that this code had no effect.
8
The type of fuel does seem to have an impact on the price of the car. In our analysis above
gasoline cars seem to have more distance drive with the highest resale value. Disel cars are second
in line with driving distance and price. If we are to compare the two gasoline/petrol is much easily
to find at a gas station, and some fuel station do not carry deisel as much as gasoline. In the past
disel fuel had alot of air pollution concerns than gasoline cars. In the past disel cars were seen as
more fuel efficient than gasoline cars, but that has changed because gasoline cars are now built to
be more fuel efficient.
20. Conclusion
Afetr carefully revieving and analysing the cars dataset for the prospective used car business our
advice is based on the intention to maximize revenue and profit. Revenue is dependent on car
sales and sales is dependent on the value proposition for prospective customers. We have made a
judgement that used car customers want reliable cars that also retain monetary value relatively
longer than other alternatives. As such, the cars we are recommending are the best candidates to
meet those two requirements as shown by the 'mileage' and 'price' analysis.