Belarus Car Price Prediction
Belarus Car Price Prediction
The dataset has been taken from kaggle. It has 56244 rows and 12 columns.
Data Dictionary
Variable Description
condition represents the condition at the sale moment (with mileage, for parts, etc)
with
0 mazda 2 5500 2008 162000.0 petrol 150
mileage
with
1 mazda 2 5350 2009 120000.0 petrol 130
mileage
with
2 mazda 2 7000 2009 61000.0 petrol 150
mileage
with
3 mazda 2 3300 2003 265000.0 diesel 140
mileage
with
4 mazda 2 5200 2008 97183.0 diesel 140
mileage
In [ ]: # Droping the columns that are not needed for the analysis
df.drop(columns = ['model','segment'], inplace=True)
Out[ ]: make 96
priceUSD 2970
year 78
condition 3
mileage(kilometers) 8400
fuel_type 3
volume(cm3) 458
color 13
transmission 2
drive_unit 4
dtype: int64
Since there are you many car make, and it is difficult to analyze them individually, so I
will group them into categories : Luxury European, Mainstream European, Russina/
Eastern European, Asian, American, Speciality, and Other. The grouping is based on the
car make and the country of origin.
df['make_segment'] = df['make'].apply(car_make)
Descriptive statistics
In [ ]: df.describe()
In [ ]: df.head()
with
0 mazda 5500 2008 162000.0 petrol 1500.0 bur
mileage
with
1 mazda 5350 2009 120000.0 petrol 1300.0
mileage
with
2 mazda 7000 2009 61000.0 petrol 1500.0
mileage
with
3 mazda 3300 2003 265000.0 diesel 1400.0
mileage
with
4 mazda 5200 2008 97183.0 diesel 1400.0
mileage
In the dataset, most of the cars are european (particulary majority of the are Luxury
followed by Mainstream and Russian/Eastern European). However the dataset also has
american as well asian cars. There are also some speciality cars such as Tesla, McLaren,
Bentley, etc. The dataset also has some cars that are not categorized into any of the
above categories.
From the above graphs, we can get an overview regarding the data across the
categorical variables in the data set. The from the above graphs it is clear that majority of
the cars are being sold are in working condition, majority of them run on petrol, followed
by diesel and hardly any of them runs on electricity. Most of the cars have manual
transmission, with front wheel drive, having colors such as balck, silver, blue, white, and
grey.
The above graphs shows the distribution of the data across continuous variables.
Majority of the cars are manufactured between 1990 to 2019,having price less than 50k
USD, mileage less than 1 million km, engine volume between 1750 to 2000 cm3.
Since most of the cars are manufactured after 1980, so I will only consider the cars
manufactured after 1980.
In [ ]: df= df[df['year']>1980]
#b Bar Plot
plt.figure(figsize=(8,5))
sns.barplot(y='make', x='priceUSD', data=demodf)
plt.xticks(rotation=90)
plt.title('Top 10 Most Expensive Car Brands')
plt.ylabel('Car Brand')
plt.xlabel('Price in USD')
plt.show()
This graph shows top 10 most expensive car brands in the data set. The top 5 most
expensive car brands are Bentley, Mclaren, aston-martin, Tesla and meserati.
This graph shows the relationship between the price and the year of the car along with
selling codition of the car. Cars, which are sold in working condition, are more expensive
and their price increased with time, having exponential increase between 2015 to 2020.
Cars, which were damaged, had a similar price to tha cars which were sold for parts
between 1980 to 2000. However, the price of the damaged cars increased significanlty
after 2000. Cars, which were sold for parts, tend to have minimal price and their price
increased very little with time.
The cars running on petrol and diesel have similar mileage, however their prices are quite
different. The cars running on petrol tend to have higher price than the diesel ones. The
cars running on electricity tend to have very high prices and low mileage.
This graph reveals the changes in the car price based on their transmission. The price of
the cars with automatic transmission decreased significantly after 1983, however its price
increased exponentially after 2000. However, the price of the cars with manual
transmission is always less than the cars with automatic transmission showing similar
increase in price after 2000.
Till 2005, there was no major difference in car price of cars running on petrol and diesel.
However, after 2015, the price of the cars running on petrol increased significantly,
whereas the price of the cars running on diesel increased with a very small margin. The
graph also highloghts the introducttion of electro cars, which runs on electricity in 1995.
However, the price of the electro cars increases exponentially after 2015, having the
highest car price based on fuel type
Between 1980 to 1995, there was not much difference in the price of the cars based on
the drive unit. However after 1995, the price of the cars with front wheel drive increased
at a slower pace as compared to other drive units. The price of the cats with all wheel
drive increased significantly after 2005, having the highest price among all the drive
units, followed by part-time four wheel drive and rear wheel drive.
This graph shows the surge in car prices after 2005, where we can seen that the price of
the specialty car segment increased significanlty followed by the luxury european car,
American, Asian and Mainstream European car segment. The price of the Russian/Eastern
European car segment increased at a slower pace as compared to other segments and is
lowest among all the segments.
Out[ ]: make 0
priceUSD 0
year 0
condition 0
mileage(kilometers) 0
fuel_type 0
volume(cm3) 47
color 0
transmission 0
drive_unit 1874
make_segment 0
dtype: int64
Since, the count of null values in small in comparison to that dataset size, I will be
dropping the null values from the dataset.
In [ ]: df.dropna(inplace=True)
In [ ]: df.drop(columns=['make'], inplace=True)
# columns to encode
cols = ['condition', 'fuel_type', 'transmission', 'color', 'drive_unit', 'make_s
condition [2 1 0]
fuel_type [1 0]
transmission [1 0]
color [ 3 0 10 11 4 1 7 8 9 5 2 12 6]
drive_unit [1 3 0 2]
make_segment [2 3 5 0 4 6 1]
Outlier Removal
In [ ]: # Using Z-score to remove outliers
from scipy import stats
z = np.abs(stats.zscore(df))
threshold = 3
#removing outliers
df = df[(z < 3).all(axis=1)]
Model Building
#best parameters
print(grid.best_params_)
C:\Users\DELL\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2k
fra8p0\LocalCache\local-packages\Python311\site-packages\sklearn\tree\_classes.p
y:277: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will b
e removed in 1.3. To keep the past behaviour, explicitly set `max_features=1.0'`.
warnings.warn(
Out[ ]: ▾ DecisionTreeRegressor
In [ ]: #training score
dtr.score(X_train, y_train)
Out[ ]: 0.8689232243678456
Model Evaluation
In [ ]: from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
print('R2 Score: ', r2_score(y_test, y_pred))
print('Mean Squared Error: ', mean_squared_error(y_test, y_pred))
print('Mean Absolute Error: ', mean_absolute_error(y_test, y_pred))
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(y_test, y_pred)))
R2 Score: 0.8529954473045238
Mean Squared Error: 4704555.776616746
Mean Absolute Error: 1414.2804910704947
Root Mean Squared Error: 2168.9987959002524
Feature Importance
In [ ]: feat_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': dtr.feature_im
feat_df = feat_df.sort_values(by='Importance', ascending=False)
feat_df
0 year 0.754301
4 volume(cm3) 0.200413
3 fuel_type 0.017333
6 transmission 0.010267
8 make_segment 0.009639
7 drive_unit 0.006883
2 mileage(kilometers) 0.000872
5 color 0.000292
1 condition 0.000000
In [ ]: # Bar Plot
sns.set_style('darkgrid')
plt.figure(figsize=(8,5))
Conclusion
The aim of this project was to predict the price of the car in Belarus, by analyzing the car
features such as brand, year, engine, fuel type, transmission, mileage, drive unit, color,
and segment. During the exploratory data analysis, it was found that there has been a
significant increase in car prices in Belarus after the year 2000. The cars which runs on
petrol have automatic transmission have higher price has compared to diesel cars with
manual transmission. However, the elctric cars are distinctively expensive than the other
cars. The cars with all wheel drive have the highest price among all the drive units. The
speciality segment cars have the highest price among all the segments followed by
luxury european, american, asian car segments.
The decision tree regressor model was used to predict the car price. The model was able
to predict the car price with 85.29% accuracy. The most important features for predicting
the car price were found to be year and volume of the engine.