0% found this document useful (0 votes)
6 views

se python_merged (1) (1) (1)

The document discusses the analysis of variance (ANOVA) using a dataset of car attributes and prices. It includes data loading, preprocessing, statistical analysis, and visualization of relationships between variables. Key findings include the distribution of car prices, correlation between features, and the calculation of skewness and kurtosis for numerical columns.

Uploaded by

csedsa23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

se python_merged (1) (1) (1)

The document discusses the analysis of variance (ANOVA) using a dataset of car attributes and prices. It includes data loading, preprocessing, statistical analysis, and visualization of relationships between variables. Key findings include the distribution of car prices, correlation between features, and the calculation of skewness and kurtosis for numerical columns.

Uploaded by

csedsa23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

WEEK 7

ANOVA-ANALYSIS OF
VARIANCE
19/06/2024, 23:30 Untitled2.ipynb - Colab

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis
import statsmodels.api as sm
%matplotlib inline

# Load the dataset


path = 'Imports_Autos_85.csv'
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight", "engine-type",
"num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke",
"compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
df = pd.read_csv(path, names=headers)

# Display the first few rows of the dataset


print(df.head())

symboling normalized-losses make fuel-type aspiration num-of-doors \


0 3 ? alfa-romero gas std two
1 3 ? alfa-romero gas std two
2 1 ? alfa-romero gas std two
3 2 164 audi gas std four
4 2 164 audi gas std four

body-style drive-wheels engine-location wheel-base ... engine-size \


0 convertible rwd front 88.6 ... 130
1 convertible rwd front 88.6 ... 130
2 hatchback rwd front 94.5 ... 152
3 sedan fwd front 99.8 ... 109
4 sedan 4wd front 99.4 ... 136

fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg \


0 mpfi 3.47 2.68 9.0 111 5000 21
1 mpfi 3.47 2.68 9.0 111 5000 21
2 mpfi 2.68 3.47 9.0 154 5000 19
3 mpfi 3.19 3.40 10.0 102 5500 24
4 mpfi 3.19 3.40 8.0 115 5500 18

highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450

[5 rows x 26 columns]

# Convert relevant columns to numeric, handling errors by coercing to NaN


numeric_columns = ["symboling", "normalized-losses", "wheel-base", "length", "width", "height",
"curb-weight", "engine-size", "bore", "stroke", "compression-ratio",
"horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Fill missing values with the column mean


df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

# Display basic statistics


print(df.describe())

symboling normalized-losses wheel-base length width \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 122.000000 98.756585 174.049268 65.907805
std 1.245307 31.681008 6.021776 12.337289 2.145204
min -2.000000 65.000000 86.600000 141.100000 60.300000
25% 0.000000 101.000000 94.500000 166.300000 64.100000
50% 1.000000 122.000000 97.000000 173.200000 65.500000
75% 2.000000 137.000000 102.400000 183.100000 66.900000
max 3.000000 256.000000 120.900000 208.100000 72.300000

height curb-weight engine-size bore stroke \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 53.724878 2555.565854 126.907317 3.329751 3.255423
std 2.443522 520.680204 41.642693 0.270844 0.313597
min 47.800000 1488.000000 61.000000 2.540000 2.070000
25% 52.000000 2145.000000 97.000000 3.150000 3.110000
50% 54.100000 2414.000000 120.000000 3.310000 3.290000
75% 55.500000 2935.000000 141.000000 3.580000 3.410000
max 59.800000 4066.000000 326.000000 3.940000 4.170000

https://colab.research.google.com/drive/18pq4PjPIeWmnKS0QDggj9qxZf2rTNa8v#scrollTo=sNHmyAXoc1Hv&printMode=true 1/6
19/06/2024, 23:30 Untitled2.ipynb - Colab

compression-ratio horsepower peak-rpm city-mpg highway-mpg \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 10.142537 104.256158 5125.369458 25.219512 30.751220
std 3.972040 39.519211 476.979093 6.542142 6.886443
min 7.000000 48.000000 4150.000000 13.000000 16.000000
25% 8.600000 70.000000 4800.000000 19.000000 25.000000
50% 9.000000 95.000000 5200.000000 24.000000 30.000000
75% 9.400000 116.000000 5500.000000 30.000000 34.000000
max 23.000000 288.000000 6600.000000 49.000000 54.000000

price
count 205.000000
mean 13207.129353
std 7868.768212
min 5118.000000
25% 7788.000000
50% 10595.000000
75% 16500.000000
max 45400.000000

# Visualizing distribution of the target variable (price)


sns.histplot(df['price'], kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Visualize relationships between variables and price


sns.pairplot(df, x_vars=['engine-size', 'horsepower', 'curb-weight', 'highway-mpg'], y_vars='price', kind='reg')
plt.show()

# Calculate and display skewness and kurtosis


print("Skewness:")
print(df[numeric_columns].skew())
print("\nKurtosis:")
print(df[numeric_columns].kurtosis())

Skewness:
symboling 0.211072

https://colab.research.google.com/drive/18pq4PjPIeWmnKS0QDggj9qxZf2rTNa8v#scrollTo=sNHmyAXoc1Hv&printMode=true 2/6
19/06/2024, 23:30 Untitled2.ipynb - Colab
normalized-losses 0.854802
wheel-base 1.050214
length 0.155954
width 0.904003
height 0.063123
curb-weight 0.681398
engine-size 1.947655
bore 0.020211
stroke -0.689784
compression-ratio 2.610862
horsepower 1.397763
peak-rpm 0.073591
city-mpg 0.663704
highway-mpg 0.539997
price 1.827324
dtype: float64

Kurtosis:
symboling -0.676271
normalized-losses 1.404644
wheel-base 1.017039
length -0.082895
width 0.702764
height -0.443812
curb-weight -0.042854
engine-size 5.305682
bore -0.785040
stroke 2.174471
compression-ratio 5.233054
horsepower 2.678182
peak-rpm 0.086770
city-mpg 0.578648
highway-mpg 0.440070
price 3.354216
dtype: float64

numeric_df = df[numeric_columns]
correlation_matrix = numeric_df.corr()
print(correlation_matrix)

symboling normalized-losses wheel-base length \


symboling 1.000000 0.465190 -0.531954 -0.357612
normalized-losses 0.465190 1.000000 -0.056518 0.019209
wheel-base -0.531954 -0.056518 1.000000 0.874587
length -0.357612 0.019209 0.874587 1.000000
width -0.232919 0.084195 0.795144 0.841118
height -0.541038 -0.370706 0.589435 0.491029
curb-weight -0.227691 0.097785 0.776386 0.877728
engine-size -0.105790 0.110997 0.569329 0.683360
bore -0.130083 -0.029266 0.488760 0.606462
stroke -0.008689 0.054929 0.160944 0.129522
compression-ratio -0.178515 -0.114525 0.249786 0.158414
horsepower 0.071389 0.203434 0.351957 0.554434
peak-rpm 0.273679 0.237748 -0.360704 -0.287031
city-mpg -0.035823 -0.218749 -0.470414 -0.670909
highway-mpg 0.034606 -0.178221 -0.544082 -0.704662
price -0.082201 0.133999 0.583168 0.682986

width height curb-weight engine-size bore \


symboling -0.232919 -0.541038 -0.227691 -0.105790 -0.130083
normalized-losses 0.084195 -0.370706 0.097785 0.110997 -0.029266
wheel-base 0.795144 0.589435 0.776386 0.569329 0.488760
length 0.841118 0.491029 0.877728 0.683360 0.606462
width 1.000000 0.279210 0.867032 0.735433 0.559152
height 0.279210 1.000000 0.295572 0.067149 0.171101
curb-weight 0.867032 0.295572 1.000000 0.850594 0.648485
engine-size 0.735433 0.067149 0.850594 1.000000 0.583798
bore 0.559152 0.171101 0.648485 0.583798 1.000000
stroke 0.182939 -0.055351 0.168783 0.203094 -0.055909
compression-ratio 0.181129 0.261214 0.151362 0.028971 0.005201
horsepower 0.642195 -0.110137 0.750968 0.810713 0.575737
peak-rpm -0.219859 -0.320602 -0.266283 -0.244599 -0.254761
city-mpg -0.642704 -0.048640 -0.757414 -0.653658 -0.584508
highway-mpg -0.677218 -0.107358 -0.797465 -0.677470 -0.586992
price 0.728699 0.134388 0.820825 0.861752 0.532300

stroke compression-ratio horsepower peak-rpm \


symboling -0.008689 -0.178515 0.071389 0.273679
normalized-losses 0.054929 -0.114525 0.203434 0.237748
wheel-base 0.160944 0.249786 0.351957 -0.360704
length 0.129522 0.158414 0.554434 -0.287031
width 0.182939 0.181129 0.642195 -0.219859
height -0.055351 0.261214 -0.110137 -0.320602
curb-weight 0.168783 0.151362 0.750968 -0.266283
engine-size 0.203094 0.028971 0.810713 -0.244599
bore -0.055909 0.005201 0.575737 -0.254761
stroke 1.000000 0.186105 0.088264 -0.066844
compression-ratio 0.186105 1.000000 -0.205740 -0.435936
horsepower 0.088264 -0.205740 1.000000 0.130971

https://colab.research.google.com/drive/18pq4PjPIeWmnKS0QDggj9qxZf2rTNa8v#scrollTo=sNHmyAXoc1Hv&printMode=true 3/6
19/06/2024, 23:30 Untitled2.ipynb - Colab
peak-rpm -0.066844 -0.435936 0.130971 1.000000
city-mpg -0.042179 0.324701 -0.803162 -0.113723
highway-mpg -0.043961 0.265201 -0.770903 -0.054257
price 0.082095 0.070990 0.757917 -0.100854

city-mpg highway-mpg price


symboling -0.035823 0.034606 -0.082201
normalized-losses -0.218749 -0.178221 0.133999
wheel-base -0.470414 -0.544082 0.583168

import statsmodels.formula.api as smf

# 7. ANOVA Analysis
# Group by 'make' and perform ANOVA
grouped_test2 = df[['make', 'price']].groupby(['make'])
print(grouped_test2.head(2))
print(grouped_test2.get_group('honda')['price'])

# Create the model using a formula


model = smf.ols('price ~ Q("engine-size") + Q("horsepower") + Q("curb-weight") + Q("highway-mpg")', data=df).fit()

# Perform ANOVA
anova_results = sm.stats.anova_lm(model, typ=2)
print(anova_results)

make price
0 alfa-romero 13495.000000
1 alfa-romero 16500.000000
3 audi 13950.000000
4 audi 17450.000000
10 bmw 16430.000000
11 bmw 16925.000000
18 chevrolet 5151.000000
19 chevrolet 6295.000000
21 dodge 5572.000000
22 dodge 6377.000000
30 honda 6479.000000
31 honda 6855.000000
43 isuzu 6785.000000
44 isuzu 13207.129353
47 jaguar 32250.000000
48 jaguar 35550.000000
50 mazda 5195.000000
51 mazda 6095.000000
67 mercedes-benz 25552.000000
68 mercedes-benz 28248.000000
75 mercury 16503.000000
76 mitsubishi 5389.000000
77 mitsubishi 6189.000000
89 nissan 5499.000000
90 nissan 7099.000000
107 peugot 11900.000000
108 peugot 13200.000000
118 plymouth 5572.000000
119 plymouth 7957.000000
125 porsche 22018.000000
126 porsche 32528.000000
130 renault 9295.000000
131 renault 9895.000000
132 saab 11850.000000
133 saab 12170.000000
138 subaru 5118.000000
139 subaru 7053.000000
150 toyota 5348.000000
151 toyota 6338.000000
182 volkswagen 7775.000000
183 volkswagen 7975.000000
194 volvo 12940.000000
195 volvo 13415.000000
30 6479.0
31 6855.0
32 5399.0
33 6529.0
34 7129.0
35 7295.0
36 7295.0
37 7895.0
38 9095.0
39 8845.0
40 10295.0
41 12945.0
42 10345.0
Name: price, dtype: float64
df ( )

https://colab.research.google.com/drive/18pq4PjPIeWmnKS0QDggj9qxZf2rTNa8v#scrollTo=sNHmyAXoc1Hv&printMode=true 4/6
19/06/2024, 23:30 Untitled2.ipynb - Colab
# 8. Regression Plots
sns.regplot(x='engine-size', y='price', data=df)
plt.title('Engine Size vs Price')
plt.show()

sns.regplot(x='horsepower', y='price', data=df)


plt.title('Horsepower vs Price')
plt.show()

sns.regplot(x='curb-weight', y='price', data=df)


plt.title('Curb Weight vs Price')
plt.show()

https://colab.research.google.com/drive/18pq4PjPIeWmnKS0QDggj9qxZf2rTNa8v#scrollTo=sNHmyAXoc1Hv&printMode=true 5/6
19/06/2024, 23:30 Untitled2.ipynb - Colab

sns.regplot(x='highway-mpg', y='price', data=df)


plt.title('Highway MPG vs Price')
plt.show()

Double-click (or enter) to edit

https://colab.research.google.com/drive/18pq4PjPIeWmnKS0QDggj9qxZf2rTNa8v#scrollTo=sNHmyAXoc1Hv&printMode=true 6/6
WEEK 8
CALCULATING THE
SKEWNESS OF A DATA SET
WEEK 9
5-POINT SUMMARY
6/22/24, 2:46 PM Untitled19

 In [3]: import numpy as np



#Example dataset
data = [10,20,30,40,50,60,70,80,90,100]

#Desired percentile (e.g. 25th percentile)
percentiles = [25,50,75]

#Calculate percentile
for percentile in percentiles:
#Calculate the position
position = (percentile / 100) * (len(data) + 1)

# Interpolation if necessary
if position.is_integer():
value = data[int(position) - 1]
else :
lower_index = int(position)
upper_index = lower_index + 1
lower_value = data[lower_index - 1]
upper_value = data[upper_index - 1]
value = lower_value + (position - lower_index) * (upper_value - lower

print(f"{percentile}th percentile:", value)

25th percentile: 27.5


50th percentile: 55.0
75th percentile: 82.5

In [7]: # Calculate minimum and maximum


minimum = np.min(data)
maximum = np.max(data)

# Calculate quartiles
Q1 = np.percentile(data , 25)
median = np.percentile(data , 50)
Q3 = np.percentile(data , 75)

# Print the five-number summary
print("Minimum:" , minimum)
print("First Quartile (Q1):" , Q1)
print("Median (Q2):" , median)
print("Third Quartile (Q3):" , Q3)
print("Maximum:" , maximum)

Minimum: 10
First Quartile (Q1): 32.5
Median (Q2): 55.0
Third Quartile (Q3): 77.5
Maximum: 100

localhost:8888/notebooks/Untitled19.ipynb 1/3
6/22/24, 2:46 PM Untitled19

In [8]: import matplotlib.pyplot as plt



# Calculate Quartiles
Q1 = np.percentile(data , 25)
median = np.percentile(data , 50)
Q3 = np.percentile(data , 75)

# Calculate interquartile range (IQR)
IQR = Q3 - Q1
print("IQR: ", IQR)

IQR: 45.0

localhost:8888/notebooks/Untitled19.ipynb 2/3
6/22/24, 2:46 PM Untitled19

In [9]: # Create a box plot


plt.figure(figsize=(8,6))
plt.boxplot(data,vert=False,patch_artist=True)
plt.title('Box Plot of Data with Interquartile Range (IQR)')
plt.xlabel('Vlaues')
plt.ylabel('Data')
plt.xticks(fontsize=10)
plt.yticks([])
plt.grid(True)

# Highlight median and IQR
plt.scatter(median,1,color='red',label='Median')
plt.scatter([Q1,Q3],[1,1],color='blue',label='Q1/Q3')
plt.plot([Q1,Q1],[0.75,1.25],color='blue')
plt.plot([Q3,Q3],[0.75,1.25],color='blue')

plt.text(Q1,1.4,f'Q1 ({Q1})',ha='center')
plt.text(Q3,1.4,f'Q3 ({Q3})',ha='center')
plt.text(median,1.4,f'Median ({median})', ha='center',color='red')

plt.legend()
plt.show()

localhost:8888/notebooks/Untitled19.ipynb 3/3
WEEK 10
UNIVARIATE, BIVARIATE,
MULTIVARIATE DESCRIPTIVE
STATISTIC MEASURES
WEEK 11
NORMAL DISTRIBUTION
24/06/2024, 18:36 Untitled13.ipynb - Colab

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

n = np.arange(0,30)
print(n)

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]

rate = 12
poisson = stats.poisson.pmf(n,rate)

print(poisson)

print(poisson[5])

[6.14421235e-06 7.37305482e-05 4.42383289e-04 1.76953316e-03


5.30859947e-03 1.27406387e-02 2.54812775e-02 4.36821900e-02
6.55232849e-02 8.73643799e-02 1.04837256e-01 1.14367916e-01
1.14367916e-01 1.05570384e-01 9.04889002e-02 7.23911201e-02
5.42933401e-02 3.83247107e-02 2.55498071e-02 1.61367203e-02
9.68203217e-03 5.53258981e-03 3.01777626e-03 1.57449196e-03
7.87245981e-04 3.77878071e-04 1.74405263e-04 7.75134504e-05
3.32200502e-05 1.37462277e-05]
0.012740638735861376

print(poisson[0] + poisson[1] + poisson[2] + poisson[3] + poisson[4])

0.007600390681067

plt.plot(n,poisson, 'o-')
plt.show()

# Parameters
n = 10 # Number of trials
p = 0.5 # Probability of success

# Creating a binomial distribution object


binom_dist = stats.binom(n, p)

# Probability mass function (PMF) for a given number of successes (k)


k = 3
pmf = binom_dist.pmf(k)
print(f"PMF at k={k}: {pmf}")

# Cumulative distribution function (CDF) for a given number of successes (k)


cdf = binom_dist.cdf(k)
print(f"CDF at k={k}: {cdf}")

# Generating random samples


https://colab.research.google.com/drive/1pgh75Hk4LTUAeQ-9VA_re6McbwW7Xi9s#scrollTo=2DElP7pcWZzL&printMode=true 1/2
24/06/2024, 18:36 Untitled13.ipynb - Colab
g p
samples = binom_dist.rvs(size=1000)
print(f"Random samples: {samples[:10]}")

PMF at k=3: 0.1171875


CDF at k=3: 0.171875
Random samples: [5 2 4 6 5 5 5 5 5 3]

https://colab.research.google.com/drive/1pgh75Hk4LTUAeQ-9VA_re6McbwW7Xi9s#scrollTo=2DElP7pcWZzL&printMode=true 2/2
WEEK 12
LINEAR REGRESSION
WEEK 13
T-TEST

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy