0% found this document useful (0 votes)
29 views5 pages

Data Preprocessing

The document reads in student data from an Excel file, analyzes it using pandas and numpy, splits the data into training and test sets using scikit-learn, encodes categorical variables using LabelEncoder and OneHotEncoder, and scales numeric variables using MinMaxScaler and StandardScaler. It loads multiple datasets, cleans missing values, calculates statistics, and preprocesses the data for machine learning.

Uploaded by

vishalsharma24yt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Data Preprocessing

The document reads in student data from an Excel file, analyzes it using pandas and numpy, splits the data into training and test sets using scikit-learn, encodes categorical variables using LabelEncoder and OneHotEncoder, and scales numeric variables using MinMaxScaler and StandardScaler. It loads multiple datasets, cleans missing values, calculates statistics, and preprocesses the data for machine learning.

Uploaded by

vishalsharma24yt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

import pandas as pd

import numpy as np
df=pd.read_excel(r"/content/Untitled spreadsheet.xlsx")
df

roll no. attendance percentage CPI PLACED


0 2042000101 77 6.7 NO
1 2042000102 61 7.4 NO
2 2042000103 95 7.0 NO
3 2042000104 85 7.6 YES
4 2042000105 96 8.3 YES
5 2042000106 70 8.4 YES
6 2042000107 68 9.2 YES
7 2042000108 95 6.2 YES
8 2042000109 43 5.9 NO
9 2042000110 75 7.8 YES

print("Independent data")
print(df.iloc[:,:-1])
print("dependent data")
print(df.iloc[:,-1])

Independent data
roll no. attendance percentage CPI
0 2042000101 77 6.7
1 2042000102 61 7.4
2 2042000103 95 7.0
3 2042000104 85 7.6
4 2042000105 96 8.3
5 2042000106 70 8.4
6 2042000107 68 9.2
7 2042000108 95 6.2
8 2042000109 43 5.9
9 2042000110 75 7.8
dependent data
0 NO
1 NO
2 NO
3 YES
4 YES
5 YES
6 YES
7 YES
8 NO
9 YES
Name: PLACED, dtype: object

print("Mean of cpi:",np.mean(df['CPI']))
print("Median of cpi:",np.median(df['CPI']))
print("Mean of attendance percantage:",np.mean(df['attendance
percentage']))
print("Median of attendance percantage:",np.median(df['attendance
percentage']))

Mean of cpi: 7.45


Median of cpi: 7.5
Mean of attendance percantage: 76.5
Median of attendance percantage: 76.0

df1=pd.read_excel(r"/content/Untitled spreadsheet (1).xlsx")


df1

roll no. attendance percentage CPI PLACED


0 2042000101 77.0 6.7 NO
1 2042000102 61.0 7.4 NO
2 2042000103 95.0 7.0 NO
3 2042000104 NaN 7.6 YES
4 2042000105 96.0 8.3 YES
5 2042000106 70.0 8.4 YES
6 2042000107 68.0 9.2 NaN
7 2042000108 95.0 6.2 YES
8 2042000109 43.0 NaN NO
9 2042000110 75.0 7.8 YES

mean_value=np.mean(df['attendance percentage'])
df1['attendance percentage'].fillna(value=mean_value, inplace=True)
df1

roll no. attendance percentage CPI PLACED


0 2042000101 77.0 6.7 NO
1 2042000102 61.0 7.4 NO
2 2042000103 95.0 7.0 NO
3 2042000104 76.5 7.6 YES
4 2042000105 96.0 8.3 YES
5 2042000106 70.0 8.4 YES
6 2042000107 68.0 9.2 NaN
7 2042000108 95.0 6.2 YES
8 2042000109 43.0 NaN NO
9 2042000110 75.0 7.8 YES

median_value=np.median(df['CPI'])
df1['CPI'].fillna(value=median_value, inplace=True)
df1

roll no. attendance percentage CPI PLACED


0 2042000101 77.0 6.7 NO
1 2042000102 61.0 7.4 NO
2 2042000103 95.0 7.0 NO
3 2042000104 76.5 7.6 YES
4 2042000105 96.0 8.3 YES
5 2042000106 70.0 8.4 YES
6 2042000107 68.0 9.2 NaN
7 2042000108 95.0 6.2 YES
8 2042000109 43.0 7.5 NO
9 2042000110 75.0 7.8 YES

mode_value=df['PLACED'].mode()[0]
df1['PLACED'].fillna(value=mode_value, inplace=True)
df1

roll no. attendance percentage CPI PLACED


0 2042000101 77.0 6.7 NO
1 2042000102 61.0 7.4 NO
2 2042000103 95.0 7.0 NO
3 2042000104 76.5 7.6 YES
4 2042000105 96.0 8.3 YES
5 2042000106 70.0 8.4 YES
6 2042000107 68.0 9.2 YES
7 2042000108 95.0 6.2 YES
8 2042000109 43.0 7.5 NO
9 2042000110 75.0 7.8 YES

df['PLACED'].mode()[0]

{"type":"string"}

from sklearn.model_selection import train_test_split


X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# split the dataset


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
print(X_train)
print( X_test)
print(y_train)
print( y_test)

roll no. attendance percentage CPI


9 2042000110 75 7.8
1 2042000102 61 7.4
6 2042000107 68 9.2
7 2042000108 95 6.2
3 2042000104 85 7.6
0 2042000101 77 6.7
5 2042000106 70 8.4
roll no. attendance percentage CPI
2 2042000103 95 7.0
8 2042000109 43 5.9
4 2042000105 96 8.3
9 YES
1 NO
6 YES
7 YES
3 YES
0 NO
5 YES
Name: PLACED, dtype: object
2 NO
8 NO
4 YES
Name: PLACED, dtype: object

df2=pd.read_excel(r"/content/Untitled spreadsheet (2).xlsx")


df2

Name Favourite color Favourite game


0 Ajay Green cricket
1 Vijay Green hockey
2 Rohit Blue cricket
3 Mayank Blue cricket
4 Manoj Red badminton

from sklearn.preprocessing import LabelEncoder


l=LabelEncoder()
df2['Favourite color']=l.fit_transform(df2['Favourite color'])
df2['Favourite game']=l.fit_transform(df2['Favourite game'])
df2

Name Favourite color Favourite game


0 Ajay 1 1
1 Vijay 1 2
2 Rohit 0 1
3 Mayank 0 1
4 Manoj 2 0

import pandas as pd
df2=pd.read_excel(r"/content/Untitled spreadsheet.xlsx")
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder()
x=ohe.fit_transform(df2).toarray()
x

array([[1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1.],
[0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0.]])

import pandas as pd
df4=pd.read_excel(r"/content/program6.xlsx")
from sklearn.preprocessing import MinMaxScaler
s=MinMaxScaler()
print(s.fit_transform(df4))

[[0. 0.33333333 0.39583333]


[0.1111111 0.2 0.5 ]
[0.22222221 0.93333333 0.6875 ]
[0.33333331 0.53333333 0.47916667]
[0.44444445 0.4 0.95833333]
[0.55555555 0.66666667 0.4375 ]
[0.66666666 0. 0.375 ]
[0.77777776 0.93333333 1. ]
[0.8888889 0.46666667 0. ]
[1. 1. 0.5 ]]

from sklearn.preprocessing import StandardScaler


s=StandardScaler()
print(s.fit_transform(df4))

[[-1.5666989 -0.67075489 -0.49715486]


[-1.21854359 -1.0899767 -0.12052239]
[-0.87038828 1.21574324 0.55741606]
[-0.52223297 -0.04192218 -0.19584889]
[-0.17407766 -0.46114399 1.53666049]
[ 0.17407766 0.37729963 -0.34650188]
[ 0.52223297 -1.71880941 -0.57248136]
[ 0.87038828 1.21574324 1.68731348]
[ 1.21854359 -0.25153308 -1.92835826]
[ 1.5666989 1.42535415 -0.12052239]]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy