0% found this document useful (0 votes)

28 views24 pages

Ass 1 DSBDL

1. The document describes an assignment for a data wrangling course to be completed in Python using an open source dataset. 2. Students are tasked with importing libraries, loading and describing the dataset, preprocessing the data by checking for missing values and data types, and converting categorical variables. 3. The document provides background information on datasets, the pandas and NumPy libraries, and common data visualization tools in Python like Matplotlib and Seaborn.

Uploaded by

Anvi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views24 pages

Ass 1 DSBDL

Uploaded by

Anvi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 1
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
Import all the required Python Libraries.
1. Locate open source data from the web (e.g. https://www.kaggle.com).
2. Provide a clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into the pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables
etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform thedata wrangling
operation using Python on any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data

Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction to Dataset
2. Python Libraries for Data Science
3. Description of Dataset
4. Panda Dataframe functions for load the dataset
5. Panda functions for Data Preprocessing
6. Panda functions for Data Formatting and Normalization
7. Panda Functions for handling categorical variables
---------------------------------------------------------------------------------------------------------------

1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but also
nested data structures such as lists, maps, and other records.

Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation
and is also called an attribute of a data instance. Some features may be inputs to a model
(the predictors) and others may be outputs or the features to be predicted.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types,
but typically they are reduced to real or categorical values when working with traditional
machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train
our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not
used to train the model. It may be called the validation dataset.
Data Represented in a Table:
Data should be arranged in a two-dimensional space made up of rows and columns. This
type of data structure makes it easy to understand the data and pinpoint any problems. An
example of some raw data stored as a CSV (comma separated values).

The representation of the same data in a table is as follows:

Pandas Data Types

A data type is essentially an internal construct that a programming language uses to
understand how to store and manipulate data.
A possible confusing point about pandas data types is that there is some overlap between
pandas, python and numpy. This table summarizes the key points:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Pandas Python
NumPy type Usage
dtype type

object str or string_, unicode_, mixed types Text or mixed numeric

mixed and non-numeric values

int64 int int_, int8, int16, int32, int64, uint8, uint16, Integer numbers
uint32, uint64

float64 float float_, float16, float32, float64 Floating point numbers

bool bool bool_ True/False values

datetime64 NA datetime64[ns] Date and time values

timedelta[ns] NA NA Differences between two

datetimes

category NA NA Finite list of text values

2. Python Libraries for Data Science

a. Pandas
Pandas is an open-source Python package that provides high-performance, easy-to-use
data structures and data analysis tools for the labeled data in Python programming
language.
What can you do with Pandas?

1. Indexing, manipulating, renaming, sorting, merging data frame

2. Update, Add, Delete columns from a data frame
3. Impute missing files, handle missing data or NANs
4. Plot data with histogram or box plot
b. NumPy

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

One of the most fundamental packages in Python, NumPy is a general-purpose

array-processing package. It provides high-performance multidimensional array objects
and tools to work with the arrays. NumPy is an efficient container of generic
multi-dimensional data.

NumPy’s main object is the homogeneous multidimensional array. It is a table of

elements or numbers of the same datatype, indexed by a tuple of positive integers. In
NumPy, dimensions are called axes and the number of axes is called rank. NumPy’s
array class is called ndarray aka array.

What can you do with NumPy?

1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.

What can you do with Matplotlib?

Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:Line plots

● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Quiver plots
● Spectrograms

Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn

So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for drawing
attractive and informative statistical graphics. Putting it simply, seaborn is an extension
of Matplotlib with advanced features.

What can you do with Seaborn?

1. Determine relationships between multiple variables (correlation)

2. Observe categorical variables for aggregate statistics
3. Analyze univariate or bi-variate distributions and compare them between different
data subsets
4. Plot linear regression models for dependent variables
5. Provide high-level abstractions, multi-plot grids
6. Seaborn is a great second-hand for R visualization libraries like corrplot and ggplot.
e. 5. Scikit Learn

Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust
machine learning library for Python. It features ML algorithms like SVMs, random
forests, k-means clustering, spectral clustering, mean shift, cross-validation and more...
Even NumPy, SciPy and related scientific operations are supported by Scikit Learn with
Scikit Learn being a part of the SciPy Stack.

What can you do with Scikit Learn?

1. Classification: Spam detection, image recognition

2. Clustering: Drug response, Stock price

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

3. Regression: Customer segmentation, Grouping experiment outcomes

4. Dimensionality reduction: Visualization, Increased efficiency
5. Model selection: Improved accuracy via parameter tuning
6. Pre-processing: Preparing input data as a text for processing with machine
learning algorithms.

3. Description of Dataset:

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning
Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are not
linearly separable from each other.
Total Sample- 150
The columns in this dataset are:
1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species
3 Different Types of Species each contain 50 Sample-

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Description of Dataset-

4. Panda Dataframe functions for Load Dataset

# The columns of the resulting DataFrame have different dtypes.

iris.dtypes
1. The dataset is downloads from UCI repository.
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
2. Now Read CSV File as a Dataframe in Python from from path where you saved the same
The Iris data set is stored in .csv format. ‘.csv’ stands for comma separated values. It is
easier to load .csv files in Pandas data frame and perform various analytical operations on
it.
Load Iris.csv into a Pandas data frame —
Syntax-
iris = pd.read_csv(csv_url, header = None)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

3. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']

4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
iris = pd.read_csv(csv_url, names = col_names)

5. Panda Dataframe functions for Data Preprocessing :

Dataframe Operations:

Sr. Data Frame Function Description

1 dataset.head(n=5) Return the first n rows.

2 dataset.tail(n=5)
Return the last n rows.

3 dataset.index The index (row labels) of the Dataset.

4 dataset.columns The column labels of the Dataset.

5 dataset.shape Return a tuple representing the dimensionality of the

Dataset.

6 dataset.dtypes Return the dtypes in the Dataset.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

This returns a Series with the data type of each column.

The result’s index is the original Dataset’s columns.
Columns with mixed types are stored with the object
dtype.

7 dataset.columns.values Return the columns values in the Dataset in array

format

8 dataset.describe(include='all') Generate descriptive statistics.

to view some basic statistical details like percentile,
mean, std etc. of a data frame or a series of numeric
values.

Analyzes both numeric and object series, as well as

Dataset column sets of mixed data types.

9 dataset['Column name] Read the Data Column wise.

10 dataset.sort_index(axis=1, Sort object by labels (along an axis).

ascending=False)

11 dataset.sort_values(by="Colu Sort values by column name.

mn name")

12 dataset.iloc[5] Purely integer-location based indexing for selection by

position.

13 dataset[0:3] Selecting via [], which slices the rows.

14 dataset.loc[:, ["Col_name1", Selection by label

"col_name2"]]

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

15 dataset.iloc[:n, :] a subset of the first n rows of the original data

16 dataset.iloc[:, :n] a subset of the first n columns of the original data

17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns

Few Examples of iLoc to slice data for iris Dataset

Sr. Data Frame Description Output

No Function

1 dataset.iloc[3:5, 0:2] Slice the data

2 dataset.iloc[[1, 2, By lists of integer

4], [0, 2]] position locations, similar
to the NumPy/Python
style:

3 dataset.iloc[1:3, :] For slicing rows

explicitly:

4 dataset.iloc[:, 1:3] For slicing Column

explicitly:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

4 dataset.iloc[1, 1] For getting a value

explicitly:

5 dataset['SepalLeng Accessing Column and

thCm'].iloc[5] Rows by position

6 cols_2_4=dataset.c Get Column Name then

olumns[2:4] get data from column

dataset[cols_2_4]

7 dataset[dataset.col in one Expression answer

umns[2:4]].iloc[5:1 for the above two
0] commands

Checking of Missing Values in Dataset:

● isnull() is the function that is used to check missing values or null values in pandas python.
● isna() function is also used to get the count of missing values of column and row wise count
of missing values
● The dataset considered for explanation is:

a. is there any missing values in dataframe as a whole

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Function: DataFrame.isnull()
Output:

b. is there any missing values across each column

Function: DataFrame.isnull().any()
Output:

c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using isnull()
Function: dataframe.isnull().sum(axis = 1)
Output:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

e. count Column wise missing value using isnull()

Method 1:
Function: dataframe.isnull().sum()
Output:

Method 2:
unction: dataframe.isna().sum()

f. count of missing values of a specific column.

Function:dataframe.col_name.isnull().sum()
df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group in
pandas we will be using isnull() and sum() function with apply() and groupby()
which performs the group wise count of missing values as shown below.
Function:
df1.groupby(['Gender'])['Score'].apply(lambda x:
x.isnull().sum())
Output:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

6. Panda functions for Data Formatting and Normalization

The Transforming data stage is about converting the data set into a format that can be

analyzed or modelled effectively, and there are several techniques for this process.

a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.

Functions used for data formatting

Sr. Data Frame Description Output

No Function

1. df.dtypes To check the data

type

2. df['petal length To change the data

(cm)']= df['petal type (data type of
‘petal length
length (cm)'changed to int)
(cm)'].astype("int
")

b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.

Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
iris = load_iris()

df = pd.DataFrame(iris.data,
columns=iris.feature_names)
Step 3: Print iris dataset.
df.head()
Step 4: Create x, where x the 'scores' column's values as floats
x = df[['score']].values.astype(float)

Step 5: Create a minimum and maximum processor object

min_max_scaler = preprocessing.MinMaxScaler()
Step 6: Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
Step 7:Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
Step 8: View the dataframe
df_normalized
Output: After Step 3:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Output after step 8:

7. Panda Functions for handling categorical variables

● Categorical variables have values that describe a ‘quality’ or ‘characteristic’
of a data unit, like ‘what type’ or ‘which category’.
● Categorical variables fall into mutually exclusive (in one category or in
another) and exhaustive (include all possible options) categories. Therefore,
categorical variables are qualitative variables and tend to be represented by a
non-numeric value.
● Categorical features refer to string type data and can be easily understood by
human beings. But in case of a machine, it cannot interpret the categorical
data directly. Therefore, the categorical data must be translated into numerical
data that can be understood by machine.
There are many ways to convert categorical data into numerical data. Here the three most used
methods are discussed.
a. Label Encoding: Label Encoding refers to converting the labels into a numeric form
so as to convert them into the machine-readable form. It is an important preprocessing
step for the structured dataset in supervised learning.

Example : Suppose we have a column Height in some dataset. After applying label
encoding, the Height column is converted into:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for Label Encoding:
● preprocessing.LabelEncoder : It Encode labels with value between 0
and n_classes-1.
● fit_transform(y):
Parameters: yarray-like of shape (n_samples,)
Target values.
Returns: yarray-like of shape (n_samples,)
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: define label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)

● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.

Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
high priority than a label having a lower value.
b. One-Hot Encoding:

In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.

In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

One-hot encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for One-hot Encoding:
● sklearn.preprocessing.OneHotEncoder(): Encode categorical
integer features using a one-hot aka one-of-K scheme
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 5: Remove the target variable from dataset
features_df=df.drop(columns=['Species'])

Step 6: Apply one_hot encoder for Species column.

enc = preprocessing.OneHotEncoder()
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']])).toarray()
Step 7: Join the encoded values with Features variable
df_encode = features_df.join(enc_df)

Step 8: Observe the merge dataframe

df_encode
Step 9: Rename the newly encoded columns.
df_encode.rename(columns = {0:'Iris-Setosa',
1:'Iris-Versicolor',2:'Iris-virginica'}, inplace = True)
Step 10: Observe the merge dataframe
df_encode

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Output after Step 8:

Output after Step 10:

c. Dummy Variable Encoding

Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2.
“Green” color is encoded as [0 1] vector of size 2.
“Blue” color is encoded as [0 0] vector of size 2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
Pandas Functions for One-hot Encoding with dummy variables:
● pandas.get_dummies(data, prefix=None, prefix_sep='_',
dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None): Convert categorical variable into
dummy/indicator variables.
● Parameters:
data:array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefixstr: list of str, or dict of str, default None
String to append DataFrame column names.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as
with prefix.
dummy_nabool: default False
Add a column to indicate NaNs, if False NaNs are ignored.

columns: list:like, default None

Column names in the DataFrame to be encoded. If columns is None then all the
columns with object or category dtype will be converted.

sparse: bool: default False

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Whether the dummy-encoded columns should be backed by a SparseArray (

True) or a regular NumPy array (False).

drop_first:bool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtype: dtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.
● Return : DataFrame with Dummy-coded data.

Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=False)
Step 7: Observe the merge dataframe
one_hot_df

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris Dataset.
Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
DSBDA
No ratings yet
DSBDA
145 pages
DOC-20250315-WA0005.
No ratings yet
DOC-20250315-WA0005.
29 pages
Charts
No ratings yet
Charts
132 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
fds_merged (3) (1)
No ratings yet
fds_merged (3) (1)
102 pages
Lesson 03 3.01 Python Libraries For Data Science
No ratings yet
Lesson 03 3.01 Python Libraries For Data Science
79 pages
Data_Science_Assignment_1_Answers
No ratings yet
Data_Science_Assignment_1_Answers
3 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
UNIT-6(Data Analytics and Visualization With Python)
No ratings yet
UNIT-6(Data Analytics and Visualization With Python)
41 pages
Data Preprocessing and Data Analysis using Python
No ratings yet
Data Preprocessing and Data Analysis using Python
32 pages
Exp-1
No ratings yet
Exp-1
22 pages
Bcse206l Fds Module-5 Smsatapathy
No ratings yet
Bcse206l Fds Module-5 Smsatapathy
74 pages
D P Lab Manual
No ratings yet
D P Lab Manual
54 pages
Ai, Ds & ML
No ratings yet
Ai, Ds & ML
52 pages
unit 4
No ratings yet
unit 4
105 pages
Cs3361 Data Science Laboratory
No ratings yet
Cs3361 Data Science Laboratory
139 pages
Final Fds Manual Print
No ratings yet
Final Fds Manual Print
55 pages
Final Fds Manual
No ratings yet
Final Fds Manual
77 pages
Ass-1 Prac
No ratings yet
Ass-1 Prac
23 pages
l9 Scientific Python Proc
No ratings yet
l9 Scientific Python Proc
30 pages
Programming For Data Science
No ratings yet
Programming For Data Science
48 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Numpy_Data_Analysis_and_visualisation_with_Python
No ratings yet
Numpy_Data_Analysis_and_visualisation_with_Python
75 pages
Unit 5
No ratings yet
Unit 5
11 pages
ML_LAB_MANUAL
No ratings yet
ML_LAB_MANUAL
12 pages
DSL Rough Draft
No ratings yet
DSL Rough Draft
34 pages
ML File Updated
No ratings yet
ML File Updated
60 pages
MTE204 Data Python
No ratings yet
MTE204 Data Python
45 pages
FDS Lab Meterial CS3361
No ratings yet
FDS Lab Meterial CS3361
30 pages
Gravitation
75% (4)
Gravitation
23 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
tool and lib in Data Science
No ratings yet
tool and lib in Data Science
32 pages
lab2report
No ratings yet
lab2report
6 pages
Report File
No ratings yet
Report File
40 pages
8 LO5 Lect 1
No ratings yet
8 LO5 Lect 1
16 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Data Ty
No ratings yet
Data Ty
59 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Data Visualization1
No ratings yet
Data Visualization1
52 pages
Lab Manual
No ratings yet
Lab Manual
19 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
CH 4
No ratings yet
CH 4
17 pages
suraj report file
No ratings yet
suraj report file
17 pages
Unit 5
No ratings yet
Unit 5
27 pages
Rcden: Service Manual
100% (1)
Rcden: Service Manual
165 pages
Python Libraries and Packages For Data Science
100% (1)
Python Libraries and Packages For Data Science
5 pages
Busbar General Datasheet PDF
No ratings yet
Busbar General Datasheet PDF
2 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Top 18 Python Libraries
100% (1)
Top 18 Python Libraries
11 pages
MAS04-06 - Standard Costing - MF - Encrypted
No ratings yet
MAS04-06 - Standard Costing - MF - Encrypted
9 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Python CA2
No ratings yet
Python CA2
11 pages
Machine Learning - Manual
No ratings yet
Machine Learning - Manual
32 pages
Unit 5 Python
No ratings yet
Unit 5 Python
10 pages
Data Science lecture 5 6th semster
No ratings yet
Data Science lecture 5 6th semster
3 pages
Hilsdorf, Hubert K. Kropp, Jörg Performance Criteria For Concrete Durability State-Of-The-Art Report
50% (2)
Hilsdorf, Hubert K. Kropp, Jörg Performance Criteria For Concrete Durability State-Of-The-Art Report
226 pages
Analysis of Flow Characteristics and Pressure Drop For An Impinging Plate Fin Heat Sink With Elliptic Bottom Profiles
No ratings yet
Analysis of Flow Characteristics and Pressure Drop For An Impinging Plate Fin Heat Sink With Elliptic Bottom Profiles
17 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Electrochemistry Online Tutorial Question Form
No ratings yet
Electrochemistry Online Tutorial Question Form
3 pages
Computational Meshing For CFD Simulations
No ratings yet
Computational Meshing For CFD Simulations
31 pages
DE GUZMAN, ISAIAH Q._MMEM
No ratings yet
DE GUZMAN, ISAIAH Q._MMEM
19 pages
Unit 1
No ratings yet
Unit 1
94 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
Congestion Control Using Network Based Protocol Abstract
No ratings yet
Congestion Control Using Network Based Protocol Abstract
5 pages
Comsol
100% (1)
Comsol
34 pages
Study Material: Free Master Class Series
No ratings yet
Study Material: Free Master Class Series
34 pages
Automotive Engine Valve
No ratings yet
Automotive Engine Valve
61 pages
Thesis Template University of Waterloo
100% (3)
Thesis Template University of Waterloo
5 pages
QM 10 Dynare
No ratings yet
QM 10 Dynare
28 pages
PT8A977B
No ratings yet
PT8A977B
11 pages
Math4 Q4 Mod6
No ratings yet
Math4 Q4 Mod6
47 pages
Physics Iiit Notes
No ratings yet
Physics Iiit Notes
11 pages
Ta 5 DC Ac
No ratings yet
Ta 5 DC Ac
6 pages
N270L3
No ratings yet
N270L3
6 pages
Topic 3 Research Process
No ratings yet
Topic 3 Research Process
3 pages
20.SAM64000-461-01 FUEL OIL SERVICE SYSTEM 燃油日用系统
No ratings yet
20.SAM64000-461-01 FUEL OIL SERVICE SYSTEM 燃油日用系统
17 pages
Objectives: Logon To SAP From VB. Use of BAPI. Creation of Sales Order Using BAPI
No ratings yet
Objectives: Logon To SAP From VB. Use of BAPI. Creation of Sales Order Using BAPI
9 pages
Codesys Motion
No ratings yet
Codesys Motion
12 pages
Tropical Design: Espeso - Midterm Examination Reviewer
No ratings yet
Tropical Design: Espeso - Midterm Examination Reviewer
7 pages
Tugop Elementary School Second Summative Test in Science Iv Quarter 3 (Week 3&4)
No ratings yet
Tugop Elementary School Second Summative Test in Science Iv Quarter 3 (Week 3&4)
2 pages
DHTML Tutorial: Components of Dynamic HTML
No ratings yet
DHTML Tutorial: Components of Dynamic HTML
17 pages
Soil Liquefaction Analysis of Banasree Residential Area, Dhaka Using Novoliq
No ratings yet
Soil Liquefaction Analysis of Banasree Residential Area, Dhaka Using Novoliq
7 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.