Ass 1 DSBDL
Ass 1 DSBDL
4 4 4 4 4 20
Group A
Assignment No: 1
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
Import all the required Python Libraries.
1. Locate open source data from the web (e.g. https://www.kaggle.com).
2. Provide a clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into the pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables
etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform thedata wrangling
operation using Python on any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but also
nested data structures such as lists, maps, and other records.
Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation
and is also called an attribute of a data instance. Some features may be inputs to a model
(the predictors) and others may be outputs or the features to be predicted.
Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types,
but typically they are reduced to real or categorical values when working with traditional
machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train
our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not
used to train the model. It may be called the validation dataset.
Data Represented in a Table:
Data should be arranged in a two-dimensional space made up of rows and columns. This
type of data structure makes it easy to understand the data and pinpoint any problems. An
example of some raw data stored as a CSV (comma separated values).
Pandas Python
NumPy type Usage
dtype type
int64 int int_, int8, int16, int32, int64, uint8, uint16, Integer numbers
uint32, uint64
1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for drawing
attractive and informative statistical graphics. Putting it simply, seaborn is an extension
of Matplotlib with advanced features.
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust
machine learning library for Python. It features ML algorithms like SVMs, random
forests, k-means clustering, spectral clustering, mean shift, cross-validation and more...
Even NumPy, SciPy and related scientific operations are supported by Scikit Learn with
Scikit Learn being a part of the SciPy Stack.
3. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning
Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are not
linearly separable from each other.
Total Sample- 150
The columns in this dataset are:
1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species
3 Different Types of Species each contain 50 Sample-
Description of Dataset-
3. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
iris = pd.read_csv(csv_url, names = col_names)
2 dataset.tail(n=5)
Return the last n rows.
17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns
dataset[cols_2_4]
Function: DataFrame.isnull()
Output:
c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using isnull()
Function: dataframe.isnull().sum(axis = 1)
Output:
Method 2:
unction: dataframe.isna().sum()
analyzed or modelled effectively, and there are several techniques for this process.
a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.
b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
iris = load_iris()
df = pd.DataFrame(iris.data,
columns=iris.feature_names)
Step 3: Print iris dataset.
df.head()
Step 4: Create x, where x the 'scores' column's values as floats
x = df[['score']].values.astype(float)
Example : Suppose we have a column Height in some dataset. After applying label
encoding, the Height column is converted into:
where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for Label Encoding:
● preprocessing.LabelEncoder : It Encode labels with value between 0
and n_classes-1.
● fit_transform(y):
Parameters: yarray-like of shape (n_samples,)
Target values.
Returns: yarray-like of shape (n_samples,)
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: define label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.
Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
high priority than a label having a lower value.
b. One-Hot Encoding:
In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.
In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
One-hot encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for One-hot Encoding:
● sklearn.preprocessing.OneHotEncoder(): Encode categorical
integer features using a one-hot aka one-of-K scheme
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 5: Remove the target variable from dataset
features_df=df.drop(columns=['Species'])
Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.
In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2.
“Green” color is encoded as [0 1] vector of size 2.
“Blue” color is encoded as [0 0] vector of size 2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
Pandas Functions for One-hot Encoding with dummy variables:
● pandas.get_dummies(data, prefix=None, prefix_sep='_',
dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None): Convert categorical variable into
dummy/indicator variables.
● Parameters:
data:array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefixstr: list of str, or dict of str, default None
String to append DataFrame column names.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as
with prefix.
dummy_nabool: default False
Add a column to indicate NaNs, if False NaNs are ignored.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=False)
Step 7: Observe the merge dataframe
one_hot_df
Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris Dataset.
Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?