Dataset Extraction and Datasetpre-Processing
Dataset Extraction and Datasetpre-Processing
processing
Here, we are going to use an already existing dataset extracted from kaggle.com.
We are going to import the downloaded dataset (.csv file) into Jupyter Notebook and
clean / pre- process the data using the pandas module. For this assignment I’ve
used “heart attack analysis prediction dataset and oxygen concentration in
their body” from https://www.kaggle.com/.
It is available to download for free in the following link:
The downloaded .csv file can be viewed using Microsoft Excel.
Now, we need to open the Jupyter notebook and create a python3(.ipynb) project file
Now we can import the downloaded (.csv) dataSet into the Jupyter
Notebook. For that we need to import some libraries.
To load and view the dataset in Jupyter Notebook, we need to import the “pandas”
library. Then we can store the dataset in a data frame dataset.
As we can see, the dataset contains 4545 datum. (4545 cells in an Excel Sheet)
Now to see the number of rows and columns present inside the dataset
(table shape), we simply have to run the following code.
Now we need to find whether there are any null values present in the
dataset. To do that, we need to execute the following piece of code.
To make the dataset cleaner we have to remove those empty / null values from
the dataset.
To do that, first we need to choose all those null values
The columns with True as values are the columns with null
values. Now, we drop the records that contains Null values.
/The dropna() function is used to remove missing values. Determine if rows
or columns which contain missing values are removed. /
This makes the dataset cleaner and the processed dataset now has 299 rows and 15
columns.