100% found this document useful (1 vote)

102 views

Pandas: Import

Pandas is a Python library used for working with and analyzing datasets. It allows users to clean messy data, explore and manipulate it, and make conclusions based on statistical analysis. Pandas provides functions for reading CSV files into DataFrames. DataFrames allow viewing and exploring data through methods like head(), tail(), and info(). Pandas can clean data by handling empty values, wrong formats, incorrect values, and duplicates. It also allows visualizing data through plots like bar plots, line plots, histograms, and more generated by the DataFrame plot method.

Uploaded by

hello

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

102 views

Pandas: Import

Uploaded by

hello

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Pandas

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.

Pandas allows us to analyze big data and make conclusions based on

statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated
files).

import pandas as pd

df = pd.read_csv('data.csv')

Viewing the Data

The head() method returns the headers and a specified number of rows,
starting from the top.
df.head()
df.head(20)
The tail() method returns the headers and a specified number of rows,
starting from the bottom.
Df.tail()
Info About the Data
df.info()
df.columns

Cleaning Data
1. Clean Data
2. Clean Empty Cells
3. Clean Wrong Format
4. Clean Wrong Data
5. Remove Duplicates

Data cleaning means fixing bad data in your data set.

Bad data could be:

 Empty cells
 Data in wrong format
 Wrong data
 Duplicates

1. Pandas - Cleaning Empty Cells

Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.

1. Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.

import pandas as pd

df = pd.read_csv('data.csv')
new_df = df.dropna()

print(new_df.to_string())

If you want to change the original DataFrame, use the inplace =

True argument

df.dropna(inplace = True)

print(df.to_string())

Replace Empty Values

Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty
cells.

The fillna() method allows us to replace empty cells with a value:

df.fillna(130, inplace = True)

Replace Only For a Specified Columns

The example above replaces all empty cells in the whole Data Frame.

To only replace empty values for one column, specify the column name for
the DataFrame:

df["Calories"].fillna(130, inplace = True)

Replace Using Mean, Median, or Mode

A common way to replace empty cells, is to calculate the mean, median or
mode value of the column.
Pandas uses the mean() median() and mode() methods to calculate the
respective values for a specified column:

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)

x = df["Calories"].mode()[0]

df["Calories"].fillna(x, inplace = True)

2. Data in wrong format

Data of Wrong Format

Cells with data of wrong format can make it difficult, or even impossible, to
analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.

Convert Into a Correct Format

df['Date'] = pd.to_datetime(df['Date'])

Removing Rows
df.dropna(subset=['Date'], inplace = True)

3. wrong data
"Wrong data" does not have to be "empty cells" or "wrong format", it can
just be wrong, like duration "450" instead of "60".

Replacing Values
One way to fix wrong values is to replace them with something else.

df.loc[7, 'Duration'] = 120

Loop through all values in the "Duration" column.

for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120

Removing Rows
Another way of handling wrong data is to remove the rows that contains
wrong data.

for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)

4. Removing Duplicates

df.duplicated()

Removing Duplicates
df.drop_duplicates(inplace = True)
import matplotlib.pyplot as plt
import pandas as pd

df.plot(y='Tmax', x='Month')

To plot both maximum and minimum temperatures,

df.plot(y=['Tmax','Tmin'], x='Month')

The Pandas Plot Function

Pandas has a built in .plot() function as part of the DataFrame

class. It has several key parameters:

kind — ‘bar’,’barh’,’pie’,’scatter’,’kde’ etc which can be found in

the docs.
color — Which accepts and array of hex codes corresponding
sequential to each data series / column.
linestyle — ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs
only)
xlim, ylim — specify a tuple (lower limit, upper limit) for
which the plot will be drawn
legend— a boolean value to display or hide the legend
labels — a list corresponding to the number of columns in the
dataframe, a descriptive name can be provided here for the
legend
title — The string title of the plot

Bar Charts

df.plot(kind='bar', y='Tmax', x='Month')

df.plot.bar(y='Tmax', x='Month')

df.plot.barh(y='Tmax', x='Month')

df.plot.bar(y=['Tmax','Tmin'], x='Month')

Different color for each bar

color=['blue', 'red']

Control color of border

edgecolor='blue'

df.plot.bar(xlabel='Class')
df.plot.bar(ylabel='Amounts')
df.plot.bar(title='I am title')
figsize=(8, 6)

Rotate Label

rot=70

Multiple charts

weather.plot(y=['Tmax','Tmin','Rain','Sun'], x='Month', subplots=True,

layout=(2,2))

df.plot.bar(y=['Tmax','Tmin','Rain','Sun'], x='Month', subplots=True,

layout=(2,2))

Scatter plot of two columns

df.plot.scatter(x='Sun', y='Rain')

df.plot(kind='line', y='Tmax', x='Month')

df.plot( y='Tmax', x='Month')

df.plot(kind='hist', y='Tmax', x='Month')

df.plot(kind='hexbin', y='Tmax', x='Month')

df.plot.kde()

Pie

df= pd.DataFrame({'cost': [79, 40 , 60]},index=['Oranges', 'Bananas', '
Apples'])

df.plot.pie(y='cost', figsize=(8, 6))
plt.show()
Python Pandas - Series
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.
pandas.Series( data, index, dtype, copy)

Data: data takes various forms like ndarray, list, constants

Index: Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.

Dtype: dtype is for data type.

Copy: Copy data. Default False

A series can be created using various inputs like −

 Array
 Dict
 Scalar value or constant

Create an Empty Series

import pandas as pd
s = pd.Series()
print(s)

import pandas as pd
s = pd.Series()
s

Create a Series from ndarray

import pandas as pd
import numpy as np
data = np.array(['FY','SY','TY','BE'])
s = pd.Series(data)
s

Customized indexed values

import pandas as pd
import numpy as np
data = np.array(['FY','SY','TY','BE'])
s = pd.Series(data,index=[100,101,102,103])
s

Create a Series from dictionary

import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
s

import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
s

Create a Series from Scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the
length of index.

import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
s

Accessing Data from Series with Position

Data in the series can be accessed similar to that in an ndarray.
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the element

s[0]

Retrieve the first three elements in the Series.

s[:3]

Retrieve the last three elements.

s[:3]
Retrieve Data Using Label (Index)
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element

print s['d']

Retrieve multiple elements using a list of index label values.

s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements

s[['a','c','d']]

If a label is not contained, an exception is raised.

s[['aa','c','d']]

Data Frame:
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns.

Features of DataFrame

 Potentially columns are of different types

 Size – Mutable
 Labeled axes (rows and columns)
 Can Perform Arithmetic operations on rows and columns

pandas.DataFrame( data, index, columns, dtype, copy)

Data:
Data takes various forms like ndarray, series, map, lists, dict, constants and also another
DataFrame.

Index:
For the row labels, the Index to be used for the resulting frame is Optional Default
np.arange(n) if no index is passed.

columns:
For column labels, the optional default syntax is - np.arange(n). This is only true if no index
is passed.

dtype:
Data type of each column.

copy

This command (or whatever it is) is used for copying of data, if the default is False.

Create DataFrame

Create an Empty DataFrame

import pandas as pd
df = pd.DataFrame()
print(df)

Create a DataFrame from Lists

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

import pandas as pd
data = [['SY',10],['TY',12],['FY',13]]
df = pd.DataFrame(data,columns=['CLASS','RN'])
print(df)

Create a DataFrame from Dict of ndarrays / Lists

import pandas as pd
data = {'CLASS':['FY', 'SY', 'TY', 'BTECH'],'RN':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

Account Registration: Laboratory Exercise
No ratings yet
Account Registration: Laboratory Exercise
3 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Python Data Import
100% (1)
Python Data Import
28 pages
ITSGHD190209 - Open Text Archive Server Installation With SQL Database
No ratings yet
ITSGHD190209 - Open Text Archive Server Installation With SQL Database
27 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Unit-03: Capturing, Preparing and Working With Data
No ratings yet
Unit-03: Capturing, Preparing and Working With Data
41 pages
DAX Cheat Sheet
No ratings yet
DAX Cheat Sheet
10 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
No ratings yet
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
8 pages
Managing Memory in SAS
No ratings yet
Managing Memory in SAS
17 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
Financial Analytics With Python
100% (1)
Financial Analytics With Python
40 pages
CSE 1001 (Python) Faculty Name: Dr. AMIT Kumar Tyagi
No ratings yet
CSE 1001 (Python) Faculty Name: Dr. AMIT Kumar Tyagi
16 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
PYTHON notes by devaraj
100% (1)
PYTHON notes by devaraj
40 pages
Python Interview Questions
No ratings yet
Python Interview Questions
61 pages
Ch-2 Panda: #Import The Pandas Library and Aliasing As PD
No ratings yet
Ch-2 Panda: #Import The Pandas Library and Aliasing As PD
5 pages
Tableau Tutorial
No ratings yet
Tableau Tutorial
65 pages
Testing in Python - Unit Test & Script
No ratings yet
Testing in Python - Unit Test & Script
5 pages
Pandas Practice Questions
No ratings yet
Pandas Practice Questions
2 pages
Pandas Complete Notes
No ratings yet
Pandas Complete Notes
105 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Python Pandas Interview Questions and Answers
No ratings yet
Python Pandas Interview Questions and Answers
20 pages
International Indian School, Riyadh WORKSHEET (2020-2021) Grade - Xii - Informatics Practices - Second Term
No ratings yet
International Indian School, Riyadh WORKSHEET (2020-2021) Grade - Xii - Informatics Practices - Second Term
9 pages
Python for Data Engineering Guide
No ratings yet
Python for Data Engineering Guide
4 pages
Python For Non-Programmers - 1-1
No ratings yet
Python For Non-Programmers - 1-1
19 pages
What Are The Difference Between DDL, DML and DCL Commands - Oracle FAQ
No ratings yet
What Are The Difference Between DDL, DML and DCL Commands - Oracle FAQ
4 pages
Python Pandas Cheatsheety
No ratings yet
Python Pandas Cheatsheety
7 pages
ML Algorithms
100% (1)
ML Algorithms
1 page
ENG202 - Introduction To Python
No ratings yet
ENG202 - Introduction To Python
34 pages
Pandas in Python 16sept2022
No ratings yet
Pandas in Python 16sept2022
8 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
100 SQL Formulas Each Student Should Know
No ratings yet
100 SQL Formulas Each Student Should Know
10 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Python Interview Questions and Answer
No ratings yet
Python Interview Questions and Answer
23 pages
Strings PDF
No ratings yet
Strings PDF
14 pages
Python Slides PDF
No ratings yet
Python Slides PDF
35 pages
STAT 451: Intro To Machine Learning Lecture Notes
100% (1)
STAT 451: Intro To Machine Learning Lecture Notes
17 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
SQL With Python Guide
No ratings yet
SQL With Python Guide
17 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
TensorFlow With R
No ratings yet
TensorFlow With R
46 pages
Pyspark IQ
No ratings yet
Pyspark IQ
13 pages
SQL interview questions for a Data Engineer
No ratings yet
SQL interview questions for a Data Engineer
11 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Pandas
100% (1)
Pandas
1,131 pages
Python - The Basics
No ratings yet
Python - The Basics
6 pages
Power BI Cheat Sheet
No ratings yet
Power BI Cheat Sheet
10 pages
Tableau Syllabus
0% (1)
Tableau Syllabus
19 pages
Acceleo User Guide
No ratings yet
Acceleo User Guide
56 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
DWH BASICS Interview Questions
No ratings yet
DWH BASICS Interview Questions
30 pages
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Shaun M. Thomas
5/5 (2)
SAS Viya: The Python Perspective
From Everand
SAS Viya: The Python Perspective
Kevin D. Smith
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Fi GL22 022
No ratings yet
Fi GL22 022
68 pages
Intel (R) CSME Version Detection Tool User Guide
No ratings yet
Intel (R) CSME Version Detection Tool User Guide
17 pages
Igcse Connectivity
No ratings yet
Igcse Connectivity
16 pages
OFM Forecast Limits and Forecast Schedule Implementation
No ratings yet
OFM Forecast Limits and Forecast Schedule Implementation
2 pages
System Programming Notes 3 - TutorialsDuniya
No ratings yet
System Programming Notes 3 - TutorialsDuniya
40 pages
02 - Cisco ISE Profiling Reporting - Part 2
No ratings yet
02 - Cisco ISE Profiling Reporting - Part 2
2 pages
Fpga Assement4 PDF
No ratings yet
Fpga Assement4 PDF
26 pages
Outer Space An Outer Product Based Sparse Matrix Multiplication Accelerator
No ratings yet
Outer Space An Outer Product Based Sparse Matrix Multiplication Accelerator
13 pages
Sphereshield For Microsoft Teams
No ratings yet
Sphereshield For Microsoft Teams
11 pages
Problem - 1418G - Codeforces
No ratings yet
Problem - 1418G - Codeforces
1 page
MJ 9120 User Manual 102.5x145mm
No ratings yet
MJ 9120 User Manual 102.5x145mm
56 pages
Examens Eng PDF
No ratings yet
Examens Eng PDF
250 pages
Applications of Double Angle & Half-Angle Identities: de La Salle University - Manila
No ratings yet
Applications of Double Angle & Half-Angle Identities: de La Salle University - Manila
21 pages
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
No ratings yet
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
4 pages
Marking Scheme Cs GR 10 11 Mockexam Paper2
No ratings yet
Marking Scheme Cs GR 10 11 Mockexam Paper2
21 pages
Lab 2 - Forensic Imaging
No ratings yet
Lab 2 - Forensic Imaging
2 pages
8FM0-28 As Decision Mathematics 2 - Practice Paper 1
No ratings yet
8FM0-28 As Decision Mathematics 2 - Practice Paper 1
4 pages
8051 Instruction Set: Facebooktwitterlinkedinredditpinterestshare
No ratings yet
8051 Instruction Set: Facebooktwitterlinkedinredditpinterestshare
13 pages
Embedded Systems (18EC62) - Embedded System Design Concepts (Module 4)
50% (2)
Embedded Systems (18EC62) - Embedded System Design Concepts (Module 4)
153 pages
Build From Source - Everything Curl
No ratings yet
Build From Source - Everything Curl
6 pages
PhotoPrint Installation Manual
No ratings yet
PhotoPrint Installation Manual
3 pages
Software Design Lab 2 (19132020)
No ratings yet
Software Design Lab 2 (19132020)
5 pages
Template - PC Audit
No ratings yet
Template - PC Audit
1 page
Transpoeng - 7.6 Trip Distribution
No ratings yet
Transpoeng - 7.6 Trip Distribution
10 pages
CC105 - App Dev & Emerging Technologies: Arcilla, Ronald Caraecle, Ella Rodriguez, Rhea Acedo, Reiner
No ratings yet
CC105 - App Dev & Emerging Technologies: Arcilla, Ronald Caraecle, Ella Rodriguez, Rhea Acedo, Reiner
17 pages
Social Scripts and Expectancy Violations Evaluating Communication With Human or AI Chatbot Interactants
No ratings yet
Social Scripts and Expectancy Violations Evaluating Communication With Human or AI Chatbot Interactants
17 pages
Guidelines: G Suite
No ratings yet
Guidelines: G Suite
31 pages
Sage 9.1 Reference Manual: Arithmetic Subgroups of SL: Release 9.1
No ratings yet
Sage 9.1 Reference Manual: Arithmetic Subgroups of SL: Release 9.1
96 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Pandas: Import

Uploaded by

Pandas: Import

Uploaded by

Pandas

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

Pandas allows us to analyze big data and make conclusions based on

Relevant data is very important in data science.

Read CSV Files

Viewing the Data

Data cleaning means fixing bad data in your data set.

Bad data could be:

1. Pandas - Cleaning Empty Cells

If you want to change the original DataFrame, use the inplace =

Replace Empty Values

The fillna() method allows us to replace empty cells with a value:

df.fillna(130, inplace = True)

Replace Only For a Specified Columns

df["Calories"].fillna(130, inplace = True)

Replace Using Mean, Median, or Mode

df["Calories"].fillna(x, inplace = True)

df["Calories"].fillna(x, inplace = True)

df["Calories"].fillna(x, inplace = True)

2. Data in wrong format

Data of Wrong Format

Convert Into a Correct Format

Loop through all values in the "Duration" column.

To plot both maximum and minimum temperatures,

The Pandas Plot Function

Pandas has a built in .plot() function as part of the DataFrame

kind — ‘bar’,’barh’,’pie’,’scatter’,’kde’ etc which can be found in

df.plot(kind='bar', y='Tmax', x='Month')

Different color for each bar

Control color of border

weather.plot(y=['Tmax','Tmin','Rain','Sun'], x='Month', subplots=True,

df.plot.bar(y=['Tmax','Tmin','Rain','Sun'], x='Month', subplots=True,

Scatter plot of two columns

df.plot(kind='line', y='Tmax', x='Month')

df.plot( y='Tmax', x='Month')

df.plot(kind='hist', y='Tmax', x='Month')

df.plot(kind='hexbin', y='Tmax', x='Month')

Data: data takes various forms like ndarray, list, constants

Dtype: dtype is for data type.

A series can be created using various inputs like −

Create an Empty Series

Create a Series from ndarray

Customized indexed values

Create a Series from dictionary

Create a Series from Scalar

Accessing Data from Series with Position

#retrieve the element

Retrieve the first three elements in the Series.

Retrieve the last three elements.

#retrieve a single element

Retrieve multiple elements using a list of index label values.

#retrieve multiple elements

If a label is not contained, an exception is raised.

 Potentially columns are of different types

pandas.DataFrame( data, index, columns, dtype, copy)

Create an Empty DataFrame

Create a DataFrame from Lists

Create a DataFrame from Dict of ndarrays / Lists

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.