0% found this document useful (0 votes)
21 views55 pages

M3-Introduction to Numpy and Pandas

The document provides an overview of NumPy and its functionalities for handling numerical data in Python, including data types, array creation, and operations such as indexing, slicing, and reshaping. It also introduces pandas as a data analysis library, highlighting its Series and DataFrame structures for managing and manipulating data. Key features of both libraries are discussed, emphasizing their importance for data analysis and artificial intelligence applications.

Uploaded by

Saraswathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views55 pages

M3-Introduction to Numpy and Pandas

The document provides an overview of NumPy and its functionalities for handling numerical data in Python, including data types, array creation, and operations such as indexing, slicing, and reshaping. It also introduces pandas as a data analysis library, highlighting its Series and DataFrame structures for managing and manipulating data. Key features of both libraries are discussed, emphasizing their importance for data analysis and artificial intelligence applications.

Uploaded by

Saraswathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Module 4

Intro.
 Numpy : Numerical python
 Python provides set of modules required for different
software applications
 General Purpose Programming Lang.
 numpy,pandas, ....
 Required programming for DA, platform for DA, Usefull programming lang. For AI

 Numpy:
 We store data in different formats.
 Dimension based the data are stored, i.e, arrays.
 Since array is a static type, Its achieved by using Numpy
module.
 As, Python is dynamic programming language.
Understanding Data Types in
Python
 how arrays of data are handled
 how NumPy improves on this
/* C code */
int result = 0;
for(int i=0; i<100;i++){
result +=i;
}
While in Python the equivalent operation could be written this
way:
# Python code
result = 0
for i in range(100):
result += i
 In C, the data types of each variable are explicitly declared,
while in Python the types are dynamically inferred

# Python code
x=4
x = "four“

/* C code */
int x = 4;
x = "four"; // FAIL
 A Python List
 Python data structure that holds many Python
objects
 The standard mutable multi element container in
Python is the list.

#list of integers:
In[1]: L = list(range(10))
L
Out[1]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In[2]: type(L[0])
Out[2]: int
#list of strings:
In[3]: L2 = [str(c) for c in L]
L2
Out[3]: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
In[4]: type(L2[0])
Out[4]: str

Because of Python’s dynamic typing, we can even create


heterogeneous lists:
In[5]: L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]
Out[5]: [bool, str, float, int]
Fixed-Type Arrays
 Several different options for storing data in efficient,
fixed-type data buffers.
In[6]: import array
l = list(range(10))
a = array.array('i', l)
a
Out[6]: array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Here 'i' is a type code indicating the contents are integers.
Creating Arrays from Python List
NumPy adds to this efficient operations on that data
In[7]: import numpy as np

 np.array to create arrays from Python lists


In[8]: # integer array:
np.array([1, 4, 2, 5, 3])
Out[8]: array([1, 4, 2, 5, 3])
 NumPy is constrained to arrays that all contain the
same type.
 If types do not match, NumPy will upcast (float value
to integer)
Creating Arrays from Scratch
 It is more efficient to create arrays from scratch using
routines built into NumPy
 np.zeros(10, dtype=int) # Create a length-10 integer array filled with zeros
 np.ones((3, 5), dtype=float) # Create a 3x5 floating-point array filled with
 np.full((3, 5), 3.14) # Create a 3x5 array filled with 3.14
 np.arange(0, 20, 2)
 np.linspace(0, 1, 5) #Generate 5 numbers between 0 and 1
 np.random.random((3, 3))
 np.random.normal(0, 1, (3, 3))
 np.random.randint(0, 10, (3, 3))
 np.eye(3)
 np.empty(3)
NumPy Standard Data Types
import numpy as np
np.array([3.14, 4, 2, 3])
np.array([1, 2, 3, 4], dtype='float32')
np.array([range(i, i + 3) for i in [2, 4, 6]])
array([[2, 3, 4], [4, 5, 6], [6, 7, 8]])

 np.zeros(10, dtype='int16') || np.zeros(10, dtype=np.int16)


The Basics of NumPy Arrays
 Attributes of arrays
 Determining the size, shape, memory consumption, and data types of
arrays
 Indexing of arrays
 Getting and setting the value of individual array elements
 Slicing of arrays
 Getting and setting smaller sub arrays within a larger array
 Reshaping of arrays
 Changing the shape of a given array
 Joining and splitting of arrays
 Combining multiple arrays into one, and splitting one array into many
NumPy Array Attributes
 Three random arrays: a one-dimensional, two-dimensional, and three-
dimensional array
In[1]:
import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

 Each array has attributes


 ndim (the number of dimensions),
 shape (the size of each dimension),
 and size (the total size of the array)
In[2]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
Attr. Cont.
 attribute is the dtype, the data type of the array
In[3]:
print("dtype:", x3.dtype)
dtype: int64

Other attributes include itemsize,


which lists the size (in bytes) of each array element,
and nbytes, which lists the total size (in bytes) of the array:

In[4]:
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

itemsize: 8 bytes
nbytes: 480 bytes
Array Indexing: Accessing Single Elements
 In a one-dimensional array,
 we can access the ith value
 (counting from zero) by specifying the desired index in square brackets
x1 x1[0] x1[4]
 To index from the end of the array, you can use negative indices
x1[-1] x1[-2]
 In a multidimensional array, using a comma-separated:
x2 x2[0, 0] x2[2, 0] x2[2, -1]
 We can also modify values using any of the above index notation:
x2[0, 0] = 12
x2
 NumPy arrays have a fixed type. x1[0] = 3.14159 # this will be truncated!
Array Slicing: Accessing Subarrays
 To access subarrays with the slice notation, is marked by the colon (:)
character.
 The NumPy slicing syntax:
x[start:stop:step]
 One-dimensional subarrays :
 x = np.arange(10)
 x[:5] # first five elements
 x[5:] # elements after index 5
 x[4:7] # middle subarray
 x[::2] # every other element
 x[1::2] # every other element, starting at index 1
 In case is when the step value is negative.
 x[::-1] # all elements, reversed
 x[5::-2] # reversed every other from index 5
 Multidimensional subarrays
 x2
 x2[:2, :3] # two rows, three columns
 x2[:3, ::2] # all rows, every other column
 x2[::-1, ::-1] #reversed

 Accessing array rows and columns


 print(x2[:, 0]) # first column of x2
 print(x2[0, :]) # first row of x2
 print(x2[0]) # equivalent to x2[0, :]
 Subarrays as no-copy views
 Return views rather than copies of the array data.
 NumPy array slicing differs from Python list slicing: in lists, slices will be
copies.
print(x2)
x2_sub = x2[:2, :2]
print(x2_sub)
x2_sub[0, 0] = 99
print(x2_sub)
 Creating copies of arrays
 explicitly copy the data within an array or a subarray , copy()
method:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
Reshaping of Arrays
 The reshape method will use a no-copy view of the initial array,
but with noncontiguous memory buffers this is not always the
case.
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
 Conversion of one-dimensional array into a two-dimensional
row or column matrix
x = np.array([1, 2, 3])
x.reshape((1, 3)) # row vector via reshape
x[np.newaxis, :] # row vector via newaxis
x.reshape((3, 1)) # column vector via reshape
x[:, np.newaxis] # column vector via newaxis
Array Concatenation and Splitting
 Concatenation of arrays :
 np.concatenate, np.vstack, and np.hstack
 np.concatenate takes a tuple or list of arrays as its first argument

x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
array([1, 2, 3, 3, 2, 1])
z = [99, 99, 99] # concatenate more than two arrays at once
print(np.concatenate([x, y, z]))
[ 1 2 3 3 2 1 99 99 99]
X`
grid = np.array([[1, 2, 3], [4, 5, 6]])
np.concatenate([grid, grid]) # the first axis
np.concatenate([grid, grid], axis=1)#the second axis (0)
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7], [6, 5, 4]]) # vertically stack the arrays
np.vstack([x, grid])
# horizontally stack the arrays
y = np.array([[99], [99]])
np.hstack([grid, y])
Cont..
 Splitting of arrays :
 np.split, np.hsplit, and np.vsplit.

x = [1, 2, 3, 99, 99, 3, 2, 1]


x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

[1 2 3] [99 99] [3 2 1]
grid = np.arange(16).reshape((4, 4))
grid

upper, lower = np.vsplit(grid, [2])


print(upper)
print(lower)

left, right = np.hsplit(grid, [2])


print(left)
print(right)
 Introduction to pandas Data Structures
 It has functions for analyzing, cleaning, exploring, and
manipulating data.
 Panel / Python Data Analysis i.e Pandas
 Analyze big data
 Clean missing data sets
 Install pip install pandas
 import pandas
 Series
 DataFrame
 Index Objects
Series
 Importing Library
 Implementing the Series
 Indexing on Series
 Re-Indexing on Series
 Missing values of Series
Series cont.
 A Series is a one-dimensional array(of any Numpy data
type)
 An associated array of data labels, called its index.
 Default index is integers 0 to n-1

Import pandas as pd
obj = pd.Series([4, 7, -5, 3]) obj # Output
obj.values
obj.index

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])


Obj2

obj2['a'] obj2['d'] = 6 obj2[['c', 'a', 'd']]


obj2[obj2 > 0] obj2 * 2 np.exp(obj2)
import pandas as pd
obj = pd.Series([4, 7, -5, 3])
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
obj2[obj2 > 0]
obj2 * 2
np.exp(obj2)

'b' in obj2
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)

states = ['California', 'Ohio', 'Oregon', 'Texas']


obj4 = Series(sdata, index=states)
pd.isnull(obj4) pd.notnull(obj4) obj4.isnull()
obj3 obj4 obj3 + obj4
obj4.name = 'population‘
obj4.index.name = 'state‘
obj4
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] obj
DataFrame
 A DataFrame represents a tabular, spreadsheet-like
data structure
 Containing an ordered collection of columns
 Each of which can be a different value type (numeric,
string, boolean, etc.).
 Has both a row and column index.
 Is from a dict of equal-length lists or NumPy
arrays
 DataFrame will have its index assigned automatically
as with Series columns are placed in sorted order
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data) frame

DataFrame(data, columns=['year', 'state', 'pop'])


frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
....: index=['one', 'two', 'three', 'four', 'five'])
frame2
frame2.columns

frame2['state'] frame2.year frame2.ix['three']


frame2['debt'] = 16.5 frame2
frame2['debt'] = np.arange(5.) frame2
Cont,.
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val frame2
frame2['eastern'] = frame2.state == 'Ohio‘ frame2
del frame2['eastern']
frame2.columns
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
frame3 = DataFrame(pop) frame3
frame3.T
DataFrame(pop, index=[2001, 2002, 2003])
pdata = {'Ohio': frame3['Ohio'][:-1], ....: 'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3
frame3.values
frame2.values
Index Objects
 pandas’s Index objects are responsible for holding the axis labels and other
metadata
 Any array or other sequence of labels used when constructing a Series or
DataFrame is internally converted to an Index
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
Index
Index([a, b, c], dtype=object)
index[1:]
Index([b, c], dtype=object)
index[1] = 'd‘ # err. Index objects are immutable

index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index
 Class Description
 Index Index object, representing axis labels
in a NumPy array of Python objects.
 Int64Index Specialized Index for integer values.
MultiIndex “Hierarchical” index object representing
multiple levels of indexing on a single axis.
 DatetimeIndex Stores nanosecond timestamps
 PeriodIndex Specialized Index for Period data
Method Description

append Concatenate with additional Index objects, producing a new Index


diff Compute set difference as an Index
intersection Compute set intersection
union Compute set union
isin Compute boolean array indicating
whether each value is contained in the passed collection
delete Compute new Index with element at index i deleted
drop Compute new index by deleting passed values
insert Compute new Index by inserting element at index i
is_monotonic Returns True if each element is greater than or equal to the prev ele
is_unique Returns True if the Index has no duplicate values
unique Compute the array of unique values in the Index
Creation of array :
np.arange(0,10)
[0,1,2,3,4,5,6,7,8,9]

Can be used as series


pandas.Series(np.arange(0,10))
pandas.Series(np.arange(0,10),index=[‘a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,’i'])

Using series will be creating Dataframe


pandas.DataFrame(np.random.randn(4, 3),
columns=list('bde'), .....: index=['Utah', 'Ohio', 'Texas',
'Oregon']
 Axis=0 i.e row wise
 Axis = 1 i.e, columnwise

 Indexing for series is row wise using property “index”


 pd.Series([1,2,3,4],index=[‘a’,’b’,’c’,’d’]

 Indexing for Dataframe is done using property called


 “index”
 “columns”
DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])
 Essential Functionality
 Reindexing
 Dropping entries from an axis
 Indexing, selection, and filtering
 Arithmetic and data alignment
 Arithmetic methods with fill values
 Operations between DataFrame and Serie
 Function application and mapping
 Sorting and ranking
 Axis indexes with duplicate values
Reindexing
 A critical method on pandas objects is reindex
 To create a new object with the data confirmed to a new index.
 Allows you to change the row indexes, and the columns labels
 The values are set to NaN if the new index is not the same as the
old
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])


obj3.reindex(range(6), method='ffill') #method option allows us to do
this, using a method such as ffill which forward fills the values
Synatx:
dataframe.reindex(keys, method, copy, level, fill_value, limit,
tolerance)
Parameter Value Description
keys Required. String or list containing row indexes or
column labels
method None Optional, default None. Specifies the method to use
'backfill' when filling holes in the indexes. For
'bfill' increasing/decreasing indexes only.
'pad'
'ffill'
'nearest'
copy True Optional, default True. Whether to return a new
False object (a copy) when all the new indexes are the same
as the old
level Number Optional
Label
fill_value List of values Optional, default NaN. Specifies the value to use for
missing values
limit Number Optional, default None.
tolerance Optional
Cont.
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])

frame2 = frame.reindex(['a', 'b', 'c', 'd'])

states = ['Texas', 'Utah', 'California']

frame.reindex(columns=states)

frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',


columns=states)

frame.ix[['a', 'b', 'c', 'd'], states]


Dropping entries from an axis
Syntax : dataframe.drop(labels, axis, index, columns,
level, inplace., errors)
Parameter Value Description
labels Optional, The labels or indexes to drop. If more than one,
specify them in a list.
axis 0 Optional, Which axis to check, default 0.
1
'index'
'columns'
index String Optional, Specifies the name of the rows to drop. Can be
List used instead of the labels parameter.
columns String Optional, Specifies the name of the columns to drop. Can
List be used instead of the labels parameter.
level Number Optional, default None. Specifies which level ( in a
level name hierarchical multi index) to check along
inplace True Optional, default False. If True: the removing is done on
False the current DataFrame. If False: returns a copy where the
removing is done.
errors 'ignore' Optional, default 'ignore'. Specifies whether to ignore
'raise' errors or not
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj
obj.drop(['d', 'c'])

data = DataFrame(np.arange(16).reshape((4, 4)),


index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data.drop(['Colorado', 'Ohio'])
data.drop('two', axis=1)
data.drop(['two', 'four'], axis=1)
Indexing, selection, and filtering
 Pandas Indexing using [ ], .loc[],.iloc[], .ix[]
 Dataframe.[ ] ; This function also known as indexing
operator
 Dataframe.loc[] : This function is used for labels.
 Dataframe.iloc[] : This function is used for positions or
integer based
 Dataframe.ix[] : This function is used for both label and
integer based (Consider label as well as integer address
of row)
 Slicing can be done using
 Represented by []
 [:]
 [[,]]
 [<]
 [>]
 Can use the Series’s index values instead of only integers.
obj = Series(np.arange(4), index=['a', 'b', 'c', 'd'])
obj['b']
obj[1]
obj[2:4]
obj[['b', 'a', 'd']]
obj[[1, 3]]
obj[obj < 2]
obj['b':'c']
obj['b':'c'] = 5
 indexing into a DataFrame is for retrieving one or more columns either
with a single value
obj = Series(np.arange(4), index=['a', 'b', 'c', 'd'])

obj['b']
obj[1]
obj[2:4]
obj[['b', 'a', 'd']]
obj[[1, 3]]
obj[obj < 2]
 Slicing
obj['b':'c']
obj['b':'c'] = 5

data = DataFrame(np.arange(16).reshape((4, 4)),


index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data['two']
data[['three', 'one']]

 Indexing
data[:2]
data[data['three'] > 5]
data < 5
data[data < 5] = 0 //
data.ix['Colorado', ['two', 'three']]
data.ix[['Colorado', 'Utah'], [3, 0, 1]]
data.ix[2]
data.ix[:'Utah', 'two']
data.ix[data.three > 5, :3]
 Type Notes
 obj[val] Single col or sequence of cols from the DataFrame.
boolean array (filter rows), slice (slice rows), or
boolean DataFrame
 obj.ix[val] Selects single row of subset of rows from the
DataFrame.
 obj.ix[:, val] Selects single column of subset of columns.
 obj.ix[val1, val2] Select both rows and columns.
 reindex method Conform one or more axes to new indexes.
 xs method Select single row or column as a Series by label.
 icol, irow methods Select single column or row, respectively, as a
Series by integer location.
 get_value, Select single value by row and column label.
set_value methods
Aligning, mapping, and sorting
data in Pandas
 Data alignment
df1 = DataFrame(np.arange(9).reshape(3,3),
columns=['a','b','c'], index=['SA', 'VIC', 'NSW'])
df1

df2 = DataFrame(np.arange(12).reshape(4,3),
columns=['a','b','e'], index=['SA', 'VIC', 'NSW', 'ACT'])
df2
 Adding DataFrames
df1+df2

 Handling missing data


df1.add(df2, fill_value=0)
Mapping
 we would want to change or manipulate the values in a
particular row or a column by applying some functions
only to select values.
 a lambda function, to specify what kind of
transformation needs to be applied
 an axis parameter, which by default equates to 0 and so
applies across the index (and not columns).
df_states

f = lambda x:x.upper()
df_states['state'] = df_states['state'].apply(f)
df_states
Sorting and ranking
 df_states.sort_index()

 df_states.sort_index(axis=1)
 Sorting and ranking
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],


.....: columns=['d', 'a', 'b', 'c'])
frame.sort_index() # sorting row index wise
frame.sort_index(axis=1) # column index wise
frame.sort_index(axis=1, ascending=False)
obj = Series([4, 7, -3, 2])
obj.order()
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()
 Method Description
 'average' Default: assign the average rank to each
entry in the equal group.
 'min' Use the minimum rank for the whole group.
 'max' Use the maximum rank for the whole group.
 'first' Assign ranks in the order the values appear in the data.

frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})


frame.sort_index(by='b')
frame.sort_index(by=['a', 'b'])
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank(method='first')
obj.rank(ascending=False, method='max')
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], .....: 'c': [-2, 5, 8, -2.5]})
frame.rank(axis=1)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy