Pandas.ipynb - Colab (1)
Pandas.ipynb - Colab (1)
import numpy as np
import pandas as pd
data = pd.Series([5,14,99,888])
data
0 5
1 14
2 99
3 888
dtype: int64
data[3]
888
data.values
data.index
Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:
data[1]
14
data[1:3]
1 14
2 99
dtype: int64
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']
0.5
Ram 123
Shyam 124
Arun 125
dtype: int64
By default, a Series will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be
performed:
123
# print dataframe.
df
Name Age
0 tom 10
1 nick 15
2 juli 14
df.index
Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:
df.columns
Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a
generalized index for accessing the data.
df['Name']
0 tom
1 nick
2 juli
Name: Name, dtype: object
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series :
pd.DataFrame(students, columns=['rollno'])
rollno
Ram 123
Shyam 124
Arun 125
Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values:
0 1.0 2 NaN
1 NaN 3 4.0
Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will
be used for each:
np.random.rand(3, 2)
array([[0.48925761, 0.81202557],
[0.37526746, 0.9834642 ],
[0.10226165, 0.37402615]])
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
foo bar
a 0.965321 0.512423
b 0.969355 0.437354
c 0.196705 0.719428
We covered structured arrays in Structured Data: NumPy's Structured Arrays. A Pandas DataFrame operates much like a structured array, and
can be created directly from one:
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
A B
0 0 0.0
1 0 0.0
2 0 0.0
<ipython-input-71-b0dd807d5915>:1: FutureWarning: Index.__and__ operating as a set operation is deprecated, in the future this will be a
indA & indB # intersection- common elements
Int64Index([3, 5, 7], dtype='int64')
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
# masking
data[(data ==0.5) ]
b 0.5
dtype: float64
# fancy indexing
data[['a', 'd']]
a 0.25
d 1.00
dtype: float64
1 a
3 b
5 c
dtype: object
'a'
3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose
certain indexing schemes.
First, the loc attribute allows indexing and slicing that always references the explicit index:
'a'
data.loc[1:3]
1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the implicit Python-style index:
data.iloc[1:3] #Implicit
3 b
5 c
dtype: object
data
0 Ram 123 80 85
1 Shyam 124 70 75
2 Arun 125 35 60
3 Gopal 235 95 70
0 123
1 124
2 125
3 235
Name: Rollno, dtype: int64
FDS_Mark
0 0.80
1 0.70
2 0.35
3 0.95
dtype: float64
data['DS_Mark']-15
DS_Mark
0 70
1 60
2 45
3 55
dtype: int64
data['Total_mark']=data['FDS_Mark']+data['DS_Mark']
data
data['Total_mark'].mean()
142.5
data['Total_mark'].median()
155.0
data['Total_mark'].mode()
0 165
dtype: int64
fillna(): Return a copy of the data with missing values filled or imputed
import numpy as np
data = pd.Series([1, np.nan, 'hello', None])
data
0
0 1
1 NaN
2 hello
3 None
dtype: object
data.isnull()
0 False
1 True
2 False
3 True
dtype: bool
0 1
2 hello
dtype: object
data
0 1
1 NaN
2 hello
3 None
dtype: object
0 1
1 0
2 hello
3 0
dtype: object
# forward-fill
data.fillna(method='ffill')
0 1
1 1
2 hello
3 hello
dtype: object