0% found this document useful (0 votes)
9 views8 pages

Pandas.ipynb - Colab

Uploaded by

nghianb23413b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

Pandas.ipynb - Colab

Uploaded by

nghianb23413b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

07:34 8/12/24 Pandas.

ipynb - Colab

keyboard_arrow_down Nhập dữ liệu


pip install pandas # Cài đặt thư viện pandas

Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)


Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)

import pandas as pd #load thư viện pandas vào

keyboard_arrow_down Load file xls, xlsx


df = pd.read_excel('/content/data.xlsx')
df

Hovaten MSSV TCC LTXS TK

0 Nguyen Van A 210 9 8 8

1 Tran Van B 211 6 5 4

2 Le thi C 212 8 7 9

3 Tran Thi D 213 7 7 10


 

keyboard_arrow_down Load dữ liệu từ file csv


df2 = pd.read_csv('/content/df2.csv')
df2

Hovaten MSSV TCC LTXS TK LT

0 Nguyen Van A 210 9 8 8 9

1 Tran Van B 211 6 5 4 10

2 Le thi C 212 8 7 9 3

3 Tran Thi D 213 7 7 10 9


 

keyboard_arrow_down Load từ đường link


url = 'https://drive.google.com/file/d/1M5KCfI3yI9lNDGruDfpU7ncas1AD_oz3/view?usp=drive_link'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df3 = pd.read_csv(url)#, encoding='unicode_escape')
df3.head()

Hovaten MSSV TCC LTXS TK LT

0 Nguyen Van A 210 9 8 8 9

1 Tran Van B 211 6 5 4 10

2 Le thi C 212 8 7 9 3

3 Tran Thi D 213 7 7 10 9


 

keyboard_arrow_down Load dữ liệu từ thư viện nào đó


csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# using the attribute information as the column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
iris = pd.read_csv(csv_url, names = col_names)
iris

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 1/8
07:34 8/12/24 Pandas.ipynb - Colab

Sepal_Length Sepal_Width Petal_Length Petal_Width Class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 Iris-virginica

146 6.3 2.5 5.0 1.9 Iris-virginica

147 6.5 3.0 5.2 2.0 Iris-virginica

148 6.2 3.4 5.4 2.3 Iris-virginica

149 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 5 columns


 

import seaborn as sns


diamonds_df = sns.load_dataset('diamonds') #Load bo data diamonds cua thu vien seaborn
print(diamonds_df)

carat cut color clarity depth table price x y z


0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

[53940 rows x 10 columns]

keyboard_arrow_down Xử lý data diamonds


diamonds_df.head() #5 dòng đầu tiên của data

carat cut color clarity depth table price x y z

0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43

1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31

2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31

3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63

4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75


 

diamonds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null category
2 color 53940 non-null category
3 clarity 53940 non-null category
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB

diamonds_df.describe()

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 2/8
07:34 8/12/24 Pandas.ipynb - Colab

carat depth table price x y z

count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000

mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734

std 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699

min 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000

25% 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000

50% 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000

75% 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000

max 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000


 

diamonds_df['cut'] # trích xuất dữ liệu của một cột

0 Ideal
1 Premium
2 Good
3 Premium
4 Good
...
53935 Ideal
53936 Good
53937 Very Good
53938 Premium
53939 Ideal
Name: cut, Length: 53940, dtype: category
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

diamonds_df[diamonds_df['cut']=='Good'] # trích xuất dữ liệu kim cương chất lượng Good

carat cut color clarity depth table price x y z

2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31

4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

10 0.30 Good J SI1 64.0 55.0 339 4.25 4.28 2.73

17 0.30 Good J SI1 63.4 54.0 351 4.23 4.29 2.70

18 0.30 Good J SI1 63.8 56.0 351 4.23 4.26 2.71

... ... ... ... ... ... ... ... ... ... ...

53913 0.80 Good G VS2 64.2 58.0 2753 5.84 5.81 3.74

53914 0.84 Good I VS1 63.7 59.0 2753 5.94 5.90 3.77

53916 0.74 Good D SI1 63.1 59.0 2753 5.71 5.74 3.61

53927 0.79 Good F SI1 58.1 59.0 2756 6.06 6.13 3.54

53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61

4906 rows × 10 columns


 

#Trích xuất ra những viên có cut = premium, color = E


diamonds_df[(diamonds_df['cut']=='Premium') & (diamonds_df['color']=='E')]

carat cut color clarity depth table price x y z

1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31

14 0.20 Premium E SI2 60.2 62.0 345 3.79 3.75 2.27

15 0.32 Premium E I1 60.9 58.0 345 4.38 4.42 2.68

53 0.22 Premium E VS2 61.6 58.0 404 3.93 3.89 2.41

69 0.24 Premium E VVS1 60.7 58.0 553 4.01 4.03 2.44

... ... ... ... ... ... ... ... ... ... ...

53905 0.70 Premium E SI1 60.0 59.0 2753 5.75 5.79 3.46

53910 0.70 Premium E SI1 60.5 58.0 2753 5.74 5.77 3.48

53911 0.57 Premium E IF 59.8 60.0 2753 5.43 5.38 3.23

53928 0.79 Premium E SI2 61.4 58.0 2756 6.03 5.96 3.68

53930 0.71 Premium E SI1 60.5 55.0 2756 5.79 5.74 3.49

2337 rows × 10 columns


 

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 3/8
07:34 8/12/24 Pandas.ipynb - Colab

#Trích xuất ra những viên có cut = premium hoặc price>=400


diamonds_df[(diamonds_df['cut']=='Premium') | (diamonds_df['price']>= 400)]

carat cut color clarity depth table price x y z

1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31

3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63

12 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33

14 0.20 Premium E SI2 60.2 62.0 345 3.79 3.75 2.27

15 0.32 Premium E I1 60.9 58.0 345 4.38 4.42 2.68

... ... ... ... ... ... ... ... ... ... ...

53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50

53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61

53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56

53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74

53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53735 rows × 10 columns


 

diamonds_df['cut'].describe()

count 53940
unique 5
top Ideal
freq 21551
Name: cut, dtype: object

diamonds_df['cut'].value_counts() # THống kê mỗi loại bao nhiêu viên

Ideal 21551
Premium 13791
Very Good 12082
Good 4906
Fair 1610
Name: cut, dtype: int64

diamonds_df['cut'].value_counts().plot(kind = 'barh')# Vẽ biểu đồ phân loại bar, barh, pie

<Axes: >

 

# Tạo cột Carat/price = carat/giá


diamonds_df['Carat/price'] = diamonds_df['carat']/diamonds_df['price']
diamonds_df

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 4/8
07:34 8/12/24 Pandas.ipynb - Colab

carat cut color clarity depth table price x y z Carat/price

0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 0.000706

1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 0.000644

2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 0.000703

3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 0.000868

4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 0.000925

... ... ... ... ... ... ... ... ... ... ... ...

53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50 0.000261

53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61 0.000261

53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56 0.000254

53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 0.000312

53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64 0.000272

53940 rows × 11 columns


 

df4 = diamonds_df.drop('depth', axis = 1) # Loại bỏ dữ liệu cột depth, gán vào df4
df4

carat cut color clarity table price x y z Carat/price

0 0.23 Ideal E SI2 55.0 326 3.95 3.98 2.43 0.000706

1 0.21 Premium E SI1 61.0 326 3.89 3.84 2.31 0.000644

2 0.23 Good E VS1 65.0 327 4.05 4.07 2.31 0.000703

3 0.29 Premium I VS2 58.0 334 4.20 4.23 2.63 0.000868

4 0.31 Good J SI2 58.0 335 4.34 4.35 2.75 0.000925

... ... ... ... ... ... ... ... ... ... ...

53935 0.72 Ideal D SI1 57.0 2757 5.75 5.76 3.50 0.000261

53936 0.72 Good D SI1 55.0 2757 5.69 5.75 3.61 0.000261

53937 0.70 Very Good D SI1 60.0 2757 5.66 5.68 3.56 0.000254

53938 0.86 Premium H SI2 58.0 2757 6.15 6.12 3.74 0.000312

53939 0.75 Ideal D SI2 55.0 2757 5.83 5.87 3.64 0.000272

53940 rows × 10 columns


 

df5 = diamonds_df.drop(columns = ['color', 'depth']) # Loại bỏ dữ liệu cột color, depth, gán vào df5
df5

carat cut clarity table price x y z Carat/price

0 0.23 Ideal SI2 55.0 326 3.95 3.98 2.43 0.000706

1 0.21 Premium SI1 61.0 326 3.89 3.84 2.31 0.000644

2 0.23 Good VS1 65.0 327 4.05 4.07 2.31 0.000703

3 0.29 Premium VS2 58.0 334 4.20 4.23 2.63 0.000868

4 0.31 Good SI2 58.0 335 4.34 4.35 2.75 0.000925

... ... ... ... ... ... ... ... ... ...

53935 0.72 Ideal SI1 57.0 2757 5.75 5.76 3.50 0.000261

53936 0.72 Good SI1 55.0 2757 5.69 5.75 3.61 0.000261

53937 0.70 Very Good SI1 60.0 2757 5.66 5.68 3.56 0.000254

53938 0.86 Premium SI2 58.0 2757 6.15 6.12 3.74 0.000312

53939 0.75 Ideal SI2 55.0 2757 5.83 5.87 3.64 0.000272

53940 rows × 9 columns

keyboard_arrow_down Group by data


df_new = diamonds_df.groupby('cut')['color'] # phân loại groupby theo cột cut, đặc tính color
df_new.value_counts()

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 5/8
07:34 8/12/24 Pandas.ipynb - Colab

cut color
Ideal G 4884
E 3903
F 3826
H 3115
D 2834
I 2093
J 896
Premium G 2924
H 2360
E 2337
F 2331
D 1603
I 1428
J 808
Very Good E 2400
G 2299
F 2164
H 1824
D 1513
I 1204
J 678
Good E 933
F 909
G 871
H 702
D 662
I 522
J 307
Fair G 314
F 312
H 303
E 224
I 175
D 163
J 119
Name: color, dtype: int64

df_new2 = diamonds_df.groupby('cut')['price'].mean() # phân loại groupby theo cột cut, đặc tính color
df_new2

cut
Ideal 3457.541970
Premium 4584.257704
Very Good 3981.759891
Good 3928.864452
Fair 4358.757764
Name: price, dtype: float64

import numpy as np
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df

A B C D

0 foo one -0.997775 0.215626

1 bar one -0.676385 0.752285

2 foo two -0.229495 -0.348257

3 bar three -0.993730 1.433516

4 foo two 0.611294 0.650352

5 bar two 0.149692 -0.705117

6 foo one -1.235844 1.285833

7 foo three 0.961605 1.292979


 

#Grouping and then applying the sum() function to the resulting groups.
df.groupby('A').sum()

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 6/8
07:34 8/12/24 Pandas.ipynb - Colab

<ipython-input-38-9670a701a0a2>:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a futu


df.groupby('A').sum()
C D

bar -1.520422 1.480684

foo -0.890215 3.096534


 

#Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
df.groupby(['A', 'B']).sum()

C D

A B

bar one -0.676385 0.752285

three -0.993730 1.433516

two 0.149692 -0.705117

foo one -2.233619 1.501459

three 0.961605 1.292979

two 0.381799 0.302095


 

df['B'] = df['B'].replace('one',1)
df['B'] = df['B'].replace('two',2)
df['B'] = df['B'].replace('three',3)
df

A B C D

0 foo 1 -0.997775 0.215626

1 bar 1 -0.676385 0.752285

2 foo 2 -0.229495 -0.348257

3 bar 3 -0.993730 1.433516

4 foo 2 0.611294 0.650352

5 bar 2 0.149692 -0.705117

6 foo 1 -1.235844 1.285833

7 foo 3 0.961605 1.292979


 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 8 non-null object
1 B 8 non-null int64
2 C 8 non-null float64
3 D 8 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 384.0+ bytes

df_NA = pd.read_csv('/content/df_na.csv')
df_NA

Hovaten MSSV TCC LTXS TK LT

0 Nguyen Van A 210 9.0 8.0 8.0 9

1 Tran Van B 211 NaN 5.0 4.0 10

2 Le thi C 212 8.0 7.0 NaN 3

3 Tran Thi D 213 7.0 NaN 10.0 9


 

df_NA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 7/8
07:34 8/12/24 Pandas.ipynb - Colab
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Hovaten 4 non-null object
1 MSSV 4 non-null int64
2 TCC 3 non-null float64
3 LTXS 3 non-null float64
4 TK 3 non-null float64
5 LT 4 non-null int64
dtypes: float64(3), int64(2), object(1)
memory usage: 320.0+ bytes

df_NA2 = df_NA.dropna() #Xóa bỏ hàng có dữ liệu NaN


df_NA2

Hovaten MSSV TCC LTXS TK LT

0 Nguyen Van A 210 9.0 8.0 8.0 9


 

df_NA3 = df_NA.fillna('Vang thi') #Điền những bạn NaN thành Vang thi
df_NA3

Hovaten MSSV TCC LTXS TK LT

0 Nguyen Van A 210 9.0 8.0 8.0 9

1 Tran Van B 211 Vang thi 5.0 4.0 10

2 Le thi C 212 8.0 7.0 Vang thi 3

3 Tran Thi D 213 7.0 Vang thi 10.0 9


 

https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 8/8

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy