Pandas.ipynb - Colab
Pandas.ipynb - Colab
ipynb - Colab
2 Le thi C 212 8 7 9
2 Le thi C 212 8 7 9 3
2 Le thi C 212 8 7 9 3
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 1/8
07:34 8/12/24 Pandas.ipynb - Colab
diamonds_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null category
2 color 53940 non-null category
3 clarity 53940 non-null category
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
diamonds_df.describe()
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 2/8
07:34 8/12/24 Pandas.ipynb - Colab
0 Ideal
1 Premium
2 Good
3 Premium
4 Good
...
53935 Ideal
53936 Good
53937 Very Good
53938 Premium
53939 Ideal
Name: cut, Length: 53940, dtype: category
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']
... ... ... ... ... ... ... ... ... ... ...
53913 0.80 Good G VS2 64.2 58.0 2753 5.84 5.81 3.74
53914 0.84 Good I VS1 63.7 59.0 2753 5.94 5.90 3.77
53916 0.74 Good D SI1 63.1 59.0 2753 5.71 5.74 3.61
53927 0.79 Good F SI1 58.1 59.0 2756 6.06 6.13 3.54
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
... ... ... ... ... ... ... ... ... ... ...
53905 0.70 Premium E SI1 60.0 59.0 2753 5.75 5.79 3.46
53910 0.70 Premium E SI1 60.5 58.0 2753 5.74 5.77 3.48
53928 0.79 Premium E SI2 61.4 58.0 2756 6.03 5.96 3.68
53930 0.71 Premium E SI1 60.5 55.0 2756 5.79 5.74 3.49
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 3/8
07:34 8/12/24 Pandas.ipynb - Colab
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64
diamonds_df['cut'].describe()
count 53940
unique 5
top Ideal
freq 21551
Name: cut, dtype: object
Ideal 21551
Premium 13791
Very Good 12082
Good 4906
Fair 1610
Name: cut, dtype: int64
<Axes: >
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 4/8
07:34 8/12/24 Pandas.ipynb - Colab
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 0.000706
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 0.000644
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 0.000703
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 0.000868
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 0.000925
... ... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50 0.000261
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61 0.000261
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56 0.000254
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 0.000312
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64 0.000272
df4 = diamonds_df.drop('depth', axis = 1) # Loại bỏ dữ liệu cột depth, gán vào df4
df4
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 57.0 2757 5.75 5.76 3.50 0.000261
53936 0.72 Good D SI1 55.0 2757 5.69 5.75 3.61 0.000261
53937 0.70 Very Good D SI1 60.0 2757 5.66 5.68 3.56 0.000254
53938 0.86 Premium H SI2 58.0 2757 6.15 6.12 3.74 0.000312
53939 0.75 Ideal D SI2 55.0 2757 5.83 5.87 3.64 0.000272
df5 = diamonds_df.drop(columns = ['color', 'depth']) # Loại bỏ dữ liệu cột color, depth, gán vào df5
df5
... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal SI1 57.0 2757 5.75 5.76 3.50 0.000261
53936 0.72 Good SI1 55.0 2757 5.69 5.75 3.61 0.000261
53937 0.70 Very Good SI1 60.0 2757 5.66 5.68 3.56 0.000254
53938 0.86 Premium SI2 58.0 2757 6.15 6.12 3.74 0.000312
53939 0.75 Ideal SI2 55.0 2757 5.83 5.87 3.64 0.000272
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 5/8
07:34 8/12/24 Pandas.ipynb - Colab
cut color
Ideal G 4884
E 3903
F 3826
H 3115
D 2834
I 2093
J 896
Premium G 2924
H 2360
E 2337
F 2331
D 1603
I 1428
J 808
Very Good E 2400
G 2299
F 2164
H 1824
D 1513
I 1204
J 678
Good E 933
F 909
G 871
H 702
D 662
I 522
J 307
Fair G 314
F 312
H 303
E 224
I 175
D 163
J 119
Name: color, dtype: int64
df_new2 = diamonds_df.groupby('cut')['price'].mean() # phân loại groupby theo cột cut, đặc tính color
df_new2
cut
Ideal 3457.541970
Premium 4584.257704
Very Good 3981.759891
Good 3928.864452
Fair 4358.757764
Name: price, dtype: float64
import numpy as np
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df
A B C D
#Grouping and then applying the sum() function to the resulting groups.
df.groupby('A').sum()
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 6/8
07:34 8/12/24 Pandas.ipynb - Colab
#Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
df.groupby(['A', 'B']).sum()
C D
A B
df['B'] = df['B'].replace('one',1)
df['B'] = df['B'].replace('two',2)
df['B'] = df['B'].replace('three',3)
df
A B C D
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 8 non-null object
1 B 8 non-null int64
2 C 8 non-null float64
3 D 8 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 384.0+ bytes
df_NA = pd.read_csv('/content/df_na.csv')
df_NA
df_NA.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 7/8
07:34 8/12/24 Pandas.ipynb - Colab
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Hovaten 4 non-null object
1 MSSV 4 non-null int64
2 TCC 3 non-null float64
3 LTXS 3 non-null float64
4 TK 3 non-null float64
5 LT 4 non-null int64
dtypes: float64(3), int64(2), object(1)
memory usage: 320.0+ bytes
df_NA3 = df_NA.fillna('Vang thi') #Điền những bạn NaN thành Vang thi
df_NA3
https://colab.research.google.com/drive/13cUSoGs2geRy7cWF0t9RwMpVZr5KdVSh#scrollTo=ror_eTYNs_V1&printMode=true 8/8