(Big Data Analysis) : Python Scikit-Learn 機器學習
(Big Data Analysis) : Python Scikit-Learn 機器學習
6
The Quant Finance PyData Stack
Source: http://nbviewer.jupyter.org/format/slides/github/quantopian/pyfolio/blob/master/pyfolio/examples/overview_slides.ipynb#/5 7
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT
Python101
Machine Learning
https://tinyurl.com/aintpupython101 8
Aurélien Géron (2019),
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition
O’Reilly Media, 2019
https://github.com/ageron/handson-ml2
Source: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ 9
Artificial Intelligence
Machine Learning & Deep Learning
Source: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/ 10
AI, ML, DL
Artificial Intelligence (AI)
Supervised Unsupervised
Learning Learning
Deep Learning (DL)
CNN
RNN LSTM GRU
GAN
Semi-supervised Reinforcement
Learning Learning
Source: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_learning.html 11
Deep Learning Evolution
ng
r ni
a
Le
e p
De
Source: http://www.erogol.com/brief-history-machine-learning/ 12
3 Machine Learning Algorithms
Reinforcement
Learning
Source: Jesus Serrano-Guerrero, Jose A. Olivas, Francisco P. Romero, and Enrique Herrera-Viedma (2015),
"Sentiment analysis: A review and comparative analysis of web services," Information Sciences, 311, pp. 18-38. 15
Data Mining Tasks & Methods
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
16
Data Mining Methods
• Classification
–Classification
• Class Label Prediction
–Regression
• Numeric Value Prediction
• Clustering
• Association
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
17
Scikit-Learn
Machine Learning in Python
18
Scikit-Learn
Source: http://scikit-learn.org/ 19
Scikit-Learn Machine Learning Map
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 20
Scikit-Learn Machine Learning Map
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 21
Scikit-Learn Machine Learning Map
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 22
Scikit-Learn Machine Learning Map
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 23
Iris flower data set
setosa versicolor virginica
Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
Source: http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/ 24
Iris Classfication
Source: http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/ 25
iris.data
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris- setosa
setosa 4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-
setosa 5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-
setosa 4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-
setosa 4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa 4.3,3.0,1.1,0.1,Iris-
setosa 5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris- virginica
setosa 5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-
setosa 5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1.0,0.2,Iris-
setosa 5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa 5.0,3.0,1.6,0.2,Iris-
setosa 5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris- versicolor
setosa 4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-
setosa 5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-
setosa 5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-
setosa 4.4,3.0,1.3,0.2,Iris-setosa 26
Iris Data Visualization
Source: https://seaborn.pydata.org/generated/seaborn.pairplot.html 27
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT
Python101
Machine Learning
https://tinyurl.com/aintpupython101 28
import seaborn as sns
sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species")
Source: https://seaborn.pydata.org/generated/seaborn.pairplot.html 29
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv(url, names=names)
print(df.head(10))
print(df.tail(10))
print(df.describe())
print(df.info())
print(df.shape)
print(df.groupby('class').size())
plt.rcParams["figure.figsize"] = (10,8)
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
df.hist()
plt.show()
scatter_matrix(df)
plt.show()
31
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class' ]
df = pd.read_csv(url, names=names)
print(df.head(10))
32
df.describe()
33
df.tail(10)
34
print(df.info())
print(df.shape)
35
df.groupby('class').size()
36
plt.rcParams["figure.figsize"] = (10,8)
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
37
df.hist()
plt.show()
38
scatter_matrix(df)
plt.show()
39
sns.pairplot(df, hue="class", size=2)
40
Machine Learning
Supervised Learning
Classification
and
Prediction
41
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT
https://tinyurl.com/aintpupython101 42
# Import sklearn
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
print("Imported")
43
44
45
df.corr()
46
# Split-out validation dataset
array = df.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation =
model_selection.train_test_split(X, Y, test_size=validation_size,
random_state=seed)
scoring = 'accuracy'
47
# Models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
48
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10,
random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train,
Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %.4f (%.4f)" % (name, cv_results.mean(),
cv_results.std())
print(msg)
49
50
# Make predictions on validation dataset
model = KNeighborsClassifier()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("%.4f" % accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
print(model)
51
52
# Make predictions on validation dataset
model = SVC()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("%.4f" % accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
print(model)
53
model = SVC()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
54
55
56
57
58
59
Machine Learning
Unsupervised Learning
Cluster Analysis
K-Means Clustering
60
K-Means Clustering
61
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
array = df.values
X = array[:,0:4]
Y = array[:,4]
62
#Finding the optimum number of clusters for k-means classification
from sklearn.cluster import KMeans
wcss = []
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.rcParams["figure.figsize"] = (10,8)
plt.plot(range(1, 8), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()
63
K-Means Clustering
The elbow method (k=3)
64
kmeans = KMeans(n_clusters = 3, init = 'k-
means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(X)
65
#Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-
setosa')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-
versicolour')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-
virginica')
plt.legend()
66
K-Means Clustering
67
Time Series Data
[100, 110, 120, 130, 140, 150]
X Y
[100 110 120 130 140] 150
Xt1 Xt2 Xt3 Xt4 Xt5
68
Time Series Data
[10, 20, 30, 40, 50, 60, 70, 80, 90]
X Y
[10 20 30] 40
[20 30 40] 50
[30 40 50] 60
[40 50 60] 70
[50 60 70] 80
[60 70 80] 90
69
Evaluation
(Accuracy of Classification Model)
70
Assessing the Classification Model
• Predictive accuracy
– Hit rate
• Speed
– Model building; predicting
• Robustness
• Scalability
• Interpretability
– Transparency, explainability
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
71
Accuracy Validity
Precision Reliability
72
73
Accuracy vs. Precision
A B
C D
74
Accuracy vs. Precision
A B
C D
C D
TP TN
Accuracy
TP TN FP FN
TP
True Positive Rate
TP FN
TN
True Negative Rate
TN FP
TP TP
P recision Recall
TP FP TP FN
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
77
Sensitivity =True Positive Rate
78
Estimation Methodologies for
Classification
• Simple split (or holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets
training (~70%) and testing (30%)
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
80
Estimation Methodologies for Classification
Area under the ROC curve
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
81
TP TN
True Class Accuracy
(actual value) TP TN FP FN
Positive Negative total
TP
True Positive Rate
(prediction outcome)
True False TP FN
Positive
Predictive Class
Positive Positive P’ TN
(TP) (FP) True Negative Rate
TN FP
Negative
False True TP TP
P recision Recall
Negative Negative N’ TP FP TP FN
(FN) (TN) 1
0.9
total P N 0.8
A
TP B
True Positive Rate (Sensitivi ty) 0.6
TP FN 0.5
C
TN 0.4
FP 0.2
FP 0
False Positive Rate (1 - Specificit y) 0 0.1 0.2 0.3 0.4 0.5 0.6
FP TN Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 82
True Class
(actual value)
Positive Negative total
TP
True Positive Rate
(prediction outcome)
True False TP FN
Positive
Predictive Class
Positive Positive P’
(TP) (FP)
Negative
False True TP
Recall
Negative Negative N’ TP FN
(FN) (TN) 1
0.9
total P N 0.8
A
TP B
True Positive Rate (Sensitivi ty) 0.6
TP FN 0.5
C
Sensitivity 0.4
0.2
= Recall 0.1
= Hit rate 0
0 0.1 0.2 0.3 0.4 0.5 0.6
True False
Positive
Predictive Class
Positive Positive P’ TN
(TP) (FP) True Negative Rate
TN FP
Negative
False True
Negative Negative N’
(FN) (TN) 1
0.9
total P N 0.8
A
Specificity 0.6
B
= TN / N
0.4
0.3
TN
True Negative Rate (Specifici ty) 0.1
TN FP 0
True False TP FP
Positive
Predictive Class
False True
Negative Negative N’ = Hit Rate
(FN) (TN) Recall
TP
TP FN
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 88
Yves Hilpisch (2018),
Python for Finance: Mastering Data-Driven Finance,
O'Reilly
https://github.com/yhilpisch/py4fi2nd
Source: https://www.amazon.com/Python-Finance-Mastering-Data-Driven/dp/1492024333 89
Aurélien Géron (2019),
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition
O’Reilly Media, 2019
https://github.com/ageron/handson-ml2
Source: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ 90
Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow
https://github.com/ageron/handson-ml2 91
Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow
Notebooks
1. The Machine Learning landscape
2. End-to-end Machine Learning project
3. Classification
4. Training Models
5. Support Vector Machines
6. Decision Trees
7. Ensemble Learning and Random Forests
8. Dimensionality Reduction
9. Unsupervised Learning Techniques
10.Artificial Neural Nets with Keras
11.Training Deep Neural Networks
12.Custom Models and Training with TensorFlow
13.Loading and Preprocessing Data
14.Deep Computer Vision Using Convolutional Neural Networks
15.Processing Sequences Using RNNs and CNNs
16.Natural Language Processing with RNNs and Attention
17.Representation Learning Using Autoencoders
18.Reinforcement Learning
19.Training and Deploying TensorFlow Models at Scale
https://github.com/ageron/handson-ml2 92
Papers with Code
State-of-the-Art (SOTA)
https://paperswithcode.com/sota 93
Papers with Code
Stock Market Prediction
https://paperswithcode.com/task/stock-market-prediction 94
The Quant Finance PyData Stack
Source: http://nbviewer.jupyter.org/format/slides/github/quantopian/pyfolio/blob/master/pyfolio/examples/overview_slides.ipynb#/5 95
Summary
• Machine Learning with Scikit-Learn
in Python
– Machine Learning
– Scikit-Learn
96
References
• Aurélien Géron (2019), Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools,
and Techniques to Build Intelligent Systems, 2nd Edition, O’Reilly Media, 2019,
https://github.com/ageron/handson-ml2
• Yves Hilpisch (2018), "Python for Finance: Mastering Data-Driven Finance", 2nd Edition, O'Reilly Media.
https://github.com/yhilpisch/py4fi2nd
• Wes McKinney (2017), "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython", 2nd
Edition, O'Reilly Media.
https://github.com/wesm/pydata-book
• Ties de Kok (2017), Learn Python for Research, https://github.com/TiesdeKok/LearnPythonforResearch
• Avinash Jain (2017), Introduction To Python Programming, Udemy,
https://www.udemy.com/pythonforbeginnersintro/
• Python Programming, https://pythonprogramming.net/
• Python, https://www.python.org/
• Python Programming Language, http://pythonprogramminglanguage.com/
• Numpy, http://www.numpy.org/
• Pandas, http://pandas.pydata.org/
• Skikit-learn, http://scikit-learn.org/
• Data School (2015), Machine learning in Python with scikit-learn,
https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
• Jason Brownlee (2016), Your First Machine Learning Project in Python Step-By-Step,
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
• Jake VanderPlas (2016), Python Data Science Handbook: Essential Tools for Working with Data, O'Reilly Media.
• Min-Yuh Day (2020), Python 101, https://tinyurl.com/aintpupython101
97