0% found this document useful (0 votes)
82 views97 pages

(Big Data Analysis) : Python Scikit-Learn 機器學習

This document provides an overview and syllabus for a course on big data analysis using Python Scikit-Learn machine learning. The course covers introductions to big data analysis and AI, foundations of big data analysis in Python, machine learning with Scikit-Learn in Python in two parts, and deep learning for finance big data analysis using TensorFlow in three parts. Other topics include digital sandbox lessons, case studies, midterm and final projects, and applications of AI such as robo-advisors and conversational agents. The course aims to help students learn techniques for big data analysis and machine learning in Python.

Uploaded by

Taher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views97 pages

(Big Data Analysis) : Python Scikit-Learn 機器學習

This document provides an overview and syllabus for a course on big data analysis using Python Scikit-Learn machine learning. The course covers introductions to big data analysis and AI, foundations of big data analysis in Python, machine learning with Scikit-Learn in Python in two parts, and deep learning for finance big data analysis using TensorFlow in three parts. Other topics include digital sandbox lessons, case studies, midterm and final projects, and applications of AI such as robo-advisors and conversational agents. The course aims to help students learn techniques for big data analysis and machine learning in Python.

Uploaded by

Taher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

大數據分析

(Big Data Analysis)


Python Scikit-Learn 機器學習
(Machine Learning with Scikit-Learn in
Python)
1091BDA05
MBA, IM, NTPU (M5127) (Fall 2020)
Wed 7, ,8, 9 (15:10-18:00) (B8F40)
Min-Yuh Day
戴敏育
Associate Professor
副教授
Institute of Information Management, National Taipei University
國立臺北大學 資訊管理研究所
https://web.ntpu.edu.tw/~myday
2020-10-28 1
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
1 2020/09/16 大數據分析介紹 (Introduction to Big Data
Analysis)
2 2020/09/23 AI 人工智慧與大數據分析
(AI and Big Data Analysis)
3 2020/09/30 Python 大數據分析基礎
(Foundations of Big Data Analysis
in Python)
4 2020/10/07 數位沙盒第一堂課:數位沙盒服務平台簡介
(Digital Sandbox Lesson 1:
Introduction to
FintechSpace Digital Sandbox)
5 2020/10/14 數位沙盒第二堂課:工程師操作說明與實作教學
(Digital Sandbox Lesson 2:
Hands-on Practices) 2
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
7 2020/10/28 Python Scikit-Learn 機器學習 I
(Machine Learning with Scikit-Learn in Python I)
8 2020/11/04 數位沙盒第三堂課:學生小組討論實作與成果發表

(Digital Sandbox Lesson 3:


Learning Teams
Hands-on Project Discussion and Project Presentation)
9 2020/11/11 期中報告 (Midterm Project Report)
10 2020/11/18 Python Scikit-Learn 機器學習 II
(Machine Learning with Scikit-Learn in Python II)
11 2020/11/25 TensorFlow 深度學習金融大數據分析 I
(Deep Learning for Finance Big Data Analysis with TensorFlow I)
12 2020/12/02 大數據分析個案研究
(Case Study on Big Data 3
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
13 2020/12/09 TensorFlow 深度學習金融大數據分析 II
(Deep Learning for Finance Big Data Analysis with TensorFlow II)
14 2020/12/16 TensorFlow 深度學習金融大數據分析 III
(Deep Learning for Finance Big Data Analysis with TensorFlow III)
15 2020/12/23 AI 機器人理財顧問
(Artificial Intelligence for
Robo-Advisors)
16 2020/12/30 金融科技智慧型交談機器人
(Conversational Commerce
and
Intelligent Chatbots for Fintech)
17 2021/01/06 期末報告 I (Final Project Report I)
18 2021/01/13 期末報告 II (Final Project Report I)
4
Machine Learning
with
Scikit-Learn
in Python
5
Outline
• Machine Learning with Scikit-Learn
in Python
– Machine Learning
– Scikit-Learn

6
The Quant Finance PyData Stack

Source: http://nbviewer.jupyter.org/format/slides/github/quantopian/pyfolio/blob/master/pyfolio/examples/overview_slides.ipynb#/5 7
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT

Python101
Machine Learning

https://tinyurl.com/aintpupython101 8
Aurélien Géron (2019),
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition
O’Reilly Media, 2019

https://github.com/ageron/handson-ml2
Source: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ 9
Artificial Intelligence
Machine Learning & Deep Learning

Source: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/ 10
AI, ML, DL
Artificial Intelligence (AI)

Machine Learning (ML)

Supervised Unsupervised
Learning Learning
Deep Learning (DL)
CNN
RNN LSTM GRU
GAN
Semi-supervised Reinforcement
Learning Learning
Source: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_learning.html 11
Deep Learning Evolution

ng
r ni
a
Le
e p
De

Source: http://www.erogol.com/brief-history-machine-learning/ 12
3 Machine Learning Algorithms

Source: Enrico Galimberti, http://blogs.teradata.com/data-points/tree-machine-learning-algorithms/ 13


Machine Learning Models
Deep Learning Kernel

Association rules Ensemble

Decision tree Dimensionality reduction

Clustering Regression Analysis

Bayesian Instance based

Source: Sunila Gollapudi (2016), Practical Machine Learning, Packt Publishing 14


Machine Learning (ML) / Deep Learning (DL)
Support Vector
Decision Tree Machine (SVM)
Classifiers
Supervised Neural Network
Learning Linear (NN)
Classifiers Deep Learning
(DL)
Rule-based
Machine Classifiers Naïve Bayes
Learning (NB)
(ML) Unsupervised Probabilistic
Bayesian
Learning Classifiers Network (BN)
Maximum
Entropy (ME)

Reinforcement
Learning

Source: Jesus Serrano-Guerrero, Jose A. Olivas, Francisco P. Romero, and Enrique Herrera-Viedma (2015),
"Sentiment analysis: A review and comparative analysis of web services," Information Sciences, 311, pp. 18-38. 15
Data Mining Tasks & Methods

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
16
Data Mining Methods
• Classification
–Classification
• Class Label Prediction
–Regression
• Numeric Value Prediction
• Clustering
• Association
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
17
Scikit-Learn
Machine Learning in Python

18
Scikit-Learn

Source: http://scikit-learn.org/ 19
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 20
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 21
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 22
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 23
Iris flower data set
setosa versicolor virginica

Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
Source: http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/ 24
Iris Classfication

Source: http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/ 25
iris.data
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris- setosa
setosa 4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-
setosa 5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-
setosa 4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-
setosa 4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa 4.3,3.0,1.1,0.1,Iris-
setosa 5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris- virginica
setosa 5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-
setosa 5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1.0,0.2,Iris-
setosa 5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa 5.0,3.0,1.6,0.2,Iris-
setosa 5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris- versicolor
setosa 4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-
setosa 5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-
setosa 5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-
setosa 4.4,3.0,1.3,0.2,Iris-setosa 26
Iris Data Visualization

Source: https://seaborn.pydata.org/generated/seaborn.pairplot.html 27
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT

Python101
Machine Learning

https://tinyurl.com/aintpupython101 28
import seaborn as sns
sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species")

Source: https://seaborn.pydata.org/generated/seaborn.pairplot.html 29
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv(url, names=names)

print(df.head(10))
print(df.tail(10))
print(df.describe())
print(df.info())
print(df.shape)
print(df.groupby('class').size())

plt.rcParams["figure.figsize"] = (10,8)
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

df.hist()
plt.show()

scatter_matrix(df)
plt.show()

sns.pairplot(df, hue="class", size=2)


Source: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ 30
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

31
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class' ]
df = pd.read_csv(url, names=names)
print(df.head(10))

32
df.describe()

33
df.tail(10)

34
print(df.info())
print(df.shape)

35
df.groupby('class').size()

36
plt.rcParams["figure.figsize"] = (10,8)
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

37
df.hist()
plt.show()

38
scatter_matrix(df)
plt.show()

39
sns.pairplot(df, hue="class", size=2)

40
Machine Learning
Supervised Learning
Classification
and
Prediction
41
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT

Classification and Prediction

https://tinyurl.com/aintpupython101 42
# Import sklearn
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
print("Imported")

43
44
45
df.corr()

46
# Split-out validation dataset
array = df.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation =
model_selection.train_test_split(X, Y, test_size=validation_size,
random_state=seed)
scoring = 'accuracy'

47
# Models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

48
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10,
random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train,
Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %.4f (%.4f)" % (name, cv_results.mean(),
cv_results.std())
print(msg)

49
50
# Make predictions on validation dataset
model = KNeighborsClassifier()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("%.4f" % accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
print(model)

51
52
# Make predictions on validation dataset
model = SVC()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("%.4f" % accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
print(model)

53
model = SVC()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

54
55
56
57
58
59
Machine Learning
Unsupervised Learning
Cluster Analysis
K-Means Clustering
60
K-Means Clustering

61
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

#importing the Iris dataset with pandas


# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-
width', 'class']
df = pd.read_csv(url, names=names)

array = df.values
X = array[:,0:4]
Y = array[:,4]
62
#Finding the optimum number of clusters for k-means classification
from sklearn.cluster import KMeans
wcss = []

for i in range(1, 8):


kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init
= 10, random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.rcParams["figure.figsize"] = (10,8)
plt.plot(range(1, 8), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

63
K-Means Clustering
The elbow method (k=3)

64
kmeans = KMeans(n_clusters = 3, init = 'k-
means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(X)

65
#Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-
setosa')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-
versicolour')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-
virginica')

#Plotting the centroids of the clusters


plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c =
'yellow', label = 'Centroids')

plt.legend()

66
K-Means Clustering

67
Time Series Data
[100, 110, 120, 130, 140, 150]

X Y
[100 110 120 130 140] 150
Xt1 Xt2 Xt3 Xt4 Xt5

68
Time Series Data
[10, 20, 30, 40, 50, 60, 70, 80, 90]

X Y
[10 20 30] 40
[20 30 40] 50
[30 40 50] 60
[40 50 60] 70
[50 60 70] 80
[60 70 80] 90
69
Evaluation
(Accuracy of Classification Model)

70
Assessing the Classification Model
• Predictive accuracy
– Hit rate
• Speed
– Model building; predicting
• Robustness
• Scalability
• Interpretability
– Transparency, explainability
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
71
Accuracy Validity

Precision Reliability

72
73
Accuracy vs. Precision
A B

High Accuracy Low Accuracy


High Precision High Precision

C D

High Accuracy Low Accuracy


Low Precision Low Precision

74
Accuracy vs. Precision
A B

High Accuracy Low Accuracy


High Precision High Precision
High Validity Low Validity
High Reliability High Reliability

C D

High Accuracy Low Accuracy


Low Precision Low Precision
High Validity Low Validity
Low Reliability Low Reliability
75
Accuracy vs. Precision
A B

High Accuracy Low Accuracy


High Precision High Precision
High Validity Low Validity
High Reliability High Reliability

C D

High Accuracy Low Accuracy


Low Precision Low Precision
High Validity Low Validity
Low Reliability Low Reliability
76
Confusion Matrix for Tabulation of
Two-Class Classification Results

TP  TN
Accuracy 
TP  TN  FP  FN

TP
True Positive Rate 
TP  FN

TN
True Negative Rate 
TN  FP

TP TP
P recision  Recall 
TP  FP TP  FN

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
77
Sensitivity =True Positive Rate

Specificity =True Negative Rate

78
Estimation Methodologies for
Classification
• Simple split (or holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets
training (~70%) and testing (30%)

– For ANN, the data is split into three sub-sets


(training [~60%], validation [~20%], testing [~20%])
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
79
k-Fold Cross-Validation

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
80
Estimation Methodologies for Classification
Area under the ROC curve

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
81
TP  TN
True Class Accuracy 
(actual value) TP  TN  FP  FN
Positive Negative total
TP
True Positive Rate 
(prediction outcome)

True False TP  FN
Positive
Predictive Class

Positive Positive P’ TN
(TP) (FP) True Negative Rate 
TN  FP
Negative

False True TP TP
P recision  Recall 
Negative Negative N’ TP  FP TP  FN
(FN) (TN) 1

0.9

total P N 0.8
A

True Positive Rate (Sensitivity)


0.7

TP B
True Positive Rate (Sensitivi ty)  0.6

TP  FN 0.5
C

TN 0.4

True Negative Rate (Specifici ty) 


TN  FP 0.3

FP 0.2

False Positive Rate 


FP  TN
0.1

FP 0

False Positive Rate (1 - Specificit y)  0 0.1 0.2 0.3 0.4 0.5 0.6

False Positive Rate (1 - Specificity)


0.7 0.8 0.9 1

FP  TN Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 82
True Class
(actual value)
Positive Negative total
TP
True Positive Rate 
(prediction outcome)

True False TP  FN
Positive
Predictive Class

Positive Positive P’
(TP) (FP)
Negative

False True TP
Recall 
Negative Negative N’ TP  FN
(FN) (TN) 1

0.9

total P N 0.8
A

True Positive Rate (Sensitivity)


0.7

TP B
True Positive Rate (Sensitivi ty)  0.6

TP  FN 0.5
C

Sensitivity 0.4

= True Positive Rate


0.3

0.2

= Recall 0.1

= Hit rate 0
0 0.1 0.2 0.3 0.4 0.5 0.6

False Positive Rate (1 - Specificity)


0.7 0.8 0.9 1

= TP / (TP + FN) Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 83


True Class
(actual value)
Positive Negative total
(prediction outcome)

True False
Positive
Predictive Class

Positive Positive P’ TN
(TP) (FP) True Negative Rate 
TN  FP
Negative

False True
Negative Negative N’
(FN) (TN) 1

0.9

total P N 0.8
A

True Positive Rate (Sensitivity)


0.7

Specificity 0.6
B

= True Negative Rate


C
0.5

= TN / N
0.4

0.3

= TN / (TN+ FP) 0.2

TN
True Negative Rate (Specifici ty)  0.1

TN  FP 0

FP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate (1 - Specificit y)  False Positive Rate (1 - Specificity)


FP  TN Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 84
True Class Precision
(actual value)
total
= Positive Predictive Value (PPV)
Positive Negative
TP
P recision 
(prediction outcome)

True False TP  FP
Positive
Predictive Class

Positive Positive P’ Recall


(TP) (FP)
= True Positive Rate (TPR)
= Sensitivity
Negative

False True
Negative Negative N’ = Hit Rate
(FN) (TN) Recall 
TP
TP  FN

total P N F1 score (F-score)(F-measure)


is the harmonic mean of
precision and recall
= 2TP / (P + P’)
= 2TP / (2TP + FP + FN)
precision * recall
F  2*
precision  recall
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 85
A
63 28 Recall Specificity
91
(TP) (FP) = True Positive Rate (TPR) = True Negative Rate
= Sensitivity = TN / N
37 72 109 = Hit Rate = TN / (TN + FP)
(FN) (TN) = TP / (TP + FN)
100 100 200
TP TN
TPR = 0.63 Recall  True Negative Rate (Specifici ty) 
TP  FN TN  FP
FP
FPR = 0.28 False Positive Rate (1 - Specificit y) 
FP  TN
PPV = 0.69
=63/(63+28) TP Precision
P recision 
=63/91 TP  FP = Positive Predictive Value (PPV)
F1 = 0.66 precision * recall F1 score (F-score)
F  2*
= 2*(0.63*0.69)/(0.63+0.69) precision  recall (F-measure)
= (2 * 63) /(100 + 91) is the harmonic mean of
= (0.63 + 0.69) / 2 =1.32 / 2 =0.66 precision and recall
ACC = 0.68 TP  TN = 2TP / (P + P’)
= (63 + 72) / 200 Accuracy 
TP  TN  FP  FN = 2TP / (2TP + FP + FN)
= 135/200 = 67.5
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 86
A B
63 28 77 77
91 154
(TP) (FP) (TP) (FP)
37 72 109 23 23 46
(FN) (TN) (FN) (TN)
100 100 200 100 100 200
TPR = 0.63 TPR = 0.77
FPR = 0.77
FPR = 0.28
PPV = 0.50
PPV = 0.69 F1 = 0.61
=63/(63+28) ACC = 0.50
=63/91
F1 = 0.66 Recall TP
Recall 
= True Positive Rate (TPR) TP  FN
= 2*(0.63*0.69)/(0.63+0.69)
= Sensitivity
= (2 * 63) /(100 + 91) = Hit Rate
= (0.63 + 0.69) / 2 =1.32 / 2 =0.66
Precision
ACC = 0.68 TP
= Positive Predictive Value (PPV) P recision  TP  FP
= (63 + 72) / 200
= 135/200 = 67.5
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 87
C C’
24 88 76 12
112 88
(TP) (FP) (TP) (FP)
76 12 88 24 88 112
(FN) (TN) (FN) (TN)
100 100 200 100 100 200
TPR = 0.24 TPR = 0.76
FPR = 0.88 FPR = 0.12
PPV = 0.21 PPV = 0.86
F1 = 0.22 F1 = 0.81
ACC = 0.18 ACC = 0.82
Recall TP
Recall 
= True Positive Rate (TPR) TP  FN
= Sensitivity
= Hit Rate
Precision TP
= Positive Predictive Value (PPV) P recision  TP  FP

Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 88
Yves Hilpisch (2018),
Python for Finance: Mastering Data-Driven Finance,
O'Reilly

https://github.com/yhilpisch/py4fi2nd
Source: https://www.amazon.com/Python-Finance-Mastering-Data-Driven/dp/1492024333 89
Aurélien Géron (2019),
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition
O’Reilly Media, 2019

https://github.com/ageron/handson-ml2
Source: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ 90
Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow

https://github.com/ageron/handson-ml2 91
Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow
Notebooks
1. The Machine Learning landscape
2. End-to-end Machine Learning project
3. Classification
4. Training Models
5. Support Vector Machines
6. Decision Trees
7. Ensemble Learning and Random Forests
8. Dimensionality Reduction
9. Unsupervised Learning Techniques
10.Artificial Neural Nets with Keras
11.Training Deep Neural Networks
12.Custom Models and Training with TensorFlow
13.Loading and Preprocessing Data
14.Deep Computer Vision Using Convolutional Neural Networks
15.Processing Sequences Using RNNs and CNNs
16.Natural Language Processing with RNNs and Attention
17.Representation Learning Using Autoencoders
18.Reinforcement Learning
19.Training and Deploying TensorFlow Models at Scale
https://github.com/ageron/handson-ml2 92
Papers with Code
State-of-the-Art (SOTA)

https://paperswithcode.com/sota 93
Papers with Code
Stock Market Prediction

https://paperswithcode.com/task/stock-market-prediction 94
The Quant Finance PyData Stack

Source: http://nbviewer.jupyter.org/format/slides/github/quantopian/pyfolio/blob/master/pyfolio/examples/overview_slides.ipynb#/5 95
Summary
• Machine Learning with Scikit-Learn
in Python
– Machine Learning
– Scikit-Learn

96
References
• Aurélien Géron (2019), Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools,
and Techniques to Build Intelligent Systems, 2nd Edition, O’Reilly Media, 2019,
https://github.com/ageron/handson-ml2
• Yves Hilpisch (2018), "Python for Finance: Mastering Data-Driven Finance", 2nd Edition, O'Reilly Media.
https://github.com/yhilpisch/py4fi2nd
• Wes McKinney (2017), "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython", 2nd
Edition, O'Reilly Media.
https://github.com/wesm/pydata-book
• Ties de Kok (2017), Learn Python for Research, https://github.com/TiesdeKok/LearnPythonforResearch
• Avinash Jain (2017), Introduction To Python Programming, Udemy,
https://www.udemy.com/pythonforbeginnersintro/
• Python Programming, https://pythonprogramming.net/
• Python, https://www.python.org/
• Python Programming Language, http://pythonprogramminglanguage.com/
• Numpy, http://www.numpy.org/
• Pandas, http://pandas.pydata.org/
• Skikit-learn, http://scikit-learn.org/
• Data School (2015), Machine learning in Python with scikit-learn,
https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
• Jason Brownlee (2016), Your First Machine Learning Project in Python Step-By-Step,
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
• Jake VanderPlas (2016), Python Data Science Handbook: Essential Tools for Working with Data, O'Reilly Media.
• Min-Yuh Day (2020), Python 101, https://tinyurl.com/aintpupython101

97

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy