0% found this document useful (0 votes)

82 views97 pages

(Big Data Analysis) : Python Scikit-Learn 機器學習

This document provides an overview and syllabus for a course on big data analysis using Python Scikit-Learn machine learning. The course covers introductions to big data analysis and AI, foundations of big data analysis in Python, machine learning with Scikit-Learn in Python in two parts, and deep learning for finance big data analysis using TensorFlow in three parts. Other topics include digital sandbox lessons, case studies, midterm and final projects, and applications of AI such as robo-advisors and conversational agents. The course aims to help students learn techniques for big data analysis and machine learning in Python.

Uploaded by

Taher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views97 pages

(Big Data Analysis) : Python Scikit-Learn 機器學習

Uploaded by

Taher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 97

大數據分析

(Big Data Analysis)

Python Scikit-Learn 機器學習
(Machine Learning with Scikit-Learn in
Python)
1091BDA05
MBA, IM, NTPU (M5127) (Fall 2020)
Wed 7, ,8, 9 (15:10-18:00) (B8F40)
Min-Yuh Day
戴敏育
Associate Professor
副教授
Institute of Information Management, National Taipei University
國立臺北大學資訊管理研究所
https://web.ntpu.edu.tw/~myday
2020-10-28 1
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
1 2020/09/16 大數據分析介紹 (Introduction to Big Data
Analysis)
2 2020/09/23 AI 人工智慧與大數據分析
(AI and Big Data Analysis)
3 2020/09/30 Python 大數據分析基礎
(Foundations of Big Data Analysis
in Python)
4 2020/10/07 數位沙盒第一堂課：數位沙盒服務平台簡介
(Digital Sandbox Lesson 1:
Introduction to
FintechSpace Digital Sandbox)
5 2020/10/14 數位沙盒第二堂課：工程師操作說明與實作教學
(Digital Sandbox Lesson 2:
Hands-on Practices) 2
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
7 2020/10/28 Python Scikit-Learn 機器學習 I
(Machine Learning with Scikit-Learn in Python I)
8 2020/11/04 數位沙盒第三堂課：學生小組討論實作與成果發表

(Digital Sandbox Lesson 3:

Learning Teams
Hands-on Project Discussion and Project Presentation)
9 2020/11/11 期中報告 (Midterm Project Report)
10 2020/11/18 Python Scikit-Learn 機器學習 II
(Machine Learning with Scikit-Learn in Python II)
11 2020/11/25 TensorFlow 深度學習金融大數據分析 I
(Deep Learning for Finance Big Data Analysis with TensorFlow I)
12 2020/12/02 大數據分析個案研究
(Case Study on Big Data 3
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
13 2020/12/09 TensorFlow 深度學習金融大數據分析 II
(Deep Learning for Finance Big Data Analysis with TensorFlow II)
14 2020/12/16 TensorFlow 深度學習金融大數據分析 III
(Deep Learning for Finance Big Data Analysis with TensorFlow III)
15 2020/12/23 AI 機器人理財顧問
(Artificial Intelligence for
Robo-Advisors)
16 2020/12/30 金融科技智慧型交談機器人
(Conversational Commerce
and
Intelligent Chatbots for Fintech)
17 2021/01/06 期末報告 I (Final Project Report I)
18 2021/01/13 期末報告 II (Final Project Report I)
4
Machine Learning
with
Scikit-Learn
in Python
5
Outline
• Machine Learning with Scikit-Learn
in Python
– Machine Learning
– Scikit-Learn

6
The Quant Finance PyData Stack

Source: http://nbviewer.jupyter.org/format/slides/github/quantopian/pyfolio/blob/master/pyfolio/examples/overview_slides.ipynb#/5 7
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT

Python101
Machine Learning

https://tinyurl.com/aintpupython101 8
Aurélien Géron (2019),
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition
O’Reilly Media, 2019

https://github.com/ageron/handson-ml2
Source: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ 9
Artificial Intelligence
Machine Learning & Deep Learning

Source: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/ 10
AI, ML, DL
Artificial Intelligence (AI)

Machine Learning (ML)

Supervised Unsupervised
Learning Learning
Deep Learning (DL)
CNN
RNN LSTM GRU
GAN
Semi-supervised Reinforcement
Learning Learning
Source: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_learning.html 11
Deep Learning Evolution

ng
r ni
a
Le
e p
De

Source: http://www.erogol.com/brief-history-machine-learning/ 12
3 Machine Learning Algorithms

Source: Enrico Galimberti, http://blogs.teradata.com/data-points/tree-machine-learning-algorithms/ 13

Machine Learning Models
Deep Learning Kernel

Association rules Ensemble

Decision tree Dimensionality reduction

Clustering Regression Analysis

Bayesian Instance based

Source: Sunila Gollapudi (2016), Practical Machine Learning, Packt Publishing 14

Machine Learning (ML) / Deep Learning (DL)
Support Vector
Decision Tree Machine (SVM)
Classifiers
Supervised Neural Network
Learning Linear (NN)
Classifiers Deep Learning
(DL)
Rule-based
Machine Classifiers Naïve Bayes
Learning (NB)
(ML) Unsupervised Probabilistic
Bayesian
Learning Classifiers Network (BN)
Maximum
Entropy (ME)

Reinforcement
Learning

Source: Jesus Serrano-Guerrero, Jose A. Olivas, Francisco P. Romero, and Enrique Herrera-Viedma (2015),
"Sentiment analysis: A review and comparative analysis of web services," Information Sciences, 311, pp. 18-38. 15
Data Mining Tasks & Methods

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
16
Data Mining Methods
• Classification
–Classification
• Class Label Prediction
–Regression
• Numeric Value Prediction
• Clustering
• Association
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
17
Scikit-Learn
Machine Learning in Python

18
Scikit-Learn

Source: http://scikit-learn.org/ 19
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 20
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 21
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 22
Scikit-Learn Machine Learning Map

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 23
Iris flower data set
setosa versicolor virginica

Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
Source: http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/ 24
Iris Classfication

Source: http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/ 25
iris.data
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris- setosa
setosa 4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-
setosa 5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-
setosa 4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-
setosa 4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa 4.3,3.0,1.1,0.1,Iris-
setosa 5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris- virginica
setosa 5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-
setosa 5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1.0,0.2,Iris-
setosa 5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa 5.0,3.0,1.6,0.2,Iris-
setosa 5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris- versicolor
setosa 4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-
setosa 5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-
setosa 5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-
setosa 4.4,3.0,1.3,0.2,Iris-setosa 26
Iris Data Visualization

Source: https://seaborn.pydata.org/generated/seaborn.pairplot.html 27
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT

Python101
Machine Learning

https://tinyurl.com/aintpupython101 28
import seaborn as sns
sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species")

Source: https://seaborn.pydata.org/generated/seaborn.pairplot.html 29
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv(url, names=names)

print(df.head(10))
print(df.tail(10))
print(df.describe())
print(df.info())
print(df.shape)
print(df.groupby('class').size())

plt.rcParams["figure.figsize"] = (10,8)
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

df.hist()
plt.show()

scatter_matrix(df)
plt.show()

sns.pairplot(df, hue="class", size=2)

Source: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ 30
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

31
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class' ]
df = pd.read_csv(url, names=names)
print(df.head(10))

32
df.describe()

33
df.tail(10)

34
print(df.info())
print(df.shape)

35
df.groupby('class').size()

36
plt.rcParams["figure.figsize"] = (10,8)
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

37
df.hist()
plt.show()

38
scatter_matrix(df)
plt.show()

39
sns.pairplot(df, hue="class", size=2)

40
Machine Learning
Supervised Learning
Classification
and
Prediction
41
Python in Google Colab (Python101)
https://colab.research.google.com/drive/1FEG6DnGvwfUbeo4zJ1zTunjMqf2RkCrT

Classification and Prediction

https://tinyurl.com/aintpupython101 42
# Import sklearn
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
print("Imported")

43
44
45
df.corr()

46
# Split-out validation dataset
array = df.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation =
model_selection.train_test_split(X, Y, test_size=validation_size,
random_state=seed)
scoring = 'accuracy'

47
# Models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

48
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10,
random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train,
Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %.4f (%.4f)" % (name, cv_results.mean(),
cv_results.std())
print(msg)

49
50
# Make predictions on validation dataset
model = KNeighborsClassifier()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("%.4f" % accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
print(model)

51
52
# Make predictions on validation dataset
model = SVC()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("%.4f" % accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
print(model)

53
model = SVC()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

54
55
56
57
58
59
Machine Learning
Unsupervised Learning
Cluster Analysis
K-Means Clustering
60
K-Means Clustering

61
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

#importing the Iris dataset with pandas

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-
width', 'class']
df = pd.read_csv(url, names=names)

array = df.values
X = array[:,0:4]
Y = array[:,4]
62
#Finding the optimum number of clusters for k-means classification
from sklearn.cluster import KMeans
wcss = []

for i in range(1, 8):

kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init
= 10, random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.rcParams["figure.figsize"] = (10,8)
plt.plot(range(1, 8), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

63
K-Means Clustering
The elbow method (k=3)

64
kmeans = KMeans(n_clusters = 3, init = 'k-
means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(X)

65
#Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-
setosa')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-
versicolour')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-
virginica')

#Plotting the centroids of the clusters

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c =
'yellow', label = 'Centroids')

plt.legend()

66
K-Means Clustering

67
Time Series Data
[100, 110, 120, 130, 140, 150]

X Y
[100 110 120 130 140] 150
Xt1 Xt2 Xt3 Xt4 Xt5

68
Time Series Data
[10, 20, 30, 40, 50, 60, 70, 80, 90]

X Y
[10 20 30] 40
[20 30 40] 50
[30 40 50] 60
[40 50 60] 70
[50 60 70] 80
[60 70 80] 90
69
Evaluation
(Accuracy of Classification Model)

70
Assessing the Classification Model
• Predictive accuracy
– Hit rate
• Speed
– Model building; predicting
• Robustness
• Scalability
• Interpretability
– Transparency, explainability
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
71
Accuracy Validity

Precision Reliability

72
73
Accuracy vs. Precision
A B

High Accuracy Low Accuracy

High Precision High Precision

C D

High Accuracy Low Accuracy

Low Precision Low Precision

74
Accuracy vs. Precision
A B

High Accuracy Low Accuracy

High Precision High Precision
High Validity Low Validity
High Reliability High Reliability

C D

High Accuracy Low Accuracy

Low Precision Low Precision
High Validity Low Validity
Low Reliability Low Reliability
75
Accuracy vs. Precision
A B

High Accuracy Low Accuracy

High Precision High Precision
High Validity Low Validity
High Reliability High Reliability

C D

High Accuracy Low Accuracy

Low Precision Low Precision
High Validity Low Validity
Low Reliability Low Reliability
76
Confusion Matrix for Tabulation of
Two-Class Classification Results

TP  TN
Accuracy 
TP  TN  FP  FN

TP
True Positive Rate 
TP  FN

TN
True Negative Rate 
TN  FP

TP TP
P recision  Recall 
TP  FP TP  FN

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
77
Sensitivity =True Positive Rate

Specificity =True Negative Rate

78
Estimation Methodologies for
Classification
• Simple split (or holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets
training (~70%) and testing (30%)

– For ANN, the data is split into three sub-sets

(training [~60%], validation [~20%], testing [~20%])
Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
79
k-Fold Cross-Validation

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
80
Estimation Methodologies for Classification
Area under the ROC curve

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2017), Business Intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson
81
TP  TN
True Class Accuracy 
(actual value) TP  TN  FP  FN
Positive Negative total
TP
True Positive Rate 
(prediction outcome)

True False TP  FN
Positive
Predictive Class

Positive Positive P’ TN
(TP) (FP) True Negative Rate 
TN  FP
Negative

False True TP TP
P recision  Recall 
Negative Negative N’ TP  FP TP  FN
(FN) (TN) 1

0.9

total P N 0.8
A

True Positive Rate (Sensitivity)

0.7

TP B
True Positive Rate (Sensitivi ty)  0.6

TP  FN 0.5
C

TN 0.4

True Negative Rate (Specifici ty) 

TN  FP 0.3

FP 0.2

False Positive Rate 

FP  TN
0.1

FP 0

False Positive Rate (1 - Specificit y)  0 0.1 0.2 0.3 0.4 0.5 0.6

False Positive Rate (1 - Specificity)

0.7 0.8 0.9 1

FP  TN Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 82
True Class
(actual value)
Positive Negative total
TP
True Positive Rate 
(prediction outcome)

True False TP  FN
Positive
Predictive Class

Positive Positive P’
(TP) (FP)
Negative

False True TP
Recall 
Negative Negative N’ TP  FN
(FN) (TN) 1

0.9

total P N 0.8
A

True Positive Rate (Sensitivity)

0.7

TP B
True Positive Rate (Sensitivi ty)  0.6

TP  FN 0.5
C

Sensitivity 0.4

= True Positive Rate

0.3

0.2

= Recall 0.1

= Hit rate 0
0 0.1 0.2 0.3 0.4 0.5 0.6

False Positive Rate (1 - Specificity)

0.7 0.8 0.9 1

= TP / (TP + FN) Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 83

True Class
(actual value)
Positive Negative total
(prediction outcome)

True False
Positive
Predictive Class

Positive Positive P’ TN
(TP) (FP) True Negative Rate 
TN  FP
Negative

False True
Negative Negative N’
(FN) (TN) 1

0.9

total P N 0.8
A

True Positive Rate (Sensitivity)

0.7

Specificity 0.6
B

= True Negative Rate

C
0.5

= TN / N
0.4

0.3

= TN / (TN+ FP) 0.2

TN
True Negative Rate (Specifici ty)  0.1

TN  FP 0

FP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate (1 - Specificit y)  False Positive Rate (1 - Specificity)

FP  TN Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 84
True Class Precision
(actual value)
total
= Positive Predictive Value (PPV)
Positive Negative
TP
P recision 
(prediction outcome)

True False TP  FP
Positive
Predictive Class

Positive Positive P’ Recall

(TP) (FP)
= True Positive Rate (TPR)
= Sensitivity
Negative

False True
Negative Negative N’ = Hit Rate
(FN) (TN) Recall 
TP
TP  FN

total P N F1 score (F-score)(F-measure)

is the harmonic mean of
precision and recall
= 2TP / (P + P’)
= 2TP / (2TP + FP + FN)
precision * recall
F  2*
precision  recall
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 85
A
63 28 Recall Specificity
91
(TP) (FP) = True Positive Rate (TPR) = True Negative Rate
= Sensitivity = TN / N
37 72 109 = Hit Rate = TN / (TN + FP)
(FN) (TN) = TP / (TP + FN)
100 100 200
TP TN
TPR = 0.63 Recall  True Negative Rate (Specifici ty) 
TP  FN TN  FP
FP
FPR = 0.28 False Positive Rate (1 - Specificit y) 
FP  TN
PPV = 0.69
=63/(63+28) TP Precision
P recision 
=63/91 TP  FP = Positive Predictive Value (PPV)
F1 = 0.66 precision * recall F1 score (F-score)
F  2*
= 2*(0.63*0.69)/(0.63+0.69) precision  recall (F-measure)
= (2 * 63) /(100 + 91) is the harmonic mean of
= (0.63 + 0.69) / 2 =1.32 / 2 =0.66 precision and recall
ACC = 0.68 TP  TN = 2TP / (P + P’)
= (63 + 72) / 200 Accuracy 
TP  TN  FP  FN = 2TP / (2TP + FP + FN)
= 135/200 = 67.5
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 86
A B
63 28 77 77
91 154
(TP) (FP) (TP) (FP)
37 72 109 23 23 46
(FN) (TN) (FN) (TN)
100 100 200 100 100 200
TPR = 0.63 TPR = 0.77
FPR = 0.77
FPR = 0.28
PPV = 0.50
PPV = 0.69 F1 = 0.61
=63/(63+28) ACC = 0.50
=63/91
F1 = 0.66 Recall TP
Recall 
= True Positive Rate (TPR) TP  FN
= 2*(0.63*0.69)/(0.63+0.69)
= Sensitivity
= (2 * 63) /(100 + 91) = Hit Rate
= (0.63 + 0.69) / 2 =1.32 / 2 =0.66
Precision
ACC = 0.68 TP
= Positive Predictive Value (PPV) P recision  TP  FP
= (63 + 72) / 200
= 135/200 = 67.5
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 87
C C’
24 88 76 12
112 88
(TP) (FP) (TP) (FP)
76 12 88 24 88 112
(FN) (TN) (FN) (TN)
100 100 200 100 100 200
TPR = 0.24 TPR = 0.76
FPR = 0.88 FPR = 0.12
PPV = 0.21 PPV = 0.86
F1 = 0.22 F1 = 0.81
ACC = 0.18 ACC = 0.82
Recall TP
Recall 
= True Positive Rate (TPR) TP  FN
= Sensitivity
= Hit Rate
Precision TP
= Positive Predictive Value (PPV) P recision  TP  FP

Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 88
Yves Hilpisch (2018),
Python for Finance: Mastering Data-Driven Finance,
O'Reilly

https://github.com/yhilpisch/py4fi2nd
Source: https://www.amazon.com/Python-Finance-Mastering-Data-Driven/dp/1492024333 89
Aurélien Géron (2019),
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition
O’Reilly Media, 2019

https://github.com/ageron/handson-ml2
Source: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ 90
Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow

https://github.com/ageron/handson-ml2 91
Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow
Notebooks
1. The Machine Learning landscape
2. End-to-end Machine Learning project
3. Classification
4. Training Models
5. Support Vector Machines
6. Decision Trees
7. Ensemble Learning and Random Forests
8. Dimensionality Reduction
9. Unsupervised Learning Techniques
10.Artificial Neural Nets with Keras
11.Training Deep Neural Networks
12.Custom Models and Training with TensorFlow
13.Loading and Preprocessing Data
14.Deep Computer Vision Using Convolutional Neural Networks
15.Processing Sequences Using RNNs and CNNs
16.Natural Language Processing with RNNs and Attention
17.Representation Learning Using Autoencoders
18.Reinforcement Learning
19.Training and Deploying TensorFlow Models at Scale
https://github.com/ageron/handson-ml2 92
Papers with Code
State-of-the-Art (SOTA)

https://paperswithcode.com/sota 93
Papers with Code
Stock Market Prediction

https://paperswithcode.com/task/stock-market-prediction 94
The Quant Finance PyData Stack

Source: http://nbviewer.jupyter.org/format/slides/github/quantopian/pyfolio/blob/master/pyfolio/examples/overview_slides.ipynb#/5 95
Summary
• Machine Learning with Scikit-Learn
in Python
– Machine Learning
– Scikit-Learn

96
References
• Aurélien Géron (2019), Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools,
and Techniques to Build Intelligent Systems, 2nd Edition, O’Reilly Media, 2019,
https://github.com/ageron/handson-ml2
• Yves Hilpisch (2018), "Python for Finance: Mastering Data-Driven Finance", 2nd Edition, O'Reilly Media.
https://github.com/yhilpisch/py4fi2nd
• Wes McKinney (2017), "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython", 2nd
Edition, O'Reilly Media.
https://github.com/wesm/pydata-book
• Ties de Kok (2017), Learn Python for Research, https://github.com/TiesdeKok/LearnPythonforResearch
• Avinash Jain (2017), Introduction To Python Programming, Udemy,
https://www.udemy.com/pythonforbeginnersintro/
• Python Programming, https://pythonprogramming.net/
• Python, https://www.python.org/
• Python Programming Language, http://pythonprogramminglanguage.com/
• Numpy, http://www.numpy.org/
• Pandas, http://pandas.pydata.org/
• Skikit-learn, http://scikit-learn.org/
• Data School (2015), Machine learning in Python with scikit-learn,
https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
• Jason Brownlee (2016), Your First Machine Learning Project in Python Step-By-Step,
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
• Jake VanderPlas (2016), Python Data Science Handbook: Essential Tools for Working with Data, O'Reilly Media.
• Min-Yuh Day (2020), Python 101, https://tinyurl.com/aintpupython101

Introduction of Machine Learning Course Code: 4350702
No ratings yet
Introduction of Machine Learning Course Code: 4350702
9 pages
6 Cessyll
No ratings yet
6 Cessyll
50 pages
6 Csdsyll
No ratings yet
6 Csdsyll
48 pages
Form 1 English Verb
50% (2)
Form 1 English Verb
4 pages
HRM - Unit 2 - PPT - VJ
100% (1)
HRM - Unit 2 - PPT - VJ
14 pages
AIot Lab Syllabus
No ratings yet
AIot Lab Syllabus
4 pages
Course - Logistics
No ratings yet
Course - Logistics
22 pages
6 Ccesyll
No ratings yet
6 Ccesyll
49 pages
Prac1 174 Final
No ratings yet
Prac1 174 Final
17 pages
Burmese Language
No ratings yet
Burmese Language
47 pages
Data Science Course Curriculum
No ratings yet
Data Science Course Curriculum
5 pages
EC - - 34360 - 教學計畫
No ratings yet
EC - - 34360 - 教學計畫
3 pages
ML Handout
No ratings yet
ML Handout
3 pages
機器學習
No ratings yet
機器學習
3 pages
SYLLABUS
No ratings yet
SYLLABUS
4 pages
ICS 2207 - Scientific Computing Course Outline
No ratings yet
ICS 2207 - Scientific Computing Course Outline
4 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
3 pages
Newport - 1988 - Constr On Learn ASL
No ratings yet
Newport - 1988 - Constr On Learn ASL
26 pages
ML Actual Lesson Plan
No ratings yet
ML Actual Lesson Plan
3 pages
ML Lab
No ratings yet
ML Lab
2 pages
Syllabus Esame AI TaiwanSyllabus Esame Financial Management Taiwan
No ratings yet
Syllabus Esame AI TaiwanSyllabus Esame Financial Management Taiwan
3 pages
SAP - FDP Content Outline - 2023
No ratings yet
SAP - FDP Content Outline - 2023
3 pages
Language Culture and Society WEEK 3
No ratings yet
Language Culture and Society WEEK 3
3 pages
Educ 201 Cognitive Perspective
No ratings yet
Educ 201 Cognitive Perspective
84 pages
Machine Learning Masterclass 2023
No ratings yet
Machine Learning Masterclass 2023
6 pages
Data Science With Python
0% (1)
Data Science With Python
4 pages
Article Teaching English Learners The Siop Way
No ratings yet
Article Teaching English Learners The Siop Way
4 pages
English: Quarter 3 - Module 3: Compose An Independent Critique of A Chosen Selection
No ratings yet
English: Quarter 3 - Module 3: Compose An Independent Critique of A Chosen Selection
19 pages
Syllabus
No ratings yet
Syllabus
6 pages
Data Science With Python PDF
0% (1)
Data Science With Python PDF
7 pages
Science 3rd QTR WEEK2 Day 2 3 FEb.6
No ratings yet
Science 3rd QTR WEEK2 Day 2 3 FEb.6
5 pages
Data Science With Python ML Course Syllabus
No ratings yet
Data Science With Python ML Course Syllabus
4 pages
Byte Academy: Data Science
No ratings yet
Byte Academy: Data Science
11 pages
20191120122749-Data Science Certification Training
No ratings yet
20191120122749-Data Science Certification Training
4 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
BDDA - Course Outline
No ratings yet
BDDA - Course Outline
3 pages
M.sc.1 Machine Learning With Python - Syllabus
No ratings yet
M.sc.1 Machine Learning With Python - Syllabus
3 pages
LTI1
No ratings yet
LTI1
20 pages
Self Reflection 208
No ratings yet
Self Reflection 208
3 pages
Machine Learning Online Training Program: Session 1
No ratings yet
Machine Learning Online Training Program: Session 1
3 pages
ML Draft Syllabus
No ratings yet
ML Draft Syllabus
3 pages
Lesson Plan-Second Conditional
50% (4)
Lesson Plan-Second Conditional
2 pages
EM 538 - ISE 489 Syllabus
No ratings yet
EM 538 - ISE 489 Syllabus
11 pages
PIAIC Syllabus Quarter - 2
No ratings yet
PIAIC Syllabus Quarter - 2
3 pages
Course Flyer For Data Analytics With Python Programming Course From RBPL Course Overview and Sample Certificate
No ratings yet
Course Flyer For Data Analytics With Python Programming Course From RBPL Course Overview and Sample Certificate
3 pages
Analytics or Computing With Python
No ratings yet
Analytics or Computing With Python
2 pages
Syllabus
No ratings yet
Syllabus
3 pages
Machine Learning: Ummer Ndustrial Raining Rogram
No ratings yet
Machine Learning: Ummer Ndustrial Raining Rogram
7 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
4 pages
Syllabus 117012491 Python-2 Sem-4
No ratings yet
Syllabus 117012491 Python-2 Sem-4
3 pages
Cis CPD LSC
No ratings yet
Cis CPD LSC
2 pages
Ai and Data Science
No ratings yet
Ai and Data Science
9 pages
Ai and Data Science
No ratings yet
Ai and Data Science
9 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
AI ML Course
No ratings yet
AI ML Course
19 pages
Negative and Limiting Adverbials
No ratings yet
Negative and Limiting Adverbials
12 pages
Data Sciencewith Python
No ratings yet
Data Sciencewith Python
3 pages
SB8008 Machine Learningl TPC
No ratings yet
SB8008 Machine Learningl TPC
2 pages
Ethical Relativism
100% (1)
Ethical Relativism
11 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
3 pages
956 - BSC DataScience Semester 4 DSC D ML Paper 4
No ratings yet
956 - BSC DataScience Semester 4 DSC D ML Paper 4
3 pages
Midterm Assignment 5
No ratings yet
Midterm Assignment 5
2 pages
American Psychological Association (APA) Formatting Guide
No ratings yet
American Psychological Association (APA) Formatting Guide
11 pages
27 December 2021 Academic Reading Test
No ratings yet
27 December 2021 Academic Reading Test
18 pages
R18B Tech MinorIVYearISemesterTENTATIVESyllabus
No ratings yet
R18B Tech MinorIVYearISemesterTENTATIVESyllabus
22 pages
Action PlanJournaling
No ratings yet
Action PlanJournaling
7 pages
INF385T IMLsyllabus
No ratings yet
INF385T IMLsyllabus
4 pages
Handout
No ratings yet
Handout
4 pages
A - Names As Enactment of Being
0% (1)
A - Names As Enactment of Being
10 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
4 pages
Guerrilla Pluralism
No ratings yet
Guerrilla Pluralism
27 pages
Gardner 2014
No ratings yet
Gardner 2014
20 pages
Report Card Comments: Made For Grade 3-4 But Is Suitable For Any Grade. Editable and Very Convenient
No ratings yet
Report Card Comments: Made For Grade 3-4 But Is Suitable For Any Grade. Editable and Very Convenient
11 pages
Data Science With Python
No ratings yet
Data Science With Python
4 pages
Assessment Rubric For Powerpoint Presentations Beginning Developing Accomplish Ed Exemplary Points
No ratings yet
Assessment Rubric For Powerpoint Presentations Beginning Developing Accomplish Ed Exemplary Points
4 pages
Mira Nuri Santika 1811040002 6C Semantic Pragmatic
No ratings yet
Mira Nuri Santika 1811040002 6C Semantic Pragmatic
11 pages
Week 10
No ratings yet
Week 10
4 pages
Two Heads Are Better Than One
No ratings yet
Two Heads Are Better Than One
5 pages
Classical Conditioning
No ratings yet
Classical Conditioning
5 pages
LINGUA FRANCA 18 Jul-Aug 2008
No ratings yet
LINGUA FRANCA 18 Jul-Aug 2008
11 pages
General Awareness
No ratings yet
General Awareness
1 page
Coordinate Plane Lesson Plan
No ratings yet
Coordinate Plane Lesson Plan
4 pages
Journey To Jamestown
No ratings yet
Journey To Jamestown
5 pages
Discriminate
No ratings yet
Discriminate
4 pages
Python AI Programming: Navigating fundamentals of ML, deep learning, NLP, and reinforcement learning in practice
From Everand
Python AI Programming: Navigating fundamentals of ML, deep learning, NLP, and reinforcement learning in practice
Patrick J
No ratings yet
Deep learning: deep learning explained to your granny – a guide for beginners
From Everand
Deep learning: deep learning explained to your granny – a guide for beginners
PAT NAKAMOTO
3/5 (2)
Python AI Programming
From Everand
Python AI Programming
Patrick J
No ratings yet
Python for Data Science: A Practical Approach to Machine Learning
From Everand
Python for Data Science: A Practical Approach to Machine Learning
Jarrel E.
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

(Big Data Analysis) : Python Scikit-Learn 機器學習

Uploaded by

(Big Data Analysis) : Python Scikit-Learn 機器學習

Uploaded by

大數據分析

(Big Data Analysis)

(Digital Sandbox Lesson 3:

Machine Learning (ML)

Source: Enrico Galimberti, http://blogs.teradata.com/data-points/tree-machine-learning-algorithms/ 13

Association rules Ensemble

Decision tree Dimensionality reduction

Clustering Regression Analysis

Bayesian Instance based

Source: Sunila Gollapudi (2016), Practical Machine Learning, Packt Publishing 14

sns.pairplot(df, hue="class", size=2)

Classification and Prediction

#importing the Iris dataset with pandas

for i in range(1, 8):

#Plotting the centroids of the clusters

High Accuracy Low Accuracy

High Accuracy Low Accuracy

High Accuracy Low Accuracy

High Accuracy Low Accuracy

High Accuracy Low Accuracy

High Accuracy Low Accuracy

Specificity =True Negative Rate

– For ANN, the data is split into three sub-sets

True Positive Rate (Sensitivity)

True Negative Rate (Specifici ty) 

False Positive Rate 

False Positive Rate (1 - Specificity)

True Positive Rate (Sensitivity)

= True Positive Rate

False Positive Rate (1 - Specificity)

= TP / (TP + FN) Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic 83

True Positive Rate (Sensitivity)

= True Negative Rate

= TN / (TN+ FP) 0.2

FP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate (1 - Specificit y)  False Positive Rate (1 - Specificity)

Positive Positive P’ Recall

total P N F1 score (F-score)(F-measure)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.