0% found this document useful (0 votes)

38 views7 pages

Machine Learning On Massive Datasets

This document discusses machine learning techniques for massive datasets. It notes that traditional machine learning algorithms have computational bottlenecks when applied to large datasets. It proposes using multi-scale decompositions and generalized N-body algorithms to reduce the computational complexity of many common machine learning tasks from quadratic or cubic to logarithmic or linear time. Examples are given showing speedups of two algorithms on very large real-world datasets. The document also briefly mentions software for implementing these efficient algorithms.

Uploaded by

ravigobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views7 pages

Machine Learning On Massive Datasets

Uploaded by

ravigobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Machine Learning

on Massive Datasets

Alexander Gray
Georgia Institute of Technology
College of Computing
FASTlab: Fundamental Algorithmic and Statistical Tools

The problem: big datasets

Could be large:
N (#data), D (#features), M (#models)
What wed like

Allow users to apply all the state-of-

the-art statistical methods

.with orders-of-magnitude more

computational efficiency

Core methods of
statistics / machine learning / mining
! Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal
range-search O(N), contingency table
! Density estimation: kernel density estimation O(N2), mixture of Gaussians
O(N)
! Regression: linear regression O(D3), kernel regression O(N2), Gaussian
process regression O(N3)
! Classification: nearest-neighbor classifier O(N2), nonparametric Bayes
classifier O(N2), support vector machine
! Dimension reduction: principal component analysis O(D3), non-negative
matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(N), hierarchical clustering O(N3), by dimension
reduction
! Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory
tracking
! 2-sample testing: n-point correlation O(Nn)
! Cross-match: bipartite matching O(N3)
5 main computational bottlenecks:
Aggregations, GNPs, graphical models, linear algebra, optimization

! Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal

range-search O(N), contingency table
! Density estimation: kernel density estimation O(N2), mixture of Gaussians
O(N)
! Regression: linear regression O(D3), kernel regression O(N2), Gaussian
process regression O(N3)
! Classification: nearest-neighbor classifier O(N2), nonparametric Bayes
classifier O(N2), support vector machine
! Dimension reduction: principal component analysis O(D3), non-negative
matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(N), hierarchical clustering O(N3), by dimension
reduction
! Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory
tracking
! 2-sample testing: n-point correlation O(Nn)
! Cross-match: bipartite matching O(N3)

How can we compute these efficiently?

Multi-scale
Decompositions
e.g. kd-trees
[Bentley 1975], [Friedman, Bentley & Finkel
1977],[Moore & Lee 1995]
How can we compute these efficiently?
! Generalized N-body algorithms
(multiple trees) for distance/similarity-
based computations [2000, 2003, 2009]
! Hierarchical series expansions for
kernel summations [2004, 2006, 2008]
! Multi-scale Monte Carlo for linear
algebra and summations [2007, 2008]
! Stochastic process approximations for
time series [2009]
! Monte Carlo optimization: online,
progressive [2009]
! Parallel computing [1998, 2006, 2009]

Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)
Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)

%!!!

Ex: 3-point correlation runtime $&!!

89:,7;/1,<=1>3?5=@
$!!!

(biggest previous: #&!!

20K) n=2: O(N)
#!!!
n=3: O(Nlog3)
"&!!
VIRGO n=4: O(N2)
simulation data,
"!!!
N = 75,000,000
&!!

nave: 5x109 sec.

(~150 years) !
! " # $ % & ' ( )
-./012,34,5676
multi-tree: 55 sec.
(exact)

Ex: support vector machine

Data: IJCNN1 [DP01a]
2 classes SMO: O(N2-N3)
49,990 training points
SFW: O(N/e + 1/e2)
91,701 testing points
22 features

SMO: 12,831 SVs, 84,360 iterations, 98.3%

accuracy, 765 sec

SFW: 4,145 SVs, 4,500 iterations, 98.1%

accuracy, 21 sec
Software
! MLPACK (C++)
!First scalable comprehensive ML library
! MLPACK-db
!fast data analytics in relational
databases (SQL Server)
! MLPACK Pro
- Very-large-scale data

Issues
! How to disseminate/integrate?
! In-database/centralized or not?
! Trust of complex algorithms?
! Other statistical/ML needs?

Mathematical Foundations of Machine Learning
100% (1)
Mathematical Foundations of Machine Learning
340 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
332 pages
statistics for applied science 200l
No ratings yet
statistics for applied science 200l
122 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
Math Foundations of Machine Learning Mississippi SU
No ratings yet
Math Foundations of Machine Learning Mississippi SU
328 pages
Introduction To Data Mining 2005
60% (5)
Introduction To Data Mining 2005
400 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Introduction To Machine Learning With Applications In Information Security Mark Stamp download
No ratings yet
Introduction To Machine Learning With Applications In Information Security Mark Stamp download
85 pages
AKTUA399 Masteroppgave Fredrik Hjorth Bentsen
No ratings yet
AKTUA399 Masteroppgave Fredrik Hjorth Bentsen
84 pages
Distributed Linear Regression Class Notes
No ratings yet
Distributed Linear Regression Class Notes
140 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
134 pages
Fundamentals of Machine Learning
No ratings yet
Fundamentals of Machine Learning
97 pages
Mlpy
0% (1)
Mlpy
113 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
3 pages
Statistical Pattern Recognition Toolbox For Matlab: User's Guide
No ratings yet
Statistical Pattern Recognition Toolbox For Matlab: User's Guide
99 pages
Active Sample Selection For Matrix Compl
No ratings yet
Active Sample Selection For Matrix Compl
89 pages
Orange 3
100% (1)
Orange 3
46 pages
Machine Learning Guide
No ratings yet
Machine Learning Guide
185 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
Kavin
No ratings yet
Kavin
15 pages
Machine Learnig Revision
No ratings yet
Machine Learnig Revision
93 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
ml
No ratings yet
ml
5 pages
Machine Learning
No ratings yet
Machine Learning
216 pages
Foundations of Machine
No ratings yet
Foundations of Machine
120 pages
Cosmo Modernist History of Mankind , Andhayug
No ratings yet
Cosmo Modernist History of Mankind , Andhayug
18 pages
Taylor J R Classical Mechanics Solutions1 PDF
100% (1)
Taylor J R Classical Mechanics Solutions1 PDF
56 pages
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
No ratings yet
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
400 pages
RPH
100% (1)
RPH
3 pages
A Comprehensive Guide To Machine Learning Ucb Cs189 Itebooks download
No ratings yet
A Comprehensive Guide To Machine Learning Ucb Cs189 Itebooks download
44 pages
Solution - Manual For Numerical Analysis
64% (45)
Solution - Manual For Numerical Analysis
41 pages
Extra Lecturenotes Cs725
No ratings yet
Extra Lecturenotes Cs725
119 pages
2012 Nikolaos Nikolaou MSC
No ratings yet
2012 Nikolaos Nikolaou MSC
102 pages
Machine Learning
No ratings yet
Machine Learning
48 pages
JD-R_Bakker et al_2023
No ratings yet
JD-R_Bakker et al_2023
31 pages
Get (Ebook PDF) Drawing Essentials A Complete Guide To Drawing 3rd by Deborah Rockman PDF Ebook With Full Chapters Now
100% (5)
Get (Ebook PDF) Drawing Essentials A Complete Guide To Drawing 3rd by Deborah Rockman PDF Ebook With Full Chapters Now
41 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
A Comprehensive Guide To Machine Learning
No ratings yet
A Comprehensive Guide To Machine Learning
152 pages
Machine Learning Algorithms Applications and Practices in Data Science PDF
No ratings yet
Machine Learning Algorithms Applications and Practices in Data Science PDF
113 pages
Scikit Learn Docs PDF
No ratings yet
Scikit Learn Docs PDF
2,387 pages
ML assignment
No ratings yet
ML assignment
13 pages
Machine Leaning and Dimensionality Reduction Course UCLouvain
No ratings yet
Machine Leaning and Dimensionality Reduction Course UCLouvain
36 pages
HS360 - Long Brochure (YES Bank)
No ratings yet
HS360 - Long Brochure (YES Bank)
28 pages
???????????? ?? ?????????? ??? ?????????
No ratings yet
???????????? ?? ?????????? ??? ?????????
96 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
Machine Learning and Data Mining Notes 1647447657
No ratings yet
Machine Learning and Data Mining Notes 1647447657
134 pages
ML 2
No ratings yet
ML 2
3 pages
Jr. Seminar Test #1 Review: ACT English ACT Math ACT Reading & Science Smart Goals Wild Card
No ratings yet
Jr. Seminar Test #1 Review: ACT English ACT Math ACT Reading & Science Smart Goals Wild Card
54 pages
Research Paper
No ratings yet
Research Paper
45 pages
AIML MODEL
No ratings yet
AIML MODEL
13 pages
MLbook Extract
No ratings yet
MLbook Extract
14 pages
Final 1
No ratings yet
Final 1
6 pages
Answer Scheme Paper 2 Form 5 2016
No ratings yet
Answer Scheme Paper 2 Form 5 2016
6 pages
Reparation For Injuries Suffered in The Service of The United Nations
0% (1)
Reparation For Injuries Suffered in The Service of The United Nations
50 pages
الجرامر في محاضرة واحدة
No ratings yet
الجرامر في محاضرة واحدة
14 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
134 pages
Visualizing Physical Geography, 2nd Edition: Timothy Foresman, Alan H. Strahler
0% (3)
Visualizing Physical Geography, 2nd Edition: Timothy Foresman, Alan H. Strahler
3 pages
Dinner Party Wbs PM
No ratings yet
Dinner Party Wbs PM
24 pages
Persuasive Essay Gun Control
100% (2)
Persuasive Essay Gun Control
4 pages
0975 Data Science and Machine Learning
No ratings yet
0975 Data Science and Machine Learning
6 pages
Ankle Fractures - StatPearls - NCBI Bookshelf
No ratings yet
Ankle Fractures - StatPearls - NCBI Bookshelf
12 pages
Preface To The Second Edition V 1 1
No ratings yet
Preface To The Second Edition V 1 1
9 pages
Is Inductive Machine Learning Just Another Wild Goose
No ratings yet
Is Inductive Machine Learning Just Another Wild Goose
24 pages
Marine Motor Notes Edition 2017 by Rajesh Singh Gulia
90% (10)
Marine Motor Notes Edition 2017 by Rajesh Singh Gulia
350 pages
2009 Grace Hopper BoF Pregnancy and Graduate School
No ratings yet
2009 Grace Hopper BoF Pregnancy and Graduate School
2 pages
Big O Notations
No ratings yet
Big O Notations
19 pages
A Comparative Runtime Analysis of Heuristic Algorithms
No ratings yet
A Comparative Runtime Analysis of Heuristic Algorithms
18 pages
D20 - Star Wars - Netbook of Prestige Classes
No ratings yet
D20 - Star Wars - Netbook of Prestige Classes
65 pages
7 - (G.R. NO. 150416. July 21, 2006)
No ratings yet
7 - (G.R. NO. 150416. July 21, 2006)
7 pages
monologue ideas
No ratings yet
monologue ideas
6 pages
Complexity Measures For Meta-Learning
No ratings yet
Complexity Measures For Meta-Learning
12 pages
Analyzing HardFaults On Cortex-M CPU
No ratings yet
Analyzing HardFaults On Cortex-M CPU
12 pages
Per-Po3 Palisoc 2019
33% (3)
Per-Po3 Palisoc 2019
2 pages
Chi Squared Test
No ratings yet
Chi Squared Test
17 pages
Automotive Big Data
No ratings yet
Automotive Big Data
10 pages
Spark On Hadoop Vs MPI OpenMP On Beowulf
No ratings yet
Spark On Hadoop Vs MPI OpenMP On Beowulf
10 pages
DR - Sanchit Paul: Pediatric Dentist
No ratings yet
DR - Sanchit Paul: Pediatric Dentist
9 pages
Audio On ARM Cortex-M Processors
No ratings yet
Audio On ARM Cortex-M Processors
5 pages
Big O Notation
No ratings yet
Big O Notation
7 pages
Streaming Linear Regression On Spark MLlib and MOA
No ratings yet
Streaming Linear Regression On Spark MLlib and MOA
4 pages
The Fun They Had
100% (1)
The Fun They Had
3 pages
Conviction - Wikipedia
No ratings yet
Conviction - Wikipedia
1 page
Q1: How Do You Think Are The Various Factors Listed in Exhibits 1-6 Associated With Each Other From The Conceptual Viewpoint?
No ratings yet
Q1: How Do You Think Are The Various Factors Listed in Exhibits 1-6 Associated With Each Other From The Conceptual Viewpoint?
4 pages
Painting With Pencils On Fabric
No ratings yet
Painting With Pencils On Fabric
3 pages
Method To Open The Spiritual Brain
97% (29)
Method To Open The Spiritual Brain
4 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Quant Developers' Tools and Techniques: Quant Books, #1
From Everand
Quant Developers' Tools and Techniques: Quant Books, #1
Manfred Hindering
No ratings yet
EC Cryptography Tutorials - Herong's Tutorial Examples
From Everand
EC Cryptography Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Nell: An SVG Drawing Language
From Everand
Nell: An SVG Drawing Language
Stefan Hollos
No ratings yet
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Machine Learning On Massive Datasets

Uploaded by

Machine Learning On Massive Datasets

Uploaded by

Machine Learning

The problem: big datasets

Allow users to apply all the state-of-

.with orders-of-magnitude more

! Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal

How can we compute these efficiently?

Ex: 3-point correlation runtime $&!!

(biggest previous: #&!!

nave: 5x109 sec.

Ex: support vector machine

SMO: 12,831 SVs, 84,360 iterations, 98.3%

SFW: 4,145 SVs, 4,500 iterations, 98.1%

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.