0% found this document useful (0 votes)
38 views7 pages

Machine Learning On Massive Datasets

This document discusses machine learning techniques for massive datasets. It notes that traditional machine learning algorithms have computational bottlenecks when applied to large datasets. It proposes using multi-scale decompositions and generalized N-body algorithms to reduce the computational complexity of many common machine learning tasks from quadratic or cubic to logarithmic or linear time. Examples are given showing speedups of two algorithms on very large real-world datasets. The document also briefly mentions software for implementing these efficient algorithms.

Uploaded by

ravigobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views7 pages

Machine Learning On Massive Datasets

This document discusses machine learning techniques for massive datasets. It notes that traditional machine learning algorithms have computational bottlenecks when applied to large datasets. It proposes using multi-scale decompositions and generalized N-body algorithms to reduce the computational complexity of many common machine learning tasks from quadratic or cubic to logarithmic or linear time. Examples are given showing speedups of two algorithms on very large real-world datasets. The document also briefly mentions software for implementing these efficient algorithms.

Uploaded by

ravigobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Machine Learning

on Massive Datasets

Alexander Gray
Georgia Institute of Technology
College of Computing
FASTlab: Fundamental Algorithmic and Statistical Tools

The problem: big datasets


D

Could be large:
N (#data), D (#features), M (#models)
What wed like

Allow users to apply all the state-of-


the-art statistical methods

.with orders-of-magnitude more


computational efficiency

Core methods of
statistics / machine learning / mining
! Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal
range-search O(N), contingency table
! Density estimation: kernel density estimation O(N2), mixture of Gaussians
O(N)
! Regression: linear regression O(D3), kernel regression O(N2), Gaussian
process regression O(N3)
! Classification: nearest-neighbor classifier O(N2), nonparametric Bayes
classifier O(N2), support vector machine
! Dimension reduction: principal component analysis O(D3), non-negative
matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(N), hierarchical clustering O(N3), by dimension
reduction
! Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory
tracking
! 2-sample testing: n-point correlation O(Nn)
! Cross-match: bipartite matching O(N3)
5 main computational bottlenecks:
Aggregations, GNPs, graphical models, linear algebra, optimization

! Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal


range-search O(N), contingency table
! Density estimation: kernel density estimation O(N2), mixture of Gaussians
O(N)
! Regression: linear regression O(D3), kernel regression O(N2), Gaussian
process regression O(N3)
! Classification: nearest-neighbor classifier O(N2), nonparametric Bayes
classifier O(N2), support vector machine
! Dimension reduction: principal component analysis O(D3), non-negative
matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(N), hierarchical clustering O(N3), by dimension
reduction
! Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory
tracking
! 2-sample testing: n-point correlation O(Nn)
! Cross-match: bipartite matching O(N3)

How can we compute these efficiently?

Multi-scale
Decompositions
e.g. kd-trees
[Bentley 1975], [Friedman, Bentley & Finkel
1977],[Moore & Lee 1995]
How can we compute these efficiently?
! Generalized N-body algorithms
(multiple trees) for distance/similarity-
based computations [2000, 2003, 2009]
! Hierarchical series expansions for
kernel summations [2004, 2006, 2008]
! Multi-scale Monte Carlo for linear
algebra and summations [2007, 2008]
! Stochastic process approximations for
time series [2009]
! Monte Carlo optimization: online,
progressive [2009]
! Parallel computing [1998, 2006, 2009]

Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)
Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)

Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum
variance unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)
$!J3;?7
%&!! %!J3;?7

%!!!

Ex: 3-point correlation runtime $&!!

89:,7;/1,<=1>3?5=@
$!!!

(biggest previous: #&!!


20K) n=2: O(N)
#!!!
n=3: O(Nlog3)
"&!!
VIRGO n=4: O(N2)
simulation data,
"!!!
N = 75,000,000
&!!

nave: 5x109 sec.


(~150 years) !
! " # $ % & ' ( )
-./012,34,5676
multi-tree: 55 sec.
(exact)

Ex: support vector machine


Data: IJCNN1 [DP01a]
2 classes SMO: O(N2-N3)
49,990 training points
SFW: O(N/e + 1/e2)
91,701 testing points
22 features

SMO: 12,831 SVs, 84,360 iterations, 98.3%


accuracy, 765 sec

SFW: 4,145 SVs, 4,500 iterations, 98.1%


accuracy, 21 sec
Software
! MLPACK (C++)
!First scalable comprehensive ML library
! MLPACK-db
!fast data analytics in relational
databases (SQL Server)
! MLPACK Pro
- Very-large-scale data

Issues
! How to disseminate/integrate?
! In-database/centralized or not?
! Trust of complex algorithms?
! Other statistical/ML needs?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy