Machine Learning On Massive Datasets
Machine Learning On Massive Datasets
on Massive Datasets
Alexander Gray
Georgia Institute of Technology
College of Computing
FASTlab: Fundamental Algorithmic and Statistical Tools
Could be large:
N (#data), D (#features), M (#models)
What wed like
Core methods of
statistics / machine learning / mining
! Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal
range-search O(N), contingency table
! Density estimation: kernel density estimation O(N2), mixture of Gaussians
O(N)
! Regression: linear regression O(D3), kernel regression O(N2), Gaussian
process regression O(N3)
! Classification: nearest-neighbor classifier O(N2), nonparametric Bayes
classifier O(N2), support vector machine
! Dimension reduction: principal component analysis O(D3), non-negative
matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(N), hierarchical clustering O(N3), by dimension
reduction
! Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory
tracking
! 2-sample testing: n-point correlation O(Nn)
! Cross-match: bipartite matching O(N3)
5 main computational bottlenecks:
Aggregations, GNPs, graphical models, linear algebra, optimization
Multi-scale
Decompositions
e.g. kd-trees
[Bentley 1975], [Friedman, Bentley & Finkel
1977],[Moore & Lee 1995]
How can we compute these efficiently?
! Generalized N-body algorithms
(multiple trees) for distance/similarity-
based computations [2000, 2003, 2009]
! Hierarchical series expansions for
kernel summations [2004, 2006, 2008]
! Multi-scale Monte Carlo for linear
algebra and summations [2007, 2008]
! Stochastic process approximations for
time series [2009]
! Monte Carlo optimization: online,
progressive [2009]
! Parallel computing [1998, 2006, 2009]
Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)
Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)
Computational complexity
using fast algorithms
! Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
! Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
! Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
! Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
! Dimension reduction: principal component analysis O(D) or O(1), non-
negative matrix factorization, kernel PCA O(N) or O(1), maximum
variance unfolding O(N)
! Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
! Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
! Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
! 2-sample testing: n-point correlation O(Nlogn)
! Cross-match: bipartite matching O(N) or O(1)
$!J3;?7
%&!! %!J3;?7
%!!!
89:,7;/1,<=1>3?5=@
$!!!
Issues
! How to disseminate/integrate?
! In-database/centralized or not?
! Trust of complex algorithms?
! Other statistical/ML needs?