0% found this document useful (0 votes)
71 views16 pages

Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

This document discusses challenges and opportunities in efficient data mining. It covers: 1. Data mining is an iterative process to extract useful information from large data stores, but faces challenges around efficiency due to dataset properties, computational complexity, and algorithm irregularity. 2. Architecture-conscious approaches can help address efficiency by understanding hardware limitations and designing algorithms to better utilize system resources like memory locality, parallelism, and multi-core processors. 3. Examples show adapting algorithms for modern architectures like multi-core CPUs and the Cell processor can significantly improve performance and enable mining very large datasets.

Uploaded by

api-3798592
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views16 pages

Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

This document discusses challenges and opportunities in efficient data mining. It covers: 1. Data mining is an iterative process to extract useful information from large data stores, but faces challenges around efficiency due to dataset properties, computational complexity, and algorithm irregularity. 2. Architecture-conscious approaches can help address efficiency by understanding hardware limitations and designing algorithms to better utilize system resources like memory locality, parallelism, and multi-core processors. 3. Examples show adapting algorithms for modern architectures like multi-core CPUs and the Cell processor can significantly improve performance and enable mining very large datasets.

Uploaded by

api-3798592
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 16

Architecture Conscious Data

Mining
Srinivasan Parthasarathy
Data Mining Research Lab
Ohio State University
KDD & Next Generation
Challenges
• KDD is an iterative and interactive process the goal of
which is to extract interesting and actionable
information from potentially large data stores efficiently
• Young field, long laundry list of technical challenges
– Theoretical foundations in various sub-fields
– Interestingness and Ranking
– New and Exciting Applications
• Embedding domain knowledge effectively
– Visualization for data & model understanding
– Efficient and scalable algorithms (focus of this talk)
• Other challenges
– Educational (talk a bit about this at the end)
– Reproducability (need for benchmarks)
– Socio-Political
Efficiency in the KDD process
• Why is it important?
– Interactive nature of KDD
– Real-time constraints
• What makes it challenging?
– Dataset properties (large,
heterogeneous, distributed)
– Computational complexity Mining Simulation Data
• Example Applications
– Clinical data Diagnosing disease and
modeling progression
– Biological data Twa et al 2005
Analyzing (dynamic) Networks
– Large scale simulation data Protein Interaction Network (yeast)
– Social network data
– Sensor data, WWW data….
Toward Efficient Realizations
• Data driven approach
– Compression, Sampling, Dimensionality Reduction, Feature
Selection, Matrix Factorization etc.
• Computational driven approach
– Intelligent search space pruning to reduce complexity
– Approximate algorithms, streaming algorithms
– Parallel and distributed algorithms
• Architecture-Conscious approach (this talk)
– Largely orthogonal to the above alternatives
– Objective is to understand limitations and novel features of
modern and emerging architecture(s)
– Subsequently, re-architect algorithms to better utilize system
resources.
Houston, do we have a problem?
• Turns out we do
– Many state-of-the-art data mining algorithms grossly
under-utilize processor resources [Ghoting 2005]
• Why?
1. Data intensive algorithms – lots of memory accesses
– high latency penalty.
2. Mining algorithms are extremely irregular in nature –
data and parameter driven – hard to predict
3. Use of pointer-based data structures – poor ILP
4. Do not leverage important features of modern
architectures – automated compiler/runtime systems
are handicapped because of 1, 2 and 3.
Spatial Locality
• Improve spatial locality of r
dynamic data structures
a f c
– Memory pooling
– Loss-less compression – c b b
store only data that is
needed – allows for more f p
data per cache line
– Memory placement to m b

match dominant access


order p m

– Side benefit – enables


effective hardware
prefetching (latency
alleviating mechanism) DFS allocation
Temporal Locality and Leveraging
SMT r

• Data Structure Tiling


– Operate on a tile-by-tile
basis
• Non-overlapping
(traditional)
• Overlapping
• Smart data partitioning
– Jigsaw puzzle analogy
• SMT
– Co-schedule tasks that
operate on same data tile
helps improve
performance Tile 1 Tile 2 Tile N-1 Tile N
Sample Benefits
• Gains in performance can be
staggering
– Frequent patterns (itemsets,
trees, graphs)
– Outlier Detection
– Clustering
• Benefits to end applications
– Scientific simulation data
– Web data
– Molecular and Clinical data
• For network of workstations
– minimize communication and VLDB’05, KDD’06, VLDBJ07
leverage remote memory PPOPP’07
– Enables mining of terabyte scale
distributed datasets efficiently.
CMPs (next frontier)
• Why the push from • Challenges
industry? – Existing applications, they
– Increasing clock need to be rewritten to use
frequencies is not returning multiple threads of
improved IPC, and it is execution
increasing power costs and – Compiler and runtime
thermal issues techniques have a hard
• Two new PCs in my den, time already – application
must help
no need for the heat vent! – Fine-grained sharing of
– Great for winters! processor resources
• Importantly (cache, bus/channel etc.)
– Parallel Computing meets – Memory hierarchy issues
mainstream commodity are even more challenging
market • Potential solution
– Adaptable algorithms
Adaptive algorithms
• Key idea:Trading off • Key idea: Moldable
memory for redundant partitioning and adaptive
computation scheduling of tasks
• Benefits: • Benefits
– Reduced working set sizes – Better CPU utilization
– Likely to have reduced – If co-scheduling – reduced
bandwidth pressure cache miss rates
– Utilizing strengths of the • Challenges:
CMP – Sensing the problem
• Challenge: – Re-architecting algorithm
– Sensing the problem • Moldable task
– Re-architecting algorithm to decomposition
reduce memory • Pass on enough state to
consumption move task to another core
Adaptive algorithms performance
• Graph mining • Tree Mining
– Converted to sequence space
– Gaston vs. Gspan vs. (dynamic arrays)
Hybrid (adaptive) • Better locality, ILP
– Reduced memory LCS
matching + structure checks
– Leveraged hybrid scheduling
– Sequential Performance
• 2 order reduction in
memory footprint
• 3 orders improvement in
processing time
– Parallel Performance
• Linear scalability on a 4-core
dual chip (8 cores)
• Adapted similar idea to XML
indexing with similar results!

ICDM’06, CIKM’06, VLDB’07


Esoteric CMPs (CELL)
• Interesting design point on 1000
commodity CMP space
– 25 GB/s OC bandwidth 100
– 8 cores (SPUs) + 1 PPU kMeans
– FP computation 200 GFlops 10 KNN

– Orca
Breakthroughs in commodity
1
processing
• Challenges

on

8
2

D2
0

l l-

l l-
Xe
0.1

25
um

Ce

Ce
m
– Hard to program

ni

iu
ro
It a

nt
te

Pe
op
– Need to explicitly manage
memory and data transfers
between PPU and SPUs Cell-6 on Sony Playstation
– Probably not suitable for all Cell-8 is simulated
programs
All cases codes optimized and
– Interesting class of algorithms
and kernels can benefit Implemented on appropriate compiler
significantly!
Mining on Clusters
• Heavily researched over the last 15 years
– DDM Wiki (a very nice start point resource)
• What are the “new” challenges?
– Non-homogeneous “hybrid” clusters – (e.g. Roadrunner)
– Multi-level parallelism (on chip, on node, on cluster)
– Leveraging features of high end systems networking
• Infiniband makes it feasible and cheaper to access remote memory
than local disk – how to leverage?
– KDD may be particularly amenable to pipelined parallelism – a
largely ignored approach
– KDD and the grid (heard about this yesterday)
– Application specific challenges -- e.g. astronomy, folding@home
etc.
Discussion
• KDD is an iterative and interactive process the goal of
which is to extract interesting and actionable information
from potentially large data stores efficiently
• This talk was primarily about the last but all 3 are
important.
• Architecture conscious data mining is a viable orthogonal
approach to achieve efficiency (references in paper)
– Tangible benefits to applications, algorithms and kernels
– Lower memory footprints + significantly faster performance
– Adaptive algorithms are necessary for emerging architectures
– Whats next? Services oriented architecture
• Plug-and-Play naturally connects with KDD process
• An effective mechanism to keep cores busy.
Broadly Speaking
• Education
– As an aside parallel algorithms and high performance
computing has to be a part of basic CS curriculum.
– We as data-intensive science need to understand the
key systems issues better from OS and architecture
friends
• Broader Scientific Impact
– Interactions between Systems and Data Mining
• Data mining for software engineering, invariant tracking,
testing, bug detection in sequential and parallel codes
• Data mining for performance modeling
• Leveraging systems features for data mining
Thanks
• Students
– A. Ghoting, G. Buehrer, S. Tatikonda
• Collaborating Colleagues
– OSU-Physics, OSU-Biomedical Informatics, Intel, IBM
• Funding agencies
– NSF CCF0702587, CNS-0406386, CAREER-IIS-0347662, RI-
CNS-0403342.
– DOE Early career principal investigator grant
– IBM Faculty partnership
• Organizers of this workshop
• Additional Information: dmrl.cse.ohio-state.edu or
srini@cse.ohio-state.edu

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy