0% found this document useful (0 votes)

71 views16 pages

Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

This document discusses challenges and opportunities in efficient data mining. It covers: 1. Data mining is an iterative process to extract useful information from large data stores, but faces challenges around efficiency due to dataset properties, computational complexity, and algorithm irregularity. 2. Architecture-conscious approaches can help address efficiency by understanding hardware limitations and designing algorithms to better utilize system resources like memory locality, parallelism, and multi-core processors. 3. Examples show adapting algorithms for modern architectures like multi-core CPUs and the Cell processor can significantly improve performance and enable mining very large datasets.

Uploaded by

api-3798592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views16 pages

Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

Uploaded by

api-3798592

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 16

Architecture Conscious Data

Mining
Srinivasan Parthasarathy
Data Mining Research Lab
Ohio State University
KDD & Next Generation
Challenges
• KDD is an iterative and interactive process the goal of
which is to extract interesting and actionable
information from potentially large data stores efficiently
• Young field, long laundry list of technical challenges
– Theoretical foundations in various sub-fields
– Interestingness and Ranking
– New and Exciting Applications
• Embedding domain knowledge effectively
– Visualization for data & model understanding
– Efficient and scalable algorithms (focus of this talk)
• Other challenges
– Educational (talk a bit about this at the end)
– Reproducability (need for benchmarks)
– Socio-Political
Efficiency in the KDD process
• Why is it important?
– Interactive nature of KDD
– Real-time constraints
• What makes it challenging?
– Dataset properties (large,
heterogeneous, distributed)
– Computational complexity Mining Simulation Data
• Example Applications
– Clinical data Diagnosing disease and
modeling progression
– Biological data Twa et al 2005
Analyzing (dynamic) Networks
– Large scale simulation data Protein Interaction Network (yeast)
– Social network data
– Sensor data, WWW data….
Toward Efficient Realizations
• Data driven approach
– Compression, Sampling, Dimensionality Reduction, Feature
Selection, Matrix Factorization etc.
• Computational driven approach
– Intelligent search space pruning to reduce complexity
– Approximate algorithms, streaming algorithms
– Parallel and distributed algorithms
• Architecture-Conscious approach (this talk)
– Largely orthogonal to the above alternatives
– Objective is to understand limitations and novel features of
modern and emerging architecture(s)
– Subsequently, re-architect algorithms to better utilize system
resources.
Houston, do we have a problem?
• Turns out we do
– Many state-of-the-art data mining algorithms grossly
under-utilize processor resources [Ghoting 2005]
• Why?
1. Data intensive algorithms – lots of memory accesses
– high latency penalty.
2. Mining algorithms are extremely irregular in nature –
data and parameter driven – hard to predict
3. Use of pointer-based data structures – poor ILP
4. Do not leverage important features of modern
architectures – automated compiler/runtime systems
are handicapped because of 1, 2 and 3.
Spatial Locality
• Improve spatial locality of r
dynamic data structures
a f c
– Memory pooling
– Loss-less compression – c b b
store only data that is
needed – allows for more f p
data per cache line
– Memory placement to m b

match dominant access

order p m

– Side benefit – enables

effective hardware
prefetching (latency
alleviating mechanism) DFS allocation
Temporal Locality and Leveraging
SMT r

• Data Structure Tiling

– Operate on a tile-by-tile
basis
• Non-overlapping
(traditional)
• Overlapping
• Smart data partitioning
– Jigsaw puzzle analogy
• SMT
– Co-schedule tasks that
operate on same data tile
helps improve
performance Tile 1 Tile 2 Tile N-1 Tile N
Sample Benefits
• Gains in performance can be
staggering
– Frequent patterns (itemsets,
trees, graphs)
– Outlier Detection
– Clustering
• Benefits to end applications
– Scientific simulation data
– Web data
– Molecular and Clinical data
• For network of workstations
– minimize communication and VLDB’05, KDD’06, VLDBJ07
leverage remote memory PPOPP’07
– Enables mining of terabyte scale
distributed datasets efficiently.
CMPs (next frontier)
• Why the push from • Challenges
industry? – Existing applications, they
– Increasing clock need to be rewritten to use
frequencies is not returning multiple threads of
improved IPC, and it is execution
increasing power costs and – Compiler and runtime
thermal issues techniques have a hard
• Two new PCs in my den, time already – application
must help
no need for the heat vent! – Fine-grained sharing of
– Great for winters! processor resources
• Importantly (cache, bus/channel etc.)
– Parallel Computing meets – Memory hierarchy issues
mainstream commodity are even more challenging
market • Potential solution
– Adaptable algorithms
Adaptive algorithms
• Key idea:Trading off • Key idea: Moldable
memory for redundant partitioning and adaptive
computation scheduling of tasks
• Benefits: • Benefits
– Reduced working set sizes – Better CPU utilization
– Likely to have reduced – If co-scheduling – reduced
bandwidth pressure cache miss rates
– Utilizing strengths of the • Challenges:
CMP – Sensing the problem
• Challenge: – Re-architecting algorithm
– Sensing the problem • Moldable task
– Re-architecting algorithm to decomposition
reduce memory • Pass on enough state to
consumption move task to another core
Adaptive algorithms performance
• Graph mining • Tree Mining
– Converted to sequence space
– Gaston vs. Gspan vs. (dynamic arrays)
Hybrid (adaptive) • Better locality, ILP
– Reduced memory LCS
matching + structure checks
– Leveraged hybrid scheduling
– Sequential Performance
• 2 order reduction in
memory footprint
• 3 orders improvement in
processing time
– Parallel Performance
• Linear scalability on a 4-core
dual chip (8 cores)
• Adapted similar idea to XML
indexing with similar results!

ICDM’06, CIKM’06, VLDB’07

Esoteric CMPs (CELL)
• Interesting design point on 1000
commodity CMP space
– 25 GB/s OC bandwidth 100
– 8 cores (SPUs) + 1 PPU kMeans
– FP computation 200 GFlops 10 KNN

– Orca
Breakthroughs in commodity
1
processing
• Challenges

8
2

D2
0

l l-

l l-
Xe
0.1

25
um

Ce
m
– Hard to program

iu
ro
It a

nt
te

Pe
op
– Need to explicitly manage
memory and data transfers
between PPU and SPUs Cell-6 on Sony Playstation
– Probably not suitable for all Cell-8 is simulated
programs
All cases codes optimized and
– Interesting class of algorithms
and kernels can benefit Implemented on appropriate compiler
significantly!
Mining on Clusters
• Heavily researched over the last 15 years
– DDM Wiki (a very nice start point resource)
• What are the “new” challenges?
– Non-homogeneous “hybrid” clusters – (e.g. Roadrunner)
– Multi-level parallelism (on chip, on node, on cluster)
– Leveraging features of high end systems networking
• Infiniband makes it feasible and cheaper to access remote memory
than local disk – how to leverage?
– KDD may be particularly amenable to pipelined parallelism – a
largely ignored approach
– KDD and the grid (heard about this yesterday)
– Application specific challenges -- e.g. astronomy, folding@home
etc.
Discussion
• KDD is an iterative and interactive process the goal of
which is to extract interesting and actionable information
from potentially large data stores efficiently
• This talk was primarily about the last but all 3 are
important.
• Architecture conscious data mining is a viable orthogonal
approach to achieve efficiency (references in paper)
– Tangible benefits to applications, algorithms and kernels
– Lower memory footprints + significantly faster performance
– Adaptive algorithms are necessary for emerging architectures
– Whats next? Services oriented architecture
• Plug-and-Play naturally connects with KDD process
• An effective mechanism to keep cores busy.
Broadly Speaking
• Education
– As an aside parallel algorithms and high performance
computing has to be a part of basic CS curriculum.
– We as data-intensive science need to understand the
key systems issues better from OS and architecture
friends
• Broader Scientific Impact
– Interactions between Systems and Data Mining
• Data mining for software engineering, invariant tracking,
testing, bug detection in sequential and parallel codes
• Data mining for performance modeling
• Leveraging systems features for data mining
Thanks
• Students
– A. Ghoting, G. Buehrer, S. Tatikonda
• Collaborating Colleagues
– OSU-Physics, OSU-Biomedical Informatics, Intel, IBM
• Funding agencies
– NSF CCF0702587, CNS-0406386, CAREER-IIS-0347662, RI-
CNS-0403342.
– DOE Early career principal investigator grant
– IBM Faculty partnership
• Organizers of this workshop
• Additional Information: dmrl.cse.ohio-state.edu or
srini@cse.ohio-state.edu

Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Seminar PPT
No ratings yet
Seminar PPT
15 pages
Applied Data Mining
100% (1)
Applied Data Mining
284 pages
Data Mining Arun Pujari (2037)
No ratings yet
Data Mining Arun Pujari (2037)
303 pages
Dic PLB L1
No ratings yet
Dic PLB L1
64 pages
1 Introduction-To-Data-Science
No ratings yet
1 Introduction-To-Data-Science
43 pages
Data Intensive Computing
No ratings yet
Data Intensive Computing
18 pages
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
No ratings yet
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
77 pages
Parallel Data Mining Techniques On Graph
No ratings yet
Parallel Data Mining Techniques On Graph
26 pages
Data in Memory Analytics Framework
No ratings yet
Data in Memory Analytics Framework
27 pages
XSEDE15 Part1 Intro
No ratings yet
XSEDE15 Part1 Intro
101 pages
ClusteringAlgorithms ConventionalandRecent
No ratings yet
ClusteringAlgorithms ConventionalandRecent
30 pages
Parallel Mining of Frequent Closed Patterns: Harnessing Modern Computer Architectures
No ratings yet
Parallel Mining of Frequent Closed Patterns: Harnessing Modern Computer Architectures
10 pages
Technology Prospects For Data-Intensive Computing
No ratings yet
Technology Prospects For Data-Intensive Computing
21 pages
Intelligent Architectures For Intelligent Computingsystems Invited Paper DATE21
No ratings yet
Intelligent Architectures For Intelligent Computingsystems Invited Paper DATE21
6 pages
Ijcrcst January17 12
No ratings yet
Ijcrcst January17 12
4 pages
DMDW
No ratings yet
DMDW
287 pages
Tesis
No ratings yet
Tesis
109 pages
Unit5-Dwdm
No ratings yet
Unit5-Dwdm
58 pages
Immediate Download Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Ebooks 2024
No ratings yet
Immediate Download Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Ebooks 2024
90 pages
DE Unit1 - Introdcution - DE - 8jul24
No ratings yet
DE Unit1 - Introdcution - DE - 8jul24
56 pages
Data Mining Research at Ohio State: Srinivasan Parthasarathy
No ratings yet
Data Mining Research at Ohio State: Srinivasan Parthasarathy
51 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Week 1-2
No ratings yet
Week 1-2
3 pages
Internship Seminar
No ratings yet
Internship Seminar
19 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
CS614 Finalterm Subjective Referencefile
No ratings yet
CS614 Finalterm Subjective Referencefile
27 pages
VIPDMTheory Chapter 1
No ratings yet
VIPDMTheory Chapter 1
25 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Spatial and Web Mining
No ratings yet
Spatial and Web Mining
27 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Dunham - Data Mining PDF
100% (1)
Dunham - Data Mining PDF
156 pages
Computer Organization and Architecture: GATE CS Topic Wise Questions
No ratings yet
Computer Organization and Architecture: GATE CS Topic Wise Questions
52 pages
A Rapid Hybird Clustring Algorithm For A Large Volumes of High
No ratings yet
A Rapid Hybird Clustring Algorithm For A Large Volumes of High
77 pages
Reconfigurable Dataflow Graphs For Processing-In-memory
No ratings yet
Reconfigurable Dataflow Graphs For Processing-In-memory
11 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Lec 1
No ratings yet
Lec 1
48 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Sakhr - Chaib - Paper On Data Mining
No ratings yet
Sakhr - Chaib - Paper On Data Mining
3 pages
Data Mining - GDi Techno Solutions
No ratings yet
Data Mining - GDi Techno Solutions
145 pages
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
No ratings yet
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
10 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Marc Snir NGDM07
No ratings yet
Marc Snir NGDM07
36 pages
Computer Science-Research Methods
No ratings yet
Computer Science-Research Methods
25 pages
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
No ratings yet
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
21 pages
Data Mining Techniques, Arun K. Pujari
No ratings yet
Data Mining Techniques, Arun K. Pujari
303 pages
Introduction To Data and Memory Intensive Computing
No ratings yet
Introduction To Data and Memory Intensive Computing
31 pages
Ljku Sem 1 049010105 Data Mining and Analysis
No ratings yet
Ljku Sem 1 049010105 Data Mining and Analysis
3 pages
Module 3:the Memory System: Courtesy: Text Book: Carl Hamacher 5 Edition
No ratings yet
Module 3:the Memory System: Courtesy: Text Book: Carl Hamacher 5 Edition
73 pages
Unit 01 DWDM
No ratings yet
Unit 01 DWDM
105 pages
Data Mining Technologies and Implementations
No ratings yet
Data Mining Technologies and Implementations
34 pages
Comparative Study of Data Mining Tools
No ratings yet
Comparative Study of Data Mining Tools
8 pages
Data Warehousing and Data Mining Iv-Cse A: Prepared by
No ratings yet
Data Warehousing and Data Mining Iv-Cse A: Prepared by
5 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
Personal Notes Gate Smashers
No ratings yet
Personal Notes Gate Smashers
73 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
Gujarat Technological University: Page 1 of 2
No ratings yet
Gujarat Technological University: Page 1 of 2
2 pages
Data Mining Syllabus and Question
No ratings yet
Data Mining Syllabus and Question
6 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
CS 6303 Computer Architecture TWO Mark With Answer
100% (1)
CS 6303 Computer Architecture TWO Mark With Answer
14 pages
Chap 1
No ratings yet
Chap 1
32 pages
CS8493 2marks PDF
100% (1)
CS8493 2marks PDF
36 pages
UNIT-2 (Memory Hierarchy Design)
No ratings yet
UNIT-2 (Memory Hierarchy Design)
98 pages
Nasraoui-Market-Based Decentralized Profile Infrastructure
100% (1)
Nasraoui-Market-Based Decentralized Profile Infrastructure
20 pages
Sampriya Chandra Cache Memory
No ratings yet
Sampriya Chandra Cache Memory
36 pages
Ngdm07 Joshi
No ratings yet
Ngdm07 Joshi
80 pages
NGDM Talia
No ratings yet
NGDM Talia
58 pages
O'Reilly - Web Caching
No ratings yet
O'Reilly - Web Caching
331 pages
Acquisti NGDM
No ratings yet
Acquisti NGDM
47 pages
Grossman Ngdm07
No ratings yet
Grossman Ngdm07
35 pages
Xindong Wu NGDM07
No ratings yet
Xindong Wu NGDM07
32 pages
HumanGeneFinding-NGDM2007 Salzberg
No ratings yet
HumanGeneFinding-NGDM2007 Salzberg
31 pages
Innovation NSF Baltimore Oct 2007 Kusiak
No ratings yet
Innovation NSF Baltimore Oct 2007 Kusiak
31 pages
Ngdm07 Singh
No ratings yet
Ngdm07 Singh
30 pages
NGDM07v1 Wei Wang
No ratings yet
NGDM07v1 Wei Wang
26 pages
Data Mining Foster
No ratings yet
Data Mining Foster
26 pages
NGDM07 Philip Yu
No ratings yet
NGDM07 Philip Yu
22 pages
NGDM Talk Kargupta2
No ratings yet
NGDM Talk Kargupta2
22 pages
InformationDiscoveryEMR-NGDM2007 Vagelis
No ratings yet
InformationDiscoveryEMR-NGDM2007 Vagelis
21 pages
NGDM Senator 071011 DM
No ratings yet
NGDM Senator 071011 DM
17 pages
Finin NGDM Panel
No ratings yet
Finin NGDM Panel
17 pages
Alok Choudhary NGDM07 Panel Talk
No ratings yet
Alok Choudhary NGDM07 Panel Talk
16 pages
Bhavani NSF NGDM Oct2007 Short
No ratings yet
Bhavani NSF NGDM Oct2007 Short
15 pages
CS3451 OPERATING SYSTEM 01 - by WWW - Learnengineering.in
No ratings yet
CS3451 OPERATING SYSTEM 01 - by WWW - Learnengineering.in
96 pages
ACA Notes
No ratings yet
ACA Notes
39 pages
NGDM 10
No ratings yet
NGDM 10
8 pages
Agouris
No ratings yet
Agouris
8 pages
Design of Hashing Algorithms
No ratings yet
Design of Hashing Algorithms
17 pages
CO Oftunit V
No ratings yet
CO Oftunit V
50 pages
GATE Questions
No ratings yet
GATE Questions
90 pages
Co Unit 4
No ratings yet
Co Unit 4
17 pages
Computer Organization
No ratings yet
Computer Organization
33 pages
Lecture11 Cda3101
No ratings yet
Lecture11 Cda3101
73 pages
UNIT-4 CA Memory
No ratings yet
UNIT-4 CA Memory
26 pages
Memory Management: Fred Kuhns Department of Computer Science and Engineering Washington University in St. Louis
100% (1)
Memory Management: Fred Kuhns Department of Computer Science and Engineering Washington University in St. Louis
34 pages
Unit 4 (With Page Number)
No ratings yet
Unit 4 (With Page Number)
37 pages
2017 CS8493 Operating System Dr. D. Loganathan
No ratings yet
2017 CS8493 Operating System Dr. D. Loganathan
37 pages
PPGCC: Non-Volatile Memory: Emerging Technologies and Their Impacts On Memory Systems
No ratings yet
PPGCC: Non-Volatile Memory: Emerging Technologies and Their Impacts On Memory Systems
44 pages
Buses and I/O System: Computer Architecture and Assembly Language Fall 2003
No ratings yet
Buses and I/O System: Computer Architecture and Assembly Language Fall 2003
45 pages
COSS - Contact Session - 4
No ratings yet
COSS - Contact Session - 4
34 pages
Trắc nghiệm KTMT
No ratings yet
Trắc nghiệm KTMT
20 pages
Computer Architecture-Cache Microarchitecture
No ratings yet
Computer Architecture-Cache Microarchitecture
36 pages
Os Ques
No ratings yet
Os Ques
13 pages
Access Type
No ratings yet
Access Type
23 pages
Sheet1 2
No ratings yet
Sheet1 2
8 pages
hw6 Circuits
No ratings yet
hw6 Circuits
4 pages
F Computer Architecture and Organization CSE205 F1-3
No ratings yet
F Computer Architecture and Organization CSE205 F1-3
3 pages
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

Uploaded by

Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

Uploaded by

Architecture Conscious Data

match dominant access

– Side benefit – enables

• Data Structure Tiling

ICDM’06, CIKM’06, VLDB’07

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.