Slide 03 Chapter1 Introduction
Slide 03 Chapter1 Introduction
Introduction
HUI-YIN CHANG (張彙音)
1
Outline
⚫ Why Data Mining?
⚫ Summary
2
I think you have read/known the following books…
孫子曰:
凡用兵之法,全國為上,破國次之﹔
全軍為上,破軍次之﹔
全旅為上,破旅次之﹔
全卒為上,破卒次之﹔
全伍為上,破伍次之。
是故百戰百勝,非善之善也﹔
不戰而屈人之兵,善之善者也。
故上兵伐謀,其次伐交,其次伐兵,其下攻城。
攻城之法為不得已。
乾:元,亨,利,貞。
初九:潛龍,勿用。
九二:見龍再田,利見大人。
九三:君子終日乾乾,夕惕若,厲無咎。
九四:或躍在淵,無咎。
九五:飛龍在天,利見大人。
上九:亢龍有悔。
用九:見群龍無首,吉。
3
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
◦ Data collection and data availability
◦ Automated data collection tools, database systems, web, computerized
society
◦ Major sources of abundant data
◦ Business: Web, e-commerce, transactions, stocks, …
◦ Science: Remote sensing, bioinformatics, scientific simulation, …
◦ Society and everyone: news, digital cameras, YouTube
4
Examples of data mining
⚫ 啤酒和尿布
⚫ 全球零售業巨頭沃爾瑪在對消費者購物行為分析時發現,男性顧客在購買嬰兒尿片時,常常會順便搭配幾瓶啤酒來犒勞自己,於是嘗試推出
了將啤酒和尿布擺在一起的促銷手段。沒想到這個舉措居然使尿布和啤酒的銷量都大幅增加了。如今,“啤酒+尿布”的資料分析成果早已
成了大資料技術應用的經典案例,被人津津樂道。
⚫Google成功預測冬季流感
⚫ 2009年,Google通過分析5000萬條美國人最頻繁檢索的詞彙,將之和美國疾病中心在2003年到2008年間季節性流感傳播時期的資料進行比較,
並建立一個特定的數學模型。
⚫大資料與喬布斯癌症治療
⚫ 喬布斯是世界上第一個對自身所有DNA和腫瘤DNA進行排序的人。為此,他支付了高達幾十萬美元的費用。他得到的不是樣本,而是包括整
個基因的資料文件。醫生按照所有基因按需下藥,最終這種方式幫助喬布斯延長了好幾年的生命。
⚫奧巴馬大選連任成功
⚫ 2012年11月奧巴馬大選連任成功的勝利果實也被歸功於大資料,因為他的競選團隊進行了大規模與深入的資料探勘。時
代雜誌更是斷言,依靠直覺與經驗進行決策的優勢急劇下降,在政治領域,大資料的時代已經到來;各色媒體、論壇、專
家鋪天蓋地的宣傳讓人們對大資料時代的來臨興奮不已,無數公司和創業者都紛紛跳進了這個狂歡隊伍。
⚫ 微軟大資料成功預測奧斯卡21項大獎
⚫ 2013年,微軟紐約研究院的經濟學家大衛•羅斯柴爾德(David Rothschild)利用大資料成功預測24個奧斯卡獎項中的19個,
成為人們津津樂道的話題。
⚫QQ圈子把前女友推薦給未婚妻
⚫ 2012年3月騰訊推出QQ圈子,按共同好友的連鎖反應攤開使用者的人際關係網,把使用者的前女友推薦給未婚妻,把同學同事朋友圈子分門
別類,利用大資料處理能力給人帶來“震撼”。 .嗶.
Source: https://www.itread01.com/lhkqcc.html 5
Evolution of Sciences
Before 1600, empirical science
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11):
50-54, Nov. 2002
6
Evolution of Database Technology
1960s:
◦ Data collection, database creation, IMS and network DBMS
1970s:
◦ Relational data model, relational DBMS implementation
1980s:
◦ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◦ Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
◦ Data mining, data warehousing, multimedia databases, and Web databases
2000s
◦ Stream data management and mining
◦ Data mining and its applications
◦ Web technology (XML, data integration) and global information systems
7
What Is Data Mining?
Data mining (knowledge discovery from data)
◦ Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
◦ Data mining: a misnomer?
Alternative names
◦ Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
8
Knowledge Discovery (KDD) Process
This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
Data mining plays an essential role in
the knowledge discovery process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
9
The knowledge discovery process
◦ Data cleaning
◦ To remove noise and inconsistent data
◦ Data integration from multiple sources
◦ Where multiple data sources may be combined
◦ Data selection
◦ Where data relevant to the analysis task are retrieved from the database
◦ Data transformation
◦ Where data are transformed and consolidated into forms appropriate for mining by performing summary or
aggregation operations.
◦ Data mining
◦ An essential process where intelligent methods are applied to extract data patterns.
◦ Pattern evaluation
◦ To identify the truly interesting patterns representing knowledge based on interestingness measures
◦ Knowledge presentation
◦ Where visualization and knowledge representation techniques are used to present mined knowledge to users
10
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
11
Example: Mining vs. Data Exploration
Business intelligence view
◦ Warehouse, data cube, reporting but not much mining
12
KDD Process: A Typical View from ML and Statistics
13
Example: Medical Data Mining
Health care & medical data mining – often adopted such a view in statistics and machine
learning
14
Multi-Dimensional View of Data Mining
Data to be mined
◦ Database data (extended-relational, object-oriented, heterogeneous, legacy),
data warehouse, transactional data, stream, spatiotemporal, time-series,
sequence, text and web, multi-media, graphs & social and information
networks
Knowledge to be mined (or: Data mining functions)
◦ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
◦ Descriptive vs. predictive data mining
◦ Multiple/integrated functions and mining at multiple levels
Techniques utilized
◦ Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
Applications adapted
◦ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
15
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
◦ Relational database, data warehouse, transactional database
16
What kinds of patterns can be mined
⚫ Descriptive
⚫ Descriptive mining tasks characterize properties of the data in a target data set.
⚫ Predictive
⚫ Predictive mining tasks perform induction on the current data in order to make predictions.
17
Data Mining Function: (1) Generalization
18
Data Mining Function: (2) Association and Correlation
Analysis
Frequent patterns (or frequent item sets)
◦ What items are frequently purchased together in your Walmart?
19
Data Mining Function: (3) Classification
監督式學習 Classification and label prediction
◦ Construct models (functions) based on some training examples
◦ Describe and distinguish classes or concepts for future prediction
◦ E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
◦ Predict some unknown class labels
Typical methods
◦ Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression, …
Typical applications:
◦ Credit card fraud detection, direct marketing, classifying stars, diseases,
web-pages, …
20
Data Mining Function: (4) Cluster Analysis
21
Data Mining Function: (5) Outlier Analysis
Outlier analysis
◦ Outlier: A data object that does not comply with the general behavior of the
data
◦ Noise or exception? ― One person’s garbage could be another person’s
treasure
◦ Methods: by product of clustering or regression analysis, …
◦ Useful in fraud detection, rare events analysis
22
Time and Ordering: Sequential Pattern, Trend and Evolution
Analysis
23
Structure and Network Analysis
Graph mining
◦ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
Information network analysis
◦ Social networks: actors (objects, nodes) and relationships (edges)
◦ e.g., author networks in CS, terrorist networks
◦ Multiple heterogeneous networks
◦ A person could be multiple information networks: friends, family,
classmates, …
◦ Links carry a lot of semantic information: Link mining
Web mining
◦ Web is a big information network: from PageRank to Google
◦ Analysis of Web information networks
◦ Web community discovery, opinion mining, usage mining, …
24
Evaluation of Knowledge
Are all mined knowledge interesting?
◦ One can mine tremendous amount of “patterns” and knowledge
◦ Some may fit only certain dimension space (time, location, …)
◦ Some may not be representative, may be transient, …
25
Data Mining: Confluence of Multiple Disciplines
26
Why Confluence (合流) of Multiple Disciplines?
Tremendous amount of data
◦ Algorithms must be highly scalable to handle such as tera-bytes of data
High-dimensionality of data
◦ Micro-array may have tens of thousands of dimensions
Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)
28
Major Issues in Data Mining (1)
Mining Methodology
◦ Mining various and new kinds of knowledge
◦ Mining knowledge in multi-dimensional space
◦ Data mining: An interdisciplinary (跨學科) effort
◦ Boosting the power of discovery in a networked environment
◦ Handling noise, uncertainty, and incompleteness of data
◦ Pattern evaluation and pattern- or constraint-guided mining
User Interaction
◦ Interactive mining
◦ Incorporation of background knowledge
◦ Presentation and visualization of data mining results
29
Major Issues in Data Mining (2)
30
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
◦ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
31
Conferences and Journals on Data Mining
32
Where to Find References? DBLP, CiteSeer, Google
Data mining and KDD (SIGKDD: CDROM)
◦ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
◦ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Web and IR
◦ Conferences: SIGIR, WWW, CIKM, etc.
◦ Journals: WWW: Internet and Web Information Systems,
Statistics
◦ Conferences: Joint Stat. Meeting, etc.
◦ Journals: Annals of statistics, etc.
Visualization
◦ Conference proceedings: CHI, ACM-SIGGraph, etc.
◦ Journals: IEEE Trans. visualization and computer graphics, etc.
33
Summary
Data mining: Discovering interesting patterns and knowledge from massive
amount of data
34
Recommended Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed.
2005
35
Thanks for Your Attention
Q&A
36