Data Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms
AND ANALYSIS
Fundamental Concepts and Algorithms
MOHAMMED J. ZAKI
Rensselaer Polytechnic Institute, Troy, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521766333
A catalog record for this publication is available from the British Library.
Contents iii
Preface vii
2 Numeric Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Univariate Analysis 33
2.2 Bivariate Analysis 42
2.3 Multivariate Analysis 48
2.4 Data Normalization 52
2.5 Normal Distribution 54
2.6 Further Reading 60
2.7 Exercises 60
3 Categorical Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1 Univariate Analysis 63
3.2 Bivariate Analysis 72
3.3 Multivariate Analysis 82
3.4 Distance and Angle 87
3.5 Discretization 89
3.6 Further Reading 91
3.7 Exercises 91
4 Graph Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1 Graph Concepts 93
iii
iv Contents
Index 585
Preface
• http://dataminingbook.info
• http://www.cs.rpi.edu/~ zaki/dataminingbook
• http://www.dcc.ufmg.br/dataminingbook
Having understood the basic principles and algorithms in data mining and data
analysis, readers will be well equipped to develop their own methods or use more
advanced techniques.
vii
viii Preface
2 3
14 6 7 15 5 4 19 18 8
13 16 20 21 11 9 10
17 22 12
Figure 0.1. Chapter dependencies
Suggested Roadmaps
The chapter dependency graph is shown in Figure 0.1. We suggest some typical
roadmaps for courses and readings based on this book. For an undergraduate-level
course, we suggest the following chapters: 1–3, 8, 10, 12–15, 17–19, and 21–22. For an
undergraduate course without exploratory data analysis, we recommend Chapters 1,
8–15, 17–19, and 21–22. For a graduate course, one possibility is to quickly go over the
material in Part I or to assume it as background reading and to directly cover Chapters
9–22; the other parts of the book, namely frequent pattern mining (Part II), clustering
(Part III), and classification (Part IV), can be covered in any order. For a course on
data analysis the chapters covered must include 1–7, 13–14, 15 (Section 2), and 20.
Finally, for a course with an emphasis on graphs and kernels we suggest Chapters 4, 5,
7 (Sections 1–3), 11–12, 13 (Sections 1–2), 16–17, and 20–22.
Acknowledgments
Initial drafts of this book have been used in several data mining courses. We received
many valuable comments and corrections from both the faculty and students. Our
thanks go to
We would like to thank all the students enrolled in our data mining courses at RPI
and UFMG, as well as the anonymous reviewers who provided technical comments
on various chapters. We appreciate the collegial and supportive environment within
the computer science departments at RPI and UFMG and at the Qatar Computing
Research Institute. In addition, we thank NSF, CNPq, CAPES, FAPEMIG, Inweb –
the National Institute of Science and Technology for the Web, and Brazil’s Science
without Borders program for their support. We thank Lauren Cowles, our editor at
Cambridge University Press, for her guidance and patience in realizing this book.
Finally, on a more personal front, MJZ dedicates the book to his wife, Amina,
for her love, patience and support over all these years, and to his children, Abrar and
Afsah, and his parents. WMJ gratefully dedicates the book to his wife Patricia; to his
children, Gabriel and Marina; and to his parents, Wagner and Marlene, for their love,
encouragement, and inspiration.