0% found this document useful (0 votes)
227 views158 pages

DataMiningForTheMasses (001 158)

Data mining 1

Uploaded by

Tracy Kereh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
227 views158 pages

DataMiningForTheMasses (001 158)

Data mining 1

Uploaded by

Tracy Kereh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Data Mining

for the Masses

Dr. Matthew North


A Global Text Project Book

This book is available on Amazon.com.

© 2012 Dr. Matthew A. North


This book is licensed under a Creative Commons Attribution 3.0 License

All rights reserved.

ISBN: 0615684378
ISBN-13: 978-0615684376

ii
DEDICATION

This book is gratefully dedicated to Dr. Charles Hannon, who gave me the chance to become a
college professor and then challenged me to learn how to teach data mining to the masses.

iii
iv
Data Mining for the Masses

Table of Contents

Dedication ....................................................................................................................................................... iii

Table of Contents............................................................................................................................................ v

Acknowledgements ........................................................................................................................................ xi

SECTION ONE: Data Mining Basics......................................................................................................... 1

Chapter One: Introduction to Data Mining and CRISP-DM .................................................................. 3

Introduction ................................................................................................................................................. 3

A Note About Tools .................................................................................................................................. 4

The Data Mining Process .......................................................................................................................... 5

Data Mining and You ...............................................................................................................................11

Chapter Two: Organizational Understanding and Data Understanding ..............................................13

Context and Perspective ..........................................................................................................................13

Learning Objectives ..................................................................................................................................14

Purposes, Intents and Limitations of Data Mining ..............................................................................15

Database, Data Warehouse, Data Mart, Data Set…? ..........................................................................15

Types of Data ............................................................................................................................................19

A Note about Privacy and Security ........................................................................................................20

Chapter Summary......................................................................................................................................21

Review Questions......................................................................................................................................22

Exercises .....................................................................................................................................................22

Chapter Three: Data Preparation ................................................................................................................25

Context and Perspective ..........................................................................................................................25

Learning Objectives ..................................................................................................................................25

Collation .....................................................................................................................................................27
v
Data Mining for the Masses

Data Scrubbing ......................................................................................................................................... 28

Hands on Exercise.................................................................................................................................... 29

Preparing RapidMiner, Importing Data, and........................................................................................ 30

Handling Missing Data ............................................................................................................................ 30

Data Reduction ......................................................................................................................................... 46

Handling Inconsistent Data .................................................................................................................... 50

Attribute Reduction.................................................................................................................................. 52

Chapter Summary ..................................................................................................................................... 54

Review Questions ..................................................................................................................................... 55

Exercise ...................................................................................................................................................... 55

SECTION TWO: Data Mining Models and Methods ........................................................................... 57

Chapter Four: Correlation ........................................................................................................................... 59

Context and Perspective .......................................................................................................................... 59

Learning Objectives.................................................................................................................................. 59

Organizational Understanding ................................................................................................................ 59

Data Understanding ................................................................................................................................. 60

Data Preparation ....................................................................................................................................... 60

Modeling .................................................................................................................................................... 62

Evaluation .................................................................................................................................................. 63

Deployment ............................................................................................................................................... 65

Chapter Summary ..................................................................................................................................... 67

Review Questions ..................................................................................................................................... 68

Exercise ...................................................................................................................................................... 68

Chapter Five: Association Rules ................................................................................................................. 73

Context and Perspective .......................................................................................................................... 73

Learning Objectives.................................................................................................................................. 73

Organizational Understanding ................................................................................................................ 73


vi
Data Mining for the Masses

Data Understanding ..................................................................................................................................74

Data Preparation .......................................................................................................................................76

Modeling .....................................................................................................................................................81

Evaluation ..................................................................................................................................................84

Deployment ...............................................................................................................................................87

Chapter Summary......................................................................................................................................87

Review Questions......................................................................................................................................88

Exercise ......................................................................................................................................................88

Chapter Six: k-Means Clustering .................................................................................................................91

Context and Perspective ..........................................................................................................................91

Learning Objectives ..................................................................................................................................91

Organizational Understanding ................................................................................................................91

Data UnderstanDing ................................................................................................................................92

Data Preparation .......................................................................................................................................92

Modeling .....................................................................................................................................................94

Evaluation ..................................................................................................................................................96

Deployment ...............................................................................................................................................98

Chapter Summary................................................................................................................................... 101

Review Questions................................................................................................................................... 101

Exercise ................................................................................................................................................... 102

Chapter Seven: Discriminant Analysis .................................................................................................... 105

Context and Perspective ....................................................................................................................... 105

Learning Objectives ............................................................................................................................... 105

Organizational Understanding ............................................................................................................. 106

Data Understanding ............................................................................................................................... 106

Data Preparation .................................................................................................................................... 109

Modeling .................................................................................................................................................. 114


vii
Data Mining for the Masses

Evaluation ................................................................................................................................................ 118

Deployment ............................................................................................................................................. 120

Chapter Summary ................................................................................................................................... 121

Review Questions ................................................................................................................................... 122

Exercise .................................................................................................................................................... 123

Chapter Eight: Linear Regression............................................................................................................. 127

Context and Perspective ........................................................................................................................ 127

Learning Objectives................................................................................................................................ 127

Organizational Understanding .............................................................................................................. 128

Data Understanding ............................................................................................................................... 128

Data Preparation ..................................................................................................................................... 129

Modeling .................................................................................................................................................. 131

Evaluation ................................................................................................................................................ 132

Deployment ............................................................................................................................................. 134

Chapter Summary ................................................................................................................................... 137

Review Questions ................................................................................................................................... 137

Exercise .................................................................................................................................................... 138

Chapter Nine: Logistic Regression ........................................................................................................... 141

Context and Perspective ........................................................................................................................ 141

Learning Objectives................................................................................................................................ 141

Organizational Understanding .............................................................................................................. 142

Data Understanding ............................................................................................................................... 142

Data Preparation ..................................................................................................................................... 143

Modeling .................................................................................................................................................. 147

Evaluation ................................................................................................................................................ 148

Deployment ............................................................................................................................................. 151

Chapter Summary ................................................................................................................................... 153


viii
Data Mining for the Masses

Review Questions................................................................................................................................... 154

Exercise ................................................................................................................................................... 154

Chapter Ten: Decision Trees.................................................................................................................... 157

Context and Perspective ....................................................................................................................... 157

Learning Objectives ............................................................................................................................... 157

Organizational Understanding ............................................................................................................. 158

Data Understanding ............................................................................................................................... 159

Data Preparation .................................................................................................................................... 161

Modeling .................................................................................................................................................. 166

Evaluation ............................................................................................................................................... 169

Deployment ............................................................................................................................................ 171

Chapter Summary................................................................................................................................... 172

Review Questions................................................................................................................................... 172

Exercise ................................................................................................................................................... 173

Chapter Eleven: Neural Networks .......................................................................................................... 175

Context and Perspective ....................................................................................................................... 175

Learning Objectives ............................................................................................................................... 175

Organizational Understanding ............................................................................................................. 175

Data Understanding ............................................................................................................................... 176

Data Preparation .................................................................................................................................... 178

Modeling .................................................................................................................................................. 181

Evaluation ............................................................................................................................................... 181

Deployment ............................................................................................................................................ 184

Chapter Summary................................................................................................................................... 186

Review Questions................................................................................................................................... 187

Exercise ................................................................................................................................................... 187

Chapter Twelve: Text Mining ................................................................................................................... 189


ix
Data Mining for the Masses

Context and Perspective ........................................................................................................................ 189

Learning Objectives................................................................................................................................ 189

Organizational Understanding .............................................................................................................. 190

Data Understanding ............................................................................................................................... 190

Data Preparation ..................................................................................................................................... 191

Modeling .................................................................................................................................................. 202

Evaluation ................................................................................................................................................ 203

Deployment ............................................................................................................................................. 213

Chapter Summary ................................................................................................................................... 213

Review Questions ................................................................................................................................... 214

Exercise .................................................................................................................................................... 214

SECTION THREE: Special Considerations in Data Mining .............................................................. 217

Chapter Thirteen: Evaluation and Deployment ..................................................................................... 219

How Far We’ve Come ........................................................................................................................... 219

Learning Objectives................................................................................................................................ 220

Cross-Validation ..................................................................................................................................... 221

Chapter Summary: The Value of Experience ..................................................................................... 227

Review Questions ................................................................................................................................... 228

Exercise .................................................................................................................................................... 228

Chapter Fourteen: Data Mining Ethics ................................................................................................... 231

Why Data Mining Ethics? ..................................................................................................................... 231

Ethical Frameworks and Suggestions .................................................................................................. 233

Conclusion ............................................................................................................................................... 235

GLOSSARY and INDEX ......................................................................................................................... 237

About the Author ....................................................................................................................................... 251

x
Data Mining for the Masses

ACKNOWLEDGEMENTS

I would not have had the expertise to write this book if not for the assistance of many colleagues at
various institutions. I would like to acknowledge Drs. Thomas Hilton and Jean Pratt, formerly of
Utah State University and now of University of Wisconsin—Eau Claire who served as my Master’s
degree advisors. I would also like to acknowledge Drs. Terence Ahern and Sebastian Diaz of West
Virginia University, who served as doctoral advisors to me.

I express my sincere and heartfelt gratitude for the assistance of Dr. Simon Fischer and the rest of
the team at Rapid-I. I thank them for their excellent work on the RapidMiner software product
and for their willingness to share their time and expertise with me on my visit to Dortmund.

Finally, I am grateful to the Kenneth M. Mason, Sr. Faculty Research Fund and Washington &
Jefferson College, for providing financial support for my work on this text.

xi
Data Mining for the Masses

xii
Data Mining for the Masses

SECTION ONE: DATA MINING BASICS

1
Chapter 1: Introduction to Data Mining and CRISP-DM

CHAPTER ONE:
INTRODUCTION TO DATA MINING AND CRISP-DM

INTRODUCTION

Data mining as a discipline is largely transparent to the world. Most of the time, we never even
notice that it’s happening. But whenever we sign up for a grocery store shopping card, place a
purchase using a credit card, or surf the Web, we are creating data. These data are stored in large
sets on powerful computers owned by the companies we deal with every day. Lying within those
data sets are patterns—indicators of our interests, our habits, and our behaviors. Data mining
allows people to locate and interpret those patterns, helping them make better informed decisions
and better serve their customers. That being said, there are also concerns about the practice of
data mining. Privacy watchdog groups in particular are vocal about organizations that amass vast
quantities of data, some of which can be very personal in nature.

The intent of this book is to introduce you to concepts and practices common in data mining. It is
intended primarily for undergraduate college students and for business professionals who may be
interested in using information systems and technologies to solve business problems by mining
data, but who likely do not have a formal background or education in computer science. Although
data mining is the fusion of applied statistics, logic, artificial intelligence, machine learning and data
management systems, you are not required to have a strong background in these fields to use this
book. While having taken introductory college-level courses in statistics and databases will be
helpful, care has been taken to explain within this book, the necessary concepts and techniques
required to successfully learn how to mine data.

Each chapter in this book will explain a data mining concept or technique. You should understand
that the book is not designed to be an instruction manual or tutorial for the tools we will use
(RapidMiner and OpenOffice Base and Calc). These software packages are capable of many types
of data analysis, and this text is not intended to cover all of their capabilities, but rather, to
illustrate how these software tools can be used to perform certain kinds of data mining. The book
3
Data Mining for the Masses

is also not exhaustive; it includes a variety of common data mining techniques, but RapidMiner in
particular is capable of many, many data mining tasks that are not covered in the book.

The chapters will all follow a common format. First, chapters will present a scenario referred to as
Context and Perspective. This section will help you to gain a real-world idea about a certain kind of
problem that data mining can help solve. It is intended to help you think of ways that the data
mining technique in that given chapter can be applied to organizational problems you might face.
Following Context and Perspective, a set of Learning Objectives is offered. The idea behind this section
is that each chapter is designed to teach you something new about data mining. By listing the
objectives at the beginning of the chapter, you will have a better idea of what you should expect to
learn by reading it. The chapter will follow with several sections addressing the chapter’s topic. In
these sections, step-by-step examples will frequently be given to enable you to work alongside an
actual data mining task. Finally, after the main concepts of the chapter have been delivered, each
chapter will conclude with a Chapter Summary, a set of Review Questions to help reinforce the main
points of the chapter, and one or more Exercise to allow you to try your hand at applying what was
taught in the chapter.

A NOTE ABOUT TOOLS

There are many software tools designed to facilitate data mining, however many of these are often
expensive and complicated to install, configure and use. Simply put, they’re not a good fit for
learning the basics of data mining. This book will use OpenOffice Calc and Base in conjunction
with an open source software product called RapidMiner, developed by Rapid-I, GmbH of
Dortmund, Germany. Because OpenOffice is widely available and very intuitive, it is a logical
place to begin teaching introductory level data mining concepts. However, it lacks some of the
tools data miners like to use. RapidMiner is an ideal complement to OpenOffice, and was selected
for this book for several reasons:

 RapidMiner provides specific data mining functions not currently found in OpenOffice,
such as decision trees and association rules, which you will learn to use later in this book.
 RapidMiner is easy to install and will run on just about any computer.
 RapidMiner’s maker provides a Community Edition of its software, making it free for
readers to obtain and use.

4
Chapter 1: Introduction to Data Mining and CRISP-DM

 Both RapidMiner and OpenOffice provide intuitive graphical user interface environments
which make it easier for general computer-using audiences to the experience the power
of data mining.

All examples using OpenOffice or RapidMiner in this book will be illustrated in a Microsoft
Windows environment, although it should be noted that these software packages will work on a
variety of computing platforms. It is recommended that you download and install these two
software packages on your computer now, so that you can work along with the examples in the
book if you would like.

 OpenOffice can be downloaded from: http://www.openoffice.org/


 RapidMiner Community Edition can be downloaded from:
http://rapid-i.com/content/view/26/84/

THE DATA MINING PROCESS

Although data mining’s roots can be traced back to the late 1980s, for most of the 1990s the field
was still in its infancy. Data mining was still being defined, and refined. It was largely a loose
conglomeration of data models, analysis algorithms, and ad hoc outputs. In 1999, several sizeable
companies including auto maker Daimler-Benz, insurance provider OHRA, hardware and software
manufacturer NCR Corp. and statistical software maker SPSS, Inc. began working together to
formalize and standardize an approach to data mining. The result of their work was CRISP-DM,
the CRoss-Industry Standard Process for Data Mining. Although

the participants in the creation of CRISP-DM certainly had vested interests in certain software and
hardware tools, the process was designed independent of any specific tool. It was written in such a
way as to be conceptual in nature—something that could be applied independent of any certain
tool or kind of data. The process consists of six steps or phases, as illustrated in Figure 1-1.

5
Data Mining for the Masses

1. Business
Understanding
6. Deployment 2. Data
Understanding

Data
5. Evaluation 3. Data
Preparation
4. Modeling

Figure 1-1: CRISP-DM Conceptual Model.

CRISP-DM Step 1: Business (Organizational) Understanding

The first step in CRISP-DM is Business Understanding, or what will be referred to in this text
as Organizational Understanding, since organizations of all kinds, not just businesses, can use
data mining to answer questions and solve problems. This step is crucial to a successful data
mining outcome, yet is often overlooked as folks try to dive right into mining their data. This is
natural of course—we are often anxious to generate some interesting output; we want to find
answers. But you wouldn’t begin building a car without first defining what you want the vehicle to
do, and without first designing what you are going to build. Consider these oft-quoted lines from
Lewis Carroll’s Alice’s Adventures in Wonderland:

"Would you tell me, please, which way I ought to go from here?"
"That depends a good deal on where you want to get to," said the Cat.
"I don’t much care where--" said Alice.
"Then it doesn’t matter which way you go," said the Cat.
"--so long as I get SOMEWHERE," Alice added as an explanation.
"Oh, you’re sure to do that," said the Cat, "if you only walk long enough."

Indeed. You can mine data all day long and into the night, but if you don’t know what you want to
know, if you haven’t defined any questions to answer, then the efforts of your data mining are less
likely to be fruitful. Start with high level ideas: What is making my customers complain so much?

6
Chapter 1: Introduction to Data Mining and CRISP-DM

How can I increase my per-unit profit margin? How can I anticipate and fix manufacturing flaws
and thus avoid shipping a defective product? From there, you can begin to develop the more
specific questions you want to answer, and this will enable you to proceed to …

CRISP-DM Step 2: Data Understanding

As with Organizational Understanding, Data Understanding is a preparatory activity, and


sometimes, its value is lost on people. Don’t let its value be lost on you! Years ago when workers
did not have their own computer (or multiple computers) sitting on their desk (or lap, or in their
pocket), data were centralized. If you needed information from a company’s data store, you could
request a report from someone who could query that information from a central database (or fetch
it from a company filing cabinet) and provide the results to you. The inventions of the personal
computer, workstation, laptop, tablet computer and even smartphone have each triggered moves
away from data centralization. As hard drives became simultaneously larger and cheaper, and as
software like Microsoft Excel and Access became increasingly more accessible and easier to use,
data began to disperse across the enterprise. Over time, valuable data stores became strewn across
hundred and even thousands of devices, sequestered in marketing managers’ spreadsheets,
customer support databases, and human resources file systems.

As you can imagine, this has created a multi-faceted data problem. Marketing may have wonderful
data that could be a valuable asset to senior management, but senior management may not be
aware of the data’s existence—either because of territorialism on the part of the marketing
department, or because the marketing folks simply haven’t thought to tell the executives about the
data they’ve gathered. The same could be said of the information sharing, or lack thereof, between
almost any two business units in an organization. In Corporate America lingo, the term ‘silos’ is
often invoked to describe the separation of units to the point where interdepartmental sharing and
communication is almost non-existent. It is unlikely that effective organizational data mining can
occur when employees do not know what data they have (or could have) at their disposal or where
those data are currently located. In chapter two we will take a closer look at some mechanisms
that organizations are using to try bring all their data into a common location. These include
databases, data marts and data warehouses.

Simply centralizing data is not enough however. There are plenty of question that arise once an
organization’s data have been corralled. Where did the data come from? Who collected them and
7
Data Mining for the Masses

was there a standard method of collection? What do the various columns and rows of data mean?
Are there acronyms or abbreviations that are unknown or unclear? You may need to do some
research in the Data Preparation phase of your data mining activities. Sometimes you will need to
meet with subject matter experts in various departments to unravel where certain data came from,
how they were collected, and how they have been coded and stored. It is critically important that
you verify the accuracy and reliability of the data as well. The old adage “It’s better than nothing”
does not apply in data mining. Inaccurate or incomplete data could be worse than nothing in a
data mining activity, because decisions based upon partial or wrong data are likely to be partial or
wrong decisions. Once you have gathered, identified and understood your data assets, then you
may engage in…

CRISP-DM Step 3: Data Preparation

Data come in many shapes and formats. Some data are numeric, some are in paragraphs of text,
and others are in picture form such as charts, graphs and maps. Some data are anecdotal or
narrative, such as comments on a customer satisfaction survey or the transcript of a witness’s
testimony. Data that aren’t in rows or columns of numbers shouldn’t be dismissed though—
sometimes non-traditional data formats can be the most information rich. We’ll talk in this book
about approaches to formatting data, beginning in Chapter 2. Although rows and columns will be
one of our most common layouts, we’ll also get into text mining where paragraphs can be fed into
RapidMiner and analyzed for patterns as well.

Data Preparation involves a number of activities. These may include joining two or more data
sets together, reducing data sets to only those variables that are interesting in a given data mining
exercise, scrubbing data clean of anomalies such as outlier observations or missing data, or re-
formatting data for consistency purposes. For example, you may have seen a spreadsheet or
database that held phone numbers in many different formats:
(555) 555-5555 555/555-5555
555-555-5555 555.555.5555
555 555 5555 5555555555

Each of these offers the same phone number, but stored in different formats. The results of a data
mining exercise are most likely to yield good, useful results when the underlying data are as

8
Chapter 1: Introduction to Data Mining and CRISP-DM

consistent as possible. Data preparation can help to ensure that you improve your chances of a
successful outcome when you begin…

CRISP-DM Step 4: Modeling

A model, in data mining at least, is a computerized representation of real-world observations.


Models are the application of algorithms to seek out, identify, and display any patterns or messages
in your data. There are two basic kinds or types of models in data mining: those that classify and
those that predict.

Figure 1-2: Types of Data Mining Models.

As you can see in Figure 1-2, there is some overlap between the types of models data mining uses.
For example, this book will teaching you about decision trees. Decision Trees are a predictive
model used to determine which attributes of a given data set are the strongest indicators of a given
outcome. The outcome is usually expressed as the likelihood that an observation will fall into a
certain category. Thus, Decision Trees are predictive in nature, but they also help us to classify our
data. This will probably make more sense when we get to the chapter on Decision Trees, but for
now, it’s important just to understand that models help us to classify and predict based on patterns
the models find in our data.

Models may be simple or complex. They may contain only a single process, or stream, or they may
contain sub-processes. Regardless of their layout, models are where data mining moves from
preparation and understanding to development and interpretation. We will build a number of
example models in this text. Once a model has been built, it is time for…

9
Data Mining for the Masses

CRISP-DM Step 5: Evaluation

All analyses of data have the potential for false positives. Even if a model doesn’t yield false
positives however, the model may not find any interesting patterns in your data. This may be
because the model isn’t set up well to find the patterns, you could be using the wrong technique, or
there simply may not be anything interesting in your data for the model to find. The Evaluation
phase of CRISP-DM is there specifically to help you determine how valuable your model is, and
what you might want to do with it.

Evaluation can be accomplished using a number of techniques, both mathematical and logical in
nature. This book will examine techniques for cross-validation and testing for false positives using
RapidMiner. For some models, the power or strength indicated by certain test statistics will also be
discussed. Beyond these measures however, model evaluation must also include a human aspect.
As individuals gain experience and expertise in their field, they will have operational knowledge
which may not be measurable in a mathematical sense, but is nonetheless indispensable in
determining the value of a data mining model. This human element will also be discussed
throughout the book. Using both data-driven and instinctive evaluation techniques to determine a
model’s usefulness, we can then decide how to move on to…

CRISP-DM Step 6: Deployment

If you have successfully identified your questions, prepared data that can answer those questions,
and created a model that passes the test of being interesting and useful, then you have arrived at
the point of actually using your results. This is deployment, and it is a happy and busy time for a data
miner. Activities in this phase include setting up automating your model, meeting with consumers
of your model’s outputs, integrating with existing management or operational information systems,
feeding new learning from model use back into the model to improve its accuracy and
performance, and monitoring and measuring the outcomes of model use. Be prepared for a bit of
distrust of your model at first—you may even face pushback from groups who may feel their jobs
are threatened by this new tool, or who may not trust the reliability or accuracy of the outputs. But
don’t let this discourage you! Remember that CBS did not trust the initial predictions of the
UNIVAC, one of the first commercial computer systems, when the network used it to predict the
eventual outcome of the 1952 presidential election on election night. With only 5% of the votes
counted, UNIVAC predicted Dwight D. Eisenhower would defeat Adlai Stevenson in a landslide;
10
Chapter 1: Introduction to Data Mining and CRISP-DM

something no pollster or election insider consider likely, or even possible. In fact, most ‘experts’
expected Stevenson to win by a narrow margin, with some acknowledging that because they
expected it to be close, Eisenhower might also prevail in a tight vote. It was only late that night,
when human vote counts confirmed that Eisenhower was running away with the election, that
CBS went on the air to acknowledge first that Eisenhower had won, and second, that UNIVAC
had predicted this very outcome hours earlier, but network brass had refused to trust the
computer’s prediction. UNIVAC was further vindicated later, when it’s prediction was found to
be within 1% of what the eventually tally showed. New technology is often unsettling to people,
and it is hard sometimes to trust what computers show. Be patient and specific as you explain how
a new data mining model works, what the results mean, and how they can be used.

While the UNIVAC example illustrates the power and utility of predictive computer modeling
(despite inherent mistrust), it should not construed as a reason for blind trust either. In the days of
UNIVAC, the biggest problem was the newness of the technology. It was doing something no
one really expected or could explain, and because few people understood how the computer
worked, it was hard to trust it. Today we face a different but equally troubling problem: computers
have become ubiquitous, and too often, we don’t question enough whether or not the results are
accurate and meaningful. In order for data mining models to be effectively deployed, balance must
be struck. By clearly communicating a model’s function and utility to stake holders, thoroughly
testing and proving the model, then planning for and monitoring its implementation, data mining
models can be effectively introduced into the organizational flow. Failure to carefully and
effectively manage deployment however can sink even the best and most effective models.

DATA MINING AND YOU

Because data mining can be applied to such a wide array of professional fields, this book has been
written with the intent of explaining data mining in plain English, using software tools that are
accessible and intuitive to everyone. You may not have studied algorithms, data structures, or
programming, but you may have questions that can be answered through data mining. It is our
hope that by writing in an informal tone and by illustrating data mining concepts with accessible,
logical examples, data mining can become a useful tool for you regardless of your previous level of
data analysis or computing expertise. Let’s start digging!

11
Chapter 2: Organizational Understanding and Data Understanding

CHAPTER TWO:
ORGANIZATIONAL UNDERSTANDING AND DATA
UNDERSTANDING

CONTEXT AND PERSPECTIVE

Consider some of the activities you’ve been involved with in the past three or four days. Have you
purchased groceries or gasoline? Attended a concert, movie or other public event? Perhaps you
went out to eat at a restaurant, stopped by your local post office to mail a package, made a
purchase online, or placed a phone call to a utility company. Every day, our lives are filled with
interactions – encounters with companies, other individuals, the government, and various other
organizations.

In today’s technology-driven society, many of those encounters involve the transfer of information
electronically. That information is recorded and passed across networks in order to complete
financial transactions, reassign ownership or responsibility, and enable delivery of goods and
services. Think about the amount of data collected each time even one of these activities occurs.

Take the grocery store for example. If you take items off the shelf, those items will have to be
replenished for future shoppers – perhaps even for yourself – after all you’ll need to make similar
purchases again when that case of cereal runs out in a few weeks. The grocery store must
constantly replenish its supply of inventory, keeping the items people want in stock while
maintaining freshness in the products they sell. It makes sense that large databases are running
behind the scenes, recording data about what you bought and how much of it, as you check out
and pay your grocery bill. All of that data must be recorded and then reported to someone whose
job it is to reorder items for the store’s inventory.

However, in the world of data mining, simply keeping inventory up-to-date is only the beginning.
Does your grocery store require you to carry a frequent shopper card or similar device which,
when scanned at checkout time, gives you the best price on each item you’re buying? If so, they
13
Data Mining for the Masses

can now begin not only keep track of store-wide purchasing trends, but individual purchasing
trends as well. The store can target market to you by sending mailers with coupons for products
you tend to purchase most frequently.

Now let’s take it one step further. Remember, if you can, what types of information you provided
when you filled out the form to receive your frequent shopper card. You probably indicated your
address, date of birth (or at least birth year), whether you’re male or female, and perhaps the size of
your family, annual household income range, or other such information. Think about the range of
possibilities now open to your grocery store as they analyze that vast amount of data they collect at
the cash register each day:

 Using ZIP codes, the store can locate the areas of greatest customer density, perhaps
aiding their decision about the construction location for their next store.
 Using information regarding customer gender, the store may be able to tailor marketing
displays or promotions to the preferences of male or female customers.
 With age information, the store can avoid mailing coupons for baby food to elderly
customers, or promotions for feminine hygiene products to households with a single
male occupant.

These are only a few the many examples of potential uses for data mining. Perhaps as you read
through this introduction, some other potential uses for data mining came to your mind. You may
have also wondered how ethical some of these applications might be. This text has been designed
to help you understand not only the possibilities brought about through data mining, but also the
techniques involved in making those possibilities a reality while accepting the responsibility that
accompanies the collection and use of such vast amounts of personal information.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Define the discipline of Data Mining
 List and define various types of data
 List and define various sources of data
 Explain the fundamental differences between databases, data warehouses and data sets

14
Chapter 2: Organizational Understanding and Data Understanding

 Explain some of the ethical dilemmas associated with data mining and outline possible
solutions

PURPOSES, INTENTS AND LIMITATIONS OF DATA MINING

Data mining, as explained in Chapter 1 of this text, applies statistical and logical methods to large
data sets. These methods can be used to categorize the data, or they can be used to create predictive
models. Categorizations of large sets may include grouping people into similar types of
classifications, or in identifying similar characteristics across a large number of observations.

Predictive models however, transform these descriptions into expectations upon which we can
base decisions. For example, the owner of a book-selling Web site could project how frequently
she may need to restock her supply of a given title, or the owner of a ski resort may attempt to
predict the earliest possible opening date based on projected snow arrivals and accumulations.

It is important to recognize that data mining cannot provide answers to every question, nor can we
expect that predictive models will always yield results which will in fact turn out to be the reality.
Data mining is limited to the data that has been collected. And those limitations may be many.
We must remember that the data may not be completely representative of the group of individuals
to which we would like to apply our results. The data may have been collected incorrectly, or it
may be out-of-date. There is an expression which can adequately be applied to data mining,
among many other things: GIGO, or Garbage In, Garbage Out. The quality of our data mining results
will directly depend upon the quality of our data collection and organization. Even after doing our
very best to collect high quality data, we must still remember to base decisions not only on data
mining results, but also on available resources, acceptable amounts of risk, and plain old common
sense.

DATABASE, DATA WAREHOUSE, DATA MART, DATA SET…?

In order to understand data mining, it is important to understand the nature of databases, data
collection and data organization. This is fundamental to the discipline of Data Mining, and will
directly impact the quality and reliability of all data mining activities. In this section, we will

15
Data Mining for the Masses

examine the differences between databases, data warehouses, and data sets. We will also
examine some of the variations in terminology used to describe data attributes.

Although we will be examining the differences between databases, data warehouses and data sets,
we will begin by discussing what they have in common. In Figure 2-1, we see some data organized
into rows (shown here as A, B, etc.) and columns (shown here as 1, 2, etc.). In varying data
environments, these may be referred to by differing names. In a database, rows would be referred
to as tuples or records, while the columns would be referred to as fields.

Figure 2-1: Data arranged in columns and rows.

In data warehouses and data sets, rows are sometimes referred to as observations, examples or
cases, and columns are sometimes called variables or attributes. For purposes of consistency in
this book, we will use the terminology of observations for rows and attributes for columns. It is
important to note that RapidMiner will use the term examples for rows of data, so keep this in
mind throughout the rest of the text.

A database is an organized grouping of information within a specific structure. Database


containers, such as the one pictured in Figure 2-2, are called tables in a database environment.
Most databases in use today are relational databases—they are designed using many tables which
relate to one another in a logical fashion. Relational databases generally contain dozens or even
hundreds of tables, depending upon the size of the organization.

16
Chapter 2: Organizational Understanding and Data Understanding

Figure 2-2: A simple database with a relation between two tables.

Figure 2-2 depicts a relational database environment with two tables. The first table contains
information about pet owners; the second, information about pets. The tables are related by the
single column they have in common: Owner_ID. By relating tables to one another, we can reduce
redundancy of data and improve database performance. The process of breaking tables apart and
thereby reducing data redundancy is called normalization.

Most relational databases which are designed to handle a high number of reads and writes (updates
and retrievals of information) are referred to as OLTP (online transaction processing) systems.
OLTP systems are very efficient for high volume activities such as cashiering, where many items
are being recorded via bar code scanners in a very short period of time. However, using OLTP
databases for analysis is generally not very efficient, because in order to retrieve data from multiple
tables at the same time, a query containing joins must be written. A query is simple a method of
retrieving data from database tables for viewing. Queries are usually written in a language called
SQL (Structured Query Language; pronounced ‘sequel’). Because it is not very useful to only
query pet names or owner names, for example, we must join two or more tables together in order
to retrieve both pets and owners at the same time. Joining requires that the computer match the
Owner_ID column in the Owners table to the Owner_ID column in the Pets table. When tables
contain thousands or even millions of rows of data, this matching process can be very intensive
and time consuming on even the most robust computers.

For much more on database design and management, check out geekgirls.com:
(http://www.geekgirls.com/ menu_databases.htm).

17
Data Mining for the Masses

In order to keep our transactional databases running quickly and smoothly, we may wish to create
a data warehouse. A data warehouse is a type of large database that has been denormalized and
archived. Denormalization is the process of intentionally combining some tables into a single
table in spite of the fact that this may introduce duplicate data in some columns (or in other words,
attributes).

Figure 2-3: A combination of the tables into a single data set.

Figure 2-3 depicts what our simple example data might look like if it were in a data warehouse.
When we design databases in this way, we reduce the number of joins necessary to query related
data, thereby speeding up the process of analyzing our data. Databases designed in this manner are
called OLAP (online analytical processing) systems.

Transactional systems and analytical systems have conflicting purposes when it comes to database
speed and performance. For this reason, it is difficult to design a single system which will serve
both purposes. This is why data warehouses generally contain archived data. Archived data are
data that have been copied out of a transactional database. Denormalization typically takes place at
the time data are copied out of the transactional system. It is important to keep in mind that if a
copy of the data is made in the data warehouse, the data may become out-of-synch. This happens
when a copy is made in the data warehouse and then later, a change to the original record
(observation) is made in the source database. Data mining activities performed on out-of-synch
observations may be useless, or worse, misleading. An alternative archiving method would be to
move the data out of the transactional system. This ensures that data won’t get out-of-synch,
however, it also makes the data unavailable should a user of the transactional system need to view
or update it.

A data set is a subset of a database or a data warehouse. It is usually denormalized so that only
one table is used. The creation of a data set may contain several steps, including appending or
combining tables from source database tables, or simplifying some data expressions. One example
of this may be changing a date/time format from ‘10-DEC-2002 12:21:56’ to ‘12/10/02’. If this
18
Chapter 2: Organizational Understanding and Data Understanding

latter date format is adequate for the type of data mining being performed, it would make sense to
simplify the attribute containing dates and times when we create our data set. Data sets may be
made up of a representative sample of a larger set of data, or they may contain all observations
relevant to a specific group. We will discuss sampling methods and practices in Chapter 3.

TYPES OF DATA

Thus far in this text, you’ve read about some fundamental aspects of data which are critical to the
discipline of data mining. But we haven’t spent much time discussing where that data are going to
come from. In essence, there are really two types of data that can be mined: operational and
organizational.

The most elemental type of data, operational data, comes from transactional systems which record
everyday activities. Simple encounters like buying gasoline, making an online purchase, or
checking in for a flight at the airport all result in the creation of operational data. The times,
prices and descriptions of the goods or services we have purchased are all recorded. This
information can be combined in a data warehouse or may be extracted directly into a data set from
the OLTP system.

Often times, transactional data is too detailed to be of much use, or the detail may compromise
individuals’ privacy. In many instances, government, academic or not-for-profit organizations may
create data sets and then make them available to the public. For example, if we wanted to identify
regions of the United States which are historically at high risk for influenza, it would be difficult to
obtain permission and to collect doctor visit records nationwide and compile this information into
a meaningful data set. However, the U.S. Centers for Disease Control and Prevention (CDCP), do
exactly that every year. Government agencies do not always make this information immediately
available to the general public, but it often can be requested. Other organizations create such
summary data as well. The grocery store mentioned at the beginning of this chapter wouldn’t
necessarily want to analyze records of individual cans of greens beans sold, but they may want to
watch trends for daily, weekly or perhaps monthly totals. Organizational data sets can help to
protect peoples’ privacy, while still proving useful to data miners watching for trends in a given
population.

19
Data Mining for the Masses

Another type of data often overlooked within organizations is something called a data mart. A
data mart is an organizational data store, similar to a data warehouse, but often created in
conjunction with business units’ needs in mind, such as Marketing or Customer Service, for
reporting and management purposes. Data marts are usually intentionally created by an
organization to be a type of one-stop shop for employees throughout the organization to find data
they might be looking for. Data marts may contain wonderful data, prime for data mining
activities, but they must be known, current, and accurate to be useful. They should also be well-
managed in terms of privacy and security.

All of these types of organizational data carry with them some concern. Because they are
secondary, meaning they have been derived from other more detailed primary data sources, they
may lack adequate documentation, and the rigor with which they were created can be highly
variable. Such data sources may also not be intended for general distribution, and it is always wise
to ensure proper permission is obtained before engaging in data mining activities on any data set.
Remember, simply because a data set may have been acquired from the Internet does not mean it
is in the public domain; and simply because a data set may exist within your organization does not
mean it can be freely mined. Checking with relevant managers, authors and stakeholders is critical
before beginning data mining activities.

A NOTE ABOUT PRIVACY AND SECURITY

In 2003, JetBlue Airlines supplied more than one million passenger records to a U.S. government
contractor, Torch Concepts. Torch then subsequently augmented the passenger data with
additional information such as family sizes and social security numbers—information purchased
from a data broker called Acxiom. The data were intended for a data mining project in order to
develop potential terrorist profiles. All of this was done without notification or consent of
passengers. When news of the activities got out however, dozens of privacy lawsuits were filed
against JetBlue, Torch and Acxiom, and several U.S. senators called for an investigation into the
incident.

This incident serves several valuable purposes for this book. First, we should be aware that as we
gather, organize and analyze data, there are real people behind the figures. These people have
certain rights to privacy and protection against crimes such as identity theft. We as data miners

20
Chapter 2: Organizational Understanding and Data Understanding

have an ethical obligation to protect these individuals’ rights. This requires the utmost care in
terms of information security. Simply because a government representative or contractor asks for
data does not mean it should be given.

Beyond technological security however, we must also consider our moral obligation to those
individuals behind the numbers. Recall the grocery store shopping card example given at the
beginning of this chapter. In order to encourage use of frequent shopper cards, grocery stores
frequently list two prices for items, one with use of the card and one without. For each individual,
the answer to this question may vary, however, answer it for yourself: At what price mark-up has
the grocery store crossed an ethical line between encouraging consumers to participate in frequent
shopper programs, and forcing them to participate in order to afford to buy groceries? Again, your
answer will be unique from others’, however it is important to keep such moral obligations in mind
when gathering, storing and mining data.

The objectives hoped for through data mining activities should never justify unethical means of
achievement. Data mining can be a powerful tool for customer relationship management,
marketing, operations management, and production, however in all cases the human element must
be kept sharply in focus. When working long hours at a data mining task, interacting primarily
with hardware, software, and numbers, it can be easy to forget about the people, and therefore it is
so emphasized here.

CHAPTER SUMMARY

This chapter has introduced you to the discipline of data mining. Data mining brings statistical
and logical methods of analysis to large data sets for the purposes of describing them and using
them to create predictive models. Databases, data warehouses and data sets are all unique kinds of
digital record keeping systems, however, they do share many similarities. Data mining is generally
most effectively executed on data data sets, extracted from OLAP, rather than OLTP systems.
Both operational data and organizational data provide good starting points for data mining
activities, however both come with their own issues that may inhibit quality data mining activities.
These should be mitigated before beginning to mine the data. Finally, when mining data, it is
critical to remember the human factor behind manipulation of numbers and figures. Data miners
have an ethical responsibility to the individuals whose lives may be affected by the decisions that
are made as a result of data mining activities.
21
Data Mining for the Masses

REVIEW QUESTIONS

1) What is data mining in general terms?

2) What is the difference between a database, a data warehouse and a data set?

3) What are some of the limitations of data mining? How can we address those limitations?

4) What is the difference between operational and organizational data? What are the pros and
cons of each?

5) What are some of the ethical issues we face in data mining? How can they be addressed?

6) What is meant by out-of-synch data? How can this situation be remedied?

7) What is normalization? What are some reasons why it is a good thing in OLTP systems,
but not so good in OLAP systems?

EXERCISES

1) Design a relational database with at least three tables. Be sure to create the columns
necessary within each table to relate the tables to one another.

2) Design a data warehouse table with some columns which would usually be normalized.
Explain why it makes sense to denormalize in a data warehouse.

3) Perform an Internet search to find information about data security and privacy. List three
web sites that you found that provided information that could be applied to data mining.
Explain how it might be applied.

4) Find a newspaper, magazine or Internet news article related to information privacy or


security. Summarize the article and explain how it might be related to data mining.

22
Chapter 2: Organizational Understanding and Data Understanding

5) Using the Internet, locate a data set which is available for download. Describe the data set
(contents, purpose, size, age, etc.). Classify the data set as operational or organizational.
Summarize any requirements placed on individuals who may wish to use the data set.

6) Obtain a copy of an application for a grocery store shopping card. Summarize the type of
data requested when filling out the application. Give an example of how that data may aid
in a data mining activity. What privacy concerns arise regarding the data being collected?

23
Chapter 3: Data Preparation

CHAPTER THREE:
DATA PREPARATION

CONTEXT AND PERSPECTIVE

Jerry is the marketing manager for a small Internet design and advertising firm. Jerry’s boss asks
him to develop a data set containing information about Internet users. The company will use this
data to determine what kinds of people are using the Internet and how the firm may be able to
market their services to this group of users.

To accomplish his assignment, Jerry creates an online survey and places links to the survey on
several popular Web sites. Within two weeks, Jerry has collected enough data to begin analysis, but
he finds that his data needs to be denormalized. He also notes that some observations in the set
are missing values or they appear to contain invalid values. Jerry realizes that some additional work
on the data needs to take place before analysis begins.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Explain the concept and purpose of data scrubbing
 List possible solutions for handling missing data
 Explain the role and perform basic methods for data reduction
 Define and handle inconsistent data
 Discuss the important and process of attribute reduction

APPLYING THE CRISP DATA MINING MODEL

Recall from Chapter 1 that the CRISP Data Mining methodology requires three phases before any
actual data mining models are constructed. In the Context and Perspective paragraphs above, Jerry

25
Data Mining for the Masses

has a number of tasks before him, each of which fall into one of the first three phases of CRISP.
First, Jerry must ensure that he has developed a clear Organizational Understanding. What is
the purpose of this project for his employer? Why is he surveying Internet users? Which data
points are important to collect, which would be nice to have, and which would be irrelevant or
even distracting to the project? Once the data are collected, who will have access to the data set
and through what mechanisms? How will the business ensure privacy is protected? All of these
questions, and perhaps others, should be answered before Jerry even creates the survey mentioned
in the second paragraph above.

Once answered, Jerry can then begin to craft his survey. This is where Data Understanding
enters the process. What database system will he use? What survey software? Will he use a
publicly available tool like SurveyMonkey™, a commercial product, or something homegrown? If
he uses publicly available tool, how will he access and extract data for mining? Can he trust this
third-party to secure his data and if so, why? How will the underlying database be designed? What
mechanisms will be put in place to ensure consistency and integrity in the data? These are all
questions of data understanding. An easy example of ensuring consistency might be if a person’s
home city were to be collected as part of the data. If the online survey just provides an open text
box for entry, respondents could put just about anything as their home city. They might put New
York, NY, N.Y., Nwe York, or any number of other possible combinations, including typos. This
could be avoided by forcing users to select their home city from a dropdown menu, but
considering the number cities there are in most countries, that list could be unacceptably long! So
the choice of how to handle this potential data consistency problem isn’t necessarily an obvious or
easy one, and this is just one of many data points to be collected. While ‘home state’ or ‘country’
may be reasonable to constrain to a dropdown, ‘city’ may have to be entered freehand into a
textbox, with some sort of data correction process to be applied later.

The ‘later’ would come once the survey has been developed and deployed, and data have been
collected. With the data in place, the third CRISP-DM phase, Data Preparation, can begin. If
you haven’t installed OpenOffice and RapidMiner yet, and you want to work along with the
examples given in the rest of the book, now would be a good time to go ahead and install these
applications. Remember that both are freely available for download and installation via the
Internet, and the links to both applications are given in Chapter 1. We’ll begin by doing some data
preparation in OpenOffice Base (the database application), OpenOffice Calc (the spreadsheet
application), and then move on to other data preparation tools in RapidMiner. You should
26
Chapter 3: Data Preparation

understand that the examples of data preparation in this book are only a subset of possible data
preparation approaches.

COLLATION

Suppose that the database underlying Jerry’s Internet survey is designed as depicted in the
screenshot from OpenOffice Base in Figure 3-1.

Figure 3-1: A simple relational (one-to-one) database for Internet survey data.

This design would enable Jerry to collect data about people in one table, and data about their
Internet behaviors in another. RapidMiner would be able to connect to either of these tables in
order to mine the responses, but what if Jerry were interested in mining data from both tables at
once?

One simple way to collate data in multiple tables into a single location for data mining is to create a
database view. A view is a type of pseudo-table, created by writing a SQL statement which is
named and stored in the database. Figure 3-2 shows the creation of a view in OpenOffice Base,
while Figure 3-3 shows the view in datasheet view.

27
Data Mining for the Masses

Figure 3-2: Creation of a view in OpenOffice Base.

Figure 3-3: Results of the view from Figure 3-2 in datasheet view.

The creation of views is one way that data from a relational database can be collated and organized
in preparation for data mining activities. In this example, although the personal information in the
‘Respondents’ table is only stored once in the database, it is displayed for each record in the
‘Responses’ table, creating a data set that is more easily mined because it is both richer in
information and consistent in its formatting.

DATA SCRUBBING

In spite of our very best efforts to maintain quality and integrity during data collection, it is
inevitable that some anomalies will be introduced into our data at some point. The process of data
scrubbing allows us to handle these anomalies in ways that make sense for us. In the remainder of
this chapter, we will examine data scrubbing in four different ways: handling missing data, reducing
data (observations), handling inconsistent data, and reducing attributes.

28
Chapter 3: Data Preparation

HANDS ON EXERCISE

Starting now, and throughout the next chapters of this book, there will be opportunities for you to
put your hands on your computer and follow along. In order to do this, you will need to be sure
to install OpenOffice and RapidMiner, as was discussed in the section A Note about Tools in
Chapter 1. You will also need to have an Internet connection to access this book’s companion
web site, where copies of all data sets used in the chapter exercises are available. The companion
web site is located at:

https://sites.google.com/site/dataminingforthemasses/

Figure 3-4. Data Mining for the Masses companion web site.

You can download the Chapter 3 data set, which is an export of the view created in OpenOffice
Base, from the web site by locating it in the list of files and then clicking the down arrow to the far
right of the file name, as indicated by the black arrows in Figure 3-4 You may want to consider
creating a folder labeled ‘data mining’ or something similar where you can keep copies of your
data—more files will be required and created as we continue through the rest of the book,
especially when we get into building data mining models in RapidMiner. Having a central place to
keep everything together will simplify things, and upon your first launch of the RapidMiner
software, you’ll be prompted to create a repository, so it’s a good idea to have a space ready. Once
29
Data Mining for the Masses

you’ve downloaded the Chapter 3 data set, you’re ready to begin learning how to handle and
prepare data for mining in RapidMiner.

PREPARING RAPIDMINER, IMPORTING DATA, AND


HANDLING MISSING DATA

Our first task in data preparation is to handle missing data, however, because this will be our first
time using RapidMiner, the first few steps will involve getting RapidMiner set up. We’ll then move
straight into handling missing data. Missing data are data that do not exist in a data set. As you
can see in Figure 3-5, missing data is not the same as zero or some other value. It is blank, and the
value is unknown. Missing data are also sometimes known in the database world as null.
Depending on your objective in data mining, you may choose to leave missing data as they are, or
you may wish to replace missing data with some other value.

Figure 3-5: Some missing data within the survey data set.

The creation of views is one way that data from a relational database can be collated and organized
in preparation for data mining activities. In this example, our database view has missing data in a
number of its attributes. Black arrows indicate a couple of these attributes in Figure 3-5 above. In
some instances, missing data are not a problem, they are expected. For example, in the Other
Social Network attribute, it is entirely possible that the survey respondent did not indicate that they
use social networking sites other than the ones proscribed in the survey. Thus, missing data are
probably accurate and acceptable. On the other hand, in the Online Gaming attribute, there are
answers of either ‘Y’ or ‘N’, indicating that the respondent either does, or does not participate in
online gaming. But what do the missing, or null values in this attribute indicate? It is unknown to
us. For the purposes of data mining, there are a number of options available for handling missing
data.

To learn about handling missing data in RapidMiner, follow the steps below to connect to your
data set and begin modifying it:

30
Chapter 3: Data Preparation

1) Launch the RapidMiner application. This can be done by double clicking your desktop
icon or by finding it in your application menu. The first time RapidMiner is launched, you
will get the message depicted in Figure 3-6. Click OK to set up a repository.

Figure 3-6. The prompt to create an initial data repository for RapidMiner to use.

2) For most purposes (and for all examples in this book), a local repository will be sufficient.
Click OK to accept the default option as depicted in Figure 3-7.

Figure 3-7. Setting up a local data repository.

3) In the example given in Figure 3-8, we have named our repository ‘RapidMinerBook, and
pointed it to our data folder, RapidMiner Data, which is found on our E: drive. Use the
folder icon to browse and find the folder or directory you created for storing your
RapidMiner data sets. Then click Finish.

31
Data Mining for the Masses

Figure 3-8. Setting the repository name and directory.

4) You may get a notice that updates are available. If this is the case, go ahead and accept the
option to update, where you will be presented with a window similar to Figure 3-9. Take
advantage of the opportunity to add in the Text Mining module (indicated by the black
arrow), since Chapter 12 will deal with Text Mining. Double click the check box to add a
green check mark indicating that you wish to install or update the module, then click
Install.

32
Chapter 3: Data Preparation

Figure 3-9. Installing updates and adding the Text Mining module.

5) Once the updates and installations are complete, RapidMiner will open and your window
should look like Figure 3-10:

Figure 3-10. The RapidMiner start screen.

33
Data Mining for the Masses

6) Next we will need to start a new data mining project in RapidMiner. To do this we click
on the ‘New’ icon as indicated by the black arrow in Figure 3-10. The resulting window
should look like Figure 3-11.

Figure 3-11. Getting started with a new project in RapidMiner.

7) Within RapidMiner there are two main areas that hold useful tools: Repositories and
Operators. These are accessed by the tabs indicated by the black arrow in Figure 3-11.
The Repositories area is the place where you will connect to each data set you wish to
mine. The Operators area is where all data mining tools are located. These are used to
build models and otherwise manipulate data sets. Click on Repositories. You will find that
the initial repository we created upon our first launch of the RapidMiner software is
present in the list.

34
Chapter 3: Data Preparation

Figure 3-12. Adding a data set to a repository in RapidMiner.

8) Because the focus of this book is to introduce data mining to the broadest possible
audience, we will not use all of the tools available in RapidMiner. At this point, we could
do a number of complicated and technical things, such as connecting to a remote
enterprise database. This however would likely be overwhelming and inaccessible to many
readers. For the purposes of this text, we will therefore only be connecting to comma
separate values (CSV) files. You should know that most data mining projects
incorporate extremely large data sets encompassing dozens of attributes and thousands or
even millions of observations. We will use smaller data sets in this text, but the
foundational concepts illustrated are the same for large or small data. The Chapter 3 data
set downloaded from the companion web site is very small, comprised of only 15 attributes
and 11 observations. Our next step is to connect to this data set. Click on the Import
icon, which is the second icon from the left in the Repositories area, as indicated by the
black arrow in Figure 3-12.

35
Data Mining for the Masses

Figure 3-13. Importing a CSV file.

9) You will see by the black arrow in Figure 3-13 that you can import from a number of
different data sources. Note that by importing, you are bringing your data into a
RapidMiner file, rather than working with data that are already stored elsewhere. If your
data set is extremely large, it may take some time to import the data, and you should be
mindful of disk space that is available to you. As data sets grow, you may be better off
using the first (leftmost) icon to set up a remote repository in order to work with data
already stored in other areas. As previously explained, all examples in this text will be
conducted by importing CSV files that are small enough to work with quickly and easily.
Click on the Import CSV File option.

36
Chapter 3: Data Preparation

Figure 3-14. Locating the data set to import.

10) When the data import wizard opens, navigate to the folder where your data set is stored
and select the file. In this example, only one file is visible: the Chapter 3 data set
downloaded from the companion web site. Click Next.

Figure 3-15. Configuring attribute separation.


37
Data Mining for the Masses

11) By default, RapidMiner looks for semicolons as attribute separators in our data. We must
change the column separation delimiter to be Comma, in order to be able to see each
attribute separated correctly. Note: If your data naturally contain commas, then you
should be careful as you are collecting or collating your data to use a delimiter that does
not naturally occur in the data. A semicolon or a pipe (|) symbol can often help you avoid
unintended column separation.

Figure 3-16. A preview of attributes separated into columns


with the Comma option selected.

12) Once the preview shows columns for each attribute, click Next. Note that RapidMiner has
treated our attribute names as if they are our first row of data, or in other words, our first
observation. To fix this, click the Annotation dropdown box next to this row and set it to
Name, as indicated in Figure 3-17. With the attribute names designated correctly, click
Next.

38
Chapter 3: Data Preparation

Figure 3-17. Setting the attribute names.

13) In step 4 of the data import wizard, RapidMiner will take its best guess at a data type for
each attribute. The data type is the kind of data an attribute holds, such as numeric, text or
date. These can be changed in this screen, but for our purposes in Chapter 3, we will
accept the defaults. Just below each attribute’s data type, RapidMiner also indicates a Role
for each attribute to play. By default, all columns are imported simply with the role of
‘attribute’, however we can change these here if we know that one attribute is going to play
a specific role in a data mining model that we will create. Since roles can be set within
RapidMiner’s main process window when building data mining models, we will accept the
default of ‘attribute’ whenever we import data sets in exercises in this text. Also, you may
note that the check boxes above each attribute in this window allow you to not import
some of the attributes if you don’t want to. This is accomplished by simply clearing the
checkbox. Again, attributes can be excluded from models later, so for the purposes of this
text, we will always include all attributes when importing data. All of these functions are
indicated by the black arrows in Figure 3-18. Go ahead and accept these defaults as they
stand and click Next.

39
Data Mining for the Masses

Figure 3-18. Setting data types, roles and import attributes.

14) The final step is to choose a repository to store the data set in, and to give the data set a
name within RapidMiner. In Figure 3-19, we have chosen to store the data set in the
RapidMiner Book repository, and given it the name Chapter3. Once we click Finish, this
data set will become available to us for any type of data mining process we would like to
build upon it.

Figure 3-19. Selecting the repository and setting a data set name
for our imported CSV file.

40
Chapter 3: Data Preparation

15) We can now see that the data set is available for use in RapidMiner. To begin using it in a
RapidMiner data mining process, simply drag the data set and drop it in the Main Process
window, as has been done in Figure 3-20.

Figure 3-20. Adding a data set to a process in RapidMiner.

16) Each rectangle in a process in RapidMiner is an operator. The Retrieve operator simply
gets a data set and makes it available for use. The small half-circles on the sides of the
operator, and of the Main Process window, are called ports. In Figure 3-20, an output (out)
port from our data set’s Retrieve operator is connected to a result set (res) port via a spline.
The splines, combined with the operators connected by them, constitute a data mining
stream. To run a data mining stream and see the results, click the blue, triangular Play
button in the toolbar at the top of the RapidMiner window. This will change your view
from Design Perspective, which is the view pictured in Figure 3-20 where you can
change your data mining stream, to Results Perspective, which shows your stream’s
results, as pictured in Figure 3-21. When you hit the Play button, you may be prompted to
save your process, and you are encouraged to do so. RapidMiner may also ask you if you
wish to overwrite a saved process each time it is run, and you can select your preference on
this prompt as well.

41
Data Mining for the Masses

Figure 3-21. Results perspective for the Chapter3 data set.

17) You can toggle between design and results perspectives using the two icons indicated by
the black arrows in Figure 3-21. As you can see, there is a rich set of information in results
perspective. In the meta data view, basic descriptive statistics are given. It is here that we
can also get a sense for the number of observations that have missing values in each
attribute of the data set. The columns in meta data view can be stretched to make their
contents more readable. This is accomplished by hovering your mouse over the faint
vertical gray bars between each column, then clicking and dragging to make them wider.
The information presented here can be very helpful in deciding where missing data are
located, and what to do about it. Take for example the Online_Gaming attribute. The
results perspective shows us that we have six ‘N’ responses in that attribute, two ‘Y’
responses, and three missing. We could use the mode, or most common response to
replace the missing values. This of course assumes that the most common response is
accurate for all observations, and this may not be accurate. As data miners, we must be
responsible for thinking about each change we make in our data, and whether or not we
threaten the integrity of our data by making that change. In some instances the
consequences could be drastic. Consider, for instance, if the mode for an attribute of
Felony_Conviction were ‘Y’. Would we really want to convert all missing values in this
attribute to ‘Y’ simply because that is the mode in our data set? Probably not; the

42
Chapter 3: Data Preparation

implications about the persons represented in each observation of our data set would be
unfair and misrepresentative. Thus, we will change the missing values in the current
example to illustrate how to handle missing values in RapidMiner, recognizing that what we
are about to do won’t always be the right way to handle missing data. In order to have
RapidMiner handle the change from missing to ‘N’ for the three observations in our
Online_Gaming variable, click the design perspective icon.

Figure 3-22. Finding an operator to handle missing values.

18) In order to find a tool in the Operators area, you can navigate through the folder tree in
the lower left hand corner. RapidMiner offers many tools, and sometimes, finding the one
you want can be tricky. There is a handy search box, indicated by the black arrow in Figure
3-22, that allows you to type in key words to find tools that might do what you need. Type
the word ‘missing’ into this box, and you will see that RapidMiner automatically searches
for tools with this word in their name. We want to replace missing values, and we can see
that within the Data Transformation tool area, inside a sub-area called Value Modification,
there is an operator called Replace Missing Values. Let’s add this operator to our stream.
Click and hold on the operator name, and drag it up to your spline. When you point your
mouse cursor on the spline, the spline will turn slightly bold, indicating that when you let
go of your mouse button, the operator will be connected into the stream. If you let go and
the Replace Missing Values operator fails to connect into your stream, you can reconfigure

43
Data Mining for the Masses

your splines manually. Simply click on the out port in your Retrieve operator, and then
click on the exa port on the Replace Missing Values operator. Exa stands for example set,
and remember that ‘examples’ is the word RapidMiner uses for observations in a data set.
Be sure the exa port from the Replace Missing Values operator is connected to your result
set (res) port so that when you run your process, you will have output. Your model should
now look similar to Figure 3-23.

Figure 3-23. Adding a missing value operator to the stream.

19) When an operator is selected in RapidMiner, it has an orange rectangle around it. This will
also enable you to modify that operator’s parameters, or properties. The Parameters pane
is located on the right side of the RapidMiner window, as indicated by the black arrow in
Figure 3-23. For this exercise, we have decided to change all missing values in the
Online_Gaming attribute to be ‘N’, since this is the most common response in that
attribute. To do this, change the ‘attribute filter type’ to ‘single’, and you will see that a
dropdown box appears, allowing you to choose the Online_Gaming attribute as the target
for modification. Next, expand the ‘default’ dropdown box, and select ‘value’, which will
cause a ‘replenishment value’ box to appear. Type the replacement value ‘N’ in this box.
Note that you may need to expand your RapidMiner window, or use the vertical scroll bar
on the left of the Parameters pane in order to see all options, as the options change based
on what you have selected. When you are finished, your parameters should look like the

44
Chapter 3: Data Preparation

ones in Figure 3-24. Parameter settings that were changed are highlighted with black
arrows.

Figure 3-24. Missing value parameters.

20) You should understand that there are many other options available to you in the
parameters pane. We will not explore all of them here, but feel free to experiment with
them. For example, instead of changing a single attribute at a time, you could change a
subset of the attributes in your data set. You will learn much about the flexibility and
power of RapidMiner by trying out different tools and features. When you have your
parameter set, click the play button. This will run your process and switch you to results
perspective once again. Your results should look like Figure 3-25.

45
Data Mining for the Masses

Figure 3-25. Results of changing missing data.

21) You can see now that the Online_Gaming attribute has been moved to the top of our list,
and that there are zero missing values. Click on the Data View radio button, above and to
the left hand side of the attribute list to see your data in a spreadsheet-type view. You will
see that the Online_Gaming variable is now populated with only ‘Y’ and ‘N’ values. We
have successfully replaced all missing values in that attribute. While in Data View, take
note of how missing values are annotated in other variables, Online_Shopping for example.
A question mark (?) denotes a missing value in an observation. Suppose that for this
variable, we do not wish to replace the null values with the mode, but rather, that we wish
to remove those observations from our data set prior to mining it. This is accomplished
through data reduction.

DATA REDUCTION

Go ahead and switch back to design perspective. The next set of steps will teach you to reduce the
number of observations in your data set through the process of filtering.

1) In the search box within the Operators tab, type in the word ‘filter’. This will help you
locate the ‘Filter Examples’ operator, which is what we will use in this example. Drag the

46
Chapter 3: Data Preparation

Filter Examples operator over and connect it into your stream, right after the Replace
Missing Values operator. Your window will look like Figure 3-26.

Figure 3-26. Adding a filter to the stream.

2) In the condition class, choose ‘attribute_value_filter’, and for the parameter_string, type
the following: Online_Shopping=. Be sure to include the period. This parameter string
refers to our attribute, Online_Shopping, and it tells RapidMiner to filter out all
observations where the value in that attribute is missing. This is a bit confusing, because in
Data View in results perspective, missings are denoted by a question mark (?), but when
entering the parameter string, missings are denoted by a period (.). Once you’ve typed
these parameter values in, your screen will look like Figure 3-27.

47
Data Mining for the Masses

Figure 3-27. Adding observation filter parameters.

Go ahead and run your model by clicking the play button. In results perspective, you will now see
that your data set has been reduced from eleven observations (or examples) to nine. This is
because the two observations where the Online_Shopping attribute had a missing value have been
removed. You’ll be able to see that they’re gone by selecting the Data View radio button. They
have not been deleted from the original source data, they are simply removed from the data set at
the point in the stream where the filter operator is located and will no longer be considered in any
downstream data mining operations. In instances where the missing value cannot be safely
assumed or computed, removal of the entire observation is often the best course of action. When
attributes are numeric in nature, such as with ages or number of visits to a certain place, an
arithmetic measure of central tendency, such as mean, median or mode might be an acceptable
replacement for missing values, but in more subjective attributes, such as whether one is an online
shopper or not, you may be better off simply filtering out observations where the datum is missing.
(One cool trick you can try in RapidMiner is to use the Invert Filter option in design perspective.
In this example, if you check that check box in the parameters pane of the Filter Examples
operator, you will keep the missing observations, and filter out the rest.)

Data mining can be confusing and overwhelming, especially when data sets get large. It doesn’t
have to be though, if we manage our data well. The previous example has shown how to filter out
observations containing undesired data (or missing data) in an attribute, but we can also reduce
data to test out a data mining model on a smaller subset of our data. This can greatly reduce

48
Chapter 3: Data Preparation

processing time while testing a model to see if it will work to answer our questions. Follow the
steps below to take a sample of our data set in RapidMiner.

1) Using the search techniques previously demonstrated, use the Operators search feature to
find an operator called ‘Sample’ and add this to your stream. In the parameters pane, set
the sample to be to be a ‘relative’ sample, and then indicate you want to retain 50% of your
observations in the resulting data set by typing .5 into the sample ratio field. Your window
should look like Figure 3-28.

Figure 3-28. Taking a 50% random sample of the data set.

2) When you run your model now, you will find that your results only contain four or five
observations, randomly selected from the nine that were remaining after our filter operator
removed records that had missing Online_Shopping values.

Thus you can see that there are many ways, and various reasons to reduce data by decreasing the
number of observations in your data set. We’ll now move on to handling inconsistent data, but
before doing so, it is going to be important to reset our data back to its original form. While
filtering, we removed an observation that we will need in order to illustrate what inconsistent data
is, and to demonstrate how to handle it in RapidMiner. This is a good time to learn how to
remove operators from your stream. Switch back to design perspective and click on your
Sampling operator. Next, right click and choose Delete, or simply press the Delete key on your

49
Data Mining for the Masses

keyboard. Delete the Filter Examples operator at this time as well. Note that your spline that was
connected to the res port is also deleted. This is not a problem, you can reconnect the exa port
from the Replace Missing Values operator to the res port, or you will find that the spline will
reappear when you complete the steps under Handling Inconsistent Data.

HANDLING INCONSISTENT DATA

Inconsistent data is different from missing data. Inconsistent data occurs when a value does
exist, however that value is not valid or meaningful. Refer back to Figure 3-25, a close up version
of that image is shown here as Figure 3-29.
?!?!

Figure 3-29. Inconsisten data in the Twitter attribute.

What is that 99 doing there? It seems that the only two valid values for the Twitter attribute
should be ‘Y’ and ‘N’. This is a value that is inconsistent and is therefore meaningless. As data
miners, we can decide if we want to filter this observation out, as we did with the missing
Online_Shopping records, or, we could use an operator designed to allow us to replace certain
values with others.

1) Return to design perspective if you are not already there. Ensure that you have deleted
your sampling and filter operators from your stream, so that your window looks like Figure
3-30.

Figure 3-30. Returning to a full data set in RapidMiner.


50
Chapter 3: Data Preparation

2) Note that we don’t need to remove the Replace Missing Values operator, because it is not
removing any observations in our data set. It only changes the values in the
Online_Gaming attribute, which won’t affect our next operator. Use the search feature in
the Operators tab to find an operator called Replace. Drag this operator into your stream.
If your splines had been disconnected during the deletion of the sampling and filtering
operators, as is the case in Figure 3-30, you will see that your splines are automatically
reconnected when you add the Replace operator to the stream.

3) In the parameters pane, change the attribute filter type to single, then indicate Twitter as
the attribute to be modified. In truth, in this data set there is only one instance of the value
99 across all attributes and observations, so this change to a single attribute is not actually
necessary in this example, but it is good to be thoughtful and intentional with every step in
a data mining process. Most data sets will be far larger and more complex that the Chapter
3 data set we are currently working with. In the ‘replace what’ field, type the value 99, since
this is the value we’re looking to replace. Finally, in the ‘replace by’ field, we must decide
what we want to have in the place of the 99. If we leave this field blank, then the
observation will have a missing (?) when we run the model and switch to Data View in
results perspective. We could also choose the mode of ‘N’, and given that 80% of the
survey respondents indicated that they did not use Twitter, this would seem a safe course
of action. You may choose the value you would like to use. For the book’s example, we
will enter ‘N’ and then run our model. You can see in Figure 3-31 that we now have nine
values of ‘N’, and two of ‘Y’ for our Twitter attribute.

Figure 3-31. Replacement of inconsistent value with a consistent one.

51
Data Mining for the Masses

Keep in mind that not all inconsistent data is going to be as easy to handle as replacing a single
value. It would be entirely possible that in addition to the inconsistent value of 99, values of 87,
96, 101, or others could be present in a data set. If this were the case, it might take multiple
replacements and/or missing data operators to prepare the data set for mining. In numeric data
we might also come across data which are accurate, but which are also statistical outliers. These
might also be considered to be inconsistent data, so an example in a later chapter will illustrate the
handling of statistical outliers. Sometimes data scrubbing can become tedious, but it will ultimately
affect the usefulness of data mining results, so these types of activities are important, and attention
to detail is critical.

ATTRIBUTE REDUCTION

In many data sets, you will find that some attributes are simply irrelevant to answering a given
question. In Chapter 4 we will discuss methods for evaluating correlation, or the strength of
relationships between given attributes. In some instances, you will not know the extent to which a
certain attribute will be useful without statistically assessing that attribute’s correlation to the other
data you will be evaluating. In our process stream in RapidMiner, we can remove attributes that
are not very interesting in terms of answering a given question without completely deleting them
from the data set. Remember, simply because certain variables in a data set aren’t interesting for
answering a certain question doesn’t mean those variables won’t ever be interesting. This is why
we recommended bringing in all attributes when importing the Chapter 3 data set earlier in this
chapter—uninteresting or irrelevant attributes are easy to exclude within your stream by following
these steps:

1) Return to design perspective. In the operator search field, type Select Attribute. The
Select Attributes operator will appear. Drag it onto the end of your stream so that it fits
between the Replace operator and the result set port. Your window should look like
Figure 3-32.

52
Chapter 3: Data Preparation

Figure 3-32. Selecting a subset of a data set’s attributes.

2) In the Parameters pane, set the attribute filter type to ‘subset’, then click the Select
Attributes button; a window similar to Figure 3-33 will appear.

Figure 3-33. The attribute subset selection window.

53
Data Mining for the Masses

3) Using the green right and left arrows, you can select which attributes you would like to
keep. Suppose we were going to study the demographics of Internet users. In this
instance, we might select Birth_Year, Gender, Marital_Status, Race, and perhaps
Years_on_Internet, and move them to the right under Selected Attributes using the right
green arrow. You can select more than one attribute at a time by holding down your
control or shift keys (on a Windows computer) while clicking on the attributes you want to
select or deselect. We could then click OK, and these would be the only attributes we
would see in results perspective when we run our model. All subsequent downstream data
mining operations added to our model will act only upon this subset of our attributes.

CHAPTER SUMMARY

This chapter has introduced you to a number of concepts related to data preparation. Recall that
Data Preparation is the third step in the CRISP-DM process. Once you have established
Organizational Understanding as it relates to your data mining plans, and developed Data
Understanding in terms of what data you need, what data you have, where it is located, and so
forth; you can begin to prepare your data for mining. This has been the focus of this chapter.

The chapter used a small and very simple data set to help you learn to set up the RapidMiner data
mining environment. You have learned about viewing data sets in OpenOffice Base, and learned
some ways that data sets in relational databases can be collated. You have also learned about
comma separated values (CSV) files.

We have then stepped through adding CSV files to a RapidMiner data repository in order to
handle missing data, reduce data through observation filtering, handle inconsistencies in data, and
reduce the number of attributes in a model. All of these methods will be used in future chapters to
prepare data for modeling.

Data mining is most successful when conducted upon a foundation of well-prepared data. Recall
the quotation from Chapter 1from Alice’s Adventures in Wonderland—which way you go does not
matter very much if you don’t know, or don’t care, where you are going. Likewise, the value of
where you arrive when you complete a data mining exercise will largely depend upon how well you
prepared to get there. Sometimes we hear the phrase “It’s better than nothing”. Well, in data
mining, results gleaned from poorly prepared data might be “Worse than nothing”, because they
54
Chapter 3: Data Preparation

may be misleading. Decisions based upon them could lead an organization down a detrimental
and costly path. Learn to value the process of data preparation, and you will learn to be a better
data miner.

REVIEW QUESTIONS

1) What are the four main processes of data preparation discussed in this chapter? What do
they accomplish and why are they important?

2) What are some ways to collate data from a relational database?

3) For what kinds of problems might a data set need to be scrubbed?

4) Why is it often better to perform reductions using operators rather than excluding
attributes or observations as data are imported?

5) What is a data repository in RapidMiner and how is one created?

6) How might inconsistent data cause later trouble in data mining activities?

EXERCISE

1) Locate a data set of any number of attributes and observations. You may have access to
data sets through personal data collection or through your employment, although if you
use an employer’s data, make sure to do so only by permission! You can also search the
Internet for data set libraries. A simple search on the term ‘data sets’ in your favorite
search engine will yield a number of web sites that offer libraries of data sets that you can
use for academic and learning purposes. Download a data set that looks interesting to you
and complete the following:

2) Format the data set into a CSV file. It may come in this format, or you may need to open
the data in OpenOffice Calc or some similar software, and then use the File > Save As
feature to save your data as a CSV file.

55
Data Mining for the Masses

3) Import your data into your RapidMiner repository. Save it in the repository as
Chapter3_Exercise.

4) Create a new, blank process stream in RapidMiner and drag your data set into the process
window.

5) Run your process and examine your data set in both meta data view and Data View. Note
if any attributes have missing or inconsistent data.

6) If you found any missing or inconsistent data, use operators to handle these. Perhaps try
browsing through the folder tree in the Operators tab and experiment with some operators
that were not covered in this chapter.

7) Try filtering out some observations based on some attibute’s value, and filter out some
attributes.

8) Document where you found your data set, how you prepared it for import into
RapidMiner, and what data preparation activities you applied to it.

56
SECTION TWO: DATA MINING MODELS AND METHODS

57
Chapter 4: Correlation

CHAPTER FOUR:
CORRELATION

CONTEXT AND PERSPECTIVE

Sarah is a regional sales manager for a nationwide supplier of fossil fuels for home heating. Recent
volatility in market prices for heating oil specifically, coupled with wide variability in the size of
each order for home heating oil, has Sarah concerned. She feels a need to understand the types of
behaviors and other factors that may influence the demand for heating oil in the domestic market.
What factors are related to heating oil usage, and how might she use a knowledge of such factors
to better manage her inventory, and anticipate demand? Sarah believes that data mining can help
her begin to formulate an understanding of these factors and interactions.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Explain what correlation is, and what it isn’t.
 Recognize the necessary format for data in order to perform correlation analysis.
 Develop a correlation model in RapidMiner.
 Interpret the coefficients in a correlation matrix and explain their significance, if any.

ORGANIZATIONAL UNDERSTANDING

Sarah’s goal is to better understand how her company can succeed in the home heating oil market.
She recognizes that there are many factors that influence heating oil consumption, and believes
that by investigating the relationship between a number of those factors, she will be able to better
monitor and respond to heating oil demand. She has selected correlation as a way to model the
relationship between the factors she wishes to investigate. Correlation is a statistical measure of
how strong the relationships are between attributes in a data set.
59
Data Mining for the Masses

DATA UNDERSTANDING

In order to investigate her question, Sarah has enlisted our help in creating a correlation matrix of
six attributes. Working together, using Sarah’s employer’s data resources which are primarily
drawn from the company’s billing database, we create a data set comprised of the following
attributes:
 Insulation: This is a density rating, ranging from one to ten, indicating the thickness of
each home’s insulation. A home with a density rating of one is poorly insulated, while a
home with a density of ten has excellent insulation.
 Temperature: This is the average outdoor ambient temperature at each home for the
most recent year, measure in degree Fahrenheit.
 Heating_Oil: This is the total number of units of heating oil purchased by the owner of
each home in the most recent year.
 Num_Occupants: This is the total number of occupants living in each home.
 Avg_Age: This is the average age of those occupants.
 Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home.

DATA PREPARATION

A CSV data set for this chapter’s example is available for download at the book’s companion web
site (https://sites.google.com/site/dataminingforthemasses/). If you wish to follow along with
the example, go ahead and download the Chapter04DataSet.csv file now and save it into your
RapidMiner data folder. Then, complete the following steps to prepare the data set for correlation
mining:

1) Import the Chapter 4 CSV data set into your RapidMiner data repository. Save it with the
name Chapter4. If you need a refresher on how to bring this data set into your
RapidMiner repository, refer to steps 7 through 14 of the Hands On Exercise in Chapter 3.
The steps will be the same, with the exception of which file you select to import. Import
all attributes, and accept the default data types. When you are finished, your repository
should look similar to Figure 4-1.
60
Chapter 4: Correlation

Figure 4-1. The chapter four data set added to the author’s RapidMiner Book repository.

2) If your RapidMiner application is not open to a new, blank process window, click the new
process icon, or click File > New to create a new process. Drag your Chapter4 data set
into your main process window. Go ahead and click the run (play) button to examine the
data set’s meta data. If you are prompted, you may choose to save your new model. For
this book’s example, we’ll save the model as Chapter4_Process.

Figure 4-2. Meta Data view of the chapter four data set.

We can see in Figure 4-2 that our six attributes are shown. There are a total of 1,218
homes represented in the data set. Our data set appears to be very clean, with no missing
values in any of the six attributes, and no inconsistent data apparent in our ranges or other
descriptive statistics. If you wish, you can take a minute to switch to Data View to
familiarize yourself with the data. It feels like these data are in good shape, and are in no
further need of data preparation operators, so we are ready to move on to…

61
Data Mining for the Masses

MODELING

3) Switch back to design perspective. On the Operators tab in the lower left hand corner, use
the search box and begin typing in the word correlation. The tool we are looking for is called
Correlation Matrix. You may be able to find it before you even finish typing the full search
term. Once you’ve located it, drag it over into your process window and drop it into your
stream. By default, the exa port will connect to the res port, but in this chapter’s example
we are interested in creating a matrix of correlation coefficients that we can analyze. Thus,
is it important for you to connect the mat (matrix) port to a res port, as illustrated in Figure
4-3.

Figure 4-3. The addition of a Correlation Matrix to our stream, with the
mat (matrix) port connected to a result set (res) port.

4) Correlation is a relatively simple statistical analysis tool, so there are few parameters to
modify. We will accept the defaults, and run the model. The results will be similar to
Figure 4-4.

Figure 4-4. Results of a Correlation Matrix.


62
Chapter 4: Correlation

5) In Figure 4-4, we have our correlation coefficients in a matrix. Correlation coefficients


are relatively easy to decipher. They are simply a measure of the strength of the
relationship between each possible set of attributes in the data set. Because we have six
attributes in this data set, our matrix is six columns wide by six rows tall. In the location
where an attribute intersects with itself, the correlation coefficient is ‘1’, because everything
compared to itself has a perfectly matched relationship. All other pairs of attributes will
have a correlation coefficient of less than one. To complicate matters a bit, correlation
coefficients can actually be negative as well, so all correlation coefficients will fall
somewhere between -1 and 1. We can see that this is the case in Figure 4-4, and so we can
now move on to the CRISP-DM step of…

EVALUATION

All correlation coefficients between 0 and 1 represent positive correlations, while all coefficients
between 0 and -1 are negative correlations. While this may seem straightforward, there is an
important distinction to be made when interpreting the matrix’s values. This distinction has to do
with the direction of movement between the two attributes being analyzed. Let’s consider the
relationship between the Heating_Oil consumption attribute, and the Insulation rating level
attribute. The coefficient there, as seen in our matrix in Figure 4-4, is 0.736. This is a positive
number, and therefore, a positive correlation. But what does that mean? Correlations that are
positive mean that as one attribute’s value rises, the other attribute’s value also rises. But, a positive
correlation also means that as one attribute’s value falls, the other’s also falls. Data analysts
sometimes make the mistake in thinking that a negative correlation exists if an attribute’s values are
decreasing, but if its corresponding attribute’s values are also decreasing, the correlation is still a
positive one. This is illustrated in Figure 4-5.

Heating Oil use Insulation Heating Oil use Insulation


rises rating also rises falls rating also falls
Whenever both attribute values move in the same direction, the correlation is positive.
Figure 4-5. Illustration of positive correlations.

63
Data Mining for the Masses

Next, consider the relationship between the Temperature attribute and the Insulation rating
attribute. In our Figure 4-4 matrix, we see that the coefficient there is -0.794. In this example, the
correlation is negative, as illustrated in Figure 4-6.

Temperature Insulation Temperature Insulation


rises rating falls falls rating rises
Whenever attribute values move in opposite directions, the correlation is negative.
Figure 4-6. Illustration of negative correlations.

So correlation coefficients tell us something about the relationship between attributes, and this is
helpful, but they also tell us something about the strength of the correlation. As previously
mentioned, all correlations will fall between 0 and 1 or 0 and -1. The closer a correlation
coefficient is to 1 or to -1, the stronger it is. Figure 4-7 illustrates the correlation strength along the
continuum from -1 to 1.

-1 0 1
-1 ← -0.8 -0.8 ← -0.6 -0.6 ← -0.4 -0.4 ← 0 0 → 0.4 0.4 → 0.6 0.6 → 0.8 0.8 → 1.0
Very Strong Strong Some No No Some Strong Very strong
Correlation Correlation Correlation correlation correlation correlation correlation correlation
Figure 4-7. Correlation strengths between -1 and 1.

RapidMiner attempts to help us recognize correlation strengths through color coding. In the
Figure 4-4 matrix, we can see that some of the cells are tinted with shades of purple in graduated
colors, in order to more strongly highlight those with stronger correlations. It is important to
recognize that these are only general guidelines and not hard-and-fast rules. A correlation
coefficient around .2 does show some interaction between attributes, even if it is not statistically
significant. This should be kept in mind as we proceed to…

64
Chapter 4: Correlation

DEPLOYMENT

The concept of deployment in data mining means doing something with what you’ve learned from
your model; taking some action based upon what your model tells you. In this chapter’s example,
we conducted some basic, exploratory analysis for our fictional figure, Sarah. There are several
possible outcomes from this investigation.

We learned through our investigation, that the two most strongly correlated attributes in our data
set are Heating_Oil and Avg_Age, with a coefficient of 0.848. Thus, we know that in this data set,
as the average age of the occupants in a home increases, so too does the heating oil usage in that
home. What we do not know is why that occurs. Data analysts often make the mistake of
interpreting correlation as causation. The assumption that correlation proves causation is
dangerous and often false.

Consider for a moment the correlation coefficient between Avg_Age and Temperature: -0.673.
Referring back to Figure 4-7, we see that this is considered to be a relatively strong negative
correlation. As the age of a home’s residents increases, the average temperature outside decreases;
and as the temperature rises, the age of the folks inside goes down. But could the average age of a
home’s occupants have any effect on that home’s average yearly outdoor temperature? Certainly
not. If it did, we could control the temperature by simply moving people of different ages in and
out of homes. This of course is silly. While statistically, there is a correlation between these two
attributes in our data set, there is no logical reason that movement in one causes movement in the
other. The relationship is probably coincidental, but if not, there must be some other explanation
that our model cannot offer. Such limitations must be recognized and accepted in all data mining
deployment decisions.

Another false interpretation about correlations is that the coefficients are percentages, as if to say
that a correlation coefficient of 0.776 between two attributes is an indication that there is 77.6%
shared variability between those two attributes. This is not correct. While the coefficients do tell a
story about the shared variability between attributes, the underlying mathematical formula used to
calculate correlation coefficients solely measures strength, as indicated by proximity to 1 or -1, of
the interaction between attributes. No percentage is calculated or intended.

65
Data Mining for the Masses

With these interpretation parameters explained, there may be several things that Sarah can do in
order to take action based upon our model. A few options might include:

 Dropping the Num_Occupants attribute. While the number of people living in a home
might logically seem like a variable that would influence energy usage, in our model it did
not correlate in any significant way with anything else. Sometimes there are attributes that
don’t turn out to be very interesting.

 Investigating the role of home insulation. The Insulation rating attribute was fairly strongly
correlated with a number of other attributes. There may be some opportunity there to
partner with a company (or start one…?) that specializes in adding insulation to existing
homes. If she is interested in contributing to conservation, working on a marketing
promotion to show the benefits of adding insulation to a home might be a good course of
action, however if she wishes to continue to sell as much heating oil as she can, she may
feel conflicted about participating in such a campaign.

 Adding greater granularity in the data set. This data set has yielded some interesting
results, but frankly, it’s pretty general. We have used average yearly temperatures and total
annual number of heating oil units in this model. But we also know that temperatures
fluctuate throughout the year in most areas of the world, and thus monthly, or even weekly
measures would not only be likely to show more detailed results of demand and usage over
time, but the correlations between attributes would probably be more interesting. From
our model, Sarah now knows how certain attributes interact with one another, but in the
day-to-day business of doing her job, she’ll probably want to know about usage over time
periods shorter than one year.

 Adding additional attributes to the data set. It turned out that the number of occupants in
the home didn’t correlate much with other attributes, but that doesn’t mean that other
attributes would be equally uninteresting. For example, what if Sarah had access to the
number of furnaces and/or boilers in each home? Home_size was slightly correlated with
Heating_Oil usage, so perhaps the number of instruments that consume heating oil in each
home would tell an interesting story, or at least add to her insight.

66
Chapter 4: Correlation

Sarah would also be wise to remember that the CRISP-DM approach is cyclical in nature. Each
month as new orders come in and new bills go out, as new customers sign up for a heating oil
account, there are additional data available to add into the model. As she learns more about how
each attribute in her data set interacts with others, she can increase our correlation model by
adding not only new attributes, but also, new observations.

CHAPTER SUMMARY

This chapter has introduced the concept of correlation as a data mining model. It has been chosen
as the first model for this book because it is relatively simple to construct, run and interpret, thus
serving as an easy starting point upon which to build. Future models will become more complex,
but continuing to develop your skills in RapidMiner and getting comfortable with the tools will
make the more complex models easier for you to achieve as we move forward.

Recall from Chapter 1 (Figure 1-2) that data mining has two somewhat interconnected sides:
Classification, and Prediction. Correlation has been shown to be primarily on the side of
Classification. We do not infer causation using correlation metrics, nor do we use correlation
coefficients to predict one attribute’s value based on another’s. We can however quickly find
general trends in data sets using correlations, and we can anticipate how strongly an observed
movement in one attribute will occur in conjunction with movement in another.

Correlation can be a quick and easy way to see how elements of a given problem may be
interacting with one another. Whenever you find yourself asking how certain factors in a problem
you’re trying to solve interact with one another, consider building a correlation matrix to find out.
For example, does customer satisfaction change based on time of year? Does the amount of
rainfall change the price of a crop? Does household income influence which restaurants a person
patronizes? The answer to each of these questions is probably ‘yes’, but correlation can not only
help us know if that’s true, it can also help us learn how strongly the interactions are when, and if,
they occur.

67
Data Mining for the Masses

REVIEW QUESTIONS

1) What are some of the limitations of correlation models?

2) What is a correlation coefficient? How is it interpreted?

3) What is the difference between a positive and a negative correlation? If two attributes have
values that decrease at essentially the same rate, is that a negative correlation? Why or why
not?

4) How is correlation strength measured? What are the ranges for strengths of correlation?

5) The number of heating oil consuming devices was suggested as a possibly interesting
attribute that could be added to the example data set for this chapter. Can you think of
others? Why might they be interesting? To what other attributes in the data set do you
think your suggested attributes might be correlated? What would be the value in knowing
if they are?

EXERCISE

It is now your turn to develop a correlation model, generate a coefficient matrix, and analyze the
results. To complete this chapter’s exercise, follow the steps below.

1) Select a professional sporting organization that you enjoy, or of which you are aware.
Locate that organization’s web site and search it for statistics, facts and figures about the
athletes in that organization.

2) Open OpenOffice Calc, and starting in Cell A across Row 1 of the spreadsheet, define
some attributes (at least three or four) to hold data about each athlete. Some possible
attributes you may wish to consider could be annual_salary, points_per_game,
years_as_pro, height, weight, age, etc. The list is potentially unlimited, will vary based on
the type of sport you choose, and will depend on the data available to you on the web site
you’ve selected. Measurements of the athletes’ salaries and performance in competition are

68
Chapter 4: Correlation

likely to be the most interesting. You may include the athletes’ names, however keep in
mind that correlations can only be conducted on numeric data, so the name attribute
would need to be reduced out of your data set before creating your correlation matrix.
(Remember the Select Attributes operator!)

3) Look up the statistics for each of your selected attributes and enter them as observations
into your spreadsheet. Try to find as many as you can—at least thirty is a good rule of
thumb in order to achieve at least a basic level of statistical validity. More is better.

4) Once you’ve created your data set, use the menu to save it as a CSV file. Click File, then
Save As. Enter a file name, and change ‘Save as type:’ to be Text CSV (.csv). Be sure to
save the file in your data mining data folder.

5) Open RapidMiner and import your data set into your RapidMiner repository. Name it
Chapter4Exercise, or something descriptive so that you will remember what data are
contained in the data set when you look in your repository.

6) Add the data set to a new process in RapidMiner. Ensure that the out port is connected to
a res port and run your model. Save your process with a descriptive name if you wish.
Examine your data in results perspective and ensure there are no missing, inconsistent, or
other potentially problematic data that might need to be handled as part of your Data
Preparation phase. Return to design perspective and handle any data preparation tasks that
may be necessary.

7) Add a Correlation Matrix operator to your stream and ensure that the mat port is
connected to a res port. Run your model again. Interpret your correlation coefficients as
displayed on the matrix tab.

8) Document your findings. What correlations exist? How strong are they? Are they
surprising to you and if so, why? What other attributes would you like to add? Are there
any you’d eliminate now that you’ve mined your data?

69
Data Mining for the Masses

Challenge step!

9) While still in results perspective, click on the ExampleSet tab (which exists assuming you
left the exa port connected to a res port when you were in design perspective). Click on the
Plot View radio button. Examine correlations that you found in your model visually by
creating a scatter plot of your data. Choose one attribute for your x-Axis and a correlated
one for your y-Axis. Experiment with the Jitter slide bar. What is it doing? (Hint: Try an
Internet search on the term ‘jittering statistics’.) For an additional visual experience, try a
Scatter 3D or Scatter 3D Color plot. Consider Figures 4-8 and 4-9 as examples. Note that
with 3D plots in RapidMiner, you can click and hold to rotate your plot in order to better
see the interactions between the data.

Figure 4-8. A two-dimensional scatterplot with a


colored third dimension and a slight jitter.

70
Chapter 4: Correlation

Figure 4-9. A three-dimensional scatterplot with a colored fourth dimension.

71
Chapter 5: Association Rules

CHAPTER FIVE:
ASSOCIATION RULES

CONTEXT AND PERSPECTIVE

Roger is a city manager for a medium-sized, but steadily growing, city. The city has limited
resources, and like most municipalities, there are more needs than there are resources. He feels
like the citizens in the community are fairly active in various community organizations, and
believes that he may be able to get a number of groups to work together to meet some of the
needs in the community. He knows there are churches, social clubs, hobby enthusiasts and other
types of groups in the community. What he doesn’t know is if there are connections between the
groups that might enable natural collaborations between two or more groups that could work
together on projects around town. He decides that before he can begin asking community
organizations to begin working together and to accept responsibility for projects, he needs to find
out if there are any existing associations between the different types of groups in the area.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Explain what association rules are, how they are found and the benefits of using them.
 Recognize the necessary format for data in order to create association rules.
 Develop an association rule model in RapidMiner.
 Interpret the rules generated by an association rule model and explain their significance, if
any.

ORGANIZATIONAL UNDERSTANDING

Roger’s goal is to identify and then try to take advantage of existing connections in his local
community to get some work done that will benefit the entire community. He knows of many of
73
Data Mining for the Masses

the organizations in town, has contact information for them and is even involved in some of them
himself. His family is involved in an even broader group of organizations, so he understands on a
personal level the diversity of groups and their interests. Because people he and his family knows
are involved in other groups around town, he is aware in a more general sense of many different
types of organizations, their interests, objectives and potential contributions. He knows that to
start, his main concern is finding types of organizations that seem to be connected with one
another. Identifying individuals to work with at each church, social club or political organization
will be overwhelming without first categorizing the organizations into groups and looking for
associations between the groups. Only once he’s checked for existing connections will he feel
ready to begin contacting people and asking them to use their cross-organizational contacts and
take on project ownership. His first need is to find where such associations exist.

DATA UNDERSTANDING

In order to answer his question, Roger has enlisted our help in creating an association rules data
mining model. Association rules are a data mining methodology that seeks to find frequent
connections between attributes in a data set. Association rules are very common when doing
shopping basket analysis. Marketers and vendors in many sectors use this data mining approach to
try to find which products are most frequently purchased together. If you have ever purchased
items on an e-Commerce retail site like Amazon.com, you have probably seen the fruits of
association rule data mining. These are most commonly found in the recommendations sections
of such web sites. You might notice that when you search for a smartphone, recommendations for
screen protectors, protective cases, and other accessories such as charging cords or data cables are
often recommended to you. The items being recommended are identified by mining for items that
previous customers bought in conjunction with the item you search for. In other words, those
items are found to be associated with the item you are looking for, and that association is so frequent
in the web site’s data set, that the association might be considered a rule. Thus is born the name of
this data mining approach: “association rules”. While association rules are most common in
shopping basket analysis, this modeling technique can be applied to a broad range of questions.
We will help Roger by creating an association rule model to try to find linkages across types of
community organizations.

74
Chapter 5: Association Rules

Working together, we using Roger’s knowledge of the local community to create a short survey
which we will administer online via a web site. In order to ensure a measure of data integrity and
to try to protect against possible abuse, our web survey is password protected. Each organization
invited to participate in the survey is given a unique password. The leader of that organization is
asked to share the password with his or her membership and to encourage participation in the
survey. Community members are given a month to respond, and each time an individual logs on
complete the survey, the password used is recorded so that we can determine how many people
from each organization responded. After the month ends, we have a data set comprised of the
following attributes:

 Elapsed_Time: This is the amount of time each respondent spent completing our survey.
It is expressed in decimal minutes (e.g. 4.5 in this attribute would be four minutes, thirty
seconds).
 Time_in_Community: This question on the survey asked the person if they have lived in
the area for 0-2 years, 3-9 years, or 10+ years; and is recorded in the data set as Short,
Medium, or Long respectively.
 Gender: The survey respondent’s gender.
 Working: A yes/no column indicating whether or not the respondent currently has a paid
job.
 Age: The survey respondent’s age in years.
 Family: A yes/no column indicating whether or not the respondent is currently a member
of a family-oriented community organization, such as Big Brothers/Big Sisters, childrens’
recreation or sports leagues, genealogy groups, etc.
 Hobbies: A yes/no column indicating whether or not the respondent is currently a
member of a hobby-oriented community organization, such as amateur radio, outdoor
recreation, motorcycle or bicycle riding, etc.
 Social_Club: A yes/no column indicating whether or not the respondent is currently a
member of a community social organization, such as Rotary International, Lion’s Club, etc.
 Political: A yes/no column indicating whether or not the respondent is currently a
member of a political organization with regular meetings in the community, such as a
political party, a grass-roots action group, a lobbying effort, etc.

75
Data Mining for the Masses

 Professional: A yes/no column indicating whether or not the respondent is currently a


member of a professional organization with local chapter meetings, such as a chapter of a
law or medical society, a small business owner’s group, etc.
 Religious: A yes/no column indicating whether or not the respondent is currently a
member of a church in the community.
 Support_Group: A yes/no column indicating whether or not the respondent is currently
a member of a support-oriented community organization, such as Alcoholics Anonymous,
an anger management group, etc.

In order to preserve a level of personal privacy, individual respondents’ names were not collected
through the survey, and no respondent was asked to give personally identifiable information when
responding.

DATA PREPARATION

A CSV data set for this chapter’s exercise is available for download at the book’s companion web
site (https://sites.google.com/site/dataminingforthemasses/). If you wish to follow along with
the exercise, go ahead and download the Chapter05DataSet.csv file now and save it into your
RapidMiner data folder. Then, complete the following steps to prepare the data set for association
rule mining:

1) Import the Chapter 5 CSV data set into your RapidMiner data repository. Save it with the
name Chapter5. If you need a refresher on how to bring this data set into your
RapidMiner repository, refer to steps 7 through 14 of the Hands On Exercise in Chapter 3.
The steps will be the same, with the exception of which file you select to import. Import
all attributes, and accept the default data types. This is the same process as was done in
Chapter 4, so hopefully by now, you are getting comfortable with the steps to import data
into RapidMiner.

2) Drag your Chapter5 data set into a new process window in RapidMiner, and run the model
in order to inspect the data. When running the model, if prompted, save the process as
Chapter5_Process, as shown in Figure 5-1.

76
Chapter 5: Association Rules

Figure 5-1. Adding the data for the Chapter 5 example model.

3) In results perspective, look first at Meta Data view (Figure 5-2). Note that we do not have
any missing values among any of the 12 attributes across 3,483 observations. In examining
the statistics, we do not see any inconsistent data. For numeric data types, RapidMiner has
given us the average (avg), or mean, for each attribute, as well the standard deviation for
each attribute. Standard deviations are measurements of how dispersed or varied the
values in an attribute are, and so can be used to watch for inconsistent data. A good rule
of thumb is that any value that is smaller than two standard deviations below the mean (or
arithmetic average), or two standard deviations above the mean, is a statistical outlier. For
example, in the Age attribute in Figure 5-2, the average age is 36.731, while the standard
deviation is 10.647. Two standard deviations above the mean would be 58.025
(36.731+(2*10.647)), and two standard deviations below the mean would be 15.437
(36.731-(2*10.647)). If we look at the Range column in Figure 5-2, we can see that the Age
attribute has a range of 17 to 57, so all of our observations fall within two standard
deviations of the mean. We find no inconsistent data in this attribute. This won’t always
be the case, so a data miner should always be watchful for such indications of inconsistent
data. It’s important to realize also that while two standard deviations is a guideline, it’s not
a hard-and-fast rule. Data miners should be thoughtful about why some observations may
be legitimate and yet far from the mean, or why some values that fall within two standard
deviations of the mean should still be scrutinized. One other item should be noted as we

77
Data Mining for the Masses

examine Figure 5-2: the yes/no attributes about whether or not a person was a member of
various types of community organizations was recorded as a 0 or 1 and those attributes
were imported as ‘integer’ data types. The association rule operators we’ll be using in
RapidMiner require attributes to be of ‘binominal’ data type, so we still have some data
preparation yet to do.

Figure 5-2. Meta data of our community group involvement survey.

4) Switch back to design perspective. We have a fairly good understanding of our objectives
and our data, but we know that some additional preparation is needed. First off, we need
to reduce the number of attributes in our data set. The elapsed time each person took to
complete the survey isn’t necessarily interesting in the context of our current question,
which is whether or not there are existing connections between types of organizations in
our community, and if so, where those linkages exist. In order to reduce our data set to
only those attributes related to our question, add a Select Attributes operator to your
stream (as was demonstrated in Chapter 3), and select the following attributes for inclusion,
as illustrated in Figure 5-3: Family, Hobbies, Social_Club, Political, Professional, Religious,
Support_Group. Once you have these attributes selected, click OK to return to your main
process.

78
Chapter 5: Association Rules

Figure 5-3. Selection of attributes to include


in the association rules model.

5) One other step is needed in our data preparation. This is to change the data types of our
selected attributes from integer to binominal. As previously mentioned, the association
rules operators need this data type in order to function properly. In the search box on the
Operators tab in design view, type ‘Numerical to’ (without the single quotes) to locate the
operators that will change attributes with a numeric data type to some other data type. The
one we will use is Numerical to Binominal. Drag this operator into your stream.

79
Data Mining for the Masses

Figure 5-4. Adding a data type converstion operator to a data mining model.

6) For our purposes, all attributes which remain after application of the Select Attributes
operator need to be converted from numeric to binominal, so as the black arrow indicates
in Figure 5-4, we will convert ‘all’ from the former data type to the latter. We could
convert a subset or a single attribute, by selecting one of those options in the attribute filter
type dropdown menu. We have done this in the past, but in this example, we can accept
the default and covert all attributes at once. You should also observe that within
RapidMiner, the data type binominal is used instead of binomial, a term many data
analysts are more used to. There is an important distinction. Binomial means one of two
numbers (usually 0 and 1), so the basic underlying data type is still numeric. Binominal on
the other hand, means one of two values which may be numeric or character based. Click
the play button to run your model and see how this conversion has taken place in our data
set. In results perspective, you should see the transformation, as depicted in Figure 5-5.

80
Chapter 5: Association Rules

Figure 5-5. The results of a data type transformation.

7) For each attribute in our data set, the values of 1 or 0 that existed in our source data set
now are reflected as either ‘true’ or ‘false’. Our data preparation phase is now complete
and we are ready for…

MODELING

8) Switch back to design perspective. We will use two specific operators in order to generate
our association rule data mining model. Understand that there are many other operators
offered in RapidMiner that can be used in association rule models. At the outset, we
established that this book is not a RapidMiner training manual and thus, will not cover
every possible operator that could be used in a given model. Thus, please do not assume
that this chapter’s example is demonstrating the one and only way to mine for association
rules. This is one of several possible approaches, and you are encouraged to explore other
operators and their functionality.

To proceed with the example, use the search field in the operators tab to look for an
operator called FP-Growth. Note that you might find one called W-FPGrowth. This is
simply a slightly different implementation of the FP-Growth algorithm that will look for
associations in our data, so do not be confused by the two very similar names. For this
chapter’s example, select the operator that is just called FP-Growth. Go ahead and drag it
into your stream. The FP in FP-Growth stands for Frequency Pattern. Frequency
pattern analysis is handy for many kinds of data mining, and is a necessary component of
association rule mining. Without having frequencies of attribute combinations, we cannot
determine whether any of the patterns in the data occur often enough to be considered
rules. Your stream should now look like Figure 5-6.

81
Data Mining for the Masses

Figure 5-6. Addition of an FP-Growth operator to an association rule model.

9) Take note of the min support parameter on the right hand side. We will come back to this
parameter during the evaluation portion of this chapter’s example. Also, be sure that both
your exa port and your fre port are connected to res ports. The exa port will generate a tab
of your examples (your data set’s observations and meta data), while the fre port will
generate a matrix of any frequent patterns the operator might find in your data set. Run
your model to switch to results perspective.

Figure 5-7. Results of an FP-Growth operator.


82
Chapter 5: Association Rules

10) In results perspective, we see that some of our attributes appear to have some frequent
patterns in them, and in fact, we begin to see that three attributes look like they might have
some association with one another. The black arrows point to areas where it seems that
Religious organizations might have some natural connections with Family and Hobby
organizations. We can investigate this possible connection further by adding one final
operator to our model. Return to design perspective, and in the operators search box, look
for ‘Create Association’ (again, without the single quotes). Drag the Create Association
Rules operator over and drop it into the spline that connects the fre port to the res port.
This operator takes in frequent pattern matrix data and seeks out any patterns that occur so
frequently that they could be considered rules. Your model should now look like Figure 5-
8.

Figure 5-8. Addition of Create Association Rules operator.

11) The Create Association Rules operator can generate both a set of rules (through the rul
port) and a set of associated items (through the ite port). We will simply generate rules, and
for now, accept the default parameters for the Create Association Rules, though note the
min confidence parameter, which we will address in the evaluation phase of our mining. Run
your model.

Figure 5-9. The results of our association rule model.


83
Data Mining for the Masses

12) Bummer. No rules found. Did we do all that work for nothing? It seemed like we had
some hope for some associations back in step 9, what happened? Remember from
Chapter 1 that the CRISP-DM process is cyclical in nature, and sometimes, you have to go
back and forth between steps before you will create a model that yields results. Such is the
case here. We have nothing to consider here, so perhaps we need to tweak some of our
model’s parameters. This may be a process of trial and error, which will take us back and
forth between our current CRISP-DM step of Modeling and…

EVALUATION

13) So we’ve evaluated our model’s first run. No rules found. Not much to evaluate there,
right? So let’s switch back to design perspective, and take a look at those parameters we
highlighted briefly in the previous steps. There are two main factors that dictate whether
or not frequency patterns get translated into association rules: Confidence percent and
Support percent. Confidence percent is a measure of how confident we are that when
one attribute is flagged as true, the associated attribute will also be flagged as true. In the
classic shopping basket analysis example, we could look at two items often associated with
one another: cookies and milk. If we examined ten shopping baskets and found that
cookies were purchased in four of them, and milk was purchased in seven, and that further,
in three of the four instances where cookies were purchased, milk was also in those
baskets, we would have a 75% confidence in the association rule: cookies → milk. This is
calculated by dividing the three instances where cookies and milk coincided by the four
instances where they could have coincided (3/4 = .75, or 75%). The rule cookies → milk
had a chance to occur four times, but it only occurred three, so our confidence in this rule
is not absolute.

Now consider the reciprocal of the rule: milk → cookies. Milk was found in seven of our
ten hypothetical baskets, while cookies were found in four. We know that the coincidence,
or frequency of connection between these two products is three. So our confidence in
milk → cookies falls to only 43% (3/7 = .429, or 43%). Milk had a chance to be found
with cookies seven times, but it was only found with them three times, so our confidence
in milk → cookies is a good bit lower than our confidence in cookies → milk. If a person

84
Chapter 5: Association Rules

comes to the store with the intention of buying cookies, we are more confident that they
will also buy milk than if their intentions were reversed. This concept is referred to in
association rule mining as Premise → Conclusion. Premises are sometimes also referred
to as antecedents, while conclusions are sometimes referred to as consequents. For each
pairing, the confidence percentages will differ based on which attribute is the premise and
which the conclusion. When associations between three or more attributes are found, for
example, cookies, crackers → milk, the confidence percentages are calculated based on the
two attributes being found with the third. This can become complicated to do manually,
so it is nice to have RapidMiner to find these combinations and run the calculations for us!

The support percent is an easier measure to calculate. This is simply the number of times
that the rule did occur, divided by the number of observations in the data set. The number
of items in the data set is the absolute number of times the association could have occurred,
since every customer could have purchased cookies and milk together in their shopping
basket. The fact is, they didn’t, and such a phenomenon would be highly unlikely in any
analysis. Possible, but unlikely. We know that in our hypothetical example, cookies and
milk were found together in three out of ten shopping baskets, so our support percentage
for this association is 30% (3/10 = .3, or 30%). There is no reciprocal for support
percentages since this metric is simply the number of times the association did occur over
the number of times it could have occurred in the data set.

So now that we understand these two pivotal parameters in association rule mining, let’s
make a parameter modification and see if we find any association rules in our data. You
should be in design perspective again, but if not, switch back now. Click on your Create
Association Rules operator and change the min confidence parameter to .5 (see Figure 5-10).
This indicates to RapidMiner that any association with at least 50% confidence should be
displayed as a rule. With this as the confidence percent threshold, if we were using the
hypothetical shopping baskets discussed in the previous paragraphs to explain confidence
and support, cookies → milk would return as a rule because its confidence percent was
75%, while milk → cookies would not, due to that association’s 43% confidence percent.
Let’s run our model again with the .5 confidence value and see what we get.

85
Data Mining for the Masses

Figure 5-10. Chaning the confidence percent threshold.

Figure 5-11. Four rules found with the 50% confidence threshold.

14) Eureka! We have found rules, and our hunch that Religious, Family and Hobby
organizations are related was correct (remember Figure 5-7). Look at rule number four. It
just barely missed being considered a rule with an 80% confidence threshold at 79.6%.
Our other associations have lower confidence percentages, but are still quite good. We can
see that for each of these four rules, more than 20% of the observations in our data set
support them. Remember that since support is not reciprocal, the support percents for
rules 1 and 3 are the same, as they are for rules 2 and 4. As the premises and conclusions
were reversed, their confidence percentages did vary however. Had we set our confidence
percent threshold at .55 (or 55% percent), rule 1 would drop out of our results, so Family
→ Religious would be a rule but Religious → Family would not. The other calculations to
the right (LaPlace…Conviction) are additional arithmetic indicators of the strength of the
rules’ relationships. As you compare these values to support and confidence percents, you
will see that they track fairly consistently with one another.

86
Chapter 5: Association Rules

If you would like, you may return to design perspective and experiment. If you click on
the FP-Growth operator, you can modify the min support value. Note that while support
percent is the metric calculated and displayed by the Create Association Rules operator, the
min support parameter in the FP-Growth actually calls for a confidence level. The default of
.95 is very common in much data analysis, but you may want to lower it a bit and re-run
your model to see what happens. Lowering min support to .5 does yield additional rules,
including some with more than two attributes in the association rules. As you experiment
you can see that a data miner might need to go back and forth a number of times between
modeling and evaluating before moving on to…

DEPLOYMENT

We have been able to help Roger with his question. Do existing linkages between types of
community groups exist? Yes, they do. We have found that the community’s churches, family,
and hobby organizations have some common members. It may be a bit surprising that the political
and professional groups do not appear to be interconnected, but these groups may also be more
specialized (e.g. a local chapter of the bar association) and thus may not have tremendous cross-
organizational appeal or need. It seems that Roger will have the most luck finding groups that will
collaborate on projects around town by engaging churches, hobbyists and family-related
organizations. Using his contacts among local pastors and other clergy, he might ask for
volunteers from their congregations to spearhead projects to clean up city parks used for youth
sports (family organization association rule) or to improve a local biking trail (hobby organization
association rule).

CHAPTER SUMMARY

This chapter’s fictional scenario with Roger’s desire to use community groups to improve his city
has shown how association rule data mining can identify linkages in data that can have a practical
application. In addition to learning about the process of creating association rule models in
RapidMiner, we introduced a new operator that enabled us to change attributes’ data types. We
also used CRISP-DM’s cyclical nature to understand that sometimes data mining involves some
back and forth ‘digging’ before moving on to the next step. You learned how support and

87
Data Mining for the Masses

confidence percentages are calculated and about the importance of these two metrics in identifying
rules and determining their strength in a data set.

REVIEW QUESTIONS

1) What are association rules? What are they good for?

2) What are the two main metrics that are calculated in association rules and how are they
calculated?

3) What data type must a data set’s attributes be in order to use Frequent Pattern operators in
RapidMiner?

4) How are rule results interpreted? In this chapter’s example, what was our strongest rule?
How do we know?

EXERCISE

In explaining support and confidence percentages in this chapter, the classic example of shopping
basket analysis was used. For this exercise, you will do a shopping basket association rule analysis.
Complete the following steps:

1) Using the Internet, locate a sample shopping basket data set. Search terms such as
‘association rule data set’ or ‘shopping basket data set’ will yield a number of downloadable
examples. With a little effort, you will be able to find a suitable example.

2) If necessary, convert your data set to CSV format and import it into your RapidMiner
repository. Give it a descriptive name and drag it into a new process window.

3) As necessary, conduct your Data Understanding and Data Preparation activities on your
data set. Ensure that all of your variables have consistent data and that their data types are
appropriate for the FP-Growth operator.

88
Chapter 5: Association Rules

4) Generate association rules for your data set. Modify your confidence and support values in
order to identify their most ideal levels such that you will have some interesting rules with
reasonable confidence and support. Look at the other measures of rule strength such as
LaPlace or Conviction.

5) Document your findings. What rules did you find? What attributes are most strongly
associated with one another. Are there products that are frequently connected that
surprise you? Why do you think this might be? How much did you have to test different
support and confidence values before you found some association rules? Were any of your
association rules good enough that you would base decisions on them? Why or why not?

Challenge Step!

6) Build a new association rule model using your same data set, but this time, use the W-
FPGrowth operator. (Hints for using the W-FPGrowth operator: (1) This operator creates
its own rules without help from other operators; and (2) This operator’s support and
confidence parameters are labeled U and C, respectively.

Exploration!

7) The Apriori algorithm is often used in data mining for associations. Search the
RapidMiner Operators tree for Apriori operators and add them to your data set in a new
process. Use the Help tab in RapidMiner’s lower right hand corner to learn about these
operators’ parameters and functions (be sure you have the operator selected in your main
process window in order to see its help content).

89
Chapter 6: k-Means Clustering

CHAPTER SIX:
K-MEANS CLUSTERING

CONTEXT AND PERSPECTIVE

Sonia is a program director for a major health insurance provider. Recently she has been reading
in medical journals and other articles, and found a strong emphasis on the influence of weight,
gender and cholesterol on the development of coronary heart disease. The research she’s read
confirms time after time that there is a connection between these three variables, and while there is
little that can be done about one’s gender, there are certainly life choices that can be made to alter
one’s cholesterol and weight. She begins brainstorming ideas for her company to offer weight and
cholesterol management programs to individuals who receive health insurance through her
employer. As she considers where her efforts might be most effective, she finds herself wondering
if there are natural groups of individuals who are most at risk for high weight and high cholesterol,
and if there are such groups, where the natural dividing lines between the groups occur.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Explain what k-means clusters are, how they are found and the benefits of using them.
 Recognize the necessary format for data in order to create k-means clusters.
 Develop a k-means cluster data mining model in RapidMiner.
 Interpret the clusters generated by a k-means model and explain their significance, if any.

ORGANIZATIONAL UNDERSTANDING

Sonia’s goal is to identify and then try to reach out to individuals insured by her employer who are
at high risk for coronary heart disease because of their weight and/or high cholesterol. She
understands that those at low risk, that is, those with low weight and cholesterol, are unlikely to
91
Data Mining for the Masses

participate in the programs she will offer. She also understands that there are probably policy
holders with high weight and low cholesterol, those with high weight and high cholesterol, and
those with low weight and high cholesterol. She further recognizes there are likely to be a lot of
people somewhere in between. In order to accomplish her goal, she needs to search among the
thousands of policy holders to find groups of people with similar characteristics and craft
programs and communications that will be relevant and appealing to people in these different
groups.

DATA UNDERSTANDING

Using the insurance company’s claims database, Sonia extracts three attributes for 547 randomly
selected individuals. The three attributes are the insured’s weight in pounds as recorded on the
person’s most recent medical examination, their last cholesterol level determined by blood work in
their doctor’s lab, and their gender. As is typical in many data sets, the gender attribute uses 0 to
indicate Female and 1 to indicate Male. We will use this sample data from Sonia’s employer’s
database to build a cluster model to help Sonia understand how her company’s clients, the health
insurance policy holders, appear to group together on the basis of their weights, genders and
cholesterol levels. We should remember as we do this that means are particularly susceptible to
undue influence by extreme outliers, so watching for inconsistent data when using the k-Means
clustering data mining methodology is very important.

DATA PREPARATION

As with previous chapters, a data set has been prepared for this chapter’s example, and is available
as Chapter06DataSet.csv on the book’s companion web site. If you would like to follow along
with this example exercise, go ahead and download the data set now, and import it into your
RapidMiner data repository. At this point you are probably getting comfortable with importing
CSV data sets into a RapidMiner repository, but remember that the steps are outlined in Chapter 3
if you need to review them. Be sure to designate the attribute names correctly and to check your
data types as you import. Once you have imported the data set, drag it into a new, blank process
window so that you can begin to set up your k-means clustering data mining model. Your process
should look like Figure 6-1.

92
Chapter 6: k-Means Clustering

Figure 6-1. Cholesterol, Weight and Gender data set added to a new process.

Go ahead and click the play button to run your model and examine the data set. In Figure 6-2 we
can see that we have 547 observations across our three previously defined attributes. We can see
the averages for each of the three attributes, along with their accompanying standard deviations
and ranges. None of these values appear to be inconsistent (remember the earlier comments about
using standard deviations to find statistical outliers). We have no missing values to handle, so our
data appear to be very clean and ready to be mined.

Figure 6-2. A view of our data set’s meta data.

93
Data Mining for the Masses

MODELING

The ‘k’ in k-means clustering stands for some number of groups, or clusters. The aim of this data
mining methodology is to look at each observation’s individual attribute values and compare them
to the means, or in other words averages, of potential groups of other observations in order to find
natural groups that are similar to one another. The k-means algorithm accomplishes this by
sampling some set of observations in the data set, calculating the averages, or means, for each
attribute for the observations in that sample, and then comparing the other attributes in the data
set to that sample’s means. The system does this repetitively in order to ‘circle-in’ on the best
matches and then to formulate groups of observations which become the clusters. As the means
calculated become more and more similar, clusters are formed, and each observation whose
attributes values are most like the means of a cluster become members of that cluster. Using this
process, k-means clustering models can sometimes take a long time to run, especially if you
indicate a large number of “max runs” through the data, or if you seek for a large number of
clusters (k). To build your k-means cluster model, complete the following steps:

1) Return to design view in RapidMiner if you have not done so already. In the operators
search box, type k-means (be sure to include the hyphen). There are three operators that
conduct k-means clustering work in RapidMiner. For this exercise, we will choose the first,
which is simply named “k-Means”. Drag this operator into your stream, and shown in
Figure 6-3.

Figure 6-3. Adding the k-Means operator to our model.

94
Chapter 6: k-Means Clustering

2) Because we did not need to add any other operators in order to prepare our data for
mining, our model in this exercise is very simple. We could, at this point, run our model
and begin to interpret the results. This would not be very interesting however. This is
because the default for our k, or our number of clusters, is 2, as indicated by the black
arrow on the right hand side of Figure 6-3. This means we are asking RapidMiner to find
only two clusters in our data. If we only wanted to find those with high and low levels of
risk for coronary heart disease, two clusters would work. But as discussed in the
Organizational Understanding section earlier in the chapter, Sonia has already recognized
that there are likely a number of different types of groups to be considered. Simply
splitting the data set into two clusters is probably not going to give Sonia the level of detail
she seeks. Because Sonia felt that there were probably at least 4 potentially different
groups, let’s change the k value to four, as depicted in Figure 6-4. We could also increase
of number of ‘max runs’, but for now, let’s accept the default and run the model.

Figure 6-4. Setting the desired number of clusters for our model.

3) When the model is run, we find an initial report of the number of items that fell into each
of our four clusters. (Note that the clustered are numbered starting from 0, a result of
RapidMiner being written in the Java programming language.) In this particular model, our

95
Data Mining for the Masses

clusters are fairly well balanced. While Cluster 1, with only 118 observations (Figure 6-5),
is smaller than the other clusters, it is not unreasonably so.

Figure 6-5. The distribution of observations across our four clusters.

We could go back at this point and adjust our number of clusters, our number of ‘max runs’, or
even experiment with the other parameters offered by the k-Means operator. There are other
options for measurement type or divergence algorithms. Feel free to try out some of these options
if you wish. As was the case with Association Rules, there may be some back and forth trial-and-
error as you test different parameters to generate model output. When you are satisfied with your
model parameters, you can proceed to…

EVALUATION

Recall that Sonia’s major objective in the hypothetical scenario posed at the beginning of the
chapter was to try to find natural breaks between different types of heart disease risk groups.
Using the k-Means operator in RapidMiner, we have identified four clusters for Sonia, and we can
now evaluate their usefulness in addressing Sonia’s question. Refer back to Figure 6-5. There are a
number of radio buttons which allow us to select options for analyzing our clusters. We will start
by looking at our Centroid Table. This view of our results, shown in Figure 6-6, give the means
for each attribute in each of the four clusters we created.

96
Chapter 6: k-Means Clustering

Figure 6-6. The means for each attribute in our four (k) clusters.

We see in this view that cluster 0 has the highest average weight and cholesterol. With 0
representing Female and 1 representing Male, a mean of 0.591 indicates that we have more men
than women represented in this cluster. Knowing that high cholesterol and weight are two key
indicators of heart disease risk that policy holders can do something about, Sonia would likely want
to start with the members of cluster 0 when promoting her new programs. She could then extend
her programming to include the people in clusters 1 and 2, which have the next incrementally
lower means for these two key risk factor attributes. You should note that in this chapter’s
example, the clusters’ numeric order (0, 1, 2, 3) corresponds to decreasing means for each cluster.
This is coincidental. Sometimes, depending on your data set, cluster 0 might have the highest
means, but cluster 2 might have then next highest, so it’s important to pay close attention to your
centroid values whenever you generate clusters.

So we know that cluster 0 is where Sonia will likely focus her early efforts, but how does she know
who to try to contact? Who are the members of this highest risk cluster? We can find this
information by selecting the Folder View radio button. Folder View is depicted in Figure 6-7.

Figure 6-7. Folder view showing the observations included in Cluster 0.

97
Data Mining for the Masses

By clicking the small + sign next to cluster 0 in Folder View, we can see all of the observations that
have means which are similar to the mean for this cluster. Remember that these means are
calculated for each attribute. You can see the details for any observation in the cluster by clicking
on it. Figure 6-8 shows the results of clicking on observation 6 (6.0):

Figure 6-8. The details of an observation within cluster 0.

The means for cluster 0 were just over 184 pounds for weight and just under 219 for cholesterol.
The person represented in observation 6 is heavier and has higher cholesterol than the average for
this highest risk group. Thus, this is a person Sonia is really hoping to help with her outreach
program. But we know from the Centroid Table that there are 154 individuals in the data set who
fall into this cluster. Clicking on each one of them in Folder View probably isn’t the most efficient
use of Sonia’s time. Furthermore, we know from our Data Understanding paragraph earlier in this
chapter that this model is built on only a sample data set of policy holders. Sonia might want to
extract these attributes for all policy holders from the company’s database and run the model again
on that data set. Or, if she is satisfied that the sample has given her what she wants in terms of
finding the breaks between the groups, she can move forward with…

DEPLOYMENT

We can help Sonia extract the observations from cluster 0 fairly quickly and easily. Return to
design perspective in RapidMiner. Recall from Chapter 3 that we can filter out observations in our

98
Chapter 6: k-Means Clustering

data set. In that chapter, we discussed filtering out observations as a Data Preparation step, but we
can use the same operator in our Deployment as well. Using the search field in the Operators tab,
locate the Filter Examples operator and connect it to your k-Means Clustering operator, as is
depicted in Figure 6-9. Note that we have not disconnected the clu (cluster) port from the ‘res’
(result set) port, but rather, we have connected a second clu port to our exa port on the Filter
Examples operator, and connected the exa port from Filter Examples to its own res port.

Figure 6-9. Filtering our cluster model’s output for only observations in cluster 0.

As indicated by the black arrows in Figure 6-9, we are filtering out our observations based on an
attribute filter, using the parameter string cluster=cluster_0. This means that only those
observations in the data set that are classified in the cluster_0 group will be retained. Go ahead
and click the play button to run the model again.

You will see that we have not lost our Cluster Model tab. It is still available to us, but now we
have added an ExampleSet tab, which contains only those 154 observations which fell into cluster
0. As with the result of previous models we’ve created, we have descriptive statistics for the
various attributes in the data set.

99
Data Mining for the Masses

Figure 6-10. Filtered results for only cluster 0 observations.

Sonia could use these figures to begin contacting potential participants in her programs. With the
high risk group having weights between 167 and 203 pounds, and cholesterol levels between 204
and 235 (these are taken from the Range statistics in Figure 6-10), she could return to her
company’s database and issue a SQL query like this one:

SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num


FROM PolicyHolders_view
WHERE Weight >= 167
AND Cholesterol >= 204;

This would give her the contact list for every person, male or female, insured by her employer who
would fall into the higher risk group (cluster 0) in our data mining model. She could change the
parameter criteria in our Filter Examples operator to be cluster=cluster_1 and re-run the model to
get the descriptive statistics for those in the next highest risk group, and modify her SQL statement
to get the contact list for that group from her organizational database; something akin to this
query:

SELECT First_Name, Last_Name, Policy_Num, Address, Phone_Num


FROM PolicyHolders_view
WHERE (Weight >= 140 AND Weight <= 169)
AND (Cholesterol >= 168 AND Cholesterol <= 204);

If she wishes to also separate her groups by gender, she could add that criteria as well, such as
“AND Gender = 1” in the WHERE clause of the SQL statement. As she continues to develop
her health improvement programs, Sonia would have the lists of individuals that she most wants to

100
Chapter 6: k-Means Clustering

target in the hopes of raising awareness, educating policy holders, and modifying behaviors that
will lead to lower incidence of heart disease among her employer’s clients.

CHAPTER SUMMARY

k-Means clustering is a data mining model that falls primarily on the side of Classification when
referring to the Venn diagram from Chapter 1 (Figure 1-2). For this chapter’s example, it does not
necessarily predict which insurance policy holders will or will not develop heart disease. It simply
takes known indicators from the attributes in a data set, and groups them together based on those
attributes’ similarity to group averages. Because any attributes that can be quantified can also have
means calculated, k-means clustering provides an effective way of grouping observations together
based on what is typical or normal for that group. It also helps us understand where one group
begins and the other ends, or in other words, where the natural breaks occur between groups in a
data set.

k-Means clustering is very flexible in its ability to group observations together. The k-Means
operator in RapidMiner allows data miners to set the number of clusters they wish to generate, to
dictate the number of sample means used to determine the clusters, and to use a number of
different algorithms to evaluate means. While fairly simple in its set-up and definition, k-Means
clustering is a powerful method for finding natural groups of observations in a data set.

REVIEW QUESTIONS

1) What does the k in k-Means clustering stand for?

2) How are clusters identified? What process does RapidMiner use to define clusters and
place observations in a given cluster?

3) What does the Centroid Table tell the data miner? How do you interpret the values in a
Centroid Table?

4) How do descriptive statistics aid in the process of evaluating and deploying a k-Means
clustering model?

101
Data Mining for the Masses

5) How might the presence of outliers in the attributes of a data set influence the usefulness
of a k-Means clustering model? What could be done to address the problem?

EXERCISE

Think of an example of a problem that could be at least partially addressed by being able to group
observations in a data set into clusters. Some examples might be grouping kids who might be at
risk for delinquency, grouping product sale volumes, grouping workers by productivity and
effectiveness, etc. Search the Internet or other resources available to you for a data set that would
allow you to investigate your question using a k-means model. As with all exercises in this text,
please ensure that you have permission to use any data set that might belong to your employer or
another entity. When you have secured your data set, complete the following steps:

1) Ensure that your data set is saved as a CSV file. Import your data set into your
RapidMiner repository and save it with a meaningful name. Drag it into a new process
window in RapidMiner.

2) Conduct any data preparation that you need for your data set. This may include handling
inconsistent data, dealing with missing values, or changing data types. Remember that in
order to calculate means, each attribute in your data set will need to be numeric. If, for
example, one of your attributes contains the values ‘yes’ and ‘no’, you may need to change
these to be 1 and 0 respectively, in order for the k-Means operator to work.

3) Connect a k-Means operator to your data set, configure your parameters (especially set
your k to something meaningful for your question) and then run your model.

4) Investigate your Centroid Table, Folder View, and the other evaluation tools.

5) Report your findings for your clusters. Discuss what is interesting about them and
describe what iterations of modeling you went through, such as experimentation with
different parameter values, to generate the clusters. Explain how your findings are relevant
to your original question.

102
Chapter 6: k-Means Clustering

Challenge Step!

6) Experiment with the other k-Means operators in RapidMiner, such as Kernel or Fast.
How are they different from your original model. Did the use of these operators change
your clusters, and if so, how?

103
Chapter 7: Discriminant Analysis

CHAPTER SEVEN:
DISCRIMINANT ANALYSIS

CONTEXT AND PERSPECTIVE

Gill runs a sports academy designed to help high school aged athletes achieve their maximum
athletic potential. On the boys side of his academy, he focuses on four major sports: Football,
Basketball, Baseball and Hockey. He has found that while many high school athletes enjoy
participating in a number of sports in high school, as they begin to consider playing a sport at the
college level, they would prefer to specialize in one sport. As he’s worked with athletes over the
years, Gill has developed an extensive data set, and he now is wondering if he can use past
performance from some of his previous clients to predict prime sports for up-and-coming high
school athletes. Ultimately, he hopes he can make a recommendation to each athlete as to the
sport in which they should most likely choose to specialize. By evaluating each athlete’s
performance across a battery of test, Gill hopes we can help him figure out for which sport each
athlete has the highest aptitude.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Explain what discriminant analysis is, how it is used and the benefits of using it.
 Recognize the necessary format for data in order to perform discriminant analysis.
 Explain the differences and similarities between k-Means clustering and discriminant
analysis.
 Develop a discriminant analysis data mining model in RapidMiner using a training data
set.
 Interpret the model output and apply it to a scoring data set in order to deploy the model.

105
Data Mining for the Masses

ORGANIZATIONAL UNDERSTANDING

Gill’s objective is to examine young athletes and, based upon their performance across a number
of metrics, help them decide which sport is the most prime for their specialized success. Gill
recognizes that all of his clients possess some measure of athleticism, and that they enjoy
participating in a number of sports. Being young, athletic, and adaptive, most of his clients are
quite good at a number of sports, and he has seen over the years that some people are so naturally
gifted that they would excel in any sport they choose for specialization. Thus, he recognizes, as a
limitation of this data mining exercise, that he may not be able to use data to determine an athlete’s
“best” sport. Still, he has seen metrics and evaluations work in the past, and has seen that some of
his previous athletes really were pre-disposed to a certain sport, and that they were successful as
they went on to specialize in that sport. Based on his industry experience, he has decided to go
ahead with an experiment in mining data for athletic aptitude, and has enlisted our help.

DATA UNDERSTANDING

In order to begin to formulate a plan, we sit down with Gill to review his data assets. Every athlete
that has enrolled at Gill’s academy over the past several years has taken a battery test, which tested
for a number of athletic and personal traits. The battery has been administered to both boys and
girls participating in a number of different sports, but for this preliminary study we have decided
with Gill that we will look at data only for boys. Because the academy has been operating for
some time, Gill has the benefit of knowing which of his former pupils have gone on to specialize
in a single sport, and which sport it was for each of them. Working with Gill, we gather the results
of the batteries for all former clients who have gone on to specialize, Gill adds the sport each
person specialized in, and we have a data set comprised of 493 observations containing the
following attributes:
 Age: This is the age in years (one decimal precision for the part of the year since the
client’s last birthday) at the time that the athletic and personality trait battery test was
administered. Participants ranged in age from 13-19 years old at the time they took the
battery.
 Strength: This is the participant’s strength measured through a series of weight lifting
exercises and recorded on a scale of 0-10, with 0 being limited strength and 10 being

106
Chapter 7: Discriminant Analysis

sufficient strength to perform all lifts without any difficulty. No participant scored 8, 9 or
10, but some participants did score 0.
 Quickness: This is the participant’s performance on a series of responsiveness tests.
Participants were timed on how quickly they were able to press buttons when they were
illuminated or to jump when a buzzer sounded. Their response times were tabulated on a
scale of 0-6, with 6 being extremely quick response and 0 being very slow. Participants
scored all along the spectrum for this attribute.
 Injury: This is a simple yes (1) / no (0) column indicating whether or not the young athlete
had already suffered an athletic-related injury that was severe enough to require surgery or
other major medical intervention. Common injuries treated with ice, rest, stretching, etc.
were entered as 0. Injuries that took more than three week to heal, that required physical
therapy or surgery were flagged as 1.
 Vision: Athletes were not only tested on the usual 20/20 vision scale using an eye chart,
but were also tested using eye-tracking technology to see how well they were able to pick
up objects visually. This test challenged participants to identify items that moved quickly
across their field of vision, and to estimate speed and direction of moving objects. Their
scores were recorded on a 0 to 4 scale with 4 being perfect vision and identification of
moving objects. No participant scored a perfect 4, but the scores did range from 0 to 3.
 Endurance: Participants were subjected to an array of physical fitness tests including
running, calisthenics, aerobic and cardiovascular exercise, and distance swimming. Their
performance was rated on a scale of 0-10, with 10 representing the ability to perform all
tasks without fatigue of any kind. Scores ranged from 0 to 6 on this attribute. Gill has
acknowledged to us that even finely tuned professional athletes would not be able to score
a 10 on this portion of the battery, as it is specifically designed to test the limits of human
endurance.
 Agility: This is the participant’s score on a series of tests of their ability to move, twist,
turn, jump, change direction, etc. The test checked the athlete’s ability to move nimbly,
precisely, and powerfully in a full range of directions. This metric is comprehensive in
nature, and is influenced by some of the other metrics, as agility is often dictated by one’s
strength, quickness, etc. Participants were scored between 0 and 100 on this attribute, and
in our data set from Gill, we have found performance between 13 and 80.
 Decision_Making: This portion of the battery tests the athlete’s process of deciding what
to do in athletic situations. Athlete’s participated in simulations that tested their choices of

107
Data Mining for the Masses

whether or not to swing a bat, pass a ball, move to a potentially advantageous location of a
playing surface, etc. Their scores were to have been recorded on a scale of 0 to 100,
though Gill has indicated that no one who completed the test should have been able to
score lower than a 3, as three points are awarded simply for successfully entering and
exiting the decision making part of the battery. Gill knows that all 493 of his former
athletes represented in this data set successfully entered and exited this portion, but there
are a few scores lower than 3, and also a few over 100 in the data set, so we know we have
some data preparation in our future.
 Prime_Sport: This attribute is the sport each of the 453 athletes went on to specialize in
after they left Gill’s academy. This is the attribute Gill is hoping to be able to predict for
his current clients. For the boys in this study, this attribute will be one of four sports:
football (American, not soccer; sorry soccer fans), Basketball, Baseball, or Hockey.

As we analyze and familiarize ourselves with these data, we realize that all of the attributes with the
exception of Prime_Sport are numeric, and as such, we could exclude Prime_Sport and conduct a
k-means clustering data mining exercise on the data set. Doing this, we might be able group
individuals into one sport cluster or another based on the means for each of the attributes in the
data set. However, having the Prime_Sport attribute gives us the ability to use a different type of
data mining model: Discriminant Analysis. Discriminant analysis is a lot like k-means clustering,
in that it groups observations together into like-types of values, but it also gives us something
more, and that is the ability to predict. Discriminant analysis then helps us cross that intersection
seen in the Venn diagram in Chapter 1 (Figure 1-2). It is still a data mining methodology for
classifying observations, but it classifies them in a predictive way. When we have a data set that
contains an attribute that we know is useful in predicting the same value for other observations
that do not yet have that attribute, then we can use training data and scoring data to mine
predictively. Training data are simply data sets that have that known prediction attribute. For the
observations in the training data set, the outcome of the prediction attribute is already known. The
prediction attribute is also sometimes referred to as the dependent attribute (or variable) or the
target attribute. It is the thing you are trying to predict. RapidMiner will ask us to set this
attribute to be the label when we build our model. Scoring data are the observations which have
all of the same attributes as the training data set, with the exception of the prediction attribute. We
can use the training data set to allow RapidMiner to evaluate the values of all our attributes in the
context of the resulting prediction variable (in this case, Prime_Sport), and then compare those
values to the scoring data set and predict the Prime_Sport for each observation in the scoring data
108
Chapter 7: Discriminant Analysis

set. That may seem a little confusing, but our chapter example should help clarify it, so let’s move
on to the next CRISP-DM step.

DATA PREPARATION

This chapter’s example will be a slight divergence from other chapters. Instead of there being a
single example data set in CSV format for you to download, there are two this time. You can
access the Chapter 7 data sets on the book’s companion web site
(https://sites.google.com/site/dataminingforthemasses/).

They are labeled Chapter07DataSet_Scoring.csv and Chapter07DataSet_Training.csv. Go ahead


and download those now, and import both of them into your RapidMiner repository as you have
in past chapters. Be sure to designate the attribute names in the first row of the data sets as you
import them. Be sure you give each of the two data sets descriptive names, so that you can tell
they are for Chapter 7, and also so that you can tell the difference between the training data set and
the scoring data set. After importing them, drag only the training data set into a new process
window, and then follow the steps below to prepare for and create a discriminant analysis data
mining model.

1) Thus far, when we have added data to a new process, we have allowed the operator to
simply be labeled ‘Retrieve’, which is done by RapidMiner by default. For the first time, we
will have more than one Retrieve operator in our model, because we have a training data
set and a scoring data set. In order to easily differentiate between the two, let’s start by
renaming the Retrieve operator for the training data set that you’ve dragged and dropped
into your main process window. Right click on this operator and select Rename. You will
then be able to type in a new name for this operator. For this example, we will name the
operator ‘Training’, as is depicted in Figure 7-1.

109
Data Mining for the Masses

Figure 7-1. Our Retrieve operator renamed as ‘Training’.

2) We know from our Data Preparation phase that we have some data that need to be fixed
before we can mine this data set. Specifically, Gill noticed some inconsistencies in the
Decision_Making attribute. Run your model and let’s examine the meta data, as seen in
Figure 7-2.

Figure 7-2. Identifying inconsistent data in the Decision_Making attribute.

3) While still in results perspective, switch to the Data View radio button. Click on the
column heading for the Decision_Making attribute. This will sort the attribute from
smallest to largest (note the small triangle indicating that the data are sorted in ascending
order using this attribute). In this view (Figure 7-3) we see that we have three observations
with scores smaller than three. We will need to handle these observations.

110
Chapter 7: Discriminant Analysis

Figure 7-3. The data set sorted in ascending order by the Decision_Making attribute.

4) Click on the Decision_Making attribute again. This will re-sort the attribute in descending
order. Again, we have some values that need to be addressed (Figure 7-4).

Figure 7-4. The Decision_Making variable, re-sorted in descending order.

5) Switch back to design perspective. Let’s address these inconsistent data by removing them
from our training data set. We could set these inconsistent values to missing then set
missing values to another value, such as the mean, but in this instance we don’t really know
111
Data Mining for the Masses

what should have been in this variable, so changing these to the mean seems a bit arbitrary.
Removing this inconsistencies means only removing 11 of our 493 observations, so rather
than risk using bad data, we will simply remove them. To do this, add two Filter Examples
operators in a row to your stream. For each of these, set the condition class to
attribute_value_filter, and for the parameter strings, enter ‘Decision_Making>=3’ (without
single quotes) for the first one, and ‘Decision_Making<=100’ for the second one. This
will reduce our training data set down to 482 observations. The set-up described in this
step is shown in Figure 7-5.

Figure 7-5. Filtering out observations with inconsistent data.

6) If you would like, you can run the model to confirm that your number of observations
(examples) has been reduced to 482. Then, in design perspective, use the search field in
the Operators tab to look for ‘Discriminant’ and locate the operator for Linear
Discriminant Analysis. Add this operator to your stream, as shown in Figure 7-6.

Figure 7-6. Addition of the Linear Discriminant Analysis operator to the model.
112
Chapter 7: Discriminant Analysis

7) The tra port on the LDA (or Linear Discriminant Analysis) operator indicates that this tool
does expect to receive input from a training data set like the one we’ve provided, but
despite this, we still have received two errors, as indicated by the black arrow at the bottom
of the Figure 7-6 image. The first error is because of our Prime_Sport attribute. It is data
typed as polynominal, and LDA likes attributes that are numeric. This is OK, because the
predictor attribute can have a polynominal data type, and the Prime_Sport attribute is the
one we want to predict, so this error will be resolved shortly. This is because it is related to
the second error, which tells us that the LDA operator wants one of our attributes to be
designated as a ‘label’. In RapidMiner, the label is the attribute that you want to predict.
At the time that we imported our data set, we could have designated the Prime_Sport
attribute as a label, rather than as a normal attribute, but it is very simple to change an
attribute’s role right in your stream. Using the search field in the Operators tab, search for
an operator called Set Role. Add this to your stream and then in the parameters area on
the right side of the window, select Prime_Sport in the name field, and in target role, select
label. We still have a warning (which does not prevent us from continuing), but you will
see the errors have now disappeared at the bottom of the RapidMiner window (Figure 7-7).

Figure 7-7. Setting an attribute’s role in RapidMiner.

With our inconsistent data removed and our errors resolved, we are now prepared to move on
to…

113
Data Mining for the Masses

MODELING

8) We now have a functional stream. Go ahead and run the model as it is now. With the mod
port connected to the res port, RapidMiner will generate Discriminant Analysis output for
us.

Figure 7-8. The results of discriminant analysis on our training data set.

9) The probabilities given in the results will total to 1. This is because at this stage of our
Discriminant Analysis model, all that has been calculated is the likelihood of an observation
landing in one of the four categories in our target attribute of Prime_Sport. Because this is
our training data set, RapidMiner can calculate theses probabilities easily—every
observation is already classified. Football has a probability of 0.3237. If you refer back to
Figure 7-2, you will see that Football as Prime_Sport comprised 160 of our 493
observations. Thus, the probability of an observation having Football is 160/493, or
0.3245. But in steps 3 and 4 (Figures 7-3 and 7-4), we removed 11 observations that had
inconsistent data in their Decision_Making attribute. Four of these were Football
observations (Figure 7-4), so our Football count dropped to 156 and our total count
dropped to 482: 156/482 = 0.3237. Since we have no observations where the value for
Prime_Sport is missing, each possible value in Prime_Sport will have some portion of the
total count, and the sum of these portions will equal 1, as is the case in Figure 7-8. These
probabilities, coupled with the values for each attribute, will be used to predict the
Prime_Sport classification for each of Gill’s current clients represented in our scoring data
set. Return now to design perspective and in the Repositories tab, drag the Chapter 7
scoring data set over and drop it in the main process window. Do not connect it to your

114
Chapter 7: Discriminant Analysis

existing stream, but rather, allow it to connect directly to a res port. Right click the
operator and rename it to ‘Scoring’. These steps are illustrated in Figure7-9.

Figure 7-9. Adding the scoring data set to our model.

10) Run the model again. RapidMiner will give you an additional tab in results perspective this
time which will show the meta data for the scoring data set (Figure 7-10).

Figure 7-10. Results perspective meta data for our scoring data set.

11) The scoring data set contains 1,841, however, as indicated by the black arrow in the Range
column of Figure 7-10, the Decision_Making attribute has some inconsistent data again.
Repeating the process previously outlined in steps 3 and 4, return to design perspective and
use two consecutive Filter Examples operators to remove any observations that have
values below 3 or above 100 in the Decision_Making attribute (Figure 7-11). This will

115
Data Mining for the Masses

leave us with 1,767 observations, and you can check this by running the model again
(Figure 7-12).

Figure 7-11. Filtering out observations containing inconsistent Decision_Making values.

Figure 7-12. Verification that observations with inconsistent values have been removed.

12) We now have just one step remaining to complete our model and predict the Prime_Sport
for the 1,767 boys represented in our scoring data set. Return to design perspective, and
use the search field in the Operators tab to locate an operator called Apply Model. Drag
this operator over and place it in the Scoring data set’s stream, as is shown in Figure 7-13.

116
Chapter 7: Discriminant Analysis

Figure 7-13. Adding the Apply Model operator to our Discriminant Analysis model.

13) As you can see in Figure 7-13, the Apply Model operator has given us an error. This is
because the Apply Model operator expects the output of a model generation operator as its
input. This is an easy fix, because our LDA operator (which generated a model for us) has
a mod port for its output. We simply need to disconnect the LDA’s mod port from the res
port it’s currently connected to, and connect it instead to the Apply Model operator’s mod
input port. To do this, click on the mod port for the LDA operator, and then click on the
mod port for the Apply Model operator. When you do this, the following warning will pop
up:

Figure 7-14. The port reconnection warning in RapidMiner.

14) Click OK to indicate to RapidMiner that you do in fact wish to reconfigure the spline to
connect mod port to mod port. The error message will disappear and your scoring model
will be ready for prediction (Figure 7-15).
117
Data Mining for the Masses

Figure 7-15. Discriminant analysis model with training and scoring data streams.

15) Run the model by clicking the play button. RapidMiner will generate five new attributes
and add them to our results perspective (Figure 7-16), preparing us for…

EVALUATION

Figure 7-16. Prediction attributes generated by RapidMiner.

The first four attributes created by RapidMiner are confidence percentages, which indicate the
relative strength of RapidMiner’s prediction when compared to the other values the software might
have predicted for each observation. In this example data set, RapidMiner has not generated

118
Chapter 7: Discriminant Analysis

confidence percentages for each of our four target sports. If RapidMiner had found some
significant possibility that an observation might have more than one possible Prime_Sport, it
would have calculated the percent probability that the person represented by an observation would
succeed in one sport and in the others. For example, if an observation yielded a statistical
possibility that the Prime_Sport for a person could have been any of the four, but Baseball was the
strongest statistically, the confidence attributes on that observation might be: confidence(Football):
8%; confidence(Baseball): 69%; confidence(Hockey): 12%; confidence(Basketball): 11%. In some
predictive data mining models (including some later in this text), your data will yield partial
confidence percentages such as this. This phenomenon did not occur however in the data sets we
used for this chapter’s example. This is most likely explained by the fact discussed earlier in the
chapter: all athletes will display some measure of aptitude in many sports, and so their battery test
scores will likely be varied across the specializations. In statistical language, this is often referred to
as heterogeneity.

Not finding confidence percentages does not mean that our experiment has been a failure
however. The fifth new attribute, generated by RapidMiner when we applied our LDA model to
our scoring data, is the prediction of Prime_Sport for each of our 1,767 boys. Click on the Data
View radio button, and you will see that RapidMiner has applied our discriminant analysis model to
our scoring data, resulting in a predicted Prime_Sport for each boy based on the specialization
sport of previous academy attendees (Figure 7-17).

Figure 7-17. Prime_Sport predictions for each boy in the scoring data set.

119
Data Mining for the Masses

DEPLOYMENT

Gill now has a data set with a prediction for each boy that has been tested using the athletic battery
at his academy. What to do with these predictions will be a matter of some thought and
discussion. Gill can extract these data from RapidMiner and relate them back to each boy
individually. For relatively small data sets, such as this one, we could move the results into a
spreadsheet by simply copying and pasting them. Just as a quick exercise in moving results to
other formats, try this:

1) Open a blank OpenOffice Calc spreadsheet.

2) In RapidMiner, click on the 1 under Row No. in Data View of results perspective (the cell
will turn gray).

3) Press Ctrl+A (the keyboard command for ‘select all’ in Windows; you can use equivalent
keyboard command for Mac or Linux as well). All cells in Data View will turn gray.

4) Press Ctrl+C (or the equivalent keyboard command for ‘copy’ if not using Windows).

5) In your blank OpenOffice Calc spreadsheet, right click in cell A1 and choose Paste
Special… from the context menu.

6) In the pop up dialog box, select Unformatted Text, then click OK.

7) A Text Import pop up dialog box will appear with a preview of the RapidMiner data.
Accept the defaults by clicking OK. The data will be pasted into the spreadsheet. The
attribute names will have to be transcribed and added to the top row of the spreadsheet,
but the data are now available outside of RapidMiner. Gill can match each prediction back
to each boy in the scoring data set. The data are still in order, but remember that a few
were removed because on inconsistent data, so care should be exercised when matching
the predictions back to the boys represented by each observation. Bringing a unique
identifying number into the training and scoring data sets might aid the matching once

120
Chapter 7: Discriminant Analysis

predictions have been generated. This will be demonstrated in an upcoming chapter’s


example.

Chapter 14 of this book will spend some time talking about ethics in data mining. As previously
mentioned, Gill’s use of these predictions is going to require some thought and discussion. Is it
ethical to push one of his young clients in the direction of one specific sport based on our model’s
prediction that that activity as a good match for the boy? Simply because previous academy
attendees went on to specialize in one sport or another, can we assume that current clients would
follow the same path? The final chapter will offer some suggestions for ways to answer such
questions, but it is wise for us to at least consider them now in the context of the chapter
examples.

It is likely that Gill, being experienced at working with young athletes and recognizing their
strengths and weaknesses, will be able to use our predictions in an ethical way. Perhaps he can
begin by grouping his clients by their predicted Prime_Sports and administering more ‘sport-
specific’ drills—say, jumping tests for basketball, skating for hockey, throwing and catching for
baseball, etc. This may allow him to capture more specific data on each athlete, or even to simply
observe whether or not the predictions based on the data are in fact consistent with observable
performance on the field, court, or ice. This is an excellent example of why the CRISP-DM
approach is cyclical: the predictions we’ve generated for Gill are a starting point for a new round of
assessment and evaluation, not the ending or culminating point. Discriminant analysis has given
Gill some idea about where his young proteges may have strengths, and this can point him in
certain directions when working with each of them, but he will inevitably gather more data and
learn whether or not the use of this data mining methodology and approach is helpful in guiding
his clients to a sport in which they might choose to specialize as they mature.

CHAPTER SUMMARY

Discriminant analysis helps us to cross the threshold between Classification and Prediction in data
mining. Prior to Chapter 7, our data mining models and methodologies focused primarily on
categorization of data. With Discriminant Analysis, we can take a process that is very similar in
nature to k-means clustering, and with the right target attribute in a training data set, generate

121
Data Mining for the Masses

predictions for a scoring data set. This can become a powerful addition to k-means models, giving
us the ability to apply our clusters to other data sets that haven’t yet been classified.

Discriminant analysis can be useful where the classification for some observations is known and is
not known for others. Some classic applications of discriminant analysis are in the fields of
biology and organizational behavior. In biology, for example, discriminant analysis has been
successfully applied to the classification of plant and animal species based on the traits of those
living things. In organizational behavior, this type of data modeling has been used to help workers
identify potentially successful career paths based on personality traits, preferences and aptitudes.
By coupling known past performance with unknown but similarly structured data, we can use
discriminant analysis to effectively train a model that can then score the unknown records for us,
giving us a picture of what categories the unknown observations would likely be in.

REVIEW QUESTIONS

1) What type of attribute does a data set need in order to conduct discriminant analysis
instead of k-means clustering?

2) What is a ‘label’ role in RapidMiner and why do you need an attribute with this role in
order to conduct discriminant analysis?

3) What is the difference between a training data set and a scoring data set?

4) What is the purpose of the Apply Model operator in RapidMiner?

5) What are confidence percent attributes used for in RapidMiner? What was the likely
reason that did we not find any in this chapter’s example? Are there attributes about young
athletes that you can think of that were not included in our data sets that might have
helped up find some confidence percents? (Hint: think of things that are fairly specific to
only one or two sports.)

6) What would be problematic about including both male and female athletes in this chapter’s
example data?

122
Chapter 7: Discriminant Analysis

EXERCISE

For this chapter’s exercise, you will compile your own data set based on people you know and the
cars they drive, and then create a linear discriminant analysis of your data in order to predict
categories for a scoring data set. Complete the following steps:

1) Open a new blank spreadsheet in OpenOffice Calc. At the bottom of the spreadsheet
there will be three default tabs labeled Sheet1, Sheet2, Sheet3. Rename the first one
Training and the second one Scoring. You can rename the tabs by double clicking on their
labels. You can delete or ignore the third default sheet.

2) On the training sheet, starting in cell A1 and going across, create attribute labels for six
attributes: Age, Gender, Marital_Status, Employment, Housing, and Car_Type.

3) Copy each of these attribute names except Car_Type into the Scoring sheet.

4) On the Training sheet, enter values for each of these attributes for several people that you
know who have a car. These could be family members, friends and neighbors, coworkers
or fellow students, etc. Try to do at least 20 observations; 30 or more would be better.
Enter husband and wife couples as two separate observations, so long as each spouse has a
different vehicle. Use the following to guide your data entry:
a. For Age, you could put the person’s actual age in years, or you could put them in
buckets. For example, you could put 10 for people aged 10-19; 20 for people aged
20-29; etc.
b. For Gender, enter 0 for female and 1 for male.
c. For Marital_Status, use 0 for single, 1 for married, 2 for divorced, and 3 for
widowed.
d. For Employment, enter 0 for student, 1 for full-time, 2 for part-time, and 3 for
retired.
e. For Housing, use 0 for lives rent-free with someone else, 1 for rents housing, and 2
for owns housing.
f. For Car_Type, you can record data in a number of ways. This will be your label, or
the attribute you wish to predict. You could record each person’s car by make (e.g.

123
Data Mining for the Masses

Toyota, Honda, Ford, etc.), or you could record it by body style (e.g. Car, Truck,
SUV, etc.). Be consistent in assigning classifications, and note that depending on
the size of the data set you create, you won’t want to have too many possible
classificatons, or your predictions in the scoring data set will be spread out too
much. With small data sets containing only 20-30 observations, the number of
categories should be limited to three or four. You might even consider using
Japanese, American, European as your Car_Types values.

5) Once you’ve compiled your Training data set, switch to the Scoring sheet in OpenOffice
Calc. Repeat the data entry process for at least 20 people (more is better) that you know
who do not have a car. You will use the training set to try to predict the type of car each of
these people would drive if they had one.

6) Use the File > Save As menu option in OpenOffice Calc to save your Training and Scoring
sheets as CSV files.

7) Import your two CSV files into your RapidMiner respository. Be sure to give them
descriptive names.

8) Drag your two data sets into a new process window. If you have prepared your data well
in OpenOffice Calc, you shouldn’t have any missing or inconsistent data to contend with,
so data preparation should be minimal. Rename the two retrieve operators so you can tell
the difference between your training and scoring data sets.

9) One necessary data preparation step is to add a Set Role operator and define the Car_Type
attribute as your label.

10) Add a Linear Discriminant Analysis operator to your Training stream.

11) Apply your LDA model to your scoring data and run your model. Evaluate and report
your results. Did you get any confidence percentages? Do the predicted Car_Types seem
reasonable and consistent with your training data? Why or why not?

124
Chapter 7: Discriminant Analysis

Challenge Step!

12) Change your LDA operator to a different type of discriminant analysis (e.g. Quadratic)
operator. Re-run your model. Consider doing some research to learn about the difference
between linear and quadratic discriminant analysis. Compare your new results to the LDA
results and report any interesting findings or differences.

125
Chapter 8: Linear Regression

CHAPTER EIGHT:
LINEAR REGRESSION

CONTEXT AND PERSPECTIVE

Sarah, the regional sales manager from the Chapter 4 example, is back for more help. Business is
booming, her sales team is signing up thousands of new clients, and she wants to be sure the
company will be able to meet this new level of demand. She was so pleased with our assistance in
finding correlations in her data, she now is hoping we can help her do some prediction as well.
She knows that there is some correlation between the attributes in her data set (things like
temperature, insulation, and occupant ages), and she’s now wondering if she can use the data set
from Chapter 4 to predict heating oil usage for new customers. You see, these new customers
haven’t begun consuming heating oil yet, there are a lot of them (42,650 to be exact), and she
wants to know how much oil she needs to expect to keep in stock in order to meet these new
customers’ demand. Can she use data mining to examine household attributes and known past
consumption quantities to anticipate and meet her new customers’ needs?

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Explain what linear regression is, how it is used and the benefits of using it.
 Recognize the necessary format for data in order to perform predictive linear regression.
 Explain the basic algebraic formula for calculating linear regression.
 Develop a linear regression data mining model in RapidMiner using a training data set.
 Interpret the model’s coefficients and apply them to a scoring data set in order to deploy
the model.

127
Data Mining for the Masses

ORGANIZATIONAL UNDERSTANDING

Sarah’s new data mining objective is pretty clear: she wants to anticipate demand for a consumable
product. We will use a linear regression model to help her with her desired predictions. She has
data, 1,218 observations from the Chapter 4 data set that give an attribute profile for each home,
along with those homes’ annual heating oil consumption. She wants to use this data set as training
data to predict the usage that 42,650 new clients will bring to her company. She knows that these
new clients’ homes are similar in nature to her existing client base, so the existing customers’ usage
behavior should serve as a solid gauge for predicting future usage by new customers.

DATA UNDERSTANDING

As a review, our data set from Chapter 4 contains the following attributes:
 Insulation: This is a density rating, ranging from one to ten, indicating the thickness of
each home’s insulation. A home with a density rating of one is poorly insulated, while a
home with a density of ten has excellent insulation.
 Temperature: This is the average outdoor ambient temperature at each home for the
most recent year, measure in degree Fahrenheit.
 Heating_Oil: This is the total number of units of heating oil purchased by the owner of
each home in the most recent year.
 Num_Occupants: This is the total number of occupants living in each home.
 Avg_Age: This is the average age of those occupants.
 Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home.

We will use the Chapter 4 data set as our training data set in this chapter. Sarah has assembled a
separate Comma Separated Values file containing all of these same attributes, except of course for
Heating_Oil, for her 42,650 new clients. She has provided this data set to us to use as the scoring
data set in our model.

128
Chapter 8: Linear Regression

DATA PREPARATION

You should already have downloaded and imported the Chapter 4 data set, but if not, you can get
it from the book’s companion web site (https://sites.google.com/site/dataminingforthemasses/).
Download and import the Chapter 8 data set from the companion web site as well. Once you
have both the Chapter 4 and Chapter 8 data sets imported into your RapidMiner data repository,
complete the following steps:

1) Drag and drop both data sets into a new process window in RapidMiner. Rename the
Chapter 4 data set to ‘Training (CH4), and the Chapter 8 data set to ‘Scoring (CH8)’.
Connect both out ports to res ports, as shown in Figure 8-1, and then run your model.

Figure 8-1. Using both Chapter 4 and 8 data sets to set up a linear regression model.

2) Figures 8-2 and 8-3 show side-by-side comparisons of the training and scoring data sets.
When using linear regression as a predictive model, it is extremely important to remember
that the ranges for all attributes in the scoring data must be within the ranges for the
corresponding attributes in the training data. This is because a training data set cannot be
relied upon to predict a target attrtibute for observations whose values fall outside the
training data set’s values.

129
Data Mining for the Masses

Figure 8-2. Value ranges for the training data set’s attributes.

Figure 8-3. Value ranges for the scoring data set’s attributes.

3) We can see that in comparing Figures 8-2 and 8-3, the ranges are the same for all attributes
except Avg_Age. In the scoring data set, we have some observations where the Avg_Age
is slightly below the training data set’s lower bound of 15.1, and some observations where
the scoring Avg_Age is slightly above the training set’s upper bound of 72.2. You might
think that these values are so close to the training data set’s values that it would not matter
if we used our training data set to predict heating oil usage for the homes represented by
these observations. While it is likely that such a slight deviation from the range on this
attribute would not yield wildly inaccurate results, we cannot use linear regression
prediction values as evidence to support such an assumption. Thus, we will need to
remove these observations from our data set. Add two Filter Examples operators with the
parameters attribute_value_filter and Avg_Age>=15.1 | Avg_Age <=72.2. When you run
your model now, you should have 42,042 observations remaining. Check the ranges again
to ensure that none of the scoring attributes now have ranges outside those of the training
attributes. Then return to design perspective.

4) As was the case with discriminant analysis, linear regression is a predictive model, and thus
will need an attribute to be designated as the label—this is the target, the thing we want to
predict. Search for the Set Role operator in the Operators tab and drag it into your training

130
Chapter 8: Linear Regression

stream. Change the parameters to designate Heating_Oil as the label for this model
(Figure 8-4).

Figure 8-4. Adding an operator to designate Heating_Oil as our label.

With this step complete our data sets are now prepared for…

MODELING

5) Using the search field in the Operators tab again, locate the Linear Regression operator and
drag and drop it into your training data set’s stream (Figure 8-5).

Figure 8-5. Adding the Linear Regression model operator to our stream.

131
Data Mining for the Masses

6) Note that the Linear Regression operator uses a default tolerance of .05 (also known in
statistical language as the confidence level or alpha level). This value of .05 is very
common in statistical analysis of this type, so we will accept this default. The final step to
complete our model is to use an Apply Model operator to connect our training stream to
our scoring stream. Be sure to connect both the lab and mod ports coming from the Apply
Model operator to res ports. This is illustrated in Figure 8-6.

Figure 8-6. Applying the model to the scoring data set.

7) Run the model. Having two splines coming from the Apply Model operator and
connecting to res ports will result in two tabs in results perspective. Let’s examine the
LinearRegression tab first, as we begin our…

EVALUATION

Figure 8-7. Linear regression coefficients.


132
Chapter 8: Linear Regression

Linear regression modeling is all about determing how close a given observation is to an imaginary
line representing the average, or center of all points in the data set. That imaginary line gives us the
first part of the term “linear regression”. The formula for calculating a prediction using linear
regression is y=mx+b. You may recognize this from a former algebra class as the formula for
calculating the slope of a line. In this formula, the variable y, is the target, the label, the thing we
want to predict. So in this chapter’s example, y is the amount of Heating_Oil we expect each
home to consume. But how will we predict y? We need to know what m, x, and b are. The
variable m is the value for a given predictor attribute, or what is sometimes referred to as an
independent variable. Insulation, for example, is a predictor of heating oil usage, so Insulation is
a predictor attribute. The variable x is that attribute’s coefficient, shown in the second column of
Figure 8-7. The coefficient is the amount of weight the attribute is given in the formula.
Insulation, with a coefficient of 3.323, is weighted heavier than any of the other predictor attributes
in this data set. Each observation will have its Insulation value multipled by the Insulation
coefficient to properly weight that attribute when calculating y (heating oil usage). The variable b is
a constant that is added to all linear regression calculations. It is represented by the Intercept,
shown in figure 8-7 as 134.511. So suppose we had a house with insulation density of 5; our
formula using these Insulation values would be y=(5*3.323)+134.511.

But wait! We had more than one predictor attribute. We started out using a combination of five
attributes to try to predict heating oil usage. The formula described in the previous paragraph only
uses one. Furthermore, our LinearRegression result set tab pictured in Figure 8-7 only has four
predictor variables. What happened to Num_Occupants?

The answer to the latter question is that Num_Occupants was not a statistically significant
predictor of heating oil usage in this data set, and therefore, RapidMiner removed it as a predictor.
In other words, when RapidMiner evaluated the amount of influence each attribute in the data set
had on heating oil usage for each home represented in the training data set, the number of
occupants was so non-influential that its weight in the formula was set to zero. An example of
why this might occur could be that two older people living in a house may use the same amount of
heating oil as a young family of five in the house. The older couple might take longer showers, and
prefer to keep their house much warmer in the winter time than would the young family. The
variability in the number of occupants in the house doesn’t help to explain each home’s heating oil
usage very well, and so it was removed as a predictor in our model.
133
Data Mining for the Masses

But what about the former question, the one about having multiple independent variables in this
model? How can we set up our linear formula when we have multiple predictors? This is done by
using the formula: y=mx+mx+mx…+b. Let’s take an example. Suppose we wanted to predict
heating oil usage, using our model, for a home with the following attributes:
 Insulation: 6
 Temperature: 67
 Avg_Age: 35.4
 Home_Size: 5

Our formula for this home would be: y=(6*3.323)+(67*-0.869)+(35.4*1.968)+(5*3.173)+134.511

Our prediction for this home’s annual number of heating oil units ordered (y) is 181.758, or
basically 182 units. Let’s check our model’s predictions as we discuss possibilities for…

DEPLOYMENT

While still in results perspective, switch to the ExampleSet tab, and select the Data View radio
button. We can see in this view (Figure 8-8) that RapidMiner has quickly and efficiently predicted
the number of units of heating oil each of Sarah’s company’s new customers will likely use in their
first year. This is seen in the prediction(Heating_Oil) attribute.

Figure 8-8. Heating oil predictions for 42,042 new clients.


134
Chapter 8: Linear Regression

Let’s check the first of our 42,042 households by running the linear regression formula for row 1:

(5*3.323)+(69*-0.869)+(70.1*1.968)+(7*3.173)+134.511 = 251.321

Note that in this formula we skipped the Num_Occupants attribute because it is not predictive.
The formula’s result does indeed match RapidMiner’s prediction for this home. Sarah now has a
prediction for each of the new clients’ homes, with the exception of those that had Avg_Age
values that were out of range. How might Sarah use this data? She could start by summing the
prediction attribute. This will tell her the total new units of heating oil her company is going to
need to be able to provide in the coming year. This can be accomplished by exporting her data to
a spreasheet and summing the column, or it can even be done within RapidMiner using an
Aggregate operator. We will demonstrate this briefly.

1) Switch back to design perspective.

2) Search for the Aggreate operator in the Operators tab and add it between the lab and res
ports, as shown in Figure 8-9. It is not depicted in Figure 8-9, but if you wish to generate a
tab in results perspective that shows all of your obsevations and their predictions, you can
connect the ori port on the Aggregate operator to a res port.

Figure 8-9. Adding an Aggregate operator to our linear regression model.

135
Data Mining for the Masses

3) Click on the Edit List button. A window similar to Figure 8-10 will appear. Set the
prediction(Heating_Oil) attribute as the aggregation attribute, and the aggregation function
to ‘sum’. If you would like you can add other aggretations. In the Figure 8-10 example, we
have added an average for prediction(Heating_Oil) as well.

Figure 8-10. Configuring aggregations in RapidMiner.

4) When you are satisfied with your aggregations, click OK to return to your main process
window, then run the model. In results perspective, select the ExampleSet(Aggregate) tab,
then select the Data View radio button. The sum and average for the prediction attribute
will be shown, as depicted in Figure 8-11.

Figure 8-11. Aggregate descriptive statistics for our predicted attribute.

From this image, we can see that Sarah’s company is likely to sell some 8,368,088 units of heating
oil to these new customers. The company can expect that on average, their new customers will
order about 200 units each. These figures are for all 42,042 clients together, but Sarah is probably
going to be more interested in regional trends. In order to deploy this model to help her more
specifically address her new customers’ needs, she should probably extract the predictions, match

136
Chapter 8: Linear Regression

them back to their source records which might contain the new clients’ addresses, enabling her to
break the predictions down by city, county, or region of the country. Sarah could then work with
her colleagues in Operations and Order Fulfillment to ensure that regional heating oil distribution
centers around the country have appropriate amounts of stock on hand to meet anticipated need.
If Sarah wanted to get even more granular in her analysis of these data, she could break her
training and scoring datas set down into months using a month attribute, and then run the
predictions again to reveal fluctuations in usuage throughout the course of the year.

CHAPTER SUMMARY

Linear regression is a predictive model that uses training and scoring data sets to generate numeric
predictions in data. It is important to remember that linear regression uses numeric data types for
all of its attributes. It uses the algebraic formula for calculating the slope of a line to determine
where an observation would fall along an imaginary line through the scoring data. Each attribute
in the data set is evaluated statistically for its ability to predict the target attribute. Attributes that
are not strong predictors are removed from the model. Those attributes that are good predictors
are assigned coefficients which give them weight in the prediction formula. Any observations
whose attribute values fall in the range of corresponding training attribute values can be plugged
into the formula in order to predict the target.

Once linear regression predictions are calculated, the results can be summarized in order to
determine if there are differences in the predictions in subsets of the scoring data. As more data
are collected, they can be added into the training data set in order to create a more robust training
data set, or to expand the ranges of some attributes to include even more values. It is very
important to remember that the ranges for the scoring attributes must fall within the ranges for the
training attributes in order to ensure valid predictions.

REVIEW QUESTIONS

1) What data type does linear regression expect for all attributes? What data type will the
predicted attribute be when it is calculated?

2) Why are the attribute ranges so important when doing linear regression data mining?

137
Data Mining for the Masses

3) What are linear regression coefficients? What does ‘weight’ mean?

4) What is the linear regression mathematical formula, and how is it arranged?

5) How are linear regression results interpreted?

Extra thought question:


6) If you have an attribute that you want to use in a linear regression model, but it contains
text data, such as the make or model of a car, what could you do in order to be able to use
that attribute in your model?

EXERCISE

In the Chapter 4 exercise, you compiled your own data set about professional athletes. For this
exercise, we will enhance this data set and then build a linear regression model on it. Complete the
following steps:

1) Open the data set you compiled for the Chapter 4 exercise. If you did not do that exercise,
please turn back to Chapter 4 and complete steps 1 – 4.

2) Split your data set’s observations in two: a training portion and a scoring portion. Be sure
that you have at least 20 observations in your training data set, and at least 10 in your
scoring data set. More would be better, so if you only have 30 observations total, perhaps
it would be good to take some time to look up ten or so more athletes to add to your
scoring data set. Also, we are going to try to predict each athlete’s salary, so if Salary is not
one of your attributes, look it up for each athlete in your training data set (don’t look it up
for the scoring data set athletes, we’re going to try to predict these). Also, if there are other
attributes that you don’t have, but that you think would be great predictors of salary, look
these up, and add them to both your training and scoring data sets. These might be things
like points per game, defensive statistics, etc. Be sure your attributes are numeric.

138
Chapter 8: Linear Regression

3) Import both of your data sets into your RapidMiner repository. Be sure to give them
descriptive names. Drag and drop them into a new process, and rename them as Training
and Scoring so that you can tell them apart.

4) Use a Set Role operator to designate the Salary attribute as the label for the training data.

5) Add a linear regression operator and apply your model to your scoring data set.

6) Run your model. In results perspective, examine your attribute coefficients and the
predictions for the athletes’ salaries in your scoring data set.

7) Report your results:


a. Which attributes have the greatest weight?
b. Were any attributes dropped from the data set as non-predictors? If so, which ones
and why do you think they weren’t effective predictors?
c. Look up a few of the salaries for some of your scoring data athletes and compare
their actual salary to the predicted salary. Is it very close? Why or why not, do you
think?
d. What other attributes do you think would help your model better predict
professional athletes’ salaries?

139
Chapter 9: Logistic Regression

CHAPTER NINE:
LOGISTIC REGRESSION

CONTEXT AND PERSPECTIVE

Remember Sonia, the health insurance program director from Chapter 6? Well, she’s back for
more help too! Her k-means clustering project was so helpful in finding groups of folks who could
benefit from her programs, that she wants to do more. This time around, she is concerned with
helping those who have suffered heart attacks. She wants to help them improve lifestyle choices,
including management of weight and stress, in order to improve their chances of not suffering a
second heart attack. Sonia is wondering if, with the right training data, we can predict the chances
of her company’s policy holders suffering second heart attacks. She feels like she could really help
some of her policy holders who have suffered heart attacks by offering weight, cholesterol and
stress management classes or support groups. By lowering these key heart attack risk factors, her
employer’s clients will live healthier lives, and her employer’s risk at having to pay costs associated
with treatment of second heart attacks will also go down. Sonia thinks she might even be able to
educate the insured individuals about ways to save money in other aspects of their lives, such as
their life insurance premiums, by being able to demonstrate that they are now a lower risk policy
holder.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:
 Explain what logistic regression is, how it is used and the benefits of using it.
 Recognize the necessary format for data in order to perform predictive logistic regression.
 Develop a logistic regression data mining model in RapidMiner using a training data set.
 Interpret the model’s outputs and apply them to a scoring data set in order to deploy the
model.

141
Data Mining for the Masses

ORGANIZATIONAL UNDERSTANDING

Sonia’s desire is to expand her data mining activities to determine what kinds of programs she
should develop to help victims of heart attacks avoid suffering a recurrence. She knows that
several risk factors such as weight, high cholesterol and stress contribute to heart attacks,
particularly in those who have already suffered one. She also knows that the cost of providing
programs developed to help mitigate these risks is a fraction of the cost of providing medical care
for a patient who has suffered multiple heart attacks. Getting her employer on board with funding
the programs is the easy part. Figuring out which patients will benefit from which programs is
trickier. She is looking to us to provide some guidance, based on data mining, to figure out which
patients are good candidates for which programs. Sonia’s bottom line is that she wants to know
whether or not something (a second heart attack) is likely to happen, and if so, how likely it is that
it will or will not happen. Logistic regression is an excellent tool for predicting the likelihood of
something happening or not.

DATA UNDERSTANDING

Sonia has access to the company’s medical claims database. With this access, she is able to
generate two data sets for us. This first is a list of people who have suffered heart attacks, with an
attribute indicating whether or not they have had more than one; and the second is a list of those
who have had a first heart attack, but not a second. The former data set, comprised of 138
observations, will serve as our training data; while the latter, comprised of 690 peoples’ data, will be
for scoring. Sonia’s hope is to help this latter group of people avoid becoming second heart attack
victims. In compiling the two data sets we have defined the following attributes:
 Age: The age in years of the person, rounded to the nearest whole year.
 Marital_Status: The person’s current marital status, indicated by a coded number: 0–
Single, never married; 1–Married; 2–Divorced; 3–Widowed.
 Gender: The person’s gender: 0 for female; 1 for male.
 Weight_Category: The person’s weight categorized into one of three levels: 0 for normal
weight range; 1 for overweight; and 2 for obese.
 Cholesterol: The person’s cholesterol level, as recorded at the time of their treatment for
their most recent heart attack (their only heart attack, in the case of those individuals in the
scoring data set.)

142
Chapter 9: Logistic Regression

 Stress_Management: A binary attribute indicating whether or not the person has


previously attended a stress management course: 0 for no; 1 for yes.
 Trait_Anxiety: A score on a scale of 0 to 100 measuring the level of each person’s natural
stress levels and abilities to cope with stress. A short time after each person in each of the
two data sets had recovered from their first heart attack, they were administered a standard
test of natural anxiety. Their scores are tabulated and recorded in this attribute along five
point increments. A score of 0 would indicate that the person never feels anxiety, pressure
or stress in any situation, while a score of 100 would indicate that the person lives in a
constant state of being overwhelmed and unable to deal with his or her circumstances.
 2nd_Heart_Attack: This attribute exists only in the training data set. It will be our label,
the prediction or target attribute. In the training data set, the attribute is set to ‘yes’ for
individuals who have suffered second heart attacks, and ‘no’ for those who have not.

DATA PREPARATION

Two data sets have been prepared and are available for you to download from the companion web
site. These are labeled Chapter09DataSet_Training.csv, and Chapter09DataSet_Scoring.csv. If
you would like to follow along with this chapter’s example, download these two datasets now, and
complete the following steps:

1) Begin the process of importing the training data set first. For the most part, the process
will be the same as what you have done in past chapters, but for logistic regression, there
are a few subtle differences. Be sure to set the first row as the attribute names. On the
fourth step, when setting data types and attribute roles, you will need to make at least one
change. Be sure to set the 2nd_Heart_Attack data type to ‘nominal’, rather than binominal.
Even though it is a yes/no field, and RapidMiner will default it to binominal because of
that, the Logistic Regression operator we’ll be using in our modeling phase expects the
label to be nominal. RapidMiner does not offer binominal-to-nominal or integer-to-
nominal operators, so we need to be sure to set this target attribute to the needed data type
of ‘nominal’ as we import it. This is shown in Figure 9-1:

143
Data Mining for the Masses

Figure 9-1. Setting the 2nd_Heart_Attack attribute’s


data type to ‘nominal’ during import.

2) At this time you can also change the 2nd_Heart_Attack attribute’s role to ‘label’, if you
wish. We have not done this in Figure 9-1, and subsequently we will be adding a Set Role
operator to our stream as we continue our data preparation.

3) Complete the data import process for the training data, then drag and drop the data set
into a new, blank main process. Rename the data set’s Retrieve operator as Training.

4) Import the scoring data set now. Be sure the data type for all attributes is ‘integer’. This
should be the default, but may not be, so double check. Since the 2nd_Heart_Attack
attribute is not included in the scoring data set, you don’t need to worry about changing it
as you did in step 1. Complete the import process, drag and drop the scoring data set into
your main process and rename this data set’s Retrieve operator to be Scoring. Your model
should now appear similar to Figure 9-2.

144
Chapter 9: Logistic Regression

Figure 9-2. The training and scoring data sets in a


new main process window in RapidMiner.

5) Run the model and compare the ranges for all attributes between the scoring and training
result set tabs (Figures 9-3 and 9-4, respectively). You should find that the ranges are the
same. As was the case with Linear Regression, the scoring values must all fall within the
lower and upper bounds set by the corresponding values in the training data set. We can
see in Figures 9-3 and 9-4 that this is the case, so our data are very clean, they were
prepared during extraction from Sonia’s source database, and we will not need to do
further data preparation in order to filter out observations with inconsistent values or
modify missing values.

Figure 9-3. Meta data for the scoring data set


(note absence of 2nd_Heart_Attack attrtibute).

145
Data Mining for the Masses

Figure 9-4. Meta data for the training data set (2nd_Heart_Attack attribute is present
with ‘nominal’ data type.) Note that all scoring ranges fall within all training ranges.

6) Switch back to design perspective and add a Set Role operator to your training stream.
Remember that if you designated 2nd_Heart_Attack to have a ‘label’ role during the
import process, you won’t need to add a Set Role operator at this time. We did not do this
in the book example, so we need the operator to designate 2nd_Heart_Attack as our label,
our target attribute:

Figure 9-5. Configuring the 2nd_Heart_Attack attribute’s role in


preparation for logistic regression mining.

146

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy