Winning With Data Science
Winning With Data Science
and
Akshay Swaminathan
WINNING
WITH
DATA
SCIENCE
A HANDBOOK
FOR BUSINESS
LEADERS
W I N NING WIT H DATA SC IE N C E
Columbia University Press
Publishers Since 1893
New York Chichester, West Sussex
cup.columbia.edu
EAGER LY R EA D M Y V ERY F IR ST DR A F TS .
A KSHAY SWA M I N AT HA N
W HAT IS T R U LY IM P ORTA N T.
CONTENTS
Acknowledgments ix
Introduction 1
1 Tools of the Trade 8
2 The Data Science Project 32
3 Data Science Foundations 54
4 Making Decisions with Data 89
5 Clustering, Segmenting, and Cutting Through the Noise 117
6 Building Your First Model 134
7 Tools for Machine Learning 161
8 Pulling It Together 179
9 Ethics 209
Conclusion 229
Notes 235
Index 255
ACKNOWLEDGMENTS
AS:
I would like to thank my co-author Howard Friedman for the opport-
unity to collaborate on this important project. Learning from How-
ard’s years of experience as an author and an educator has been
truly enlightening. His patience, generosity, and mentorship have
shaped me as a writer and as a person.
ACKNOWLEDGMENTS
HF:
I would like to start by thanking my co-author Akshay Swaminathan
who has been a brilliant inspiration and wonderful co-author. Without
his vision, this book would have been mired in the wilderness for years.
I warmly thank Shui Chen and Howard Chen Friedman Jr for
their love, encouragement, and support.
I am grateful to Arthur Goldwag, who has been a friend and men-
tor throughout the development of this book.
Much of this book was inspired by the great leaders and collabo-
rators that I have worked with in the past. I wish to thank Prakash
Navaratnam, Joe Gricar, Kim Heithoff, Nathan Hill, Dorie Clark,
Mark Jordan, David Todd, Nelson Lin, Whit Bundy, Jack Harnett,
Kyle Neumann, Xena Ugrinsky, Afsheen Afshar, Charles Solomon,
Armen Kherlopian, Emma Arakelyan, Armen Aghinyan, Arby Leo-
nian, Conner Raikes, Derrick Perkins, Gavin Miyasato, Mohit Misra,
Vamsi Kasivajjala, Wlad Perdomo, Mark Jordan, Rachel Schutt, Brandt
McKee, Jason Mozingo, Daniel Lasaga, Russ Abramson, Natalia
Kanem, Diene Keita, Julitta Onabanjo, Will Zeck, Anneka Knutsson,
Arthur Erken, Ramiz Alakbarov, Frank Rotman, and Peter Schnall.
x
W I N NING WIT H DATA SC IE N C E
INTRODUCTION
2
INTRODUCTION
3
INTRODUCTION
4
INTRODUCTION
5
INTRODUCTION
6
INTRODUCTION
and both need to extract value from the data science expertise in
their companies. Steve works in consumer finance, with rotations
in the Fraud Department, Recoveries Department, and Real Estate
Division. Kamala, a junior executive at a health insurance company,
is tasked with roles in both clinical strategy and marketing, where
she needs to balance delivering good patient care with keeping
her company profitable. While we will dig deeply into consumer
finance and health insurance in this book, the lessons about being
a good customer are general across industries, so you will find rel-
evant guidance regardless of your professional focus.
What are you waiting for? Let’s get started.
7
1
TOOLS OF THE TRADE
has not paid anything owed in the previous six months, the debt is
charged off. This is true for any financial product offered by Shu
Money Financial, be it a credit card, a line of credit, a car loan,
or a mortgage. The charged-off account is then transferred to the
Recoveries Department, whose task is to collect as much of the debt
as possible.1
This morning Steve met with the company’s data science team.
As a primer, they shared with him a bewilderingly long list of pro-
gramming languages, software tools, and deployment methods
used in the last three years. His eyes glazed over as the team mem-
bers recounted their inventory of data products delivered to other
parts of the company. This was the data science team’s first project
in the Recoveries Department.
Steve felt overwhelmed. He wanted to understand everything
that the data science team members mentioned and not embarrass
himself. He wanted them to view him as a well-informed future
leader of the company. He started thinking about the data science
discussion and again felt a tightness in his chest.
But then Steve remembered that he wasn’t building the solution.
He was the customer.
Instead of lining up different grades of sandpaper, wood, drill
bits, and saws, he needed to act like a customer. If he was buy-
ing cabinets and having someone install them, he would know how
large the cabinets had to be, where they would be installed, and
how exactly he would use them. He would let the cabinet profes-
sionals build the solutions for him, while he made sure that the
products met his needs.
Steve relaxed a bit. He didn’t need to become an expert on every
tool in the data science toolbox. What he needed to do was focus
on the problem he was trying to solve. While it is true that different
tools are better for different situations, he was not going to make
the decisions in terms of which data storage devices, software, pro-
gramming languages, and other tools were being used. He needed to
understand the options and implications of the data science team’s
recommendations. If he had specific needs or restrictions, then he
had to make sure that the data science team members understood
9
TOOLS OF THE TRADE
10
TOOLS OF THE TRADE
DATA WORKFLOW
11
TOOLS OF THE TRADE
Data Collection
Storage
Preparation
Exploration
Modeling:
Experimentation
and Prediction
12
TOOLS OF THE TRADE
13
TOOLS OF THE TRADE
DATA STORAGE
Looking back a few decades, the world of data was much simpler.
The volume and velocity of data were limited to events when the
customer directly interacted with the company by sending in a
check, calling Customer Service, or using a credit card. Data remained
in silos and was linked only if the same identifier, such as the cus-
tomer identification number, appeared in two tables so that the
tables could be joined.
Data storage was local, using physical storage devices such as
hard disk drives, solid-state drives, or external storage devices.
This worked fine until there was a physical disaster or the physical
devices simply failed, something that happened quite often. Back-
ups would need to be loaded to replace the lost data, but one would
often be quickly disappointed to learn that the files were not backed
up with any degree of consistency.
In business school, Steve was assigned a data analysis project
that had to be solved using hard drives and external storage devices
for backup. He presented the results and received a solid B+ on what
turned out to be only part one of the project. Part two required the
students to complete the same data analysis project using a cloud
provider of their choice, with nearly all choosing Google Cloud,
Amazon Web Services, or Microsoft Azure. Part two resulted in
14
TOOLS OF THE TRADE
15
TOOLS OF THE TRADE
16
TOOLS OF THE TRADE
17
TOOLS OF THE TRADE
For a large company like Shu Money Financial, Steve isn’t going to
make these decisions. At a smaller company, he may be one of the
key customers that helps drive this decision. What he may want to
know is what the advantages are for him in this specific case. If he
has a very small analytics project, one that can be readily run on
a local computer, then he won’t need major computing power or
storage capability. He may be more focused on addressing issues
such as version control, data security, and backup for critical files.
DATA SOURCES
18
TOOLS OF THE TRADE
and any other content areas where there is a market. Much of this
information is scraped from the web, while other information is
aggregated from public records, APIs, and private data sources.
There is a world of difference between the data elements captured
from a company’s prespecified inputs and the wide assortment of
data that comes from social media, customer emails, chatbots, or
other external sources. In the past, nearly all data was structured,
with predefined values, ranges, and formats for information, such
as how old the customer was, where they lived, or how much they
spent last month. This information was company data in that it
reflected information about the relationship between the customer
and the company. The information was stored in data warehouses,
where a substantial effort was made to clean the data and make
it readily amenable to analysis. This type of structured informa-
tion is readily stored in relational databases, since the data points
are related to one another and can be thought of as a large data
table.8 The columns hold attributes of the data, and each record has
a value for each attribute. Different files could be merged using a
variable that was common to each file, such as a customer’s unique
identification number.
Today there are vast quantities of unstructured data (data that
is not captured in traditional relational databases), such as the
content of email communications, online chatbot exchanges, video
files, audio files, web pages, social media sites, and speech-to-text
conversions from interactions with a customer service representa-
tive. This type of data is often stored in a document database, or it
can be stored as a data lake and later processed and stored as a data
warehouse. Data lakes are data storage mechanisms that hold the
raw data in its native format until it is required for use.9 The data has
some tagging, such as keywords that are added to the records. These
tags are useful later when there is a need to use the information,
enabling the correct records to be extracted, cleaned, and then used.
Some useful information can be automatically extracted from
unstructured data. Years ago a human was required to manually
convert this unstructured data to structured data. Now technologies
such as text-to-speech conversion and natural language processing
19
TOOLS OF THE TRADE
DATA QUALITY
The breadth and richness of the data are going to provide only lim-
ited value if the data quality is not assured. A truism for data sci-
ence, and for analytics more generally, is “garbage in, garbage out.”
As a result, the data preparation and cleaning stages are critical and
usually involve many steps. That said, many data sets have their
own idiosyncrasies, so the data scientist will need to understand
the input data well enough to know what kinds of data cleaning are
the highest priority.
Removing duplicate records is necessary so that observations
do not receive excessive weighting because they were acciden-
tally repeated in the database. The data types have to be verified
to prevent the same field from being represented as both a num-
ber and a character or as both a date and a text field. The range
of possible values needs to be understood so that misentered val-
ues can be readily detected and corrected. Also, the relevance of
missing values needs to be understood. How frequently is there
missing data in each field? Does the fact that data is missing
have meaning? If so, then that should be considered in the data
quality checking. Are missing values allowed, or does there need
to be some imputation method (a data-driven method for making
a guess when the data is missing) in situations where the data is
not available?
20
TOOLS OF THE TRADE
21
TOOLS OF THE TRADE
Different tables can be linked very easily using SQL. The user
simply has to identify the tables in the FROM statement and specify
exactly what fields need to be the same in the two tables in a JOIN
statement. There is a wide variety of ways to join tables that specify
which of the records needs to be in which of the tables.
22
TOOLS OF THE TRADE
23
TOOLS OF THE TRADE
24
TOOLS OF THE TRADE
25
TOOLS OF THE TRADE
DATA PRODUCTS
26
TOOLS OF THE TRADE
27
TOOLS OF THE TRADE
28
TOOLS OF THE TRADE
29
TOOLS OF THE TRADE
KEY TOOLS
There are many important questions a customer can ask the data
science team, including the following:
• Overall Project
2 What are the business’s priorities in terms of solutions?
2 What are the business’s key constraints?
2 Does the business have institutional policies regarding data storage,
programming languages, and software choices?
2 What are the advantages and disadvantages of each tool option?
• Data Workflow
2 What steps are you taking to handle data cleaning and preprocessing
tasks?
2 How do you handle data integration or merging from multiple sources?
2 How do you handle missing or incomplete data?
2 What tools and alerts have you implemented to monitor and maintain
data pipelines?
2 How do you handle updates or changes to the underlying data sources?
• Data Storage
2 What are the advantages and disadvantages of our current data stor-
age system?
2 What other data storage options have you considered?
2 How are data security and privacy concerns addressed in our current
data storage solution?
2 What techniques are being used for data backup and disaster recovery?
30
TOOLS OF THE TRADE
• Data Sources
2 Is all the source data structured, or is it a mix of structured and unstruc-
tured data?
2 Is there a list of the data sources that can be shared?
• Data Quality
2 Is there an up-to-date data dictionary?
2 What data quality checks are being implemented for specific data
sources and fields?
2 Would the data quality checking be improved by linking the data sci-
entists with the subject-matter experts?
2 What is the frequency of missing data for different data fields and
sources?
2 Is there a pattern to the missing data, or is it random?
2 Are missing values allowed, or does there need to be some imputation
method?
• Coding Languages and Repositories
2 What coding repository is being used?
2 Can the customer have access to the project repository?
2 What coding languages are being used and why were those languages
chosen?
• Data Products
2 What is the level of data product that is appropriate for the business
given its IT capacity and human resource skill set?
2 Why is the specific data solution being proposed?
31
2
THE DATA SCIENCE PROJECT
Kamala was the pride and joy of her family. Her parents had moved
from Bangalore to New Jersey when her mom was pregnant with
her. As the eldest of three children, she was always the one to break
new ground. She was named both “the top entrepreneur” and “the
one most likely to be retired by age 40” by her high school class.
Kamala was the first in her family to go to college and, just a few
years later, had finished her MD and MBA from Stanford. She settled
into the Bay Area after graduation to work at a med-tech start-up.
She figured they would IPO in less than 5 years. With her fortune,
she could coax her parents into an early retirement so they could
finally enjoy themselves instead of working night and day to pay for
the other kids’ education. But the start-up folded in less than a year.
With her degrees in hand and over a quarter of a million dollars
of debt, she soon found herself looking for stable employment. At
age 28, she joined Stardust Health Insurance, a medium-sized insur-
ance company working primarily in California and Florida. Within
4 years, Kamala was promoted to become the director of clinical
strategy and marketing. This position perfectly matched her pas-
sion for health care and her talent for growing businesses. Her posi-
tion was considered to be one that would put her on the fast track
THE DATA SCIENCE PROJECT
to joining the C-suite. As long as Kamala could show some big wins,
she would quickly move even higher up the corporate ladder.
Kamala was given a budget of a few million dollars to spend on a
diverse set of projects as broad as her job title.
As her boss, Annie, told her on day one, “Your goal is to increase
the profitability of Stardust Health Insurance by taking data-driven
decisions on which drugs and procedures are the most cost effec-
tive. Given the choice among drugs that are commonly used to treat
pneumonia, lung cancer, high blood pressure, or other medical con-
ditions, Stardust needs to know which one is most likely to achieve
better outcomes after accounting for the patient’s demographics,
other medications, medical history, and family medical history. If
there is a difference in the expected outcomes of the drugs, we need
to know the total expected treatment cost of each option. And if the
drugs are expected to achieve the same outcomes, then we need to
know which one is the least expensive.”
Kamala nodded her head. “I understand completely. Sometimes
the more cost-effective decision involves a drug that is more expen-
sive but leads to better health outcomes as well as to less future
health care resource utilization, such as reduced future outpatient
visits, emergency room visits, or inpatient admissions. Other times
the increased drug costs are not offset by reduced expenses or bet-
ter health outcomes, so they are not worth the additional expenses.
If my team finds data in support of one of the drugs, we will advise
the medication formulary team at Stardust to add that to the list
of generics and brand-name prescription drugs we cover. Also, we
can incentivize the patient to use this more cost-effective drug by
reducing the co-pay for the preferred drug.”
Annie added, “Ultimately, the patient still has an option regard-
ing which treatment they choose, but your team plays a very strong
role in nudging the patient in one direction or another. Your job
function includes the marketing title as well. You will lead promo-
tional campaigns that build awareness of Stardust within the target
population. I expect you to manage the negotiations with advertis-
ers regarding the different media outlets, including billboards and
TV, radio, and online advertising. Stardust has always pitched its
33
THE DATA SCIENCE PROJECT
PROJECT MANAGEMENT
A key lesson that Kamala took from her med-tech start-up failure
was the importance of project management. That business was
pitched as delivering an end-to-end system for taking MRIs, read-
ing the images via artificial intelligence algorithms, and delivering
the results online, followed by chatbot-based customer support to
answer patient questions. The main issue was that the product was
never built. No functional prototypes were ever produced. Sure,
they had some algorithms that could be coaxed into making accu-
rate predictions on a specially selected set of images, but that wasn’t
going to get them a product that could work in the marketplace.
Rather, it became an academic exercise led by the founder, a profes-
sor of data science, and a data science team whose members were
too focused on seeing what cool things they could do rather than on
delivering what they needed to produce to become profitable. One
of the main reasons that this academic exercise never left academia
was the lack of basic project management supervision. Since her
arrival at Stardust, Kamala has insisted on aggressive project man-
agement for all projects, especially data science projects, which she
feared could cost her money and time but yield few usable prod-
ucts. Her mantra for her team is “Measure twice, cut once,” and her
personality is such that she usually plans diligently by lining up the
critical steps rather than trying to sprint into the wind.
34
THE DATA SCIENCE PROJECT
In her first meeting with the data science team, she asked, “So
which one of you is formally trained as a project manager?” The
dead silence was eventually broken by someone coughing up, “I’ve
managed a few projects before, and, well, most of them did OK.”
Kamala’s confidence was deflating by the second, but she quickly
rebounded. “I have some great project managers on my team. Phil
has worked on many IT projects and is a pro at planning, organiz-
ing, staffing, and leading projects from concept to completion. He’ll
be working with you on a day-to-day basis, and we will do weekly
check-ins. Of course, I am always checking emails and available
for a quick call.” Kamala understood that, as the project sponsor,
she needed to stay informed of the progress and key decisions but
didn’t need to get into the day-to-day minutia, as Phil would take
care of those details.
The second meeting featured Phil walking the team through the
paces of his project management training with the precision that one
expects from a military officer with 20 years’ experience. Following
Kamala’s lead, he insisted that data science projects are just like any
other projects, with four basic phases and milestones: concept, plan-
ning, implementation, and closeout (figure 2.1).1 Other companies
have their own standard sets of definitions for the different project
phases. These may include more phases and have different names.
The key is to ensure that the project team members agree on what
Phase 1: Concept
Phase 2: Planning
Phase 3: Implementation
Phase 4: Closeout
35
THE DATA SCIENCE PROJECT
they will call the different phases of the project and what exactly
those phases represent. Agreeing on definitions is critical, since a
large percentage of the disagreements that people have about proj-
ects stem from simply working off of a different set of definitions.
Phil explained, “The concept phase refers specifically to defin-
ing the project requirements and its desired outcomes. This is
necessary, since a project can never succeed if everyone involved—
from the sponsor, Kamala, to the project manager to the data sci-
ence team—doesn’t agree on the project’s requirements, goals,
and scope.” There was plenty of head nodding around the room, as
everyone understood that if the project’s scope and objectives con-
tinue to change, it will constantly be a moving target for everyone
involved. Clearly, these questions had to be answered: What exact
data product will be produced? How are the products expected to
be used? How will success be measured? How will the utility of
the deliverable be assessed? What are the constraints on human
resources, time, cost, equipment, and other resources? What are
the risks associated with the data product and this project? How can
those risks be mitigated?
Kamala took over at this point. “The planning phase includes
laying out many of the key activities for the project along with
the resources required, risks identified, and risk mitigation steps
needed. It also includes discussing what the final deliverables are,
who will review those final deliverables, and what the timelines and
budgets are for the work.”
“Let’s review the planned activities and then clarify roles and
responsibilities,” Phil announced. As people began chiming in on
suggestions, Phil reminded them that activities are the tasks that
need to be accomplished in order for the project to be completed.
Sequencing refers to the fact that some activities have to happen
before others.
“Here’s a perfect example of sequencing,” said David, the senior
data scientist. “You have to first acquire data before you can begin
any data-cleaning process.”
“And, similarly, you need to clean the data before you begin
doing any modeling work,” added Kamala.
36
THE DATA SCIENCE PROJECT
37
THE DATA SCIENCE PROJECT
that most teams consider. Looking around the room, they assess
whether the team has the appropriate skills to do the job and, if not,
what should be done to address gaps. Steps needed to address gaps
often include hiring other staff or consultants, providing training,
and planning for unexpected departures.
Beyond these human resource risks are cost risks, where the proj-
ect costs exceed the planned budget; schedule risks, where the
activities take longer than planned; performance risks, where the
data product doesn’t produce results that meet the business needs;
operational risks, where the implementation of the data product is
not successful, resulting in poor outcomes; and legal risks, where
the project violates specific regulatory requirements or legal restric-
tions. Other risks relate to governance, strategy, and external haz-
ards and risks outside of the scope of the project team or company,
such as natural disasters or changes in the regulatory framework
or the competitive landscape. Risk assessment is often subject to
human biases, so some consideration of the biases that are implicit
in the assessments is useful.2 For example, “Do we have data to sup-
port our risk assessment?” is a fair question, though it may result in
some blank stares.
Phil liked to open up the discussion about risks by asking the
simple question “What can go wrong with this project?” and devel-
oping a systematic list of categories of risk such as those mentioned
above. For each risk, he would ask the team to guess the likelihood
of the risk becoming a problem as well as the expected impact. For
risks that are worth attending to, the team members would then
develop mitigation strategies as well as ways of detecting whether
those risks are becoming more likely.
The key take-home regarding project management of data sci-
ence projects is that these are projects and should be treated simi-
larly to how one would treat other important projects. The best
practices of project management should be applied so as to reduce
the chances of project failure and increase the chances that the data
product will solve the customer’s needs.
Phil drafted a project plan design document that listed the key
activities, roles and responsibilities, task sequencing, risks, mitigation
38
THE DATA SCIENCE PROJECT
39
THE DATA SCIENCE PROJECT
40
THE DATA SCIENCE PROJECT
transform the data into formats that are more valuable for doing
further analysis and modeling work.”
“What are some examples of these common data transforma-
tions?” asked Kamala.
David explained, “Common data transformations include taking
values that have very large ranges and using log transformations to
map them to a space with a smaller range. For continuous variables,
there are sometimes advantages to converting, creating, and using
categorical variables. For example, consider the patient’s age. While
it could be measured in days or years, often it is more useful to
use age groupings such as 5-year increments. In that case, the data
analysts might identify which new variables should be created and
then add those to the database for future use. In addition to this
data preparation, data analysts’ work includes doing basic query-
ing of the data, retrieving data for specific questions, and preparing
output used for reporting. Their work is often displayed using busi-
ness intelligence tools and data visualization tools such as Tableau
and Power BI or, more simply, using spreadsheets.”
Kamala asked, “What about specialized roles like data visualiza-
tion specialist or data storyteller?”
David smiled as he replied, “There is often little distinction
between the role of a data analyst and that of a data visualization
specialist or data storyteller. They all use data as inputs to commu-
nicate results, often to a nontechnical audience. They are the link
between the data science team and you, Kamala, or perhaps they
produce the key communications content that you use to explain
the project’s value and the utility of its deliverables to the board
and senior management like Annie. Ideally, this work goes beyond
the charts, tables, and graphs to develop a narrative that can sup-
port effective storytelling.”
David finished off by providing some more details on other
titles, including machine learning scientist, natural language pro-
cessing (NLP) expert, and statistician. He explained that machine
learning scientists focus on creating and using the newest technol-
ogy and innovative approaches. They can be engaged in developing
41
THE DATA SCIENCE PROJECT
42
THE DATA SCIENCE PROJECT
leverages GPS and other location data to create systems used for
site selection (the best place for a new store), navigation (the best
route to take), and other geospatial-specific tasks.
Some “old school” titles that may appear in your data science
team and still are used in the corporate world include math-
ematician and statistician. Mathematicians often have degrees in
operations research or applied mathematics and tend to focus on
optimization problems. Statisticians would have experience in
theoretical and applied statistics. They would have a deep under-
standing of how to apply specific statistical tests or models and to
quantify uncertainty, for example using confidence intervals. The
statistical tools and tests will need to be well matched to the spe-
cific problem and data set. Before implementing the statistical test,
the statistician should verify that the key assumptions in the statis-
tical test are valid. The medical field and health care industry tend
to employ many statisticians, since their work is often scrutinized
by governments and scientific committees.
The list of typical data scientist job titles above is meant to be
only a rough guide, since often titles don’t reflect the actual job
function as well as the experience and skill sets required. A key
step in the project management of a data science project is to iden-
tify what skills are needed at different phases of the project. When
exactly will we need data engineers and for how long? In what spe-
cific areas of machine learning do we need expertise? Do we need
someone with experience in NLP or geospatial data?
This listing of the skill sets required for the project and when
they are required needs to be mapped against the actual available
resources. In areas where there are skills gaps, a decision needs to be
made regarding where to source those resources. Internal transfers
can be a short-term option, though in many cases staff are already
well utilized in their current roles and have little availability. When
they can’t find additional resources internally, the project manager
needs to look externally. External sourcing can involve either inter-
nal human resources or headhunters to identify both short-term
consultants to work on a project or a time-and-materials basis and
new employees if the need is expected to be permanent.
43
THE DATA SCIENCE PROJECT
PRIORITIZING PROJECTS
44
THE DATA SCIENCE PROJECT
impacted, the impact per customer, and the frequency of the issue
that is being addressed.
In some cases, the project can be classified as mission critical. For
those small sets of projects, success on the project is necessary for
the company to be successful. In other situations, an industry aver-
age solution or an off-the-shelf solution might be sufficient where
the area involved is not considered a competitive advantage.
Many within Stardust Health Insurance have adopted the objec-
tive and key results (OKR) framework for project management.4
This framework requires the team to define an objective, which
is a significant, clearly defined goal that can be readily measured.
The key results are used to track the achievement of the objective.
These key results should be SMART: Specific, Measurable, Assign-
able (state who will accomplish the goal), Relevant, and Time-
related.5 This isn’t the only formulation of this acronym; you may
see examples where the A stands for Achievable or Attainable and
the T stands for Time-bound.
The data should be explored to understand the current status of
the in-house data as well as to identify external data sources that
could be useful for the project. For each possible data source, it is
necessary to understand information such as the data availability,
the data limitations, the time required to acquire the data, and the
costs for both initial data acquisition and ongoing data use. Even
data that is publicly available for free, such as that on government
websites, may still involve some costs related to the staff time
needed to acquire the data, do any necessary cleaning, and monitor
the data source, as changes can often be made to variable names,
formats, and other quality control processes.
For the data itself, an assessment should be made regarding
its potential utility. Is the level of resolution appropriate for your
needs? Do you need geographic information accurate to the street
block, but the data provides information only by zip code? Do you
need age information accurate to the year, but the data is provided
only in 5-year age groupings?
Is the sample size sufficient for your purposes? If you already
know that a model like the one you want to develop requires
45
THE DATA SCIENCE PROJECT
46
THE DATA SCIENCE PROJECT
47
THE DATA SCIENCE PROJECT
of AutoML due to both efficiency and less need for more advanced,
and expensive, data science talent. But also do a sensitivity analysis
that doesn’t include this assumption, just in case it doesn’t bear
fruit. By the way, any concerns from your end on whether we will
feel less confident in the output of AutoML programs, since we will
be using a black box to produce answers?”
“No concerns. The AutoML software isn’t a black box. It gives
us details on the variables that are important, the factors that drive
individual decisions, the reason a specific model was chosen, and
many other pieces of information we have trouble getting from our
programmers.”
With that guidance, Phil developed a few rough cost-benefit
analysis estimates for the AutoML software. The sensitivity analysis
showed that purchasing the software was not necessarily a major
win for the organization, so they decided to stick with their current
Python and R programming approach.
MEASURING SUCCESS
48
THE DATA SCIENCE PROJECT
Success
Stakeholder
Schedule Costs Quality Business Case
Satisfaction
49
THE DATA SCIENCE PROJECT
50
THE DATA SCIENCE PROJECT
KEY TOOLS
1. Concept
2. Planning
3. Implementation
4. Closeout
TECHNICAL SKILLS
• Intellectual Curiosity
• Communication Skills
• Business Acumen
• Collaboration
51
THE DATA SCIENCE PROJECT
COST-BENEFIT ANALYSIS
PROJECT PURPOSE
DATA SOURCES
52
THE DATA SCIENCE PROJECT
FEASIBILITY
• Can the project deliver the expected results in the expected timeline?
• What are the risks involved in the project?
• How can these risks be mitigated?
• Are there new data sources, new software, or new skills required from
human resources that add uncertainty to the project’s chance of success?
1. Schedule: Did the project get completed within the expected timeline?
2. Costs: Was the project’s actual spend in line with the expected spend?
3. Stakeholder Satisfaction: How are the stakeholders in the project feeling
about the project deliverables? Did it meet their signed-off expectations?
4. Quality: Did the project’s deliverables meet the quality standards, whether
measured by the modeling performance, the improvement over current
processes, or some other benchmark?
5. Business Case: How did the benefits of the delivered project compare
with the expected benefits as established in the cost-benefit analysis
developed as part of the project planning?
53
3
DATA SCIENCE FOUNDATIONS
EXPLORATION
55
DATA SCIENCE FOUNDATIONS
Data Scope
Maya, a junior data scientist who reports to the team lead, David,
provides an overview of the claims database that the team used for
the exploratory analysis.
“The database we used includes claims made by all covered
patients from 2015 to 2023. Note that clients below age 18 at
the time of the claim are not included in the data set due to pri-
vacy concerns. If we want data on pediatric patients, we’ll have
to use another data source. We can be pretty confident that the
data set is comprehensive and accurate, since the data is col-
lected automatically whenever a claim is made. One important
caveat is that the database does not include claims for depen-
dents for certain employers whose contracts stipulate that data
on dependents must be stored in a high-security data lake. For
these folks, we need to tap into a separate database if we want to
look at their claims.”
There are a few key points from the data exploration that help
Kamala frame her analysis:
• The time period of the data. The data set covers claims from 2015 to 2023,
but Kamala is mainly interested in claims patterns from 2017 to 2023
because Stardust made some major changes to reimbursement rates in
2016. Data before that policy change is less relevant.
• Data inclusion criteria. Since the data set contains only claimants who are
18 or older, Kamala may have to access different data sources to under-
stand cost patterns in child claimants.
• Missing data. It’s important to know that the database doesn’t include
claims for certain dependents. Kamala asks the data science team to
56
DATA SCIENCE FOUNDATIONS
investigate how many claimants this omission would impact and the
total reimbursement for those claimants to determine whether it is
worth the effort to get access to the high-security data lake.
Data Limitations
“We also found some data quality issues that you should be aware
of,” Maya points out. “The ‘Date of patient encounter’ field is empty
17 percent of the time. For these cases, we assumed the date of the
encounter was the date of the claim.”
Kamala interrupts. “A missing patient encounter date is not
uncommon, since many services are reimbursed without a patient
encounter. For example, some patients have a health plan that
reimburses gym visits. For those folks, a claim is triggered 3 months
after they buy their gym membership, and it wouldn’t have an asso-
ciated patient encounter, since they’re not interfacing with their
health care provider. The reason we wait 3 months before we trigger
the claim is that many folks cancel their gym membership within
the first month. So if we want to know when they activated their
gym membership, we’d have to go back 3 months from the date of
57
DATA SCIENCE FOUNDATIONS
58
DATA SCIENCE FOUNDATIONS
59
DATA SCIENCE FOUNDATIONS
Importantly, Kamala realizes that she cannot use this data set to
answer certain questions. For example, she wouldn’t be able to look
at whether patients with a family history of chronic illness incur
higher costs because there is no data on family history captured.
After hearing about the scope of the data set from the data sci-
ence team, Kamala has a better understanding of how this data set
can be used to answer the question she’s interested in.
Summary Statistics
Now that they’ve reviewed the scope and limitations of the data,
the data science team members dive into their key results from the
exploratory data analysis.
Maya begins. “We saw that the total number of patients in our
network has been increasing over time. The total number of claim-
ants in 2018 was 1,003,123, and in 2019, this number was 1,014,239.
We also saw a drastic increase in the average number of claims per
claimant: this increased by 20 percent from 2018 to 2019. Digging
into this result a bit more, we found that in 2018, the number of
claims per claimant ranged from 0 to 368. In 2019, this number
ranged from 0 to 199,093. We also saw an increase in the average
age of claimants from 46 to 50 from 2018 to 2019. Lastly, we looked
at the relationship between the age of claimants and the number of
claims they filed. We made a scatterplot of these two variables that
shows a positive slope.”
There are typical ways results from exploratory data analyses are
reported. One key feature is reporting a measure of central ten-
dency, like an average or mean. Other measures of central tendency
include median and mode.5 These metrics boil down large amounts
of data into a single number. With over 1 million claimants, it’s
useful to understand general patterns in a few simple numbers. For
Kamala, it’s useful to know that the average age of all 1 million
claimants was 46 in 2018 and that this average increased to 50 in
2019. This shows that the claimant population got older on average.
There are times when using an average can be misleading and
when a measure like a median is more appropriate for understanding
60
DATA SCIENCE FOUNDATIONS
61
DATA SCIENCE FOUNDATIONS
Median
Median
Mean Mean
Figure 3.1 Comparison of the mean, median, and mode for normal, left-skewed,
and right-skewed data distributions. In normally distributed data, the mean, median
and mode are equal. In right skewed distributions, the mode is less than the median
which is less than the mean. In left skewed distributions, the mode is greater than
the median, which is greater than the mean. In skewed distributions, the mean is
pulled towards the tail
be that half of all claimants are 26 and half are 66; in this extreme
case, the average age of claimants would be 46 even though none of
the claimants was actually 46! This example (albeit extreme) shows
that understanding variation is just as important as understanding
measures of central tendency. It’s like the joke about the person
who has one leg in an ice bath and the other in scalding hot water
and who says, “The average temperature is very comfortable.”
Some typical measures of variation include range, percentiles,
interquartile range, standard deviation, confidence intervals, and
variance.10 Each of these measures (figure 3.2) is useful in different
scenarios. The range is the difference between the minimum and
the maximum observed values. It can be useful in identifying outli-
ers and errors in data entry. For example, Kamala notices that the
range of claims per claimant goes almost as high as 200,000 in 2018.
That’s over 500 claims per day! She flags that unrealistic result for
the data science team to investigate.
Percentiles are a great way to understand variation. The nth per-
centile value means that n percent of all data points have a value less
than or equal to that value.11 For example, if the 50th percentile for
claims per claimant is 20, that means 50 percent of claimants have
less than or equal to 20 claims per year. The 0th percentile is the same
62
DATA SCIENCE FOUNDATIONS
Q1 Q3
(25th percentile) (75th percentile)
Minimum Median Maximum
(Q1 – 1.5* IQR) (Q3 + 1.5 * IQR)
Outlier
Interquartile range
(IQR)
0 20 40 60
as the minimum, the 50th percentile is the same as the median, and
the 100th percentile is the same as the maximum. Another related
measure of spread is the interquartile range, which is the difference
between the 75th and 25th percentiles. This corresponds to the
“middle half” of a data set. The interquartile range can be more help-
ful than the range (minimum to maximum), since the interquartile
range typically excludes outliers at both ends of the distribution.12
Kamala explains to the data science team, “The range of 0 to
almost 200,000 claims in 2019 makes me think that the upper val-
ues are outliers—maybe there was some issue in how the claims
were being recorded that is making this number so unrealistically
high. Do we know the interquartile range for the number of claims
per person? That may give us a better sense of the data, since it
would minimize the impact of outliers.”
Luckily, one of the data analysts has their computer on hand,
ready to field on-the-fly questions from Kamala. The analyst calcu-
lates the interquartile range for the number of claims per person for
both 2018 and 2019 and recalculated the range and average number
of claims for both years after excluding the outliers. “This makes a
lot more sense.” The analyst goes on to explain, “In 2018, the inter-
quartile range was 3 to 123, and in 2019, it was 5 to 148.”
63
DATA SCIENCE FOUNDATIONS
In this case, the outlier was due to data quality issues: there
must have been something wrong with the data entry because it’s
impossible to have that many claims. However, there may be situa-
tions where outliers arise legitimately, not owing to issues with data
integrity. We saw this in our discussion of wealth, where a high-net-
worth individual may have wealth that is many orders of magnitude
greater than that of most of the population. This is an example of
an outlier that cannot be attributed to data quality issues.
64
DATA SCIENCE FOUNDATIONS
of a year). Another way of putting this is that given the date, we can
precisely determine your age. An example of a moderate correlation
(−0.4 to −0.7 or 0.4 to 0.7) is the relationship between height and
weight. In general, taller people are heavier, but this is not always the
case. If we were to plot the relationship between height and weight
for everyone in the United States, the points would have a positive
slope to them, but it would be a messy scatterplot, not a clean line.
Something with a weak correlation (–0.2 to 0.2) may be the relation-
ship between the number of steps you take in a day and the number
of minutes you spend listening to music in a day. If we plotted these
two quantities, we would expect to see a random cloud of points that
do not show any clear linear pattern. Situations where there are only
weak correlations can still be useful. For example, using variables
that are only weakly correlated still allows the data science team to
develop predictive models that will do better than random guesses.16
It’s important to keep in mind that the most popular correla-
tion coefficient, called a Pearson correlation coefficient, assumes
linear relationships, while there are other types of correlations—
such as the Spearman coefficient, which looks at the rank order of
values—that are more robust to other relationships.17 Calculating
a correlation coefficient is similar to fitting a line to a scatterplot
(figure 3.3). For some scatterplots, a line cannot be easily fit, and
for these types of plots, calculating a correlation coefficient may be
misleading.18 For example, the relationship between the amount of
carbohydrates eaten and overall health may not be linear—it may
look more like an upside-down U. If you eat very few carbohydrates,
you may not be getting enough energy, and your health outcomes
may be negatively impacted. Similarly, if you eat too many carbohy-
drates, you may develop insulin resistance and associated metabolic
conditions like diabetes. In between the two extremes is the “Goldi-
locks zone” of carbohydrate consumption that maximizes health.
Let’s assume that these two variables, carb consumption and overall
health, have a strong upside-down-U relationship (also known as a
quadratic relationship). If we were to calculate a correlation coef-
ficient for these two quantities, we may get a coefficient of 0.1. The
key is that this indicates a weak linear relationship, not necessarily
65
DATA SCIENCE FOUNDATIONS
1 1 1
0 0 0
y
y
–1 –1 –1
–2 –2 –2
–2 –1 0 1 2 –2 –1 0 1 2 –2 –1 0 1 2
x x x
1 1 1
0 0 0
y
y
–1 –1 –1
–2 –2 –2
–2 –1 0 1 2 –2 –1 0 1 2 –2 –1 0 1 2
x x x
66
DATA SCIENCE FOUNDATIONS
67
DATA SCIENCE FOUNDATIONS
“Yes, that’s right.” Kamala goes on. “I’m also wondering if this
relationship has anything to do with the fact that we recently added
some employers to our network that employ older folks. So maybe
this relationship is being driven by that handful of employers that
we recently added.”
“Okay, so we have three questions to answer: (1) Are older
patients sicker than younger patients? (2) Do older patients use
health care services more than younger patients? (3) Is the relation-
ship between age and number of claims being driven by the hand-
ful of recently added employers? Now that we’ve narrowed down
the questions to answer, could you help us clarify some definitions?
First, how are we defining ‘older’ and ‘younger’ patients?”
“Good question. Let’s break up our patients into three age
groups: under 40, 40–65, and 65+. That way we can easily distin-
guish among Medicare-age patients (65+), middle-aged patients,
and young patients,” Kamala explains.
“That makes sense. We still have some other terms to define.
How should we measure who is sicker? Can we look at preexisting
conditions for that?” Maya asks.
“Looking at preexisting conditions is one way. We can also look
at diagnosis codes to see what conditions folks have been recently
diagnosed with,” Kamala explains.
“Okay, great. We can measure how sick someone is by looking
at the number of preexisting conditions and the diagnosis codes
they have.”
Kamala is perplexed. “Now that I think about it, this isn’t the
best measure of how sick someone is because having more diagno-
ses or conditions doesn’t necessarily mean that you are sicker than
someone with fewer. For example, someone with pancreatic cancer
would certainly be sicker than someone with dandruff and spring
allergies, but according to our metric, the person with two condi-
tions would be considered sicker.” By clearly stating what metric
would be used to measure “sickness,” it became clear to Kamala
that the proposed definition had limitations.
“I can see how this metric would have limitations. To my knowl-
edge, we don’t capture any other data on health status that could
68
DATA SCIENCE FOUNDATIONS
be used to measure how sick patients are. What do you think about
sticking with our proposed metric but being clear about how we
are measuring it and interpreting the results accordingly? If we find
the results do not make sense, we can look to import a comorbidity
index from another source,” Maya proposes. Kamala nods. “That
sounds good to me.” In the absence of better data, sometimes the
best option is to use an imperfect metric while clearly stating defi-
nitions, limitations, and interpretations.
“Last question: What do we mean by ‘use health care services
more’? Isn’t use of health care services measured by claims, so any-
one who files more claims would use more health care services?”
“Ah, I realized the way I phrased my initial question wasn’t clear.”
Kamala clarifies, “I’m thinking that older patients may make use of
routine screening procedures and preventive health care more than
younger patients—things like colonoscopies, mammograms, and
routine PCP checkups. These are examples of health care utiliza-
tion that has nothing to do with how sick someone is.”
“That’s really helpful, so we’re interested in utilization of routine
screening and preventive health care—and maybe other types of
services that are not indicative of severity of a certain condition.
For purposes of defining this metric, could we look at patients who
receive a procedure code for any service that falls into this cate-
gory? And do we have a list of procedure codes that we could use to
define the relevant services?” Maya asks.
“Yes, I think using procedure codes here makes sense,” Kamala
agrees. “Let me get back to you on that list. I can work with my
team to come up with a list of codes that would cover screening
procedures, preventive services, and other services that we’d want
to include.”
“Thanks. I think we have a clear path forward now. We will get
back to you soon with results from our three questions of interest.
We will break patients up into the three age groups you proposed,
measure health status by counting the number of preexisting
conditions and diagnosis codes—noting the limitations of this
definition—and measure health care utilization by counting proce-
dure codes for relevant services.”
69
DATA SCIENCE FOUNDATIONS
Testing Hypotheses
70
DATA SCIENCE FOUNDATIONS
• Are men taller than women? Here the quantity of interest is the differ-
ence in height between men and women, and the reference value (null
hypothesis) is a difference in height of 0 inches.
• Is manager A better than manager B at hiring employees who stay with
the company for at least 1 year? Here the quantity of interest is the differ-
ence between managers A and B in the proportions of hired employees
who stay with the company for at least 1 year (i.e., “# Hired employees
staying > 1Y” / “# Hired employees) and those who do not. The refer-
ence value (null hypothesis) is a difference in proportions of 0.
• Is age associated with IQ? Here the quantity of interest is the associa-
tion of age with IQ, measured by a correlation coefficient. The reference
value (null hypothesis) is a correlation coefficient of 0 (no association).
71
DATA SCIENCE FOUNDATIONS
“Is age associated with IQ?” the effect size may be reported as a
correlation coefficient (e.g., r = 0.3).
EFFECT SIZES The key question to ask about effect sizes is “How big
72
DATA SCIENCE FOUNDATIONS
73
DATA SCIENCE FOUNDATIONS
74
DATA SCIENCE FOUNDATIONS
75
DATA SCIENCE FOUNDATIONS
Interpreting Results
Effect size
Figure 3.4 Using p-values and effect sizes to interpret the results of a hypothesis test
76
DATA SCIENCE FOUNDATIONS
knowing how to interpret these four cases can help you make sense
of the gray areas.
Consider a scenario where the effect size is large and the p-value
is high. This is what we saw in the coin example. After one flip
of heads, our effect estimate was 100 percent. This is a very large
effect size because a coin with a 100 percent probability of heads
is an extremely rigged coin. In addition, we found that the p-value
for this scenario is 0.5 because if the coin was fair, we’d have a
50 percent chance of observing one heads after one flip. Even
though the effect size is large, the p-value is high enough that
we have low confidence in the effect estimate. The coin could be
extremely rigged, but it could just be that we don’t have enough
data. In other words, the true effect is uncertain.30
Now consider a scenario where the effect estimate is large
and the p-value is small. Imagine after ten flips, we get ten heads
in a row. In this case, the effect size is very large at 100 per-
cent, and the p-value is small at 0.001. This means we have an
extremely rigged coin, and we are highly confident that the coin is
truly rigged.31
What if we had a small effect estimate and a high p-value? Let’s
say that after two flips, we got one heads and one tails. Our effect
estimate is 50 percent, and the p-value is 0.75 because if the coin
was fair, we’d have a 75 percent chance of getting at least one heads
every two flips. This is similar to the first scenario, where we sim-
ply don’t have enough evidence to say whether the effect estimate
of 50 percent is accurate or not. It could be, or it could be that we
don’t have enough data yet.32
Lastly, what if the effect estimate is small and the p-value is also
small? Imagine we flip a coin 10,000 times and we get 5,200 heads
and 4,800 tails. If we had a fair coin, we’d expect the ratio to be
closer to 5,000 to 5,000, so we can be pretty sure that this coin is
not completely fair (the p-value in this scenario is about 0.00003).
But the effect estimate is relatively small because we were saying
that instead of a 50 percent chance of heads, this coin gives you a
52 percent chance of heads. This is not a substantial difference, so
we would call this a small effect size.33
77
DATA SCIENCE FOUNDATIONS
Some statistical hypothesis tests require that the data meet cer-
tain assumptions. For example, some statistical tests assume that the
data follows a bell curve or normal distribution, with approximately
equal numbers above and below the mean.34 Others may assume
that the two groups being compared have equal standard deviations.
Note that not all data distributions may meet these assumptions.
Remember the distribution of wealth where we had many people
around the median and one person at the extreme end who earned
a lot more? In scenarios like these, it’s important to make sure that
the data meets the assumptions of the tests being used.
Armed with this knowledge, we are now ready to interpret the
findings of the data science team. Kamala says, “0.003 is a fairly low
p-value, and the average difference of twenty-two claims is substan-
tial, so we can be fairly confident that older people do indeed tend
to file more claims than younger people. Now the question is ‘Why
are older people filing more claims than younger people, and what
factors make older people more likely to need more claims?’"”
These are questions that can be answered using statistical mod-
eling. In the next section, we will see how modeling can help us
identify the drivers of certain phenomena.
78
DATA SCIENCE FOUNDATIONS
79
DATA SCIENCE FOUNDATIONS
y 2
–2
–2 –1 0 1 2
x
Figure 3.5 Diagram of the sum of squared residuals for the line of best fit. The
line represents the line of best fit. Each square represents the squared residual of
a given point. The side length of a square is the difference between the point and
the line of best fit (residual).
80
DATA SCIENCE FOUNDATIONS
81
DATA SCIENCE FOUNDATIONS
82
DATA SCIENCE FOUNDATIONS
83
DATA SCIENCE FOUNDATIONS
84
DATA SCIENCE FOUNDATIONS
PUTTING IT TOGETHER
A few weeks later Kamala and the data science team reconvened to
discuss initial results from the model. David asked Maya to drive the
conversation, promising that he would jump in when appropriate.
Maya opened with the main finding. “Kamala, thanks again for your
list of proposed variables. We’re already seeing some strong signals
that we want to get your thoughts on. Out of all the variables that
you suggested including in the models, the biggest contributor to
the number of claims filed per year is the presence of chronic dis-
eases. We also found that the presence of chronic diseases affects
claims differently in younger patients versus older patients. In the
model for younger patients, the coefficient for this variable was
4.34, which means that—all else constant—patients with chronic
diseases file about four more claims per year than patients without
chronic diseases. In older patients, the coefficient was 15.17.”
Kamala tried to make sense of these results. “The first result is
that patients with chronic diseases file more claims—this doesn’t
surprise me. But the second result is that chronic diseases have
less of an effect in younger patients compared to older patients.
The natural follow-up question is ‘Why do older patients’ chronic
85
DATA SCIENCE FOUNDATIONS
KEY TOOLS
Important Questions:
• Are there any outliers in the data, and if so, how might they affect our
results?
• What assumptions does this hypothesis test make about the data? Are
these assumptions valid?
• Is the data normally distributed? If not, how does this affect our analysis
and interpretation?
• What other ways could we have set up the hypothesis test to answer the
same question?
• Do we have adequate sample size to use this hypothesis test?
• What effect size represents a meaningful effect?
• Did we conduct a power calculation or a sample size calculation before
collecting data?
• Are we seeing a significant p-value because we have a large sample size
(overpowered) or because there is truly a strong effect?
• Are we seeing a nonsignificant p-value because we have a small sample
size (underpowered) or because there is truly no effect?
• Is this a univariable analysis or a multivariable analysis?
• Could there be important confounders at play?
• Is this the right type of outcome variable?
86
DATA SCIENCE FOUNDATIONS
• Did we use an assumption in our statistical test that was not appropriate,
such as assuming linearity (Pearson correlation) when we could have
used a more robust test (Spearman correlation)?
• Is there a different way we could have done this analysis? For example,
could we have used a different model type or different outcome variable
specification?
• Does this model make sense for this data?
• How was missing data dealt with? Does imputation/exclusion make sense?
Common Mistakes:
87
DATA SCIENCE FOUNDATIONS
88
4
MAKING DECISIONS WITH DATA
90
MAKING DECISIONS WITH DATA
91
MAKING DECISIONS WITH DATA
I ordered both. The following day I got sick, but my two friends
didn’t. In the cloning example, when only clone 3 got sick, we could
conclude that it was the combination of nachos and enchiladas that
caused the illness. In the absence of cloning, we can’t conclude that
it was the combination that caused the illness because it’s possible
that my two friends have traits that make them less susceptible to
GI issues than me—it’s not a fair comparison. Just because they
didn’t get sick after ordering only nachos or only enchiladas doesn’t
mean that I wouldn’t have gotten sick! They may have a stronger
stomach than me, or they may have eaten at that restaurant before,
or they may not have eaten as much as I did, etc. Without account-
ing for all these possible baseline differences in susceptibility to GI
issues, it’s difficult to draw causal conclusions.”4
“Ok, I get that cloning is great, but how can we establish causal-
ity in a world where cloning is not possible?” Kamala asked.
David smiled. “Randomization.”
David erased his drawings of stick people eating Mexican food and
started writing. “BioFarm, the manufacturer of ClaroMax, wants to
know whether its drug is effective at increasing survival of elderly
male patients with prostate cancer. Let’s consider how the company
could prove a causal relationship between ClaroMax and longer sur-
vival. In an imaginary world, BioFarm could use the cloning method.
First, it would recruit a representative sample of elderly men with
prostate cancer, and then it would clone them, thereby creating two
identical groups of patients. BioFarm would give one group ClaroMax
and the other group a placebo (a fake drug that has no therapeutic
effect) and see which of the two groups survived longer. Because the
two cloned groups are identical in every imaginable way, any differ-
ence in survival between the groups can be attributed to the drug.”5
Kamala was following. “That makes sense, just like the restaurant
example. They’re cloning so that any differences in survival are attrib-
utable to the drug and not to any differences in the participants.”
92
MAKING DECISIONS WITH DATA
93
MAKING DECISIONS WITH DATA
the ClaroMax group had longer survival than the placebo group, is
it appropriate to conclude that ClaroMax was responsible for the
increased survival?”
Kamala could answer this one. “No. Because the two groups
were not comparable in terms of age, it could be that the baseline
difference in age is what led to the difference in survival.”
“That’s right!” David was proud of his student. “We know that
age is an important confounder here; obviously, younger people
tend to survive longer than older people. Other important con-
founders for this study may include comorbidities, baseline severity
of cancer, and genetic mutations associated with survival. As long
as critical confounders are accounted for, it’s not important for the
intervention groups to be similar across other attributes.”7
Kamala interrupted. “Just to make sure I’m following, examples
of variables that are not confounders in this example could be
things like the length of their toenails, the number of hairs on their
head, whether their favorite color is blue, whether they prefer waf-
fles or pancakes, etc. Is that right? Even if 100 percent of the Cla-
roMax group preferred the color blue and 0 percent of the placebo
group preferred the color blue, it is not expected to interfere with
our analysis of causality because we have no reason to believe that
liking the color blue is associated with longer survival—it’s not a
confounder.”
“You’re exactly right,” said David. “All those things you listed
are not confounders! So to recap, to determine whether or not Cla-
roMax is effective at improving survival, BioFarm needs to recruit
two groups of elderly men who are nearly identical across all impor-
tant confounders (age, comorbidities, disease severity, etc.). This
requirement presents two issues. First, this sounds like a nightmare
to coordinate logistically; for every 70-year-old man with stage III
cancer and diabetes who gets ClaroMax, BioFarm will need to find
someone with very similar traits to put in the placebo group. Sec-
ond, and more importantly, BioFarm may not be able to measure all
the important confounders. Maybe there’s a genetic mutation that is
strongly associated with shorter life expectancy in cancer patients,
but there is no accessible genetic test available to determine who
94
MAKING DECISIONS WITH DATA
95
MAKING DECISIONS WITH DATA
A/B TESTING
96
MAKING DECISIONS WITH DATA
97
MAKING DECISIONS WITH DATA
98
MAKING DECISIONS WITH DATA
99
MAKING DECISIONS WITH DATA
other clinics would give all patients a placebo. While this approach
may meet organizational constraints, it introduces other complica-
tions: namely, BioFarm will need to ensure that roughly equal num-
bers of patients are allocated to the ClaroMax and placebo groups.
“With a randomization strategy in place, the next important
consideration is data collection. How will you ensure that data on
the intervention and the outcome is being captured accurately and
in a timely manner? Again, we have it easy in the online advertising
world. Our computers do all the data collection for us automati-
cally. We know exactly which ads had more people clicking through
at any moment in time. I imagine data collection would be more
difficult for a clinical trial,” Kyra said.
Kamala nodded. “That’s right. From what I understand, collecting
data from clinical trials can be extremely time consuming. BioFarm
would need to work with each clinic to ensure that it has a data
collection mechanism in place. Nurses would record the date and
time of each administration of the drug, and other coordinators
would be in charge of following up with patients to collect data on
drug side effects. Finally, the site coordinators would be in charge
of tracking documentation of death events among the study partici-
pants. All data would be stored on a secure, HIPAA-compliant data
server. So many moving pieces to keep track of!”
While conducting a causal analysis can present a lot of nuances,
randomization reduces the analytic burden substantially. If ran-
domization is done appropriately and effectively, the data analy-
sis amounts to a simple comparison between the two intervention
groups.13 Once the study is completed, BioFarm can simply compare
the survival times for the ClaroMax group and the placebo group
using a statistical hypothesis test. The resulting effect-size estimate
and p-value can be used to draw causal inferences about the effect
of ClaroMax in increasing patient survival. Let’s say, for example,
that the ClaroMax group ended up with 4 percent increased survival
compared to the placebo group, with a p-value of 0.23. This corre-
sponds to an increase in survival time of about 4 percent, which the
p-value indicates is not statistically significant. A 4 percent increase
in this population amounts to a median increase in survival that
100
MAKING DECISIONS WITH DATA
101
MAKING DECISIONS WITH DATA
102
MAKING DECISIONS WITH DATA
103
MAKING DECISIONS WITH DATA
all the patients in our EMR database who have the genetic muta-
tion of interest. We can then apply matching to find patients who
do not have the mutation of interest but who are similar to the
mutated patients across all relevant traits (possible confounders).
Suppose we decide that we want to match on patient sex, race, and
comorbidities. Let’s say we have a 64-year-old Asian male who is
suffering from diabetes and has the mutation of interest. Using a
matching algorithm, we can find a 64-year-old Asian male suffering
from diabetes who does not have the mutation of interest. In other
words, matching is trying to find ‘clones’ for the patients in the
intervention group so that the patients in the control group are as
similar as possible.”
Naturally, the more characteristics that you want to match for,
the harder it will be to find a match. Similarly, searching that
requires exact matches is more restrictive than searching that
allows some flexibility in the matching.19 Flexibility involves, for
example, allowing a 64-year-old to be matched with a 63-year-old
or a 65-year-old. Different matching algorithms exist that allow the
user to specify the number of matching variables and the specificity
of the matching. One important assumption inherent in matching
is that the variables that are selected to match are all there.20 It’s
important to know that matching cannot account for unobserved
or unmeasurable confounders. Obviously, if there is a confounder
at play that cannot be measured, it’s impossible to match on it.
SPOTTING BIASES
104
MAKING DECISIONS WITH DATA
Pitfall Description
Selection bias Bias that arises from the nonrandom selection of participants or
samples.
Measurement error Bias that arises from inaccurate or imprecise measurements or data
collection methods.
Response bias Bias that arises when people who respond to a survey are substantially
different from those who do not respond.
Reporting bias (publication bias) Bias that arises from the selective reporting of results or findings that
are statistically significant or interesting.
Conditional probability fallacy Mistaking the probability of A given B with the probability of B given A
(e.g. not all rectangles are squares but all squares are rectangles).
Improved detection Bias that arises from improved diagnostic techniques, leading to an
apparent increase in the frequency of a condition or disease.
Absolute vs. percent change The change in a variable may appear exaggerated when only
considering the absolute change or the percent change in isolation.
p-hacking Bias that arises from the selective reporting or analysis of data to
obtain statistically significant results.
Figure 4.1 Summary of biases, fallacies, and other pitfalls in analyzing or interpret-
ing data
We’ll start by talking about bias. We’ve all encountered the con-
cept of bias, whether it’s preferential treatment of students by a
teacher, discrimination in the workplace, or misrepresentation of
reality. While these are examples of bias in interpersonal reaction
relationships, bias can also make its way into data and statistics.
Here we will talk about biases that show up in data collection, data
analysis, and data presentation (figure 4.1). We’ll see how these
biases can lead to data that is misleading and how being aware of
these biases can help you spot fake statistics in real life.
Selection Bias
Kamala clears her schedule for the day as she dives into the results
of the ClaroMax study. She needs to spend some time reading the
original paper and trying to make sense of the study methods. Her
goal is to understand exactly how the researchers arrived at their
105
MAKING DECISIONS WITH DATA
Measurement Error
Bias can also arise when collecting data. Imagine a study that’s try-
ing to measure the average weight of patients at a clinic. Before
106
MAKING DECISIONS WITH DATA
Nonresponse Bias
Consider a hospital that wants to know how its patients felt about
the care they received. The hospital CEO looks at the results of a
patient satisfaction survey that the staff had rolled out the previous
month. The CEO notices that there are a lot of five-star reviews
and also several one-star reviews—but very few ratings in between
one and five stars. This is an example of what we’d call nonre-
sponse bias.23 Nonresponse bias occurs when data is captured only
from a specific subset of the population. In this case, the types of
people who are most likely to respond are those who loved the
experience (and gave five-star reviews) and those who hated the
experience (and gave one-star reviews).
107
MAKING DECISIONS WITH DATA
Reporting Bias
108
MAKING DECISIONS WITH DATA
p-Hacking
There are several biases that can arise during data analysis. One of
these is called data dredging bias, or p-hacking.28 p-hacking occurs
when the analyst fishes for a significant finding by exploring many
different statistical associations, even if they are not meaningful.
Imagine a research group that wants to find genetic mutations that
are associated with cancer. The group has at its disposal a data set
with over 10,000 different mutations. The analyst computes the
association between each mutation and a cancer diagnosis until
they find some significant relationships. The issue is that signifi-
cance tests based on p-values can lead to false positives based on
random chance. If you look at enough associations, you’re bound
to find at least one that yields a significant p-value (out of 10,000
associations, we’d expect at least 500 with a p-value of < 0.05 purely
by chance).
The best way to avoid p-hacking is to prespecify analysis plans.
Before starting to explore the data, clearly define which variables will
be studied and which will not. Prespecifying analysis plans avoids
109
MAKING DECISIONS WITH DATA
situations where the analyst can modify the analysis plan post hoc
if they don’t see the results that they desire.
110
MAKING DECISIONS WITH DATA
has had a very high fever, and I know that everyone who has Ebola
gets a really high fever!” Hopefully you can spot the flaw in the
mother’s reasoning. Just because everyone who has Ebola has a
high fever doesn’t mean that everyone who has a high fever has
Ebola. This is an example of confusion of the inverse.30 While this
is a very obvious example, there are more insidious examples of
confusion of the inverse that are less easy to spot. Implicit bias and
racial profiling often stem from confusion of the inverse.
In the wake of 9/11, Muslim Americans faced unprecedented
levels of discrimination, partly because the news media portrayed
terrorists in a way that emphasized their Muslim identities. In other
words, the media portrayed nearly all terrorists as being Muslim. As
Americans consumed this narrative, confusion of the inverse took
hold, and some folks began to implicitly or explicitly associate all
Muslims with terrorism. Confusion of the inverse can also occur
when interpreting the results of a diagnostic test. Consider a test
for COVID-19 that is designed such that everyone who actually
has COVID-19 tests positive. If you take this test and test positive,
what is the chance that you actually have COVID-19? If you said
100 percent, you fell into the trap of confusion of the inverse. Just
because everyone who actually has COVID-19 gets a positive test
result doesn’t mean that everyone who gets a positive test result
actually has COVID-19.
111
MAKING DECISIONS WITH DATA
was, the report said that the new ad had increased enrollment from
10 people per 10,000 to 13 people per 10,000. It’s true that this was a
30 percent increase, but after looking at the absolute change, it seems
much less impressive.31 The reverse can also be true. If the market-
ing team’s spending on advertising went up by $1 million in 1 year,
would you say that’s a large increase or a small increase? There’s no
way to know. In this case, knowing only the absolute change makes it
difficult to contextualize. We need to know what the total advertising
budget was in order to decide whether this change is large or small.
For instance, if we found out that $1 million corresponds to 2 per-
cent of the budget, we may not feel that this is a big change after all.
Improved Detection
QUALITY OF EVIDENCE
Now that we’ve discussed biases that arise when collecting data,
analyzing data, and interpreting results, let’s put these concepts
together to assess the quality of a body of evidence. A body of
evidence may consist of one or more quantitative analyses or
112
MAKING DECISIONS WITH DATA
113
MAKING DECISIONS WITH DATA
that we have access to. In Kamala’s case, she has access to only a
single study. While this study is indeed a randomized experiment,
it wasn’t free of biases, and we know that a single study isn’t as
compelling as several studies.
For questions that cannot be answered through randomized
experiments, observational studies are the only option. However,
several observational studies can be combined and analyzed as a
whole through a technique called meta-analysis.35 Meta-analysis
uses quantitative methods to synthesize the results from many
different studies and come up with an overall estimate. Meta-anal-
ysis can also be conducted on randomized experiments. System-
atic reviews are similar to meta-analysis, but they typically do not
include the quantitative synthesis of results.36 Meta-analysis and
systematic review are the best ways to analyze the results across
multiple different studies. A group in the UK called Cochrane
has put together guidelines for performing meta-analysis and sys-
tematic review such that they are rigorous, transparent, and free
from bias.37
With the number of scientific studies that are published on a
daily basis, it’s possible to find a scientific publication supporting
almost any viewpoint. We see a similar phenomenon with experts
that spout advice to the general public. It’s possible to find scien-
tists who don’t believe in climate change and doctors who don’t
believe in vaccine efficacy, but it’s more important to understand
the overall consensus across all experts, not just the opinions of a
single expert. Similarly, it’s important to consider not just a single
study but rather the overall synthesis of evidence across all avail-
able studies. This will allow you to get a holistic understanding
of the phenomenon in question and will prevent you from being
biased by the results of a single study.
CONCLUSION
114
MAKING DECISIONS WITH DATA
115
MAKING DECISIONS WITH DATA
2 How reproducible are the results, and are the methods transparent
and well-documented?
2 Does the evidence align with other sources of information or contra-
dict them?
2 Are there any conflicts of interest that could affect the interpretation
of the evidence?
• Is the evidence we’re using to make decisions high-quality?
116
5
CLUSTERING, SEGMENTING, AND
CUTTING THROUGH THE NOISE
fraud, a fact that has been consistent for years. The company is
very concerned about reducing these losses, as they affect not only
Shu Financial’s bottom line but also its reputation. The last thing
Shu Financial wants is a front-page news story about a fraudulent
application that its employees accepted.
The company already has one supervised machine learning
model that predicts the probability that an application is fraudu-
lent. This model is a supervised machine learning model because it
is trained on a target variable, the indicator of whether or not the
application is fraudulent.
A key question that keeps arising is whether there are differ-
ent types of application fraud or whether application fraud is one
homogenous group. Perhaps different criminals who commit appli-
cation fraud have different ways of getting the false information
or target different victims or use the credit cards in different ways
once their application is approved.
If there is only one type of application fraud, then it may be rea-
sonable to stick with a single model. On the other hand, if there
are a number of different types of application fraud, then it might
be advantageous to identify these different types and target them
with different predictive models. Steve was brimming with ideas
but uncertain as to what steps to take first, so he arranged a meet-
ing with the data science team about his challenge.
DIMENSIONALITY REDUCTION
After hearing the problem description, the lead data scientist, Brett,
nodded his head. “Sounds like a classic unsupervised machine learn-
ing problem. It is unsupervised in that we don’t have a target to
predict like the charge-off amount or the probability of fraud. The
first thing we would like to do is find a good way to visualize the
customer data. This will help us see if there are obvious groupings
of fraudulent applications. Again, we are not trying to predict the
probability of fraud but rather trying to see if there are patterns that
can be detected among the frauds. Unsupervised machine learning
118
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
not only is useful for finding subpopulations but also is a very good
tool for conducting exploratory data analysis in general, where we
can find patterns in the data that we currently aren’t aware of or
don’t use.”1
Steve understood the goal, but the actual steps seemed unclear.
“There are so many variables for each application: variables on the
application itself, comparisons to other applications in that zip
code, information from the credit report, the transactions on the
account, and the online and telephone activity. How can you pos-
sibly graph all of those variables and make any sense? Are you going
to make hundreds of scatterplots?”
Brett explained, “Unsupervised machine learning will quickly sur-
pass anything you would do on an Excel spreadsheet. You are right;
there are huge numbers of possible variables, and we will need to
be clever about doing the data exploration. Luckily, there is a great
method called principal components analysis that can help us make
sense of situations where there are a lot of features.2 We’ll do that
as the first step.”
Principal components analysis is called a dimensionality-reduction
method, since it tries to reduce the number of dimensions that a
user would consider in the analysis. Here, “dimension” refers to
a variable, or a column in a dataset; dimensionality reduction is
the process of reducing the number of columns in a dataset while
retaining the important signals. Essentially, it maps the original
data set into a new data set that has the same information as the
original data set but that looks at the data through a different lens,
a transformed lens. The goal is to reduce the number of dimen-
sions as much as possible while not discarding a lot of information.3
Clearly, this is a balancing act where the customer needs to under-
stand what is being gained and what is being lost.
It is important to note that principal components analysis is
not the only algorithm for dimensionality reduction, but it is prob-
ably the most popular. Its popularity stems from the fact that it
is extremely effective, is fairly intuitive, can be performed quickly,
and has been around for over one hundred years.4 There is a wealth
of multidimensional scaling algorithms that can be used instead of
119
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
120
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
121
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
122
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
123
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
124
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
CLUSTERING METHODS
125
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
Clustering Algorithm
Partitional Hierarchical
could group the cases in a way that brings together similar frauds
and separates out dissimilar frauds.
There is a wealth of different algorithms used for clustering,
including Gaussian mixed models, expectation-maximization mod-
els, latent Dirichlet allocation methods, and fuzzy clustering algo-
rithms. As a customer, however, you don’t need to try to learn the
complete taxonomy of algorithms. Rather, it is useful to have a clear
understanding of some of the most popular methods (figure 5.1)
and then ask the data science team members to explain which meth-
ods they used and why.
What are the main types of clustering algorithms? One way to
think about clustering algorithms is in terms of whether the cus-
tomers are assigned to only one group or whether they can belong
to multiple clusters. Exclusive clusters are simple to understand—
customers belong to one and only one cluster. Other algorithms
allow customers to be shared across multiple clusters, and at the
most extreme are fuzzy clusters, where the customer has a cer-
tain probability of belonging to every cluster. From Steve’s point
of view, it is worthwhile to ask the data science team whether the
clustering is exclusive or not. If not, then there should be a reason
why the team chose a more complicated way of clustering.
Another way to think about clustering algorithms is in terms
of whether they are hierarchical. This refers to the idea that the
126
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
127
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
128
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
129
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
elements. Alternatively, you could use the distance between the far-
thest elements in the clusters as the criterion for merging. How you
choose to measure distance and link clusters has major ramifica-
tions for the clusters developed, so you should ask the data science
team to explain which criteria were used and why.
Divisive clustering works top-down and is the opposite of agglom-
erative clustering.23 It starts with all of the customers belonging to
one single cluster. These customers are split into separate clusters
and are then split again into new clusters. This process continues
until each cluster is made up of one and only one customer. Because
it is a hierarchical method, each cluster fits inside another at a dif-
ferent level, meaning the team can quickly understand what hap-
pens to the cluster size and characteristics when they go from four
clusters to six to ten clusters.
Steve’s head was swirling a bit, so he called Brett. “As a customer,
do I really care whether the algorithm was divisive or agglomera-
tive?” Brett immediately replied, “No, the more important ques-
tions for you are what the resulting clusters represent and how they
are used. That said, if your data science team members can’t tell
you what algorithm they used and why they selected it, then this
should raise a lot of red flags about their methods and capabilities.”
Hierarchical clustering models are popular because they are
intuitive, they perform well, and the results can be easily communi-
cated. That said, they do have some downsides. Hierarchical mod-
els struggle with large data sets because the processing time scales
with the square or more of the number of observations. As with
other clustering methods, there are many different ways to measure
the clustering performance, so you can’t easily say that one solution
is the best clustering answer.24
In some situations, there is some partial labeling of data. For
example, there are different types of fraud, including fraudulent
applications, account takeovers (where the legitimate account holder
applied for the card but then a fraudster took control of the account),
card thefts, illegal use of lost cards, and skimming (where a second card
is made using the data from the first card). If the members of Steve’s
team have already tagged some of the cases they have investigated,
130
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
then the data science team can look to see if the clusters correspond
to known fraud types. This helps interpret the clusters and also could
be used as a way to test which clustering techniques are more accu-
rate at distinguishing different types of fraud. Since there is a target
involved in this type of analysis, there is a large family of supervised
machine learning algorithms that can also be used to address this
question. These will be discussed in later chapters.
USING CLUSTERS
Once a cluster analysis has been completed, it can be used not only
to interpret the past but also to help understand new situations.
Let’s see how things worked out for Steve in his fraudulent applica-
tion example.
Brett began the project debrief. “Here’s the rundown of the clus-
tering project. We started by doing a principal components analysis,
which showed us that there was definitely going to be some separa-
tion between the different fraudulent applications, though it was
unclear exactly how many clusters to use. We selected the three top
principal components, which accounted for 90 percent of the vari-
ance, and then began our cluster analysis. The components were all
standardized, where some initial exploration showed us that range
standardization worked best. We used K-means clustering to test
a range between two to ten different clusters. The final number
of clusters was set at six, though we could have made pretty good
arguments for any number between four and seven. Once we had
our final number of clusters, we reviewed the average input values
of each cluster and named them.”
Steve nodded his head. “Yes, I remember all of these steps. Of
course, my manager is going to want to know what value we got out
of doing this clustering. She is very focused on making sure I can
answer the question ‘So how does this translate into Shu Financial
stopping fraud losses?’"”
Brett smiled. “Yes, many times she has interrupted me with that
sharp question. But we are prepared to answer it. There are a few
131
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
ways we used this cluster analysis to help the company. The first
is that we used to have only one model to predict the probabil-
ity of application fraud. The cluster analysis helped us to develop
other behavioral-based predictive models that target some of the
specific features of the different clusters. These separate models
are far more accurate at predicting fraud than our single model was,
so our fraud detection rate is now much higher, meaning we can
prevent more fraud losses and lower our operating expenses, since
investigators’ time is better targeted.”
Steve interrupted. “So the unsupervised machine learning actu-
ally helped us build new supervised machine learning models to
help detect fraudsters more quickly.”
Brett continued. “Yes, in fact this happens often. Unsupervised
machine learning and supervised machine learning often are part-
ners in solving real-world problems. In addition to the new mod-
els, we use the cluster analysis directly by scoring each application
and assigning it to a cluster. We send the information about which
cluster the application belongs to over to the investigators along
with some of the application’s distinguishing features. These fea-
tures serve as reason codes that inform the investigators as to what
makes an application unusual and worthy of attention. This cluster
and reason code information helps the investigative unit where the
employees have divided themselves into different teams to focus
more efficiently on different clusters.”
“This means that our cluster analysis helped us not only gain
insight into our data but also create new models and improve the
operational efficiency of our units.”
KEY QUESTIONS
Clustering Concepts:
132
CLUSTERING, SEGMENTING, AND CUTTING THROUGH THE NOISE
• If I did determine that there were clusters, can I use that information
operationally?
• Is there a limit to how many clusters I can use in my operation so as to
improve performance?
• What factors do we normally use to segment populations, and how do
these compare with the factors identified in the modeling?
Dimensionality Reduction:
• Which method did you use for dimensionality reduction, and why did
you choose that method?
• How much variance is explained by the different components? Can we
reduce our focus to just a few components, or do we lose too much
information?
• How do we interpret the components? What do we think the first, sec-
ond, and third components represent?
Clustering Algorithm:
• Is the clustering exclusive? If not, then why did you decide to use that
algorithm?
• Which clustering algorithm was used?
• What metric was used for the distance and the linkage between clusters?
Why were those criteria used?
133
6
BUILDING YOUR FIRST MODEL
it, and patients who don’t need surgery who are getting it! It’s clearly
leading to higher costs, and we’ve seen that often the patients who
get surgery end up having complications or continued back pain.2
Physicians advocate hard for surgery when they submit their autho-
rizations, but it’s sometimes tough for them to see what a patient’s
life may look like after an unsuccessful surgery.”
Annie nodded. “I think this is something that David and the data
science team can help us out with. Instead of having humans manu-
ally review which cases to approve and which to deny, let’s consider
adopting a strategy where we empower our human decision-makers
with data.”
Kamala set up a meeting with the data science team to hash out
a plan. She explained the context of rising costs and poor decision-
making, “We’re seeing that our reviewers are finding it difficult
to determine whether a patient would benefit from a nonsurgical
treatment option for their back pain. Some combination of physi-
cal therapy and pain medication might be better for many patients.
The biggest issue is that our reviewers have trouble deciding if a
patient would do well on nonsurgical treatment, so they end up
approving the surgery.”
SCOPING
David, the senior data scientist on the team, had been listening care-
fully. “It sounds like this is a difficult decision for the reviewers. Do
you see this as an issue related to reviewer training, or would this
be a difficult decision even for the most highly skilled reviewer?”
David wanted to make sure that before building a data-driven solu-
tion, they were exploring all possible alternatives.
Kamala responded. “Even the most highly skilled reviewer would
have trouble with this because what we’re asking them to do is pre-
dict the future. We’re asking them to consider whether a specific
patient would do well with nonsurgical management of their back
pain and, if so, to deny the surgery. Right now we do see cases
where the patient’s request for surgery was denied and they ended
135
BUILDING YOUR FIRST MODEL
TO EXPLAIN OR PREDICT?
136
BUILDING YOUR FIRST MODEL
Kamala smiled. “David, this isn’t the first time you’ve built a
model for us. Remember the model your team built to help us under-
stand which patients had higher costs? You’re treating this like it’s a
totally different exercise, but haven’t we done this before?”
David understood Kamala’s perspective. “You’re right. We have
built models before, but not all models are equal! The key differ-
ence lies in our reasons for building the models. In the model we
built to quantify patient costs, we were interested in finding spe-
cific variables that led to higher cost. We wanted to explain which
phenomena led to higher costs so your team could act on them. To
put it simply, we wanted to explain why certain patients had higher
costs. Now we don’t necessarily care about explaining why some-
one would be a good candidate for nonsurgical management. We
want to predict as accurately as possible who would benefit from
a nonsurgical approach. It’s a subtle distinction, but depending
on whether our goal is explanation or prediction, our approach to
building the model will be substantially different.”4
DEFINING OUTCOMES
137
BUILDING YOUR FIRST MODEL
138
BUILDING YOUR FIRST MODEL
139
BUILDING YOUR FIRST MODEL
140
BUILDING YOUR FIRST MODEL
141
BUILDING YOUR FIRST MODEL
and text data from the PDFs—basically turning scanned PDFs into
machine-readable format.14 Keep in mind that extracting features
from images and text data can be time intensive but is continuing to
get easier with the release of new software and increased data avail-
ability. We should make sure that we’ve exhausted all our existing
tabular data before we spend too much time fiddling with scanned
PDF images.”
Every data science team has its strengths and weaknesses. Star-
dust Health Insurance’s data science team does not deal with image
data on a regular basis, so building out the infrastructure to ingest
and handle scanned PDFs would be a substantial lift. It’s important
to define what is in scope versus out of scope for feature engineer-
ing. Otherwise, you run the risk of spending too much time creating
features, which can be a long and expensive process, especially if
new data infrastructure needs to be built.
“We have access to a lot of claims data that is stored as data
tables,” David continued. “I mention that just because the fact that
something is already in tabular format doesn’t mean that we’re
done. It’s often possible to extract even more features from data
that’s already in tabular format.”
Kamala looked intrigued. “What would be an example of that?”
“Well, for instance, we may know from our insurance claims
whenever a patient is hospitalized,” David suggested. “Using that
data, we can create a feature that counts the total number of times
a patient was hospitalized in the 1 year before the prior authoriza-
tion submission.”
“I see. So we’d essentially be aggregating the claims data that
we have in different ways to create patient-level metrics,” reasoned
Kamala.
Maya nodded. “Exactly. Aggregating is a great way to approach
feature engineering.15 We can do this with total number of hos-
pital visits, total number of medications prescribed, average num-
ber of claims per month, and much more. Another great approach
is to create rate-of-change features. For example, we can find the
percentage by which their prescription drug use increased in the
6 months before the prior authorization submission. These are all
142
BUILDING YOUR FIRST MODEL
features that we can create using our existing tabular data and that
may help us in predicting our outcome.
Kamala was starting to make sense of all the possibilities that
feature engineering afforded. “I see. So these aggregate measure
and rate-of-change metrics can all be calculated using our existing
claims data. If we combine all these features with the features that
we create from transforming the text data and the PDFs, we could
have a lot of features for our model to use. Is there a specific num-
ber of features we need to have?” Kamala asked.
“There’s no set rule. What we’re trying to find are the most
highly predictive features. If we could wave a magic wand and find
a single feature that perfectly predicts health care utilization for
nonsurgical treatment, we could call it a day. But because we don’t
know a priori which features are good predictors, it’s in our best
interest to come up with as many candidate predictors as possible.”
David’s explanation made sense to those in the group, and now they
were more interested in the mechanics of the model.
“How will the model be able to sort through all of these? Will it
just use all of them at the same time?” Jenna wondered.
Maya responded, “Each model is a little different, but funda-
mentally the goal of a model is to identify statistical relationships
between variables. If we create five hundred features, the model is
going to look through each one to find the ones that have the stron-
gest relationship with our outcome of interest.”
Jenna jumped in. “Naive question, but is it possible to have too
many features, so that the model gets confused?”
“It’s a good question!” Maya always got excited when the non-
technical folks participated in technical discussions. “In general,
the more data points you have per feature, the better. The reason
is that the model needs some data points to learn whether or not
a feature is important. If the data points–to–features ratio is too
low, the model won’t be able to learn well which features are most
important. So if you start out with too many features, you’ll need
to use some method of feature selection to narrow down the num-
ber of features that ultimately get passed into the model. A rough
rule of thumb is to have at least ten data points per predictor, and
143
BUILDING YOUR FIRST MODEL
FEATURE SELECTION
144
BUILDING YOUR FIRST MODEL
145
BUILDING YOUR FIRST MODEL
Filtering step
Prediction
Filtering All Features Final feature set
function
Embedded Prediction
All Features Final feature set
methods function
Figure 6.1 Variable selection methods: filtering, wrappers, and embedded methods
146
BUILDING YOUR FIRST MODEL
147
BUILDING YOUR FIRST MODEL
and fit a trend line before! A trend line is actually a basic type of
prediction model—a linear prediction model.22 Once you fit a line
to a set of points, you can use the equation of the line to calculate or
‘predict’ the y variable given a value of the x variable. Remember the
equation of a line from algebra: Y = mX + b? Here m is the slope of
the line, and b is the y-intercept. These are the numbers that Excel
figures out for you automatically in order to plot the lines. Under
the hood, Excel is doing a bit of linear algebra to figure out the
optimal values of m and b in order to find the line of best fit. Once
you know m and b, you can plug in a value for X (input) and get a
predicted Y (outcome). Remember our discussion of model train-
ing and testing? ‘Training’ the model involves doing some math to
figure out the optimal values of m and b—all the various parameters
that together optimize the ‘fit’ of the model.”
“So, to put it simply,” Kamala said hesitantly, “the model uses
the training data to find the ‘line of best fit.’"”
“Exactly. The only difference is that with a scatterplot, you have
one X variable and one Y variable. Usually, good prediction mod-
els have more than one predictor variable. If we had fifty predictor
variables in our model, for example, the model would be finding the
fifty-dimensional ‘line of best fit’ instead of the two-dimensional
line of best fit. It’s impossible to visualize what a line would look
like in that many dimensions, but as long as you can understand the
concept in two dimensions, you can understand it in fifty dimen-
sions,” David said reassuringly.
Kamala had made a lot of scatterplots with trend lines in her
day. She couldn’t totally wrap her head around more than three
dimensions, but she understood the concept of feeding a model
data and allowing it to find the values of parameters that optimize
the fit. “What happens if you ask the model to predict a data point
that is really different from what it has seen before? How will it
perform? Basically, I’m asking about a situation where you ask the
kid to identify pictures of cats and dogs and then all of a sudden you
show them a picture of a hippo. Won’t they be confused?”
David was impressed by Kamala’s questions. “You’re right. That
would be a scenario where the training data was not representative
148
BUILDING YOUR FIRST MODEL
of all the possible situations that may arise in real life. In a case like
this, it would be hard to predict how well the model would perform.
It wouldn’t be surprising if it performed poorly.”
Kamala was encouraged to hear that her question was well-
founded. “How can we protect against that issue? We want our
model to be as robust as possible.”
“We can do our best to ensure that our training data is as repre-
sentative as possible so that our model has the chance to learn the
patterns from a diverse array of cases,” David responded. “Another
option is to filter out data points that look dissimilar from our train-
ing data and avoid making predictions on them: basically, we’re say-
ing ‘Hey, this picture doesn’t look like a cat or a dog, so we’re not
even going to try to make a guess.’"”
Jenna chimed in with her own idea. “Why not just include all the
data we have access to in the training set? Wouldn’t that be building
the most robust model possible?”
Maya responded. “In order to test the performance of the model,
we need to leave some data aside. The idea is that we train the model
and then we test it on some new data that the model has never seen
before. There’s a tradeoff in deciding how much data to use for train-
ing and how much to use for testing. The more data you use for train-
ing, the better the model will perform on the training set, but the
harder it is to quantify the performance of the model on an unseen
test set with high certainty. A rough rule of thumb is to use 20 to
40 percent of the available data for testing and the rest for training
but with huge data sets the amount reserved for testing could be
1 percent or less. If you want to be more precise, you can calculate
the amount of data you need in your test set based on the level of
certainty you want in evaluating the performance of the model.”
Depending on the size and variability of the data set, the amount
of data held out for testing can vary. In a huge data set, just 1 per-
cent of the total data may be sufficient to use as a representative
test set.23
“Going back to my question about showing the model new data,”
Kamala said, “how different can the new data be? Eventually, we
want to be able to use this model on our new patients—patients
149
BUILDING YOUR FIRST MODEL
that we don’t have data on now. Will the model be able to handle
these new patients?”
She smiled as she continued her train of thought. “During my
medical training, we thought by the end of medical school that we
had seen it all. Then I spent a summer volunteering in India and
was shocked to see how little of what I learned in medical school
was relevant. I literally encountered patients with conditions that I
had never seen before in the United States. Sure, you still had your
flu and your diabetes, but I had no idea how to deal with condi-
tions like leprosy and drug-resistant tuberculosis, and I needed to
go back to the books and get more training . . . you just don’t see
those types of things here in the United States anymore.”
“Building machine learning models sounds pretty similar to get-
ting trained as a doctor, Kamala!” David joked. “A case of the flu
in the United States is going to be pretty similar to a case of the
flu in India. But what threw you off were the diseases in India that
rarely show up in U.S. populations! Those were the conditions for
which you had to be retrained! To answer whether our model will
be able to handle new patients, we need to think about how the new
patients may differ from our current population.
“For example, if we’re bringing on a group of new employees
from a car dealership, I would think that our model would general-
ize well. I have no reason to suspect that those employees are dif-
ferent from our current patient network in a meaningful way with
regard to health care utilization for back pain. On the other hand,
if we’re bringing on members of a coal miners’ union, it may be a
different story. Coal miners may be more likely to experience back
pain than the overall population due to the nature of their work,
and because they are more likely to live in rural areas, coal min-
ers may have different rates of health care utilization compared
to other professions. Also, they may be dealing with other health
issues that could impact their use of health care services. For a case
like this, we may want to do some targeted testing to make sure
our model generalizes to that specific population, and we may even
need to retrain our model if we see that it’s performing poorly on
that subgroup.” David’s example landed with the group.
150
BUILDING YOUR FIRST MODEL
Kamala still had more questions for the team. “What happens if
we test our model and we find it’s not working well? Can we go back
and retrain the model?”
“I’m glad you asked this,” Maya said, looking like she had been
anticipating this question. “Let’s say we do the modeling using the
training data, and we test it out on the testing data, and we see
that the performance isn’t great. Then we go back into the training
data and tweak the model a little bit, make some changes to our
features. Finally, we try it on the test data again and find that it’s
working much better. The problem is that in machine learning, this
is considered cheating.”
David picked up where Maya left off. “It reminds me of my
son, who recently had a big exam at school. After several days of
studying, he finally took a practice test and got a C. He realized he
needed to study harder, so he studied for a few more days and then
retook the same practice test—this time he got an A. What my son
didn’t understand is that just because he did well on that practice
test doesn’t mean he’s going to do well on the real test. Since he
had already seen the practice test before, he knew exactly how to
change his studying to help him pass that practice test the second
time around. But if he were given a completely new test, I’m not
confident that he would get an A.”
Maya nodded. “It’s exactly the same issue here with model build-
ing. When we see our performance on the test set and we go back
and make modifications to the model training and apply it again to
the test set, we run the risk of overfitting the model to the testing
data.24 That is, we’ve optimized the model to perform really well on
our testing data, and in doing so, we run the risk of having designed
a model that doesn’t work well on other data sets.”
On cue, Kamala chimed in with the key question: “So how can
we continuously improve our model, evaluate its performance, and
continue making improvements without overfitting—or cheating?”
David was back up at the whiteboard. “One common approach
that people take is to split the data into three groups: a training set,
a validation set, and a testing set.25 After training your model on the
training set, you can see how it’s doing with the validation set. If the
151
BUILDING YOUR FIRST MODEL
performance on the validation set isn’t great, you can go back and
make adjustments and then check it again on the validation set—
and repeat this process as many times as you need. Then, once you
think the model is in good shape, you can try it out on the test set.”
Just as David was about to continue, Kamala chimed in, “I don’t
get it. Isn’t this the same as just having a training set and a test set
essentially? If you go back and forth between the training set and
the validation set, won’t you just end up overfitting to the valida-
tion set?”
David smiled. “You’re exactly right, and this is why more and more
folks are moving away from the validation set approach. Instead,
it’s recommended to do something called cross-validation.26 Here’s
how it works: The setup is to split your data into a training set and
a testing set. Next you split up your training data into equally sized
bins or folds. Usually, we take ten folds. Then you repeat the train-
ing and validation process once for each fold. For example, first, you
would take fold 1 and treat it as your validation set; then you would
take the other nine folds and treat them as your training set. You
train your model on the nine folds and test it on the first fold. Then
you store the predictions that your model made on the first fold for
later evaluation. Next you repeat the same process, treating fold 2
as your validation set. You train the model on the remaining nine
folds and test it on fold 2. Then you store these predictions for later
evaluation. At the end of the day, you will have generated predic-
tions for ten mini validation sets, each of which was predicted by
models trained on slightly different training sets. The reason cross-
validation works so well is that you’re not dependent on a single val-
idation set to test your model, so it’s less likely that you will overfit.”
Jenna tried her best to think of an analogy to make sense of all
this. “So if having a single training and validation set is like study-
ing for an exam and then taking a practice test, cross validation is
like . . . ?”
Maya helped her out. “Cross-validation is like getting ten people
to study for a little bit and then answer a few questions on a prac-
tice test. This way they can determine whether their study approach
is good or not.”
152
BUILDING YOUR FIRST MODEL
MODEL TUNING
153
BUILDING YOUR FIRST MODEL
154
BUILDING YOUR FIRST MODEL
155
BUILDING YOUR FIRST MODEL
156
BUILDING YOUR FIRST MODEL
157
BUILDING YOUR FIRST MODEL
Actual Value
Positive Negative
Positive
Negative
158
BUILDING YOUR FIRST MODEL
NEXT STEPS
“I’m excited to put this model to the test,” said Kamala, smiling. “How
long will it take for your team to create the features, do the training
and cross-validation, make improvements, and test the model?”
“We’ll need at least a couple of weeks, but we’ll touch base with
feedback in the interim,” David said, nodding to Maya.
“Looking forward to it!”
KEY QUESTIONS
• Population:
2 What population will we be applying the model to?
2 Is this the same population that we’re training the model on?
• Outcome variable
2 What are the limitations of choosing this outcome?
2 In what scenarios may our outcome variable lead to incorrect or mis-
leading results?
2 How could our outcome variable be more meaningful?
2 Is the outcome variable really what we want to measure, or is it a proxy?
159
BUILDING YOUR FIRST MODEL
• Feature selection:
2 What features do we think will be the most predictive?
2 What steps did the data science team take to perform feature selection
and how do the results align with our domain knowledge?
2 Were any feature importance techniques such as statistical tests or
feature ranking algorithms used for feature selection?
2 Did you consider any feature engineering techniques, such as creating
interaction terms or transforming variables, to improve the predictive
power of the selected features?
• Training:
2 What machine learning algorithms or models did you use for training?
Why did you choose these particular models?
2 Did you split the dataset into separate training and validation sets?
How did you decide on the split ratio?
2 How did you handle class imbalance, if present, during model train-
ing? Did you use any techniques such as oversampling, undersampling,
or class weights?
2 What strategies did you implement to avoid overfitting? Did you use
regularization techniques, cross-validation, or early stopping?
2 Can you describe the hyperparameter tuning process? How did you
determine the optimal hyperparameter values for the chosen models?
• Model performance:
2 What performance metric is most appropriate for our use case?
2 What evaluation metrics did you use to assess the performance of the
trained models?
2 How did you perform model evaluation on the validation or test set?
Did you employ techniques such as k-fold cross-validation or holdout
validation?
2 What was the performance of the model on different evaluation met-
rics? Can you provide a summary or comparison of the results?
2 Did you consider any ensemble techniques, such as model averaging
or stacking, to improve the overall model performance?
2 How did you validate the generalizability of the model? Were there any
concerns about overfitting to the specific dataset, and if so, how did
you address them?
160
7
TOOLS FOR MACHINE LEARNING
162
TOOLS FOR MACHINE LEARNING
EXAMINING RESIDUALS
“I’d like to welcome you all to the inaugural Stardust Health Insur-
ance data science team hackathon!” David was having fun as the
emcee of the event. “Our teams have been hard at work to optimize
our model to predict health care expenditure for patients with back
pain who receive nonsurgical treatment. Before we get started, let’s
recap the current state of our model.
“The outcome for this model was predicted back pain–related
health care expenditure in the 1 year following the request for
prior authorization. In terms of features, we had access to patients’
demographic data, including their place of employment and type
of employment, zip code, age, sex, race, ethnicity, and informa-
tion about dependents. We also had access to their medical claims,
which include almost every encounter they had with the health
care system that came through our insurance. We also had access
to their prior authorization report, which includes medical docu-
ments prepared by their physician. In terms of model architecture,
we used good old linear regression.
“After training our current model using ten-fold cross-valida-
tion, we ran it on a test data set that the model had never seen
before. On average, the absolute difference between the model’s
predicted health care expenditure and the true health care expen-
diture was $12,000 for the 1 year following the prior authorization
request. The actual differences ranged from $50 to $50,000. The
top features that linear regression selected were prior history of
back pain, prior medication usage, and age. Those on the prior
authorization team have been using this model for the past several
months with success: they are seeing reduced health care expen-
diture as a result of their prior authorization decisions, and initial
analyses show improvements in patient outcomes as well. Although
the model was designed to predict expenditures for 2 years, these
initial results are promising, and now we want to see how well we
can improve the performance of this model.
163
TOOLS FOR MACHINE LEARNING
164
TOOLS FOR MACHINE LEARNING
hazards associated with the job. When we added in these new fea-
tures, we saw that the residuals for the previously poor-performing
group dropped from $25,000 to $15,000. In other words, we were
able to improve the performance of our model for that group of
patients by about 40 percent. Overall, we were able to bring down
the average residual from $12,000 to $9,000—all by just adding a
few well-selected features!”
David took the mic and welcomed the next team. “The next team
also stuck with linear regression but took a clever approach to fea-
ture engineering.”
The team lead took to the stage. “Thanks for the intro, David.
We also felt that the biggest limitation of our model was the fea-
tures. It doesn’t matter if you use linear regression, random forest,
or a neural network; if your features aren’t good, your predictions
won’t be good. That said, we’re limited by the types of data we have
access to. So we decided to engineer new features using the ones
that we already had. Specifically, we calculated all two-way interac-
tion terms for the features in our set, which dramatically increased
the number of features we had access to.”
One of the simplest approaches to automated feature engineer-
ing is calculating interaction terms.3 An interaction term represents
the relationship between two main terms. For example, patient age
and patient sex may be included in the original health care expendi-
ture model as main effects. In linear regression, the model will out-
put a value (coefficient) that represents the change in the outcome
for every unit increase in one of the main effect variables—change
in health care expenditure for each additional year of age, for
instance. But what if the relationship between health care spending
and age depends on sex? This would mean that as patients get older,
the change in how much they spend depends on whether they are a
man or a woman. We can add an interaction term by including the
product of age * sex as a new feature in our model. This interaction
165
TOOLS FOR MACHINE LEARNING
term allows for different slopes for the relationship between expen-
diture and age for men versus women.
“That’s right, folks,” another member of the team explained,
“by including all possible two-way interaction terms, we gave our
model access to relationships between variables that we hadn’t
considered including in the original model. Not only that—we also
performed transformations on variables that appeared to have a
nonlinear relationship with health care expenditure. Depending on
the relationship, we performed a square transformation, a square
root transformation, or a log transformation.”
A linear relationship means that as the feature doubles, the con-
tribution to the outcome doubles. If the model has a linear term
for age, then as age doubles, the contribution that the age feature
adds to health care expenditure doubles. If you have a reason to
believe that the relationship between a variable and the outcome is
nonlinear, adding a polynomial (like the square of age = age * age)
or taking the log or square root might improve model fitting.4 A
general rule of thumb is that you should always include the lower-
order terms. If you are including age * age in your model, then you
should also include age.5 A basic question you can ask the data sci-
ence team members is whether they included polynomial terms in
the regression model.
The team lead summed up the results. “When we added in these
new features, we were able to bring down the average residual from
$12,000 to $8,000. It’s amazing what a little automated feature
engineering can do!”
166
TOOLS FOR MACHINE LEARNING
The team lead took the stage. “Instead of thinking about how
we could improve the features, we took a critical look at the data
we were using to train the model. One thing we noticed was
that we were using data on prior authorizations from the past
5 years to train the model. This sounds reasonable at first, but
we remembered that we changed our procedures about 2 years
ago to make it easier for Medicare Advantage patients to submit
prior authorizations. As we know, Medicare Advantage patients
are typically 65 or older; this means that starting 2 years ago, the
demographic of patients who were requesting prior authorizations
suddenly skewed older, since it became easier for these patients to
make submissions.”
Kamala was nodding in agreement. “That’s an astute observa-
tion. But what effect would this have on the model?”
The team lead continued. “Our training data includes data from
the past 5 years—3 years’ worth of data before the policy change
and 2 years’ worth of data afterward. We suspected that the 2 years’
worth of data after the policy change was more relevant, since that
data would be more representative of the current demographic of
patients who are submitting prior authorization requests.”
“So you were concerned that the training data from before the
policy change would skew toward younger patients and throw off
the results of the model?” Kamala asked.
“Exactly,” said the team lead. “At the same time, 3 years’ worth
of data is a lot of data, and we didn’t want to just throw it all out.
So we used a weighted regression to weight the recent data more
heavily than the older data when fitting our model.”
Weighted regression can also be used when training data is not
collected in a representative way. For example, let’s say that in the
training data, 70 percent of the patients were men when, in the
real world, men make up only about 50 percent of patients. This
means that the training data has more men than a random sample.
In statistical terms, we would say that men are overrepresented in
the sample. It’s important to consider whether the model is built
on a representative sample—either a random sample or the entire
population. If not, a weighted regression may be appropriate.
167
TOOLS FOR MACHINE LEARNING
168
TOOLS FOR MACHINE LEARNING
spend more than $25,000 a month from your data set, the quantile
regression coefficients will not change much.
A good way to think about this robust method is by consider-
ing the difference between the mean and the median when you
have outliers. Imagine computing the average wealth of a group of
100 people. You compute both the mean and the median wealth. The
value comes out to about $50,000 for both metrics. Now 4 people
now walk into the room: Bill Gates, Jeff Bezos, and two unnamed
people who have zero wealth. The mean wealth of this new group of
104 people is roughly $1 billion, but the median remains the same,
$50,000. Quantile regression helps you focus on that median with-
out being influenced by the outliers, while least squares regression
would be severely impacted by the outliers.
The team lead flashed the team’s money slide on the screen.
“As you can see, the combination of weighted regression and least
absolute deviation helped us bring the average residual down from
$12,000 to $8,500.”
K-NEAREST NEIGHBORS
“This next team moved away from linear regression and tried out a
different model architecture,” said David, as he stepped down from
the stage.
The team lead picked up the mic. “One characteristic of linear
regression is that it uses all the data points in the data set to make
a prediction. This can be good in some cases, but we felt that for
something like health care expenditure, you may not want the
model to make a prediction based on all the patients in the data
set. We thought it might be better to use an approach where the
prediction is based on only the handful of data points that are most
similar to the patient in question—in this case, the patient request-
ing prior authorization. Therefore, we used a K-nearest neighbor
regression model.”
K-nearest neighbor (K-NN) modeling is a different approach
to predictive modeling. It can predict categorical or continuous
169
TOOLS FOR MACHINE LEARNING
170
TOOLS FOR MACHINE LEARNING
NAIVE BAYES
171
TOOLS FOR MACHINE LEARNING
problems. However, it may not work well with features that are highly
correlated, since the assumption of independence may not hold.”
The training process for Naive Bayes involves estimating the
prior probability of each class and the conditional probability of
each feature given the class. Once these probabilities are estimated,
they can be used to classify new data points. To classify a new data
point, the model calculates the posterior probability of each class
given the features of the data point using Bayes’ theorem. The class
with the highest posterior probability is the predicted class.
Overall, Naive Bayes is a powerful and efficient algorithm that
can be used for classification problems with high-dimensional fea-
ture spaces. However, the assumption of independence may not
hold in some cases, and the algorithm may not work well with
highly correlated features.
“We achieved an area under the curve of 0.72 on our test set,
which we think is pretty good, considering the simplicity of the
algorithm and the small amount of data we had to work with,” the
team lead concluded.
DECISION TREES
The next team lead took to the stage. “Similar to the last group,
we wanted to account for possible nonlinear relationships between
variables. We also wanted the output of the model to correspond
to easily interpretable patient profiles. A regression tree model was
the perfect choice for our goals.”
Classification and regression tree (CART) modeling consists of
dividing the population into smaller subpopulations and then mak-
ing predictions on those smaller subpopulations.14 It is a clever
algorithm for dividing the population that is cyclical, meaning the
algorithm repeats itself over and over. CART modeling starts with
the entire population in the root node. It searches for the best vari-
able and value of that variable for separating the low- and high-
outcome groups. Once the root node is split into two or more
branches, the algorithm repeats again on each of those branches.
172
TOOLS FOR MACHINE LEARNING
After each split, the algorithm repeats over and over until it reaches
a stopping rule.15
For example, assume we’re building a regression tree that splits
the population into two groups for each node. To predict health care
expenditure, the algorithm would find the best variable and value
of that variable to separate out the high- and low-cost patients.
Once this first split is done, there are now two groups, the low- and
the high-cost patients. The algorithm is then applied to each of these
two groups to again find the best variable and value of that variable
to split each of these groups into two more groups, so there are now
four groups. The algorithm will stop when there are not enough cus-
tomers in each group to split again or when it reaches some other
stopping rule. When the CART model is completed, the entire popu-
lation will be assigned to one and only one group, called a leaf, and
the characteristics of each leaf can be easily read. For example, the
highest-cost leaf may be male patients over 65 years old who live in
the Northeast, and the lowest-cost leaf may be female patients under
18 who live in the South. A prediction for a new patient would be
made by identifying what leaf the customer belongs to and assigning
the average value of that leaf to that customer.
CART modeling has several advantages. First, the data scientist
does not have to make assumptions about the features and their
supposed relationship with the outcome; only the splitting and stop-
ping rules need to be defined for the tree to be produced.16 Second,
CART models can be used to predict binomial variables, categorical
variables, and continuous variables, and they are not as sensitive to
outliers as other regression methods are. Tree-based models can
also be built with missing data, while the other regression models
often struggle with missing data.17 Third, CART models can eas-
ily represent nonlinear relationships and interaction terms without
the need for the modeler to specify them in the model itself. Lastly,
CART modeling is easily interpretable in that it produces subpopu-
lations of high and low values based on a set of if/then statements,
allowing you easily to look at the rules and ask if they make sense.18
Despite these advantages, CART modeling has limitations. For
instance, a small change in the data sample (for example, the addition
173
TOOLS FOR MACHINE LEARNING
174
TOOLS FOR MACHINE LEARNING
David clapped for the last team and went back up on stage. “Now
that we’ve heard from all our teams, I want to present a surprise
that the data team leaders have been working on since the end
of the hackathon. We’ve all heard the old saying ‘Two heads are better
than one.’ But have you heard the more recent saying ‘Two machine
learning models are better than one’?” The audience laughed.
“After all the teams submitted their models, we had a secret team
working in the background to combine all the models using ensem-
bling methods. For the machine learning newbies in the audience,
ensembling methods are techniques for combining the predictions
of several models into a single, more accurate, more robust predic-
tion.22 Our team applied these techniques to all the models that
were submitted to generate an ensemble model that outperformed
each of the models individually.”
One of the most common ensembling methods is stacking. In
model stacking, different models are first created, and those models
are then used as input variables to a new model, which is used to make
the final prediction.23 The first models are known as level 1 mod-
els, and these level 1 model predictions serve as inputs to the level
2 model. Stacked models can be thought of as a method to weight dif-
ferent models to produce a final result. A very common model-stack-
ing approach involves the inclusion of commercial or external models
as level 1 models that act as one or more of the inputs to the level
2 model. If you are focused on risk modeling, FICO can be a good level
1 model. For mortality modeling, the Charlson Comorbidity Index is
175
TOOLS FOR MACHINE LEARNING
a common level 1 model. The level 2 model, also called the stacked
model, can outperform each of the individual models by more heav-
ily weighting the level 1 models where they perform best and giving
those models less weight where they perform poorly.
A special case of ensembling is called bootstrap aggregation, or
bagging for short. Bootstrapping refers to making several data sets
from the original data set by resampling observations. For example,
from a starting data set of 1,000 observations, you may create ten
bootstrapped data sets, each containing 1,000 observations resa-
mpled from the original 1,000. A model is fit to each of the boot-
strapped data sets, and the resulting predictions are aggregated. It
has been shown that bagging can improve prediction accuracy and
help avoid overfitting.24
Another technique commonly used to improve the performance
of machine learning models is gradient boosted machine learning,
also known as GBML.25 The word boosted is critical here. Boosting
involves having models learn by giving the misclassified observa-
tions more weight in the next iteration of the training as well as
by potentially giving more weight to the more accurate trees. This
boosting process means that as the algorithm repeats itself, it will
improve on the observations that are more difficult to predict while
not sacrificing too much on the ones that are easier to predict.
There are many boosting algorithms; two of the most commonly
used are Adaboost and Arcboost.26
David continued. “The models that we included in our ensemble
were the original linear regression model, the K-NN model, and the
regression tree model. We gave each model access to all the fea-
tures that were engineered by the different teams, and we applied
bagging and boosting to the regression tree model to improve its
performance. Finally, we developed a stacked model, which com-
bined the predictions from the three level 1 models using a level 2
linear regression model. Our stacked ensemble vastly outperformed
each of the individual models: we were able to decrease the average
residual from $12,000 to $5,000!”
Kamala was floored. How could the stacked model be that much
better? “David, I’m wondering how the stacked model performed so
176
TOOLS FOR MACHINE LEARNING
well. Before the ensembling, the best model was one of the linear
regression models that had an average residual of $8,500. All the
other models had residuals higher than that. How is it that adding
a model with residuals over $10,000 to a model with residuals at
$8,500 leads to a model with residuals at $5,000?”
“That’s the beauty of ensembling.” David smiled. “Imagine you’re
on a trivia game show with a partner. Your partner is a quiz bowl
whiz. They know just about everything—except they’ve never had a
penchant for pop music, so they’re completely useless when it comes
to any question relating to music. You, on the other hand, know
absolutely nothing about geography, sports, science, or history. But
you’re a huge music and movies fan, so any question on pop music
that comes your way is a piece of cake. Individually, your partner
would perform way better than you. They may get 90 percent of all
questions right if you assume the remaining 10 percent are about
music. You, on the other hand, would do terribly on your own. You’d
be lucky to get 15 percent if you assume those are the questions
about music and movies. Now what if we put you two together on
the same team? You can complement your partner’s knowledge on
music and movies, and they can carry the team for all the remain-
ing questions. Individually, it’s unlikely either of you would get 100
percent, but together you have a real shot at a perfect score.”
This made sense to Kamala. “It’s like forming a committee. The
different backgrounds of the committee members complement
each other, and, collectively, they’re able to make a better decision
than they would individually.”
“Exactly. Two models are better than one!”
KEY QUESTIONS
• What can we learn from examining the residuals of this model? Did you
identify any patterns in the residuals?
• Does the relationship between the feature and the outcome align with
intuition?
• Can we include polynomial terms or interaction terms as features?
177
TOOLS FOR MACHINE LEARNING
• Based on our knowledge of the data itself, are there specific interaction
terms that we expect to see?
• Which features should be transformed before being used for modeling?
• Does the data contain outliers that need to be dealt with? If so, how will
those outliers be treated?
• Would it make sense to weight certain observations in this data set? If so,
what weightings would be appropriate to explore?
• Would a K-nearest neighbor approach be superior to a parametric regres-
sion approach in this scenario?
• Should we try out metalearning methods like boosting, bagging, and
ensembling? What are the potential benefits and risks involved in these
explorations?
178
8
PULLING IT TOGETHER
180
PULLING IT TOGETHER
181
PULLING IT TOGETHER
182
PULLING IT TOGETHER
183
PULLING IT TOGETHER
can, and let’s see what can be delivered. You will work closely with
the data science team and update me on a weekly basis in terms of
the progress. When we have something definitive to show off, we
can arrange a chat with our boss, Charissa.”
With those words of encouragement, Steve was off to the races.
Step number one on this journey would be natural language pro-
cessing. He knew Jerry was right that the acronym NLP is thrown
around all the time, but few seem to really understand what exactly
it is and how it may be used. His first goal was to better understand
NLP and then discuss with the data science team what could and
couldn’t be done in that area given the data sources, technical limi-
tations, and capacity constraints.
Charissa was able to use her influence as a senior vice president
to get Brett temporarily assigned to work with Steve so they could
continue working as a tightly knit team that combines their busi-
ness and data science knowledge.
Jerry and Steve lined up their project management tools and
decided to focus as a first phase on the development of data sources
and features. Their logic was simple: Jerry had already done an exten-
sive exploration of the competitive landscape of vendors selling
rental price models that they could use. Once they had a good data
set, they were very comfortable building the models themselves.
For the data sources, the search began with a blue sky session.
This session was facilitated by Steve, who prepared everyone with a
discussion guide that summarized what he knew about rental price
modeling from his background research, such as the features that
have been identified by other researchers and companies as being
important, the vendors that already produce rental price models and
their descriptions of these models, and the data sources that have
been used by others. He then laid out some key questions that he
wanted the participants to discuss. These key questions included
what features might influence a rental price and what some poten-
tial data sources are. Participants at the blue sky session included
not only Jerry and Brett but also two former property managers who
were familiar with the operations of renting properties and could lend
their real-world insight. The people with operations insight turned
184
PULLING IT TOGETHER
185
PULLING IT TOGETHER
186
PULLING IT TOGETHER
walked all have the same stem—walk. Removing the endings allows
the analysis to focus on the key point of the word walk without
treating those other words (walked, walking) as separate ideas. This
is particularly useful when you are trying to see how often an idea is
mentioned in a review. Lemmatization is similar to stemming, but
it has the constraint that the stem must be a word. For example,
the stem of sharing and shared is shar, while the lemma would be a
word, share.
A commonly used step is to search not only for word frequen-
cies but also for N-grams.9 Pretty means attractive, and ugly is the
opposite. Both are N-grams of length 1, or unigrams. But when
Brett announces that his first model is “pretty ugly,” it is a bigram
(N = 2) that means the model is rather poor. Trigrams are three-
word sequences such as “this clearly demonstrates.” N-grams
preserve the order of the words, which is critically important, since
the phrase “ugly pretty” doesn’t mean anything to a native speaker,
but the reverse order, “pretty ugly,” does. NLP analysis of N-grams
looks to identify the most frequently used N-grams. By using longer
N-grams, such as those with five or more words, a computer can
often mimic human writing by automatic text generation. This can
be handy in creating chatbots that mimic human writing.10 It is also
critical in sentence completion when you are typing on your phone
or laptop and it offers you a suggestion of how to complete your
sentence, which is often very accurate.
Steve was excited to share his knowledge of NLP with Brett but
was soon redirected to the world of text data vectorizing. As Brett
explained, “This is how we can convert text information into num-
bers that we can use for analysis: for example, by developing a bag
of words to identify how similar two reviews are based on the words
that appear.”
Bag-of-words analysis is a method that can examine a pair of
documents and create a list of the words used in the two sources.11
It then measures how often the same word appears in both doc-
uments. Two very similar sentences will have a lot of overlap in
words, while two sentences that are quite different will not. Another
simple analysis performed with a bag of words is to look at the term
187
PULLING IT TOGETHER
188
PULLING IT TOGETHER
189
PULLING IT TOGETHER
190
PULLING IT TOGETHER
GEOSPATIAL ANALYSIS
In reviewing the data available, a key item that kept coming up was
the use of geospatial information. Jerry’s advice on this was clear:
let Brett be your guide, since he has already done some geospatial
analysis in the Fraud Department.
Brett led off with his usual confidence. “People make it seem like
geospatial analysis is some foreign language compared to other data
analytics, but it is really just one more area that can be explored.
Basically, you are using location information as part of the model-
ing where that location information can be gathered from various
sources. Often the challenge is getting things linked together given
the different location systems used, but, luckily, there are geographic
coordinates and some useful tools for connecting different systems.”
191
PULLING IT TOGETHER
192
PULLING IT TOGETHER
an exercise in linking across data sets and then assigning the value
of the features to the correct zip code, census tract, or other identi-
fier for that house.
Knowing the specific location of the rental properties also enables
other computations. A common approach is to do a spatial smooth-
ing by computing the average rental price within a fixed distance
from the property of interest.20 In its simplest form, this would take
the average of the rental prices within X miles of the rental property
and use that as another feature in the model. This is an extremely
simple way to do a weighted average. The properties within X all
are weighted the same, while properties that are more than X miles
away have a weighting of 0 in a weighted average. Weighting can
be made more sophisticated by using a weight that scales inversely
with the distance so that properties that are closest to the one of
interest have the highest weights and those that are farthest have
the lowest weights.21 The scale can be linear, where properties that
are twice as far have half the weight, or nonlinear, where properties
that are twice the distance have much less than half the weight. The
choice of weightings as a function of distance is critical and should
be explored systematically.
A more advanced approach to geospatial analytics involves using
the latitude and longitude data from the rental property and then
analyzing the distance between that location and other points of
interest.22 Which points of interest? That depends on what is being
modeled. In the case of the rental price modeling, the distance to
transportation hubs such as subway stops can be a critical feature.
The distances to shopping malls, parks, and other points of interest
are potentially strong features. This is a simple analysis that looks
at the distance between points. Other geographic information sys-
tem (GIS) analysis can be more advanced. For example, one can
examine the travel time between the rental property and the near-
est transportation hub. This is more complicated than computing
the distance, as it involves other considerations such as the route,
average speed on the road, and road quality. There are algorithms
that compute this given two locations, the same functionality that
you find on your smartphone.23
193
PULLING IT TOGETHER
194
PULLING IT TOGETHER
COMPUTER VISION
195
PULLING IT TOGETHER
196
PULLING IT TOGETHER
197
PULLING IT TOGETHER
for roof work or chimney repairs, and to indoor items, such as the
details of the kitchen, bathrooms, floor quality, or ceilings. Wall-to-
wall carpeting from the 1970s should get identified easily, but the
type of material used for the countertops is a far more difficult ques-
tion to address with CV. Many of the values will be blank because
either there are no images or the images do not contain the informa-
tion needed. A set of summary statistics about the completeness of
the images would illuminate how many properties have images, how
many have images showing different parts of the inside and outside
of the property, and how many properties have images that provide
meaningful information about the property details. The information
from the CV analysis becomes features that are added to the data set
that is being used to develop the rental price model.
With these pieces in hand, Steve scheduled a discussion with
Jerry and Brett to review progress.
NETWORK ANALYSIS
198
PULLING IT TOGETHER
are not. The well-connected ones are probably able to draw more
people to their showings. More showings could mean higher rental
prices, so the level of connection of the brokers could help predict
rental prices. Also, we can examine the accuracy of the property
descriptions, the use of specific adjectives, and the overestimation
of features and link them back to the agents who posted them.”
Jerry responded with suspicion. “It seems like quite a stretch. But
some part of data science involves incurring some risk in exploring
methods and data sources. After all, if we knew the answer before
we started, then we wouldn’t call it science. Let’s get clarity on how
much time and money it will cost us to explore network analysis,
and then we can decide if it is worth pursuing. Just curious—has
anyone published anything about applying network analysis to
rental price or home price prediction before?”
Steve nodded his head. “I did an online search and checked the
academic journals. There are a few references.”37
Jerry continued. “Well, then let’s stay on the leading edge but
not be on the bleeding edge here. Figure out how much money and
time we should spend on exploring network analysis, and then stick
to it. If Brett is right and it really is a special sauce that will make
us stand out, then great. If not, then at least we cut our losses and
chalk it up to research expenses.”
With that guidance in mind, Brett paid to access the applica-
tion programming interface (API) of Agentster, a social media net-
work for real estate agents. It serves as the Facebook, LinkedIn, and
Twitter of real estate.
In general, a social network consists of nodes (sometimes called
actors or vertices) connected using a relation, also called a tie,
link, arc, or edge.38 The nodes can be individuals, groups, or teams.
In the case of Agentster, the nodes are the brokers. The relations
reflect how those different nodes are connected. This basic idea of
a network is even reflected in how the data is structured. Network
data always consists of at least two data sets. The first data set is a
nodelist, where the nodes are the units of observation. In our case,
it identifies all of the brokers in the network. The second data set
defines how those nodes are connected. It can take different forms,
199
PULLING IT TOGETHER
but common ones include the adjacency matrix (also called a net-
work matrix) and the edgelist. In an adjacency matrix, the columns
and rows are the nodes, and the cells of the matrix define the rela-
tions.39 An edgelist provides a list of all of the relations between
different nodes. In an edgelist, all pairs of nodes that have relations
are listed along with the type of relation. An edgelist can be readily
converted into an adjacency matrix, and an adjacency matrix can be
converted to an edgelist.
The simplest adjacency matrix or edgelist consists of only binary
data: that is, a set of 0s and 1s that show if someone is or is not
connected.40 Binary data does not convey the strength of the con-
nection but rather only whether or not a connection exists. A more
complicated data set would have values that are scaled to reflect the
strength of the connection. Those with no connection would still
have 0s, but other connections could fall within a range based on
the strength of the relationship, either as an ordinal value (weak,
medium, or strong) or as a continuous value (such as a range
between 0 and 10). Valued networks provide a lot more informa-
tion than binary networks but are far more work to collect and ana-
lyze. Our agent network is a binary network in that it indicates only
whether or not two agents are connected and nothing about the
strength of the connection.
Some networks are directed, while others are undirected. With
directed networks, the connections don’t necessarily go both ways.
These are asymmetric relationships.41 Twitter and Instagram users
know that you can follow someone who doesn’t follow you. That
means the relationship is directed. You can imagine this relation-
ship as a set of arrows where many of the arrows point in only one
direction. The sender, also called the source, is where the informa-
tion comes from, and the arrow points at the receiver (or target).
If you have a directed network, you can describe which nodes have
the most inbound connections (what data scientists call the node’s
in-degree) versus outbound connections (the node’s out-degree).
The brokerage network is undirected, meaning the number of in-
degrees is equal to the number of out-degrees. This makes the anal-
ysis a bit simpler, since direction is not a consideration.
200
PULLING IT TOGETHER
201
PULLING IT TOGETHER
202
PULLING IT TOGETHER
203
PULLING IT TOGETHER
mind that there are packages that do this type of work very quickly
that have been validated and used by many people already . . . so you
don’t need to build this yourself. With that in mind, another mea-
sure of node importance is the closeness centrality. This measures
how close a node is to all other nodes, where high closeness scores
have the shortest distances to all other nodes. This is a metric that
helps detect nodes that can spread information quickly through
the network, while nodes with low closeness scores would have a
more difficult time spreading information. Again, minimal effort is
needed to compute this once you have the data set prepped.”
The next day Steve reviewed his notes with Jerry and Brett. Brett
was keen to explore all possible network-related analytics, as he
had just finished an online Python course dedicated to the topic
and wanted to put some theory into practice. Jerry was much more
interested in getting a quick assessment of whether the effort will
yield value or whether they should stick with the information they
already have to build the rental price model. He turned to Steve and
said, “You’ve been doing some research on network analysis. What
are the basics we should examine this week to decide if it is worth
digging further?”
Steve cleared his throat. He was expecting Brett to answer this
question. After a few seconds, he blurted out, “So far we have
learned that the network is a binary, undirected network. There are
no measures of the strength of the connections—only that two bro-
kers are connected. The simplicity of this structure should help us
do our analysis quickly. Since we can access the edgelist from the
API, the first thing I would like to compute is the network density.
From that, we will have a sense for whether this is a diffuse or a
dense network.”
He looked around the room and saw nodding heads, so he con-
tinued. “From there, my priority is on identifying a measure of the
broker’s influence. My logic is that brokers who are more influential
amongst their peers are probably able to also bring more people
to view properties. Properties with more showings are more likely
to obtain higher rents than those that are viewed by few people.
I think we can easily capture the basic measures of broker influence
204
PULLING IT TOGETHER
and test each one in the rental price model to see if any of them add
value. Specifically, I suggest we compute the number of connec-
tions, the eigencentrality, and the closeness centrality for each bro-
ker and then determine if any of these three measures of influence
has incremental impact after using all of the other data sources.”
Jerry inquired, “Why are we looking at the network analysis
inputs as the last possible source rather than the first?”
Steve responded immediately. “A few considerations, but mostly
time and money. We already have all of the other data elements
in place. Most of these data sources are free or have already been
purchased at discount prices, since they are used in other parts of
the company. Access to this broker database’s API is not cheap, and
the Real Estate Division will be incurring the full expense, since
no other department uses it. Since we would need to constantly
update the brokerage influence data, there would be a substantial
recurrent expense. I need to understand the costs of acquiring the
data as well as the costs associated with updating it and cleaning
it so it is usable. Of course, we are assuming that the data is of a
high enough quality that it is usable . . . an assumption that is not
always correct. Just because someone provides data doesn’t mean it
is high-quality data. We really don’t want to find ourselves in a situ-
ation where we paid a lot of money for a database and then discover
that the data is so dirty, with missing data or incorrectly entered
information, that it is of little or no value.”
PULLING IT TOGETHER
205
PULLING IT TOGETHER
Charissa was pleased with the results. The in-house model had
a reported accuracy that was slightly better than that of the com-
mercial vendor’s model and at a fraction of the price. The addition
of the GIS, computer vision, and NLP information clearly made a
difference in the model performance, as they could measure the
improvement in accuracy as each set of variables was introduced.
As for the network analysis, Charissa summed it up perfectly. “It
didn’t add value in this case, but it can certainly help in other situ-
ations. We didn’t know going into the network analysis if the bro-
ker information was going to be useful, so we tested . . . after all,
data science is a science. Key for me is that you figured out how
to test this idea and come up with a conclusion without risking
a lot of time or money. By the way, we just decided to license an
automated machine learning tool. Steve, this means you can do
some of the modeling yourself without becoming a Python guru
like Brett.”
Steve responded, “Happy to learn more . . . I am already doing a
little programming myself. Just basic querying for now.”
Charissa smiled, looked at everyone, and then announced,
“Great job team—promotions for everyone!”
Steve jumped up. “That’s amazing. I can’t believe I am getting
promoted so quickly.” His huge smile slowly left his face as he
noticed everyone looking at him puzzled.
Jerry shook his head. “Steve, that’s how Charissa likes to announce
that the meeting is over. But, seriously, you did a good job in learn-
ing the basics, asking good questions, and driving to an answer.”
KEY QUESTIONS
NLP:
• What specific steps did you take in the data cleaning and preprocessing?
• Did you remove stop words? If so, can we quickly review the stop words
that were removed to make sure none are relevant to the business?
• Is sentiment analysis appropriate for this problem?
206
PULLING IT TOGETHER
Geospatial Analysis:
• What information can be obtained from this new data source and tech-
nique that cannot be obtained in another way?
• What other data sources can be used?
• How much will it cost to access this data both initially and as an ongoing
expenses?
Computer Vision:
• Are there simpler or better ways to obtain this information than using
computer vision?
• How complete is the coverage of the images (what percentage of the
items have image data)?
• What specific information can and cannot be obtained from these images?
207
PULLING IT TOGETHER
Network Analysis:
208
9
ETHICS
“You all know why we’re here,” the CEO started. “We need a
plan to ensure that what happened to our competitor does not hap-
pen here. I will not condone any use of data that exacerbates or
creates undue harm to our patients. Kamala and David, I’d like you
to work together to develop a comprehensive data ethics strategy.
David, your team is closest to the data. If there’s anything wrong
with the way we handle our data, your team ought to know about
it. Kamala, your team uses data to make decisions, sometimes deci-
sions that affect thousands of patients. We need a strategy that
governs everything from how we store data to how we analyze it
to how we use it make decisions. If we don’t figure this out now,
we are doing ourselves and our patients a disservice. I’d like to see
something within the next couple of days.”
Walking out of the meeting, Kamala found herself yet again in a
scenario where she needed to get up to speed on technical topics in
a short amount of time. “David, I know how my team uses data to
make decisions, and I know the impact that our decisions have on
our patients. But I don’t know the first thing about data ethics as it
relates to data storage and use.”
“This will certainly be a collaborative process,” said David, nod-
ding. “We need to work together to make sure that the systems we
set up within the data science team are effective and appropriate
for the way that your team uses our data. Let’s get together this
afternoon and start brainstorming our strategy.”
210
ETHICS
• I will not be ashamed to say, “I know not,” nor will I fail to call in my
colleagues when the skills of another are needed for solving a problem.
• I will respect the privacy of my data subjects, for their data are not dis-
closed to me that the world may know, so I will tread with care in mat-
ters of privacy and security. . . .
• I will remember that my data are not just numbers without meaning or
context, but represent real people and situations, and that my work may
lead to unintended societal consequences, such as inequality, poverty,
and disparities due to algorithmic bias.1
211
ETHICS
“Across all these oaths and checklists, I think there are four key
concepts to understand: (1) fairness, (2) privacy and security, (3)
transparency and reproducibility, and (4) the social impact of data.”
FAIRNESS
212
ETHICS
213
ETHICS
214
ETHICS
215
ETHICS
216
ETHICS
in the model may recapitulate biases that we see in the real world,
such as people of certain racial groups not having access to oppor-
tunities that may make hiring them for a job more desirable. Third,
we should consider where our data on race is coming from and
whether the race variable that we’re using is high-quality.”
Kamala jumped in. “We see low-quality racial data a lot in the
health care field. “Different races will be grouped into the same
bucket. Asians may include Indian Asians, Chinese Asians, Korean
Asians, Vietnamese Asians, and Cambodian Asians. Even though all
these groups are technically Asian, they have very different health
care needs. Indian Asians have a higher risk of cardiovascular dis-
ease than Chinese Asians, and Cambodian and Vietnamese Asians
on average face greater socioeconomic barriers than Chinese and
Indian Asians. You always see health data reported for all Asians,
but that’s not useful information because Asians are such a diverse
group of people.”16
David nodded. “Exactly, that’s an example of where your race
data is not of high enough quality to provide a valuable signal. These
are exactly the types of pitfalls we want to be aware of when con-
sidering what sensitive variables to include in operation models.”
Particular industries may have legal restrictions on modeling.
Credit scoring17 and actuarial modeling18 are situations where the
use of sensitive variables like gender and race may be prohibited in
the model development process. In these applications, models may
be evaluated for proxies by seeing if the variables in the model can
predict variables like gender or race. If the variables in the model
cannot accurately predict the sensitive variable, it suggests that the
model does not contain any proxies for the sensitive variable that
could lead to disparate performance across groups.
“So how can we prevent this at our company, David? What sys-
tems can we put in place to ensure that our machine learning mod-
els are trained on an unbiased data set?”
“You said it right, Kamala. We need a system. In other words,
ensuring that the data is representative cannot be solely the respon-
sibility of the data scientist who’s building the model. We need a
comprehensive approach. Of course, the model builder has some
217
ETHICS
Data Drift
218
ETHICS
219
ETHICS
“David, we’ve been spending a lot of time talking about using the
data properly to do analysis and build models, but what about
handling the data itself? At the end of the day, we’re dealing with
sensitive health information. How can we avoid lapses in data secu-
rity that our peer companies are struggling with? Our competitor
clearly didn’t have strong enough data security systems to keep
out hackers. With the sensitive patient information stored on our
servers, a hack could expose personal data for millions of patients.
Legal issues notwithstanding, our patients trust us to safeguard
their data, and we have an ethical obligation to do all that we can to
protect their health information.”
“The data science team takes many measures to keep our data sys-
tems secure,” David pointed out. “The first thing to recognize is that
we’re dealing with protected health information, which is subject to a
number of legal regulations that normal data is not. Protected health
information is covered under the Health Information Portability and
Accountability Act, or HIPAA for short.24 We work closely with our
security team to ensure that our data infrastructure is secure and
HIPAA compliant. Not only that—it is our ethical obligation to our
patients to maintain their sensitive data as securely as possible.
220
ETHICS
“All of our computers and laptops are encrypted, our emails have
an option to send encrypted messages when we’re attaching data
files, and our data is stored in highly secure servers with restricted
access. Whenever we need to use the data for analysis, we make
sure that people use a virtual private network and that they have
access only to the minimum information needed for their role. This
means that if you need to know only how many patients received
a certain type of procedure, we won’t give you access to patients’
names, dates of birth, addresses, etc. We call this operating on a
need-to-know basis. Sometimes this involves deidentifying the data
or removing all possible data elements that can be used to link a
data point back to an individual.”
Depending on the industry, data may be subject to different laws
and regulations. It’s important to understand these constraints
and develop the appropriate data infrastructure from day one. All
industries need to protect the data of their customers or clients.
Data breaches in any industry can decrease trust, lead to lawsuits,
and hurt the bottom line. Engaging a data security expert or build-
ing a data security team can be an important step in ensuring the
security of your data.
Data collection is another important consideration. In the health
care context, where data can be extremely sensitive and confiden-
tial, it is critical to obtain consent for data collection and use. In the
research context, informed consent is a prerequisite, both ethically
and legally. For research that will eventually be published, informed
consent procedures often need to be reviewed and approved by a
board of ethics or an institutional review board.
Why Reproducibility?
221
ETHICS
222
ETHICS
223
ETHICS
SOCIAL IMPACT
“My son showed me a video the other day of Obama saying things
that you would never imagine a president saying. Turns out it was
a ‘deep fake’—a computer-generated video that looks extremely
realistic.”
“That’s really concerning, David. I’ve heard of deep fakes and
always wondered what could happen if the wrong person got their
hands on that technology.”
“Unfortunately, Kamala, that’s a big concern for artificial intel-
ligence, or AI, developers these days. We have to be aware of the
possibility that the technology we develop could be used to harm
others.30 Recently, leading AI researchers developed a suite of
tools that can generate not only fake videos but also fake images
224
ETHICS
and fake text. While normally researchers release their code to the
public, this group chose not to release its work, recognizing that
the technology was powerful enough to do serious harm in the
wrong hands.”
“I know we’re not developing fake images at Stardust Health,
but this is something I want us to consider in our data ethics strat-
egy. The way we use our data could ultimately impact millions of
patients.” Kamala was intent on exploring this point further.
“Definitely,” David agreed. “Just because we’re not developing
cutting-edge AI doesn’t mean that our data and technology can’t
harm people. Take our spine surgery model, for example. We’re
using that model to determine who we think should receive spine
surgery. What if, after deploying the model, we noticed a spike in
the number of spine surgery complications among our patients?
The reality is that our model is being used to shape real-world deci-
sions that have an impact on real people. We need to consider how
to respond if we notice people are being harmed by our model.”
“I liked what we did when we developed that model,” Kamala
started. “We got a group of stakeholders together and brainstormed
all the possible ways it could go wrong. We set up systems to moni-
tor adverse events, and we’re reviewing that data regularly to ensure
our model is not causing any negative consequences. We should
bake this sort of deliberation into our model development work-
flows so that we are proactive about monitoring potential harms.”
225
ETHICS
226
ETHICS
CONCLUSION
From the way data is collected and stored to the way it’s analyzed,
interpreted, and used in machine learning models, there are many
opportunities for bias, misinterpretation, and breaches in security
and ethics. While having a team devoted to data security can address
potential dangers, a system for ethical data analysis and interpreta-
tion should be adopted by all stakeholders within an organization.
Building a culture of ethical data stewardship, where analysis plans
are prespecified and documented, potential biases are called out
and explored, and data is not manipulated to suit a certain narra-
tive, is a responsibility that all team members should accept.
KEY QUESTIONS
Data Ethics:
227
ETHICS
• How will we monitor changes in the data and model performance over
time?
• What variables should we avoid including in the model?
• What variables may bias the output of our model?
2 Which laws and regulations might be applicable to our project?
2 How might individuals’ privacy and anonymity be impinged by our
storage and use of the data?
228
CONCLUSION
produced, which range from raw data to structured data sets that
can be readily analyzed to an advanced product such as automated
decision-making.
Working on the data science project, we emphasized that a
data science project is just that—a project. The best practices
of project management, such as establishing milestones, time-
lines, and roles and responsibilities and identifying risks, are
critical to ensuring project success. A challenge that we acknowl-
edge in data science is that people’s titles are often misleading,
so someone’s official title may be quite different from the tasks
they perform or their actual skill sets. We identified a mechanism
for prioritizing projects as well as providing guidance on how to
measure success.
We then introduced some basic statistics, the kind often used
in exploratory data analysis. This is critical, since exploratory data
analysis is a common first step before developing predictive mod-
els. We brought in the concepts of effect sizes and p-values, leading
up to a basic understanding of linear regression, one of the oldest
and still most useful methods of analyzing data.
Unsupervised machine learning is a set of commonly used
techniques that is often overlooked by data science customers. It
focuses on how observations can be grouped or clustered together
using the data rather than relying on preset rules or expert opinion.
Besides generating useful insight, it can often be a data processing
step before developing a final predictive model.
Supervised machine learning is a topic that gets a lot of press
and one that is likely to gain attention in most companies and orga-
nizations. Predicting sales, charge-offs, risk, product utilization,
spam, or any other outcome is useful across all industries. While
the techniques can range from simple linear regression models to
deep learning neural networks, the basic principles of data science
remain the same, and they often use many of the same key steps,
including cleaning and preparing the data, identifying the features
used as inputs in the prediction, training and testing the model,
and taking steps to improve model performance. In applications
spanning the range from natural language processing to network
230
CONCLUSION
Annie called Kamala into her office for her annual review. Kamala
was a little nervous as she sat down, but she saw Annie’s relaxed
smile and felt confident that things were going to go well.
“Kamala, you arrived here with an impressive educational back-
ground, both an MD and an MBA. We rarely see someone arriving
231
CONCLUSION
here with such a broad background, yet you really hit the ground
running the moment we hired you. Four years ago you came to Star-
dust, and you are already a director. How do you feel about your
career growth and this past year’s work?” Annie asked.
Kamala was ambitious but didn’t want to sound too aggressive.
“I am proud of the past year’s collaborations, especially the work
with the data science team. The prior authorization modeling work
saved the company millions. Our work on ClaroMax saved lives
and helped Stardust’s bottom line.”
Annie nodded her head. “I agree. You have found the right level
of skills in data science so you are able to understand the problems,
discuss potential solutions, and then let the data science team do
what it does best. You showed how to be both a good customer and
a leader. With that in mind, are there specific items you want to
work on for your development next year?”
“Now that I have a good grasp of how to leverage data science to
generate value for the company, I would love to expand the scope
of my team to include not only folks from the clinical and business
teams but also folks from the data science teams. I believe form-
ing an interdisciplinary team can help kick-start more innovative
data-driven initiatives, and I would be excited to lead that team.
As part of those increased responsibilities, it might be good to con-
tinue building my skills as a project leader. Perhaps I can get more
training in how to be a good project manager as well as how to
lead effectively.”
“These are great suggestions. There are some good online courses,
but you can also look into some executive education courses. And
we have a nice corporate program where our vice presidents receive
intensive training on leadership, communications, and team build-
ing. Would that interest you?”
“Well, yes. But like you said, that is for vice presidents, and I am
still a director.”
Annie smiled broadly. “Yes. But I am looking at the new vice
president of clinical strategy. Congratulations. Keep up the good
work. Pretty soon I will be calling you boss.”
232
CONCLUSION
At the same time as Kamala received the good news about her
promotion, Steve was looking for some career advice from his data
science mentor and friend, Brett.
“Got to admit that at first I was dreading working with the team,
but I really got into the analytics,” Steve said.
Brett gave him a thumbs up. “See, we aren’t all a boring set of
nerds. What did you like best?”
“Toward the end of our last project, I was doing some basic
Python coding myself, and it was great being able to build my own
simple models. Given the choice, I think I would rather focus more
on programming than on managing projects or teams.”
“Nothing wrong with becoming a data science guru. There are
great opportunities to grow here at Shu Financial, and nearly every
week a headhunter is calling me about a position somewhere else,”
Brett replied.
“What would I need to do to dig more deeply into the data and
programming side? Last thing I want to do is get another degree.
I am still paying off the debt from my Wharton MBA.”
Brett nodded. “That is what makes data science such a great
career. You don’t need to go back to school. You can learn from
tons of great online resources. Dive into Kaggle, and you can learn
a ton. Coursera and Edx have plenty of free content, and if you are
OK paying a little, there are many excellent online courses.”
“Courses are fun, but what I really need to do is get more real-
world experience.”
“You can do it here. If you really are thinking about making
a shift, we can talk with your boss about doing a hybrid role
where you get more into the programming while you also keep
some of your current responsibilities. Believe me, you wouldn’t
be the first Shuster to make this transition, and you won’t be
the last.”
Steve hesitated for a second. “This really would take me down a
different path than I had planned out of school.”
Brett responded immediately, “We both know, if you don’t know
where you are going . . . ”
233
CONCLUSION
“ . . . then any road will take you there. OK. I know I want to test
the data science programming waters more deeply, so let’s set up a
chat with my boss and we can see what is possible.”
What about you, the reader? You aren’t Kamala and you aren’t
Steve.
You should make a conscious choice about your next step in the
world of data science. What do you want to do next? Will you seek
Kamala’s path, enhancing your management skills as you lead larger
teams, manage bigger budgets, and take a bigger role in your organi-
zation? Will you seek Steve’s path, going more deeply into the tech-
nical aspects of data science? Will you create a different path—one
that speaks to your interests and desires as it relates to your career
and data? Will you be a mentor to others on how to be a good data
science customer or how to work effectively with technical teams?
Will you be satisfied with your knowledge and skill in data science
and move on to another topic to broaden your skill set and increase
your capability to add value to your company?
The choice is yours.
234
NOTES
1. R. Sanders, “The Pareto Principle: Its Use and Abuse,” Journal of Services
Marketing 1, no. 2 (1987): 37–40.
2. H. L. Stuckey, “The First Step in Data Analysis: Transcribing and Manag-
ing Qualitative Research Data,” Journal of Social Health and Diabetes 2,
no. 1 (2014): 6.
236
3. DATA SCIENCE FOUNDATIONS
237
3. DATA SCIENCE FOUNDATIONS
22. G. M. Sullivan and R. Feinn, “Using Effect Size—or Why the P Value Is Not
Enough,” Journal of Graduate Medical Education 4, no. 3 (2012): 279–282.
23. R. Rosenthal, H. Cooper, and L. Hedges, “Parametric Measures of Effect
Size,” in The Handbook of Research Synthesis, ed. H. Cooper and L. Hedges
(New York: Russell Sage Foundation, 1994), 231–244.
24. E. Burmeister and L. M. Aitken, “Sample Size: How Many Is Enough?,”
Australian Critical Care 25, no. 4 (2012): 271–274.
25. Websites like https://clincalc.com/stats/samplesize.aspx can be used to
perform sample size calculations based on statistical power.
26. J. O. Berger and T. Sellke, “Testing a Point Null Hypothesis: The Irrec-
oncilability of p Values and Evidence,” Journal of the American Statistical
Association 82, no. 397 (1987): 112–122.
27. T. Dahiru, “P-Value, a True Test of Statistical Significance? A Cautionary
Note,” Annals of Ibadan Postgraduate Medicine 6, no. 1 (2008): 21–26.
28. J. A. Berger, “A Comparison of Testing Methodologies,” in PHYSTAT LHC
Workshop on Statistical Issues for LHC Physics (Geneva: CERN, 2008),
8–19; J. P. Ioannidis, “The Proposal to Lower P Value Thresholds to. 005,”
JAMA 319, no. 14 (2018): 1429–1430.
29. R. Rosenthal, R. L. Rosnow, and D. B. Rubin, Contrasts and Effect Sizes in
Behavioral Research: A Correlational Approach (Cambridge: Cambridge Uni-
versity Press, 2000).
30. L. G. Halsey, D. Curran-Everett, S. L. Vowler, and G. B. Drummond, “The
Fickle P Value Generates Irreproducible Results,” Nature Methods 12,
no. 3 (2015): 179–185.
31. X. Fan, “Statistical Significance and Effect Size in Education Research:
Two Sides of a Coin,” Journal of Educational Research 94, no. 5 (2001):
275–282.
32. T. Vacha-Haase and B. Thompson, “How to Estimate and Interpret
Various Effect Sizes,” Journal of Counseling Psychology 51, no. 4 (2004):
473.
33. D. A. Prentice and D. T. Miller, “When Small Effects Are Impressive,”
in Methodological Issues and Strategies in Clinical Research, ed. A. E.
Kazdin (Washington, DC: American Psychological Association, 2016),
99–105.
34. E. Flores-Ruiz, M. G. Miranda-Novales, and M. Á. Villasís-Keever, “The
Research Protocol VI: How to Choose the Appropriate Statistical
Test; Inferential Statistics,” Revista Alergia México 64, no. 3 (2017):
364–370.
35. M. M. Mukaka, “A Guide to Appropriate Use of Correlation Coefficient in
Medical Research,” Malawi Medical Journal 24, no. 3 (2012): 69–71.
36. D. York, N. M. Evensen, M. L. Martı˧nez, and J. De Basabe Delgado,
“Unified Equations for the Slope, Intercept, and Standard Errors of
238
3. DATA SCIENCE FOUNDATIONS
the Best Straight Line,” American Journal of Physics 72, no. 3 (2004):
367–375.
37. R. A. Philipp, “The Many Uses of Algebraic Variables,” Mathematics Teacher
85, no. 7 (1992): 557–561.
38. H. R. Varian, “Goodness-of-Fit in Optimizing Models,” Journal of Econo-
metrics 46, no. 1–2 (1990): 125–140.
39. K. L. Pearson, “LIII. On Lines and Planes of Closest Fit to Systems of
Points in Space,” London, Edinburgh, and Dublin Philosophical Magazine
and Journal of Science 2, no. 11 (1901): 559–572.
40. O. Fernández, “Obtaining a Best Fitting Plane Through 3D Georeferenced
Data,” Journal of Structural Geology 27, no. 5 (2005): 855–858.
41. M. K. Transtrum, B. B. Machta, and J. P. Sethna, “Why Are Nonlinear Fits
to Data So Challenging?,” Physics Review Letters 104, no. 6 (2010): 060201.
42. R. B. Darlington and A. F. Hayes, Regression Analysis and Linear Models
(New York: Guilford Press, 2017), 603–611.
43. T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, 2nd ed. (New York:
Springer, 2009; G. Shmueli, “To Explain or to Predict?,” Statistical Science
25, no. 3 (2010): 289–310.
44. A. Gandomi and M. Haider, “Beyond the Hype: Big Data Concepts, Meth-
ods, and Analytics,” International Journal of Information Management 35,
no. 2 (2015): 137–144.
45. P. A. Frost, “Proxy Variables and Specification Bias,” Review of Economics
and Statistics 61, no. 2 (1979): 323–325.
46. P. Velentgas, N. A. Dreyer, and A. W. Wu, “Outcome Definition and Mea-
surement,” in Developing a Protocol for Observational Comparative Effective-
ness Research: A User's Guide, ed. P. Velentgas, N. A. Dreyer, P. Nourjah,
S. R. Smith, and M. M. Torchia (Rockville, MD: Agency for Healthcare
Research and Quality, 2013). 71–92.
47. D. G. Altman and P. Royston, “The Cost of Dichotomising Continuous
Variables,” BMJ 332 (2006): 1080.
48. M. L. Thompson, “Selection of Variables in Multiple Regression: Part I. A
Review and Evaluation,” International Statistical Review 46, no. 1 (1978):
1–19.
49. G. Heinze, C. Wallisch, and D. Dunkler, “Variable Selection–a Review and
Recommendations for the Practicing Statistician,” Biometrical Journal 60,
no. 3 (2018): 431–449.
50. L. Kuo and B. Mallick, “Variable Selection for Regression Models,” Sankhyā:
The Indian Journal of Statistics, Series B 60, no. 1 (1998): 65–81.
51. T. Mühlbacher and H. Piringer, “A Partition-Based Framework for Build-
ing and Validating Regression Models,” IEEE Transactions on Visualization
and Computer Graphics 19, no. 12 (2013): 1962–1971.
239
4. MAKING DECISIONS WITH DATA
240
4. MAKING DECISIONS WITH DATA
17. J. Hahn, P. Todd, and W. Van der Klaauw, “Identification and Estimation
of Treatment Effects with a Regression-Discontinuity Design,” Economet-
rica 69, no. 1 (2001): 201–209.
18. J. M. Bland and D. G. Altman, “Statistics Notes: Matching,” BMJ 309
(1994): 1128.
19. B. Lu, R. Greevy, X. Xu, and C. Beck, “Optimal Nonbipartite Matching and
Its Statistical Applications,” American Statistician 65, no. 1 (2011): 21–30.
20. M. A. Mansournia, N. P. Jewell, and S. Greenland, “Case-Control Match-
ing: Effects, Misconceptions, and Recommendations,” European Journal
of Epidemiology 33, no. 1 (2018): 5–14.
21. J. Heckman, “Varieties of Selection Bias,” American Economic Review 80,
no. 2 (1990): 313–318.
22. J. M. Bland and D. C. Altman, “Measurement Error,” BMJ 312 (1996):
1654.
23. D. L. Paulhus, “Measurement and Control of Response Bias,” in Measures of
Personality and Social Psychological Attitudes, ed. J. P. Robinson, P. R. Shaver,
and L. S. Wrightsman (San Diego, CA: Academic Press, 1991), 17–59.
24. P. E. Shrout and J. L. Fleiss, “Intraclass Correlations: Uses in Assessing
Rater Reliability,” Psychological Bulletin 86, no. 2 (1979): 420.
25. K. L. Gwet, “Intrarater Reliability,” in Wiley Encyclopedia of Clinical Trials, ed.
R. D'Agostino, J. Massaro, and L. Sullivan (Hoboken, NJ: Wiley, 2008), 4.
26. P. E. Fischer and R. E. Verrecchia, “Reporting Bias,” Accounting Review 75,
no. 2 (2000): 229–245.
27. C. B. Begg, “Publication Bias,” in The Handbook of Research Synthesis, ed.
Harris Cooper and Larry V. Hedges (New York: Russell Sage Foundation,
1994), 399–409.
28. M. L. Head, L. Holman, R. Lanfear, A. T. Kahn, and M. D. Jennions, “The
Extent and Consequences of P-Hacking in Science,” PLOS Biology 13, no. 3
(2015): e1002106.
29. N. Barrowman, “Correlation, Causation, and Confusion,” New Atlantis,
Summer/Fall 2014, 23–44.
30. P. D. Bliese and J. W. Lang, “Understanding Relative and Absolute Change
in Discontinuous Growth Models: Coding Alternatives and Implica-
tions for Hypothesis Testing,” Organizational Research Methods 19, no. 4
(2016): 562–592.
31. Bliese and Lang, “Understanding Relative and Absolute Change.”
32. L. Wartofsky, “Increasing World Incidence of Thyroid Cancer: Increased
Detection or Higher Radiation Exposure?,” Hormones 9, no. 2 (2010):
103–108.
33. P. R. Rosenbaum, Design of Observational Studies (New York: Springer,
2010).
34. A. S. Detsky, C. D. Naylor, K. O’Rourke, A. J. McGeer, and K. A. L’Abbé.
“Incorporating Variations in the Quality of Individual Randomized Trials
241
4. MAKING DECISIONS WITH DATA
242
6. BUILDING YOUR FIRST MODEL
1. Marcus Dillender, “What Happens When the Insurer Can Say No? Assess-
ing Prior Authorization as a Tool to Prevent High-Risk Prescriptions and
to Lower Costs,” Journal of Public Economics 165 (2018): 170–200.
2. Richard A. Deyo, Sohail K. Mirza, Judith A. Turner, and Brook I. Mar-
tin, “Overtreating Chronic Back Pain: Time to Back Off?,” Journal of the
American Board of Family Medicine 22, no. 1 (2009): 62–68.
243
6. BUILDING YOUR FIRST MODEL
244
6. BUILDING YOUR FIRST MODEL
19. Lukas Meier, Sara Van De Geer, and Peter Bühlmann, “The Group Lasso
for Logistic Regression,” Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 70, no. 1 (2008): 53–71.
20. Arthur E. Hoerl and Robert W. Kennard, “Ridge Regression: Applications
to Nonorthogonal Problems,” Technometrics 12, no. 1 (1970): 69–82.
21. Kelleher and Tierney, Data Science, 145–148.
22. S. Kavitha, S. Varuna, and R. Ramya, “A Comparative Analysis on Linear
Regression and Support Vector Regression,” in 2016 Online International Con-
ference on Green Engineering and Technologies (Piscataway, NJ: IEEE, 2016).
23. Kelleher and Tierney, Data Science, 147.
24. Tom Dietterich, “Overfitting and Undercomputing in Machine Learning,”
ACM Computing Surveys 27, no. 3 (1995): 326–327.
25. Kelleher and Tierney, Data Science, 147–148.
26. Payam Refaeilzadeh, Lei Tang, and Huan Liu, “Cross-Validation,” in Ency-
clopedia of Database Systems, ed. L. Liu and M. T. ܜzsu (Boston: Springer,
2009), 532–538.
27. Avrim L. Blum and Pat Langley, “Selection of Relevant Features and Examples
in Machine Learning,” Artificial Intelligence 97, no. 1–2 (1997): 245–271.
28. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An
Introduction to Statistical Learning (New York: Springer, 2013), 225–282.
29. Matthias Feurer and Frank Hutter, “Hyperparameter Optimization,”
in Automated Machine Learning: Methods, Systems, Challenges, ed. Frank
Hutter, Lars Kotthoff, and Joaquin Vanschoren (Cham, Switzerland:
Springer, 2019), 3–33.
30. Gary Brassington, “Mean Absolute Error and Root Mean Square Error:
Which Is the Better Metric for Assessing Model Performance?,” in Geophysi-
cal Research Abstracts (Munich: European Geophysical Union, 2017), 3574.
31. Antonio Bella, Cèsar Ferri, José Hernández-Orallo, and María José
Ramírez-Quintana, “Calibration of Machine Learning Models,” in Hand-
book of Research on Machine Learning Applications and Trends: Algorithms,
Methods, and Techniques, ed. Emilio Soria Olivas, Jose David Martin Guer-
rero, Marcelino Martinez Sober, Jose Rafael Magdalena Benedito, and
Antonio Jose Serrano Lopez (Hershey, PA: Information Science Refer-
ence, 2010), 128–146.
32. Sarah A. Gagliano, Andrew D. Paterson, Michael E. Weale, and Jo Knight,
“Assessing Models for Genetic Prediction of Complex Traits: A Compari-
son of Visualization and Quantitative Methods,” BMC Genomics 16, no. 1
(2015): 1–11.
33. Olivier Caelen, “A Bayesian Interpretation of the Confusion Matrix,”
Annals of Mathematics and Artificial Intelligence 81, no. 3 (2017): 429–450.
34. Anthony K. Akobeng, “Understanding Diagnostic Tests 1: Sensitivity, Speci-
ficity and Predictive Values,” Acta Paediatrica 96, no. 3 (2007): 338–341.
245
7. TOOLS FOR MACHINE LEARNING
246
8. PULLING IT TOGETHER
8. PULLING IT TOGETHER
247
8. PULLING IT TOGETHER
248
8. PULLING IT TOGETHER
249
8. PULLING IT TOGETHER
250
9. ETHICS
9. ETHICS
251
9. ETHICS
252
9. ETHICS
253
INDEX
256
INDEX
257
INDEX
258
INDEX
259