Computational and Inferential Thinking
Computational and Inferential Thinking
This example is meant to illustrate some of the broad themes of this text.
Don’t worry if the details of the program don’t yet make sense. Instead,
focus on interpreting the images generated below. Later sections of the
text will describe most of the features of the Python programming
language used below.
First, we read the text of both books into lists of chapters,
called huck_finn_chapters and little_women_chapters. In Python, a name cannot
contain any spaces, and so we will often use an underscore _ to stand in
for a space. The = in the lines below give a name on the left to the result
of some computation described on the right. A uniform resource
locator or URL is an address on the Internet for some content; in this case,
the text of a book. The # symbol starts a comment, which is ignored by the
computer but helpful for people reading the code.
Chapters
I. YOU don't know about me without you have read a book ...
II. WE went tiptoeing along a path amongst the trees bac ...
IV. WELL, three or four months run along, and it was wel ...
V. I had shut the door to. Then I turned around and ther ...
VI. WELL, pretty soon the old man was up and around agai ...
VII. "GIT up! What you 'bout?" I opened my eyes and look ...
VIII. THE sun was up so high when I waked that I judged ...
In
the plot above, the horizontal axis shows chapter numbers and the
vertical axis shows how many times each character has been mentioned
up to and including that chapter.
You can see that Jim is a central character by the large number of times
his name appears. Notice how Tom is hardly mentioned for much of the
book until he arrives and joins Huck and Jim, after Chapter 30. His curve
and Jim’s rise sharply at that point, as the action involving both of them
intensifies. As for Huck, his name hardly appears at all, because he is the
narrator.
Little Women is a story of four sisters growing up together during the civil
war. In this book, chapter numbers are spelled out and chapter titles are
written in all capital letters.
Chapters
THREE THE LAURENCE BOY "Jo! Jo! Where are you?" cried Me ...
FOUR BURDENS "Oh, dear, how hard it does seem to take up ...
FIVE BEING NEIGHBORLY "What in the world are you going t ...
SIX BETH FINDS THE PALACE BEAUTIFUL The big house did
pr ...
NINE MEG GOES TO VANITY FAIR "I do think it was the most ...
TEN THE P.C. AND P.O. As spring came on, a new set of am ...
... (37 rows omitted)
We can track the mentions of main characters to learn about the plot of
this book as well. The protagonist Jo interacts with her sisters Meg, Beth,
and Amy regularly, up until Chapter 27 when she moves to New York
alone.
Laurie is a young man who
marries one of the girls in the
end. See if you can use the plots
to guess which one. 1.3.2. Another
Kind of Character
In some situations, the relationships between quantities allow us to make
predictions. This text will explore how to make accurate predictions based
on incomplete information and develop methods for combining multiple
sources of uncertain information to make decisions.
Here are the data for Huckleberry Finn. Each row of the table corresponds
to one chapter of the novel and displays the number of characters as well
as the number of periods in the chapter. Not surprisingly, chapters with
fewer characters also tend to have fewer periods, in general: the shorter
the chapter, the fewer sentences there tend to be, and vice versa. The
relation is not entirely predictable, however, as sentences are of varying
lengths and can involve other punctuation such as question marks.
7026 66
Huck Finn Chapter
Number of Periods
Length
11982 117
8529 72
6799 84
8166 91
14550 125
13218 127
22208 249
8081 71
7036 70
... (33 rows omitted)
chars_periods_little_women
21759 189
22148 188
20558 231
25526 195
23395 255
14622 140
14431 131
22476 214
33767 337
18508 185
... (37 rows omitted)
You can see that the chapters of Little Women are in general longer than
those of Huckleberry Finn. Let us see if these two simple variables – the
length and number of periods in each chapter – can tell us anything more
about the two books. One way to do this is to plot both sets of data on the
same axes.
In the plot below, there is a dot for each chapter in each book. Blue dots
correspond to Huckleberry Finn and gold dots to Little Women. The
horizontal axis represents the number of periods and the vertical axis
represents the number of characters.
T
he plot shows us that many but not all of the chapters of Little Women are
longer than those of Huckleberry Finn, as we had observed by just looking
at the numbers. But it also shows us something more. Notice how the blue
points are roughly clustered around a straight line, as are the yellow
points. Moreover, it looks as though both colors of points might be
clustered around the same straight line.
Now look at all the chapters that contain about 100 periods. The plot
shows that those chapters contain about 10,000 characters to about
15,000 characters, roughly. That’s about 100 to 150 characters per
period.
Indeed, it appears from looking at the plot that on average both books
tend to have somewhere between 100 and 150 characters between
periods, as a very rough estimate. Perhaps these two great 19th century
novels were signaling something so very familiar to us now: the 140-
character limit of Twitter.
Does the death penalty have a deterrent effect? Is chocolate good for
you? What causes breast cancer?
For several years, a doctor by the name of John Snow had been following
the devastating waves of cholera that hit England from time to time. The
disease arrived suddenly and was almost immediately deadly: people died
within a day or two of contracting it, hundreds could die in a week, and
the total death toll in a single wave could reach tens of thousands. Snow
was skeptical of the miasma theory. He had noticed that while entire
households were wiped out by cholera, the people in neighboring houses
sometimes remained completely unaffected. As they were breathing the
same air—and miasmas—as their neighbors, there was no compelling
association between bad smells and the incidence of cholera.
Snow had also noticed that the onset of the disease almost always
involved vomiting and diarrhea. He therefore believed that the infection
was carried by something people ate or drank, not by the air that they
breathed. His prime suspect was water contaminated by sewage.
At the end of August 1854, cholera struck in the overcrowded Soho district
of London. As the deaths mounted, Snow recorded them diligently, using a
method that went on to become standard in the study of how diseases
spread: he drew a map. On a street map of the district, he recorded the
location of each death.
Here is Snow’s original map. Each black bar represents one death. When
there are multiple deaths at the same address, the bars corresponding to
those deaths are stacked on top of each other. The black discs mark the
locations of water pumps. The map displays a striking revelation—the
deaths are roughly clustered around the Broad Street
pump.
Snow studied his map carefully and investigated the apparent anomalies.
All of them implicated the Broad Street pump. For example:
There were deaths in houses that were nearer the Rupert Street
pump than the Broad Street pump. Though the Rupert Street pump
was closer as the crow flies, it was less convenient to get to because
of dead ends and the layout of the streets. The residents in those
houses used the Broad Street pump instead.
There were no deaths in two blocks just east of the pump. That was
the location of the Lion Brewery, where the workers drank what they
brewed. If they wanted water, the brewery had its own well.
There were scattered deaths in houses several blocks away from the
Broad Street pump. Those were children who drank from the Broad
Street pump on their way to school. The pump’s water was known to
be cool and refreshing.
The final piece of evidence in support of Snow’s theory was provided by
two isolated deaths in the leafy and genteel Hampstead area, quite far
from Soho. Snow was puzzled by these until he learned that the deceased
were Mrs. Susannah Eley, who had once lived in Broad Street, and her
niece. Mrs. Eley had water from the Broad Street pump delivered to her in
Hampstead every day. She liked its taste.
Later it was discovered that a cesspit that was just a few feet away from
the well of the Broad Street pump had been leaking into the well. Thus the
pump’s water was contaminated by sewage from the houses of cholera
victims.
Snow used his map to convince local authorities to remove the handle of
the Broad Street pump. Though the cholera epidemic was already on the
wane when he did so, it is possible that the disabling of the pump
prevented many deaths from future waves of the disease.
The removal of the Broad Street pump handle has become the stuff of
legend. At the Centers for Disease Control (CDC) in Atlanta, when
scientists look for simple answers to questions about epidemics, they
sometimes ask each other, “Where is the handle to this pump?”
Snow’s map is one of the earliest and most powerful uses of data
visualization. Disease maps of various kinds are now a standard tool for
tracking epidemics.
Towards Causality
Though the map gave Snow a strong indication that the cleanliness of the
water supply was the key to controlling cholera, he was still a long way
from a convincing scientific argument that contaminated water was
causing the spread of the disease. To make a more compelling case, he
had to use the method of comparison.
The map below shows the areas served by the two companies. Snow
honed in on the region where the two service areas
overlap.
Snow noticed that there was no systematic difference between the people
who were supplied by S&V and those supplied by Lambeth. “Each
company supplies both rich and poor, both large houses and small; there
is no difference either in the condition or occupation of the persons
receiving the water of the different Companies … there is no difference
whatever in the houses or the people receiving the supply of the two
Water Companies, or in any of the physical conditions with which they are
surrounded …”
The only difference was in the water supply, “one group being supplied
with water containing the sewage of London, and amongst it, whatever
might have come from the cholera patients, the other group having water
quite free from impurity.”
In order to establish whether it was the water supply that was causing
cholera, Snow had to compare two groups that were similar to each other
in all but one aspect—their water supply. Only then would he be able to
ascribe the differences in their outcomes to the water supply. If the two
groups had been different in some other way as well, it would have been
difficult to point the finger at the water supply as the source of the
disease. For example, if the treatment group consisted of factory workers
and the control group did not, then differences between the outcomes in
the two groups could have been due to the water supply, or to factory
work, or both. The final picture would have been much more fuzzy.
Snow’s brilliance lay in identifying two groups that would make his
comparison clear. He had set out to establish a causal relation between
contaminated water and cholera infection, and to a great extent he
succeeded, even though the miasmatists ignored and even ridiculed him.
Of course, Snow did not understand the detailed mechanism by which
humans contract cholera. That discovery was made in 1883, when the
German scientist Robert Koch isolated the Vibrio cholerae, the bacterium
that enters the human small intestine and causes cholera.
In fact the Vibrio cholerae had been identified in 1854 by Filippo Pacini in
Italy, just about when Snow was analyzing his data in London. Because of
the dominance of the miasmatists in Italy, Pacini’s discovery languished
unknown. But by the end of the 1800’s, the miasma brigade was in
retreat. Subsequent history has vindicated Pacini and John Snow. Snow’s
methods led to the development of the field of epidemiology, which is the
study of the spread of diseases.
Confounding
Let us now return to more modern times, armed with an important lesson
that we have learned along the way:
Example: Coffee and lung cancer. Studies in the 1960’s showed that
coffee drinkers had higher rates of lung cancer than those who did not
drink coffee. Because of this, some people identified coffee as a cause of
lung cancer. But coffee does not cause lung cancer. The analysis
contained a confounding factor—smoking. In those days, coffee drinkers
were also likely to have been smokers, and smoking does cause lung
cancer. Coffee drinking was associated with lung cancer, but it did not
cause the disease.
2.4. Randomization
An excellent way to avoid confounding is to assign individuals to the
treatment and control groups at random, and then administer the
treatment to those who were assigned to the treatment group.
Randomization keeps the two groups similar apart from the treatment.
If you are able to randomize individuals into the treatment and control
groups, you are running a randomized controlled experiment, also known
as a randomized controlled trial (RCT). Sometimes, people’s responses in
an experiment are influenced by their knowing which group they are in.
So you might want to run a blind experiment in which individuals do not
know whether they are in the treatment group or the control group. To
make this work, you will have to give the control group a placebo, which is
something that looks exactly like the treatment but in fact has no effect.
Benefits of Randomization
In this course, you will learn how to conduct and analyze your own
randomized experiments. That will involve more detail than has been
presented in this chapter. For now, just focus on the main idea: to try to
establish causality, run a randomized controlled experiment if possible. If
you are conducting an observational study, you might be able to establish
association but it will be harder to establish causation. Be extremely
careful about confounding factors before making conclusions about
causality based on an observational study.
observational study
treatment
outcome
association
causal association
causality
comparison
treatment group
control group
epidemiology
confounding
randomization
blind
placebo
Fun facts
The Strange Case of the Broad Street Pump: John Snow and the Mystery
of Cholera by Sandra Hempel, published by our own University of
California Press, reads like a whodunit. It was one of the main sources for
this section’s account of John Snow and his work. A word of warning: some
of the contents of the book are stomach-churning.
Poor Economics, the best seller by Abhijit Banerjee and Esther Duflo of
MIT, is an accessible and lively account of ways to fight global poverty. It
includes numerous examples of RCTs, including the PROGRESA example
in this chapter. In 2019, Banerjee, Duflo, and Michael Kremer received
the Nobel Prize in Economics, in part for showing that “questions are often
best answered via carefully designed experiments.”
3. Programming in Python
Programming can dramatically improve our ability to collect and analyze
information about the world, which in turn can lead to discoveries through
the kind of careful reasoning demonstrated in the previous section. In
data science, the purpose of writing a program is to instruct a computer to
carry out the steps of an analysis. Computers cannot study the world on
their own. People must describe precisely what steps the computer should
take in order to collect and analyze data, and those steps are expressed
through programs.
3.1. Expressions
Programming languages are much simpler than human languages. Nonetheless, there are
some rules of grammar to learn in any language, and that is where we will begin. In this text,
we will use the Python programming language. Learning the grammar rules is essential, and
the same rules used in the most basic programs are also central to more sophisticated
programs.
Programs are made up of expressions, which describe to the computer how to combine pieces
of data. For example, a multiplication expression consists of a * symbol between two
numerical expressions. Expressions, such as 3 * 4, are evaluated by the computer. The value
(the result of evaluation) of the last expression in each cell, 12 in this case, is displayed below
the cell.
3 * 4
12
The grammar rules of a programming language are rigid. In Python, the * symbol cannot
appear twice in a row. The computer will not try to interpret an expression that differs from
its prescribed expression structures. Instead, it will show a SyntaxError error. The Syntax of a
language is its set of grammar rules, and a SyntaxError indicates that an expression structure
doesn’t match any of the rules of the language.
3 * * 4
File "<ipython-input-2-012ea60b41dd>", line 1
3 * * 4
^
SyntaxError: invalid syntax
Small changes to an expression can change its meaning entirely. Below, the space between
the *’s has been removed. Because ** appears between two numerical expressions, the
expression is a well-formed exponentiation expression (the first number raised to the power
of the second: 3 times 3 times 3 times 3). The symbols * and ** are called operators, and the
values they combine are called operands.
3 ** 4
81
Common Operators. Data science often involves combining numerical values, and the set of
operators in a programming language are designed so that expressions can be used to express
any sort of arithmetic. In Python, the following operators are essential.
1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10
17.555555555555557
1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10
2017.0
This chapter introduces many types of expressions. Learning to program involves trying out
everything you learn in combination, investigating the behavior of the computer. What
happens if you divide by zero? What happens if you divide twice in a row? You don’t always
need to ask an expert (or the Internet); many of these details can be discovered by trying them
out yourself.
3.2. Names
Names are given to values in Python using an assignment statement. In an assignment, a
name is followed by =, which is followed by any expression. The value of the expression to
the right of = is assigned to the name. Once a name has a value assigned to it, the value will
be substituted for that name in future expressions.
a = 10
b = 20
a + b
30
A previously assigned name can be used in the expression to the right of =.
quarter = 1/4
half = 2 * quarter
half
0.5
However, only the current value of an expression is assigned to a name. If that value changes
later, names that were defined in terms of that value will not change automatically.
quarter = 4
half
0.5
Names must start with a letter, but can contain both letters and numbers. A name cannot
contain a space; instead, it is common to use an underscore character _ to replace each space.
Names are only as useful as you make them; it’s up to the programmer to choose names that
are easy to interpret. Typically, more meaningful names can be invented than a and b. For
example, to describe the sales tax on a $5 purchase in Berkeley, CA, the following names
clarify the meaning of the various quantities involved.
purchase_price = 5
state_tax_rate = 0.075
county_tax_rate = 0.02
city_tax_rate = 0
sales_tax_rate = state_tax_rate + county_tax_rate + city_tax_rate
sales_tax = purchase_price * sales_tax_rate
sales_tax
0.475
initial = 2766000
changed = 2814000
(changed - initial) / initial
0.01735357917570499
It is also typical to subtract one from the ratio of the two measurements,
which yields the same value.
(changed/initial) - 1
0.017353579175704903
This value is the growth rate over 10 years. A useful property of growth
rates is that they don’t change even if the values are expressed in
different units. So, for example, we can express the same relationship
between thousands of people in 2002 and 2012.
initial = 2766
changed = 2814
(changed/initial) - 1
0.017353579175704903
In 10 years, the number of employees of the US Federal Government has
increased by only 1.74%. In that time, the total expenditures of the US
Federal Government increased from $2.37 trillion to $3.38 trillion in 2012.
initial = 2.37
changed = 3.38
(changed/initial) - 1
0.4261603375527425
A 42.6% increase in the federal budget is much larger than the 1.74%
increase in federal employees. In fact, the number of federal employees
has grown much more slowly than the population of the United States,
which increased 9.21% in the same time period from 287.6 million people
in 2002 to 314.1 million in 2012.
initial = 287.6
changed = 314.1
(changed/initial) - 1
0.09214186369958277
A growth rate can be negative, representing a decrease in some value.
For example, the number of manufacturing jobs in the US decreased from
15.3 million in 2002 to 11.9 million in 2012, a -22.2% growth rate.
initial = 15.3
changed = 11.9
(changed/initial) - 1
-0.2222222222222222
An annual growth rate is a growth rate of some quantity over a single
year. An annual growth rate of 0.035, accumulated each year for 10
years, gives a much larger ten-year growth rate of 0.41 (or 41%).
annual_growth_rate = 0.035
ten_year_growth_rate = (1 + annual_growth_rate) ** 10 - 1
ten_year_growth_rate
0.410598760621121
Likewise, a ten-year growth rate can be used to compute an equivalent
annual growth rate. Below, t is the number of years that have passed
between measurements. The following computes the annual growth rate
of federal expenditures over the last 10 years.
initial = 2.37
changed = 3.38
t = 10
(changed/initial) ** (1/t) - 1
0.03613617208346853
The total growth over 10 years is equivalent to a 3.6% increase each year.
initial * (1 + g) ** t
To compute g, raise the total growth to the power of 1/t and subtract one.
(changed/initial) ** (1/t) - 1
abs(-12)
12
round(5 - 1.3)
4
max(2, 2 + 3, 4)
5
In this last example, the max function is called on three arguments: 2, 5,
and 4. The value of each expression within parentheses is passed to the
function, and the function returns the final value of the full call expression.
The max function can take any number of arguments and returns the
maximum.
A few functions are available by default, such as abs and round, but most
functions that are built into the Python language are stored in a collection
of functions called a module. An import statement is used to provide
access to a module, such as math or operator.
import math
import operator
math.sqrt(operator.add(4, 5))
3.0
An equivalent expression could be expressed using the + and ** operators
instead.
(4 + 5) ** 0.5
3.0
Operators and call expressions can be used together in an expression.
The percent difference between two values is used to compare values for
which neither one is obviously initial or changed. For example, in 2014
Florida farms produced 2.72 billion eggs while Iowa farms produced 16.25
billion eggs (http://quickstats.nass.usda.gov/). The percent difference is
100 times the absolute value of the difference between the values,
divided by their average. In this case, the difference is larger than the
average, and so the percent difference is greater than 100.
florida = 2.72
iowa = 16.25
100*abs(florida-iowa)/((florida+iowa)/2)
142.6462836056932
Learning how different functions behave is an important part of learning a
programming language. A Jupyter notebook can assist in remembering the
names and effects of different functions. When editing a code cell, press
the tab key after typing the beginning of a name to bring up a list of ways
to complete that name. For example, press tab after math. to see all of the
functions available in the math module. Typing will narrow down the list of
options. To learn more about a function, place a ? after its name. For
example, typing math.log? will bring up a description of the log function in
the math module.
math.log?
log(x[, base])
math.log(16, 2)
4.0
math.log(16)/math.log(2)
4.0
The list of Python’s built-in functions is quite long and includes many
functions that are never needed in data science applications. The list
of mathematical functions in the math module is similarly long. This text
will introduce the most important functions in context, rather than
expecting the reader to memorize or understand these lists.
The table cones has been imported for us; later we will see how, but here
we will just work with it. First, let’s take a look at it.
cones
Flavor Color Price
The table has six rows. Each row corresponds to one ice cream cone. The
ice cream cones are the individuals.
Each cone has three attributes: flavor, color, and price. Each column
contains the data on one of these attributes, and so all the entries of any
single column are of the same kind. Each column has a label. We will refer
to columns by their labels.
name_of_table.method(arguments)
For example, if you want to see just the first two rows of a table, you can
use the table method show.
cones.show(2)
Flavor Color Price
You can replace 2 by any number of rows. If you ask for more than six,
you will only get six, because cones only has six rows.
strawberry
chocolate
chocolate
strawberry
chocolate
bubblegum
cones
Flavor Color Price
You can select more than one column, by separating the column labels by commas.
cones.select('Flavor', 'Price')
Flavor Price
strawberry 3.55
chocolate 4.75
Flavor Price
chocolate 5.25
strawberry 5.25
chocolate 5.25
bubblegum 4.75
You can also drop columns you don’t want. The table above can be created by dropping
the Color column.
cones.drop('Color')
Flavor Price
strawberry 3.55
chocolate 4.75
chocolate 5.25
strawberry 5.25
chocolate 5.25
bubblegum 4.75
You can name this new table and look at it again by just typing its name.
no_colors = cones.drop('Color')
no_colors
Flavor Price
strawberry 3.55
chocolate 4.75
chocolate 5.25
Flavor Price
strawberry 5.25
chocolate 5.25
bubblegum 4.75
Like select, the drop method creates a smaller table and leaves the original table
unchanged. In order to explore your data, you can create any number of smaller tables by
using choosing or dropping columns. It will do no harm to your original data table.
cones.sort('Price')
Flavor Color Price
To sort in descending order, you can use an optional argument to sort. As the name implies,
optional arguments don’t have to be used, but they can be used if you want to change the
default behavior of a method.
By default, sort sorts in increasing order of the values in the specified column. To sort in
decreasing order, use the optional argument descending=True.
cones.sort('Price', descending=True)
Flavor Color Price
Like select and drop, the sort method leaves the original table unchanged.
The code in the cell below creates a table consisting only of the rows corresponding to
chocolate cones.
cones.where('Flavor', 'chocolate')
Flavor Color Price
dark
chocolate 5.25
brown
dark
chocolate 5.25
brown
The arguments, separated by a comma, are the label of the column and the value we are
looking for in that column. The where method can also be used when the condition that the
rows must satisfy is more complicated. In those situations the call will be a little more
complicated as well.
It is important to provide the value exactly. For example, if we specify Chocolate instead
of chocolate, then where correctly finds no rows where the flavor is Chocolate.
cones.where('Flavor', 'Chocolate')
Flavor Color Price
Like all the other table methods in this section, where leaves the original table unchanged.
The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of
almost $18.7 million in 2015-2016.
nba
POSITIO
PLAYER TEAM SALARY
N
We can also create a new table called warriors consisting of just the data for the Golden
State Warriors.
Golden State
Klay Thompson SG 15.501
Warriors
Golden State
Draymond Green PF 14.2609
Warriors
Golden State
Andrew Bogut C 13.8
Warriors
Golden State
Andre Iguodala SF 11.7105
Warriors
Golden State
Stephen Curry PG 11.3708
Warriors
Golden State
Jason Thompson PF 7.00847
Warriors
Warriors
Golden State
Harrison Barnes SF 3.8734
Warriors
Golden State
Marreese Speights C 3.815
Warriors
Golden State
Leandro Barbosa SG 2.5
Warriors
By default, the first 10 lines of a table are displayed. You can use show to display more or
fewer. To display the entire table, use show with no argument in the parentheses.
warriors.show()
PLAYER POSITION TEAM SALARY
Golden State
Klay Thompson SG 15.501
Warriors
Golden State
Draymond Green PF 14.2609
Warriors
Golden State
Andrew Bogut C 13.8
Warriors
Golden State
Andre Iguodala SF 11.7105
Warriors
Golden State
Stephen Curry PG 11.3708
Warriors
Golden State
Jason Thompson PF 7.00847
Warriors
Golden State
Shaun Livingston PG 5.54373
Warriors
PLAYER POSITION TEAM SALARY
Golden State
Harrison Barnes SF 3.8734
Warriors
Golden State
Marreese Speights C 3.815
Warriors
Golden State
Leandro Barbosa SG 2.5
Warriors
Golden State
Festus Ezeli C 2.00875
Warriors
Golden State
Brandon Rush SF 1.27096
Warriors
Golden State
Kevon Looney SF 1.13196
Warriors
Golden State
Anderson Varejao PF 0.289755
Warriors
The nba table is sorted in alphabetical order of the team names. To see how the players were
paid in 2015-2016, it is useful to sort the data by salary. Remember that by default, the
sorting is in increasing order.
nba.sort('SALARY')
PLAYER POSITION TEAM SALARY
Thanasis
SF New York Knicks 0.030888
Antetokounmpo
Los Angeles
Jeff Ayres PF 0.111444
Clippers
These figures are somewhat difficult to compare as some of these players changed teams
during the season and received salaries from more than one team; only the salary from the
last team appears in the table.
The CNN report is about the other end of the salary scale – the players who were among the
highest paid in the world. To identify these players we can sort in descending order of salary
and look at the top few rows.
nba.sort('SALARY', descending=True)
PLAYER POSITION TEAM SALARY
The late Kobe Bryant was the highest earning NBA player in 2015-2016.
4. Data Types
Every value has a type, and the built-in type function returns the type of
the result of any expression.
type(abs)
builtin_function_or_method
This chapter will explore many useful types of data.
4.1. Numbers
Computers are designed to perform numerical calculations, but there are
some important details about working with numbers that every
programmer working with quantitative data should know. Python (like
most other programming languages) distinguishes between two different
types of numbers:
Integers are called int values in the Python language. They can only
represent whole numbers (negative, zero, or positive) that don’t
have a fractional component
Real numbers are called float values (or floating point values) in the
Python language. They can represent whole or fractional numbers
but have some limitations.
The type of a number is evident from the way it is displayed: int values
have no decimal point and float values always have a decimal point.
1.5 + 2
3.5
3 / 1
3.0
-12345678900000000000.0
-1.23456789e+19
The type function can be used to find the type of any number.
type(3)
int
type(3 / 1)
float
The type of an expression is the type of its final value. So,
the type function will never indicate that the type of an expression is a
name, because names are always evaluated to their assigned values.
x = 3
type(x) # The type of x is an int, not a name
int
type(x + 2.5)
float
4.1.1. More About Float Values
Float values are very flexible, but they do have limits.
1. A float can represent extremely large and extremely small numbers. There are
limits, but you will rarely encounter them.
2. A float only represents 15 or 16 significant digits for any number; the remaining
precision is lost. This limited precision is enough for the vast majority of applications.
3. After combining float values with arithmetic, the last few digits may be incorrect.
Small rounding errors are often confusing when first encountered.
The first limit can be observed in two ways. If the result of a computation is a very large
number, then it is represented as infinite. If the result is a very small number, then it is
represented as zero.
2e306 * 10
2e+307
2e306 * 100
inf
2e-322 / 10
2e-323
2e-322 / 100
0.0
The second limit can be observed by an expression that involves numbers with more than 15
significant digits. These extra digits are discarded before any arithmetic is carried out.
0.6666666666666666 - 0.6666666666666666123456789
0.0
The third limit can be observed when taking the difference between two expressions that
should be equivalent. For example, the expression 2 ** 0.5 computes the square root of 2,
but squaring this value does not exactly recover 2.
2 ** 0.5
1.4142135623730951
(2 ** 0.5) * (2 ** 0.5)
2.0000000000000004
(2 ** 0.5) * (2 ** 0.5) - 2
4.440892098500626e-16
The final result above is 0.0000000000000004440892098500626, a number that is very
close to zero. The correct answer to this arithmetic expression is 0, but a small error in the
final significant digit appears very different in scientific notation. This behavior appears in
almost all programming languages because it is the result of the standard way that arithmetic
is carried out on computers.
Although float values are not always exact, they are certainly reliable and work the same
way across all different kinds of computers and programming languages.
4.2. Strings
Much of the world’s data is text, and a piece of text represented in a
computer is called a string. A string can represent a word, a sentence, or
even the contents of every book in a library. Since text can include
numbers (like this: 5) or truth values (True), a string can also describe
those things.
The meaning of an expression depends both upon its structure and the
types of values that are being combined. So, for instance, adding two
strings together produces another string. This expression is still an
addition expression, but it is combining a different type of value.
"data" + "science"
'datascience'
Addition is completely literal; it combines these two strings together
without regard for their contents. It doesn’t add a space because these
are different words; that’s up to the programmer (you) to specify.
"data" + " " + "science"
'data science'
Single and double quotes can both be used to create
strings: 'hi' and "hi" are identical expressions. Double quotes are often
preferred because they allow you to include apostrophes inside of strings.
The str function returns a string representation of any value. Using this
function, strings can be constructed that have embedded values.
"loud".upper()
'LOUD'
Perhaps the most important method is replace, which replaces all
instances of a substring within the string. The replace method takes two
arguments, the text to be replaced and its replacement.
'hitchhiker'.replace('hi', 'ma')
'matchmaker'
String methods can also be invoked using variable names, as long as
those names are bound to strings. So, for instance, the following two-step
process generates the word “degrade” starting from “train” by first
creating “ingrain” and then applying a second replacement.
s = "train"
t = s.replace('t', 'ing')
u = t.replace('in', 'de')
u
'degrade'
Note that the line t = s.replace('t', 'ing') doesn’t change the string s,
which is still “train”. The method call s.replace('t', 'ing') just has a
value, which is the string “ingrain”.
s
'train'
This is the first time we’ve seen methods, but methods are not unique to
strings. As we will see shortly, other types of objects can have them.
4.3. Comparisons
Boolean values most often arise from comparison operators. Python
includes a variety of operators that compare values. For example, 3 is
larger than 1 + 1.
3 > 1 + 1
True
The value True indicates that the comparison is valid; Python has
confirmed this simple fact about the relationship between 3 and 1+1. The
full set of common comparison operators are listed below.
1 < 1 + 1 < 3
True
The average of two numbers is always between the smaller number and
the larger number. We express this relationship for the
numbers x and y below. You can try different values of x and y to confirm
this relationship.
x = 12
y = 5
min(x, y) <= (x+y)/2 <= max(x, y)
True
Strings can also be compared, and their order is alphabetical. A shorter
string is less than a longer string that begins with the shorter string.
5. Sequences
Values can be grouped together into collections, which allows
programmers to organize those values and refer to all of them with a
single name. By grouping values together, we can write code that
performs a computation on many pieces of data at once.
baseline_high = 14.48
highs = make_array(baseline_high - 0.880, baseline_high - 0.093,
baseline_high + 0.105, baseline_high + 0.684)
highs
array([13.6 , 14.387, 14.585, 15.164])
Collections allow us to pass multiple values into a function using a single
name. For instance, the sum function computes the sum of all values in a
collection, and the len function computes its length. (That’s the number of
values we put in it.) Using them together, we can compute the average of
a collection.
sum(highs)/len(highs)
14.434000000000001
The complete chart of daily high and low temperatures appears below.
Mean of Daily High Temperature
Mean of Daily Low Temperature
5.1. Arrays
While there are many kinds of collections in Python, we will work primarily
with arrays in this class. We’ve already seen that the make_array function
can be used to create arrays of numbers.
Arrays can also contain strings or other types of values, but a single array
can only contain a single kind of data. (It usually doesn’t make sense to
group together unlike data anyway.) For example:
baseline_high = 14.48
highs = make_array(baseline_high - 0.880,
baseline_high - 0.093,
baseline_high + 0.105,
baseline_high + 0.684)
highs
array([13.6 , 14.387, 14.585, 15.164])
Arrays can be used in arithmetic expressions to compute over their
contents. When an array is combined with a single number, that number
is combined with each element of the array. Therefore, we can convert all
of these temperatures to Fahrenheit by writing the familiar conversion
formula.
(9/5) * highs + 32
array([56.48 , 57.8966, 58.253 , 59.2952])
Arrays also have methods, which are functions that operate on the array
values. The mean of a collection of numbers is its average value: the sum
divided by the length. Each pair of parentheses in the examples below is
part of a call expression; it’s calling a function with no arguments to
perform a computation on the array called highs.
highs.size
4
highs.sum()
57.736000000000004
highs.mean()
14.434000000000001
5.1.1. Functions on Arrays
The numpy package, abbreviated np in programs, provides Python programmers with
convenient and powerful functions for creating and manipulating arrays.
import numpy as np
For example, the diff function computes the difference between each adjacent pair of
elements in an array. The first element of the diff is the second element minus the first.
np.diff(highs)
array([0.787, 0.198, 0.579])
The full Numpy reference lists these functions exhaustively, but only a small subset are used
commonly for data processing applications. These are grouped into different packages
within np. Learning this vocabulary is an important part of learning the Python language, so
refer back to this list often as you work through examples and problems.
Each of these functions takes an array as an argument and returns a single value.
Function Description
np.prod Multiply all elements together
np.sum Add all elements together
np.all Test whether all elements are true values (non-zero numbers are true)
np.any Test whether any elements are true values (non-zero numbers are true)
np.count_nonzero Count the number of non-zero elements
Each of these functions takes an array as an argument and returns an array of values.
Function Description
np.diff Difference between adjacent elements
np.round Round each number to the nearest integer (whole number)
np.cumprod A cumulative product: for each element, multiply all elements so far
np.cumsum A cumulative sum: for each element, add all elements so far
np.exp Exponentiate each element
np.log Take the natural logarithm of each element
np.sqrt Take the square root of each element
np.sort Sort the elements
Each of these functions takes an array of strings and returns an array.
Function Description
np.char.lower Lowercase each element
np.char.upper Uppercase each element
np.char.strip Remove spaces at the beginning or end of each element
np.char.isalpha Whether each element is only letters (no numbers or symbols)
np.char.isnumeric Whether each element is only numeric (no letters)
Each of these functions takes both an array of strings and a search string; each returns an
array.
Function Description
np.char.count Count the number of times a search string appears among the elements of an array
np.char.find The position within each element that a search string is found first
np.char.rfind The position within each element that a search string is found last
np.char.startswith Whether each element starts with the search string
5.2. Ranges
A range is an array of numbers in increasing or decreasing order, each
separated by a regular interval. Ranges are useful in a surprisingly large
number of situations, so it’s worthwhile to learn about them.
Ranges are defined using the np.arange function, which takes either one,
two, or three arguments: a start, and end, and a ‘step’.
If you pass one argument to np.arange, this becomes the end value,
with start=0, step=1 assumed. Two arguments give
the start and end with step=1 assumed. Three arguments give
the start, end and step explicitly.
A range always includes its start value, but does not include its end value. It
counts up by step, and it stops before it gets to the end.
When you specify a step, the start, end, and step can all be either positive
or negative and may be whole numbers or fractions.
π=4⋅(1−13+15−17+19−111+…)
Though some math is needed to establish this, we can use arrays to convince ourselves that
the formula works. Let’s calculate the first 5000 terms of Leibniz’s infinite sum and see if it
is close to π.
4⋅(1−13+15−17+19−111+⋯−19999)
We will calculate this finite sum by adding all the positive terms first and then subtracting the
sum of all the negative terms [1]:
4⋅((1+15+19+⋯+19997)−(13+17+111+⋯+19999))
The positive terms in the sum have 1, 5, 9, and so on in the denominators. The
array by_four_to_20 contains these numbers up to 17:
positive_terms = 1 / positive_term_denominators
The negative terms have 3, 7, 11, and so on on in their denominators. This array is just 2
added to positive_term_denominators.
negative_terms = 1 / (positive_term_denominators + 2)
The overall sum is
4 * ( sum(positive_terms) - sum(negative_terms) )
3.1413926535917955
This is very close to π=3.14159…. Leibniz’s formula is looking good!
5.2.2. Footnotes
[1] Surprisingly, when we add infinitely many positive and negative fractions, the order can
matter! But our approximation to π uses only a large finite number of fractions, so it’s okay to
add the terms in any convenient order.
For our first example, we return once more to the temperature data. This
time, we create arrays of average daily high and low temperatures for the
decades surrounding 1850, 1900, 1950, and 2000.
baseline_high = 14.48
highs = make_array(baseline_high - 0.880,
baseline_high - 0.093,
baseline_high + 0.105,
baseline_high + 0.684)
highs
array([13.6 , 14.387, 14.585, 15.164])
baseline_low = 3.00
lows = make_array(baseline_low - 0.872, baseline_low - 0.629,
baseline_low - 0.126, baseline_low + 0.728)
lows
array([2.128, 2.371, 2.874, 3.728])
Suppose we’d like to compute the average daily range of temperatures for
each decade. That is, we want to subtract the average daily high in the
1850s from the average daily low in the 1850s, and the same for each
other decade.
make_array(
highs.item(0) - lows.item(0),
highs.item(1) - lows.item(1),
highs.item(2) - lows.item(2),
highs.item(3) - lows.item(3)
)
array([11.472, 12.016, 11.711, 11.436])
As when we converted an array of temperatures from Celsius to
Fahrenheit, Python provides a much cleaner way to write this:
highs - lows
array([11.472, 12.016, 11.711, 11.436])
What we’ve seen in these examples are special cases of a general feature
of arrays.
For example, if array1 and array2 have the same number of elements, then the value
of array1 * array2 is an array. Its first element is the first element of array1 times the
first element of array2, its second element is the second element of array1 times the
second element of array2, and so on.
π=2⋅(21⋅23⋅43⋅45⋅65⋅67…)
This is a product of “even/odd” fractions. Let’s use arrays to multiply a million of them, and
see if the product is close to π.
Remember that multiplication can done in any order [1], so we can readjust our calculation
to:
π≈2⋅(21⋅43⋅65⋯1,000,000999999)⋅(23⋅45⋅67⋯1,000,0001,000,001)
We’re now ready to do the calculation. We start by creating an array of even numbers 2, 4, 6,
and so on upto 1,000,000. Then we create two lists of odd numbers: 1, 3, 5, 7, … upto
999,999, and 3, 5, 7, … upto 1,000,001.
2 * np.prod(even/one_below_even) * np.prod(even/one_above_even)
3.1415910827951143
That’s π correct to five decimal places. Wallis clearly came up with a great formula.
5.3.3. Footnotes
[1] As we saw in the example about Leibniz’s formula, when we add infinitely many
fractions, the order can matter. The same is true with multiplying fractions, as we are doing
here. But our approximation to π uses only a large finite number of fractions, so it’s okay to
multiply the terms in any convenient order.