0% found this document useful (0 votes)
142 views196 pages

Session2 Short

This document provides an overview of data preparation, including data cleansing, integration, and transformation. It discusses exploring data to identify problems, cleansing techniques like removing errors and inconsistencies, and combining and shaping data through activities like selecting, projecting, and joining data from multiple sources. The document also covers documenting cleansing processes, and techniques for anonymizing data like data laundering and obfuscation that aim to retain data usefulness while removing links to real-world entities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views196 pages

Session2 Short

This document provides an overview of data preparation, including data cleansing, integration, and transformation. It discusses exploring data to identify problems, cleansing techniques like removing errors and inconsistencies, and combining and shaping data through activities like selecting, projecting, and joining data from multiple sources. The document also covers documenting cleansing processes, and techniques for anonymizing data like data laundering and obfuscation that aim to retain data usefulness while removing links to real-world entities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 196

1

TM351
Data management and analysis
3

SESSION 2
parts 3 and 4
5

Session Overview
1. Part 3: Data preparation
2. Part 4: Data analysis
6

PART 3
Data preparation
9

Workload (total about 3 hours)


• During the course of this week you will work through the VLE content
and make extensive use of IPython (Jupyter) Notebooks.
• You will work through five Notebooks, looking at manipulating tabular
data in SQL and pandas DataFrames.
• Activity 3.5 uses 03.1 Cleaning data (30 minutes).
• Activity 3.9 (optional) recaps features of SQL.
• Activity 3.10 uses 03.2 Selecting and projecting, sorting and limiting
(40 minutes).
• Activity 3.11 uses 03.3 Combining data from multiple datasets
(30 minutes).
• Activity 3.12 uses 03.4 Handling missing data (20 minutes).
• In addition there are 3 screencasts showing how to use OpenRefine.
• Activity 3.4: Cleaning data with OpenRefine and clustering data to
show outliers for cleaning (30 minutes).
• Activity 3.8: Reshaping data with OpenRefine (20 minutes).
11

Data preparation
Purpose:
• Convert acquired ‘raw’ datasets into valid, consistent data, using structures and
representations that will make analysis straightforward.

Initial Steps:
1. Explore the content, values and the overall shape ? of the data.
2. Determine the purpose for which the data will be used
3. Determine the type and aims of the analysis to be applied to it?

Possible discovered problems with real data:


1. Data is wrongly packaged
2. Some values may not make sense
3. Some values may be missing
4. The format doesn’t seem right
5. The data doesn’t have the right structure for the tools and packages to be
used with it, for example, it might be represented in an XML schema, and a
CSV format is required, or organised geographically rather than by property
type.
12

Data preparation
Activities:
1. Data cleansing: remove or repair obvious errors and inconsistencies in the dataset
2. Data integration: combining datasets
3. data transformation: shaping datasets

Activities also known as:


• In data warehousing, the acronym ETL (Extract, Transform, and Load) is used for the
process of taking data from operational systems and loading them into the warehouse.
• Terms like data harmonisation and data enhancement are also used.

Note:
• Some of the techniques used in data preparation – especially in transformation and
integration – are also used to manipulate data during analysis
• Conversely, some analysis techniques are also used in data preparation.

Looking ahead:
This week you will look first  at some basic data cleansing issues that apply to single and
multiple tabular datasets, and then  at the processes used to combine and shape them:
selection, projection, aggregation, and joins. Many of these techniques can also be
straightforwardly applied to data structures other than tables.
13

2 Data cleansing
Is the process of:
• detecting and correcting errors in a dataset.
• It can even mean removing irrelevant parts of the data –
we will look at this later in the section.
• Having found errors – incomplete, incorrect, inaccurate or
irrelevant data – a decision must be made about how to
handle them.
14

2.1 Data cleansing headaches


Errors can be introduced into data in many ways:
• user input mistakes
• transport errors
• conversion between representations
• disagreements about the meaning of data elements

Some error types:


• incorrect formats
• Incorrect structures
• inaccurate values –can be hardest to identify and correct without additional data
or complex checking processes. (Is ‘Jean Smit’ the real name of a person in a
survey?)

Most operational systems try to keep ‘dirty’ data out of the data store, by:
• Input validation
• database constraints
• error checking
However, despite these efforts, errors will occur
15

Exercise 3.1 Exploratory (Repeated)


• Identify possible errors and issues that might require
further attention in the table.

Table 3.1 Fictitious details of family members


16

Classification of error types


• Validity
• Accuracy
• Completeness
• Consistency
• Uniformity
17

Validity

• Do the data values match any specified constraints, value


limits, and formats for the column in which they appear?
18

Accuracy

• Checking correctness requires some external ‘gold


standard’ to check them against (e.g. a table of valid
postcodes, would show that M60 9HP isn’t a postcode
that is currently is use). Otherwise, hints based on
spelling and capitalisation are the best hope.
19

Completeness

• Are all the values required present? Everyone has a DOB


and a postcode, although they may not know the value
(assuming they are in the UK – if they live elsewhere they
may not have a postcode), but can the dataset be
considered complete with some of these missing? This
will depend on the purpose of any future analysis.
20

Consistency

• If two values should be the same but are not, then there is
an inconsistency. So, if the two rows with ‘John Smith’ and
‘J. Smith’, do indeed represent a single individual, John
Smith, then the data for that individual’s monthly income is
inconsistent.
21

Uniformity

• The DOB field contains date values drawn from two


different calendars, which would create problems in later
processing. It would be necessary to choose a base or
canonical representation and translate all values to that
form. A similar issue appears in the income column.
24

2.2 Combining data from multiple


sources
• Harmonisation is the data cleansing activity of creating a
common (aka canonical) form for non-uniform data.
• Mixed forms more often occur when two or more data
sources use different base representations.
25

2.2 Combining data from multiple


sources (Examples)
• Imagine a company with two departments.
• One stores local phone numbers
• the other stores them in international format.
• A trouble-free canonical form might be:
• to specify international format for phone numbers in one column, or
• to create columns for both local and international versions.
27

2.2 Combining data from multiple


sources (Examples)
• There are limits to how much harmonisation can be achieved
with subjective values:

• Figure 3.2 The challenge of agreeing subjective values


29

2.4 Approaches to handling dirty data


• fix it – replace incorrect or missing values with the correct
values
• remove it – remove the value, or a group of values (or
rows of data or data elements) from the dataset
• replace it – substitute a default marker for the incorrect
value, so that later processing can recognise it is dealing
with inappropriate values
• leave it – simply note that it was identified and leave it,
hoping that its impact on subsequent processing is
minimal.
31

Documenting data cleansing


• it is necessary to:
• document how the dirty data was identified and handled, and for
what reason
• and maintain the data in both raw and ‘cleaned’ form
• If the data originally came from operational systems it might be
necessary to feed the findings back to the managers of these
systems
32

Benefits of Documenting data cleansing

1. Allows others to consider the changes made and


ensure they were both valid and sensible.
2. Helps to build a core of approaches and methods for
the kinds of datasets that are frequently used.
3. Allows managers of operations systems where the data
came form to adjust and improve their validation
processes.
4. Allows you, in time, to develop effective cleansing
regimes for specialized data assets.
34

2.5 Data laundering and data obfuscating


• Two further data cleansing activities:
• Data laundering attempts to break the link between the
dataset and its (valid) provenance.
• Data obfuscating (aka data anonymisation) is the
process of removing the link between sensitive data and
the real-world entities to which it applies, while at the
same time retaining the value and usefulness of that data.
35

2.5 Data laundering and data obfuscating


• The key difference between these activities and data
cleansing itself is this:
• in data cleansing we are trying to document and maintain the full
provenance of our dataset;
• in laundering we want to lose its history, and
• in obfuscation we’re trying to produce anonymised but useful data.
36

3 Data integration and transformation


• A new dataset may be in the wrong shape
• For example, data held in a tree-like structure may be
needed in table form.
• Another reason for reshaping data is to choose a subset
of a dataset for some purpose
• Finally, reshaping may also mean combining multiple
datasets.
• In this section, you’ll try out data origami, by reshaping
some datasets of your own.
• We will concentrate on tabular data.
38

SQL and Python Pandas


• To explore data integration and transformation, we will
use
• OpenRefine to make a quick change to a tabular dataset.
• SQL and Python pandas DataFrame objects

 Activity 3.9 follows


39

3.1 Picking only the data you want –


projection and selection
• Projection: Extracting columns from a table
• Selection: choosing rows from a table

• Activity 3.10 will take you through some basic table


manipulation operations using SQL and Python
40

3.2 Sorting data


• Which of these two tables is the more informative, at first
glance?
41

Tables 3.6 and 3.7 Unsorted monthly sales data in Table 3.6 and sorted by month
order in Table 3.7

With the data sorted by month, it’s relatively easy to see the gradual
decline in the TotalAmount values over the year. It’s much harder to see
this trend in the unsorted data.
42

Sorting other data types


• Sorting data types other than numbers and strings, or
sorting complex data structures, might raise issues.
• Example, numbers – when embedded in character strings
– are sorted by their character string representation:
43

Sorting other data types


• String sorting order might not be what some people
expect, or require.
• ASCII sorting order sorts upper case ahead of lower case,
but we might want to ignore the case, e.g. ‘alison’ and
‘Alison’:
• A ‘natural’ ordering of the days of the week, or months of
the year will probably more useful than an alphabetical
ordering.

• DBMSs and data management libraries allow the


definition of a collating sequence.
45

3.3 Limiting the display of the result


data
• When processing very large tables, it can be distracting to
get a huge table of results every time, especially when
developing or debugging a complex process
incrementally.
• Both SQL and Python offer ways to limit the number of
rows of a table that will be displayed, offering the choice
of seeing sufficient data to confirm that the results look
right, but not so much data as to get in the way of working
interactively.
46

3.4 Combining data from multiple


datasets
Data analysts often hear the plea ‘we need more data’; but
there are several interpretations of ‘more’ :
• more of the same, a bigger dataset with more data
elements (a longer table, one with more rows)
• more data about a data element we already have (a wider
table, one with more columns)
• more datasets (more tables).
• Many such cases involve multiple datasets – an original
table, and those containing additional data. In general,
when two tables are combined into a single table we talk
about ‘joining’ them. We will now look at several different
types of join and the tables that result from them.
47

4 Coping with missing or invalid data


elements

• Missing values: a ‘DOB’, a ‘Postcode’ and an ‘Approx_income’ value.


• Every person has a DOB, so presumably Smith, Jean’s DOB is
missing because it’s not known. However, there is one.
• If Walter Smith is not living in the UK then he won’t have a postcode,
so this may be a different kind of missing data – it doesn’t exist.
• Jean Smit’s missing ‘Approx_income’ value could be missing because
she refused to offer the information. This is another, semantically
different, form of missing data.
48

Is the SQL NULL marker adequate ?


• SQL uses the NULL marker for all these types of missing
data, but
• for data processing purposes, simply marking the missing
data may not be sufficient.
• It might be more useful to do something such as
categorising the reasons data is missing. (A ‘DOB’ or
‘Approx_income’ value of ‘refused’ would be useful in
some situations.)
49

Detecting type mismatches


• Some systems may provide warnings that certain values
are clearly meaningless in some way. Imagine the
following table of raw data:

• If we had ‘Number_attending’ defined as a numeric field


and ‘Start_time’ as a time field – then some systems
would flag the data mismatches.
50

Flagging invalid entries

Using:
• Not a Number (NaN) & Not a Time (NaT) values
• Null marker
• None value
• Others …..
51

NaN & NaT


• Some systems would flag the data mismatches to show
that inappropriate data has been detected.
• Here, NaN represents Not a Number and NaT represents
Not a Time, indicating that meaningful data is missing
from these elements.

Table 3.12 invalid entries are flagged as NaN and NaT


52

Null
• In an earlier example used in Notebook 03.3 Combining
data from multiple datasets, we used the SQL OUTER
JOIN to create rows with missing data, putting the SQL
NULL marker in place of the missing values.

Table 3.13 The outer join table for the small sports club
53

None

Table 3.14 The parts table showing those parts with no


colour having the colour value ‘none’
54

Treating missing or invalid data


• there are various forms of missing, or invalid data that need to
be treated appropriately. Issues: In Table 3.10,

What is the average ‘Approx_income’?


• Do we include refused values as 0, which will distort the
average? or
• Do we calculate the average without them? or
• Do we report that we can’t give the average because we don’t
have all the values needed?
55

Treating missing or invalid data

• What is the total of the ‘TotalAmount’ column in


Table 3.13? 107 or unknown?
• If we know that NULL indicates that Kirrin had made no
payment, it is a substitute for 0? and it makes sense to
handle it that way?
56

Treating missing or invalid data

• How many distinct colours appear in Table 3.14? Two, three or


four?
• It is two if we ignore the ‘no colours’, but should ‘no colour’ be
considered a kind of colour? and if so is the ‘None’ of the
Flange the same as the ‘None’ of the Sprocket?
• Finally, if NaN and NaT values are being used, then strict rules
for the data types in a column are probably being broken.
Should we go back and consider what to do with the original
data?
57

Treating missing or invalid data

• Finally, if NaN and NaT values are being used, then strict
rules for the data types in a column are probably being
broken. Should we go back and consider what to do with
the original data?
58

Treating missing or invalid data


• No consistent or automatic way to handle the full range of
semantic interpretations of missing values: we simply
have to:
• treat them with care
• decide what they represent
• how they can be interpreted, and
• how they can best be cleaned so that subsequent processing and
analysis does not lead to logical errors.
• Much will depend on how the chosen libraries and
packages handle missing data.
59

5 Bringing it all together


• Do Activity 3.13 Exploratory
• 25 minutes
61

EXERCISES
For Part 3
62

Exercise 3.1 Exploratory


• 10 minutes
• Consider the following table of data, showing names, dates of
birth (DOB), genders, postcodes and approximate monthly
incomes for five (fictitious) individuals from the same family.

Table 3.1 Fictitious details of family members


• Identify possible errors and issues that might require further
attention in the table.
63

Exercise 3.1 Exploratory (Cont.)


Discussion
• The following might require consideration. Not all of them are ‘errors’; some may simply be
unexpected values or require additional consideration:
• Names are problematic: are J. Smith and John Smith the same person, or twins? (Same
initial and surname, same postcode, same DOB, same gender … different incomes?) Is Jean
Smit a member of the Smith family? Is Jean Smit also Smith, Jean?
• Mixed representations are used in the same column: the ‘Name’ column uses initial and full
first name, but also surname-name, as well as name-surname; the ‘DOB’ column uses
different representations; the ‘Approx_income’ column uses both the £ and $ prefix, or no
prefix; the postcode column uses a space and no space.
• The gender value ‘R’ looks intriguingly strange. A mistake? Or is it specified for this column
that ‘R’ means ‘refused’?
• Missing DOB: What about Jean Smith? Everyone has a date of birth. When was he (or she)
born? Or is this not known, withheld, or not collected?
• Postcodes might need checking: M60 doesn’t look right (or rather the single character start
looks unusual, unless you come from Manchester).
• It might be correct to show monthly income as empty, but shouldn’t that then be £0 (or
might that impact on later processing)?
• 13-15-1901 doesn’t look like a standard date format. Assuming we are dealing with the
living, then 1901 makes Walter 114 in 2015. More seriously 13-15 is not a DD-MM or MM-DD
form.
• Hang on! Can £40 000 000 really be a sensible monthly income?
64

Exercise 3.2 Self-assessment


• 5 minutes
• If two distinct datasets are being taken into a system,
which do you think would be the better strategy: clean
then harmonise, or harmonise then clean?
• Discussion
• Probably the path of cleaning each dataset independently
before harmonising would be best: this ensures that the
harmonisation is applied to the ‘cleanest’ available version
of the data, reducing possible errors caused by merging
erroneous data. However, as with most activities in data
cleansing, be led by the data – explore it before making
changes.
65

Exercise 3.3 Exploratory


• 15 minutes
• Two online shops are being merged. Here are customer
tables from each, with a representative row.

Table 3.2 Company X customer table

Table 3.3 Company Y customer table


66

Exercise 3.3 Exploratory (Cont.)


• 15 minutes
• Describe how you might go about trying to harmonise
these datasets. What problems might arise, and can you
suggest forms for the harmonised data?
67

Discussion
• Firstly, you would need to spend time understanding what each table shows: understanding
what the data rows represent and the way in which the values in the columns are to be
interpreted.
• Ideally you would have a lot of sample data, supported by descriptive documentation,
available for this review stage; what we’ve supplied here is deliberately short so that we can
highlight some key questions we might ask.
• Company X uses fairly standard table components – even if we’re not sure of their exact
interpretation. There are also insufficient data values in some columns to get a sense of the
range of possible values that may occur there. For example, the ‘Priority’ column only has the
value ‘Top’, and we have no way of inferring other values that might appear in that column.
The ‘Cno’ column – assuming an interpretation of it being a unique numeric value
representing a customer number – would allow us to infer other possible numeric values for
that column.
• Company Y appears to be using a complex string representation for contact details that
combines the email and address into one field. This appears to be semi-colon separated,
and using a tag of the form label: string for that label. The ‘Id’ column is a five-digit string with
leading zeros, and for ‘Gender’ we can infer a second value of ‘M’ although there may be
more values permitted. The ‘Class’ column might relate to Company X’s ‘Priority’, but without
further information this would be a guess; even if it does relate to priority we’ve no idea if ‘1’
is a high or low priority, or how this might relate to the ‘Top’ used by Company Y. Finally,
Company Y uses a single ‘Name’ field, which appears to be split (using a comma) into
surname and name – it’s not possible to say what might happen to multi-part names.
• Without a lot of additional information it would be impossible to suggest a robust harmonised
form for the data – the only fields where this would appear possible are:
68

Discussion (Cont.)

Table 3.4 Data fields from Tables 3.2 and 3.3 that could be harmonised
• It might be possible simply to put leading zeros in front of the
‘Cno’, provided of course that the Cno range didn’t overlap the
values of the ‘Id’ column. But if this forces Company X
customers to log in using customer numbers with leading
zeros, then they would need to be told of this change.
• In operational systems, any attempts to harmonise will usually
impact on the existing systems, requiring maintenance updates
to allow existing applications to mesh with the harmonised
datastores.
69

ACTIVITIES
For part 3
70

Activity 3.1 Exploratory


• 5 minutes
• Watch the following video. While watching the video, identify the original error and the dramatic outcome.
How were the two connected, even though the original error was rectified?
• Data Quality Example (Elliott, 2007)
• Discussion
• The original error was that the value of a property was mistyped, possibly after misinterpreting information
from another source.
• The outcome was that the predicted overall property tax income for an entire county was miscalculated and
county budgets were set using this incorrect amount. When the predicted tax revenue failed to materialise,
the county was forced to make good a huge budget shortfall.
• The original error was corrected, but not before the erroneous values had propagated into downstream
data analysis and decision-making systems, and too late for downstream warnings that the decision-
making information was potentially compromised.
• Although elements of the story sound rather like an urban myth, the core of it appears to be true, at least
according to the article in The New York Times, ‘A one-house, $400 million bubble goes pop’:
• [T]he value of [Daelyn and Dennis Charnetzky’s] house … skyrocketed to $400 million from $121,900, after
someone most likely hit a wrong key and changed the figure in the county’s computer system.
• The inflated value, discovered by the Charnetzkys’ mortgage company, has become a financial headache
for Porter County. It was used to calculate tax rates and led the county to expect $8 million in property
taxes that did not exist.
• The fiasco peaked last week when the county’s 18 taxing units were told they must repay the county $3.1
million that had been advanced to them in error. On Monday, Valparaiso, the county seat, hired a financial
consultant to investigate a $900,000 budget shortfall.
• (Ruethling, 2006)
• A similar tale was told a few days earlier by the Times of Northwest Indiana in the article: ‘A $400 million
home’ (Van Dusen, 2006). To use a technical term: oops!
71

Activity 3.2 Exploratory


• 60 minutes
• The IEEE Technical Committee on Data Engineering publish regular articles on a wide range of topics
relevant to TM351. Search the web for the Committee’s official website and the back issues of their Data
Engineering Bulletin. From there, access the article:
• Rahm, E. and Do, H. H. (2000) ‘Data cleaning: problems and current approaches’, IEEE Data Eng. Bull.,
vol. 23, no. 4, pp. 3–13.
• Via the Open University Library, obtain a copy of the following paper:
• Kim, W., Choi, B. J., Hong, E. K., Kim, S. K. and Lee, D. (2003) ‘A taxonomy of dirty data’, Data Mining and
Knowledge Discovery, vol. 7, no. 1, pp. 81–99.
• Read quickly through Sections 1 and 2 of Rahm and Do (2000) and Sections 1 and 2 of Kim et al. (2003)
taking notes of what strike you as the key points of the way they categorise dirty data problems.
• The intention is that you start to build-up a mental picture of the different ways in which data can go wrong
– and ways of mitigating against such errors – and develop your vocabulary for describing such problems.
In reading Kim et al. (2003), you may skip over the sections that refer particularly to referential integrity and
transaction management in databases for now, although you may find it useful to revisit this paper later in
the module when we consider these issues.
• You may find it useful to use an IPython Notebook as a place to make your notes (you don’t need to limit
your use of the Notebook to just writing Python code). Alternatively, you may find another note-taking
approach, such as a mind map (GitHub, 2015), more appropriate.
• Now read through the identified sections again, giving them a slightly closer reading, and further annotating
the key points you identified at the first reading. To what extent are the categorisations given in the two
papers coherent? That is, to what extent do they offer a similar view and to what extent do they differ, or
even disagree?
• If you have any experience of working with dirty data, how well do the taxonomies capture your
experience? Add these observations to your notes. If you have any particularly notable anecdotes about
dirty data that you are willing to share, post them to the module forum thread ‘Dirty data war stories’. (You
should ensure you obscure the names of any companies or individuals to avoid any embarrassments.)
72

Discussion
• Note: the following is based on my [The author's] notes as I read through the papers – this
will differ from what others reading the same papers might choose to note; we should be
seeing the same kinds of things in the paper but we might attach a different level of
significance to the things we see.
• Rahm and Do (2000) break down the issues initially into those related to single or multi-
source data, and at a lower level distinguish between schema (data model and description)
and instance (data values). Single source schema issues usually result from a lack of
adequate constraints to prevent incorrect data appearing which requires schema change. For
multi-source data, the issues include combining multiple schema requiring schema
harmonisation. At the instance level, the single schema issues are generally reflections of
poor constraints, or simply erroneous – but plausible – data entry errors. The multi-schema
issues include differences in units/data types for comparable values, differences in
aggregation levels for grouped data items and the challenges of identifying when multiple
values are referring to the same ‘thing’ which they label – overlapping data.
• Kim et al. (2003) take a different approach to Rahm and Do, with a top-level description of
how dirty data appears: ‘missing data, not missing but wrong data, and not missing and not
wrong but unusable’. They then break each of these descriptions down, using categories
similar to the single versus multiple source distinctions of Rahm and Do (see Section 2.1.2 of
the paper, which in the version that I read is incorrectly indented). Their taxonomy is very
description based: their leaf nodes are of specific issues of specific types of problem; in
contrast, Rahm and Do focus more on classes of problems (based on where they occur).
• In their final bullet point of Section 1, Kim et al. (2003) state clearly that they do not intend to
discuss metadata issues and in this they include independent design of multiple data sources
– so this paper addresses what Rahm and Do label ‘instance data’.
73

Activity 3.3 Exploratory


• 10 minutes
• Read Section 3 (the first paragraph and Table 2 should be
sufficient) of:
• Kim, W., Choi, B. J., Hong, E. K., Kim, S. K. and Lee, D.
(2003) ‘A taxonomy of dirty data’, Data Mining and
Knowledge Discovery, vol. 7, no. 1, pp. 81–99 (a paper
you saw in the last activity).
• You will see how they describe handling dirty data
mapped against their taxonomy of dirty data.
74

Activity 3.3 Exploratory


Discussion
• This mapping mixes prevention (how to avoid it
happening), checking (how the dirty data can be found)
and repairing (doing something about it) against the
taxonomy, and so not all suggestions address ‘handling’
the dirty data that is present. It’s quite significant to note
how often this requires a call to ‘intervention by a domain
expert’ which suggests these are not going to be easily
automated.
75

Activity 3.4 Exploratory


• 15 minutes
• In this activity you will have an opportunity to see some dirty
datasets being cleaned using OpenRefine. One advantage of
using OpenRefine for this exercise is that it provides an explicit
view over the data that you can manipulate directly, you can
see the data and the impact of the cleaning actions you
perform.
• The following screencasts build on the spending datasets we
opened into OpenRefine in Part 2.
• Previewing data in OpenRefine

• A powerful feature of OpenRefine is the set of tools available to


cluster partially matching text strings; some of these are shown
in the following screencast.
• Grouping partially matching strings in OpenRefine
76

Activity 3.5 Notebook


• 30 minutes
• Work through Notebook 03.1 Cleaning data.
77

Activity 3.6 Exploratory and social


(optional)
• 20 minutes
• Sometimes, bad code can lead to bad data, as this story
reported by Digital Health Intelligence shows: ‘Inquiry into
transplant database errors’ (Bruce, 2010).
• Optional extension
• You can read the outcome of the review here: Review of the
Organ Donor Register (Department of Health, 2010).
• Warning! The full report is 45 pages, but the key descriptions of
the data-related problems are given on pp. 14–18 and the data-
related recommendations are: Recommendations 2, 3 and 4 on
p. 6.
• If you know of any similar stories that you are able to share
here, please do so. Note that this is at your own risk. We can’t
guarantee that the story won’t become public and escape onto
the web!
78

Activity 3.7 Exploratory


• 10 minutes
• Read through the blog post ‘Several takes on the notion
of “data laundering”’ (Hirst, 2012).
• How are the different forms of data laundering
characterised?
• If you would like to contest any of the descriptions offered
there, or would like to offer your own definition of ‘data
laundering’, do so via the module forum thread ‘Data
laundering’.
79

Activity 3.8 Exploratory


• 10 minutes
• To set the scene for our exploration of data integration and
transformation, consider this example of how we can use
OpenRefine to reshape a dataset. In the following screencast
you will see how we can import a dataset, remove unnecessary
columns and rows (in this case empty rows) and then combine
data from two datasets that have values in common, allowing
us to produce a single table integrating data from the two
source datasets.

• Reshaping a dataset in OpenRefine

• We will now go on to use SQL and Python pandas to explore


data integration and transformation; but don’t forget that
OpenRefine is handy when you want to make a quick change
to a tabular dataset.
80

Activity 3.9 Notebook (optional)


• 30 minutes
• If you studied M269, or are familiar with database work,
you will have seen examples of SQL. If you want a quick
recap on SQL before you look at the table manipulation
examples you can review the notebook:
• Example notebook reviewing SQL covered in M269
Recap – Python.
81

Activity 3.10 Notebook


• 40 minutes
• Work through Notebook 03.2 Selecting and projecting,
sorting and limiting, which contains SQL and Python to
manipulate tabular data.
82

Activity 3.11 Notebook


• 30 minutes
• Work through Notebook 03.3 Combining data from
multiple datasets, which looks at SQL and Python code
for joining datasets in different ways.
83

Activity 3.12 Notebook


• 20 minutes
• Work through Notebook 03.4 Handling missing data.
84

Activity 3.13 Exploratory


• 25 minutes
• Many real-world datasets often require a considerable
amount of tidying in order to get them into a workable
state.
• An example of published real-world data is given in the
blog post ‘A wrangling example with OpenRefine: making
“oven ready data”’ (Hirst, 2013).
• Read through the post and make notes on the
OpenRefine techniques used to work through the
example described there. You may want to refer to the
notes in later activities in the module – so I suggest
putting your notes in a Notebook.
85

PART 4
Data analysis
88

Workload
• This part of the module is split between reading, exercises, and notebook
activities.
• There are two, largely independent pieces of work to be completed this week:
• studying the module content, exercises and activities
• practical work in which you will use OpenRefine, regular expressions, SQL and
Python.
• During this part of the module you will work through six notebooks, looking at
Python’s pandas and developing skills in reading, writing and manipulating
content in different file formats.
• Activity 4.1 uses 04.1 Crosstabs and pivot tables (15 minutes).
• Activity 4.2 uses 04.2 Descriptive statistics in pandas (15 minutes).
• Activity 4.3 uses 04.3 Simple visualisations in pandas (20 minutes).
• Activity 4.4 uses 04.4 Activity 4.4 Walkthrough (10 minutes).
• Activity 4.5 uses 04.5 Split-apply-combine with SQL and pandas (30 minutes).
• Activity 4.6 uses 04.6 Introducing regular expressions (30 minutes).
• Activity 4.7 uses 04.7 Reshaping data with pandas (30 minutes).
• In addition there is a screencast in Activity 4.7 (20 minutes), which shows how
OpenRefine is used to reshape a table.
89

2 Analysis: finding the data’s voice


• Our path through the data analysis pipeline so far has
seen us:
• acquire and package data from external sources, and
• clean and prepare it for analysis and reporting.
• In this part we look at the actual examination and
interpretation of the data – the analysis stage of the
analysis pipeline, as shown in Figure 4.1.

Figure 4.1 A data analysis pipeline


90

Perspectives of the analysis step


• We will look at analysis from two perspectives – both of
which constitute ways of finding the data’s voice. They
are:
1. the descriptive: explore the data to:
• bring out its basic features and
• isolate any of those features that might be of particular interest.
2. the inferential: aims to go beyond the descriptive and
to bring out new ideas from it; for example:
• to confirm hypotheses
• make comparisons between sectors of the data.
• Very often, the data has been collected and shaped
especially with some specific inferential purpose in mind.
91

Next
• Most forms of data analysis consist in transforming the data in
some way.
• In the case of both these approaches, we will look at some of
the standard ways in which datasets can be manipulated to
support analysis, to assist decision making and generate
information and insight.
• We will build on the techniques presented in the previous
section on data preparation, and extend them to consider
common techniques for transforming data for analytical
purposes.
• Some of these new techniques can also be used in data
preparation activities, but we are presenting them here as tools
by means of which datasets can be broken down and rebuilt in
useful ways.
92

The role of statistics


• Many scientists, social and physical, might argue that analysis
without numbers is essentially without value – that only through
numbers are the stories that inhabit data revealed.
• Whether this is true or not, much of data analysis is indeed
numerical, and numerical analysis – except in the simplest
cases – means statistics.
• We will briefly discuss a number of statistical tools and
techniques here, but without probing their mathematical
foundations.
• Although statistical analysts should have an understanding of
the techniques they use, they will employ specialised software
packages such as SPSS to do their calculations, so we will not
consider how the measures we discuss are actually calculated.
• We will present them purely in terms of their application to
various types of data analysis.
93

Sharing analysis results


• One of the principal aims of any analysis activity is to
produce results that can be reported on and shared with
others.
• This is most often achieved by using visualisations,
which we will touch on here, but consider in detail in the
next section.
• However, for fairly simple sets of analytical results, just
presenting the figures alone may be enough, provided
these are presented in a clear and helpful manner.
• At the end of this section, then, we will briefly consider
ways in which data may be reshaped to achieve this.
94

3 Descriptive analysis
• Descriptive analysis seeks to describe the basic
features of the data in a study – to describe the data’s
characteristics and shape in some useful way.
• One way to do this is to aggregate the data: that is, if the
data consists of elements on an interval scale, to boil
down masses of data into a few key numbers, including
certain basic statistical measures.
• Compressing the data in this way runs the risk of
distorting the original data or losing important detail.
• Nevertheless, descriptive statistics may provide powerful
indicators for decision making.
• A second way to describe masses of data is through
visualisation techniques.
95

3 Descriptive analysis
3.1 Aggregation for descriptive analysis
• Simple aggregation functions
• 2 examples:
• a large (imaginary) OU module, TX987, part of which is expressed
in Table 4.1.
• And an (imaginary) student’s overall transcript, as shown in
Table 4.2.
96

Example 1
• a large (imaginary) OU module, TX987, part of which is
expressed in Table 4.1.
97

Example 2
• an (imaginary) student’s overall transcript (Table 4.2.)
98

Aggregation functions
• An aggregation function reduces a set, a list of values or
expressions over a set, or over a column of a table, to a
single value, or small number of values.
• Among the most obvious of these are:
• Count: the number of values in the set or list.
• Sum: the sum total of the values in the set or list.
• Max: the largest value from all the values in the set or list.
• Min: the smallest value from all the values in the set or list.
• Average (= mean): obtained by dividing the sum of all the
values in the set or list by the number of values in it.
99

Aggregation functions in analysis


packages
• Commonly provided in data processing and analysis
packages.
• SQL provides all five, for example, and you will also see in
the accompanying Notebook how these functions work in
SQL and Python pandas.
• However, remember Section 4 in Part 3 of the module on
handling missing data. It is important to know how the
packages you use handle NULLs or marker values in the
datasets.
• For example, standard SQL ignores NULLs in all these
cases. Here is one example of SQL at work.
100

Aggregation functions in analysis


packages: example
• At the OU, a small award is made to students whose
transcript shows they have completed more than four
modules, studied at least 140 credits, and achieved an
overall average mark of 40 or more.
• (No, not really, but just follow us here.)
• The SQL that will give us the descriptive values that will
let us check to see if this student has met these criteria
would be:
SELECT Count(Module Code) AS how_many_modules,
SUM(Credit) AS total_credit, Average(Mark) AS average_mark
FROM Transcript;
101

Aggregation functions in analysis


packages: example
SELECT Count(Module Code) AS how_many_modules,
SUM(Credit) AS total_credit, Average(Mark) AS average_mark
FROM Transcript;
• Will return Table 4.3:
102

Aggregation in data warehousing


• In data warehouses used for OLAP activities (mentioned
in Part 1), a topic addressed later in the module, it is
common to precalculate and store many of the aggregate
values for datasets directly in the data warehouse, so that
the overheads of processing are applied once at data
load, rather than each time the values are required.
103

Cross tabulation
• Cross tabulation or crosstab is a process used to reveal
the extent to which the values of categorical variables are
associated with each other.
• The result of a cross tabulation operation is a
contingency table, also known as a cross tabulation
frequency distribution table.
104

Cross tabulation: Example


• suppose we have a set of council spending transaction data
that allocates items to a particular directorate as well as to a
capital or revenue budget.

• Table 4.4 A sample of data from a council spending dataset


105

Cross tabulation: Example


• A cross tabulation of this data could be used to produce
Table 4.5, providing a summary count of the number of
transactions associated with each type of spend (capital or
revenue) by each directorate.

• Table 4.5 A cross tabulation of the council spending data


106

Cross tabulation: Example


• Crosstab functions can also be used to support ‘margin’
calculations (so-called because the calculation results are
shown at the margins of the table;) for example, to calculate
the total number of transactions by Capital or Revenue budget
or Directorate.

Table 4.6 A cross tabulation of council spending data with


margin calculations
107

Cross tabulation: Example

Table 4.6 A cross tabulation of council spending data with


margin calculations

• Crosstab summaries can provide a useful analytical tool.


• For example, in a large dataset, any very low margin count
values may represent errors, such as mistyping the name of a
directorate, or alternatively may represent items of interest as
potential outliers.
108

3.2 Statistics for descriptive analysis


• More powerful statistical measures that
• Summarise the properties of each variable (i.e each
column) separately
• Technique known as univariate analysis.
• 3 main characteristics of a single variable generally
covered:
• the distribution
• the central tendency
• the dispersion.
109

The normal distribution


• If a module is large, and we plot each possible mark against
the number of students who achieved it:

Figure 4.2 A normal distribution


• The normal distribution – the bell curve –is the foundation of
statistics.
110

The normal distribution - aggregated


• Another technique: aggregate Marks into ranges, and plot them
instead. Suggests the same bell-shaped distribution.

Figure 4.3 A normal distribution showing the number of


students with marks, grouped into mark ranges
111

The normal distribution – real life


• real-life data is much more likely to be imperfect:

Figure 4.4 A skewed normal distribution with a mean around 20


• Here, the distribution is skewed towards the lower end.
• Not a perfect normal distribution, something interesting might
be happening.
112

Central tendency
• very few students will get marks far from the average.
Statisticians tend not to be interested in outliers
• statistician Nassim Taleb (2007) argued that outliers are
the most important feature of a dataset.
• Most statisticians are more interested in what is
happening at the middle of the distribution – the central
tendency.
• Three major statistical measures are used here:
• mean
• median
• mode.
113

Central tendency 3 measures


• the mean is also known as the average
• The median is the middle value of the set.
• to compute the median, sort the values and then
find the central value. For example:
• 15, 15, 15, 17, 20, 21, 25, 36, 99
• The median is: 20.
• The mode is the most frequently occurring value
in the set.
• Thus, in the example above, 15 is the mode.
• there may be > 1 modal value; so if all values
appear only once, they’re all the modal value!
• No excuse not to look at the 3 measures (very
fast)
114

The mean vs median


• In a truly normal distribution, the mean = median.
• In a skewed distribution, they are not equal:

Figure 4.5 The skewed normal distribution curve with


lines showing a mean = 58.8 & median = 42.7 overlaid
115

Dispersion
• The dispersion of a dataset is an indication of how
spread out the values are around the central tendency.

Figure 4.6 Two normal distributions, each centred at 50


marks, but with different dispersion
116

Common measures of dispersion


• There are three common measures of dispersion:
• the range
• the variance
• the standard deviation.
117

The range
• The range is the highest value minus the lowest value.
• For example:

15, 15, 15, 17, 20, 21, 25, 36, 99

• the range is 99 − 15 = 84.


• However, this is little used because an outlier (e.g. the 99)
can wildly exaggerate the true picture.
118

The variance & the standard deviation


• The variance and the related standard deviation, which
are measures of how spread out around the mean the
values are, are the accepted measures.
• No need to worry about how they are calculated 
• S1, will have a higher variance and standard deviation
than S2.
119

Correlation
• One of the most widely used statistical measures.
• A correlation is a single value that describes how related
two variables are.
• For example, the relationship between age and
achievement on OU modules – with the (dubious)
hypothesis that older students tend to do better in TX987.
• A first step might be to produce a scatterplot of age
against mark for the TX987 dataset, which
(controversially) might reveal something like Figure 4.7.
120

Figure 4.7 A fictitious scatterplot of age


against marks
121

Interpreting the scatter plot


• there is some kind of positive relationship between age
and mark.
• But there are two questions:
• Strength - what is the degree of this relationship – in other words,
to what extent are the two variables correlated
• Significance - and is the relationship real, or is it just the product of
chance?
122

Correlation coefficient
• statistical packages can compute the correlation value, r.
• r will always be between −1.0 and +1.0.
• If the correlation is positive (i.e. in this case,
achievement does improve with age), r will be positive,
• otherwise it will be negative.
• It is then possible to determine the probability that the
correlation is real, or just occurred by chance variations in the
data. This is known as a significance test. This is generally
done either automatically, or by consulting a table of critical
values of r to get a significance value alpha. Most introductory
statistics texts would have a table like this. Generally, analysts
look for a value of alpha = 0.05 or less, meaning that the odds
that the correlation is a chance occurrence are no more than 5
out of 100. The absolute gold standard for these kinds of tests
is alpha = 0.01.
123

Significance test
• a significance test determines wether the probability that
the correlation is real, or just occurred by chance
variations in the data.
• Done either automatically, or by consulting a table of
critical values of r to get a significance value alpha.
• Most introductory statistics texts would have a table like
this. Generally, analysts look for a value of alpha = 0.05 or
less, meaning that the odds that the correlation is a
chance occurrence are no more than 5 out of 100. The
absolute gold standard for these kinds of tests is alpha =
0.01.
124

Significance value alpha


• Most introductory statistics texts have a table of critical
values of r to get a significance value alpha, like:
125

Significance value alpha


• Generally, analysts look for a value of alpha = 0.05 or
less, meaning that the odds that the correlation is a
chance occurrence are no more than 5 out of 100.
• The absolute gold standard for these kinds of tests is
alpha = 0.01.
126

Correlation among multiple variables


• Diagonal has 1.00 s
• No need to the duplicate upper triangle

Table 4.7 A correlation matrix showing the r values for


possible variable pairings over variables C1–C5
127

3.3 Visualisation for descriptive


analysis
• Sometimes aggregation measures might not tell the whole
story.
• Returning to our student in Table 4.2, the mean score on
its own reveals little.
• Assuming that the student’s results are given in the order
in which the modules were taken, the simple visualisation
in Figure 4.8 offers a more revealing picture.
128

Example 2
• an (imaginary) student’s overall transcript (Table 4.2.)
129

3.3 Visualisation for descriptive analysis

Figure 4.8 Bar graph showing marks on modules for the


fictitious student from Table 4.2
130

3.4 Comparing datasets


• It is often necessary to describe datasets in comparison to
one another – always ensuring that a meaningful
comparison can be made – that is, the datasets are
related in ways that comparison would yield some new
information.
• Do activity 4.4 (Practical)
131

3.5 Segmenting datasets


• What and why?
• In data segmentation, a dataset is split into separate
partitions based on properties shared by all members of
each partition.
• One or more operations can then be applied to each
partition.
• Segmenting the data in this way can have two analytical
purposes:
1. looking for other shared characteristics of each group
2. bringing out similarities and differences between the groups.
132

3.5 Segmenting datasets


• Example 1 (Sales team):
• Sales data could be partitioned by region, month, team
member, etc.

In other applications, segments may be based on particular


behaviours:
• Example 2 (Web analytics):
• a commonly used, if crude, metric in web analytics splits
website visitors into ‘new’ or ‘returning’ groups. ‘Returning’
visitors may be further grouped by how recently they last
visited the website.
133

3.5 Segmenting datasets


• Example 3 (Marketing):
• market segmentation or consumer segmentation refers to
a strategy in which the potential market for a product is
split into groups depending on how they respond to
different marketing messages, product packaging, pricing
strategies, etc.
• These behaviours may well cut across the boundaries of
traditional demographic groupings (gender, age, job, etc.),
and may be thought to contain members receptive to
particular marketing messages, or be well matched to a
particular product.
134

3.5 Segmenting datasets


Sometimes, an organisation may want to segment their data according to
segments that are defined according to third-party classifications.

• Example 3 (third party classification)


• the well-known ABC1 categorisation scheme (Ipsos MediaCT, 2009)
categorises households based on the employment of the chief income
earner (CIE);
• another social grade classification is based on the household reference
person (HRP) identified for each household in the UK 2011 census.
• Third-party data brokers run data enrichment services that will augment
an organisation’s customer data with social status data from their
databases;
• organisations can then run analyses that identify whether or not particular
patterns of behaviour appear to exist within particular geodemographic
groups.
• This simple idea – of enriching one dataset with elements from another,
and then summarising the behaviour of members of the first dataset
according to groupings defined in the second – lies at the heart of many
data processing activities.
135

3.5 Segmenting datasets


• In the world of business and enterprise information
systems, data segmentation often plays an important role
in online analytical processing (OLAP – see Part 1).
• Experian, CACI, and similar organisations have access to
vast quantities of information which require automated
segmentation techniques, such as clustering and
classification algorithms.
• Later parts of the module will look at data warehousing,
mining, classification and clustering in more detail.
136

Working with subsets


• value segmentation:
• the most basic form of segmentation
• shared values of attributes determine the segments.
• In an earlier exercise we started with two separate
datasets represented by two tables of data, to which we
applied the same processing.
• But now suppose that the data from the exercise is
instead held in a single table called, say
‘mixed_module_data’, shown in Table 4.12.
137

Table 4.12 Combined QQ223 and


QQ224 data into a single table

.
.
.
138

Using SQL to segment the data


• Using SQL it is easy to select the rows of data for the data
subsets that we are interested in:

SELECT Module_code, AVG(Mark)


FROM mixed_module_data
WHERE Module_code = ‘QQ233’
139

Combining the results


• Processing each subset, then combine the results:

SELECT Module_code, AVG(Mark)


FROM mixed_module_data
WHERE Module_code = ‘QQ233’
UNION
SELECT Module_code, AVG(Mark)
FROM mixed_module_data
WHERE Module_code = ‘QQ224’
140

The split-apply-combine processing


pattern
• very common in handling large datasets (Wickham, 2011)
• and often appears in code libraries for data processing
packages.
• SQL allows us to build a version of the split-apply-
combine pattern around the GROUP BY clause
• Python pandas has a similar groupby method. (We’ll
explore these in the next activity.)
141

Binning
Predefined bins
• There are often clearly defined groupings or
segmentations to be imposed on data.
• These ranges, often referred to as bins or sometimes as
buckets, are used to split-up a continuous range into a
set of discrete, non-overlapping segments that cover the
full range.
• Example: Age ranges in a survey.
• Allocating members of a range to one or other bin is
referred to as binning (or discretisation if the range is
continuous).
• As with most forms of segmentation, all items allocated to
a particular bin can then be treated to the same operation.
142

Defining the bins


• Binning is a non-reversible process and thus represents a
loss of information (unless you explicitly retain the
continuous value alongside the bin value).
• Example: if, I’m between 20 and 25, can’t tell the exact
age.
• Vital to ensure that the fenceposts are well defined.
• Identify interval boundaries as inclusive or exclusive
• The collection of bins must cover every possible value in
the entire range,
• Each value in that range must fall into only one bin.
143

Fuzzy classification
• In ‘fuzzy classification’, values may be in more than one
bin.
• For example, they might use the bins ‘Very tall’, ‘Tall’,
‘Medium’, ‘Short’, ‘Very short’.
• Into which bin would someone 6ft tall be placed? 20% of
people might consider this ‘Very Tall’, 70% as ‘Tall’, and
10% as ‘Medium’.
• This might be represented by our six-footer having an
entry in the ‘Very tall’ bin but tagged with a 20% marker,
and an entry in the ‘Tall’ bin with a 70% marker, and an
entry in the ‘Medium’ bin with a 10% marker.
144

Imposed bins
• So far, we’ve been discussing bin descriptions that are
decided in advance of allocating data. However, there are
analysis techniques that require bins to be defined based
on the shape of the dataset itself.
• In equal-frequency binning, the fencepost values are
selected such that an equal number of data instances are
placed into each bin.
• This means that the width of the bins may be unequal.
145

Equal-frequency binning
• Figure 4.13 shows a graph of the numbers of people of
each age in a population. We’ve superimposed the
fenceposts for three bins that cover the range where each
bin has an equal number of population members in it (600
in this case).

• Figure 4.13 Age versus the # of members of the


population with that age; superimposed by 3 bins
146

Problems with equal frequency


binning
• A fencepost might appear in the middle of a group of members,
all with the same age.
• It might prove impossible to allocate meaningful fenceposts; for
example, there could never be 6 distinct bins if the population
members only had ages 14, 20, and 30.
• Therefore, care is needed when creating or applying algorithms
that create bins based on data populations.
• It might be necessary to create bins containing sets of different
sizes.
• Examples:
• In business it is common to analyse first, second, third, and fourth
quarter sales figures, or
• property locations within fixed radii of a reference point, for example
properties within one mile of a police station, between one and three
miles, and between three and five miles.
147

4 Inferential analysis
• Descriptive analysis seeks only to describe the data one
has, usually by means of transforming it in some way.
• Inferential analysis seeks to reach conclusions that
extend beyond it.
• Common examples:
• An inferential analysis attempts to infer whether conclusions drawn
from sample data might apply to an entire population.
• Or an inferential analysis may be used to judge the probability that
an observed difference between groups is real, or has just
happened by chance.
148

4 Inferential analysis
• data is extensively used to support business processes
and transactions, and also in research.
• When considering users’ requirements for data
management to support research work, it can be useful to
know something about research design.
• So, while this is not a module on research methods or
experimental design, we do believe the following short
discussion of the methods by means of which
experimental data is collected, recorded and analysed is
necessary.
• Some knowledge of these puts the data manager in a
position to understand how research efforts may best be
supported.
149

4.1 Experiments and experimental


design
• Formal research methods tend to be based on one of two
types of study: observational studies and experiments.
• In observational studies, data is collected in situations outside the
control of the researcher
Examples: for example, when analysing the outcomes of a particular
policy or the results of a marketing campaign.
• In experiments, which are procedures specifically designed to test
a hypothesis or idea about how a particular system operates, the
data to be controlled is rigidly specified by the experimenter, and its
collection is strictly controlled.
Example: an experiment to test a new method of electronic learning, with
a control group and a test group
150

The design of experiments


• The vast majority of experiments seek to discover the
relationship (if any) between one or more data elements,
known as the independent variable(s), and one or more
other data elements, known as the dependent
variable(s).
• The values of the independent variables are controlled by
the experimenter, who seeks to establish whether or not
values of the dependent variable change in some regular
way in response to changes in the independent variable.
151

The design of experiments


Example
• suppose that a pharmaceutical company has developed a
drug to improve memory in older people.
• The experimenter starts with a hypothesis that memory
improves with application of the drug and will gather data
to demonstrate (or refute) this hypothesis by getting
together a group of participants, treating them with the
drug and administering a memory test of some kind to
them.
• Here, the independent variable is the application of the
treatment, and the dependent variable is the score in the
memory test.
152

The design of experiments


Two possibilities:
• within-subjects or repeated measures design: all participants
are treated equally, that is they all receive the drug and they
take a series of tests during the treatment (or just ‘before’ and
‘after’ treatment).
• between-subjects designs (more common), in which
participants are divided into two or more groups and receive
different treatments. In the classic medical trial, subjects are
split into a control group and a treatment group. The drug
under test is administered to members of the treatment group
but not to the members of the control group. The outcomes (the
dependent variable) of members of each group (the
independent variable) are measured to identify whether or not
the treatment was a likely cause of a particular outcome.
153

Considerations for between-subjects


designs
• Such experiments assume that the only significant difference
between the groups is the drug – the only independent variable
to change across the two groups.
• Subjects must be randomly allocated to the control group and
the treatment group.
• Must be confident that there are no confounding variables –
that is, any other factor (differences between the average age
of participants in each group, for instance) that might influence
outcomes
• The allocation of individuals to groups in such a way that the
only difference between the groups is the independent variable,
is one of the most important aspects of experimental design
and can be extremely difficult to achieve if the condition being
studied is rare or the population groups are very small.
• Once data has been gathered, it must be suitably shaped and
then subjected to statistical analysis.
154

4.2 Shaping the data


• fairly straightforward
• there will be crucial differences depending on whether a
within-subjects or between-subjects design is being
analysed.
• In the simplest form of our example of the memory-
enhancing drug, the results would be tabulated as in
Table 4.13.
155

4.2 Shaping the data


• Between-subjects analyses require the grouping variable
to be explicitly stated. For example, it is shown in
Table 4.14 as a column of group values where ‘1’ and ‘2’
represent the treatment and control groups, respectively.
156

4.2 Shaping the data


• Of course, it is also possible to take repeated measures of
subjects in a between-subjects design in order to
measure progression under treatment, as shown in
Table 4.15.
157

4.3 Statistics for inferential analysis


• Statisticians have available to them an immense battery
of statistical tests and techniques for inferential analysis
• Among them a family of statistical models known as
the general linear model (GLM).
• We can do no more here than scratch the surface of
these.
158

The General Linear Model (GLM)


Family
• Let’s return to our example of a pharmaceutical company
testing a new memory drug.
• Data in the form of scores on a memory test from the
treatment group and the control group have been
gathered.
• But the world is generally far too messy and chaotic for
simple relationships between variables to exist:
• within each group there will be much variation, because individuals
respond in different ways.
• Within both groups, we would expect the test scores to be
distributed in something like the classic bell curve.
• If we superimpose the score distributions of both groups
on the same graph, we might get something like
Figure 4.14.
159

• It looks as if the drug improves memory, as the two groups have different means.
• However, there is a fair amount of overlap between them.
• The experimenter will want to be fairly sure that the differences are not due to
random variation, but to the effect of the drug.
• Put another way, they will want to show that the probability of the difference
between the means being due to chance is very low.
• This probability – the alpha value – is usually set to 0.05 (5 in 100) but, as with
the correlation statistics we discussed earlier, the gold standard is 0.01 (1 in 100).
160

The GLM Family: the t-test


• One of the simplest statistical tests, and the most
appropriate to apply here, is known as the t-test.
• The t-test works by calculating a t-value for the difference
between the means of the two groups. (Details of how this
is calculated can be found in any statistics textbook, or
online.)
• Then, with the alpha value (and another value known as
the degrees of freedom, df), the significance of the t-
value can be looked up in a table of standard values.
• (Alternatively, a statistics package will both do the
calculations and find the significance for you.)
161

The GLM Family: ANOVA (& other tools)


• As we stated, the t-test is perhaps the simplest test in the GLM
family.
• For more complex experiments, involving several variables, a
technique known as analysis of variance (ANOVA) is used.
• This calculates the significance of interactions between all pairs
of variables in a multi-variable experiment.
• Other GLM tools include:
• analysis of covariance (ANCOVA)
• regression analysis
• factor analysis
• multidimensional scaling
• cluster analysis, and
• discriminant function analysis
• some of which we look at in later parts of the module.
162

5 Working with textual data


• ‘Non-numerical data’ essentially means data values that
are text strings.
• Text strings are notoriously messy.
• When we looked at data cleaning and preparation we
met:
• strings of differing format or type, or
• Strings that did not meet some expected pattern:
• the wrong date representations
• $ and £ prefixes
• postcodes with and without spaces,
• and so on.
163

5 Working with textual data


• There are numerous analytical tasks that will involve
working with datasets that contain strings, among them
might be:
• Discovery: to discover and mark all strings that contain a certain
sequence of characters.
• Replacement: to find and replace occurrences of strings containing
certain substrings with updated versions.
• Extraction: to extract the pattern matched from the text string and
return it in a variable.
164

Examples
• To clarify this, consider two rather simplified examples:
1. Genetic databases
• Structurally, the DNA molecule consists of two spiral chains of sugar and phosphate
molecules, bound together by pairs of nitrogen bases: guanine (G), adenine (A), cytosine (C)
and thymine (T). Guanine only binds with cytosine; adenine only binds with thymine.
• It is the precise sequence of these base pairs that bestows on every individual their
individuality. Thus, an individual’s unique biological identity can be expressed as a string,
many hundreds of millions of characters long, of the characters C, A, G, T: e.g.
CGAGGGTAATTTGATG ….
• Certain areas (known as loci) of the human genome are highly variable between individuals.
Thus DNA analysis may be used to pick out individuals stored in a DNA database, possibly
containing millions of profiles, using DNA fragments of a few groups of bases –
GAGTGACCCTGA, for example – taken from certain loci of DNA recovered from a crime
scene, say.
2. Codes
• If some boundary changes led to a revision of postcodes it might be necessary to find all
instances of partial postcodes (MK6 xxx, MK7 xxx, MJ8 xxx, MJ9 xxx) and amend these in
some way. Or, identify vehicles from partial number plate data, e.g. M?16??T.

• All these kinds of analysis depend on what are known as regular expressions.
165

5.1 Regular expressions


• Regular expression This is simply a string of one or more characters to
be discovered within a text or string.
Examples:
• ‘abc’ will match ‘abc’ inside a string
Example: e.g. ‘hello abc world’.
• The . wildcard The ‘.’ symbol matches any single character inside a
string.
Example: ‘a.c’ will match ‘abc’, ‘adc’, ‘a9c’, but not ‘abbc’ or ‘a9sc’.
• Sets and ranges Matching one of a specified set of characters.
Examples: ‘a[bcd]c’ will match ‘abc’, ‘acc’, ‘adc’, but not ‘abdc’.
• ‘I am 2[1234] years old’ – will match any string beginning ‘I am 2’ and
ending ‘ years old’, where the age is one of ‘21’, ‘22’, ‘23’ or ‘24’.
• Repetitions Matching one or more repetitions of a pattern within a string.
Example: ‘a+bc’ will match ‘abc’ ‘aabc’ ‘aaabc’, etc. but not ‘aabbc’.
• There are many other possibilities and variations on these themes
covered in the following Notebook activity.
166

6 Reshaping datasets for reporting


• We have already considered several ways in which tables
may be reshaped for analysis – removing rows or
columns, joining tables, etc.
• However, it may also be necessary to carry out other
reshaping operations during analysis (and, as you will see
in the next part of the module, for reporting and
visualisation purposes). Here are some examples:
• Transposing rows and columns
• Wide versus long format
• Hierarchies
167

6.1 Transposing rows and columns


• One of the most common transformations is to transpose
the rows and columns, so that the rows become columns
and the columns become rows. For example, consider
Table 4.16.

• Table 4.16 can be converted into Table 4.17 by


transforming N + 1 rows (N data rows, plus a header row),
and two columns into 1 × N columns, with an ‘index’
column identifying the property the row relates to.
168

Table 4.16 can be converted into Table 4.17 by transforming N


+ 1 rows (N data rows, plus a header row), and two columns
into 1 × N columns, with an ‘index’ column identifying the
property the row relates to.
169

Notice that in both the inverted forms we run into problems if


module codes in Table 4.16 can appear several times, or if the
points size is repeated for more than one table. Does Table 4.19
become Table 4.20?
170

Notice that in both the inverted forms we run into problems if


module codes in Table 4.16 can appear several times, or if the
points size is repeated for more than one table. Does Table 4.19
become Table 4.20?
171

Or do we create extra ‘Points’ rows for the duplicate module


codes as shown in Table 4.21?
172

Or do we list the different points values for TM351 in the same


cell as in Table 4.22?
173

Obviously, the answer will depend on the analysis requirements.


However, in general when transposing a table like this it is the
comma-separated list form, shown in Table 4.22, that is used.
174

6.2 Wide versus long format


• The following datasets, in Tables 4.23 and 4.24, both
contain the same information, but represented in different
ways, each with a different shape.
175
176

• Table 4.23 is often referred to as wide or stacked format.


• One or more columns act as indexes to identify a
particular entity, with additional columns representing
attributes of that entity.
177

Table 4.24 is referred to as long, narrow or unstacked format. Again,


one or more index columns identify each entity, a second column
(Variable in this case) names an attribute, with a third (Value) column
recording the value of the attribute. The format is also known as a
‘triple store’ (object_identifier, attribute_name, attribute_value), or
sometimes as O-A-V format.
178

Choosing the best shape


• Knowing how and when to transform a dataset:
• from a wide to long format through a process of melting
or stacking the data, or
• from a long to a wide format, sometimes referred to as
casting or unstacking
• are useful skills for the analyst when reporting on data
using a required layout, or with a visualisation tool that
requires the data to be presented in a specific form.
179

6.3 Hierarchies
• Table 4.25 shows a fragment of an Olympic medal table in a
traditional tabular view.

• Table 4.25 looks fairly chaotic. If we are interested in the performance


of nations, for example, information could be gleaned more
straightforwardly by grouping rows hierarchically, as in Table 4.26.
180

6.3 Hierarchies

• If we are focusing on the success (or otherwise) of each


country in each individual event, Table 4.27 might be more
informative..
181

6.3 Hierarchies

• Notice however, that the table reshaping is now supporting a


particular interpretation or reading of the data – it has a more
structured presentation related to a given purpose, but is now
harder to read for other purposes.
182

7 Summary
In this part, you have learned about:
• descriptive analysis of data, including transforming,
aggregating and visualising datasets, including some common
statistical measures for descriptive analysis
• inferential analysis of data, including experimental design,
shaping data and statistical measures for inferential analysis
• the shaping of data for reports, including the use of
OpenRefine as a data shaping tool.
• Practically you will have worked with:
• SQL and pandas to manipulate tabular data in several ways
• simple panda visualisations to produce scatter and bar plots
• regular expressions to process text strings.
• In the next part of the module, we will consider how to present
reports on data investigations and your findings, and to use
more complex visualisations.
183

ACTIVITIES
For part 2
184

Activity 4.1 Notebook


• 15 minutes
• In spreadsheets and other data reporting packages, the
crosstab functions are usually supported by the ‘pivot
table’ tools for reshaping tables and embedding
summaries, subtotals and totals into report tables. They
allow a wide range of table reshaping and summarisation
to be applied, usually in a drag-and-drop or menu-driven
manner.
• Work through Notebook 04.1 Crosstabs and pivot tables.
185

Activity 4.2 Notebook


• 15 minutes
• Work through Notebook 04.2 Descriptive statistics in
pandas, which looks at some basic statistical methods
applied to pandas DataFrames.
186

Activity 4.3 Notebook


• 20 minutes
• Work through Notebook 04.3 Simple visualisations in
pandas, which demonstrates some simple ways to chart
the values in a data frame.
187

Activity 4.4 Practical


• 10 minutes
• Tables 4.8 and 4.9 show two separate datasets (available
in files QQ223.CSV, and QQ224.CSV in the Part 4
Notebook’s data folder).
188

Tables 4.8 and 4.9 Student marks on


QQ223 and QQ224
189

• Complete Table 4.10 by calculating values using the datasets


shown in Tables 4.8 and 4.9; you could use a spreadsheet or
pandas Notebook to help.

• Assuming the pass mark on each module was 40, compare the
modules and the performance of students on them. Is there
anything that might suggest there is scope for further
exploration?
• Hint: it may help to draw a chart of the sorted marks for each
student and the pass mark line, then do an alphabetic sort on
the student identifier before drawing the charts.
• Optional: See Notebook 04.4 Activity 4.4 Walkthrough to help
with this activity
190

Discussion

Table 4.11 The completed comparative statistics table


• The differences between the average mark on each
module seem significant: the much higher average on
QQ224 might be the result of a few students with very
high marks, pulling the mean away from the centre of the
distribution. Once again, a chart (Figure 4.11) is helpful
here.
191

Discussion

Figure 4.11 A bar chart showing each student’s mark on two modules
• This shows that students (with a couple of exceptions) perform
consistently better on QQ224 (red bars).
• Moreover, in QQ233 only one student gets above 60 marks, with the
majority of the students getting over 40, whereas in QQ224 the
majority of students gain over 60 marks.
• However, if we now sort the results by student identifier, as in
Figure 4.12, another problem is revealed.
192

Discussion

Figure 4.12 The previous bar chart sorted by student identifier

• It is immediately obvious that in both modules students whose identifiers


begin with Q are performing extremely poorly, and R students seem slightly
weaker than P students, suggesting that the relative performance of these
groups should be explored more deeply. Of course, for such a small sample
this may just be chance, but, for example, this might trigger an analysis of
larger datasets or different modules, and an exploration of the significance of
the starting letter of the student number.
193

Discussion

Table 4.11 The completed comparative statistics table


• The differences between the average mark on each module seem significant: the much higher average on
QQ224 might be the result of a few students with very high marks, pulling the mean away from the centre
of the distribution. Once again, a chart (Figure 4.11) is helpful here.

• This shows that students (with a couple of exceptions) perform consistently better on QQ224 (red bars).
Moreover, in QQ233 only one student gets above 60 marks, with the majority of the students getting over
40, whereas in QQ224 the majority of students gain over 60 marks.
• However, if we now sort the results by student identifier, as in Figure 4.12, another problem is revealed.
• Figure 4.12 The previous bar chart sorted by student identifier
• View long description
• It is immediately obvious that in both modules students whose identifiers begin with Q are performing
extremely poorly, and R students seem slightly weaker than P students, suggesting that the relative
performance of these groups should be explored more deeply. Of course, for such a small sample this may
just be chance, but, for example, this might trigger an analysis of larger datasets or different modules, and
an exploration of the significance of the starting letter of the student number.

Figure 4.11 A bar chart showing each student’s mark on two modules
194

Discussion

Table 4.11 The completed comparative statistics table


• The differences between the average mark on each module seem significant: the much higher average on
QQ224 might be the result of a few students with very high marks, pulling the mean away from the centre
of the distribution. Once again, a chart (Figure 4.11) is helpful here.

• This shows that students (with a couple of exceptions) perform consistently better on QQ224 (red bars).
Moreover, in QQ233 only one student gets above 60 marks, with the majority of the students getting over
40, whereas in QQ224 the majority of students gain over 60 marks.
• However, if we now sort the results by student identifier, as in Figure 4.12, another problem is revealed.
• Figure 4.12 The previous bar chart sorted by student identifier
• View long description
• It is immediately obvious that in both modules students whose identifiers begin with Q are performing
extremely poorly, and R students seem slightly weaker than P students, suggesting that the relative
performance of these groups should be explored more deeply. Of course, for such a small sample this may
just be chance, but, for example, this might trigger an analysis of larger datasets or different modules, and
an exploration of the significance of the starting letter of the student number.

Figure 4.11 A bar chart showing each student’s mark on two modules
195

Activity 4.5 Notebook


• 30 minutes
• Work through Notebook 04.5 Split-apply-combine with
SQL and pandas, which walks through some summaries
and questions that might be covered by the split-apply-
combine analysis, applied to a table of sales data.
196

Activity 4.6 Notebook


• 30 minutes
• Work through Notebook 04.6 Introducing regular
expressions.
197

Activity 4.7 Notebook


• 30 minutes
• Work through Notebook 04.7 Reshaping data with
pandas.
• The table reorganisation of this kind can also be achieved
using the OpenRefine tool.
• The following screencast shows this process for the
Olympic medal table.
• Active content not displayed. This content requires
JavaScript to be enabled, and a recent version of Flash
Player to be installed.
198

Discussion
Table 4.11 The completed comparative statistics table

• The differences between the average mark on each


module seem significant: the much higher average on
QQ224 might be the result of a few students with very
high marks, pulling the mean away from the centre of the
distribution. Once again, a chart (Figure 4.11) is helpful
here.
199

Figure 4.11 A bar chart showing each


student’s mark on two modules

• This shows that students (with a couple of exceptions) perform


consistently better on QQ224 (red bars). Moreover, in QQ233 only
one student gets above 60 marks, with the majority of the students
getting over 40, whereas in QQ224 the majority of students gain over
60 marks.
200

Figure 4.12 The previous bar chart


sorted by student identifier

• It is immediately obvious that in both modules students whose identifiers


begin with Q are performing extremely poorly, and R students seem
slightly weaker than P students, suggesting that the relative performance
of these groups should be explored more deeply. Of course, for such a
small sample this may just be chance, but, for example, this might trigger
an analysis of larger datasets or different modules, and an exploration of
the significance of the starting letter of the student number.
201

EXERCISES
For part 2
202

Exercise 4.1 Self-assessment


• 3 minutes
• What kind of descriptive measures do you think a module
team chair might want for the data in Table 4.1?
• Discussion
• It is likely that the module team chair would want to know
about such features as the average mark, what the best
(and worst) marks were, something that would indicate
whether the module was too challenging (or not
challenging enough) for students. Possibly they might
want to know if age or gender had any effect on the
marks, and so on.
203

Exercise 4.2 Self-assessment


• 2 minutes
• Is information ‘destroyed’ if we replace a full dataset with
a summary report (e.g. sum total, average value) of the
values in one or more data columns?
• Discussion
• Yes, information is destroyed, firstly because information
about columns not summarised disappears. Secondly, we
can’t recreate information from the result of the
summarising operation applied to it. Applying the SUM
operation to a column with values [2, 3, 4] returns a single
value 9. But from this we cannot know (a) how many
elements were in the original column and (b) the
individual values they took. Information has been lost.
204

Exercise 4.3 Self-assessment


• 10 minutes
a. What does the graph in Figure 4.9 of the prices of
products sold in an online store against the numbers of
units sold in that price range show?

Figure 4.9 Number of sales versus price plotted


205

Exercise 4.3 Self-assessment

Discussion
• The figure shows a skew towards both high-priced and low-
priced goods sold, with comparatively less in-between – a kind
of inversion of the normal distribution curve discussed earlier.
Statisticians refer to this as a bimodal distribution. However,
note should be taken of the differences in the size of each price
range.
206

Exercise 4.3 Self-assessment


b. To encourage energy users to consider ways of saving
energy, it produces a graph of energy use over the
previous 12 months [averaged every three months], and it
can include monthly average daytime temperatures for
comparison. A typical graph is shown in Figure 4.10.
• What criticism could you make of this graph
representation?
207

Exercise 4.3 Self-assessment

• Figure 4.10 Graph showing the amount of energy used


each month and average monthly temperature
208

Exercise 4.3 Self-assessment


Discussion
• The power company is collecting aggregated (summary)
data (the total amount of energy used in a three-month
period) but appears to be showing it on the graph as if
they had a record of the monthly amounts of energy used.
• The use of summary data and the implicit loss of
information at the monthly level means that there can be
no confidence in the monthly points plotted on the chart.
209

Exercise 4.4 Self-assessment


• 3 minutes
• What do you think the following SQL will do?

SELECT Module_code, COUNT(Student) AS


how_many_students
FROM mixed_module_data
GROUP BY Module_code;
210

Discussion
• This results in a table with a single row for each distinct
module code from the mixed module data table. Each row
shows the module code and a count of the number of
students who completed that module. It can be described
as producing a row of data for each unique module code
in the mixed module data table, each row consisting of the
module code and the number of students on that module.
• If you’re not sure how this query would be evaluated, the
following is a description of the logical processing for the
SQL (we’ve illustrated it with tables at each step).
• (Kindly refer to the module learning materials for detailed
steps)
211

Exercise 4.5 Self-assessment


• 5 minutes
• Can you think of any other examples of binning you’re familiar
with? Remember this is using a discrete set of bins in place of
a value in a range. If you know of any unusual examples, share
them in the module forum.
• Discussion
• As a student on an OU module you will know that you get a
module mark between 0 and 100 based on your assignments
and examination/project marks. However, you may also get a
classification that reduces this mark range to a series of bins,
which (depending on your module) may be ‘distinction’, ‘merit’,
‘pass’ or ‘fail’.
• Buyers of clothes generally purchase ‘small’, ‘medium’, ‘large’,
‘x-large’ or ‘12’, ‘14’, ‘16’, ‘18’, ‘20’ and not a size based on their
exact measurements.
212

Exercise 4.6 Self-assessment


• 3 minutes
• What are the problems with the following description of age-
related bin descriptions?
• 0–18 18–34 36–40 41–49 50 50+
Discussion
• There are three clear problems with the bin fenceposts. Where
does an 18-year-old go, where does a 35-year-old go, and
where does a 50-year-old go?
1. 18-year-olds seem to belong to two bins
2. a 35-year-old has no home – part of the range is missing
3. 50+ could be taken to mean ‘over 50 but not 50’ or ‘50 and
above’.
• Wider problem: would someone 40 years and 6 months old be
considered 40 or 41? A clear understanding of the fencepost
interpretation is required.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy