Session2 Short
Session2 Short
TM351
Data management and analysis
3
SESSION 2
parts 3 and 4
5
Session Overview
1. Part 3: Data preparation
2. Part 4: Data analysis
6
PART 3
Data preparation
9
Data preparation
Purpose:
• Convert acquired ‘raw’ datasets into valid, consistent data, using structures and
representations that will make analysis straightforward.
Initial Steps:
1. Explore the content, values and the overall shape ? of the data.
2. Determine the purpose for which the data will be used
3. Determine the type and aims of the analysis to be applied to it?
Data preparation
Activities:
1. Data cleansing: remove or repair obvious errors and inconsistencies in the dataset
2. Data integration: combining datasets
3. data transformation: shaping datasets
Note:
• Some of the techniques used in data preparation – especially in transformation and
integration – are also used to manipulate data during analysis
• Conversely, some analysis techniques are also used in data preparation.
Looking ahead:
This week you will look first at some basic data cleansing issues that apply to single and
multiple tabular datasets, and then at the processes used to combine and shape them:
selection, projection, aggregation, and joins. Many of these techniques can also be
straightforwardly applied to data structures other than tables.
13
2 Data cleansing
Is the process of:
• detecting and correcting errors in a dataset.
• It can even mean removing irrelevant parts of the data –
we will look at this later in the section.
• Having found errors – incomplete, incorrect, inaccurate or
irrelevant data – a decision must be made about how to
handle them.
14
Most operational systems try to keep ‘dirty’ data out of the data store, by:
• Input validation
• database constraints
• error checking
However, despite these efforts, errors will occur
15
Validity
Accuracy
Completeness
Consistency
• If two values should be the same but are not, then there is
an inconsistency. So, if the two rows with ‘John Smith’ and
‘J. Smith’, do indeed represent a single individual, John
Smith, then the data for that individual’s monthly income is
inconsistent.
21
Uniformity
Tables 3.6 and 3.7 Unsorted monthly sales data in Table 3.6 and sorted by month
order in Table 3.7
With the data sorted by month, it’s relatively easy to see the gradual
decline in the TotalAmount values over the year. It’s much harder to see
this trend in the unsorted data.
42
Using:
• Not a Number (NaN) & Not a Time (NaT) values
• Null marker
• None value
• Others …..
51
Null
• In an earlier example used in Notebook 03.3 Combining
data from multiple datasets, we used the SQL OUTER
JOIN to create rows with missing data, putting the SQL
NULL marker in place of the missing values.
Table 3.13 The outer join table for the small sports club
53
None
• Finally, if NaN and NaT values are being used, then strict
rules for the data types in a column are probably being
broken. Should we go back and consider what to do with
the original data?
58
EXERCISES
For Part 3
62
Discussion
• Firstly, you would need to spend time understanding what each table shows: understanding
what the data rows represent and the way in which the values in the columns are to be
interpreted.
• Ideally you would have a lot of sample data, supported by descriptive documentation,
available for this review stage; what we’ve supplied here is deliberately short so that we can
highlight some key questions we might ask.
• Company X uses fairly standard table components – even if we’re not sure of their exact
interpretation. There are also insufficient data values in some columns to get a sense of the
range of possible values that may occur there. For example, the ‘Priority’ column only has the
value ‘Top’, and we have no way of inferring other values that might appear in that column.
The ‘Cno’ column – assuming an interpretation of it being a unique numeric value
representing a customer number – would allow us to infer other possible numeric values for
that column.
• Company Y appears to be using a complex string representation for contact details that
combines the email and address into one field. This appears to be semi-colon separated,
and using a tag of the form label: string for that label. The ‘Id’ column is a five-digit string with
leading zeros, and for ‘Gender’ we can infer a second value of ‘M’ although there may be
more values permitted. The ‘Class’ column might relate to Company X’s ‘Priority’, but without
further information this would be a guess; even if it does relate to priority we’ve no idea if ‘1’
is a high or low priority, or how this might relate to the ‘Top’ used by Company Y. Finally,
Company Y uses a single ‘Name’ field, which appears to be split (using a comma) into
surname and name – it’s not possible to say what might happen to multi-part names.
• Without a lot of additional information it would be impossible to suggest a robust harmonised
form for the data – the only fields where this would appear possible are:
68
Discussion (Cont.)
Table 3.4 Data fields from Tables 3.2 and 3.3 that could be harmonised
• It might be possible simply to put leading zeros in front of the
‘Cno’, provided of course that the Cno range didn’t overlap the
values of the ‘Id’ column. But if this forces Company X
customers to log in using customer numbers with leading
zeros, then they would need to be told of this change.
• In operational systems, any attempts to harmonise will usually
impact on the existing systems, requiring maintenance updates
to allow existing applications to mesh with the harmonised
datastores.
69
ACTIVITIES
For part 3
70
Discussion
• Note: the following is based on my [The author's] notes as I read through the papers – this
will differ from what others reading the same papers might choose to note; we should be
seeing the same kinds of things in the paper but we might attach a different level of
significance to the things we see.
• Rahm and Do (2000) break down the issues initially into those related to single or multi-
source data, and at a lower level distinguish between schema (data model and description)
and instance (data values). Single source schema issues usually result from a lack of
adequate constraints to prevent incorrect data appearing which requires schema change. For
multi-source data, the issues include combining multiple schema requiring schema
harmonisation. At the instance level, the single schema issues are generally reflections of
poor constraints, or simply erroneous – but plausible – data entry errors. The multi-schema
issues include differences in units/data types for comparable values, differences in
aggregation levels for grouped data items and the challenges of identifying when multiple
values are referring to the same ‘thing’ which they label – overlapping data.
• Kim et al. (2003) take a different approach to Rahm and Do, with a top-level description of
how dirty data appears: ‘missing data, not missing but wrong data, and not missing and not
wrong but unusable’. They then break each of these descriptions down, using categories
similar to the single versus multiple source distinctions of Rahm and Do (see Section 2.1.2 of
the paper, which in the version that I read is incorrectly indented). Their taxonomy is very
description based: their leaf nodes are of specific issues of specific types of problem; in
contrast, Rahm and Do focus more on classes of problems (based on where they occur).
• In their final bullet point of Section 1, Kim et al. (2003) state clearly that they do not intend to
discuss metadata issues and in this they include independent design of multiple data sources
– so this paper addresses what Rahm and Do label ‘instance data’.
73
PART 4
Data analysis
88
Workload
• This part of the module is split between reading, exercises, and notebook
activities.
• There are two, largely independent pieces of work to be completed this week:
• studying the module content, exercises and activities
• practical work in which you will use OpenRefine, regular expressions, SQL and
Python.
• During this part of the module you will work through six notebooks, looking at
Python’s pandas and developing skills in reading, writing and manipulating
content in different file formats.
• Activity 4.1 uses 04.1 Crosstabs and pivot tables (15 minutes).
• Activity 4.2 uses 04.2 Descriptive statistics in pandas (15 minutes).
• Activity 4.3 uses 04.3 Simple visualisations in pandas (20 minutes).
• Activity 4.4 uses 04.4 Activity 4.4 Walkthrough (10 minutes).
• Activity 4.5 uses 04.5 Split-apply-combine with SQL and pandas (30 minutes).
• Activity 4.6 uses 04.6 Introducing regular expressions (30 minutes).
• Activity 4.7 uses 04.7 Reshaping data with pandas (30 minutes).
• In addition there is a screencast in Activity 4.7 (20 minutes), which shows how
OpenRefine is used to reshape a table.
89
Next
• Most forms of data analysis consist in transforming the data in
some way.
• In the case of both these approaches, we will look at some of
the standard ways in which datasets can be manipulated to
support analysis, to assist decision making and generate
information and insight.
• We will build on the techniques presented in the previous
section on data preparation, and extend them to consider
common techniques for transforming data for analytical
purposes.
• Some of these new techniques can also be used in data
preparation activities, but we are presenting them here as tools
by means of which datasets can be broken down and rebuilt in
useful ways.
92
3 Descriptive analysis
• Descriptive analysis seeks to describe the basic
features of the data in a study – to describe the data’s
characteristics and shape in some useful way.
• One way to do this is to aggregate the data: that is, if the
data consists of elements on an interval scale, to boil
down masses of data into a few key numbers, including
certain basic statistical measures.
• Compressing the data in this way runs the risk of
distorting the original data or losing important detail.
• Nevertheless, descriptive statistics may provide powerful
indicators for decision making.
• A second way to describe masses of data is through
visualisation techniques.
95
3 Descriptive analysis
3.1 Aggregation for descriptive analysis
• Simple aggregation functions
• 2 examples:
• a large (imaginary) OU module, TX987, part of which is expressed
in Table 4.1.
• And an (imaginary) student’s overall transcript, as shown in
Table 4.2.
96
Example 1
• a large (imaginary) OU module, TX987, part of which is
expressed in Table 4.1.
97
Example 2
• an (imaginary) student’s overall transcript (Table 4.2.)
98
Aggregation functions
• An aggregation function reduces a set, a list of values or
expressions over a set, or over a column of a table, to a
single value, or small number of values.
• Among the most obvious of these are:
• Count: the number of values in the set or list.
• Sum: the sum total of the values in the set or list.
• Max: the largest value from all the values in the set or list.
• Min: the smallest value from all the values in the set or list.
• Average (= mean): obtained by dividing the sum of all the
values in the set or list by the number of values in it.
99
Cross tabulation
• Cross tabulation or crosstab is a process used to reveal
the extent to which the values of categorical variables are
associated with each other.
• The result of a cross tabulation operation is a
contingency table, also known as a cross tabulation
frequency distribution table.
104
Central tendency
• very few students will get marks far from the average.
Statisticians tend not to be interested in outliers
• statistician Nassim Taleb (2007) argued that outliers are
the most important feature of a dataset.
• Most statisticians are more interested in what is
happening at the middle of the distribution – the central
tendency.
• Three major statistical measures are used here:
• mean
• median
• mode.
113
Dispersion
• The dispersion of a dataset is an indication of how
spread out the values are around the central tendency.
The range
• The range is the highest value minus the lowest value.
• For example:
Correlation
• One of the most widely used statistical measures.
• A correlation is a single value that describes how related
two variables are.
• For example, the relationship between age and
achievement on OU modules – with the (dubious)
hypothesis that older students tend to do better in TX987.
• A first step might be to produce a scatterplot of age
against mark for the TX987 dataset, which
(controversially) might reveal something like Figure 4.7.
120
Correlation coefficient
• statistical packages can compute the correlation value, r.
• r will always be between −1.0 and +1.0.
• If the correlation is positive (i.e. in this case,
achievement does improve with age), r will be positive,
• otherwise it will be negative.
• It is then possible to determine the probability that the
correlation is real, or just occurred by chance variations in the
data. This is known as a significance test. This is generally
done either automatically, or by consulting a table of critical
values of r to get a significance value alpha. Most introductory
statistics texts would have a table like this. Generally, analysts
look for a value of alpha = 0.05 or less, meaning that the odds
that the correlation is a chance occurrence are no more than 5
out of 100. The absolute gold standard for these kinds of tests
is alpha = 0.01.
123
Significance test
• a significance test determines wether the probability that
the correlation is real, or just occurred by chance
variations in the data.
• Done either automatically, or by consulting a table of
critical values of r to get a significance value alpha.
• Most introductory statistics texts would have a table like
this. Generally, analysts look for a value of alpha = 0.05 or
less, meaning that the odds that the correlation is a
chance occurrence are no more than 5 out of 100. The
absolute gold standard for these kinds of tests is alpha =
0.01.
124
Example 2
• an (imaginary) student’s overall transcript (Table 4.2.)
129
.
.
.
138
Binning
Predefined bins
• There are often clearly defined groupings or
segmentations to be imposed on data.
• These ranges, often referred to as bins or sometimes as
buckets, are used to split-up a continuous range into a
set of discrete, non-overlapping segments that cover the
full range.
• Example: Age ranges in a survey.
• Allocating members of a range to one or other bin is
referred to as binning (or discretisation if the range is
continuous).
• As with most forms of segmentation, all items allocated to
a particular bin can then be treated to the same operation.
142
Fuzzy classification
• In ‘fuzzy classification’, values may be in more than one
bin.
• For example, they might use the bins ‘Very tall’, ‘Tall’,
‘Medium’, ‘Short’, ‘Very short’.
• Into which bin would someone 6ft tall be placed? 20% of
people might consider this ‘Very Tall’, 70% as ‘Tall’, and
10% as ‘Medium’.
• This might be represented by our six-footer having an
entry in the ‘Very tall’ bin but tagged with a 20% marker,
and an entry in the ‘Tall’ bin with a 70% marker, and an
entry in the ‘Medium’ bin with a 10% marker.
144
Imposed bins
• So far, we’ve been discussing bin descriptions that are
decided in advance of allocating data. However, there are
analysis techniques that require bins to be defined based
on the shape of the dataset itself.
• In equal-frequency binning, the fencepost values are
selected such that an equal number of data instances are
placed into each bin.
• This means that the width of the bins may be unequal.
145
Equal-frequency binning
• Figure 4.13 shows a graph of the numbers of people of
each age in a population. We’ve superimposed the
fenceposts for three bins that cover the range where each
bin has an equal number of population members in it (600
in this case).
4 Inferential analysis
• Descriptive analysis seeks only to describe the data one
has, usually by means of transforming it in some way.
• Inferential analysis seeks to reach conclusions that
extend beyond it.
• Common examples:
• An inferential analysis attempts to infer whether conclusions drawn
from sample data might apply to an entire population.
• Or an inferential analysis may be used to judge the probability that
an observed difference between groups is real, or has just
happened by chance.
148
4 Inferential analysis
• data is extensively used to support business processes
and transactions, and also in research.
• When considering users’ requirements for data
management to support research work, it can be useful to
know something about research design.
• So, while this is not a module on research methods or
experimental design, we do believe the following short
discussion of the methods by means of which
experimental data is collected, recorded and analysed is
necessary.
• Some knowledge of these puts the data manager in a
position to understand how research efforts may best be
supported.
149
• It looks as if the drug improves memory, as the two groups have different means.
• However, there is a fair amount of overlap between them.
• The experimenter will want to be fairly sure that the differences are not due to
random variation, but to the effect of the drug.
• Put another way, they will want to show that the probability of the difference
between the means being due to chance is very low.
• This probability – the alpha value – is usually set to 0.05 (5 in 100) but, as with
the correlation statistics we discussed earlier, the gold standard is 0.01 (1 in 100).
160
Examples
• To clarify this, consider two rather simplified examples:
1. Genetic databases
• Structurally, the DNA molecule consists of two spiral chains of sugar and phosphate
molecules, bound together by pairs of nitrogen bases: guanine (G), adenine (A), cytosine (C)
and thymine (T). Guanine only binds with cytosine; adenine only binds with thymine.
• It is the precise sequence of these base pairs that bestows on every individual their
individuality. Thus, an individual’s unique biological identity can be expressed as a string,
many hundreds of millions of characters long, of the characters C, A, G, T: e.g.
CGAGGGTAATTTGATG ….
• Certain areas (known as loci) of the human genome are highly variable between individuals.
Thus DNA analysis may be used to pick out individuals stored in a DNA database, possibly
containing millions of profiles, using DNA fragments of a few groups of bases –
GAGTGACCCTGA, for example – taken from certain loci of DNA recovered from a crime
scene, say.
2. Codes
• If some boundary changes led to a revision of postcodes it might be necessary to find all
instances of partial postcodes (MK6 xxx, MK7 xxx, MJ8 xxx, MJ9 xxx) and amend these in
some way. Or, identify vehicles from partial number plate data, e.g. M?16??T.
• All these kinds of analysis depend on what are known as regular expressions.
165
6.3 Hierarchies
• Table 4.25 shows a fragment of an Olympic medal table in a
traditional tabular view.
6.3 Hierarchies
6.3 Hierarchies
7 Summary
In this part, you have learned about:
• descriptive analysis of data, including transforming,
aggregating and visualising datasets, including some common
statistical measures for descriptive analysis
• inferential analysis of data, including experimental design,
shaping data and statistical measures for inferential analysis
• the shaping of data for reports, including the use of
OpenRefine as a data shaping tool.
• Practically you will have worked with:
• SQL and pandas to manipulate tabular data in several ways
• simple panda visualisations to produce scatter and bar plots
• regular expressions to process text strings.
• In the next part of the module, we will consider how to present
reports on data investigations and your findings, and to use
more complex visualisations.
183
ACTIVITIES
For part 2
184
• Assuming the pass mark on each module was 40, compare the
modules and the performance of students on them. Is there
anything that might suggest there is scope for further
exploration?
• Hint: it may help to draw a chart of the sorted marks for each
student and the pass mark line, then do an alphabetic sort on
the student identifier before drawing the charts.
• Optional: See Notebook 04.4 Activity 4.4 Walkthrough to help
with this activity
190
Discussion
Discussion
Figure 4.11 A bar chart showing each student’s mark on two modules
• This shows that students (with a couple of exceptions) perform
consistently better on QQ224 (red bars).
• Moreover, in QQ233 only one student gets above 60 marks, with the
majority of the students getting over 40, whereas in QQ224 the
majority of students gain over 60 marks.
• However, if we now sort the results by student identifier, as in
Figure 4.12, another problem is revealed.
192
Discussion
Discussion
• This shows that students (with a couple of exceptions) perform consistently better on QQ224 (red bars).
Moreover, in QQ233 only one student gets above 60 marks, with the majority of the students getting over
40, whereas in QQ224 the majority of students gain over 60 marks.
• However, if we now sort the results by student identifier, as in Figure 4.12, another problem is revealed.
• Figure 4.12 The previous bar chart sorted by student identifier
• View long description
• It is immediately obvious that in both modules students whose identifiers begin with Q are performing
extremely poorly, and R students seem slightly weaker than P students, suggesting that the relative
performance of these groups should be explored more deeply. Of course, for such a small sample this may
just be chance, but, for example, this might trigger an analysis of larger datasets or different modules, and
an exploration of the significance of the starting letter of the student number.
Figure 4.11 A bar chart showing each student’s mark on two modules
194
Discussion
• This shows that students (with a couple of exceptions) perform consistently better on QQ224 (red bars).
Moreover, in QQ233 only one student gets above 60 marks, with the majority of the students getting over
40, whereas in QQ224 the majority of students gain over 60 marks.
• However, if we now sort the results by student identifier, as in Figure 4.12, another problem is revealed.
• Figure 4.12 The previous bar chart sorted by student identifier
• View long description
• It is immediately obvious that in both modules students whose identifiers begin with Q are performing
extremely poorly, and R students seem slightly weaker than P students, suggesting that the relative
performance of these groups should be explored more deeply. Of course, for such a small sample this may
just be chance, but, for example, this might trigger an analysis of larger datasets or different modules, and
an exploration of the significance of the starting letter of the student number.
Figure 4.11 A bar chart showing each student’s mark on two modules
195
Discussion
Table 4.11 The completed comparative statistics table
EXERCISES
For part 2
202
Discussion
• The figure shows a skew towards both high-priced and low-
priced goods sold, with comparatively less in-between – a kind
of inversion of the normal distribution curve discussed earlier.
Statisticians refer to this as a bimodal distribution. However,
note should be taken of the differences in the size of each price
range.
206
Discussion
• This results in a table with a single row for each distinct
module code from the mixed module data table. Each row
shows the module code and a count of the number of
students who completed that module. It can be described
as producing a row of data for each unique module code
in the mixed module data table, each row consisting of the
module code and the number of students on that module.
• If you’re not sure how this query would be evaluated, the
following is a description of the logical processing for the
SQL (we’ve illustrated it with tables at each step).
• (Kindly refer to the module learning materials for detailed
steps)
211