CH 4 Handout
CH 4 Handout
DATA MERGING
36
relationship where you are having a designed to contain unique values. In
legacy system with a very poorly the Employee table, the Employee ID
formatted data that you are willing to field is the primary key, in the Contact
integrate with your new system. This is Info table, the Employee ID field is a
where data merging comes into the foreign key.
picture. Let us now dive deep into data
merging techniques. The one to one relationship returns the
related records when the value in the
We can perform data merging by Employee ID field in the Contact Info
implementing data joins on the table is the same as the Employee ID
databases in frame. There are three field in the Employees table.
categories of data joins:
This is how one to one join works, by
1. One to One Joins merging the data tables using this
2. One to Many Joins primary key.
3. Many to Many Joins
One to Many Joins
One to One Joins
In a one to many join, one record in a
One to one join is probably one of the table can be related to one or many
simplest join techniques. In this type of records in another table. For example,
join, each row in one table is linked to a each student can have multiple books by
single row in another table using a “key” school library.
column.
In the database, a one to many
For example, in a company database, relationships looks like this:
each employee has only one Employee
ID, and each Employee ID is assigned to
only one employee.
In the database, a one to one
relationship looks like this:
37
The one to many relationships returns using a third table which is called as a
the related records when the value in the join table. Every record in a join table
Student ID field in the Library Table is contains a match field that contains the
the same as the value in the Student ID value of the primary keys of two tables
field in the Students table. that it joins. In a join table, usually these
match fields are called as foreign keys.
This is one to many join works, by These foreign keys are populated with
merging databases using primary key
the data as records in the join table are
which demonstrates one to many
created from either table that it joins.
relationships.
The below table demonstrates the
Student table, which contains a record
for every student. It also contains a
Courses table, which contains a record
for each course. A join table called
Enrollments creates two one to many
relationships, the one between each of
the two tables.
38
1. Using the above example, you can Z = (x-μ)/σ
create a table called Enrollments.
Where,
This will act as a join table.
2. In the Enrollments table, make a X = raw score
Student ID field and a Course ID
field. μ = Population mean
3. Make a relationship between the σ = Population Standard Deviation
two Student ID fields in the tables.
Later, make a relationship between Thus, the z-score is the raw score minus
two Course ID field in the tables. the population mean, divided by the
population standard deviation.
We can use this design, if a student
registers for four courses, we can ensure Whenever we come across situations
that the student has only one record in where the population mean and the
the Students table and four records in population standard deviation are
the Enrollments table, one for each unknown, the standard score can be
course student is enrolled in. calculated using the sample mean i.e. x̄
and the sample standard deviation as
2. What is Z-Score? estimates of population values.
A Z-score describes the position of a Now we will consider an example that
point in terms of its distance from the will illustrate the use of z-score formula.
mean when it is measured in the Consider that we know about a
standard deviation units. The z-score is population of group of kids having
always positive if the value of z-score lies weights that are normally distributed.
above the mean and it is negative if its Further to this, consider that we know
value is below the mean. that the mean of the distribution is 10
Z-score is also known as standard score kgs and the standard deviation is 2 kgs.
as it allows comparison of scores on Now consider the below questions:
different types of variables by 1. What is the z-score for 12 kgs?
standardizing the distribution. 2. What is the z-score for 5 kgs?
A standard normal distribution is a 3. How many kgs corresponds to a
normally shaped distribution with a z-score of 1.25?
mean of value as 0 and a standard For the first question, we simply plug
deviation of value as 1. x=12 in our z-score formula. The result
is: (12-10)/2 = 1.
3. How to calculate a
This means that 12 is one standard
Z-score? deviations above the mean.
The mathematical formula for
calculating the z-score is as following:
39
The second question is also very similar.
Simply put x=5 into the formula. Thus,
5. Why is a Z-score so
the result for this is: important?
(5-10)/2= -2.5 It is very helpful to standardize the
values of a normal distribution by
The interpretation of this is that 5 is 2.5 converting them into z-score because:
standard deviations below the mean.
1. It gives us an opportunity to
For the last question, we now know our calculate the probability of a
z-score. For this problem we plug z = value occurring within a normal
1.25 into the formula and use basic distribution.
algebra to solve for x: 2. Z-score allows us to compare two
values that are from the different
1.25 = (x-10)/2
samples.
Multiply both the sides by 2:
2.5 = (x-10)
Add 10 to both the sides:
6. Concept of
12.5 = x
Percentiles
The maximum value of the distribution
Hence, we see that 12.5 kgs corresponds can be considered in an alternative way.
to a z-score of 1.25. We can represent it as a value in a set of
data having 100% of the observations at
4. How to interpret the or below it. When we consider the
maximum value this way, it is called the
Z-score? 100th percentile.
The value of a z-score always tells us
how many standard deviations we are A percentile can be defined as the
away from the mean. For example, if the percentage of the total ordered
z-score is equal to 0, it is on the mean. observations at or below it. Therefore, pth
percentile of a distribution is the value
A positive z-score tells us that the raw such that p percentage of the ordered
score is higher than the mean average. observation falls at or below it.
For example, if the z-score is equal to +2,
it is 2 standard deviations above the Consider the following data set: [10, 12,
mean. 15, 17, 13, 22, 16, 23, 20, 24]
A negative z-score tells us that the score Here, if we want to find the percentile for
is below the mean average. For example, element 22, we follow the steps below:
if a z-score is equal to -3, it is 3 standard
deviations below the mean.
40
▪ Sort the dataset in ascending Using the values of the quartiles, we can
order. Once sorted, the dataset also find out the interquartile range. An
will look like [10, 12, 13, 15, 16, interquartile range can be defined as
17, 20, 22, 23, 24] the measure of middle 50% of the values
▪ The number of values at or below when ordered from lowest to highest.
the element 22 is 8. The total The interquartile range can be
number of elements in the calculated by subtracting first
dataset is 10. quartile(Q1) from the third quartile(Q3).
▪ Thus, going by the definition, 80
percent of the values are at or
below the element 22. Thus, IQR = Q3 – Q1
percentile for the element 22 is 80
percentiles. Let us consider the following 10 data
points:
[10, 20, 30, 40, 50,60, 70, 80, 90, 100]
7. Quartiles Here, as there are ten values (an even
Quartiles of dataset partitions the data number of values), the median is
into four equal parts, with one-fourth of halfway between the fifth & sixth data
the data values in each part. The total of values, which gives us 55 as the median,
100% is divided into four equal parts: or Q2.
25%, 50%, 75% & 100%. Since the
median is defined as the middlemost
value in the observation, the median will
have 50% of the observations at or below
it. Thus, the second quartile(Q2) or the
50th percentile demarcates the median. The first quartile or Q1 is the median of
The most frequently used percentiles all the values to the left of Q2. Thus here,
30 is the middle number of numbers to
other than the median are the 25 th
percentile and the 75 th percentile. The the left of the actual median (Q2 ).
25th percentile defines the first quartile, The third quartile or Q3 is the median of
the 75th percentile defines the third all the values to the right of Q2. Thus
quartile, and the 100 th percentile here, 80 is the middle number of
represents the fourth quartile. numbers to the right of the actual
The first quartile is the median of all the median (Q2 ).
values to the actual median's (Q2) left.
Similarly, the third quartile is the
median of all the values to the actual
The interquartile range (IQR) can be
median's (Q2) right.
calculated as Q3 – Q1, which is 80 - 30 =
50.
41
i is the ith decile and can be represented
as:
1st Decile, D1 = 1 * (n + 1)/ 10 th data
2nd Decile, D2 = 2 * (n + 1)/ 10 th data
and so on
An important application of quartiles is
in temperature ranges for the day as Steps to calculate decile:
reported on a weather report. In the
a. Find out the number of data or
presence of irregularities, the range
values can be significantly influenced by variables in the sample or
population. This is denoted by n.
them. Hence, it is preferred to use the
IQR instead, thereby ignoring the top 25
percentile and the bottom 25 percentile
of the data points. In the presence of b. In the next step, sort all the data
irregularities, IQR is more robust as well or variables in the sample or
as a better representation of the amount population in ascending order.
of spread in the data.
c. In the next step, based on the
8. Deciles decile that is required, calculate
the decile by using the formula:
Just like quartiles, we have deciles.
While quartiles sort the data into four
quarters, deciles sort the data into ten
equal parts: the 10 th, 20th, 30th, 40th, 𝑖 ∗ (𝑛 + 1 )
𝐷𝑖 =
50th, 60th, 70th, 80th, 90th,100th. 10𝑡ℎ 𝐷𝑎𝑡𝑎
42
Let’s say the raw numbers are: [24, 32, Now D1 = 1 * (n+1)/ 10 th data
27, 32, 23, 62, 45, 77, 60, 63, 36, 54, 57, 36,
72, 55, 51, 32, 56, 33, 42, 55, 30] = 1* (23 + 1)/ 10
Following the steps mentioned above, we = 2.4 th data i.e. data between
first determine the number of variables digit number 2 & 3
in the sample (n). Here n = 23. Which is 24 + 0.4 * ( 27- 24 ) = 25.2
We then need to sort the 23 random Similarly,
numbers in ascending order, as shown
below. D2 = 2 * (n+1)/ 10 th data
43
= 5 * (23 + 1)/ 10 = 9 * (23 + 1)/ 10
= 12th data i.e. data at digit number = 21.6 th data i.e. data between digit
12 number 21 & 22
Which is 45 Which is 63 + 0.6 * ( 72 - 63 ) = 68.4
= 8 * (23 + 1)/ 10
One example of the use of deciles is in
= 19.2nd data i.e. data between digit
school rankings. Students in the top 10
number 19 & 20
% or highest decile will be rewarded,
Which is 60 + 0.2 * ( 62 - 60 ) = 60.4 whereas students in the last 10% or
lowest decile will be given extra
assistance to improve their scores.
44
Recap
Exercises
45
4. What is a z-score?
a) It is the number of standard deviations a particular score lies above or below
the mean of the set of scores.
b) It is a standardized measure of the mean of a set of data.
c) It is the average frequency of scores in a sample
d) It is a measure of central tendency in the data.
5. The median, mode, deciles and percentiles are all considered as measures of
a) Mathematical averages
b) Population averages
c) Sample averages
d) Averages of position
Standard Questions
Please answer the questions below in no less than 100 words.
1. What is data merging?
2. Why is data merging required in data science?
3. Name different ways of merging data sets
4. Explain one-to-one join with the help of an example
5. Explain one-to-many join with the help of an example
46