0% found this document useful (0 votes)
4 views11 pages

CH 4 Handout

This chapter discusses data merging techniques in data science, highlighting the importance of combining datasets from different sources while addressing potential issues such as naming conventions and data formatting. It covers various types of joins (one-to-one, one-to-many, and many-to-many) and explains statistical concepts like standard deviation, z-scores, percentiles, quartiles, and deciles. The chapter aims to provide a comprehensive understanding of how to effectively merge and analyze data for better insights.

Uploaded by

ayan.infernogod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

CH 4 Handout

This chapter discusses data merging techniques in data science, highlighting the importance of combining datasets from different sources while addressing potential issues such as naming conventions and data formatting. It covers various types of joins (one-to-one, one-to-many, and many-to-many) and explains statistical concepts like standard deviation, z-scores, percentiles, quartiles, and deciles. The chapter aims to provide a comprehensive understanding of how to effectively merge and analyze data for better insights.

Uploaded by

ayan.infernogod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CHAPTER

DATA MERGING

However, while merging the data from


different sources there are many issues
Studying this chapter should that occur that require corrections for
enable you to understand: successful data merging. Different data
1. How to merge data sets? sources will always have different
2. What is Standard Deviation naming conventions than the main data
and what are different ways source. They may have different ways of
to calculate it? grouping the data and so on. Many
times, it happens that the additional
data source happens to be created at a
very different time by different people
1. Overview of Data with a different objective and use-cases.
Owing to all these factors, it should not
Merging sound strange if there is a lot of
In Data Science, data merging is the difference between multiple data
process of combining two or more data sources.
sets into a single data frame. This
process is necessary when we have raw In this topic, we will explore various
data stored in multiple files or data ways of simplifying the process of data
tables, that we want to analyze all in one merging. There are many places where
go. these data merging techniques will help
you. For example, if you have two
different systems that operate in parallel
with each other. Suppose that you have
to perform some analysis of the

36
relationship where you are having a designed to contain unique values. In
legacy system with a very poorly the Employee table, the Employee ID
formatted data that you are willing to field is the primary key, in the Contact
integrate with your new system. This is Info table, the Employee ID field is a
where data merging comes into the foreign key.
picture. Let us now dive deep into data
merging techniques. The one to one relationship returns the
related records when the value in the
We can perform data merging by Employee ID field in the Contact Info
implementing data joins on the table is the same as the Employee ID
databases in frame. There are three field in the Employees table.
categories of data joins:
This is how one to one join works, by
1. One to One Joins merging the data tables using this
2. One to Many Joins primary key.
3. Many to Many Joins
One to Many Joins
One to One Joins
In a one to many join, one record in a
One to one join is probably one of the table can be related to one or many
simplest join techniques. In this type of records in another table. For example,
join, each row in one table is linked to a each student can have multiple books by
single row in another table using a “key” school library.
column.
In the database, a one to many
For example, in a company database, relationships looks like this:
each employee has only one Employee
ID, and each Employee ID is assigned to
only one employee.
In the database, a one to one
relationship looks like this:

In this example, the primary key field in


the Students table, Student ID, is
designed to contain the unique values.
The foreign key field in the Library table,
Student ID, is designed to allow multiple
instances of the same value.
In this example, the “key” field in each
table is “Employee ID”. This “key” field is

37
The one to many relationships returns using a third table which is called as a
the related records when the value in the join table. Every record in a join table
Student ID field in the Library Table is contains a match field that contains the
the same as the value in the Student ID value of the primary keys of two tables
field in the Students table. that it joins. In a join table, usually these
match fields are called as foreign keys.
This is one to many join works, by These foreign keys are populated with
merging databases using primary key
the data as records in the join table are
which demonstrates one to many
created from either table that it joins.
relationships.
The below table demonstrates the
Student table, which contains a record
for every student. It also contains a
Courses table, which contains a record
for each course. A join table called
Enrollments creates two one to many
relationships, the one between each of
the two tables.

Many to Many Joins


A many to many relationships is said to
occur when multiple records in one table The primary key Student ID is a unique
are related to multiple records of other identifier for every student in Students
table. For example, a many to many table. The primary key Course ID is a
relationships exists between students unique identifier for every course in the
and courses. A student can register for Courses table. The Enrollments table
multiple courses. A course can have carries the foreign keys Student ID and
multiple students. Course ID.
It is not easy to perform join on tables To set up a join table for many to many
which have many to many relationships. relationships,
As a workaround, to perform a join, you
can break a many to many relationships
into two one to many relationships by

38
1. Using the above example, you can Z = (x-μ)/σ
create a table called Enrollments.
Where,
This will act as a join table.
2. In the Enrollments table, make a X = raw score
Student ID field and a Course ID
field. μ = Population mean
3. Make a relationship between the σ = Population Standard Deviation
two Student ID fields in the tables.
Later, make a relationship between Thus, the z-score is the raw score minus
two Course ID field in the tables. the population mean, divided by the
population standard deviation.
We can use this design, if a student
registers for four courses, we can ensure Whenever we come across situations
that the student has only one record in where the population mean and the
the Students table and four records in population standard deviation are
the Enrollments table, one for each unknown, the standard score can be
course student is enrolled in. calculated using the sample mean i.e. x̄
and the sample standard deviation as
2. What is Z-Score? estimates of population values.
A Z-score describes the position of a Now we will consider an example that
point in terms of its distance from the will illustrate the use of z-score formula.
mean when it is measured in the Consider that we know about a
standard deviation units. The z-score is population of group of kids having
always positive if the value of z-score lies weights that are normally distributed.
above the mean and it is negative if its Further to this, consider that we know
value is below the mean. that the mean of the distribution is 10
Z-score is also known as standard score kgs and the standard deviation is 2 kgs.
as it allows comparison of scores on Now consider the below questions:
different types of variables by 1. What is the z-score for 12 kgs?
standardizing the distribution. 2. What is the z-score for 5 kgs?
A standard normal distribution is a 3. How many kgs corresponds to a
normally shaped distribution with a z-score of 1.25?
mean of value as 0 and a standard For the first question, we simply plug
deviation of value as 1. x=12 in our z-score formula. The result
is: (12-10)/2 = 1.
3. How to calculate a
This means that 12 is one standard
Z-score? deviations above the mean.
The mathematical formula for
calculating the z-score is as following:

39
The second question is also very similar.
Simply put x=5 into the formula. Thus,
5. Why is a Z-score so
the result for this is: important?
(5-10)/2= -2.5 It is very helpful to standardize the
values of a normal distribution by
The interpretation of this is that 5 is 2.5 converting them into z-score because:
standard deviations below the mean.
1. It gives us an opportunity to
For the last question, we now know our calculate the probability of a
z-score. For this problem we plug z = value occurring within a normal
1.25 into the formula and use basic distribution.
algebra to solve for x: 2. Z-score allows us to compare two
values that are from the different
1.25 = (x-10)/2
samples.
Multiply both the sides by 2:

2.5 = (x-10)
Add 10 to both the sides:
6. Concept of
12.5 = x
Percentiles
The maximum value of the distribution
Hence, we see that 12.5 kgs corresponds can be considered in an alternative way.
to a z-score of 1.25. We can represent it as a value in a set of
data having 100% of the observations at
4. How to interpret the or below it. When we consider the
maximum value this way, it is called the
Z-score? 100th percentile.
The value of a z-score always tells us
how many standard deviations we are A percentile can be defined as the
away from the mean. For example, if the percentage of the total ordered
z-score is equal to 0, it is on the mean. observations at or below it. Therefore, pth
percentile of a distribution is the value
A positive z-score tells us that the raw such that p percentage of the ordered
score is higher than the mean average. observation falls at or below it.
For example, if the z-score is equal to +2,
it is 2 standard deviations above the Consider the following data set: [10, 12,
mean. 15, 17, 13, 22, 16, 23, 20, 24]

A negative z-score tells us that the score Here, if we want to find the percentile for
is below the mean average. For example, element 22, we follow the steps below:
if a z-score is equal to -3, it is 3 standard
deviations below the mean.

40
▪ Sort the dataset in ascending Using the values of the quartiles, we can
order. Once sorted, the dataset also find out the interquartile range. An
will look like [10, 12, 13, 15, 16, interquartile range can be defined as
17, 20, 22, 23, 24] the measure of middle 50% of the values
▪ The number of values at or below when ordered from lowest to highest.
the element 22 is 8. The total The interquartile range can be
number of elements in the calculated by subtracting first
dataset is 10. quartile(Q1) from the third quartile(Q3).
▪ Thus, going by the definition, 80
percent of the values are at or
below the element 22. Thus, IQR = Q3 – Q1
percentile for the element 22 is 80
percentiles. Let us consider the following 10 data
points:
[10, 20, 30, 40, 50,60, 70, 80, 90, 100]
7. Quartiles Here, as there are ten values (an even
Quartiles of dataset partitions the data number of values), the median is
into four equal parts, with one-fourth of halfway between the fifth & sixth data
the data values in each part. The total of values, which gives us 55 as the median,
100% is divided into four equal parts: or Q2.
25%, 50%, 75% & 100%. Since the
median is defined as the middlemost
value in the observation, the median will
have 50% of the observations at or below
it. Thus, the second quartile(Q2) or the
50th percentile demarcates the median. The first quartile or Q1 is the median of
The most frequently used percentiles all the values to the left of Q2. Thus here,
30 is the middle number of numbers to
other than the median are the 25 th
percentile and the 75 th percentile. The the left of the actual median (Q2 ).
25th percentile defines the first quartile, The third quartile or Q3 is the median of
the 75th percentile defines the third all the values to the right of Q2. Thus
quartile, and the 100 th percentile here, 80 is the middle number of
represents the fourth quartile. numbers to the right of the actual
The first quartile is the median of all the median (Q2 ).
values to the actual median's (Q2) left.
Similarly, the third quartile is the
median of all the values to the actual
The interquartile range (IQR) can be
median's (Q2) right.
calculated as Q3 – Q1, which is 80 - 30 =
50.

41
i is the ith decile and can be represented
as:
1st Decile, D1 = 1 * (n + 1)/ 10 th data
2nd Decile, D2 = 2 * (n + 1)/ 10 th data

and so on
An important application of quartiles is
in temperature ranges for the day as Steps to calculate decile:
reported on a weather report. In the
a. Find out the number of data or
presence of irregularities, the range
values can be significantly influenced by variables in the sample or
population. This is denoted by n.
them. Hence, it is preferred to use the
IQR instead, thereby ignoring the top 25
percentile and the bottom 25 percentile
of the data points. In the presence of b. In the next step, sort all the data
irregularities, IQR is more robust as well or variables in the sample or
as a better representation of the amount population in ascending order.
of spread in the data.
c. In the next step, based on the
8. Deciles decile that is required, calculate
the decile by using the formula:
Just like quartiles, we have deciles.
While quartiles sort the data into four
quarters, deciles sort the data into ten
equal parts: the 10 th, 20th, 30th, 40th, 𝑖 ∗ (𝑛 + 1 )
𝐷𝑖 =
50th, 60th, 70th, 80th, 90th,100th. 10𝑡ℎ 𝐷𝑎𝑡𝑎

The higher the place in the decile


ranking, the higher is the overall d. Lastly, based on the decile value,
ranking. For example, a person receiving determine the corresponding
99 percentiles in a test would be placed variable from amongst the
in a decile ranking of 10. However, a population data.
person receiving 5 percentiles in the
same test would be placed in a decile
ranking of 1. Let us look at an example to understand
The mathematical formula to calculate the concept in detail:
decile is:
𝑖 ∗ (𝑛 + 1 ) Suppose we have been given 23 random
𝐷𝑖 =
10𝑡ℎ 𝐷𝑎𝑡𝑎 numbers between 20 and 80. We need to
Where n is the number of data in the represent them as deciles.
population sample.

42
Let’s say the raw numbers are: [24, 32, Now D1 = 1 * (n+1)/ 10 th data
27, 32, 23, 62, 45, 77, 60, 63, 36, 54, 57, 36,
72, 55, 51, 32, 56, 33, 42, 55, 30] = 1* (23 + 1)/ 10

Following the steps mentioned above, we = 2.4 th data i.e. data between
first determine the number of variables digit number 2 & 3
in the sample (n). Here n = 23. Which is 24 + 0.4 * ( 27- 24 ) = 25.2
We then need to sort the 23 random Similarly,
numbers in ascending order, as shown
below. D2 = 2 * (n+1)/ 10 th data

SR. No Digit = 2 * (23 + 1)/ 10


1 23 = 4.8th data i.e. data between digit
2 24 number 4 & 5
3 27
Which is 30 + 0.8 * ( 32 - 30 ) = 31.6
4 30
5 32
6 32
7 32
8 33 D3 = 3 * (n+1)/ 10 th data
9 36 = 3 * (23 + 1)/ 10
10 36
11 42 = 7.2nd data i.e. data between digit
number 7 & 8
12 45
13 51 Which is 32 + 0.2 * ( 33 - 32 ) = 32.2
14 54
15 55
16 55
17 56 D4 = 4 * (n+1)/ 10 th data
18 57
19 60 = 4 * (23 + 1)/ 10
20 62 = 9.6th data i.e. data between digit
21 63 number 9 & 10
22 72
Which is 36 + 0.6 * ( 36 - 36 ) = 36
23 77

We can now calculate the positions of


decile D1 to decile D9 .
D5 = 5 * (n+1)/ 10 th data

43
= 5 * (23 + 1)/ 10 = 9 * (23 + 1)/ 10
= 12th data i.e. data at digit number = 21.6 th data i.e. data between digit
12 number 21 & 22
Which is 45 Which is 63 + 0.6 * ( 72 - 63 ) = 68.4

D6= 6 * (n+1)/ 10 th data


= 6 * (23 + 1)/ 10 Thus, we can represent the deciles for
the data set with its positions and
= 14.4th data i.e. data between digit
corresponding values in a table as
number 14 & 15
shown below:
Which is 54 + 0.4 * ( 55 - 54 ) = 54.4

Decile Data position Value


1 2.4 25.2
D7= 7 * (n+1)/ 10 th data 2 4.8 31.6
= 7 * (23 + 1)/ 10 3 7.2 32.2
= 16.8th data i.e. data between digit 4 9.6 36
number 16 & 17
5 12 45
Which is 55 + 0.8 * ( 56 - 55 ) = 55.8 6 14.4 54.4
7 16.8 55.8
8 19.2 60.4
D8= 8 * (n+1)/ 10 th data 9 21.6 68.4

= 8 * (23 + 1)/ 10
One example of the use of deciles is in
= 19.2nd data i.e. data between digit
school rankings. Students in the top 10
number 19 & 20
% or highest decile will be rewarded,
Which is 60 + 0.2 * ( 62 - 60 ) = 60.4 whereas students in the last 10% or
lowest decile will be given extra
assistance to improve their scores.

D9= 9 * (n+1)/ 10 th data

44
Recap

• In Data Science, data merging is the process of combining two or more


data sets into a single data frame.
• In one-to-one join, each row in one table is linked to a single row in
another table using a “key” column.
• In a one to many join, one record in a table can be related to one or
many records in another table.
• A many to many relationships are said to occur when multiple records
in one table are related to multiple records of other table.

Exercises

Objective Type Questions


Please choose the correct option in the questions below.
1. The pth percentile of a distribution is such that:
a) p percent of the observations fall at it
b) p percent of the observations fall below it
c) p percent of the observations fall at or below it
d) the value is p.

2. Which of the following function is used for quantiles of quantitative values?


a) Quantile
b) Quantity
c) Quantiles
d) All of the mentioned

3. The distribution of heights of Indian women aged 18 to 24 is approximately


normally distributed with a mean of 65.5 inches and standard deviation of 2.5
inches. Calculate the z-score for a woman six feet tall.
a) 2.60
b) 4.11
c) 1.04
d) 1.33

45
4. What is a z-score?
a) It is the number of standard deviations a particular score lies above or below
the mean of the set of scores.
b) It is a standardized measure of the mean of a set of data.
c) It is the average frequency of scores in a sample
d) It is a measure of central tendency in the data.

5. The median, mode, deciles and percentiles are all considered as measures of
a) Mathematical averages
b) Population averages
c) Sample averages
d) Averages of position

6. According to percentiles, the median to be measured must lie in


a) 80th
b) 40th
c) 50th
d) 100th

7. What measures of position divides the distribution into 10 equal parts?


a) Quartiles
b) Deciles
c) Percentiles
d) Range

8. What measures of position divides the distribution into 4 equal parts?


a) Quartiles
b) Deciles
c) Percentiles
d) Range

Standard Questions
Please answer the questions below in no less than 100 words.
1. What is data merging?
2. Why is data merging required in data science?
3. Name different ways of merging data sets
4. Explain one-to-one join with the help of an example
5. Explain one-to-many join with the help of an example

46

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy