CHAPTER-4-Data-Management
CHAPTER-4-Data-Management
DATA MANAGEMENT
Data management is the practice of collecting, organizing, storing, and analyzing data
to derive meaningful insights. In today's digital age, data has become an invaluable asset,
driving innovation and decision-making across various industries. Statistics, a branch of
mathematics, plays a crucial role in data management by providing the tools and techniques
to extract valuable information from raw data.
Learning Outcomes
1. Utilize various data management tools to process and manage quantitative data;
2. Identify the types of data and its level of measurement
3. Calculate the measure of central tendency and measures of dispersion of a given set
of data.
4. Interpret data based on the result of computation.
5. Appreciate the importance and application of measures of central tendency, and
measures of dispersions in real life situation
A. Data Collection
b. Data Preparation - Transforming raw data into a suitable format for analysis,
which may involve:
1|Page
D. Data Analysis and Interpretation
What is Data?
Data refers to raw and unprocessed facts, figures, or values collected or observed from the
real world. It can take various forms, including numbers, text, images, or any other
representations. Data by itself lacks context and meaning until it is processed, organized, and
interpreted.
Examples:
1. Numeric Data - Numbers like age, height, weight, temperature, sales figures, etc.
2. Textual Data - Words, sentences, paragraphs, like articles, emails, social media posts,
etc
3. Categorical Data - Red, blue, green; Yes or No; Categories like "High," "Medium," "Low"
4. Video data - Movies, TV shows, video clips, etc.
5. Image Data - Pixel values in a digital image
6. Audio Data - Waveform values in digital audio
Types of Data
Data, the raw material of the information age, comes in various forms and can be categorized
based on different criteria.
1. Quantitative Data - These variables represent measurable quantities and can be either
discrete or continuous.
Discrete Data - Take on distinct, separate values with no intermediate values. Often
whole numbers. Examples include the number of siblings or the number of cars in a
parking lot.
Continuous data - Can take on any value within a given range and have infinite
possible values. Examples include height, weight, and temperature.
2. Qualitative Variables - These variables represent categories or groups and can be
either nominal or ordinal.
Nominal Data - Categories with no inherent order or ranking. Examples include
gender, ethnicity, or types of fruits.
Ordinal Data - Categories with a meaningful order or ranking but with inconsistent
intervals between them. Examples include education levels (e.g., high school,
college, graduate), satisfaction rating (low, medium, high), Likert scale responses.
2|Page
Level of Data Measurement
Understanding the level of measurement of your data is crucial for selecting
appropriate statistical techniques and interpreting your findings accurately. There are four
primary levels of measurement, each with distinct characteristics and limitations.
1. Nominal Data - Nominal data represent categories or groups with no inherent order
or ranking.
Examples:
• Gender (categories: male, female)
• Eye color (categories: brown, blue, green)
• Marital Status (single, married, divorce, widow)
• Hair type (straight, wavy, curly, kinky)
• Car brands (Toyota, Ford, Honda, Chevrolet)
• Political affiliation (democrat, republic, independent)
2. Ordinal Data - Ordinal data have ordered categories, but the intervals between
them are not consistent or meaningful.
Examples:
• Educational levels (categories: high school, college, graduate)
• Customer satisfaction ratings (categories: dissatisfied, neutral,
satisfied)
• Socio-economic status (lower class, working class, middle class,
upper-middle class, upper class)
• Likert scale for agreement (strongly disagree, disagree, neutral, agree,
strongly agree)
• Performance rating (below expectations, meeting expectation,
exceeding expectation)
3|Page
3. Interval Data - Interval data have ordered categories with consistent and meaningful
intervals between them, but they lack a true zero point.
Examples:
• Temperature in Celsius or Fahrenheit
• IQ scores
• pH level
• Longitude or latitude
• Standardized Test Scores
4. Ratio Data - Ratio data have all the properties of interval variables, but they also
have a true zero point, indicating the absence of the attribute.
Examples:
• Height in centimeters or inches
• Income
• Weight
• Distance travelled
• Time (in seconds, minutes, hours)
4|Page
Measure of Central Tendency
A measure of central tendency is a summary statistic that represents the center point
or typical value of a dataset. It also referred to as the central location of a distribution. There
are three measures of central tendency - mean, median, and mode. Choosing the best
measure of central tendency depends on the type of data.
A. Mode
Mode is a statistical measure that represents the most frequently occurring value in a dataset.
It is the value with the greatest frequency. Mode is appropriate to use when the variable
measured is in the nominal scale.
Example 1.
Let's say we surveyed a group of people about their favorite color. Here are the results:
• Blue: 15 people
• Red: 10 people
• Green: 8 people
• Yellow: 7 people
In this case, blue is the mode because it is the most frequently chosen color.
Example 2
A teacher records the following scores for a class of 10 students on a recent test:
75, 82, 85, 85, 85, 90, 92, 95, 95, 100
Solution
To find the mode, we identify the score that appears most frequently. In this case, the score
85 appears three times, which is more frequent than any other score. Therefore, the mode of
the test scores is 85.
Real-world examples:
Fashion. A clothing store owner might notice that a particular style of jeans is selling more
than any other. The mode would be the most popular style.
Weather. A meteorologist might observe that the most common daily high temperature in a
particular city during a specific month is 25 degrees Celsius. This would be the modal
temperature.
Product Sales. A supermarket manager might identify the best-selling brand of cereal by
determining the brand that appears most frequently in sales records.
Quality Control. A manufacturer might inspect a batch of products and find that a certain
defect occurs most often. This would be the modal defect.
5|Page
Characteristics:
• Multiple Modes: A dataset can have more than one mode. This is known as bimodal
or multimodal. For instance, if "red" and "blue" are equally popular colors in the
survey example, the dataset would be bimodal.
• No Mode. A dataset might not have a mode if all values occur with the same
frequency.
• Mode for Categorical Data. Mode is often used for categorical data, as it helps
identify the most common category.
• Mode for Numerical Data. While it can be used for numerical data, it's less common
than the mean or median, especially for large datasets.
• Identifying the Most Common Value - When you want to know the most frequent
occurrence.
• Categorical Data Analysis - When dealing with categorical data, mode is a useful
measure of central tendency.
• Non-Normal Distributions - In cases where the data is not normally distributed, the
mode can provide insights that might be missed by the mean or median.
B. Median
The median is the middle entry or term in a set of data arranged in either increasing or
decreasing order. The median is a positional measure. Thus, the values of the individual
measures in a set of data do not affect it. It is affected by the number of measures and not by
the size of the extreme values. This measure is appropriate to use when the distribution is at
least ordinal scale since ranking of the data is involved.
To find the median of a given set of data, take note of the following:
1. Arrange the data in either increasing or decreasing order.
2. Locate the middle value. If the number of cases is odd, the middle values is the
median. If the number of cases is even, take the arithmetic mean of the two
middle measures.
Example 1
The number of books borrowed in the library from Monday to Friday last week were 58, 60,
54, 35, and 97 respectively. Find the median.
Example 2
Cora’s quizzes for the second quarter are 8, 7,6, 10, 9, 5, 9, 6, 10, and 7. Find the median.
5, 6, 6, 7, 7, 8, 9, 9, 10, 10
6|Page
Since the number of measures is even, then the median is the average of the two
middle scores.
Characteristics of Median
• Less Affected by Outliers. Unlike the mean, the median is not significantly affected by
extremely large or small values.
• Quick and Easy to Calculate. It's relatively simple to find the median, especially for
smaller datasets.
• Represents the Middle Value. It provides a good measure of central tendency,
indicating the value that separates the lower half of the data from the upper half.
Real-world examples
Real State. When analyzing housing market trends, real estate agents often use the median
home price. This is because it's less affected by outliers like extremely expensive or
inexpensive homes.
Income. Economists and policymakers often use the median income to gauge the overall
economic health of a population. This is because it provides a better picture of typical income
levels, as it's less influenced by very high or very low incomes.
Demographics. Demographers use the median age to understand the age distribution of a
population. This can help in planning for future needs like healthcare, education, and social
services.
C. MEAN
The mean (also known as the arithmetic mean) is the most commonly used measure
of central position. It is the sum of measures divided by the number of measures in a variable.
It is symbolized as 𝒙𝒙� (read as x bar). Mean is appropriate to use when the distribution is at
least interval scale.
7|Page
Example 1
The grades in Chemistry of 10 students are 87, 84, 85, 85, 86, 90, 79, 82, 78, 76. What is
the average grade of the 10 students?
Solution:
Suppose a company in the Philippines has 10 employees with the following annual salaries
in Philippine Pesos (PHP):
Php 150,000
Php 175,000
Php 200,000
Php 225,000
Php 250,000
Php 250,000
Php 275,000
Php 300,000
Php 350,000
Php 1,000,000
1. Add up all the salaries: Php 150,000 + Php 175,000 + Php 200,000 + Php 225,000 +
Php 250,000 + Php 250,000 + Php 275,000 + Php 300,000 + Php 350,000 + Php
1,000,000 = Php 3,150,000
2. Divide the total salary by the number of employees: PHP 3,150,000/10 = PHP 315,000
Characteristics of Mean:
• Sensitivity to Outliers. The mean is sensitive to outliers. This means that extreme
values can significantly influence the mean, potentially skewing it. For example, if a few
very high salaries are included in a dataset of employee salaries, the mean salary will be
higher than the typical salary.
• Uses All Data Points. The mean takes into account every data point in the dataset. This
makes it a comprehensive measure of central tendency.
Real-world examples
Academic Performance. Teachers often calculate the average score on a test to assess
class performance and identify areas where students may need additional support.
Additionally, a student's Grade Point Average (GPA) is calculated by averaging their grades
in different courses.
8|Page
Finance. Investors track the average price of a stock over a specific period to assess its
performance. Moreover, Investors calculate the average return on their investments (ROI) to
evaluate their portfolio's performance.
Business. Businesses use average sales figures to track performance and set sales targets.
Weather. Meteorologists use the average temperature to predict weather patterns and climate
trends.
Weighted Mean
A weighted mean is a type of average that assigns different weights to different data
points. This is useful when some data points are more important or reliable than others. The
formula for weighted mean is:
Example 1.
Below are Maria’s subjects and the corresponding number of units and grades she got for
the first grading period. Compute her grade point average.
9|Page
Therefore, Maria has the GPA of 81.86 for the first grading period.
10 | P a g e
Measures of Dispersion
The measures that describe the degree of spread of the data are called “measure of
dispersion” or “measure of variability” or “measure of spread”. This measure is used to
determine how scattered the values are in the distribution. In this topic, we will consider four
measures of dispersion, namely: range, average deviation, variance, and standard deviation.
The range is the simplest measure of variability. It is the difference between the largest ad
smallest measurement. To determine the range of ungrouped data, the formula is;
Example 1
Consider the four data sets presented below. Find the range of each data set.
Comparing the data sets, Data Set 1 has the least variation because it has the smallest
value of R. On the other hand, Data Set 3 has the most variation because it has the largest
value of R.
A large average deviation would mean that a set of scores is widely dispersed about
the mean, while a small average deviation would imply that the set of scores is closer to the
mean.
11 | P a g e
The formula of average deviation for ungrouped data is:
Example 1
The raw scores of eight students in Statistics are given as follows: 17, 17, 26, 28, 30,
30, 31, and 37. Compute the average deviation.
12 | P a g e
Example 2.
The scores of nine students in Psychology are given as follows: 15, 19, 20, 24, 28,
30, 32, 32, and 40. Calculate the average deviation.
The computed average deviation (A.D.) of scores in Statistics is 6 while test scores in
Psychology is 7.17. This can be interpreted as the scores in Statistics are less dispersed or
closely distributed near the mean (homogeneous) while the scores in Psychology are more
dispersed away from the mean (heterogeneous).
13 | P a g e
Variance for Ungrouped Data
Another way to avoid a sum of zero for the deviation scores is to square each deviation
score and get the average of all squared deviation scores. The resulting measure is called
“variance” which has a squared unit. In symbol, 𝑠𝑠2.
To compute the variance of ungrouped data, the following formula may be used
Example 1. Consider the data set below. Compute the variance of each data set.
14 | P a g e
Standard Deviation for Ungrouped Data
Recall that, in the computation of the variance, the deviation was squared. This implies
that the variance is expressed in squared units. Extracting the square root of the value of the
variance will give the value of the standard deviation. In symbol, 𝑠𝑠.
To take the standard deviation of ungrouped data, extract the square root of the
variance. In mathematical formula,
Example 1. Consider the data set below. Compute the standard deviation of each data set.
15 | P a g e
On the basis of the obtained standard deviation, we say that the scores in Data Set 1
deviate from the mean by 2.06 units, on the average. For Data Set 2, the scores deviate from
the mean by an average of 2.56 units.
16 | P a g e