0% found this document useful (0 votes)
7 views18 pages

Unit 2

data science unit 2 notes

Uploaded by

AJAY KRISHNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

Unit 2

data science unit 2 notes

Uploaded by

AJAY KRISHNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT – II DESCRIPTIVE ANALYTICS USING STATISTICS 9

Frequency distributions – Outliers –Interpreting Distributions – Graphs – Averages –Describing


Variability – Interquartile Range – Variability For Qualitative And RankedData - Normal
Distributions – Z Scores –Correlation – Scatter Plots – Regression –Regression Line – Least
Squares Regression – Line Standard Error of Estimate –Interpretation of R2 – Multiple
Regression Equations – Regression Toward TheMean.
DESCRIPTIVE ANALYTICS
Descriptive analytics focuses on summarizing and interpreting historical data to gain insights
into events, patterns, and trends in a business. It involves the exploration and examination of data
through various statistical techniques, visualizations, and KPI reports to provide a clear
understanding of what has happened. This provides a foundation for decision-making,
performance evaluation, and operational improvements. It is a method of describing the
characteristics of a data set. It is Useful because they allow making sense of the data and it
includes calculating things such as the average of the data, its spread and the shape it produces.
Qualitative data
Qualitative Variable is a variable that can't be measured in numerical units.The variable which
yield non numerical data.
E.g.- Education, marital status, eye colour.
Frequency
The number of observations falling into particular class/ category of the qualitative variable.
FREQUENCY DISTRIBUTIONS
A frequency distribution is a collection of observations produced by sorting observationsinto
classes and showing their frequency (f) of occurrence in each class.A frequency distribution
helps us to detect any pattern in the data (assuming apattern exists) by superimposing some order
on the inevitable variability amongobservations.
Frequency Distributionfor Grouped Data
A frequency distribution produced whenever observations are sorted into classes of more than
one value.Grouped data refers to the data which is bundled together in different classes or
categories.
Frequency Distributionfor Ungrouped Data
A frequency distribution producedwhenever observations are sortedinto classes of single
values.The ungrouped frequency distribution is a type of frequency distribution that displays the
frequency of each individual data value instead of groups of data values.Ungrouped data or raw
data is a mere list of numbers that does not convey anything. This is because no summarization
or aggregation is possible.

Not Always Appropriate


The frequency distribution shown for ungrouped data is only partially displayed becausethere are
more than 100 possible values between the largest and smallest observations.Frequency
distributions for ungrouped data are much more informative when thenumber of possible values
is less than about 20. Under these circumstances, they are astraightforward method for
organizing data. Otherwise, if there are 20 or more possiblevalues, consider using a frequency
distribution for grouped data.
Example Problem
1. Students in theater arts appreciation class rated the classic film ‘The Wizard of Oz’ on
a 10-point scale, ranging from 1 (poor) to 10 (excellent), as follows: Construct a
frequency distribution for the above data.

Since the number of possible values is relatively small—only 10—it’s appropriate to construct
a frequency distribution for ungrouped data.

Real Limits of Class Intervals


Gaps cannot be ignored when you are determining the actual width of any classinterval. The real
limits are located at the midpoint of the gap between adjacent tabledboundaries; that is, one-half
of one unit of measurement below the lower tabled boundaryand one-half of one unit of
measurement above the upper tabled boundary.
Constructing Frequency Distributions

2. The IQ scores for a group of 35 high school dropouts are as follows

(a) Construct a frequency distribution for grouped data.


(b) Specify the real limits for the lowest class interval in this frequency distribution.
To construct a frequency distribution for grouped data from the given IQ scores, follow
these steps:
1. Sort the IQ scores in ascending order.
2. Determine the range and the number of intervals.
3. Calculate the width of each interval.
4. Create the class intervals.
5. Count the number of IQ scores falling within each interval.
6. Construct the frequency distribution table.

Given the IQ scores:


69,71,75,77,79,80,80,84,85,86,87,89,90,90,90,91,93,94,95,95,96,98,98,99,100,100,103,104,105,
108,109,110,112,123.
1. Sort the IQ scores:
69,71,75,77,79,80,80,84,85,86,87,89,90,90,90,91,93,94,95,95,96,98,98,99,100,100,103,104,1
05,108,109,110,112
2. Determine the range:
Range=Maximum value−Minimum
=123−69=54
3. Decide the number of intervals
Calculate the width of each interval:

=54/10=5.4 Round off to a convenient number, such as 5.


4. Determine the class intervals:
Class intervals
65−69,70−74,75−79,80−84,85−89,90-94,94−99,100−104,105-109,110−114,115—119,120-
124.
5. Construct frequency distribution table
6. The real limits are located at the midpoint of the gap between adjacent tabled boundaries;
that is, one-half of one unit of measurement below the lower tabled boundary and one-half of
one unit of measurement above the upper tabled boundary.
65-0.5=64.5
69-0.5=68.5
The real limits for the lowest class interval 64.5-69.5.
3. What are some possible poor features of the following frequencydistribution?

Not all observations can be assigned to one and only one class (because of gap between
20–22 and 25–30 and overlap between 25–30 and 30–34). All classes are not equal in
width (25–30 versus 30–34). All classes do not have both boundaries (35–above).
Outliers (Very extreme score)
An outlier is an extremely high or extremely low data point relative to the nearest data point and
the rest of the neighboring co-existing values in a data graph or dataset.
Example
The value in the month of January is significantly less than in the other months.

4. Identify any outliers in each of the following sets of data collectedfrom nine college students.

1. Summer Income:
Mean = $7,522.67
Standard Deviation = $8,595.49
Z-scores:
$6,450: -0.123
$4,820: -0.287
$5,650: -0.082
$1,720: -0.785
$600: -0.852
$0: -0.877
$3,482: -0.409
$25,700: 2.106
$8,548: 0.603
Outlier: $25,700 (z-score > 3
2. Family Size:
Mean = 5.00
Standard Deviation = 5.29
Z-scores: 2: -0.377
4: -0.377
3: -0.377
6: 0.377
18: 2.831
2: -0.377
6: 0.377
3: -0.377
4: -0.377
Outlier: 18 (z-score > 3)
4. GPA: Mean = 3.05 Standard Deviation = 0.67 Z-scores: 2.30: -0.948 4.00: 0.840
3.56: 0.573 2.89: -0.802 2.15: -1.275 3.01: -0.694 3.09: -0.662 3.50: -0.134 3.20: -0.510
No outliers.
Therefore, the outliers in the data are:
 Summer Income: $25,700
 Family Size: 18
INTERPRETING DISTRIBUTIONS
In data science, interpreting distributions involves analyzing the patterns and characteristics of
data sets to extract insights and make informed decisions.
GRAPHS
Data can be described clearly and concisely with the aid of a well-constructed frequency
distribution.
Graphs for quantitative data
For visualizing quantitative data, histograms and box plots are commonly used.
Histogram:
A bar-type graph for quantitative data and there are common boundaries between adjacent bars
emphasize the continuity of the data, as withcontinuous variables.A histogram is a graphical
representation of the distribution of numerical data. It consists of a series of bars, where each bar
represents a range of values (bin) and the height of the bar indicates the frequency of data points
falling within that range. Histograms are useful for visualizing the shape, center, and spread of
the data distribution.
Features of histograms
 Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution.
 Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
(The units along the vertical axis do not have to be the same width as those along the
horizontal axis.)
 The intersection of the two axes defines the origin at which both numerical scales equal 0.
 Numerical scales always increase from left to right along the horizontal axis and from bottom
to top along the vertical axis.
 The body of the histogram consists of a series of bars whose heights reflect the frequencies
for the various classes. Notice that adjacent bars in histograms have common boundaries that
emphasize the continuity of quantitative data for continuous variables. The introduction of
gaps between adjacent bars would suggest an artificial disruption in the data more
appropriate for discrete quantitative variables or for qualitative variables.
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency
polygons may be constructed directly from frequency distributions. A line graph for quantitative
datathat also emphasizes the continuityof continuous variables.
Transformation of a histogram into a frequencypolygon
1. Construct a Histogram: Start by creating a histogram to represent the frequency distribution
of the data. Divide the range of the data into intervals (bins) and count the number of data
points falling into each interval.
2. Identify Midpoints and Heights: For each bar in the histogram, identify the midpoint of the
interval and the height of the bar (representing the frequency or relative frequency of data
points in that interval).
3. Plot Points: Plot each midpoint on the horizontal axis, with its corresponding height on the
vertical axis. These points represent the tops of the bars in the histogram.
4. Connect the Points: Connect the points on the graph using straight line segments. Start from
the leftmost point and end at the rightmost point. If you want to emphasize the continuity of
the distribution, you can close the polygon by connecting the last point to the first point.
Example Problem
5. The following frequency distribution shows the annual incomes indollars for a group of
college graduates.

a) Construct a histogram.
b) Construct a frequency polygon.
c) Is this distribution balanced or lopsided?
To determine if the distribution is balanced or lopsided, we typically look at the shape of the
histogram or frequency polygon. In this case, both the histogram and frequency polygon show
that the distribution is lopsided, with more data points concentrated on the left side (lower
income ranges) and fewer data points on the right side (higher income ranges). This suggests that
the distribution is positively skewed, meaning it has a longer tail on the right side. Thus, the
distribution is lopsided or skewed to the right.
6. The number of friends reported by Facebook users is summarized in the following
frequency distribution

a) Convert to a histogram.
b) Why would it not be possible to convert to a stem and leaf display?
It would not be possible to convert this distribution to a stem and leaf display because stem
and leaf plots is typically used for smaller datasets. In this case, you have 200 data points (the
number of users in each frequency category), which would make a stem and leaf plot impractical
and challenging to interpret. Stem and leaf plots are more suitable for datasets with fewer data
points to show the distribution of values in a compact and readable form.
StemandLeafDisplays
Still another technique for summarizing quantitative data is a stem and leaf display.Stem and leaf
displays are ideal for summarizing distributions, such as that for weightdata, without destroying
the identities of individual observations.
Selection of Stems
Stem values are not limited to units of 10. Depending on the data, identify the stem with one or
more leading digits that culminates in some variation on a stem value of 10, such as 1, 100, 1000,
or even .1, .01, .001, and so on.
7. Construct stem and leave display from the statistics:
The stem represents the tens digit of the weight.
The leaves represent the units digit of the weight.

8. Construct a stem and leaf display for the following IQ scores obtained from a group of
four-year-old children
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
For qualitative (nominal) data, a bar graph is often used to represent the frequency or count of
each category.
Bar Graph
Gaps between adjacent bars emphasize the discontinuous nature of the data. A bar graph, also
known as a bar chart, is a graphical representation of data where the length or height of bars
corresponds to the frequency, count, or other numerical measures of different categories or
groups.
9. Construct a bar graph for the data shown in the following table:
AVERAGES
Averages consist of numbers (or words) about which the data are, in some sense, centered. They
are often referred to as measures of central tendency, the several types of average yield numbers
or words that attempt to describe, most generally, the middle or typical value for a distribution. It
focuses on three different measures of central tendency—the mode, median, and mean. Each of
these has its special uses, but the mean is the most important average in both descriptive and
inferential statistics. It is a measure used in statistics to summarize a set of data points.
MODE
The mode reflects the value of the most frequently occurring score.
More Than One Mode
Distributions can have more than one mode (or no mode at all). Distributions with two obvious
peaks, even though they are not exactly the same height, are referred to as bimodal. Distributions
with more than two peaks are referred to as multimodal. The presence of more than one mode
might reflect important differences among subsets of data. For instance, the distribution of
weights for both male and female statistics students would most likely be bimodal, reflecting the
combination of two separate weight distributions—a heavier one for males and a lighter one for
females.
10. Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60,
65, 63.
The retirement age 63 appears most frequently, occurring 4 times. So, the mode for this set of
retirement ages is 63.
11. The owner of a new car conducts six gas mileage tests and obtains the following results,
expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find the mode for these
data.
Here, the mileage 27.4 appears twice, which is more than any other value. So, the mode for this
set of gas mileage tests is 27.4 miles per gallon.
MEDIAN
The median reflects the middle value when observations are ordered from least to most. The
median splits a set of ordered observations into two equal parts, the upper and lower halves.
FINDING THE MEDIAN
12. Find the median for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65,
63.
Arrange the retirement ages in ascending order:
45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70.
Since there are 11 data points, the median will be the middle value. In this case, the middle value
is the sixth value, which is 63.
So, the median retirement age for this set of data is 63.
13. Find the median for the following gas mileage tests: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9.
let's arrange the values in ascending order:
26.3, 26.6, 26.9, 27.4, 27.4, 28.7
Since there are 6 data points, the median will be the average of the two middle values (if there's
an even number of data points). Here, the two middle values are 26.9 and 27.4.
Calculating the average:
Median = (26.9 + 27.4) / 2
Median = 54.3 / 2
Median = 27.15
So, the median for this set of gas mileage tests is 27.15 miles per gallon.
MEAN
The mean is the most common average, one you have doubtless calculated many times. The
mean is found by adding all scores and then dividing by the number of scores.

Statisticians distinguish between two types of means—the population mean and the sample
mean—depending on whether the data are viewed as a population (a complete set of scores) or as
a sample (a subset of scores).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy