0% found this document useful (0 votes)
14 views27 pages

T1. DescriptiveStatistics

This document outlines a lesson on descriptive statistics. It introduces key concepts like population and sample, and defines descriptive statistics as organizing and summarizing sample data through tables, graphs, and numerical values. The lesson covers frequency distributions, graphical representations of data including histograms and frequency polygons, and numerical descriptions like measures of central tendency (mean, median, mode), position (quantiles and quartiles), and dispersion (range, variance, standard deviation). The goal is to teach students how to properly analyze and summarize sample data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views27 pages

T1. DescriptiveStatistics

This document outlines a lesson on descriptive statistics. It introduces key concepts like population and sample, and defines descriptive statistics as organizing and summarizing sample data through tables, graphs, and numerical values. The lesson covers frequency distributions, graphical representations of data including histograms and frequency polygons, and numerical descriptions like measures of central tendency (mean, median, mode), position (quantiles and quartiles), and dispersion (range, variance, standard deviation). The goal is to teach students how to properly analyze and summarize sample data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Lesson 1.

Descriptive Statistics

Irene Garcı́a-Camacha Gutiérrez


Elements of Probability and Statistics
1st course of Mathematics Degree

January 8, 2024

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 1 / 27


Outline

1 Preliminaries

2 Frequency Distribution

3 Graphical representation of data

4 Numerical Description of Data


Measures of Central Tendency
Measures of Position
Measures of Dispersion
Measures of Distribution

5 z-scores

6 Box-Plots

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 2 / 27


Preliminaries

Introduction
Statistics is the science of data. It involves collecting, classifying,
summarizing, organizing, and analyzing, and interpreting data. It also
involves model building.
Population: A population is the collection or set of all objects or
measurements that are of interest to the collector.
Sample: A sample is a subset of data selected from a population.
Example: In a clinical study, the population will be all the patients with
the same disease, whereas the sample will be the subset of patients
participants in the study.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 3 / 27


Preliminaries

Introduction

Descriptive Statistics: The methods consisting in organizing,


summarizing, and presenting sample data in the form of tables,
graphs, or through numerical values are called descriptive statistics.
A measurement representing some sample feature is named statistics.

Statistical Inference: The methods consisting in estimating,


predicting, taking a decision, and giving a generalization about
population based on information contained in a sample are called
statistical inferential. (Statistical Inference → 2nd course).
A “true” measurable characteristic of the population that cannot, in
practice, be known with certainty is named parameter.
Inferential statistics uses PROBABILITY THEORY → Main part
of this course (next units...)

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 4 / 27


Preliminaries

Types of variables
Variable: It is a feature of interest of the sample.

Types of variables:

Remark: Properly determining the type of variable to analyze is the key


step for any statistical analysis.
I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 5 / 27
Preliminaries

Types of variables

Exercise 1. Classify the following variables in one of the previous types.


Justify your answer.
Sex (male, female).
Seasons.
Rating of satisfaction (1 to 7).
Body type (slim, average, heavy).
Number of students in your class.
Temperature (degrees Fahrenheit).
Position standing in line.
Political affiliation (republican, democrat).
Weight (in pounds) of an infant.
Civil status.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 6 / 27


Frequency Distribution

Types of frequency tables

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 7 / 27


Frequency Distribution

Frequency tables for UNGROUPED data

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 8 / 27


Frequency Distribution

Frequency tables for GROUPED data


Number of class intervals k: If N small, the nearest integer to N (it must
be between 5 and 20). Otherwise, the nearest integer to 1 + 3.22 log(N).
lk+1 −l0
Class interval width: round up k .

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 9 / 27


Graphical representation of data

Graphs for UNGROUPED data


A graph of bars whose heights represents the frequencies (or relative
frequencies) of respective categories is called a bar graph.
A circle divided into sectors that represents the percentages of a sample that
belongs to different categories is called a pie chart.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 10 / 27


Graphical representation of data

Graphs for GROUPED data

A histogram is a graph in which classes are marked on the horizontal axis


and either the frequencies, relative frequencies, or percentages are
represented by the heights on the vertical axis. In a histogram, the bars are
drawn adjacent to each other without any gaps.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 11 / 27


Graphical representation of data

Graphs for GROUPED data


In a frequency polygon it is represented the line that joins the midpoints of each
corresponding class intervals (from a histogram). To complete the polygon we
assume a class interval with frequency zero. It is a first approach to the DENSITY
function of a random variable.
A cumulative frequency graph represents the running total of frequencies for
each bin in a data set. They allow as to determine the number of data points
below (or above) a particular value.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 12 / 27


Numerical Description of Data

Numerical Description of Data

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 13 / 27


Numerical Description of Data Measures of Central Tendency

Mean

Add up all the numbers and divide by how many numbers there are.
Pn
Ungrouped data: x̄ = n1 i=1 xi .
Pn
Grouped data: x̄ = n1 i=1 ni xi .
It can be only computed for NUMERICAL data.
It is very sensitive to outliers.
It is not a good indicator when data distribution is not symmetric.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 14 / 27


Numerical Description of Data Measures of Central Tendency

Median
It is the middle number of the ordered data set. It is found by putting
the numbers in order and taking the actual middle number if there is
one, or the average of the two middle number if not.
Ungrouped data: Me = order data and locate the central one.
n/2−Ni−1
Grouped data: Me = li−1 + ni (li − li−1 ).
It can be only computed for NUMERICAL and ORDINAL data.
It is more robust than mean.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 15 / 27


Numerical Description of Data Measures of Central Tendency

Mode
It is the most commonly occurring number.
Ungrouped data: Mo = is just select the most occurring category (or
categories if there are several with the same frequency) or number(s).
Grouped data: Mo = li−1 + δ1δ+δ 1
2
(li − li−1 ), where δ1 = ni − ni−1 and
δ2 = ni − ni+1 .
Theoretically it can be computed for ALL type of data, but it is only
representative in nominal, ordinal and discrete (with few different
values) data.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 16 / 27


Numerical Description of Data Measures of Position

Quantiles, Quartiles, Deciles and Percentiles


Quantiles are cut points dividing the ordered observations in a sample into
groups with equal probabilities.
Ungrouped data: q−quantiles are values that partition a set of ordered
values into q subsets of (nearly) equal sizes.
N
(α· 100 )−Ni−1
Grouped data:Qα = li + ni (li − li−1 ).
They can be calculated for ORDINAL and NUMERICAL data.
Common quantiles are quartiles (four groups), deciles (ten groups), and
percentiles (hundred groups).

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 17 / 27


Numerical Description of Data Measures of Dispersion

Range, IQR, variance and standard deviation

The range of a data set is R = xmax − xmin . It is severely sensitive to


extreme values.
The interquartile range (IQR) is IQR = Q3 − Q1 . It is a robust statistic.
The variance is S 2 = n1 i=1 (xi − x̄)2 . For a quick calculation S 2 = x¯2 − x̄ 2 .
Pn

The standard deviation is S = + S 2
1
Pn
The sample variance or quasi-variance is Sc2 = (n−1) 2
i=1 (xi − x̄) .
(Statistical Inference → 2nd course).
The sample
p standard deviation or quasi-standard deviation is
SC = + SC2 .
S
The coefficient of variation is CV = |x̄| . It is dimensionless.
They can be calculated for NUMERICAL data.
Last five statistics depend of mean, so that they are sensitive to extreme values.
I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 18 / 27
Numerical Description of Data Measures of Dispersion

Variability and location

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 19 / 27


Numerical Description of Data Measures of Distribution

Skewness
Skewness denotes the degree of asymmetry in the data. It is the tendency of
a distribution to depart from symmetry.
x̄−Mo
Pearson’s 1st coefficient of skewness: AP1 = S .

Pearson’s 2st coefficient of skewness: AP2 = 3(x̄−Me)


S .
1
ni (xi − x̄)3
P
Fisher’s coefficient of skewness: AF = n i .
S3
They can be only calculated for NUMERICAL data.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 20 / 27


Numerical Description of Data Measures of Distribution

Kurtosis

Kurtosis denotes the degree of peakedness of frequency curve. It is used to


specify the frequency curve as regards the sharpness of its peak.
1 4
P
n i ni (xi − x̄)
Coefficient of kurtosis: K = − 3.
S4
They can be only calculated for NUMERICAL data.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 21 / 27


z-scores

z-scores
The z−score indicates how many standard deviations an individual
observation x is from the center of the data set, its mean.
x − x̄
The z −score of an observation x is calculated: z = .
S
It is quite useful to do fair comparisons of observations that were
measured by using different scales or units. It is dimensionless.

Exercise 2. A professor at a university records the scores of students on a


statistics exam. The scores are normally distributed with a mean of 68 and
a standard deviation of 10. One of the students scored 85 on the exam.
Calculate the z-score for this student?s score and interpret its meaning.
An student from another university scored 6.2 points in this exam, but the
scores of his class are normally distributed with a mean of 4.7 and a
standard deviation of 2.3. If a scholarship is to be given to the best of
these two students, who deserves it more?
I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 22 / 27
Box-Plots

Box-Plot

A box-plot (also called box-and-whisker plot) is a diagrammatic representation of


data which summarizes their center, spread, the extend and nature of any
departure of symmetry, and identifies the outliers.

How to build a box-plot:


1 Draw a vertical measurement axis and mark Q1, Q2 (median), and Q3 on
this axis.
2 Construct a rectangular box whose bottom edge lies at the lower quartile,
Q1 and whose upper edge lies at the upper quartile, Q3.
3 Draw a horizontal line segment inside the box through the median.
4 Extend the lines from each end of the box out to the farthest observation
that is still within 1.5(IQR) of the corresponding edge. These lines are called
whiskers.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 23 / 27


Box-Plots

Box-Plot
5 Draw an open circle (or asterisks *) to identify each observation that falls
between 1.5(IQR) and 3(IQR) from the edge to which it is closest; these are
called mild outliers.
6 Draw a solid circle to identify each observation that falls more than 3(IQR)
from the closest edge; these are called extreme outliers.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 24 / 27


Box-Plots

Box-Plot

Exercise 3. The following data identify the time in months from hire to
promotion to chief pharmacist for a random sample of 25 employees from
a certain group of employees in a large corporation of drugstores.

Construct a box plot. Do the data appear to be symmetrically distributed


along the measurement axis?

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 25 / 27


Box-Plots

Recap

Exercise 4. Fill in the following table with the types of frequencies,


graphs and statistics studied for each type of variable.

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 26 / 27


Box-Plots

Lesson 1. Descriptive Statistics

Irene Garcı́a-Camacha Gutiérrez


Elements of Probability and Statistics
1st course of Mathematics Degree

January 8, 2024

I. Garcı́a-Camacha Descriptive Statistics January 8, 2024 27 / 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy