0% found this document useful (0 votes)
27 views116 pages

Apuntes Estadistica

The document introduces basic statistical concepts, including what statistics is, common applications, and the steps of statistical analysis. It discusses databases, noting they contain variables, statistical units (observations), and different types of variables including numerical discrete and continuous variables. Examples are provided for each topic to illustrate the concepts.

Uploaded by

claudiazdeandres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views116 pages

Apuntes Estadistica

The document introduces basic statistical concepts, including what statistics is, common applications, and the steps of statistical analysis. It discusses databases, noting they contain variables, statistical units (observations), and different types of variables including numerical discrete and continuous variables. Examples are provided for each topic to illustrate the concepts.

Uploaded by

claudiazdeandres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Topic 1: Introduction and basic concepts

1. What are Statistics? Examples of applications


2. Databases: elements, variables and observations

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3. Types of variables
4. Data sources: population and sample
5. Statistical analysis with R Commander
Topic 1: Introduction and basic concepts
1. What is Statistics?
In everyday language, the term statistics is used to refer to numbers that describe some aspect of
the world. Statistics is a pseudoscience or discipline; it is not a science. Its name comes from the
word “state” because at the beginning it was used just to describe things related to politics,
governments… Nowadays it has multiple applications, almost all disciplines use it.
Statistics is much more than mere numbers, it is the discipline that addresses how to collect,

Reservados todos los derechos.


summarize, analyse, and interpret data, with the goal of obtaining information to draw conclusions
and make better and rational decisions.
In the strictest sense, the term is used to denote the data or numbers obtained from the analysis.
For example:
• Economic statistics: number of unemployed, inflation rate, …
• Demographic statistics: birth rate, life expectancy…
• Sports statistics: goals scored, number of red cards in a football match…
• Meteorological statistics: temperature, rain, …
Applications of Statistics
• In Accounting: audits, …
• In Finance: analysis and prediction of the value of a firm, …
• In Marketing: information about consumption habits, effectiveness of add campaigns, …
• In Economics: predictions of economic indicators, analysis of the effects of a policy, …
• In Politics: opinion polls, analysis of electoral results, social indicators, …
• In Sustainability: UN 2030 Sustainable Development Goals (SDG). Indicators to increase
visibility of vulnerable groups and to detect the degree of attainment of 17 goals (No poverty,
Zero hunger, Good health and well-being, Quality education, Gender equality, etc.)
• … and many more: in sports, medicine, engineering, …
Steps in statistical analysis:
1. Definition of the problem
2. Defining: statistical units, variables (different type of variables → 3. Types of statistical
variables) and population
3. Sample: random sampling or stratified random sample
4. Collect data
5. Analysis of collected data (different methods)

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451208
6. Interpretation and the take of decisions
2. Databases (DB): elements, variables and observations
Data are the collected features about the phenomenon under study.
• Time series data: Data that evolves in time.
o Example: GNP (Gross National Product) from 1970 to 2021
• Static data: time is fixed. Statistical units are firms, countries, etc. This type of data is the
only one we are going to be treating in this course.
Data matrix or data set is used to get information from the data collected. We can differentiate

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
three elements in the data matrix or data set: the variables, the statistical units (also called
elements) and observations.
• Variables: each column represents a variable. A variable is a characteristic of interest for the
statistical unit or element. Different types of variables require different treatments. Examples
of statistical variables:
o Vote of “madrileños”: Cs, IU, PP, PSOE, UP, Vox, …
o Employment status of “getafenses”: unemployed, part time, full time, …
o Customer purchase satisfaction

Reservados todos los derechos.


o Number of a newspapers bought by “madrileños” in a day
o Number of employees of Madrid firms
o Expenses of Spanish city councils
• Statistical units or elements: the entities (individuals) on which data are collected.
• Observations: the set of measurements collected for a particular element. Data (metrics)
recorded for each element or statistical unit.

3. Types of statistical variables


A variable is a symbol, such as X, Y, H, x, D. It can have only one value known as constant from
a group of values known as dominium.
• Numerical (quantitative): Variables collected from measuring → Measures or numbers

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451208
o Discrete: if the variable cannot have any value between two given values, we call it
discrete variable. Generally, enumerations or counts
▪ Example: In a family the number of children N can take any value of the values
0, 1, 2, 3…, but cannot be 2,5 or 3,842; it is a discrete variable.
▪ The data that are defined by a discrete variable are called discrete data.
• Example: The number of children in each family out of 1000 families.
o Continuous: if the variable can have any value between two given values, we call it
continuous variable. Generally, measurements.

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
▪ Example: The hight H of an individual can be 62 inches, 63.8 inches or 65.8341
inches, depending on the exactitude of the measurement, it is a continuous
variable.
▪ The data that are defined by a continuous variable are called continuous data.
• Example: The hight of 100 university students.
• Categorical (qualitative): Variables are attributes, no measuring → Labels
o Nominal: no natural ordering
o Ordinal: naturally ordered classes

Reservados todos los derechos.


Notation: typically, the letters X, Y, Z are used. NOTE: Numerical codes for categorical variables
DO NOT make them numerical (ex: Male = 1, Female = 2). Examples:
• X = Number of employees in Madrid firms (upper case in definition)
• x1 = 55; x2 = 3000 (lower case for specific values, we add subscripts to indicate individuals)
• The colour C in the solar spectrum is a variable that can have the “values” red, orange, yellow,
green, blue, indigo and violet. It is possible to substitute such variables for numeric quantities,
for example, giving the colour red the value 1, orange the value 2, etc.

6. Population and sample; data sources

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451208
In a collection of data about the characteristics of a group of individual or objects, for example the
hight or weight of the students of a university or the number of defective and non-defective locks
produced by a fabric in a specific day, it is typically impossible or not practical to observe the totality

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
of the individuals, especially if they are a large number. Instead of studying the whole group called
population, a small part of the group is examined, called sample.
• Population: complete collection of individuals. In practice it is unusual to study all the
individuals of a population:
o The individuals may exist conceptually, but not in reality
o It may be economically infeasible to study the entire population
o The study might take so much time that it would be infeasible and, moreover, the
population might change over the time span of the study
o The study may imply the destruction of individuals
• Sample: a subset of individuals drawn from the population

Reservados todos los derechos.


o To draw valid conclusions, it must be representative of the population. We can induce
important conclusions from this throughout the analysis of the sample.
▪ Inferential analysis: the part of statistics that treats the conditions under which
such inferences are valid is called inferential analysis. It is the bridge between
the sample and the population (Confidence intervals and hypothesis testing →
Chapter 6)
▪ Probability: As we cannot be completely sure about the veracity of such
inferences, the term “probability” has to be used with frequency in these
conclusions. (Probability models → Chapter 4 and 5)
▪ Descriptive statistics: The part of statistics that is about only describing and
analysing a given group (sample) without obtaining any conclusion or
inferences about a bigger group is the descriptive statistics. It only works with
the sample. (frequency tables, graphs, statistical measures → Chapter 2 and
3/ univariate or bivariate)

POPULATION
Inferential statistics

Probability models

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451208
o The sample selection method (sampling method) is very important
• Data sources:
o Available historical information
o From observations (observational studies)
o From experiments (experimental studies)

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451208
STATISTICS I

EXERCISES OF TOPIC 1

ACADEMIC YEAR 2020/21: SOLUTIONS

1. Classify the following variables:


a) City or town of residence: Categorical Nominal.
b) Total income: Numerical continuous.
c) Number of points in driving license: Numerical discrete.
d) Degree of agreement with the decision of having exams with exercises (as opposed to short questions):
Categorical ordinal.
e) Telephone number. Categorical nominal.
f) Educational level: categorical ordinal.
g) Address postal code: categorical nominal.
h) Number of siblings: numerical discrete.
i) Device on which the respondent watches series more often: categorical nominal.
j) Who went to a party: categorical nominal.
k) How many went to a party: numerical discrete.
l) Favorite social network of students in a class: categorical nominal.

2.
a) Sample biased to a particular person profile.
b) Sample biased to students in Madrid who are undergoing a course in statistics.
c) Correct: sample unbiased and with a lower percentage of non-response compared to other methods (e.g.:
e-mail, phone call...).
d) Sample biased to university students.
e) Biased: we obtain response only from relatives and friends (who in most of the cases will have a similar
opinion to ours).
f) Sample biased to readers of that particular newspaper. Furthermore, we collect information only from
those who answer (usually people with the most extreme opinion).

3. The following is part of a questionnaire of the INE life conditions survey:


9.a: Categorical nominal, 9.b: Categorical ordinal,9.c: Categorical ordinal (if we did not know the year, it would
be nominal) 9.d: Numerical continuous.
4.
a) The population: All companies from Madrid with more than 50 workers and that use outsourcing.
b) The sample: the 100 companies selected.
c) An element or individual: one of the 100 companies.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302006
Topic 2: Analysis of univariate data
1. Representations and graphs
a. Frequency tables

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
b. Bar and pie charts, pictograms, histograms, frequency polygons, pictograms. Other
graphs. Lying with graphs
2. Numerical measures to summarize and describe data:
a. Central tendency (mean, median, mode)
b. Location (quartiles and percentiles). Box plots
c. Spread (variance, standard deviation, quasi-variance, quasi-standard deviation,
range, IQR, coefficient of variation)
d. Shape (coefficients of skewness and kurtosis)
Topic 2: Analysis of univariate data
Frequency table (categorical variables)
Example about education data:

Reservados todos los derechos.


A tabular organization of the data in classes with their corresponding frequencies is what is known
as frequency distributions or frequency table.
• Class (category): ci → values of the variables. When there is many data, it is useful to
distribute them in classes or categories.
o In this example the variable “High school” corresponds to the value 1, “College” is
value 2 and “Advanced degree” is value 3.
• Absolute frequency: ni → how many times each variable appears in the table that recollects
the data. It is how many individuals does each class or category have.
o In this example “High school” (1) appears 14 times, “College” (2) appears 19 times
and “Advanced degree” (3) appears 13 times. “n” is the sum of n1 + n2 + … + nk. So,
in this example n (the total number of individuals) is 14 + 19 + 13 = 46
• Relative frequency: fi → it represents the percentage of each value, therefore the sum of
all the relative frequencies is. It is the absolute frequency divided by the total frequencies of
all the classes and it is usually expressed as a percentage.
o In this example the relative frequency of “High school” is obtained by dividing 14/46 =
0.304, the relative frequency of “College” is obtained by dividing 19/46 = 0.413 and
the relative frequency of “Advanced degree” is obtained by dividing 13/46 = 0.283. The
sum of 0.304 + 0.413 + 0.283 = 1.

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Note:
• ni = number of individuals of class ci in the sample (sample size)
• fi = ni/n
• 0 ≤ fi ≤ 1
Frequency table (discrete numeric variables)
Example:
• Sample: 100 shopping malls in which a promotion of a certain service was launched last
November

Reservados todos los derechos.


• Variable: number of new customers of the service

• Cumulative Absolute frequency: Ni → Ni = Ni−1 + ni.


o In this example, N0 = 1, N1 = 1 + 4 = 5, N2 = 5 + 7 = 12, N3 = 12 + 8 = 20 … and so
on.
• Cumulative Relative frequency: Fi → Fi = Fi−1 + fi.
o In this example, F0 = 0.01, F1 = 0.01 + 0.04 = 0.05, F2 = 0.05 + 0.07 = 0.12, F3 =
0.12 + 0.08 = 0.20 … and so on

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Note:
• c1 < c2 < . . . < ck
• ni = number of individuals of class ci in the sample, fi = ni/n
• Ni = Ni−1 + ni, Fi = Fi−1 + fi
• 0 ≤ fi, Fi ≤ 1
• Fi and Ni also make sense for ordinal categorical variables
Grouping in class intervals: numeric data

Reservados todos los derechos.


• Example about the hight of 100 Hight (inches) (ci) Number of students (ni)
university students (frequency table):
60 – 62 = c1 5 = n1
The first category, for example, takes
into account the hights between 60 and 63 – 65 = c2 18 = n2
62 inches and becomes indicated by 66 – 68 = c3 42 = n3
the symbol “60 – 62”. As there are 5
students that have the hight 69 – 71 = c4 27 = n4
corresponding to this class (c1), the 72 – 74 = c5 8 = n5
frequency (n1) for this class is 5.
Total 100 = n
• Grey table example:

• Class Mark (midpoint): The class mark is the midpoint of the class interval and is obtained
by adding the right and left endpoints and dividing by 2.
Ck = (Lk-1 + Lk)/2

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
o Examples:
▪ Hight example: The class mark from the interval 60 – 62 is (60 + 62)/2 = 61
▪ The grey table, in this example C1 = (0+3)/2 = 1.5, C2= (3+6)/2 = 4.5, C3=
(6+9)/2 = 7.5 … and so on.
• Class interval: A symbol that defines a class, as “60 – 62” as in the previous example, is
known as a class interval. The extreme numbers 60 and 62, are the endpoints, the smallest
number 60 is the left endpoint and the larger one, 62, is the right endpoint.
o Very often class intervals have the same width. The width of a class interval is the
difference between the endpoints that form the interval.

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
▪ Example (hights): w = 62 - 60 = 65 - 63 = 68 - 66 = 69 - 71 = 74 - 72 = 2
o Class intervals cannot overlap
o Round up the interval width to get convenient interval endpoints
o We can determine the width (w) of each interval by
w = (largest number - smallest number)/number of desired intervals

Determine the width of a class interval:


1. Find range (the range is the difference between the largest and the smallest number of the

Reservados todos los derechos.


data)
range = (largest number – smallest number)
Examples:
- If the largest hight is 74 inches and the smallest is 60 inches, then the range is:
74-60 = 14
- In the table that is shown the highest value is 20, the smallest value is 1. Then, range is:
20 − 1 = 19

2. Select number of classes: say k = √ 46 = 6.78 ≈ 7 → k = √n


3. Compute interval width: 19/7 = 2.71 ⇒ 3. → w = range/k
4. Determine the endpoints (beginning before the first one and ending after the last one): [0, 3],
(3, 6], ..., (19, 21]

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
Note:
• In R the left endpoint is excluded, but right endpoint is included (default option), except
for first interval

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Useful for tabulating discrete variables with many possible values

The pie chart (categorical variables)


Example about education:
• Each pie sector is a fraction of the circle
• Sectors are labelled with their corresponding
class names
• Computer software typically orders classes in

Reservados todos los derechos.


alphabetical order
• Pie charts are visually engaging, but relative
sector sizes are harder to assess correctly than
in bar charts
• Avoid 3D pie charts: 3D perspective distorts our
perception of relative sector sizes

Bar charts
Example about education:
• Bars are of the same width and equally spaced,
their heights represent frequencies
• There are gaps between bars
• Bars are labelled with class names (or codes)
• Bar charts with cumulative frequencies:
Beware! Many software programs rank classes in
alphabetical order when the variable is
categorical. If it is an ordinal variable, it must be
ranked in increasing order.

• Bar charts can also be used for discrete data if there are not too many different values

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o Example (too many different variables in discrete data):
▪ Sample: 46 employees of a company
▪ Variable: EXPRNC: years working in the company

Reservados todos los derechos.


Histogram and frequency polygon
Histogram: Consists in a series of rectangles that have their bases in the horizontal axis (x) with
centres in the class marks.
• There are no gaps between the bars/bins
• Bin widths = widths of class intervals (identical), class boundaries are marked on the
horizontal axis
• Bin heights = frequencies (here, absolute)
• Bin areas are proportional to the frequencies
Frequency polygon: line graph traced over the class marks. Can be obtained linking the medium
points of the tops of the rectangles in the histogram.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
Descarga la app de Wuolah desde tu store favorita
Reservados todos los derechos.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The pareto chart
• Bar chart in which the variable classes are ranked in decreasing order of frequency
• It only applies to nominal categorical variables
• Useful to identify the more relevant classes
• The Pareto Principle (80/20 rule): Pareto stated (c. 1896) that, typically, about 80% of the
effects come from 20% of the possible causes. Example:

Reservados todos los derechos.


o 20% of the population owns about 80% of the wealth
o 80% of the population owns the remaining 20%
Example:
• Sample: Among the 1,100 visitors of the art exhibition Turner and the Masters (Prado
Museum, 2010), those who bought their tickets online (20.3%). Source: Institute for Tourism
Studies I
• Variable: Main reason for buying the ticket online

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Other graphics: pictograms

Reservados todos los derechos.


• Sample: 70 university students from Madrid
• Variable: Preferred political party

• The area of each class graph is proportional to its frequency

Cartograms
INE, Encuesta de Turismo de residentes

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Time series
INE, Encuesta de Población Activa

Reservados todos los derechos.


Measures of central tendency: mean, median and mode
Example of the experience:

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
The (arithmetic) mean: The mean is the average of all the data

• It is the most common measure of central tendency


• It is the centre of gravity of the data
• It should be calculated only for numeric variables

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o Example: For the experience of the 46 employees. What is the mean?

o Example: the arithmetic mean of the numbers 8,3,5,12,10 is


X = (8+3+5+12+10)/5 = 38/5 = 7.6
o Example: If 5, 8, 6 and 2 appear with the frequencies of 3, 2, 4 and 1 respectively,
the mean is

Reservados todos los derechos.


X = (5·3 + 8·2 + 6·4 + 2·1)/ (3+2+4+1) = 5.7
• Calculating the mean from grouped data
o It is the same formula but using the centre of each interval.
o Example: For the salary of the 46 employees. What is the mean?

Note: the mean salary from the original data equals 17250.41
• Linearity: the same operations you apply to the data also applies to the mean.
o If Y = a + bX ⇒ y¯ = a + bx¯
o If Z = X + Y ⇒ z¯ = ¯ x + ¯ y
• Disadvantages: Affected by extreme values (outliers)
o Example: X: 3, 1, 5, 4, 2 Y: 3, 1, 5, 4, 200
x¯ = (3 + 1 + 5 + 4 + 2)/5 = 3
y¯ = (3 + 1 + 5 + 4 + 200)/5 = 42.6!

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
o When the data are skewed, an alternative robust measure of central tendency is
more appropriate
The median: Ordered data from smallest to largest: x(1) , x(2) , . . . , x(n)

11133557889→M=5

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
1. Order the data from smallest to largest
2. Include repetitions
3. The median is in the central position
11133 5 5 7 8 8 → M = (3 + 5)/2 = 4
• To find the median in the frequency table we look for the value whose Fi > 0.5 = 50%
o Example of the experience: M = 6

Reservados todos los derechos.

• Linearity: If Y = a + bX with b > 0 ⇒ My = a + bMx


• Advantage: Not affected by outliers
o Example: X: 3, 1, 5, 4, 2 Y: 3, 1, 5, 4, 200
Mx = 3 My = 4
When the data are skewed it is a better measure of central tendency than the mean.
The mode: is the highest ni (absolute frequency). The value that is represented with the highest
frequency, in other words, the most common value

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
• It can be calculated for both categorical and numeric variables. Indeed, it is the only
descriptive measure that makes sense for nominal categorical variables.

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Examples:

Reservados todos los derechos.


• Properties:
o It can be calculated for both categorical and numeric variables. Indeed, it is the only
descriptive measure that makes sense for nominal categorical variables.
o Not affected by outliers
o There can be more than one mode: bimodal–trimodal–plurimodal

Location measures: quartiles and percentiles


Quartiles split the ranked data into four segments with an (approximately) equal number of values
per segment.
Percentiles split the ranked data into a hundred segments with an (approximately) equal number
of values per segment.
1. Order the data from smallest to largest
2. Include repetitions
3. Select each quartile (percentile) according to:
o The first quartile Q1 is in position ¼ (n + 1).
o The second quartile Q2 (= median) is in position ½ (n + 1).
o The third quartile Q3 is in position ¾ (n + 1).
o The k-th percentile Pk is in position k*(n + 1)/100, k = 1, …, 99, leaving k% of data
below
Note: Typically, the fractions ¼ (n + 1), ¾ (n + 1) y k/100 (n + 1) are not integer ⇒ a rounding criterion
is used.

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
Measures of spread: range, interquartile range (IQR), variance, standard deviation and coeff.
of variation
The range is the simplest measure of spread, it is the difference between the highest number and
the smallest.
R = Xmax − Xmin
• It ignores the way the data are distributed
• Sensitive to outliers
Example: Given observations 3, 1, 5, 4, 2, R = 5 − 1 = 4

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Example: Given observations 3, 1, 5, 4, 100, R = 100 − 1 = 99
The Interquartile range (IQR) can eliminate some outlier problems. Eliminate high and low
observations and calculate the range of the middle 50% of the data
IQR = 3rd quartile − 1st quartile = Q3 − Q1
• Outliers are observations that fall
• below the value Q1 − 1.5 · IQR
• above the value Q3 + 1.5 · IQR

Reservados todos los derechos.


• For extreme outliers, replace 1.5 by 3 in the above definition

• Boxplot
• It shows five location measures
• It allows to assess the spread of the data
• It allows to assess the symmetry of the data
• It is very useful to compare different datasets
• Note: R produces a modified boxplot, where
outliers are plotted as distinguished points (the
min and max shown are those without outliers)
Variance
• Average of squared deviations of values from the mean
• Sample variance

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
• Sample quasi-variance (corrected sample variance)

• They are related via

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
If a, b are real numbers and y = a + bx, then
Standard deviation (SD)
• The most-commonly used measure of spread
• The sample standard deviation and sample quasi-standard deviation are respectively

• They both measure variation about the mean


• They have the same units as the original data, while variance is in units2

Reservados todos los derechos.


• Variance and SD are both sensitive to outliers

Measures of spread: coefficient of variation (CV)


• The CV measures relative variation and is defined as
CV = s/|x¯ |
• It is a unitless number (sometimes given in %)

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
• It represents variation relative to the mean
o Example:
▪ Stock A: Mean price last year = 50, Quasi-standard deviation = 5
▪ Stock B: Mean price last year = 100, Quasi-standard deviation = 5
CVA = 5/50 = 0.10 CVB = 5/100 = 0.05
Both stocks have the same quasi-SDs, but stock B is less variable relative to its mean
price
Standardizing variables

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Standardizing variable x means to calculate a new variable
z = (x − x¯)/s
• If you apply this formula to all observations x1, …, xn and call the transformed ones z1, ...,
zn, then the mean of the z’s is zero with standard deviation one
• Standardizing = calculating z-scores

Measures of shape: coefficient of skewness and coefficient of Kurtosis

Reservados todos los derechos.


Coefficient of skewness

Do not make a decision about the shape just through a comparison between the Mean, the Median
and the Mode.

Coefficient of Kurtosis

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.
If the data are bell-shaped (normal), that is, symmetric with light tails, the following rule holds:
• About 68 % of the data are in (¯x − 1s, x¯ + 1s)
• About 95 % of the data are in (¯x − 2s, x¯ + 2s)
• About 99.7 % of the data are in (¯x − 3s, x¯ + 3s)
Note: This rule is also known as 68–95–99.7 rule
Example: We know that for a sample of 100 observations, the mean is 40 and the quasi-standard
deviation is 5. Assuming that the data are bell-shaped, give the endpoints of an interval that contains
about 95 % of the observations.
95 % of the xi are in: (¯ x ± 2s) = (40 ± 2(5)) = (30, 50)

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
1. The histogram of a numerical continuous variable has bars for each class whose area is
proportional to the frequency of the class.
Select one:
• True → This is a convention introduced to facilitate the interpretation of a histogram. The
correct answer is 'True'.
• False

2. The Pareto chart for the sample of a nominal random variable orders the values of the variable

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
according to their values.
Select one:
• True
• False → A Pareto chart orders the values of the variable according to their frequencies
(largest to smallest). The correct answer is 'False'.

3. For a sample of an integer random variable taking values between 0 and 10 (all of them with
positive frequencies), it holds that its median (Q2) is equal to 3 and Q3 is equal to 6 . Then, the

Reservados todos los derechos.


sample contains no outliers.
Select one:
• True → It holds that Q1 must be a value between 0 and 3, and in the least favourable case
the lower limit for the outliers would be 3 - 1.5 x (6-3) = -1.5 < 0. For the upper limit we would
have (again in the least favourable case) 6 + 1.5 x (6-3) = 10.5 > 10. The correct answer is
'True'.
• False

4. We have collected a sample of size 40 from a certain variable. The variable takes integer values
between 5 and 10. Then, the quasivariance for this sample cannot be larger than 5.
Select one:
• True
• False → You could have half the values in the sample equal to 5 and half equal to 10, and
the quasi variance would equal 6.4. The correct answer is 'False'.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
Statistics I
ExercisIs for Topic 2
Academic year 2020/21
- Solutions

Exercises
1. The following table shows the absolute frequency distribution of the duration (in minutes) of 60 taxi services with origin
in a certain airport:

Duration (interval) Number of services


[0, 10) 8
[10, 20) 17
[20, 30) 14
[30, 40) 10
[40, 60) 11
Total 60

(a) Draw a histogram, taking into account that not all classes have the same width. Calculate the height of each bar so
that the area of each rectangle equals the relative frequency of its class ( histogram with unit total area)
(b) From the histogram, describe the shape of the distribution. Indicate the modal and median intervals.

(c) From the table of frequencies, calculate (approximately) the mean and variance of the duration using the class marks.

Solution:
Duración (intervalo) Marca clase Frec. relativa Altura Frec. relativa acumulada
[0, 10) 5 0.133 0.013 0.133
[10, 20) 15 0.283 0.028 0.416
(a)
[20, 30) 25 0.233 0.023 0.649
[30, 40) 35 0.167 0.017 0.816
[40, 60) 50 0.184 0.009 1.000

El histograma muestra una ligera asimetría positiva (a la derecha). Es unimodal, siendo el intervalo modal el [10, 20).
La Mediana se obtendría como el promedio de los tiempos que ocupan las posiciones 30 y 31, ambas en el intervalo
[20, 30)

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(b) El tiempo medio estimado a partir de las marcas de clase es de

x̄ = 0,133 · 5 + 0,283 · 15 + 0,233 · 25 + 0,167 · 35 + 0,184 · 50 = 25,78 minutos

Para calcular aproximadamente la varianza:

σ̂ 2 = 0,133 · (5 − 25,78)2 + 0,283 · (15 − 25,78)2 + 0,233 · (25 − 25,78)2 + 0,167 · (35 − 25,78)2 + 0,184 · (50 − 25,78)2 =
212,59 minutos2

2. The spreadsheet data_condemned_2016_INE of the Excel workbook Datos_spreadsheet2.xls contains information provided
by the INE
1 about the age and number of prison sentences dictated in 2016.

(a) Represent the relative and cumulative frequency distributions of the variable age through a bar chart. What information
can you obtain about the age of the condemned? (Note: if you use Excel you can represent simultaneously both
distributions through a combined chart. Select the cumulative frequencies as secondary axis).

(b) Represent the relative frequency distribution of the variable number of prison sentences through a pie chart. Do the
quartiles and percentiles make sense for this variable? If yes, calculate the 80 percentile and interpret it.

Solution:

(a) En el diagrama de barras se observa que el mayor porcentaje de condenados tiene entre 41 y 50 años, edad a partir
de la cual el porcentaje de condenados baja de forma ostensible. Los condenados con menos de 41 años representan
más del 60 % de los condenados (el 64,4 %). Los más jóvenes, de 18 a 20 años, representan el 8,8 %, mientras que el
resto de tramos se mantienen en torno al 15 % de representación. En el propio gráco se incluyen las distribuciones de
frecuencias.
1 In Estadística de condenados: Adultos, from information in the Registro Central of Penados of the Ministerio de Justicia.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(b) En la siguiente tabla se recogen las distribuciones de frecuencias absolutas, relativas y relativas acumuladas:

Número de condenas ni fi Fi
Una pena 94.709 0, 349 0, 349
Dos penas 89.703 0, 330 0, 679
Tres penas 32.624 0, 120 0, 799
Cuatro penas 23.692 0, 087 0, 887
Cinco penas 10.681 0, 039 0, 926
Más de cinco penas 20.117 0, 074 1, 000
El diagrama de sectores es:

Se observa que los condenados con 1 o 2 penas representan más de la mitad de los condenados, siendo casi del 70 % (el
67, 9 %), y como el porcentaje de condenados va disminuyendo a medida que aumenta el número de penas, salvo en los
últimos casos en los que se invierte el orden, siendo casi el doble el número de condenados con más de 5 penas, que el
de condenados con exactamente 5 penas.

Como la variable es cuantitativa tienen sentido las medidas de posición. En este caso el percentil P80 puede ser 3 o 4
penas, ya que el 80 % de los condenados tiene 3 o menos penas, mientras que el 20 % de los condenados tiene 4 o más
penas.

3. The following is a chart from the report  La Universidad Española in Cifras 2015/2016  2 .
2 Published by the Conferencia the Rectores de las Universidades Españolas (CRUE) with the collaboration of Santander Universidades.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(a) What is the variable of interest? What is its type? What are the population and the sample?

(b) From the chart, obtain the maximum, the minimum, the median, the rst and the third quartiles, the range and
the interquartile range (IQR) of the variable. Which universities occupy each of these positions? Interpret the values
obtained.

(c) Draw the chart that you consider more appropriate to represent the data and comment about the shape and the posible
presence of outliers.

(d) If you observe outliers, calculate the mean and the standard deviation of the data with and without outliers. Obtain
also the median and the interquartile range of the data without outliers. Comment on the results obtained.

(e) Taking into account the datum at the end of the chart, indicating that the total percentage of mobility students in
public Spanish universities is 6, 18 %, which criterion do you guess has been used to select the 20 universities in the
chart?

Solution:

(a) La variable de interés es el porcentaje de estudiantes de Grado con movilidad internacional en las Universidades
públicas españolas durante el curso 2015/2016, es cuantitativa continua. La población son todas las Universidades
Públicas presenciales. Los datos disponibles son una muestra de 20 de estas universidades.

(b) En el gráco se muestran los valores ordenados en orden creciente. El máximo se alcanza en la Universidad Carlos III
con un 15,68 %, siendo la Universidad de Málaga en la que el porcentaje de alumnos de movilidad es el mínimo, con
1 3
un 6,21 %. Como
4 21 = 5,25 y 4 21 = 15,75, podemos redondear y seleccionar a la Universidad de Salamanca y a la
Politécnica de Catalunya, que ocupan las respectivamente las posiciones 5 y 16, como las que nos dan el primer y tercer
cuartil, con valores de Q1 = 7,17 % y Q3 = 9,09 %
(c) Pueden representar los datos a través de un histograma o de un diagrama de cajas. Como hay pocos datos n = 20
tomamos 5 clases de amplitud 1,9 empezando en 6, 2 para construir el histograma. Se obtiene:

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Se observa que la distribución es asimétrica positiva (a la derecha) y que hay un dato atípicamente alto. Como se
puede apreciar después en los estadísticos descriptivos con y sin el dato atípico de la Carlos III, la asimetría (bastante
acusada con los 20 datos) se atenúa bastante, pero sigue siéndolo. Obsérvese cómo el valor alto de la curtosis también
nos indica la presencia de atípicos.

El RIC = 9,09 − 7,17 = 1,92, y por tanto los límites superiores e inferiores para considerar un dato como atípico son:

LI = 7,17 − 1,5 · 1,92 = 4,29


LI − ext = 7,17 − 3 · 1,92 = 1,41
LS = 9,09 + 1,5 · 1,92 = 11,97
LS − ext = 9,09 + 3 · 1,92 = 14,85

Como el mínimo porcentaje es de 6, 21 > 4, 29, no hay atípicos inferiores. Hay un único atípico superior, que además
es extremo. El porcentaje de alumnos de movilidad de la Universidad Carlos III es atípicamente alto con respecto al
resto.

(d) Las medidas descriptivas con y sin dato atípico que se obtienen son:

Comparando todas las medidas que nos indican se observa cómo la más afectada por el valor atípico es la desviación
estándar, seguida de la media. La Mediana se queda prácticamente igual y el RIC varía poco. En términos relativos
los cambios son:
2,0737−1,222
Del 41.05 % para la cuasi-desviación típica:
2,0737 = 0,4105
8,464−8,084
Del 4.48 % para la media:
8,464 = 0,048
8,155−8,06
Del 1.16 % para la mediana:
8,155 = 0,0116

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
1,92−1,87
Del 2.6 % para el RIC:
1,92 = 0,026
(e) El porcentaje total 6,18 < 6,21, que es el mínimo porcentaje de las 20 universidades sobre las que se muestran datos.
Todo apunta a que son las 20 Universidades con mayor proporción de estudiantes de movilidad.

4. The following bar chart represents the distribution of cumulative frequencies of a certain variable:

(a) What is the type of the variable?

(b) Deduce and represent the corresponding absolute frequency table.

(c) Discuss the shape of the distribution.

(d) Calculate the mean and the standard deviation of this dataset.

(e) Calculate the mode, the median and the percentiles 20 and 80.

Solution:

(a) La variable es cuantitativa discreta.

(b) La distribución de frecuencias absolutas es:

Frecuencia
ci Absoluta
0 6
1 10
2 12
3 8
4 5
5 4
6 3
8 1
10 1
Total 50
(c) La distribución es asimétrica sesgada a la derecha.

(d) Utilizar la tabla de frecuencias para calcular la media y desviación estándar de este conjunto de datos.

Pk
i=1 ci ni
x̄ = = 2,68
n
Pk 2
i=1 ci ni − nx̄2
s2x = = 4,5485
n−1

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(e) M oda = 2
Las observaciones ordenadas:

0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
6, 8, 10

Mediana

X(50/2) + X(50/2+1) X(25) + X(26) 2+2


Mx = = = =2
2 2 2
20(n+1) 20·51
Percentil 20, ocupa la posición
100 = 100 = 10,2. Redondeando: P20 = x(10) = 1
80(n+1) 80·51
Percentil 80, ocupa la posición
100 = 100 = 40,8. Redondeando: P80 = x(41) = 4
5. The following table shows the number of university graduates for the academic year 2009-2010 by Comunidad Autónoma
(CA) of the university in which they graduated (INE, Encuesta of inserción laboral de titulados universitarios 2014).

CA Number of univ. graduates


Andalucía 31655
Aragón 4989
Asturias, Principado of 3947
Balears, IllIs 1905
Canarias 4615
Cantabria 1751
Castilla and León 14368
Castilla - the Mancha 4924
Cataluña 31345
Comunitat Valenciana 19799
Extremadura 3767
Galicia 10175
Madrid, Comunidad of 38739
Murcia, Región of 6771
Navarra, Comunidad Foral of 3162
País Vasco 9744
Rioja, the 1005

If the variable of interest is the CA of the university in which graduates obtained their degree:

(a) What is the variable type and what is the population?

(b) What frequency distribution does the above table show?

(c) What statistical measures can you obtain for such a variable?

Draw a Pareto chart to check whether the following claims, which refer to university graduates in the academic year
2009-2010, are true or false:

(a) Less than 25 % of universities produce more than 60 % of graduates.

(b) The median of the distribution is Cataluña.

(c) 20 % of CCAA concentrate the universities from which more than 50 % of graduates come.

(d) From the universities of 35 % of the CCAA come less than 10 % of graduates.

Solution:

(a) La variable es cualitativa nominal y la población está compuesta por todos los titulados universitarios del curso 2009-
2010.

(b) La distribución de frecuencias absolutas

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(c) La única medida que tiene sentido para datos cualitativos nominales es la Moda, que en este caso es Madrid.

El gráco de Pareto es:

Para construirlo se emplea la siguiente información:

Comunidad Autónoma Núm. titulados Porcentaje titulados Porcentaje acumulado


Madrid, Comunidad de 38739 20,11 % 20,11 %
Andalucía 31655 16,43 % 36,54 %
Cataluña 31345 16,27 % 52,81 %
Comunitat Valenciana 19799 10,28 % 63,08 %
Castilla y León 14368 7,46 % 70,54 %
Galicia 10175 5,28 % 75,82 %
País Vasco 9744 5,06 % 80,88 %
Murcia, Región de 6771 3,51 % 84,39 %
Aragón 4989 2,59 % 86,98 %
CastillaLa Mancha 4924 2,56 % 89,54 %
Canarias 4615 2,40 % 91,94 %
Asturias, Principado de 3947 2,05 % 93,98 %
Extremadura 3767 1,96 % 95,94 %
Navarra, Comunidad Foral de 3162 1,64 % 97,58 %
Balears, Illes 1905 0,99 % 98,57 %
Cantabria 1751 0,91 % 99,48 %
Rioja, La 1005 0,52 % 100,00 %
TOTAL 192661 1

(a) No tenemos suciente información como para saber si es verdadero o falso. Tendríamos que saber en qué Universidad
obtuvieron el título. Podemos decir que entre todas las universidades Madrileñas concentraron al 20,11 % de los titulados
de ese año, pero no sabemos cómo está repartido ese 20,11 % ebtre todas ellas.

(b) La Mediana no tiene sentido para datos nominales, que por tanto no se pueden ordenar.

(c) Verdadero. Entre Madrid, Andalucí y Cataluña, que representan el 17,65 % de las comunidades autónomas, concentran
el 52,81 % de los titulados.

(d) Verdadero. Las 6 (35.29 %) comunidades con porcentajes menores, Asturias, Extremadura, Navarra, Balears, Cantabria
y La Rioja, solo tienen el 8.6 % de titulados.

6. Consider the following charts published in El Mundo3 about diusion data of Spanish press (OJD, Ocina of Justicación
3 24 September, 2014. Source: blog Malaprensa

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
of the Difusión).

(a) Do you nd the charts adequate? Why?

(b) Represent properly the data in such charts and compare the charts obtained with those that were published.

Solution. El eje vertical está truncado. No empieza en 0, sino en 50000. Además no está representada la escala, pero lo
peor de todo es que no está tampoco respetada. Las distancias entre los resultados de El Mundo y El Pais parecen mucho
más cortas que las que hay entre El Mundo y el ABC, cuando realmente es al contrario. Por ejemplo, solo jándonos en el
último dato, correspondiente al mes de agosto, la diferencia entre las ventas en quiosco entre El Pais y El Mundo fueron de
42605, mientras que la diferencia con las ventas del ABC fue de 22054, algo más de la mitad. La discrepancia en las cifras
de la difusión total en agosto es todavía más acusada, la diferencia con El Pais fue de 89486 mientras que fue 30769 con el
ABC.
Los grácos respetando la escala y sin truncar el eje quedarían como sigue:

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
7. The following table shows data from the Encuesta of Condiciones de Vida (INE) corresponding to the years 2014 and 2006
about the percentage of households facing economic hardship by CCAA.

10

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
CA 2014 2006
Andalucía 24,3 16,8
Aragón 9,8 5,7
Asturias, Principado of 4,6 3,1
Balears, IllIs 14,7 9,7
Canarias 19,5 18,4
Cantabria 15,2 9,3
Castilla and León 12,1 8,5
Castilla - the Mancha 15,9 10,7
Cataluña 12,2 11,3
Comunitat Valenciana 18,0 12,3
Extremadura 19,6 8,7
Galicia 20,8 11,9
Madrid, Comunidad of 12,4 8,8
Murcia, Región of 22,7 14,1
Navarra, Comunidad Foral of 4,2 6,6
País Vasco 11,5 5,2
Rioja, the 12,9 6,6
Ceuta 32,9 25,7
Melilla 12,9 15,9

The following tables show information about the variable percentage of households facing economic hardship in each of the
observed periods:

2014 2006

Media 15,5895 Media 11,0158


Error típico 1,5677 Error típico 1,2385
Mediana 14,7 Mediana 9,7
Moda 12,9 Moda 6,6
Desviación estándar 6,8333 Desviación estándar 5,3986
Varianza de la muestra 46,6943 Varianza de la muestra 29,1447
Curtosis 1,1493 Curtosis 1,7247
Coeficiente de asimetría 0,6357 Coeficiente de asimetría 1,1245
Rango 28,7 Rango 22,6
Mínimo 4,2 Mínimo 3,1
Máximo 32,9 Máximo 25,7
Suma 296,2 Suma 209,3
Cuenta 19 Cuenta 19

(a) Represent the data of 2006 and of 2014 in histograms and compare their distributions. What dierences do you nd?

(b) To analyze the evolution of the percentage of households facing economic hardship in the period 20062014, obtain the
percentiles 20, 40, 60 and 80 for each year. Tabulate these data for each year, along with the minimum and maximum
values. What conclusions can you draw? Also, represent the data in the table as a chart.

(c) What central tendency measure is more adequate in each case and why?

(d) Which year shows more variability in the data?

(e) In which of the two periods, 2014 or 2006, do the Comunidad de Madrid and the Comunitat Valenciana show the worst
results relative to the situation in those years?

Solution.

11

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(a) Tomando 5 clases en ambos casos, se obtienen los siguientes histogramas:

Se observa que los porcentajes de hogares con dicultades para llegar a n de mes se ha incrementado desde el 2006
hasta el 2014, y además la distribución se ha desplazado, mientras que en 2006 presentaba una asimetría positiva, en el
2014 la asimetría es ahora muy ligera. El desplazamiento se observa sobre todo comprando las medianas y los percentiles
(siguiente apartado). Mientras que en el 2006 en el 50 % de las CCAA el porcentaje de hogares con dicultades era
mayor o igual que 9.7 %, este valor ha subido en 5 puntos porcentuales, hasta un 14.7 %, en el 2014.

(b) Los percentiles vienen dados por:

Percentil Posición 2014 2006


Mínimo 1 4,2 3,1
P20 4 11,5 6,6
P40 8 12,9 8,8
P60 12 15,9 11,3
P80 16 20,8 15,9
Máximo 20 32,9 25,7

Grácamente:

(c) En este caso, dada la asimetría de los datos del 2006 y que en los del 2014 se detecta un dato atípico sería más
conveniente usar la Mediana como medida de centralización. Además por la naturaleza de los datos se suele emplear
la mediana.

Los límites inferiores y superiores para los datos del 2014 serían:

LI = Q1 − 1,5 · IQR = 12,1 − 1,5 · 7,5 = 0,85 < 4,2 = min


LS = Q3 + 1,5 · IQR = 19,6 + 1,5 · 7,5 = 30,85 < 32,9 = max

No hay datos atípicamente bajos, pero sí altos. Al menos el dato de Cueta lo es. Ya no hay más. Comprobamos si es
una dato atípico extremo, LSe = Q3 + 3 · IQR = 42,1 > 32,9. No lo es.

12

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(d) Si comparamos las varianzas diríamos que el 2014, con una varianza de 46.6943 frente al 29.1447 del 2006. No obstante
los porcentajes han aumentado de un año a otro. Si comparamos los coecientes de variación cv2014 = 6,833/15,5895 =
0,438 y cv2006 = 5,399/11,0158 = 0,490, una variación del 43.8 % frente al 49.0 %, luego la respuesta más adecuada es
en el 2006.

(e) Para comparar la situación de estas dos comunidades teniendo en cuenta la situación general del 2014 y del 2006
calculamos los porcentajes tipicados:

12,3 − 11,0158
zcv,2006 = = 0,238
5,399
18 − 15,5895
zcv,2014 = = 0,353
6,833
8,8 − 11,0158
zmad,2006 = = −0,41
5,399
12,4 − 15,5895
zmad,2014 = = −0,467
6,833
La situación de la Comunidad Valenciana es peor en el 2014, año en el que tiene un porcentaje tipicado mayor. Para
la Comunidad de Madrid sin embargo el peor año es el 2006.

Exercises from exams of previous academic years


8. (May 2015 exam) Vendors doing business with a particular company were sampled to determine the economic impact
of company business on their gross sales. A sample of 15 rms that provide services to the company had the following
percentages of total annual sales as a result of sales to the company:

27 12 14,9 1,2 0,1 1 0,1 5,3 7,6 5 1 1 3,2 3 7

(a) Is the sample mean of the 15 percentages larger than the sample median? If true, what does this result suggest? Justify
your answers.

(b) Calculate the three sample quartiles. Interpret them in term of percentages.

(c) Compute the sample quasi-variance and coecient of variation of the 15 percentages.

(d) Draw a box-plot of the data and identify the outliers (if any). Justify your answer.

Solución.

(a) La media muestral de los 15 porcentajes es x̄ = 5, 96, mientras que la mediana muestral es M = 3, 2. Entonces, la media
muestral es mayor que la mediana muestral. Este resultado sugiere que la distribución de los datos es asimétrica a la
derecha (asimetría positiva), es decir, hay un número reducido de pequeñas compañías para las que los porcentajes de
ventas totales anuales a la gran compañía son notablemente mayores que para el resto. Esta asimetría queda corroborada
en el diagram de cajas (apartado d), incluso después de eliminar el efecto del dato atípico.

(b) Los cuartiles muestrales son Q1 = x(4) = 1, Q2 = x(8) = 3, 2 y Q3 = x(12) = 7, 6, respectivamente. Entonces, el
25 % de los porcentajes son menores que el 1 %, el 50 % de los porcentajes son menores que el 3, 2 % y el 75 % de los
porcentajes son menores que el 7, 6 %. Consecuentemente, los tres cuartiles muestrales dividen la muestra en cuatro
sub-muestras que contienen respectivamente el mismo número de porcentajes. En general, los porcentajes de ventas
totales a la compañía para la mayoría de las pequeñas empresas representa menos del 7, 6 %.
(c) La cuasi-varianza muestral es s2 = 53, 2668, mientras que el coeciente de variación muestral es cv = 1, 2245.
(d) Para construir el diagrama de caja, necesitamos el rango intercuartílico que está dado por IQR = Q3 − Q1 = 7, 6 − 1 =
6, 6. Más aún, para construir las barras del diagrama y para detectar atípicos, si los hay, necesitamos los valores
Q1 − 1, 5IQR = 1 − 1, 5 · 6, 6 = 8, 9 y Q3 + 1, 5IQR = 7, 6 + 1, 5 · 6, 6 = 17, 5. Además, los valores máximo y mínimo en
la muestra son 0, 1 y 27, respectivamente. Entonces, hay un solo dato atípico ya que 17, 5 < 27. El diagrama de caja
aparece a continuación.

13

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
En el box-plot se aprecia la asimetría de la distribución, que persiste aunque se elimine el dato atípicamente alto, como
se observa en el siguiente box-plot. Notése también los valores del coeciente de asimetría en los dos casos, con y sin
atípico. Observa también el cambio de las medids menos robustas (media, desviación estandar, rango..)

14

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
9. (June 2015 exam) The following tables contain information about the GDP and the unemployment rate of the Spanish
Autonomous Regions:

Answer and justify the following questions:

(a) Fill in the gaps in Table 2.

(b) Which of the two variables is more disperse?

(c) Determine the group of Autonomous Regions formed by the 15 % with higher GDP.

(d) Darw the box-plot of the unemployement rate. What can you tell about the shape of the distribution?
(e) From the previous box-plot, decide if there are outliers and /or extreme outliers in the data. Identify the Autonomous
Regions that can be considered outliers and/or extreme outliers.

15

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Solución.

(a) En la Tabla 2 faltan algunos estadísticos descriptivos para la variable tasa de paro: media 10,5, mediana 10,3, cuasi-
desviación típica 3,4, Q3 = 11, 2, P1 5 = 7, 0.
(b) Puesto que las unidades de medida y el rango de valores son muy distintos para el PIB y la tasa de paro, la cuasi-
desviación típica no es un buen descriptivo para comparar sus variabilidades. Es mejor utilizar una medida adimensional,
como el coeciente de variación (CV). En este caso,

3251, 1 3, 4
cv(PIB) = = 0, 19; cv(tasa paro) = = 0, 32
17530, 7 10, 5
Luego la variación de la tasa de paro es mayor.

(c) El grupo de CCAA formado por el 15 % con mayor PIB son aquellas cuyo PIB sea superior al percentil 85, es decir
superior a 21360, 3. Hay tres CCAA que cumplen esta condición: Navarra, País Vasco y Madrid.

(d) Diagrama de caja para la tasa de paro

10. (May 2016 Exam) The following tables contain information about 10 companies of the IBEX 35. In particular, three variables
are shown: X1 =average remuneration of the governing board, X2 =average remuneration of senior management and
X3 =average expenditure per employee (in millions of euros). Source: El País, 8th May 2016.

Tabla 1 / Table 1
Empresa / Company X1 X2 X3 Figura 1 / Figure 1
BBVA 0,985 1,144 0,455
ACS 0,667 0,540 0,401
FCC 0,720 0,650 0,323
I dit
Inditex 1,270
1 270 1,730
1 730 0,231
0 231 A
Acciona 0,463 0,590 0,390
Santander 1,484 2,580 0,586
IAG 1,220 2,440 0,809
Iberdrola 0,920 1,979 0,894
Ferrovial 1,330 1,800 0,391
Telefónica 1,240 1,869 0,491 B

Tabla 2 / Table 2
X1 X2 X3
Media / Mean 1,030 0,497
Mediana / Median 1,765 0,428
Desv. típica / Standard dev. 0,333 0,756 0,210
Varianza / Variance 0,572 0,044
Q1 0,770 0,774 0,390
Q3 , 63
1,263 ,95
1,952

Answer to the following questions:

(a) Fill in the gaps in Table 2.

(b) Determine the shape of the distribution of X2 . Justify your answer.

(c) Which of the three variables is more disperse? Justify your answer.

(d) Are there any outliers in X3 ? Justify your answer.

(e) Match the box-plots A and B of Figure 1 with the corresponding variables (X1 , X2 , X3 ). Justify your answer.

(f ) It is known that the correlation between X1 and X3 is 0.175 and, on the other hand, that the covariance between X2
and X3 is 0.093. Is it true that the linear relationship between X3 and X1 is stronger than between X3 and X2 ? Justify
your answer. (Note: this question is from Chapter 3)

Solución.

16

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
1 BBVA 0,985 1,144 0,455
2 ACS 0,667 0,540 0,401
3 FCC 0,720 0,650 0,323
4 Inditex 1,270 1,730 0,231
5 Acciona 0,463 0,590 0,390
6 Santander 1,484 2,580 0,586
7 IAG 1,220 2,440 0,809
8 Iberdrola 0,920 1,979 0,894
9 Ferrovial 1,330 1,800 0,391 2,75
10 Telefónica 1,240 1,869 0,491 8,25
(a) La tabla completa es:

Estadísticos
X1 X2 X3
Media 1,030 1,532 0,497
Mediana 1,103 1,765 0,428
Desviación estándar 0,333 0,756 0,210
Varianza 0,111 0,572 0,044
Percentiles 25 0,770 0,774 0,390
75 1,263 1,952 0,562

CV 0,323 0,494 0,423


RIC 0,493 1,178 0,172 0,504

(b) La media de la variable X2 es menor que su mediana. Por tanto, se trata de una distribución asimétrica hacia la
izquierda (o asimetría negativa).
44,91
(c) Las medias de las tres variables son muy distintas, por lo que la dispersión debe medirse meidante el coeciente de
27,29
variación de Pearson. En este caso CV (X ) = 0,323 CV (X ) = 0,494
1 24,2 , 2 y 3 CV (X ) = 0,423
, por lo que la variable con
23,79
mayor dispersión es 2. X 23,37
22,58
(d) La variable X
3 tiene asimetría positiva (o hacia
22,03 la derecha), por lo que si tiene datos atípicos éstos estarán en la cola
derecha de 3 X
. Los datos atípicos serán 21,98
aquellos
20,01
valores superiores a 3 Q + 1,5 × RIC = 0,562 + 1, 5 × 0,172 = 0, 82
(o
también 3 Q + 1,5 × RIC = 0,586 + 1,5 × 0,196 = 0,88
19,82 ). Hay un dato atípico y se trata de la empresa Iberdrola.
24,998
(e) El box-plot A corresponde a la variable 1 (por
6,942214056 X
ejemplo, teniendo observando los valores de la mediana o de
1,38844281 1 ), Q
mientras que el box-plo B corresponde a la variable
2,3 X0,73 (observando
230 los valores
70 de la mediana, el valor del atípico, etc.)

(f ) Para determinar el grado de relación lienal es necesario


170 calcular el coeciente de correlación lineal de Pearson. En este
2,428571429
caso, r(X1 , X3 ) = 0,175 y r(X2 , X3 ) = 0,093/(0,756 × 0,210) = 0,585.
0,42857143 Por tanto, la relación lineal entre X2 y X3 es
mayor que entre X1 y X3 .
4,2
2,4
11. (June 2016 Exam) The following tables contain information
1,3 about 10 companies of the Dow Jones. In particular: X1 =highest-
paid CEO (in million dollars) and 0,4
X2 =share2,075
price (in dollars). Source: El País, 8th May 2016.
1,416642157
Tabla 1 / Table 1 Figura 1 / Figure 1
X2 30
X1
44,91 106,11
27,29 87,81
24,2 53,95
23,79 113,69
23,37 29,97
22,58 158,31
22,03 100,31
21 98
21,98 64,21
64 21 A
20,01 110,68
19,82 147,6
B
C

a) Draw the box-plot for X2 and identify the outliers (if any). Justify your answer.

b) Determine if X1 and X2 have the same type of asymmetry. Justify your answer.

c) Determine which box-plot (A, B, C) corresponds to X1 . Justify your answer.

d) If 1 euro = 1.14 dollars, calculate the average salary for those CEOs (in million euros) and the variance.

Solución.

a) La variable X2 toma valores desde mı́n(X2 ) = 29,97 hasta máx(X2 ) = 158,31, con una media de x̄2 = 97,26. Además,
se tiene que:

100,31 + 106,11
M e(X2 ) = = 103,21, Q1 (X2 ) = 64,21, Q3 (X2 ) = 113,69, RIC(X2 ) = 49,49.
2

17

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
No hay datos atípicos (ni atípicos extremos) porque Q3 (X2 ) + 1,5RIC(X2 ) = 187,91 y Q1 (X2 ) − 1,5RIC(X2 ) < 0. El
diagrama de caja es:

b ) X1 y X2 presentan tipos de asimetría distinta. En concreto, X1 tiene una distribución asimétrica positiva porque x̄1 =
24,998 > M e(X1 ) = 22,975, mientras que X2 presenta asimetría negativa, puesto que x̄2 = 97,26 < M e(X2 ) = 103,21.
c) La diferencia clara entre los box-plot A, B y C radica en el número de atípicos y atípicos extremos. Luego es suciente
en averiguar el número de atípicos que tiene la variable X1 . Para la variable X1 se tiene que

Q1 (X1 ) = 21,98, Q3 (X1 ) = 24,2, RIC(X1 ) = 2,22.

Calculamos las barreras exteriores: Q1 (X1 ) − 1,5RIC(X1 ) = 18,65, luego no hay atípicos en la cola izquierda y, por
tanto, descartamos el box-plot C. En cuanto a la cola derecha, Q3 (X1 )+1,5RIC(X1 ) = 27,53 (y Q3 (X1 )+3RIC(X1 ) =
30,86), luego 44.91 es un atípico extremo. Por tanto, elegimos el box-plot A.

d) Si llamamos Y =sueldo de ejecutivo mejor pagado (en millones de euros), y sabemos que 1 euro = 1,14 dólares,
entonces Y = X1 /1,14. Al ser Y una transformación lineal de X1 , se tiene que

x̄1 s2n (X1 ) 48,194


ȳ = = 21,928, s2n (Y ) = = = 37,084.
1,14 1,142 1,142

12. (May 2017 exam) The following table shows the values of the Human Development Index (HDI) for dierent countries of
Africa, America and Europe in the year 2015.

África 0,348 0,411 0,413 0,416 0,419 0,646 0,666 0,684 0,69 0,698 0,721 0,724 0,736 0,772 0,777

América 0,483 0,666 0,679 0,714 0,715 0,772 0,78 0,783 0,785 0,79 0,793 0,827 0,847 0,919 0,923

Europa 0,693 0,751 0,754 0,761 0,771 0,899 0,907 0,907 0,908 0,916 0,916 0,922 0,923 0,93 0,944

Answer to the following questions:

(a) Find the three quartiles for each of the three continents and decide if there are any outliers in the data of each continent.

(b) Draw the box-plot of the American data in the following picture. Determine the shape of each distribution and compare
them. Which measures of centrality and variability are more appropriate in each case? Do not calculate them.

18

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(c) Justify the truthfulness or falseness of the following statements. Apply the quartiles to justify your answers.

1) 50 % of African countries in the table have an HDI that is below the level reached by any of the European countries
in the table.

2) 75 % of American countries in the table have an HDI that is above the level reached by any African countries in the
table.

(d) The HDI can be classied as follows: Very High [0,8, 1); High [0,7,0,8]; Medium [0,55, 0, 7) and Low [0,0,55]. The
contingency table for the variable continent (X ) and the variable HDI in categories (Y ) is depicted below:
X / Y Low, (0; 0, 55) Medium, [0, 55; 0, 7) High, [0, 7; 0, 8) Very High, [0, 8; 1)
África 5 5 5 0
América 1 2 8 4
Europa 0 1 4 10

What percentage of countries with high or very high HDI belongs to Europe? And what percentage of countries with
an HDI of less than 0, 55 belongs to the African continent?

Solución.
1
(a) Como hay n = 15 datos, los cuartiles ocupan las posiciones
4 16 = 4, 12 16 = 8 y
3
4 16 = 12. Luego,

Af rica : M in(af r) = 0, 348 Q1 (af r) = 0, 416 Q2 (af r) = 0, 684 Q3 (af r) = 0, 724 M ax(af r) = 0, 777
America : M in(amr) = 0, 483 Q1 (amr) = 0, 714 Q2 (amr) = 0, 783 Q3 (amr) = 0, 827 M ax(amr) = 0, 923
Europa : M in(eur) = 0, 693 Q1 (eur) = 0, 761 Q2 (eur) = 0, 907 Q3 (eur) = 0, 922 M ax(eur) = 0, 944

Para el análisis de atípicos:

Af rica : RI = 0, 308 =⇒ LS = Q3 + 1, 5RI = 1, 186 > max y L1 = Q1 − 1, 5RI = −0, 046 < min (No tiene)
America : RI = 0, 113 =⇒ LS = Q3 + 1, 5RI = 0, 9965 > max y L1 = Q1 − 1, 5RI = 0, 5445 > min (Tiene)
Europa : RI = 0, 161 =⇒ LS = Q3 + 1, 5RI = 1, 1635 > max y L1 = Q1 − 1, 5RI = 0, 5195 < min (No tiene)

(b) Con las medidas anteriores, se construyen los siguientes diagramas de cajas:

19

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Las tres distribuciones presentan asimetría hacia la izquierda, en mayor o menor medida, siendo el continente americano
el que, salvo por el dato atípico inferior que tiene, tiene la distribución más simétrica. El gráco conjunto permite
comparar fácilmente el IDH en los 3 continentes, siendo el Americano el más homogéneo (salvo por el dato atípico) y
el Europeo en el que se alcanza el valor mediano más alto; mientras que el Africano es en el que hay más variabilidad
y el IDH es más bajo.

Teniendo en cuenta la asimetría más acusada de África y Europa y el atípico de América, las medidas de centralidad
y variabilidad más adecuadas son las que sean más robustas: la mediana y el IQR respectivamente.

No hace falta clacular las medidas de centro y de variabilidad, sólo que opten por medidas robustas, aunque al tener
los box-plot se pueden comparar sin necesidad de obtener el dato exacto. Se observa claramente como las medianas van
aumentando M d(af rica) < M d(america) < M d(europa). En cuanto a la variabilidad, también se aprecia claramente
en el gráco la relación entre las amplitudes de las cajas. Una vez eliminado el efecto del atípico la variabilidad de
América y Europa es algo más similar, siendo la del contienente americano la inferior. La de África es claramente
superior, lo que denota mayores diferencias en el nivel de desarrollo alcanzado por los países de dicho continente.

(c) Para analizar la veracidad de las armaciones necesitamos obtener la M d(af rica), M in(europa), Q1 (america), M ax(af rica),
M d(europa) y el Q3 (america).
1) Verdadero. M d(af rica) = 0, 684 < 0, 693 = M in(europa) y, por tanto, el 50 % de los países de África con un índice
de desarrollo humano tienen un valor del mismo que está por debajo del valor mínimo de los países europeos.

2) Falso. Q1 (america) = 0, 714 < M ax(af rica) = 0, 777 y, por tanto, no podemos garantizar que el 75 % mýs
desarrollado de los países americanos tenga un nivel que supere el máximo nivel alcanzado por los africanos, que
es del 0, 777. Nótese que Q3 (america) = 0, 827 > M ax(af rica) = 0, 777 sólo garantiza que el 25 % de los países
americanos mejor posicionados superen el valor de 0, 827. De hecho, sólo 9 de ellos lo superan, que representan un
60 %.
14
(d) Hay 14 + 17 = 31 países con un IDH Alto o Muy alto, de los cuales 10 + 4 = 14 son europeos. Luego
31 = 0, 4516 y el
porcentaje pedido es el 45, 16 %.
5
Hay 6 países con un IDH inferior a 0, 5, de los que 5 son africanos. Luego
6 = 0, 8333 y el porcentaje pedido es el
83, 33 %

20

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Topic 3: Analysis of bivariate data
1. Bivariate data.
2. Tabular methods.

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a. The two-way table. Absolute and relative frequency tables.
b. Marginal and conditional frequencies.
c. The two-way table with quantitative variables.
3. Charts and numerical summary:
a. Qualitative variables: bar charts (clustered, stacked)
b. Qualitative variable and quantitative variable:
i. Multiple box-plots, histograms.
ii. Multiple numerical summaries.
c. Quantitative variables:
i. Scatterplots.
ii. Types of relationships between numerical variables
iii. Measures of linear association: covariance and correlation.

Reservados todos los derechos.


Chapter 3: Analysis of bivariate data
Bivariate data
• Bivariate data are obtained when we observe two variables (X, Y), numerical or categorical,
in a sample of n individuals:
(x1, y1), (x2, y2), …, (xn, yn).
• Besides analysing each variable separately, we want to study whether there is any relation
between them, and in such a case to analyse such a relation
Joint absolute frequency distribution - Absolute frequency two-way table
When at least one variable is qualitative, the two-way table is also called a contingency table
Two-way table with k rows and m columns

Notation:
• Joint absolute freq. for classes ci of X and c’j of Y: nij
• Marginal absolute freq. for class ci of X: ni. = ni1 + … + nim
• Marginal absolute freq. for class c’j of Y: n.j = n1j + … + nkj
• Sample size: n.. = n

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
Example:
• Sample: 10 madrileños
• Variable X: Educational level attained (1=Below secondary, 2=Secondary, 3=Post-
secondary)
• Variable Y: Employment status (1=Employed, 2=Unemployed, 3=Inactive)
X/Y 1 2 3
1 0 0 2 2
2 1 0 4 5
3 2 0 1 3

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3 0 7 10

Another example:
• Sample: 1508 madrileños.
• Variable X: Educational level attained (1=Below secondary, 2=Secondary, 3=Post-
secondary)
• Variable Y: Employment status (1=Employed, 2=Unemployed, 3=Inactive)

Reservados todos los derechos.


Relative frequency two-way table
• fij = nij/n: Joint relative freq. for classes ci of X and c’j of Y

• Marginal relative freq. for row i (class ci of X):


fi. = fi1 + … + fij + … + fim
• Marginal relative freq. for column j (class c0 j of Y):
f.j = f1j + … + fij + …+ fkj
Example:

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
The table is obtained by taking the previous one and dividing it by n.. that in this case is 1508

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Conditional frequency distributions
• Given the joint distribution of (X, Y), the absolute frequency distribution of one of the variables,
assuming known and fixed the value of the other variable, is a conditional distribution.
• Notation: Y | X = ci or X | Y = c’j. This symbol “|” means given the following condition.

Reservados todos los derechos.


Example:
Conditional frequency distribution for employment status (Y) given that the attained education level
is post-secondary (X)

• Interpretation: 1.93% of sampled individuals who have attained a post-secondary education


level are unemployed.
• Variable Y takes values of “Employed”, “Unemployed” and Inactive”.
• We go to the frequency table and search for the values that satisfy the condition. In this case
are the values of x of post-secondary

• We compute the relative frequencies but only for the values that satisfy the condition, that is
we do not take 1508, we take 414.

Another example:
We can also condition on one variable taking more than one value:

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
• Y | (X ≥ secondary).

• Interpretation: 3.3% of sampled individuals who have attained a secondary or post-


secondary level of education are unemployed.
• The values that satisfy that condition are Secondary and Post-secondary

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• We add their absolute frequencies

Two-way table for quantitative variables

Reservados todos los derechos.


Example:
• Sample: 43 students.
• Variable X: # of times the student went to the theatre last month
• Variable Y: # of times the student went to the movies last month
X and Y are discrete quantitative variables and take a small number of distinct values ⇒ data
without grouping

Another example:
• Sample: 1000 USA firms
• Variable X: Sales volume. Variable Y: Number of workers
X and Y are discrete quantitative variables that take a large number of distinct values ⇒ grouped
data

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Graphs
Clustered bar graph Stacked bar graph

Reservados todos los derechos.


Quantitative variables: the scatterplot
Is there a relation between the size and the price
of a house?
• Sample: 15 houses
• Variable Y: price
• Variable X: size, in m2
Two statistical variables X and Y can present:
• no relationship
• a negative linear relationship
• a positive linear relationship
• a nonlinear relationship.

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Measure of linear association for quantitative variables: covariance
The covariance (sxy) is a measure of association between two variables. It quantifies the
information in a scatter plot on the linear association between them.

Reservados todos los derechos.


• sxy >> 0 ⇒ Positive linear relationship
• sxy << 0 ⇒ Negative linear relationship
• sxy ≈ 0 ⇒ No relationship or nonlinear relationship
• Drawbacks of the covariance:
o Can take any values, hence it is unclear when sxy is sufficiently large or small
o It depends on the units of measurement of the variables:
▪ If sxy is the covariance between X and Y, a and b are numbers with b 6= 0,
and T = a + bY, then sxt = b sxy

Measures of linear association: The correlation

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
• Correlation (Pearson’s linear correlation coefficient): [covariance/(standard deviationx *
standard deviation14y)]

• Advantages:
o It is bounded: −1 ≤ rxy ≤ 1
o It does not depend on the units of measurement

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o Interpretation:
▪ rxy > 0: Positive linear association
▪ rxy < 0: Negative linear association
▪ | rxy | = 1: Perfect linear association
▪ rxy = 0: X and Y are uncorrelated (no linear relationship)

Correlation and causation


Reservados todos los derechos.


Suppose that the correlation between two variables X e Y is very high (e.g., rxy = 0.9)
• Can we conclude that there is a causal relationship between such variables? (one causes
the other)
• The answer is: NO
• For example, X = foot size of a child, Y = reading comprehension of a child
• Correlation does not imply causation

Example. Three variables are measured over 91 countries: X = female life expectancy, Y = male
life expectancy and Z = GDP
• The covariances between the three pairs of two variables are sxy = 105.15, sxz = 50066.04
and syz = 57917.93.
• On the other hand, the correlations between them are rxy = 0.98, rxz = 0.64 and ryz = 0.65
• Therefore, even if the covariances between male and female live expectancy and gross
national product are larger than the covariance between male and female life expectancies,
the correlation is larger for these last two variables
Example: PISA 2012
• Sample: 64 countries participating in the 2012 PISA tests
• X: Countrywide mean reading score
• Y: Countrywide mean math score
We have:
• The covariance between X and Y is sxy = 2440.78

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
• The correlation between X and Y is rxy = 0.96
The following slides show the data through scatterplots
• What can you conclude about the relationship between such variables?

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
Question 1
or a sample of two variables, X and Y, if we modify the first variable X to X′ = 2X − 1, the
correlation coefficient of X′ and Y is the same as that for X and Y.

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The correlation coefficient does not change if we introduce a linear transformation on one or both
variables.
The correct answer is 'True'.

Question 2
For two variables X and Y, you are given the following frequency table:
Frequencies

Y = 0Y = 1Y = 2

Reservados todos los derechos.


X = 0 0.2 0.7 0.1

X = 1 0.3 0.3 0.4

Then, the frequencies for each value of X are conditional frequencies.

The statement is true, as the frequencies add up to 1.


The correct answer is 'True'.

Question 3
For a bivariate sample from two random variables, X and Y, if all the conditional means of Y for
each of the different values of X are equal, they are also equal to the marginal mean of Y .
The marginal mean of Y can be written as the sum of the conditional means times their marginal
frequencies, divided by the sample size.
The correct answer is 'True'.

Question 4
For two nonnegative variables X and Y, if the relative frequency of X = 1 | Y = 0 is 0.25 and that of
X = 0 | Y = 0 is 0.1, then the relative frequency of X <= 1 | Y = 0 is 0.35.
If we condition on the same value of the variable, we can add the frequencies, as the denominator
to compute the marginal frequency (the marginal frequency of Y = 0 ) is the same in all cases.
The correct answer is 'True'.

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
Statistics I
Exercises for Topic 3
Academic year 2020/21
- Short Answers

These short answers do not include in general comments related to the interpretation of the solutions. Nevertheless, some
interpretations are provided for the exercises corresponding to exams from previous years.

Exercises
1. Answer(s):

a) Marginal relative frequencies:

# of children \ Income: (0, 1000] (1000, 2000] (2000, 3000] (3000, 5000] # children marginal distribution
0 0,15 0,05 0,03 0,02 0,25
1 0,10 0,20 0,10 0,05 0,45
2 0,05 0,10 0,05 0,03 0,23
3 0,02 0,03 0,02 0,00 0,07
Income marginal distribution 0,32 0,38 0,20 0,10 1
b) Conditional distribution for Y |X ≥ 2:
Income | # children ≥ 2: (0, 1000] (1000, 2000] (2000, 3000] (3000, 5000] Total
frc0j |x≥2 0,233 0,433 0,233 0,1 1

Conditional mean income: Ȳ |(X ≥ 2) = 1750 euros.


c) Frequency table for X|Y ∈ (1000, 2000]:
# children
X|Y ∈ (1000, 2000] 0 1 2 3
frxi |y∈(1000,2000] 0,13 0,53 0,26 0,08
d) Distributions for X conditioned on each income level:
Conditional distributions
# children X|Y ∈ (0, 1000] X|Y ∈ (1000, 2000] X|Y ∈ (2000, 3000] X|Y ∈ (3000, 4000]
0 0,47 0,13 0,15 0,20
1 0,31 0,53 0,50 0,50
2 0,16 0,26 0,25 0,30
3 0,06 0,08 0,10 0,00
The following figure shows the bar graphs for the four conditional distributions and the marginal:

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
2. Answer(s):
a) Marginal distribution for Y = Weekly number of credit card purchases:
Purchases 0 1 2 3 4 Total
nj 36 72 69 69 54 300
Mean, quasivariance and quasi-standard deviation:
ȳ = 2,11, s2y = 1,6634, sy = 1,29.
b) Distribution of the number of credit cards:
# credit cards ni
1 117
2 105
3 78
Total 300
The mode is equal to 1.
c) Distribution of the number of purchases by persons with 3 credit cards:
# purchases | (# cards = 3) 0 1 2 3 4 Total
fryj |x=3 0,037 0,111 0,222 0,296 0,296 1
Conditional mean: ȳ|(x = 3) = 2,731
d) Conditional frequencies:
Y = # purchases
Cond. distr. 0 1 2 3 4 Total
Y |X = 1 0,205 0,333 0,231 0,154 0,077 1
Y |X = 2 0,086 0,229 0,229 0,257 0,200 1
Y |X = 3 0,038 0,115 0,231 0,308 0,308 1
Conditional means:
ȳ|(x = 1) = 1,564, ȳ|(x = 2) = 2,257
3. Answer(s):
a) Contingency table:
(−34,5, −29,5] (−29,5, −24,5] (−24,5, −19,5] (−19,5, −14,5] Total
Madrid 2 3 1 0 6
Castilla y León 0 1 0 1 2
Castilla-La Mancha 0 0 2 2 4
Total 2 4 3 3 12

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
b) Conditional frequencies for the sales changes by autonomous region:
(−34,5, −29,5] (−29,5, −24,5] (−24,5, −19,5] (−19,5, −14,5]
Madrid 0,33 0,5 0,17 0
Castilla y León 0 0.5 0 0.5
Castilla-La Mancha 0 0 0.5 0.5
c) Marginal distribution for the changes in sales:
(−34,5, −29,5] (−29,5, −24,5] (−24,5, −19,5] (−19,5, −14,5]
1 1 1 1
frj 6 3 4 4

d) Conditional distributions depend on the region. Castilla-La Mancha had the smallest changes, equally distributed among
the largest intervals, while Madrid had the largest changes.
e) Scatterplot:

The correlation coefficient is r(x,y) = −0,856.


4. Answer(s): Relative frequency contingency tables for both surveys:

Galicia
Female 20,0 % 60,0 % 10,0 % 10,0 % 100,0 %
Male 50,0 % 30,0 % 15,0 % 5,0 % 100,0 %
Total Galicia 35,0 % 45,0 % 12,5 % 7,5 % 100,0 %
Madrid
Female 15,0 % 30,0 % 35,0 % 20,0 % 100,0 %
Male 5,0 % 15,0 % 35,0 % 45,0 % 100,0 %
Total Madrid 10,0 % 22,5 % 35,0 % 32,5 % 100,0 %
Total both 22,5 % 33,8 % 23,8 % 20,0 % 100,0 %

The bar plot obtained from the preceding table is:

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
All the interpretation questions can be answered from this bar chart.
The differences between regions are shown in the following bar plot:

5. Answer(s):
a) Scatterplot:

b) Correlation coefficient: r(x,y) = 0,9848

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
6. Answer(s):

a) Correlation coefficient: rxy = −0,736.


b) The highlighted observation has a large percentage of households facing economic hardship for its income level (but
note that it is not an outlier). It corresponds to the Ciudad Autónoma de Ceuta.
If this observation is removed, the relationship will keep the same sign but it may become slightly stronger. In our case,
−ceuta
ryy = −0,786.
7. Answer(s):

a) Scatterplot:

b) The correlation coefficient is rxy = 0,786.


c) Not necessarily, as correlation does not imply causation.
8. Answer(s):
a) Table 1: Joint distribution of relative frequencies for gender and Department
Table 2: Conditional distribution of relative frequencies. Income distribution conditioned to Department
Table 3: Marginal distribution of absolute frequencies by income level
Table 4: Joint distribution of absolute frequencies for marital status and Department
b) Use the charts for the conditionals in relative terms to analyze any possible relationships.

Exercises from Old Exams


9. (May 2012 Exam)
Answer(s):

(a) X is numerical discrete and Y is numerical continuous.


(b) The marginal frequency of X is
xi 0 1 2
ni· 466 352 182
The mean, quasivariance and quasi-standard deviation of X are:

x̄ = 0,716, s2x = 0,5679, sx = 0,7536

(c) The conditional mean is x̄ | (Y < 50) = 0,2609, and the mode is 0.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
(d) Boxplot a) corresponds to those households for which X = 2, b) to those where X = 0 and c) to those having X = 1.
(e) Histogram I corresponds to c) (X = 1), II to a) (X = 2) and III to b) (X = 0).
(f) The standardized incomes (zi = (xi − x̄)/s) are 0,3045, 0,1592, −3,3487 respectively. The household with an income
equal to 75,000 (X = 2) is the poorest in relative terms.

10. (May 2016 Exam) See the answer for question (f) in the problem sheet for Lesson 2 (Exercise 12).
11. (May 2017 Exam) See the answer for question (d) in the problem sheet for Lesson 2 (Exercise 14).
12. (June 2017 Exam)
Answer(s):
a) The scatterplot shows high dispersion levels for v when p is between 0,6 and 0,8. Also, the correlation coefficient
is quite small. Being conservative in our interpretation, we may conclude that there is no clear linear relationship
between the variables.
b) From the data we can obtain the limits for the corresponding boxplots. They are [0,55, 0,91] for p and [0, 0,68] for v.
We observe in the scatterplot that there exist values of p beyond these limits, both above and below them. There are
also outliers for v above 0,60, but there are no outliers for small values of this variable.
c) The mean and median values do not change, and this may seem to indicate that the distributions could be symmetric.
Looking at the histograms we notice that the distribution of p has positive asymmetry and the distribution of v has
a slight negative asymmetry.
d ) Comparing the corresponding relative frequencies
2239 9027
≈ 0,13 < 0,26 ≈ ,
2239 + 14803 9027 + 27686
we see that the percentage of votes in polling stations with high participation is twice that for the low participation
polling stations, supporting the conclusion of the analysts.
13. (June 2018 Exam)
Answer(s):

a) The mean income for all families from both regions is


120xI + 120xII
x̄T = = 2063,60
240
To obtain the joint median we would need to have disaggregated information for the 240 families.
For the modal interval we would also need additional information. We could obtain the modal interval, for example, if
the classes used for both regions were the same.
Using the information we have been given, we could try to define a more detailed set of classes, with width equal to
300 and starting at 300 euros. We would need to assume that each new class would contain exactly one half of the
observations from the preceding larger classes (uniformity in the frequencies). We would obtain:

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
The modal interval would be [1800, 2100).
b) We will compare the coefficients of variation. Their values are:
sI sII
cvI = = 0,3723 cvII = = 0,4096
|xI | |xII |
Both values are very similar, but the variability in Region I is slightly lower.
c) We will conduct this comparison by using the histograms and frequency polygons for both regions.

From these graphs we observe that Region I has a negative asymmetric distribution, while Region II has a slightly
positive asymmetric distribution. This is confirmed by the signs of their asymmetry coefficients.
An interpretation of these differences is that Region I seems to present larger income inequalities, while having larger
income levels.
d) We compute the conditional means using the class marks:
900 · 12 + 1500 · 6 + 2100 · 3 + 2700 · 1 + 3300 · 6
x|(y = 1) = = 1735,7 euros
12 + 6 + 3 + 1 + 6
900 · 2 + 1500 · 3 + 2100 · 12 + 2700 · 5 + 3300 · 8
x|(y = 2) = = 2380 euros
2 + 3 + 12 + 5 + 8
900 · 3 + 1500 · 3 + 2100 · 3 + 2700 · 9 + 3300 · 22
x|(y = 3) = = 2760 euros
3 + 3 + 3 + 9 + 22
900 · 1 + 1500 · 0 + 2100 · 2 + 2700 · 5 + 3300 · 14
x|(y = 4 or more) = = 2945 euros
1 + 0 + 2 + 5 + 14
The average incomes increase with the number of family members.

14. (May 2018 Exam)


Answer(s):

(a) The right boxplot includes an outlier. We can use the lower limits for outlier values in both cases to differentiate
between them: LL = mı́n −1,5 × IQR, where IQR = Q3 − Q1 .
For employees up to 35 years old, Q1 = 17 and Q3 = 31. Thus, IQR = 31 − 17 = 14 and the lower limit is
LL = 14 − 1,5 × 14 < 0. We conclude that in this group there are no lower outlying observations. These data
correspond to the left boxplot.
For employees older than 35: Q1 = 34 and Q3 = 41. Thus, IQR = 34−41 = 7 and the lower limit is LL = 34−1,5×7 =
23,5. In this case there is an outlier (x = 21). The right boxplot is the one representing these values.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
(b) Median = 32,5, CV = 9,42/29,95 = 0,31, Q1 = Q3 - IQR = 38 − 17 = 21, Range = 43 − 14 = 29.
Symmetry: As the mean is smaller than the median, this suggests the presence of negative asymmetry.
(c) (1) Raise salaries by 1000 euros: y = x + 1.
Mean: 29,95 + 1 = 30,95. Median: 32,5 + 1 = 33,5. SD = not affected by this transformation.
(2) Raise salaries by 10%: z = 1,1x.
Mean: 1,1 × 29,95 = 32,94. Median: 1,1 × 32,5 = 35,75. SD = 1,1 × 9,42 = 10,36.
(d) (i) Mean for employees older than 35: 37,7. Q3 for employees younger than 35: 29.
Thus, 75 % of the employees younger than 35 earn less than 29 (thousands of euros) each year, while the average
salary for older than 35 employees is 37,7. The statement is true.
(ii) Highest salary: 43. Q1 for these 20 employees: 21. Thus, 25 % of these 20 employees earn less than 21 (thousands
of euros), that is, less than 21,5, half the income of the employee with the highest salary. The statement is true.

15. (May 2018 Exam)


Answer(s):

(a) Both variables are quantitative (numerical) continuous variables (and their frequencies are grouped by intervals).
(b) The frequency table including the marginal distributions is:
X\Y 6 60 (60, 80] (80, 100] (100, 150] > 150 ni· fi·
(50, 100] 20 18 2 1 0 41 0,155
(100, 200] 25 40 30 2 1 98 0,370
(200, 350] 5 10 15 25 3 58 0,219
(350, 500] 0 5 15 20 8 48 0,181
> 500 0 1 2 7 10 20 0,075
n·j 50 74 64 55 22 N = 265 1
f·j 0,189 0,279 0,242 0,208 0,083 1
(c) The modal interval for the monthly income is (100, 200]. The distribution of the home size conditioned to this interval
is given by
X\Y 6 60 (60, 80] (80, 100] (100, 150] > 150
(100, 200] 0,255 0,408 0,307 0,02 0,01
(d) The median interval for the home size is (80, 100]. The distribution of family income conditioned to this interval is
X\Y (80, 100]
(50, 100] 0,031
(100, 200] 0,470
(200, 350] 0,234
(350, 500] 0,234
> 500 0,031

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
Topic 4. Probability
1. Random experiments, sample space, elementary and composite events.
2. Definition of probability. Properties.

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3. Conditional probability and Multiplication Law. Independence.
4. Law of Total Probability and Bayes’ Theorem.
Topic 4: Probability
Basic concepts: examples
• Random experiment: outcome of a die toss
o Sample space (possible outcomes) finite: Ω = {1, 2, 3, 4, 5, 6}
o Elementary events (sample points): {1}, 2, . . ., 6
o Composite events: e.g., A = “even outcome” = {2, 4, 6}, B = “outcome greater than 3”
= {4, 5, 6}

Reservados todos los derechos.


• Random experiment: number of visits to UC3M’s web page next Monday
o Sample space countably infinite: Ω = {0, 1, 2, . . .} = N ∪ {0}
o Elementary events: {0}, {1}, 2, . . .
o Composite events: e.g., A = “at least 100 visits” = {100, 101, . . .} and B = “less than
500 visits” = {0, 1, . . ., 499}
• Random experiment: closing price of a certain share of stock next Monday
o Sample space uncountably infinite: Ω = (0, +∞) or, Ω = (0, M) for some M large
enough
o Elementary events: {x}, with x ∈ Ω
o Random events: e.g., A = “price larger than 5 euros” = (5, M) and B = “price between
3 and 8 euros” = (3, 8)

Random events: basic concepts


Events: An event is a “reasonable” subset A of the sample space Ω (A ⊆ Ω). If the outcome ω of
the random experiment satisfies that ω ∈ A, the event happens. Otherwise, event A does not
happen.
Trivial events:
• Sure event: The complete sample space Ω. It always happens.
• Impossible event: The empty set ∅. It never happens
Complementary event to an event A: event that happens when A does not happen. It comprises
the elementary events of Ω that are not in A We denote it by A̅ or Ac

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Basic operations with random events
Suppose that A and B are events of the sample space Ω
• Intersection of events: The intersection A ∩ B comprises all elements that are both in A and
B (A∩B: “A and B happen”)
o A and B are incompatible events if they have no element in common, i.e., if their
intersection is A ∩ B = ∅
• Union of events: The union A ∪ B comprises all elements that are in A or in B (A ∪ B: “A or

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
B happen”)
• Difference of events: The difference A \ B comprises all elements of A that are not in B (A \
B: “A happens but not B”)

De Morgan’s Laws
Relations between the union, intersection and complementary events:

Reservados todos los derechos.


Example: die tossing
Random experiment “result of a die toss”:

• Sample space: Ω = {1, 2, 3, 4, 5, 6}


• Elementary events: {1}, 2, {3}, 4, 5, 6
• Composite events: e.g., A = {2, 4, 6}, B = {4, 5, 6}
Event A happens when “the outcome is even”
Event B happens when “the outcome is larger than 3”

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Probability: Intuition
The probability of an event is a measure of the confidence we have a priori in that the event will
happen when the random experiment takes place (the larger the probability of an event, the higher
the confidence that it will happen)

Reservados todos los derechos.


When tossing a fair die, intuitively:
• The probability that the outcome is 1 is less than the probability that the outcome is larger
than 1.
• The probability of getting a 4 is equal to that of getting a 6.
• The probability of getting a 7 is minimal since it is an impossible event.
• The probability of getting a positive number is maximal because it is a sure event.

Three approaches/interpretations
Classical probability (Laplace’s Rule): It considers random experiments where all elementary
events are equiprobable. If event A has n(A) sample points, we define the probability of A as

P(A) = number of cases favourable to A / number of possible cases = n(A) / n(Ω).

Frequentist approach: If the experiment were to be repeated many times, the relative frequency
of event A happening would converge to its probability.

P(A) = limiting frequency of event A

Subjective probability: It depends on the available information.

P(A) = degree of belief that event A will happen

Probability: Axioms and properties

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Definition: Let F be the collection of all events of Ω (note: F comprises all subsets of Ω if Ω is
countable) The probability is a function P : F → [0, 1] that assigns to each event A ∈ F a number
P(A), satisfying the following axioms:
• P(A) ≥ 0 for every event A ∈ F
• P(Ω) = 1
• Probability of the union of incompatible events: if A and B are incompatible (A ∩ B = ∅), then

P(A ∪ B) = P(A) + P(B)

Properties (consequences of the Axioms):

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Probability of the complementary:

P(A¯) = 1 − P(A)

• P(∅) = 0
• If (event A is included in event B) A ⊆ B ⇒ P(A) ≤ P(B)
• If A = {e1, . . ., en} is finite (or countably infinite) ⇒ P(A) = ∑n i=1 P({ei}) (note: we’ll write P({ei})
= P(ei))
• Probability of the union:

Reservados todos los derechos.


P(A∪B) = P(A) + P(B) − P(A∩B)

Example: tossing a fair die

Conditional probability. Independent events


Conditional probability
Definition: The probability of an event A given that another event B (with P(B) > 0) has happened
is

P(A | B) = P(A ∩ B)/P(B)

Independent events

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Intuitively: knowing that an event has happened gives us no information about whether the other
event has happened
Definition: Two events A and B are independent if

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
P(A ∩ B) = P(A)P(B)

Property: Suppose that P(B) > 0. Then, A and B are independent ⇐⇒ P(A | B) = P(A)
Conditional probability: Example
The following table shows the results of classifying a group of 100 executives according to their
weight and to whether or not they are hypertensive:

Reservados todos los derechos.


• Random experiment: we select equiprobably one of the 100 executives and observe their
tension and weight classification.
• Sample space: Ω = {(H, I),(H, N),(H, O),(N, I),(N, N),(N, O)}
Probability of A = “the selected executive is hypertensive”?

P(A) = 20/100 = 0.2


Suppose the selected executive is overweight. What is then the probability (s)he is hipertensive? Is
it the same as before?
Probability of A (“is hipertensive”) given B (“is overweight”): P(A | B)
To calculate it, we consider only the overweight executives:

P(A | B) = n(A ∩ B)/n(B)

= 10/25 = 0.4 > 0.2 = P(A)


The probability of an event depends on the available information

The conditional probability P(A | B) is the probability that A happens given that we know B has
happened

Independent events: example


• Fair die toss
• Event A: outcome is even
• Event B: outcome is larger than 2
• We are told that B happened. What is the conditional probability that the outcome was even?
P(A | B) = P(A ∩ B)/P(B) = (2/6)/(4/6) = 1/2 = P(A)
Events A and B are independent

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Fundamental theorems of probability calculus
Multiplication Rule
Useful to compute the probability that several events happen simultaneously when the conditional
probabilities are easy to calculate.
• P(A ∩ B) = P(A) P(B | A), if P(A > 0)
• P(A ∩ B ∩ C) = P(A) P(B | A) P(C | A ∩ B), if P(A ∩ B) > 0
• It extends to calculate the probability of the intersection of n events A1, . . . , An

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Multiplication Rule: examples
We draw successively two cards from a Spanish card deck. Probability that:
• the first card is a “copa”: P(A) = 12/48
• the second card is a “copa”, knowing that the first card was a “copa”: P(B | A) = 11/47
• both cards are “copas”: P(A ∩ B) = P(A)P(B | A) = 12/48 × 11/47
We throw twice a fair die. Probability that:
• we get a 1 in the first throw: P(C) = 1/6
• we get a 1 in the second throw, knowing that in the first we got a 1: P(D | C) = P(D) = 1/6

Reservados todos los derechos.


• we get a 1 in the first throw, knowing that in the second we got a 1: P(C | D) = P(C) = 1/6
• we get a 1 in both throws: P(C ∩ D) = P(C)P(D | C) = P(C) P(D) = 1/6 × 1/6 (independent
events)

Fundamental theorems: Theorem of Total Probability


Events B1,B2 ,… , Bk are mutually exclusive if Bi ∩ Bj = ∅, for i 6= j. If furthermore they satisfy that
Ω = B1 ∪ B2 ∪ … ∪ Bk, we say they are a partition of the sample space

If B1, B2,… ,Bk is a partition of the sample space such that P(Bi) 6= 0, i = 1, …, k, and A is any
event, then

P(A) = P(A ∩ B1)+P(A ∩ B2)+ … +P(A ∩ Bk ) = P(A|B1)P(B1)+P(A|B2)P(B2) + … + P(A|Bk)P(Bk)

Theorem of Total Probability: example


In a cookie factory there are four packaging lines: A1, A2, A3, and A4. 35% of total production is
packed in line A1, 20%, 24% and 21% in lines A2, A3 and A4, respectively. Data shows that a small

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
percentage of cookie packages are incorrectly packaged: 1% in A1, 3% in A2, 2.5% in A3 and 2%
in A4. What is the probability that a randomly chosen cookie package is defective (event D)?
P(D) = P(D ∩ A1) + P(D ∩ A2) + P(D ∩ A3) + P(D ∩ A4) = ∑4k=1 P(D|Ak)P(Ak) = 0.01 × 0.35 + 0.03
× 0.20 + 0.025 × 0.24 + 0.02 × 0.21 = 0.0197

Fundamental theorems: Bayes’ Theorem


Given two events A and B with P(A) > 0 and P(B) > 0 we have

P(A | B) = P(B | A)P(A) /P(B)

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Example: (cont.) Suppose that the chosen cookie package is defective. What is the probability that
it was packed in line A1?
P(A1 | D) = P(D | A1)P(A1)/P(D) = (0.01 × 0.35)/0.0197 = 0.17766
Given a partition of events of the sample space B1, B2, . . . , Bk , with P(Bi) 6= 0, i = 1, . . . , k, and
given an event A with P(A) > 0, we have, for j = 1 . . . , k,
P(Bj | A) = P(A | Bj)P(Bj)/P(A) =
P(A | Bj)P(Bj)/ [P(A | B1)P(B1) + P(A | B2)P(B2) + . . . + P(A | Bk )P(Bk )]
• Prior probabilities of the Bj : P(B1), . . . , P(Bk )

Reservados todos los derechos.


• Posterior probabilities of the Bj : P(B1 | A), . . . , P(Bk | A)
• Likelihood of A given each Bj : P(A | B1), . . . , P(A | Bk )
Bayes’ Theorem: Example
There is a clinical test for a rare disease affecting 1 in 10000 people
On average, the test gives a positive outcome (it detects the disease) in 99 out of 100 people having
it, and gives a negative outcome (it does not detect it) in 97 out of 100 people who do not have it
The test is applied to a randomly chosen person, and the outcome is positive. What is the probability
that the person has the disease?
Events: B1 = the person has the disease, B2 = the person does not have the disease, A = positive
test outcome
We apply Bayes’ Theorem:

The probability that the person has the disease is only 0.33%

Fundamental theorems: Applications


The Theorem of Total Probability and Bayes’ theorem are especially useful when:
• The random experiment can be organized in 2 stages

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
• It is easy to partition the sample space Ω through events B1 … Bk corresponding to the first
stage
• We know, or can easily calculate, the a priori probabilities P(B1) … P(Bk )
• We know, or can easily calculate, the likelihoods P(A | B1), . . . , P(A | Bk ), where A is an
event corresponding to the second stage

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Combinatorics
Combinatorics summary

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Basic concepts in Combinatorics
• Many random experiments can be defined on the basis of a finite number of elementary
equiprobable events. For example, many games of chance have this property
o In these cases, the computation of probabilities for specific events can be carried out
by counting the number of elementary events that belong to the events of interest
(Laplace rule)
• This count can be facilitated applying results from Combinatorics
o The study of different ways to arrange or configure the elements of a finite set, and of
counting the elements in the resulting configurations

Reservados todos los derechos.


Permutations
For a given set with n elements, its permutations are the orderings generated from these elements
without repeating any of them while including all of them
• The ordering of the elements is relevant
The number of permutations that can be generated from a set of
elements is given by the factorial of n:
Example: Obtain the total number of different orderings for six persons in a row

Permutations with repetition


We are given a set of n elements, where “a” elements are identical, “b” elements are also identical
but different from the preceding ones, c are also identical and different from the previous ones, etc.
• Each ordering of these elements is called a permutation with repetition
The number of permutations with repetition for the preceding case is
Example: Determine how many 10-digit numbers can be formed using
the digits in the number 3233244155

Combinations
We call combinations of n elements choose m at a time, the subsets of m different elements that
can be selected from a set composed of n elements
• The order of the elements is not relevant in this case
The total number of different combinations for n elements choose m is
given by the combinatorial number:
Example: En una clase de 35 alumnos se quiere elegir un comité formado por tres alumnos.
Calcular el número de comités que se pueden formar
Combinations with repetition

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
The combinations with repetition of n elements choose m at a time, the different subsets of m
elements that can be obtained from a set of n elements when repetition is allowed
• They can be interpreted, for example, as the different ways to assign n tasks to m individuals,
if a task can be assigned to more than one individual
The number of combinations with repetition of n elements
choose m is given by:
Example: Obtain the number of possible outcomes obtained by throwing four indistinguishable dice
(n = 6, m = 4)

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Variations
The variations of n elements choose m at a time are defined as the different orderings of m elements
that can be generated from a set having n elements
• The order of the elements is relevant in this case
The total number of variations of n elements choose m
is given by:
Example. Ten athletes compete in a race. Obtain the number of different ways the podium could be
configured with a first, a second and a third athlete

Reservados todos los derechos.


Variations with repetition
The variations with repetition of n elements choose m are defined as the different orderings of m
elements with possible repeated values, obtained from a set of n elements
• The number of variations with repetition of n elements choose m are:
Example: Determine how many different sequences of results can you obtain if you throw a coin
ten consecutive times

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Statistics I
Exercises Topic 4. Probability.
Academic year 2020/21 — Answers

1. [Newbold et al.] A mail-order company considers three possible errors when handling an order: A =
“a wrong item is delivered”, B = “the item is lost in transit”, and C = “the item is delivered
damaged”. It is assumed that A is independent of B and of C, and that B and C are mutually
exclusive. It is known that P (A) = 0.02, P (B) = 0.01, and P (C) = 0.04. Calculate the probability
that some of the mentioned errors happens for a given order.

Answer(s): 0.069.
2. The following are four possible results of a stock index in two consecutive days:
O1 : the index goes up both days.
O2 : the index goes up the first day and does not go up the second.
O3 : the index does not go up the first day and goes up the second.
O4 : the index does not go up either day.
a) Determine the sample space corresponding to the random experiment “Observe whether the
index goes up or not in two consecutive days”.
b) Consider the following events:
A: “The index goes up the first day”.
B: “The index goes up the second day”.

Determine the intersection, the union and the complements of A and B.


c) Assuming that the elementary events are equiprobable, what is the probability that the index
goes up at least one day?

Answer(s):

a) Ω = {O1 , O2 , O3 , O4 }.
b) A ∩ B = {O1 }; A ∪ B = {O1 , O2 , O3 }; Ā = {O3 , O4 } y B̄ = {O2 , O4 }
c) 0.75.

3. From the experience of an online clothes shopping portal, it has been observed that, on average,
every 1000 visits result in 10 big sales (over 500 e) and 100 small sales. We assume that all visits
have the same probability of resulting in a big sale, and the same for a small sale.

a) Indicate the sample space corresponding to the random experiment “observe the result of a
visit to the portal”.
b) What is the probability that a visit results in a big sale?
c) What is the probability that a visit results in a small sale?
d ) What is the probability that a visit results in a sale?

Answer(s):

a) Ω = {VP, VG, NV}, donde VP = “venta pequeña”, VG = “venta grande” y NV = “no venta”.
b) 0.01.
c) 0.10.
d ) 0.11.

4. In a market research study, a mobile phone company observed that 75% of its clients wanted the
SMS functionality, 80% wanted the capability to take pictures, and 65% wanted both.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
a) What is the probability that a client wants at least one of the two functionalities?
b) What is the probability that a client who wants SMSs also wants to be able to take pictures?
c) What is the probability that a client who wants to be able to take pictures also wants SMSs?

Answer(s):
a) 0.9.
b) 0.8667.
c) 0.8125.
5. In a Spanish oposición, the opositores must develop a topic they can choose among 3 drawn at
random from a total of 85 topics. An opositor has prepared only 35 topics. What is the probability
that (s)he has prepared at least one of the 3 topics in her/his draw?

Answer(s): 0.8016.
6. In order to estimate the audiences of a debate and a movie aired at nonoverlapping times, a TV
chain asked 2500 people whether they watched each of them: 2100 watched the movie, 1500 watched
the debate, and 350 did not watch any of the two programs. If we choose at random one of the
people surveyed:

a) What is the probability that this person watched both the movie and the debate?
b) What is the probability that this person watched the movie, knowing that (s)he watched the
debate?
c) knowing that (s)he watched the movie, what is the probability that this person watched the
debate?

Answer(s):

a) 0.58.
b) 0.9667.
c) 0.6905.

7. According to a study, 38% of Madrid households has a monthly income over 2000 e and 37%
between 1000 e and 2000 e. On the other hand, the percentage of households owning a second
residence is 6.4% among those with incomes of not more than 1000 e, 12.57% among households
with incomes between 1000 e and 2000 e, and 23.4% among households with incomes over 2000
e.

a) Calculate the percentage of households owning a second residence.


b) If a household owns a second residence, what is the probability that its income is over 2000
e?
c) Calculate the probability that a household does not own a second residence and has an income
over 1000 e.
d ) Among households with incomes not exceeding 2000 e, what percentage owns a second resi-
dence?

Answer(s):
a) 0.1514.
b) 0.5873.
c) 0.614571.
d ) 0.1008.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
8. [Ross 2005] An insurance company classifies its clients in two groups, those who are accident-prone,
representing 20% of clients, and those who are not. Data indicate that, in a given year, 10% of
accident-prone clients have an accident, while only 5% of not accident-prone clients have it.

a) What is the probability that a new client has an accident during the first year?
b) If a new client has an accident during the first year, what is the probability that the client is
accident-prone?
c) If a new client does not have an accident during the first year, what is the probability that the
client is not accident-prone?

Answer(s):
a) 0.06.
1
b) 3.
c) 0.8085.
9. [Ross 2005] An inspector in charge of a criminal investigation has a certainty of 60% that a certain
suspect is guilty. A forensic analysis reveals that the criminal was left-handed. It is known that 20%
of the general population is left-handed.

a) What is the probability that the suspect is left-handed?


b) If the suspect turns out to be left-handed, what is the probability that (s)he is guilty?

Answer(s):
a) 0.68.
b) 0.8824.
10. (Exam, May 2016) The AROPE indicator is computed using several social vulnerability factors and
measures the risk that a given household is under risk of poverty exclusion. During 2014, an NGO
attended 1.200.000 households and 156.000 were not under AROPE. Regarding the households that
did not suffer AROPE, 84 % of them were not over-indebted. On the other hand, considering the
households that were under AROPE, 40 % of them also suffer over-indebtedness.
(a) Calculate the number of households attended by the NGO that suffer over-indebtedness.
(b) Given that a household is not over-indebted, compute the probability that it is under AROPE.
(c) (Topic 5) A social worker can visit 20 households per day. Compute the probability that, in a
given day, 5 out of 20 households are not under AROPE.
(d) (Topic 5) Considering that 420 households can be visited per month, compute the probability
that at least 150 households suffer both AROPE and over-indebtedness.
Answer(s).

a) 0.3688
b) .8270
c) 0.0713
d ) 0.6517

11. (Exam, June 2016) In a city of 3.5 millon inhabitants there are three urban transport systems:
metro, bus and tram. In general, in a working day, the amount of travellers are 1.500.000 for the
metro, 750.000 for the bus and 450.000 for the tram. Moreover, it is known that, 30 % of metro
travellers also use the bus, 10 % of metro travellers also use the tram and 5 % of metro travellers
also use both bus and tram. Finally, 15 % of bus travellers also use the tram. (Hint: An inhabitant
can take or not the urban transport).
(a) Calculate the probability that, in a working day, an inhabitant uses only one of the three
transport systems.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
(b) Calculate the probability that, in a working day, an inhabitant uses at least one transport
system.
(c) When only a transport system is used, there is a 2 % probability of having a delay of more that
5 minutes in a working day. However, the probability of having such a delay rises to 7 % when
combining more than one transport system in a working day. Calculate the probability that an
inhabitant suffers a delay of more that 5 minutes in a working day.
(d) With the same information as in part (c) and given that a traveller suffered a delay of more than
5 minutes, calculate the probability that this traveller took more than one transport system.

Answer(s).
a) 0.4286
b) 0.5893
c) 0.0198
d ) 0.5681

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
Topic 5. Probability models
• Random variables: concept

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Discrete random variables:
o Probability function and distribution function
o Mean and variance of a discrete r.v.
• Continuous random variables:
o Density function and distribution function
o Mean and variance of a continuous r.v
• Probability models:
o Discrete probability models: Bernoulli, Binomial and Poisson.
o Continuous probability models: Uniform, exponential and normal.
o The Central Limit Theorem and applications (Normal approximation to the Binomial)
Topic 5
Random variables: concept

Reservados todos los derechos.


• Let Ω be the sample space for a random experiment
• A random variable (r.v.) is a a function X : Ω −→ R that assigns to each element ω ∈ Ω a
number X(ω) ∈ R

• Intuitively, a r.v. X is a model of a random experiment, given as a number X(ω) that


varies according to the outcome ω obtained
• A r.v. is written in upper case (X), while lower case (x) indicates concrete values that the r.v.
can take when evaluated in a sample point (ω)
• OBS: The statistical variables seen in Topics 1, 2 and 3 can be modelled as r.v.
Discrete r.v.: If X takes values in a finite or countably infinite set S ⊆ R, we say that X is a discrete
r.v.
Continuous r.v.: If X takes values in an uncountably infinite set S ⊆ R (e.g., an interval or a union
of intervals of R), we say that X is a continuous. r.v.
Examples
• X = “Outcome of a die throw” is a discrete r.v. with S = {1, 2, 3, 4, 5, 6}
• Y = “number of cars crossing a certain toll in a week” is a discrete r.v. with S = {0, 1, 2, . . .}
= N ∪ {0}
• Z = “height (in cms.) of a randomly chosen student” is a continuous r.v. with S = [0, ∞)

Discrete random variables: Probability function


Let X be a discrete r.v. with values x ∈ S. Its probability (or mass) function assigns to each
possible value of X its probability: px = P{X = x} for x ∈ S → preferred to understand the behaviour
of a discrete r.v. than the distribution function
Example
X = outcome of throwing a fair die. The probability function is
In this case, S = {1, 2, 3, 4, 5, 6} and p1 = · · · = p6 = 1/6

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Properties
Let X be a discrete r.v. taking values in the set S with probabilities px = P{X = x} for x ∈ S. Then

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Properties example:
A game consists in trying to insert 3 rings successively in a stick. Participating costs 3 euros. Prizes
are 4 euros for 1 success, 6 euros for two and 30 euros for three. We assume that the probability of
inserting a ring is 0.1 in each toss, and that the outcomes are independent.
We define the r.v. X as the net gain in the game. The sample space is
Ω = {{f , f , f }, {a, f , f }, {f , a, f }, {f , f , a}, {a, a, f }, {a, f , a}, {f , a, a}, {a, a, a}}
where a denotes success and f failure. Hence, X only admits four possible outcomes, with the
following probabilities:

Reservados todos los derechos.


X Є {-3, 1, 3, 27}
Y = number of inserts in the ring
P{X = −3} = P(Y=0) = P(I͞1 ꓵ I͞2 ꓵ I͞3) = P(I͞1) P(I͞2) P(I͞3) = 0.9^3 = 0.729
P{X = 1} = 3 × 0.1 × 0.9^2 = 0.243
P{X = 3} = 3 × 0.1^2 × 0.9 = 0.027
P{X = 27} = 0.1^3 = 0.001
-3 1 3 27
Px 0,729 0,243 0,027 0,001

What is the probability of earning at least 3 euros, net of the 3 euros for participating?
P{X ≥ 3} = P{X = 3}+P{X = 27} = 0.027+0.001 = 0.028
What is the probability of not losing money?
P{X ≥ 0} = P{X = 1} + P{X = 3} + P{X = 27} = 0.243 + 0.027 + 0.001 = 0.271
or, equivalently,
P{X ≥ 0} = 1−P{X < 0} = 1−P{X = −3} = 1−0.729 = 0.271

Discrete random variables: Distribution function


The distribution function or cumulative probability function of an r.v. X is the function F : R →
[0, 1] that assigns to each value x ∈ R the probability

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
OBS: It is defined for every x ∈ R, not only for x ∈ S
• Properties:
o 0 ≤ F(x) ≤ 1 for every x ∈ R.
o F(y) = 0 for every y < mín S. Hence, F(−∞) = 0.
o F(y) = 1 for every y > máx S. Hence, F(∞) = 1.
o If x < y, then F(x) ≤ F(y), i.e., F(x) is nondecreasing

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o For any a, b ∈ R, P{a < X ≤ b} = P{X ≤ b} − P{X ≤ a} = F(b) − F(a).
Example:
The probability function of the r.v. X in the game example is

X=0 -3 1 3 27
F(x) 0,729 0,729+0,243 0,972 + 0,999 +
= 0,972 0,027 = 0,001 = 1
0,999

Its distribution function is

Reservados todos los derechos.


Note that F(x) is piecewise constant with jump discontinuities at points in S. The jump at x ∈ S
has magnitude P{X = x}

Discrete random variables: Expectation (mean)


Let X be a discrete r.v. taking values in S with probabilities px = P{X = x}. The expectation (mean)
of X is

Properties

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Example
-3 1 3 27
Px 0,729 0,243 0,027 0,001

The expectation of the r.v. X in the game example is


E[X] = ∑x∈S xP{X = x} =−3 × P{X = −3} + 1 × P{X = 1} + 3 × P{X = 3} +27 × P{X = 27} = −3 × 0.729
+ 1 × 0.243 + 3 × 0.027 + 27 × 0.001 = −1.836
Therefore, the expected (mean) net gain is −1.836 euros

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Discrete random variables: Variance
The variance of the discrete r.v. X is

The square root of the variance is the standard deviation, denoted by S[X] = √ V[X]
Properties

Reservados todos los derechos.


(Is like when we did ∑xi^2*fi – mean^2 for the sample)

Example:
-3 1 3 27
Px 0,729 0,243 0,027 0,001

E[X] = −3 × 0.729 + 1 × 0.243 + 3 × 0.027 + 27 × 0.001 = −1.836


V[X] = E[(X-E[X]^2] = (-3-(-1,836))^2 * 0,729 + (1-(-1,836))^2 * 0,243 + (3-(-1,836))^2 * 0,027 +
(27-(-1,836))^2 *0,001 = 4,405 euros ^2
V[X] = E[X^2] – E[X]^2 = (-3)^2*0,729 + (1)^2*0,234 + 3^2 *0,027 + 27^2*0,001 – (-1,836)^2 =
4,405

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.
Let X be the r.v. giving the number of heads minus the number of tails in 3 tosses of a loaded coin,
in which it is twice more likely to get heads than tails
We denote by “c” ={heads} and by “+” ={tails}.
The sample space is

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
We participate in a gamble where we have to pay 6 euros upfront. If when tossing 3 times the coin
in the previous example we get 1 tail, we earn 4 euros, if we get 2 tails we earn 6 euros, and if we
get 3 tails we earn 30 euros. What is the mean net gain?
Let Y be the r.v. “net gain in the gamble”. We have:
• If we don’t get any tails, X = 3, so Y = −6 with probability P{Y = −6} = P{X = 3} = 8/27
• If we get one tail, X = 1, so Y = −2 with probability P{Y = −2} = P{X = 1} = 4/9

Reservados todos los derechos.


• If we get two tails, X = −1, so Y = 0 with probability P{Y = 0} = P{X = −1} = 2/9 .
• If we get three tails, X = −3, so Y = 24 with probability P{Y = 24} = P{X = −3} = 1/27
Hence, Y takes values in the set S = {−6, −2, 0, 24}. The mean net gain is E[Y ] = −6 × 8/27 − 2 × 4
9 + 0 × 2/9 + 24 × 1/27 = −1.78 euros

Continuous random variables


Continuous random variables: Distribution function
The distribution function of a continuous r.v. X is F(x) = P{X ≤ x}, for x ∈ R → to understand the
behaviour of a constant random variable we use the distribution function

As in the discrete case, the function F(x) gives the cumulative probabilities until the point x ∈ R, but
now it is a continuous function
Properties
• 0 ≤ F(x) ≤ 1 for every x ∈ R
• F(−∞) = 0.
• F(∞) = 1.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
• If x < y then F(x) ≤ F(y), i.e., F(x) is nondecreasing.
• For a < b, P(a < X ≤ b) = P(a ≤ X ≤ b) = F(b) − F(a).
• F(x) is continuous.
The probability function does not make sense in continuous r.v., because P(X = x) = 0. Instead, we
shall use the density function.

Continuous random variables: Density function

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
For a continuous r.v. X with distribution function F(x), the density function of X is the derivative of
the distribution function. The density is not a probability but is related to probability when you
integrate it (second property).

f (x) = d/dx F(x) = F’(x)

Properties

Reservados todos los derechos.


Example:

1. To check if it is a density function, we:


• Plot the function and see that the values are between 0 and 1

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
• Check it satisfies

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Continuous random variables: Expectation (mean)
Let X be a continuous r.v. taking values in S ⊆ R, with density function f (x). Then, the expectation
(mean) of X is

Reservados todos los derechos.


The same properties of the expectation of a discrete r.v. hold
Example
The expectation (mean) of the r.v. X of the previous example is

Continuous random variables: Variance


The variance of the continuous r.v. X is

The square root of the variance is the standard deviation, denoted by S[X] = √V[X].
The same properties as for the variance of a discrete r.v. hold
Example

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.

Probability models
• Discrete probability models: Bernoulli, Binomial and Poisson
• Continuous probability models: Uniform, exponential and normal
• Central Limit Theorem

The Bernoulli model


Description

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Consider a random experiment with two possible outcomes, which we call “success” and “failure”
Define the r.v.

Let p be the probability of success (so 1 − p is the probability of failure). The


experiment is called a Bernoulli trial and we say that the r.v. has a Bernoulli
distribution with parameter p.
We write X ∼ Ber(p).
Example: We toss a fair coin and obtain

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Then X ∼ Ber(1/2)
Example: An airline considers that passengers who buy a ticket for a certain flight have a probability
of 0.05 of not showing up Define, for a randomly chosen passenger who buys a ticket for that flight,

Reservados todos los derechos.


Then Y ∼ Ber(0.95)

The Binomial model


Description
A Bernoulli trial with parameter p is repeated n (fixed) times with trials being mutually independent.
The r.v. total number of successes obtained follows a Binomial distribution with parameters n and
p.
Definition:
A r.v. X follows a Binomial distribution with parameters n and p if

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Example

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The airline of the previous example has sold 80 tickets for the flight. The probability that a passenger
does not show up is 0.05. Define X = number of passengers showing up. Then (assuming
independence)
X ∼ B(80, 0.95)
The probability that all passengers show up is

The probability that at least one passenger does not show up is


P{X < 80} = 1 − P{X = 80} = 1 − 0.0165 = 0.9835

Reservados todos los derechos.


Another example:
Families with 4 children. Probability of 2 of them being girls.
X ∼ B(n,p) n=4, p=1/2
P{X = x} → P{X = 2} x=2
Properties

The Poisson model


Definition:
A r.v. X follows a Poisson distribution with parameter (rate) λ (events per unit time) if

where e is the base of natural logarithms.


We write X ∼ P(λ)
Description
We use the Poisson distribution for modelling the number of random events of a certain type that
happen in a given time interval (there is no maximum quantity), assuming they occur

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
homogeneously over time, with λ being the mean number of events in the interval. It is
convenient to specify the unit of measure of time (minutes, hours, days, weeks, etc.)
Example
Over the years it has been observed that in a certain road there are, on average, 25 accidents per
year. We assume that X, the yearly number of accidents in such a road, follows a Poisson
distribution,
X ∼ P(25)
The probability that in a given year there are 25 accidents is

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The probability that there are 20 accidents or less is

In the previous example, consider Y, the number of of accidents in such a road over two consecutive
years
The distribution of the r.v. Y is Y ∼ P(2 × 25) = P(50)

Reservados todos los derechos.


The probability that over two years there are 50 accidents is

The probability that there are no more than 40 accidents is

Properties

Example:

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.
Computing the median

The uniform distribution


Description: The uniform distribution is such that all intervals of equal length are equally likely.
Hence, its density function is constant over the range of possible values it can take.
Definition: A r.v. X follows a uniform distribution in the interval [a, b] if its density function is

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
We write X ∼ U[a, b].

The uniform distribution: distribution function


Suppose that X ∼ U[a, b]. Then its distribution
function is

Properties

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Example: Distribution U[3, 5]

Reservados todos los derechos.


Distribution function

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The exponential distribution
Description: The exponential distribution is used for modelling the time elapsed until a certain
random event happens. It is important to specify the time units (seconds, minutes, hours, etc.)
Definition: We say that a r.v. X follows an exponential distribution with parameter (rate) λ (events

Reservados todos los derechos.


per unit time) if its density function is

We write X ∼ Exp(λ)
The exponential distribution: distribution function
Assume that X ∼ Exp(λ). Its distribution function is

Properties

Relation between the exponential and Poisson distributions

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Example: the exponential distribution

Reservados todos los derechos.

The Normal distribution


Description: The normal distribution is a theoretical model that approximates well many real-world
random quantities. A good part of statistical inference is based on the normal and related
distributions
Definition: A r.v. X follows a normal (or Gaussian) distribution with parameters µ and σ, denoted
by X ∼ N(µ, σ), if its density function is

Properties

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Density function for 3 different values of µ and σ

Reservados todos los derechos.


Property
If X ∼ N(µ, σ), then
• P(µ − σ < X < µ + σ) ≈ 0.683
• P(µ − 2σ < X < µ + 2σ) ≈ 0.955
• P(µ − 3σ < X < µ + 3σ) ≈ 0.997

Tabla N(0,1)

Example:
Let Z ∼ N(0, 1). Let us calculate some probabilities:
• P(Z < 1.5) = 0.9332. (table)
• P(Z > −1.5) = P(Z < 1.5) = 0.9332. (why?)

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
• P(Z < −1.5) = P(Z > 1.5) = 1 − P(Z < 1.5) = 1 − 0.9332 = 0.0668. (why not ≤?)
• P(−1.5 < Z < 1.5) = P(Z < 1.5) − P(Z < −1.5) = 0.9332 − 0.0668 = 0.8664.
Let X ∼ N(µ = 2, σ = 3). We want to calculate P(X < 4) and P(−1 < X < 3.5):
• First, we standardize the original r.v. as follows:

where Z ∼ N(0, 1)

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Then,

where Z ∼ N(0, 1)

Reservados todos los derechos.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.
Another example:
It is hard to label packed meat with the exact weight due to loss of liquid effects (measured as a
percentage of the meat’s true weight). Assume that the loss of liquid in a package of chicken breast
is normally distributed with mean 4 % and standard deviation 1 %
Let X be the loss of liquid in a randomly chosen package
• What is the probability that 3 % < X < 5 %?
• What is the value of x for which 90 % of packages have a loss of liquid less than x?
• In a sample of 4 packages, find the probability that all have a loss of liquid between 3 % and
5 %.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
From the tables, we have x − 4 ≈ 1.28, which implies that 90% of packages have a loss of liquid less
than x = 5.28 %. For a package, let p = P(3 < X < 5) = 0.6827. Let Y be the number packages in the

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
sample of 4 packages having a loss of liquid between 3 % and 5 %. We have Y ∼ B(4, 0.6827), and

If the sample were of 5 packages, what would be the probability that at least one would have losses
between 3 % and 5 %? We have n = 5 and p = 0.6827. Therefore, Y ∼ B(5, 0.6827). Then,

Reservados todos los derechos.


The Central Limit Theorem (CLT)

This result refers to the limit of the sample mean from n independent and
identically distributed (i.i.d.) r.v. with finite mean µ and standard deviation σ. It says that, for large
n, the distribution of X¯ is approximately normal, whatever the distribution of the Xi

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
An application of the CLT: Normal approximation to the Binomial distribution
Binomial
If X ∼ B(n, p) with n large enough (either n ≥ 30 and 0.1 ≤ p ≤ 0.9, or np ≥ 5 and n(1 − p) ≥ 5), then

CLT and approximations: Example

Reservados todos los derechos.


Let X ∼ B(100, 1/3). We want to approximate the value of P{X < 40}, whose exact calculation is very
cumbersome
Using the CLT we have X ∼ B(100, 1/3) ≈ N(33.3˙ , 4.714), since

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Question 1

Given a collection of independent r.v.s Xi (i=1,…,n) that follow Poisson distributions with the same
parameter λ, then Y = X1 + ⋯ + Xn also follows a Poisson distribution with parameter λ/n.

Y follows a Poisson distribution with parameter nλ. Note that E[Xi]= λ and that implies E[Y]= nλ

The correct answer is 'False'.

Question 2

The variance of a discrete r.v. X cannot be larger than E[X^2].

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The variance of any r.v. X (if it exists) must satisfy Var(X) = E[X^2] − (E[X])^2 ≤ E[X^2]

The correct answer is 'True'.

Question 3

Let X and Y be two independent r.v.s, both of them with means equal to 1 and variances equal to 2.
The random variable W = Y − X has mean equal to 0 and standard deviation equal to 2.

W has a normal distribution, as it is a linear combination or normal r.v.s. Its mean is E[W] = E[Y] −
E[X] = 1 – 1 = 0 and its variance is Var(W) = Var(Y) + Var(X) = 2 + 2 = 4, implying that σY = √4 = 2

Reservados todos los derechos.


The correct answer is 'True'.

Question 4

A r.v. X is defined as the number of times you get a prize after 10 (independent) draws in a lottery.
Another r.v. Y is defined as the number of times you get a prize after 5 additional (independent)
draws. If pp is the probability of getting a prize in one draw, the variance of X + Y takes the
value 15p(1−p).

X+Y follows a binomial distribution with parameters n=15 and p, and its variance is given
by Var(X+Y) = np(1−p).

The correct answer is 'True'.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Statistics I

Exercises. Topic 5: Probability models.

Academic year 2020/21


- Answers

1. The random variable X = number of children in a randomly chosen family from a certain city has the following probability
function:

X P(X=x)
0 0.47
1 0.30
2 0.10
3 0.06
4 0.04
5 0.02
6 0.01

Answer the following questions:

(a) Calculate and interpret the mean of X.


(b) Calculate and interpret the variance and standard deviation of X.
(c) If the city council pays 1000 euros per child, what does the random variable Y = 1000X represent? What is its
probability function?

(d) Calculate and interpret the mean and standard deviation of Y.


(e) Answer the two previous questions if the subsidy per family is Z = 350x2 , where x is the number of children.

Answers(s).

(a) Mean: µ = E(X) = 1


(b) Variance: σ 2 = 1.74 and σ = 1.655.
Y P(Y=y)
0 0.47
1000 0.30
2000 0.10
(c)
3000 0.06
4000 0.04
5000 0.02
6000 0.01

(d) µY = 1655
Z P (Z = z)
0 0.47
350 0.30
1400 0.10
(e)
3150 0.06
5600 0.04
8750 0.02
12600 0.01
2
p
E(Z) = 959 euros, V (Z) = 4281669 euros and DT (Z) = V (Z) = 2069, 219 euros.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
2. The following gure shows the density function of a continuous r.v. X.

(a) Calculate the probability that X is less than 1 using the graph.

(b) Calculate the probability that X is greater than 0.5 and less than 3/2, arguing analytically.

(c) Calculate the mean of X.


(d) Calculate the variance of X.

Answers(s). The density function can be written as

1 − 21 x,

if x ∈ (0, 2),
f (x) =
0 if x∈/ (0, 2).

(a) P (X < 1) = 0.5 + 0.25 = 0.75.


1
(b) P [0.5 < X < 1.5] = 2
(c) µ = 0.667
2
(d) σ2 = 9 = 0.22

3. The length in minutes of a phone call to a certain customer service is a continuous random variable with distribution
function (
0 if x≤0
F (x) = −2x −x
1 − 23 e 3 − 31 e 3 if x>0
It is known that calls lasting more than 6 minutes receive a very low satisfaction rating, while those lasting less than 3
minutes receive a very high rating:

(a) Calculate the probability that the duration of a call lies between 3 and 6 minutes.

(b) Calculate the probability that the duration of a call is over 6 minutes.

(c) Knowing that an ongoing call has already lasted 3 minutes, what is the probability that it will be shorter than 6
minutes?

Answers(s).

(a)
P (3 < X < 6) = 0.1552
(b)
P (X > 6) = 0.0574

(c)
P (X < 6|X > 3) = 0.7292

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
4. In each of the following situations, say if the random variable X dened can be modeled as a binomial distribution. In
such a case, identify the parameter values n and p:

(a) We roll a die 100 times and X is the total number of 1's obtained.

(b) We draw one card from a deck of 52 cards and check if it is an ace. We repeat this procedure 10 times, without
putting the card back after each draw. X is the total number of aces drawn.

(c) 2% of oranges shipped in a certain place are rotten. Oranges are placed in bags of 10 units. We randomly select one
bag and take X to be the number of rotten oranges in the bag.

(d) In a box there are 2 red balls, 3 white balls and 2 green balls. We draw one ball at random, write down its colour and
put it back in the box. We repeat this procedure 10 times and take X to be the total number of white balls drawn.

(e) In a box there are 2 red balls, 3 are white and 2 are green. We select one ball at random, report its color and return
it to the box. We repeat this process 10 times and count the number of balls of each colour.

Answer(s).

(a) Bin (100, 1/6).


(b) It is not a binomial distribution.

(c) Bin (10, 0.02).


Bin 10, 37

(d) .

(e) It is not a binomial distribution.

5. (May 2017 exam) A rm has designed the following campaign to advertise a product on a global scale by massive email
sending: The rm will send one hundred thousand emails to potential customers unrelated among themselves oering the
product, which yields a prot of 70 e per unit sold. The rm assumes that, on average, one out of ten people receiving
the email will purchase the product. Answer the following questions, providing adequate justication.

(a) Specify a probabilistic model for the random variable Y, which models the prot that will be obtained with the
campaign.

(b) Calculate the mean, variance and standard deviation of the prot Y.
(c) Calculate (exactly or approximately) the probability that the prot Y exceeds 712000 e.

Solution.

(a) Let X= prot. X ∼ B(n = 100000, p = 1/10) and Y = 70X .


E[Y ] = 70E[X] = 70np = 700000 e, V [Y ] = 702 V [X] = 702 np(1 − p) = 441 × 105 e2 V [Y ] = 6640.78 e.
p
(b) ,

(c) Since n is large, p > 0.1 and np > 5, then by the CLT:
 
Y − 700000 712000 − 700000
P {Y > 712000} = P >
6640.78 6640.78
≈ P {Z > 1.81} ≈ 0.0351.

6. A bank oers a deposit of 6000 euros with full liquidity. It has 25 subscribed deposits. If the probability that a customer
requests a full refund in a given day is 0.01, and refund requests are independent, how much money should the bank reserve
to ensure that refund requests in a given day are honored with a probability of at least 99%?

Answer(s). P99 = 2 and therefore the bank should reserve 12000 euros.

7. A company rents a computer for periods of t hours, charging for it 600 euros per hour. The number of times the computer
breaks down in an hour is a random variable with a Poisson distribution and failure rate λ = 0.08 per hour. If the computer
breaks down x times in the t hours, the company must pay 50x2 to x it. Calculate the expected benet of the company
as a function of t. For what value of t does the company obtain the maximum expected benet?

Answer(s). E(B) = 596t − 0.32t2 . Thus, the maximum expected benet is attained at t = t = 931.25 hours (38.8 days).

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
8. An insurance company receives on average 3 claims for accidents per day. It is assumed that the number of claims in a
random day follows a Poisson distribution.

a) Calculate the probability that more than two claims are received in two consecutive days.

b) If the amount to be paid for a random claim follows an exponential distribution with mean 500 euros, determine its
distribution function. Calculate the probability that the company has to pay for a claim more than 1200 euros.

c) Calculate the median of the amount to be paid for a random claim.

Answer(s).

a) P {Y > 2} = 0.9380
b)
(
0, if x < 0,
F (x) = x
1 − exp− 500 , if x ≥ 0.
P {C > 1200} = 0.091
c) M d = 346.5736
9. Based on her previous experience, the head of a construction company knows that the amount of a randomly chosen project
contract follows a uniform distribution in the interval ( 2C
3 , 2C), where C is the project cost. What is the expected benet
per project?

C
Answer(s). E(B) = 3.

10. (May 2014 exam) In a company, each technical service visit to x a computer system breakdown costs 350 euros, plus a
xed monthly fee of 175 euros. The monthly average number of breakdowns is 9.5 with a standard deviation of 2.

a) Obtain the expectation and variance of the monthly repair cost (including the monthly fee).

b) Using Chebyshev's inequality, bound the probability that in a given month the cost of repairr is lower than or equal to
2000 euros or greater than or equal to 5000 euros.

c) If we instead assume that the monthly cost of repairs is uniformly distributed with the expectation and variance in a),
calculate the probability of the previous part.

d) How can you explain the dierence between the results in parts b) and c) ?
Solution.

a) X = number of breakdowns in a month. E[X] = 9.5 and V ar[X] = 22 = 4.


2
Let C = be the monthly repair cost. Since C = 350X + 175, then E[C] = 350 · 9.5 + 175 = 3500 and V ar[C] = 350 · 4 =
490000.
b)
P (C ≤ 2000 or C ≥ 5000) = P (C − 3500 ≤ −1500 ó C − 3500 ≥ 1500) = P (|C − 3500| ≥ 1500)
Des.Cheb. V ar[C] 490000
= P (|C − E[C]| ≥ 1500) ≤ = = 0.22.
15002 15002
2
(b−a)
c) If C ∼ U (a, b), E[C] = a+b
then
2 = 3500 and V ar[C] = 12 = 490000. Thus,
) √ )
a+b  7000− 12·490000
2 = 3500 a = √ 7000 − b a = √ = 2287.56
(b−a)2 ⇔ ⇔ 2
12 = 490000 b−a = 12 · 490000 b = 12 · 490000 + 2287.56 = 4712.44

That is, C ∼ U (2287.56, 4712.44) and therefore P (C ≤ 2000 or C ≥ 5000) = 0.


d) In the rst case we only know the mean and the variance, and Chebyshev's inequality provides a weak bound of the
requested probability. In the second case, since we know the distribution of the random variable we can calculate the
probability precisely.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
11. (Anderson et al., 2018) Wendy's fast-food chain has been recognized for having the fastest average service time among
fast-food restaurants. In a benchmark study, Wendy's average service time of 2.2 minutes was less than those of Burger
King, Chick-ll-A, Krystal, McDonalds, Taco Bell, y Taco John's (QSR Magazine website, December 2014). Assume that
the service time for Wendy's follows an exponential distribution.

a) What is the probability that a service does not exceed one minute?

b) What is the probability that a service is between 30 seconds and one minute?

c) Suppose Wendy's is considering a policy under which if a service time exceeds ve minutes, the customer's order is free.
What is the probability that you will get a free meal?

Answer(s).

a)
P {X ≤ 1} = 0.365

b)
P {0.5 ≤ X ≤ 1} = 0.162
c)
P {X ≥ 5} = 0.103
12. (Newbold et al., 2013) In Great Britain, a factory of 2000 employees has a rate of weekly accidents with a loss equal
to λ = 0.4 and the number of accidents follows a Poisson distribution. Get the average time between two consecutive
accidents. What is the probability that the time between two consecutive accidents is less than 2 weeks?

1
Answer(s). E(T ) = 0.4 = 2.5 weeks and P {T < 2} = 0.5507
13. The manufacturing cost (in euro) of a certain product can be modeled as a random variable X with normal distribution
N (100, σ = 3). The sale price is independent of the manufacturing cost and varies depending on market conditions. Let Y
be the normal random variable N (129, σ = 6) indicating the unit sale price (in euro) of the product. Answer the following
questions:

a) Obtain the distribution of the benet obtained with the sale of 10 product units. Do you need to assume any additional
hypothesis to answer?

b) Obtain the expected benet and its variance.

c) What is the probability of obtaining a benet of at least 320e?

Answer(s).

a) B ∼ N (290, 450)

b) E(B) = 290 euros and SD(B) = 450 euros.

c) P {B ≥ 320} ≈ 0.079

14. According to a bank's study, the number of bounced checks received in a bank branch follows a Poisson distribution, with
a mean of 10 bounced checks per day.

(a) A random sample of 200 branches is selected, for which the number of bounced checks received is recorded in a day.
What is the probability that the total number of bounced checks received is larger than 1900?

(b) What is the probability that a branch receives less than 3 bounced checks in a day?

Answer(s).

a) P ≈ 0.987
b) P = 0.0028.

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
15. (May 2010 exam) The transit time X (in minutes) for the buses of a certain city in a certain route is modeled as a uniform
random variable on the interval (30, 40).
a) Draw the probability density function of X. Indicate in the coordinate axes both the values of X with positive density
and the values of the density function.

b) What is the probability that the route transit time of a bus will be between 30 and 37.5 minutes?

c) We select 100 buses at random and are interested in the number of buses having a route transit time between 30 and
37.5 (we denote this random variable as Y ). Give the name of the distribution of Y and its parameter values. Calculate
the expectation and standard deviation of Y .

d) What is the (approximate) probability that less than 64 buses will have route transit times between 30 and 37.5?

Solution.
1
a) f (x) = 10 , for all x ∈ (30, 40), and f (x) = 0, for all x∈
/ (30, 40).

0,12

0,1

0,08

0,06

0,04

0,02

0
-10 0 10 20 30 40 50 60 70

b) 0.75
c) Y ∼ B(n = 100,pp = 0.75), were success={ the bus has a route transit time between 30 and 37.5 }. E(Y ) = np = 75,
and DT (Y ) = V (Y ) = 4.33
d) 0.0055 (by the CLT).

Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
Topic 6. Introduction to statistical inference
• Statistical inference: objectives and basic concepts.

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Point estimation of parameters.
• Goodness of fit to a distribution. Graphical methods.
• Distribution of the sample mean.
• Confidence intervals for the population mean.
Topic 6: Introduction to statistical inference
Statistical inference
• Goal: Obtaining information about the parameters of a population from a sample from it.
• We identify the concept of statistical population with that of population for a random
variable (r.v.) X.
• The distribution of the population is the distribution of the r.v. X. For example, X may have
a normal distribution with parameters µ and σ, X ∼ N(µ, σ).

Reservados todos los derechos.


• Statistical inference is concerned with inferring the unknown values of population
parameters for an r.v. (such as its mean) from information in a sample.

Sampling
• A sample is a finite subset of a population. The number of individuals on it is called the
sample size.
• The reasons to consider a sample instead of the entire population include the following:
o The elements of the population may exist conceptually, but maybe not in reality at a
given moment (population of defective parts that a machine will produce during its
lifespan).
o It can be economically infeasible to study the entire population.
o The study of the population would take an excessive time. Further, its characteristics
might change over time (electoral polls).
o The study might entail the destruction of elements studied (mean life of a type of light
bulb, mean breakpoint tension of a cable type, …).

Simple random sampling


• To obtain valid inferences about the population from a sample, it is of paramount importance
that the sample be representative of the population (how about a sample from basketball
players to infer the mean height of the general population?)
• Simple random sampling is a model of a representative sample, which randomly selects
individuals of a population in such a way that:
o each element of the population has the same probability of being selected; and
o draws are carried out with replacement, so that the population remains identical. If the
population size (N) is large with respect to the sample size (n), then, in practice, it is
indifferent to sample with or without replacement.

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
Simple random sample
• Let X be an r.v. with distribution F. A simple random sample (s.r.s.) of size n of X is a set of
r.v. X1, …, Xn such that:
o X1, …, Xn have distribution F (Xi ∼ F, for i = 1, …, n).
o X1, …, Xn are mutually independent.
• Each realization x1, …, xn of such an s.r.s. is called a particular sample.
• A statistic is a function of the s.r.s. X1, …, Xn. Hence, a statistic is an r.v. (unlike a

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
parameter, which takes a fixed numerical value for a given population)
• An estimator is a statistic used to approximate a parameter

Point estimation of the population mean


• Suppose that X is an r.v. for which we ignore the value of E[X]. To estimate E[X], we use an
s.r.s. to obtain the statistic sample mean (statistic estimator):

Reservados todos los derechos.


Note that mean is an r.v.
• Further, from a particular sample x1, …, xn, we obtain the concrete numerical value

• Later we will see why the mean is a good estimator of E [X].

Example of sampling and inference


A physician visited 24 patients on a certain day. For that day, we thus have a finite population of N
= 24 individuals, with the variable of interest X = “duration (min.) of a visit”.
The values of X in the population were
1 0.9 3.8 10.2 2.1 9.5 4.5 1 2.2 1.5 4.8 1.6 8.8 4.3 1 9 5.1 0.2 2.3 0.8 7.8 7.7 1.5
Hence, the population mean of X is E [X] = 4.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
We draw an s.r.s. of size n = 7 of X, obtaining

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3.8 9.5 4.8 1.6 0.2 0.8 1.5
The sample mean of these values is mean = 3.171. The relative error of this estimate of E[X] is
(3.171 − 4) /4 = −0.207 (−20.7%).
If we add new elements to the above s.r.s., the sample mean changes: it tends to get closer to the
population mean.

Reservados todos los derechos.


For a fixed sample size n, the sample mean changes with each particular sample. Thus, we can
draw another sample (n = 7),
5.1 1 0.9 3.8 10.2 2.1 9.5
with sample mean = 4.65. Next, we see a histogram of the possible values of the sample mean with
n = 7:

Below, we see histograms of the possible values of the sample mean for samples of size n = 7 and
n = 17, respectively. What do such histograms suggest?

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Review of key concepts
Key concepts:
• An s.r.s. of size n of an r.v. X is a set of independent r.v. with the same distribution as X:
{Xi}ni=1 = independent and identically distributed (i.i.d.)
• The statistic sample mean is an r.v. In general, a statistic is an r.v. obtained as a function of
an s.r.s.

Reservados todos los derechos.


Expectation and variance of the sample mean
• We will obtain the expectation and the variance of the r.v. X to understand why X is
considered a good estimator of E [X]. For such a purpose, we use two properties of the
expectation and the variance of a sum of r.v.
• Let X1, …, Xn be an s.r.s. of an r.v. X with expectation E [X] and variance V [X]. Then, for
any numbers a1, …, an:
E [a1X1 + · · · + anXn] = a1E [X1] + · · · + anE [Xn]
and (note: what property are we using in the following identity?)
V [a1X1 + · · · + anXn] = a21V [X1] + · · · + a2nV [Xn]
• Applying the above properties, we obtain

• Thus, the expected value (mean) of X is E [X], so we say that the mean is an unbiased
estimator of E [X].
• Further, since V[mean] = V[X]/n, the larger n is, the more concentrated around E [X] will be
the distribution of the mean.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
• These properties justify the use of the mean as estimator of E [X].

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Application to the Bernoulli distribution
• The above results allow us to obtain statistics useful to estimate the parameters of the
distributions studied in Topic 5, from an s.r.s.
• Let X be a Bernoulli r.v. with parameter p, X ∼ Ber (p):

• p is the probability of success.


• Recall that E [X] = p and V [X] = p (1 − p).

Reservados todos los derechos.


Bernoulli distribution
• Suppose that we have an s.r.s. X1, …, Xn from the r.v. X. We want to estimate the value of
its parameter p. Since E [X] = p, we can estimate p as follows:

• Furthermore, from the above results, we have

Hence, if the sample size n is large, we can expect pb to be close to p (^p ≈ p).

Example
Pablo wants to run for mayor of his town. To assess his chances, he takes a poll of n = 10 voters to
estimate the proportion of votes that he would obtain.
Consider the r.v. X =“Votes to Pablo”, taking the value 1 if the person says (s)he will vote for Pablo,
and 0 otherwise, with X ∼ Ber (p)
He thus draws a sample of size n = 10, obtaining
1001101010
From this particular sample, we obtain the estimate p^ = 0.5 of p, the expected proportion of votes
that Pablo would obtain. x̅ = (1+0+0+1+1+0+1+0+1+0)/10 = 5/10 = ½ = 0.5

Binomial distribution
• Let Y be a Binomial r.v. with parameters m and p, Y ∼ B (m, p):

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
• We know that Y is the sum of m independent Bernoulli r.v. X1, . . . , Xm with parameter p: Y
= X1 + · · · + Xm.
• Recall that E [Y] = mp and V [Y] = mp(1 − p).
• We will see how to estimate p.
• We have an s.r.s. of size n of Y, Y1, …, Yn (recall that m is the number of Bernoulli trials of
X), we estimate p as follows:

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Furthermore, by the properties of the sample mean, we have

hence, if the number of Bernoulli trials, m, and/or the number of binomial samples, n, is very
large, we can expect pb to be close to p.

Reservados todos los derechos.


Example
In the previous example, if we define the variable Y = “Number of voters for Pablo in a sample of
size 10 of X”, for the sample drawn we obtain Y1 = 5. Y~Bn(m,p)
Suppose next that Pablo takes a second poll obtaining the following values of the variable X = “Votes
for Pablo”:
0000100100
In this case, the estimated proportion of voters is 0.2.
The value obtained of the variable Y = “Number of voters in a sample of size 10 of X” is, then, Y 2 =
2.
Using the aforementioned estimator of the proportion p that takes into account the values Y 1 = 5 e
Y2 = 2, we obtain the estimate. E(Y) = mp

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Application to the normal distribution
• Let X be an r.v. with normal (Gaussian) distribution with parameters µ and σ, X ∼ N (µ, σ). Its
density function is

Recall that E [X] = µ and V [X] = σ^2 .


• From an s.r.s. X1, …, Xn of X we can estimate µ and σ through

Reservados todos los derechos.


Normal distribution
• From the properties of the sample mean, it holds that

hence, if n is very large, we can expect that µb is close to µ.


• As for the variance, it holds that (biased estimator)

• That is why the following estimator called quasi-standard deviation is also used, given by:

It is because the estimator s^2 is


unbiased for σ2: E [s2] = σ2

Example

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
We assume that the monthly percent returns of a financial asset follow a normal distribution. We
want to estimate the parameters of their distribution.
We have n = 46 values of the monthly returns.
The sample mean, mean = 1.03, is an estimate for the population mean µ.
On the other hand, the sample standard deviation, σ^ = 4.16, is an estimate for the population
standard deviation, σ. An alternative estimate is the quasi-standard deviation, s = 4.25.

Goodness of fit to a distribution

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• To carry out a statistical inference analysis, it is often assumed that data come from a certain
distribution type (example: normal). Yet, such an assumption should be properly justified.
• There are different methods that can be used for such a purpose, called goodness of fit
methods.
• Here we will only consider two very common graphical goodness of fit methods.
Histogram with density function
• The first is to compare a histogram of the data to the density function obtained with the
estimated parameters. If the hypothesis is true, then such a density function will be close to

Reservados todos los derechos.


the histogram.

For example, the following chart is obtained from data from 200 returns of a financial asset. The
chart shows the histogram and the normal density function obtained with the estimated parameters
(µ^ = 0.83 and σ^ = 4.12)

QQ-plot
• The second method is based on a chart called QQ-plot. This plots the estimated quantiles
from the data vs. the theoretical quantiles for the distribution with the parameters estimated
from the sample.
• If the data come from the assumed distribution, then the points in the plot will be close to the
line y = x.
• If the distribution function is continuous and increasing, the p-th quantile (0 < p < 1), denoted
by qp, is obtained by inverting the distribution function. Thus, if we look for the value qp such

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
that F(qp) = p, then qp = F−1 (p). If the distribution is discrete o piecewise constant, we take qp
= min {x : F(x) ≥ p}.
• The p-th sample quantile, Qp, is obtained through the following procedure: (1) order the data

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
from smallest to largest, obtaining x(1), . . . , x(n) ; (2) then, take Qp = x([np]).

For example, the following chart shows the QQ-plot of 200 returns of a certain financial asset, where
the estimated quantiles are plotted against the quantiles for the normal distribution with parameters
µ^ = 0.83 and σ^ = 4.12.
The chart shows that the fit is quite good.

Reservados todos los derechos.


The distribution of the sample mean
• We have seen how to estimate the parameters of some distributions based on properties of
the sample mean.
• Next, we’ll determine the distribution of the sample mean. This distribution will be very useful
to calculate confidence intervals.
• If X has a normal distribution N (µ, σ) and X1, …, Xn is an s.r.s. of X, it holds that

• If X has expectation E [X] and variance V [X], and does not have a normal distribution,
then the Central Limit Theorem (CLT) ensures that, if X1, …, Xn is an s.r.s. of X, with n
large enough (n ≥ 30), it holds approximately that

Example
If X1,… , Xn is an s.r.s. of X with distribution Ber (p), for large n we have

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
Let X be a discrete r.v. with probability function:

We draw an s.r.s. of size n = 125 of X. What is the probability that the sample mean lies between
2.4 and 2.6?

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
We have

Hence, by the CLT it holds approximately that

Reservados todos los derechos.


From which we obtain

Confidence intervals
• Instead of a point estimator, it is more informative to give an interval of plausible values for
the unknown parameter.
• Given a sample, we would like to have a narrow interval of values that, with certainty, will
contain the true value of the population mean, µ. But that is not possible. Why?
• We will consider a method to construct random intervals from an s.r.s., such that about
(1−α)% of the generated intervals from different s.r.s. contain the true value of the population
mean µ. We will call 1 − α the confidence level and the intervals obtained confidence
intervals.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Confidence intervals for the population mean
• Suppose X1, …, Xn is an s.r.s. of an r.v. X with distribution N (µ, σ), with σ known.
• We know that X͞ ∼ N (µ, σ/√n), and (X͞−µ)/ (σ/√n) ∼ N (0, 1). Then

Reservados todos los derechos.


• Hence, for a particular sample x1, …, xn, a confidence interval for µ with confidence level 1
− α is given by

• We have generated 100 samples of size n = 50 of a distribution N(−2, 1). The following chart
shows the resulting 90% confidence intervals for µ. About 90% of them contain the true value
µ = 2.

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
Example
Suppose that the stock returns of the firm SEGURA S.L. follow a normal distribution with mean µ
euros and variance σ^2 = 1. We take an s.r.s. of n = 20 returns, obtaining the values
5.29 3.66 5.71 6.62 4.30 5.85 6.25 3.40 3.55 5.57 4.60 5.69 5.81 5.71 6.29 5.66 6.19 3.79 4.98 4.84
The sample mean obtained is mean = 5.188. Hence, the 90% confidence interval for the mean return
is

No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Confidence intervals with large samples
• What if the standard deviation is unknown or the population is not normal?
• When the sample size n is large, the CLT ensures that the distribution of X is approximately
normal, regardless of the distribution of the observations.
• Hence, if data are not normal, for large samples we can use the following confidence interval
for the population mean:

Reservados todos los derechos.


with s the estimated quasi-standard deviation.

Confidence intervals for a proportion: Bernoulli


• Let X1, …, Xn be an s.r.s. of an r.v. X with a Ber(p) distribution. Then, mean is an r.v. that
estimates the proportion p of successes for Bernoulli experiment X.
• By the CLT we know that, for large n,

• The (approximate) confidence interval for the proportion p is

Example
In the example of the estimation of the Bernoulli parameter p, Pablo finally takes a poll of n = 100
voters and obtains the estimate pb = 0.4.
The 95% confidence interval for p is

Descarga la app de Wuolah desde tu store favorita


a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Confidence intervals for a proportion: Binomial
• Let Y1, …, Yn be an s.r.s. of an r.v. Y ∼ B(m, p). Then, Y is an r.v. that estimates its mean
mp.
• By the CLT we know that, for large n

Reservados todos los derechos.


• The (approximate) confidence interval for the proportion p is

a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy