Apuntes Estadistica
Apuntes Estadistica
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3. Types of variables
4. Data sources: population and sample
5. Statistical analysis with R Commander
Topic 1: Introduction and basic concepts
1. What is Statistics?
In everyday language, the term statistics is used to refer to numbers that describe some aspect of
the world. Statistics is a pseudoscience or discipline; it is not a science. Its name comes from the
word “state” because at the beginning it was used just to describe things related to politics,
governments… Nowadays it has multiple applications, almost all disciplines use it.
Statistics is much more than mere numbers, it is the discipline that addresses how to collect,
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451208
6. Interpretation and the take of decisions
2. Databases (DB): elements, variables and observations
Data are the collected features about the phenomenon under study.
• Time series data: Data that evolves in time.
o Example: GNP (Gross National Product) from 1970 to 2021
• Static data: time is fixed. Statistical units are firms, countries, etc. This type of data is the
only one we are going to be treating in this course.
Data matrix or data set is used to get information from the data collected. We can differentiate
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
three elements in the data matrix or data set: the variables, the statistical units (also called
elements) and observations.
• Variables: each column represents a variable. A variable is a characteristic of interest for the
statistical unit or element. Different types of variables require different treatments. Examples
of statistical variables:
o Vote of “madrileños”: Cs, IU, PP, PSOE, UP, Vox, …
o Employment status of “getafenses”: unemployed, part time, full time, …
o Customer purchase satisfaction
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
▪ Example: The hight H of an individual can be 62 inches, 63.8 inches or 65.8341
inches, depending on the exactitude of the measurement, it is a continuous
variable.
▪ The data that are defined by a continuous variable are called continuous data.
• Example: The hight of 100 university students.
• Categorical (qualitative): Variables are attributes, no measuring → Labels
o Nominal: no natural ordering
o Ordinal: naturally ordered classes
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
of the individuals, especially if they are a large number. Instead of studying the whole group called
population, a small part of the group is examined, called sample.
• Population: complete collection of individuals. In practice it is unusual to study all the
individuals of a population:
o The individuals may exist conceptually, but not in reality
o It may be economically infeasible to study the entire population
o The study might take so much time that it would be infeasible and, moreover, the
population might change over the time span of the study
o The study may imply the destruction of individuals
• Sample: a subset of individuals drawn from the population
POPULATION
Inferential statistics
Probability models
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451208
o The sample selection method (sampling method) is very important
• Data sources:
o Available historical information
o From observations (observational studies)
o From experiments (experimental studies)
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.
EXERCISES OF TOPIC 1
2.
a) Sample biased to a particular person profile.
b) Sample biased to students in Madrid who are undergoing a course in statistics.
c) Correct: sample unbiased and with a lower percentage of non-response compared to other methods (e.g.:
e-mail, phone call...).
d) Sample biased to university students.
e) Biased: we obtain response only from relatives and friends (who in most of the cases will have a similar
opinion to ours).
f) Sample biased to readers of that particular newspaper. Furthermore, we collect information only from
those who answer (usually people with the most extreme opinion).
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302006
Topic 2: Analysis of univariate data
1. Representations and graphs
a. Frequency tables
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
b. Bar and pie charts, pictograms, histograms, frequency polygons, pictograms. Other
graphs. Lying with graphs
2. Numerical measures to summarize and describe data:
a. Central tendency (mean, median, mode)
b. Location (quartiles and percentiles). Box plots
c. Spread (variance, standard deviation, quasi-variance, quasi-standard deviation,
range, IQR, coefficient of variation)
d. Shape (coefficients of skewness and kurtosis)
Topic 2: Analysis of univariate data
Frequency table (categorical variables)
Example about education data:
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Note:
• ni = number of individuals of class ci in the sample (sample size)
• fi = ni/n
• 0 ≤ fi ≤ 1
Frequency table (discrete numeric variables)
Example:
• Sample: 100 shopping malls in which a promotion of a certain service was launched last
November
• Class Mark (midpoint): The class mark is the midpoint of the class interval and is obtained
by adding the right and left endpoints and dividing by 2.
Ck = (Lk-1 + Lk)/2
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
▪ Example (hights): w = 62 - 60 = 65 - 63 = 68 - 66 = 69 - 71 = 74 - 72 = 2
o Class intervals cannot overlap
o Round up the interval width to get convenient interval endpoints
o We can determine the width (w) of each interval by
w = (largest number - smallest number)/number of desired intervals
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Useful for tabulating discrete variables with many possible values
Bar charts
Example about education:
• Bars are of the same width and equally spaced,
their heights represent frequencies
• There are gaps between bars
• Bars are labelled with class names (or codes)
• Bar charts with cumulative frequencies:
Beware! Many software programs rank classes in
alphabetical order when the variable is
categorical. If it is an ordinal variable, it must be
ranked in increasing order.
• Bar charts can also be used for discrete data if there are not too many different values
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o Example (too many different variables in discrete data):
▪ Sample: 46 employees of a company
▪ Variable: EXPRNC: years working in the company
Cartograms
INE, Encuesta de Turismo de residentes
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Time series
INE, Encuesta de Población Activa
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o Example: For the experience of the 46 employees. What is the mean?
Note: the mean salary from the original data equals 17250.41
• Linearity: the same operations you apply to the data also applies to the mean.
o If Y = a + bX ⇒ y¯ = a + bx¯
o If Z = X + Y ⇒ z¯ = ¯ x + ¯ y
• Disadvantages: Affected by extreme values (outliers)
o Example: X: 3, 1, 5, 4, 2 Y: 3, 1, 5, 4, 200
x¯ = (3 + 1 + 5 + 4 + 2)/5 = 3
y¯ = (3 + 1 + 5 + 4 + 200)/5 = 42.6!
11133557889→M=5
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
1. Order the data from smallest to largest
2. Include repetitions
3. The median is in the central position
11133 5 5 7 8 8 → M = (3 + 5)/2 = 4
• To find the median in the frequency table we look for the value whose Fi > 0.5 = 50%
o Example of the experience: M = 6
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Examples:
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
Measures of spread: range, interquartile range (IQR), variance, standard deviation and coeff.
of variation
The range is the simplest measure of spread, it is the difference between the highest number and
the smallest.
R = Xmax − Xmin
• It ignores the way the data are distributed
• Sensitive to outliers
Example: Given observations 3, 1, 5, 4, 2, R = 5 − 1 = 4
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Example: Given observations 3, 1, 5, 4, 100, R = 100 − 1 = 99
The Interquartile range (IQR) can eliminate some outlier problems. Eliminate high and low
observations and calculate the range of the middle 50% of the data
IQR = 3rd quartile − 1st quartile = Q3 − Q1
• Outliers are observations that fall
• below the value Q1 − 1.5 · IQR
• above the value Q3 + 1.5 · IQR
• Boxplot
• It shows five location measures
• It allows to assess the spread of the data
• It allows to assess the symmetry of the data
• It is very useful to compare different datasets
• Note: R produces a modified boxplot, where
outliers are plotted as distinguished points (the
min and max shown are those without outliers)
Variance
• Average of squared deviations of values from the mean
• Sample variance
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
If a, b are real numbers and y = a + bx, then
Standard deviation (SD)
• The most-commonly used measure of spread
• The sample standard deviation and sample quasi-standard deviation are respectively
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Standardizing variable x means to calculate a new variable
z = (x − x¯)/s
• If you apply this formula to all observations x1, …, xn and call the transformed ones z1, ...,
zn, then the mean of the z’s is zero with standard deviation one
• Standardizing = calculating z-scores
Do not make a decision about the shape just through a comparison between the Mean, the Median
and the Mode.
Coefficient of Kurtosis
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451209
1. The histogram of a numerical continuous variable has bars for each class whose area is
proportional to the frequency of the class.
Select one:
• True → This is a convention introduced to facilitate the interpretation of a histogram. The
correct answer is 'True'.
• False
2. The Pareto chart for the sample of a nominal random variable orders the values of the variable
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
according to their values.
Select one:
• True
• False → A Pareto chart orders the values of the variable according to their frequencies
(largest to smallest). The correct answer is 'False'.
3. For a sample of an integer random variable taking values between 0 and 10 (all of them with
positive frequencies), it holds that its median (Q2) is equal to 3 and Q3 is equal to 6 . Then, the
4. We have collected a sample of size 40 from a certain variable. The variable takes integer values
between 5 and 10. Then, the quasivariance for this sample cannot be larger than 5.
Select one:
• True
• False → You could have half the values in the sample equal to 5 and half equal to 10, and
the quasi variance would equal 6.4. The correct answer is 'False'.
Exercises
1. The following table shows the absolute frequency distribution of the duration (in minutes) of 60 taxi services with origin
in a certain airport:
(a) Draw a histogram, taking into account that not all classes have the same width. Calculate the height of each bar so
that the area of each rectangle equals the relative frequency of its class ( histogram with unit total area)
(b) From the histogram, describe the shape of the distribution. Indicate the modal and median intervals.
(c) From the table of frequencies, calculate (approximately) the mean and variance of the duration using the class marks.
Solution:
Duración (intervalo) Marca clase Frec. relativa Altura Frec. relativa acumulada
[0, 10) 5 0.133 0.013 0.133
[10, 20) 15 0.283 0.028 0.416
(a)
[20, 30) 25 0.233 0.023 0.649
[30, 40) 35 0.167 0.017 0.816
[40, 60) 50 0.184 0.009 1.000
El histograma muestra una ligera asimetría positiva (a la derecha). Es unimodal, siendo el intervalo modal el [10, 20).
La Mediana se obtendría como el promedio de los tiempos que ocupan las posiciones 30 y 31, ambas en el intervalo
[20, 30)
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(b) El tiempo medio estimado a partir de las marcas de clase es de
σ̂ 2 = 0,133 · (5 − 25,78)2 + 0,283 · (15 − 25,78)2 + 0,233 · (25 − 25,78)2 + 0,167 · (35 − 25,78)2 + 0,184 · (50 − 25,78)2 =
212,59 minutos2
2. The spreadsheet data_condemned_2016_INE of the Excel workbook Datos_spreadsheet2.xls contains information provided
by the INE
1 about the age and number of prison sentences dictated in 2016.
(a) Represent the relative and cumulative frequency distributions of the variable age through a bar chart. What information
can you obtain about the age of the condemned? (Note: if you use Excel you can represent simultaneously both
distributions through a combined chart. Select the cumulative frequencies as secondary axis).
(b) Represent the relative frequency distribution of the variable number of prison sentences through a pie chart. Do the
quartiles and percentiles make sense for this variable? If yes, calculate the 80 percentile and interpret it.
Solution:
(a) En el diagrama de barras se observa que el mayor porcentaje de condenados tiene entre 41 y 50 años, edad a partir
de la cual el porcentaje de condenados baja de forma ostensible. Los condenados con menos de 41 años representan
más del 60 % de los condenados (el 64,4 %). Los más jóvenes, de 18 a 20 años, representan el 8,8 %, mientras que el
resto de tramos se mantienen en torno al 15 % de representación. En el propio gráco se incluyen las distribuciones de
frecuencias.
1 In Estadística de condenados: Adultos, from information in the Registro Central of Penados of the Ministerio de Justicia.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(b) En la siguiente tabla se recogen las distribuciones de frecuencias absolutas, relativas y relativas acumuladas:
Número de condenas ni fi Fi
Una pena 94.709 0, 349 0, 349
Dos penas 89.703 0, 330 0, 679
Tres penas 32.624 0, 120 0, 799
Cuatro penas 23.692 0, 087 0, 887
Cinco penas 10.681 0, 039 0, 926
Más de cinco penas 20.117 0, 074 1, 000
El diagrama de sectores es:
Se observa que los condenados con 1 o 2 penas representan más de la mitad de los condenados, siendo casi del 70 % (el
67, 9 %), y como el porcentaje de condenados va disminuyendo a medida que aumenta el número de penas, salvo en los
últimos casos en los que se invierte el orden, siendo casi el doble el número de condenados con más de 5 penas, que el
de condenados con exactamente 5 penas.
Como la variable es cuantitativa tienen sentido las medidas de posición. En este caso el percentil P80 puede ser 3 o 4
penas, ya que el 80 % de los condenados tiene 3 o menos penas, mientras que el 20 % de los condenados tiene 4 o más
penas.
3. The following is a chart from the report La Universidad Española in Cifras 2015/2016 2 .
2 Published by the Conferencia the Rectores de las Universidades Españolas (CRUE) with the collaboration of Santander Universidades.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(a) What is the variable of interest? What is its type? What are the population and the sample?
(b) From the chart, obtain the maximum, the minimum, the median, the rst and the third quartiles, the range and
the interquartile range (IQR) of the variable. Which universities occupy each of these positions? Interpret the values
obtained.
(c) Draw the chart that you consider more appropriate to represent the data and comment about the shape and the posible
presence of outliers.
(d) If you observe outliers, calculate the mean and the standard deviation of the data with and without outliers. Obtain
also the median and the interquartile range of the data without outliers. Comment on the results obtained.
(e) Taking into account the datum at the end of the chart, indicating that the total percentage of mobility students in
public Spanish universities is 6, 18 %, which criterion do you guess has been used to select the 20 universities in the
chart?
Solution:
(a) La variable de interés es el porcentaje de estudiantes de Grado con movilidad internacional en las Universidades
públicas españolas durante el curso 2015/2016, es cuantitativa continua. La población son todas las Universidades
Públicas presenciales. Los datos disponibles son una muestra de 20 de estas universidades.
(b) En el gráco se muestran los valores ordenados en orden creciente. El máximo se alcanza en la Universidad Carlos III
con un 15,68 %, siendo la Universidad de Málaga en la que el porcentaje de alumnos de movilidad es el mínimo, con
1 3
un 6,21 %. Como
4 21 = 5,25 y 4 21 = 15,75, podemos redondear y seleccionar a la Universidad de Salamanca y a la
Politécnica de Catalunya, que ocupan las respectivamente las posiciones 5 y 16, como las que nos dan el primer y tercer
cuartil, con valores de Q1 = 7,17 % y Q3 = 9,09 %
(c) Pueden representar los datos a través de un histograma o de un diagrama de cajas. Como hay pocos datos n = 20
tomamos 5 clases de amplitud 1,9 empezando en 6, 2 para construir el histograma. Se obtiene:
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Se observa que la distribución es asimétrica positiva (a la derecha) y que hay un dato atípicamente alto. Como se
puede apreciar después en los estadísticos descriptivos con y sin el dato atípico de la Carlos III, la asimetría (bastante
acusada con los 20 datos) se atenúa bastante, pero sigue siéndolo. Obsérvese cómo el valor alto de la curtosis también
nos indica la presencia de atípicos.
El RIC = 9,09 − 7,17 = 1,92, y por tanto los límites superiores e inferiores para considerar un dato como atípico son:
Como el mínimo porcentaje es de 6, 21 > 4, 29, no hay atípicos inferiores. Hay un único atípico superior, que además
es extremo. El porcentaje de alumnos de movilidad de la Universidad Carlos III es atípicamente alto con respecto al
resto.
(d) Las medidas descriptivas con y sin dato atípico que se obtienen son:
Comparando todas las medidas que nos indican se observa cómo la más afectada por el valor atípico es la desviación
estándar, seguida de la media. La Mediana se queda prácticamente igual y el RIC varía poco. En términos relativos
los cambios son:
2,0737−1,222
Del 41.05 % para la cuasi-desviación típica:
2,0737 = 0,4105
8,464−8,084
Del 4.48 % para la media:
8,464 = 0,048
8,155−8,06
Del 1.16 % para la mediana:
8,155 = 0,0116
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
1,92−1,87
Del 2.6 % para el RIC:
1,92 = 0,026
(e) El porcentaje total 6,18 < 6,21, que es el mínimo porcentaje de las 20 universidades sobre las que se muestran datos.
Todo apunta a que son las 20 Universidades con mayor proporción de estudiantes de movilidad.
4. The following bar chart represents the distribution of cumulative frequencies of a certain variable:
(d) Calculate the mean and the standard deviation of this dataset.
(e) Calculate the mode, the median and the percentiles 20 and 80.
Solution:
Frecuencia
ci Absoluta
0 6
1 10
2 12
3 8
4 5
5 4
6 3
8 1
10 1
Total 50
(c) La distribución es asimétrica sesgada a la derecha.
(d) Utilizar la tabla de frecuencias para calcular la media y desviación estándar de este conjunto de datos.
Pk
i=1 ci ni
x̄ = = 2,68
n
Pk 2
i=1 ci ni − nx̄2
s2x = = 4,5485
n−1
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(e) M oda = 2
Las observaciones ordenadas:
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
6, 8, 10
Mediana
If the variable of interest is the CA of the university in which graduates obtained their degree:
(c) What statistical measures can you obtain for such a variable?
Draw a Pareto chart to check whether the following claims, which refer to university graduates in the academic year
2009-2010, are true or false:
(c) 20 % of CCAA concentrate the universities from which more than 50 % of graduates come.
(d) From the universities of 35 % of the CCAA come less than 10 % of graduates.
Solution:
(a) La variable es cualitativa nominal y la población está compuesta por todos los titulados universitarios del curso 2009-
2010.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(c) La única medida que tiene sentido para datos cualitativos nominales es la Moda, que en este caso es Madrid.
(a) No tenemos suciente información como para saber si es verdadero o falso. Tendríamos que saber en qué Universidad
obtuvieron el título. Podemos decir que entre todas las universidades Madrileñas concentraron al 20,11 % de los titulados
de ese año, pero no sabemos cómo está repartido ese 20,11 % ebtre todas ellas.
(b) La Mediana no tiene sentido para datos nominales, que por tanto no se pueden ordenar.
(c) Verdadero. Entre Madrid, Andalucí y Cataluña, que representan el 17,65 % de las comunidades autónomas, concentran
el 52,81 % de los titulados.
(d) Verdadero. Las 6 (35.29 %) comunidades con porcentajes menores, Asturias, Extremadura, Navarra, Balears, Cantabria
y La Rioja, solo tienen el 8.6 % de titulados.
6. Consider the following charts published in El Mundo3 about diusion data of Spanish press (OJD, Ocina of Justicación
3 24 September, 2014. Source: blog Malaprensa
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
of the Difusión).
(b) Represent properly the data in such charts and compare the charts obtained with those that were published.
Solution. El eje vertical está truncado. No empieza en 0, sino en 50000. Además no está representada la escala, pero lo
peor de todo es que no está tampoco respetada. Las distancias entre los resultados de El Mundo y El Pais parecen mucho
más cortas que las que hay entre El Mundo y el ABC, cuando realmente es al contrario. Por ejemplo, solo jándonos en el
último dato, correspondiente al mes de agosto, la diferencia entre las ventas en quiosco entre El Pais y El Mundo fueron de
42605, mientras que la diferencia con las ventas del ABC fue de 22054, algo más de la mitad. La discrepancia en las cifras
de la difusión total en agosto es todavía más acusada, la diferencia con El Pais fue de 89486 mientras que fue 30769 con el
ABC.
Los grácos respetando la escala y sin truncar el eje quedarían como sigue:
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
7. The following table shows data from the Encuesta of Condiciones de Vida (INE) corresponding to the years 2014 and 2006
about the percentage of households facing economic hardship by CCAA.
10
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
CA 2014 2006
Andalucía 24,3 16,8
Aragón 9,8 5,7
Asturias, Principado of 4,6 3,1
Balears, IllIs 14,7 9,7
Canarias 19,5 18,4
Cantabria 15,2 9,3
Castilla and León 12,1 8,5
Castilla - the Mancha 15,9 10,7
Cataluña 12,2 11,3
Comunitat Valenciana 18,0 12,3
Extremadura 19,6 8,7
Galicia 20,8 11,9
Madrid, Comunidad of 12,4 8,8
Murcia, Región of 22,7 14,1
Navarra, Comunidad Foral of 4,2 6,6
País Vasco 11,5 5,2
Rioja, the 12,9 6,6
Ceuta 32,9 25,7
Melilla 12,9 15,9
The following tables show information about the variable percentage of households facing economic hardship in each of the
observed periods:
2014 2006
(a) Represent the data of 2006 and of 2014 in histograms and compare their distributions. What dierences do you nd?
(b) To analyze the evolution of the percentage of households facing economic hardship in the period 20062014, obtain the
percentiles 20, 40, 60 and 80 for each year. Tabulate these data for each year, along with the minimum and maximum
values. What conclusions can you draw? Also, represent the data in the table as a chart.
(c) What central tendency measure is more adequate in each case and why?
(e) In which of the two periods, 2014 or 2006, do the Comunidad de Madrid and the Comunitat Valenciana show the worst
results relative to the situation in those years?
Solution.
11
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(a) Tomando 5 clases en ambos casos, se obtienen los siguientes histogramas:
Se observa que los porcentajes de hogares con dicultades para llegar a n de mes se ha incrementado desde el 2006
hasta el 2014, y además la distribución se ha desplazado, mientras que en 2006 presentaba una asimetría positiva, en el
2014 la asimetría es ahora muy ligera. El desplazamiento se observa sobre todo comprando las medianas y los percentiles
(siguiente apartado). Mientras que en el 2006 en el 50 % de las CCAA el porcentaje de hogares con dicultades era
mayor o igual que 9.7 %, este valor ha subido en 5 puntos porcentuales, hasta un 14.7 %, en el 2014.
Grácamente:
(c) En este caso, dada la asimetría de los datos del 2006 y que en los del 2014 se detecta un dato atípico sería más
conveniente usar la Mediana como medida de centralización. Además por la naturaleza de los datos se suele emplear
la mediana.
Los límites inferiores y superiores para los datos del 2014 serían:
No hay datos atípicamente bajos, pero sí altos. Al menos el dato de Cueta lo es. Ya no hay más. Comprobamos si es
una dato atípico extremo, LSe = Q3 + 3 · IQR = 42,1 > 32,9. No lo es.
12
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(d) Si comparamos las varianzas diríamos que el 2014, con una varianza de 46.6943 frente al 29.1447 del 2006. No obstante
los porcentajes han aumentado de un año a otro. Si comparamos los coecientes de variación cv2014 = 6,833/15,5895 =
0,438 y cv2006 = 5,399/11,0158 = 0,490, una variación del 43.8 % frente al 49.0 %, luego la respuesta más adecuada es
en el 2006.
(e) Para comparar la situación de estas dos comunidades teniendo en cuenta la situación general del 2014 y del 2006
calculamos los porcentajes tipicados:
12,3 − 11,0158
zcv,2006 = = 0,238
5,399
18 − 15,5895
zcv,2014 = = 0,353
6,833
8,8 − 11,0158
zmad,2006 = = −0,41
5,399
12,4 − 15,5895
zmad,2014 = = −0,467
6,833
La situación de la Comunidad Valenciana es peor en el 2014, año en el que tiene un porcentaje tipicado mayor. Para
la Comunidad de Madrid sin embargo el peor año es el 2006.
(a) Is the sample mean of the 15 percentages larger than the sample median? If true, what does this result suggest? Justify
your answers.
(b) Calculate the three sample quartiles. Interpret them in term of percentages.
(c) Compute the sample quasi-variance and coecient of variation of the 15 percentages.
(d) Draw a box-plot of the data and identify the outliers (if any). Justify your answer.
Solución.
(a) La media muestral de los 15 porcentajes es x̄ = 5, 96, mientras que la mediana muestral es M = 3, 2. Entonces, la media
muestral es mayor que la mediana muestral. Este resultado sugiere que la distribución de los datos es asimétrica a la
derecha (asimetría positiva), es decir, hay un número reducido de pequeñas compañías para las que los porcentajes de
ventas totales anuales a la gran compañía son notablemente mayores que para el resto. Esta asimetría queda corroborada
en el diagram de cajas (apartado d), incluso después de eliminar el efecto del dato atípico.
(b) Los cuartiles muestrales son Q1 = x(4) = 1, Q2 = x(8) = 3, 2 y Q3 = x(12) = 7, 6, respectivamente. Entonces, el
25 % de los porcentajes son menores que el 1 %, el 50 % de los porcentajes son menores que el 3, 2 % y el 75 % de los
porcentajes son menores que el 7, 6 %. Consecuentemente, los tres cuartiles muestrales dividen la muestra en cuatro
sub-muestras que contienen respectivamente el mismo número de porcentajes. En general, los porcentajes de ventas
totales a la compañía para la mayoría de las pequeñas empresas representa menos del 7, 6 %.
(c) La cuasi-varianza muestral es s2 = 53, 2668, mientras que el coeciente de variación muestral es cv = 1, 2245.
(d) Para construir el diagrama de caja, necesitamos el rango intercuartílico que está dado por IQR = Q3 − Q1 = 7, 6 − 1 =
6, 6. Más aún, para construir las barras del diagrama y para detectar atípicos, si los hay, necesitamos los valores
Q1 − 1, 5IQR = 1 − 1, 5 · 6, 6 = 8, 9 y Q3 + 1, 5IQR = 7, 6 + 1, 5 · 6, 6 = 17, 5. Además, los valores máximo y mínimo en
la muestra son 0, 1 y 27, respectivamente. Entonces, hay un solo dato atípico ya que 17, 5 < 27. El diagrama de caja
aparece a continuación.
13
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
En el box-plot se aprecia la asimetría de la distribución, que persiste aunque se elimine el dato atípicamente alto, como
se observa en el siguiente box-plot. Notése también los valores del coeciente de asimetría en los dos casos, con y sin
atípico. Observa también el cambio de las medids menos robustas (media, desviación estandar, rango..)
14
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
9. (June 2015 exam) The following tables contain information about the GDP and the unemployment rate of the Spanish
Autonomous Regions:
(c) Determine the group of Autonomous Regions formed by the 15 % with higher GDP.
(d) Darw the box-plot of the unemployement rate. What can you tell about the shape of the distribution?
(e) From the previous box-plot, decide if there are outliers and /or extreme outliers in the data. Identify the Autonomous
Regions that can be considered outliers and/or extreme outliers.
15
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Solución.
(a) En la Tabla 2 faltan algunos estadísticos descriptivos para la variable tasa de paro: media 10,5, mediana 10,3, cuasi-
desviación típica 3,4, Q3 = 11, 2, P1 5 = 7, 0.
(b) Puesto que las unidades de medida y el rango de valores son muy distintos para el PIB y la tasa de paro, la cuasi-
desviación típica no es un buen descriptivo para comparar sus variabilidades. Es mejor utilizar una medida adimensional,
como el coeciente de variación (CV). En este caso,
3251, 1 3, 4
cv(PIB) = = 0, 19; cv(tasa paro) = = 0, 32
17530, 7 10, 5
Luego la variación de la tasa de paro es mayor.
(c) El grupo de CCAA formado por el 15 % con mayor PIB son aquellas cuyo PIB sea superior al percentil 85, es decir
superior a 21360, 3. Hay tres CCAA que cumplen esta condición: Navarra, País Vasco y Madrid.
10. (May 2016 Exam) The following tables contain information about 10 companies of the IBEX 35. In particular, three variables
are shown: X1 =average remuneration of the governing board, X2 =average remuneration of senior management and
X3 =average expenditure per employee (in millions of euros). Source: El País, 8th May 2016.
Tabla 1 / Table 1
Empresa / Company X1 X2 X3 Figura 1 / Figure 1
BBVA 0,985 1,144 0,455
ACS 0,667 0,540 0,401
FCC 0,720 0,650 0,323
I dit
Inditex 1,270
1 270 1,730
1 730 0,231
0 231 A
Acciona 0,463 0,590 0,390
Santander 1,484 2,580 0,586
IAG 1,220 2,440 0,809
Iberdrola 0,920 1,979 0,894
Ferrovial 1,330 1,800 0,391
Telefónica 1,240 1,869 0,491 B
Tabla 2 / Table 2
X1 X2 X3
Media / Mean 1,030 0,497
Mediana / Median 1,765 0,428
Desv. típica / Standard dev. 0,333 0,756 0,210
Varianza / Variance 0,572 0,044
Q1 0,770 0,774 0,390
Q3 , 63
1,263 ,95
1,952
(c) Which of the three variables is more disperse? Justify your answer.
(e) Match the box-plots A and B of Figure 1 with the corresponding variables (X1 , X2 , X3 ). Justify your answer.
(f ) It is known that the correlation between X1 and X3 is 0.175 and, on the other hand, that the covariance between X2
and X3 is 0.093. Is it true that the linear relationship between X3 and X1 is stronger than between X3 and X2 ? Justify
your answer. (Note: this question is from Chapter 3)
Solución.
16
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
1 BBVA 0,985 1,144 0,455
2 ACS 0,667 0,540 0,401
3 FCC 0,720 0,650 0,323
4 Inditex 1,270 1,730 0,231
5 Acciona 0,463 0,590 0,390
6 Santander 1,484 2,580 0,586
7 IAG 1,220 2,440 0,809
8 Iberdrola 0,920 1,979 0,894
9 Ferrovial 1,330 1,800 0,391 2,75
10 Telefónica 1,240 1,869 0,491 8,25
(a) La tabla completa es:
Estadísticos
X1 X2 X3
Media 1,030 1,532 0,497
Mediana 1,103 1,765 0,428
Desviación estándar 0,333 0,756 0,210
Varianza 0,111 0,572 0,044
Percentiles 25 0,770 0,774 0,390
75 1,263 1,952 0,562
(b) La media de la variable X2 es menor que su mediana. Por tanto, se trata de una distribución asimétrica hacia la
izquierda (o asimetría negativa).
44,91
(c) Las medias de las tres variables son muy distintas, por lo que la dispersión debe medirse meidante el coeciente de
27,29
variación de Pearson. En este caso CV (X ) = 0,323 CV (X ) = 0,494
1 24,2 , 2 y 3 CV (X ) = 0,423
, por lo que la variable con
23,79
mayor dispersión es 2. X 23,37
22,58
(d) La variable X
3 tiene asimetría positiva (o hacia
22,03 la derecha), por lo que si tiene datos atípicos éstos estarán en la cola
derecha de 3 X
. Los datos atípicos serán 21,98
aquellos
20,01
valores superiores a 3 Q + 1,5 × RIC = 0,562 + 1, 5 × 0,172 = 0, 82
(o
también 3 Q + 1,5 × RIC = 0,586 + 1,5 × 0,196 = 0,88
19,82 ). Hay un dato atípico y se trata de la empresa Iberdrola.
24,998
(e) El box-plot A corresponde a la variable 1 (por
6,942214056 X
ejemplo, teniendo observando los valores de la mediana o de
1,38844281 1 ), Q
mientras que el box-plo B corresponde a la variable
2,3 X0,73 (observando
230 los valores
70 de la mediana, el valor del atípico, etc.)
a) Draw the box-plot for X2 and identify the outliers (if any). Justify your answer.
b) Determine if X1 and X2 have the same type of asymmetry. Justify your answer.
d) If 1 euro = 1.14 dollars, calculate the average salary for those CEOs (in million euros) and the variance.
Solución.
a) La variable X2 toma valores desde mı́n(X2 ) = 29,97 hasta máx(X2 ) = 158,31, con una media de x̄2 = 97,26. Además,
se tiene que:
100,31 + 106,11
M e(X2 ) = = 103,21, Q1 (X2 ) = 64,21, Q3 (X2 ) = 113,69, RIC(X2 ) = 49,49.
2
17
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
No hay datos atípicos (ni atípicos extremos) porque Q3 (X2 ) + 1,5RIC(X2 ) = 187,91 y Q1 (X2 ) − 1,5RIC(X2 ) < 0. El
diagrama de caja es:
b ) X1 y X2 presentan tipos de asimetría distinta. En concreto, X1 tiene una distribución asimétrica positiva porque x̄1 =
24,998 > M e(X1 ) = 22,975, mientras que X2 presenta asimetría negativa, puesto que x̄2 = 97,26 < M e(X2 ) = 103,21.
c) La diferencia clara entre los box-plot A, B y C radica en el número de atípicos y atípicos extremos. Luego es suciente
en averiguar el número de atípicos que tiene la variable X1 . Para la variable X1 se tiene que
Calculamos las barreras exteriores: Q1 (X1 ) − 1,5RIC(X1 ) = 18,65, luego no hay atípicos en la cola izquierda y, por
tanto, descartamos el box-plot C. En cuanto a la cola derecha, Q3 (X1 )+1,5RIC(X1 ) = 27,53 (y Q3 (X1 )+3RIC(X1 ) =
30,86), luego 44.91 es un atípico extremo. Por tanto, elegimos el box-plot A.
d) Si llamamos Y =sueldo de ejecutivo mejor pagado (en millones de euros), y sabemos que 1 euro = 1,14 dólares,
entonces Y = X1 /1,14. Al ser Y una transformación lineal de X1 , se tiene que
12. (May 2017 exam) The following table shows the values of the Human Development Index (HDI) for dierent countries of
Africa, America and Europe in the year 2015.
África 0,348 0,411 0,413 0,416 0,419 0,646 0,666 0,684 0,69 0,698 0,721 0,724 0,736 0,772 0,777
América 0,483 0,666 0,679 0,714 0,715 0,772 0,78 0,783 0,785 0,79 0,793 0,827 0,847 0,919 0,923
Europa 0,693 0,751 0,754 0,761 0,771 0,899 0,907 0,907 0,908 0,916 0,916 0,922 0,923 0,93 0,944
(a) Find the three quartiles for each of the three continents and decide if there are any outliers in the data of each continent.
(b) Draw the box-plot of the American data in the following picture. Determine the shape of each distribution and compare
them. Which measures of centrality and variability are more appropriate in each case? Do not calculate them.
18
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
(c) Justify the truthfulness or falseness of the following statements. Apply the quartiles to justify your answers.
1) 50 % of African countries in the table have an HDI that is below the level reached by any of the European countries
in the table.
2) 75 % of American countries in the table have an HDI that is above the level reached by any African countries in the
table.
(d) The HDI can be classied as follows: Very High [0,8, 1); High [0,7,0,8]; Medium [0,55, 0, 7) and Low [0,0,55]. The
contingency table for the variable continent (X ) and the variable HDI in categories (Y ) is depicted below:
X / Y Low, (0; 0, 55) Medium, [0, 55; 0, 7) High, [0, 7; 0, 8) Very High, [0, 8; 1)
África 5 5 5 0
América 1 2 8 4
Europa 0 1 4 10
What percentage of countries with high or very high HDI belongs to Europe? And what percentage of countries with
an HDI of less than 0, 55 belongs to the African continent?
Solución.
1
(a) Como hay n = 15 datos, los cuartiles ocupan las posiciones
4 16 = 4, 12 16 = 8 y
3
4 16 = 12. Luego,
Af rica : M in(af r) = 0, 348 Q1 (af r) = 0, 416 Q2 (af r) = 0, 684 Q3 (af r) = 0, 724 M ax(af r) = 0, 777
America : M in(amr) = 0, 483 Q1 (amr) = 0, 714 Q2 (amr) = 0, 783 Q3 (amr) = 0, 827 M ax(amr) = 0, 923
Europa : M in(eur) = 0, 693 Q1 (eur) = 0, 761 Q2 (eur) = 0, 907 Q3 (eur) = 0, 922 M ax(eur) = 0, 944
Af rica : RI = 0, 308 =⇒ LS = Q3 + 1, 5RI = 1, 186 > max y L1 = Q1 − 1, 5RI = −0, 046 < min (No tiene)
America : RI = 0, 113 =⇒ LS = Q3 + 1, 5RI = 0, 9965 > max y L1 = Q1 − 1, 5RI = 0, 5445 > min (Tiene)
Europa : RI = 0, 161 =⇒ LS = Q3 + 1, 5RI = 1, 1635 > max y L1 = Q1 − 1, 5RI = 0, 5195 < min (No tiene)
(b) Con las medidas anteriores, se construyen los siguientes diagramas de cajas:
19
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Las tres distribuciones presentan asimetría hacia la izquierda, en mayor o menor medida, siendo el continente americano
el que, salvo por el dato atípico inferior que tiene, tiene la distribución más simétrica. El gráco conjunto permite
comparar fácilmente el IDH en los 3 continentes, siendo el Americano el más homogéneo (salvo por el dato atípico) y
el Europeo en el que se alcanza el valor mediano más alto; mientras que el Africano es en el que hay más variabilidad
y el IDH es más bajo.
Teniendo en cuenta la asimetría más acusada de África y Europa y el atípico de América, las medidas de centralidad
y variabilidad más adecuadas son las que sean más robustas: la mediana y el IQR respectivamente.
No hace falta clacular las medidas de centro y de variabilidad, sólo que opten por medidas robustas, aunque al tener
los box-plot se pueden comparar sin necesidad de obtener el dato exacto. Se observa claramente como las medianas van
aumentando M d(af rica) < M d(america) < M d(europa). En cuanto a la variabilidad, también se aprecia claramente
en el gráco la relación entre las amplitudes de las cajas. Una vez eliminado el efecto del atípico la variabilidad de
América y Europa es algo más similar, siendo la del contienente americano la inferior. La de África es claramente
superior, lo que denota mayores diferencias en el nivel de desarrollo alcanzado por los países de dicho continente.
(c) Para analizar la veracidad de las armaciones necesitamos obtener la M d(af rica), M in(europa), Q1 (america), M ax(af rica),
M d(europa) y el Q3 (america).
1) Verdadero. M d(af rica) = 0, 684 < 0, 693 = M in(europa) y, por tanto, el 50 % de los países de África con un índice
de desarrollo humano tienen un valor del mismo que está por debajo del valor mínimo de los países europeos.
2) Falso. Q1 (america) = 0, 714 < M ax(af rica) = 0, 777 y, por tanto, no podemos garantizar que el 75 % mýs
desarrollado de los países americanos tenga un nivel que supere el máximo nivel alcanzado por los africanos, que
es del 0, 777. Nótese que Q3 (america) = 0, 827 > M ax(af rica) = 0, 777 sólo garantiza que el 25 % de los países
americanos mejor posicionados superen el valor de 0, 827. De hecho, sólo 9 de ellos lo superan, que representan un
60 %.
14
(d) Hay 14 + 17 = 31 países con un IDH Alto o Muy alto, de los cuales 10 + 4 = 14 son europeos. Luego
31 = 0, 4516 y el
porcentaje pedido es el 45, 16 %.
5
Hay 6 países con un IDH inferior a 0, 5, de los que 5 son africanos. Luego
6 = 0, 8333 y el porcentaje pedido es el
83, 33 %
20
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302038
Topic 3: Analysis of bivariate data
1. Bivariate data.
2. Tabular methods.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a. The two-way table. Absolute and relative frequency tables.
b. Marginal and conditional frequencies.
c. The two-way table with quantitative variables.
3. Charts and numerical summary:
a. Qualitative variables: bar charts (clustered, stacked)
b. Qualitative variable and quantitative variable:
i. Multiple box-plots, histograms.
ii. Multiple numerical summaries.
c. Quantitative variables:
i. Scatterplots.
ii. Types of relationships between numerical variables
iii. Measures of linear association: covariance and correlation.
Notation:
• Joint absolute freq. for classes ci of X and c’j of Y: nij
• Marginal absolute freq. for class ci of X: ni. = ni1 + … + nim
• Marginal absolute freq. for class c’j of Y: n.j = n1j + … + nkj
• Sample size: n.. = n
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
Example:
• Sample: 10 madrileños
• Variable X: Educational level attained (1=Below secondary, 2=Secondary, 3=Post-
secondary)
• Variable Y: Employment status (1=Employed, 2=Unemployed, 3=Inactive)
X/Y 1 2 3
1 0 0 2 2
2 1 0 4 5
3 2 0 1 3
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3 0 7 10
Another example:
• Sample: 1508 madrileños.
• Variable X: Educational level attained (1=Below secondary, 2=Secondary, 3=Post-
secondary)
• Variable Y: Employment status (1=Employed, 2=Unemployed, 3=Inactive)
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Conditional frequency distributions
• Given the joint distribution of (X, Y), the absolute frequency distribution of one of the variables,
assuming known and fixed the value of the other variable, is a conditional distribution.
• Notation: Y | X = ci or X | Y = c’j. This symbol “|” means given the following condition.
• We compute the relative frequencies but only for the values that satisfy the condition, that is
we do not take 1508, we take 414.
Another example:
We can also condition on one variable taking more than one value:
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• We add their absolute frequencies
Another example:
• Sample: 1000 USA firms
• Variable X: Sales volume. Variable Y: Number of workers
X and Y are discrete quantitative variables that take a large number of distinct values ⇒ grouped
data
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Measure of linear association for quantitative variables: covariance
The covariance (sxy) is a measure of association between two variables. It quantifies the
information in a scatter plot on the linear association between them.
• Advantages:
o It is bounded: −1 ≤ rxy ≤ 1
o It does not depend on the units of measurement
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o Interpretation:
▪ rxy > 0: Positive linear association
▪ rxy < 0: Negative linear association
▪ | rxy | = 1: Perfect linear association
▪ rxy = 0: X and Y are uncorrelated (no linear relationship)
Example. Three variables are measured over 91 countries: X = female life expectancy, Y = male
life expectancy and Z = GDP
• The covariances between the three pairs of two variables are sxy = 105.15, sxz = 50066.04
and syz = 57917.93.
• On the other hand, the correlations between them are rxy = 0.98, rxz = 0.64 and ryz = 0.65
• Therefore, even if the covariances between male and female live expectancy and gross
national product are larger than the covariance between male and female life expectancies,
the correlation is larger for these last two variables
Example: PISA 2012
• Sample: 64 countries participating in the 2012 PISA tests
• X: Countrywide mean reading score
• Y: Countrywide mean math score
We have:
• The covariance between X and Y is sxy = 2440.78
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The correlation coefficient does not change if we introduce a linear transformation on one or both
variables.
The correct answer is 'True'.
Question 2
For two variables X and Y, you are given the following frequency table:
Frequencies
Y = 0Y = 1Y = 2
Question 3
For a bivariate sample from two random variables, X and Y, if all the conditional means of Y for
each of the different values of X are equal, they are also equal to the marginal mean of Y .
The marginal mean of Y can be written as the sum of the conditional means times their marginal
frequencies, divided by the sample size.
The correct answer is 'True'.
Question 4
For two nonnegative variables X and Y, if the relative frequency of X = 1 | Y = 0 is 0.25 and that of
X = 0 | Y = 0 is 0.1, then the relative frequency of X <= 1 | Y = 0 is 0.35.
If we condition on the same value of the variable, we can add the frequencies, as the denominator
to compute the marginal frequency (the marginal frequency of Y = 0 ) is the same in all cases.
The correct answer is 'True'.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451210
Statistics I
Exercises for Topic 3
Academic year 2020/21
- Short Answers
These short answers do not include in general comments related to the interpretation of the solutions. Nevertheless, some
interpretations are provided for the exercises corresponding to exams from previous years.
Exercises
1. Answer(s):
# of children \ Income: (0, 1000] (1000, 2000] (2000, 3000] (3000, 5000] # children marginal distribution
0 0,15 0,05 0,03 0,02 0,25
1 0,10 0,20 0,10 0,05 0,45
2 0,05 0,10 0,05 0,03 0,23
3 0,02 0,03 0,02 0,00 0,07
Income marginal distribution 0,32 0,38 0,20 0,10 1
b) Conditional distribution for Y |X ≥ 2:
Income | # children ≥ 2: (0, 1000] (1000, 2000] (2000, 3000] (3000, 5000] Total
frc0j |x≥2 0,233 0,433 0,233 0,1 1
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
2. Answer(s):
a) Marginal distribution for Y = Weekly number of credit card purchases:
Purchases 0 1 2 3 4 Total
nj 36 72 69 69 54 300
Mean, quasivariance and quasi-standard deviation:
ȳ = 2,11, s2y = 1,6634, sy = 1,29.
b) Distribution of the number of credit cards:
# credit cards ni
1 117
2 105
3 78
Total 300
The mode is equal to 1.
c) Distribution of the number of purchases by persons with 3 credit cards:
# purchases | (# cards = 3) 0 1 2 3 4 Total
fryj |x=3 0,037 0,111 0,222 0,296 0,296 1
Conditional mean: ȳ|(x = 3) = 2,731
d) Conditional frequencies:
Y = # purchases
Cond. distr. 0 1 2 3 4 Total
Y |X = 1 0,205 0,333 0,231 0,154 0,077 1
Y |X = 2 0,086 0,229 0,229 0,257 0,200 1
Y |X = 3 0,038 0,115 0,231 0,308 0,308 1
Conditional means:
ȳ|(x = 1) = 1,564, ȳ|(x = 2) = 2,257
3. Answer(s):
a) Contingency table:
(−34,5, −29,5] (−29,5, −24,5] (−24,5, −19,5] (−19,5, −14,5] Total
Madrid 2 3 1 0 6
Castilla y León 0 1 0 1 2
Castilla-La Mancha 0 0 2 2 4
Total 2 4 3 3 12
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
b) Conditional frequencies for the sales changes by autonomous region:
(−34,5, −29,5] (−29,5, −24,5] (−24,5, −19,5] (−19,5, −14,5]
Madrid 0,33 0,5 0,17 0
Castilla y León 0 0.5 0 0.5
Castilla-La Mancha 0 0 0.5 0.5
c) Marginal distribution for the changes in sales:
(−34,5, −29,5] (−29,5, −24,5] (−24,5, −19,5] (−19,5, −14,5]
1 1 1 1
frj 6 3 4 4
d) Conditional distributions depend on the region. Castilla-La Mancha had the smallest changes, equally distributed among
the largest intervals, while Madrid had the largest changes.
e) Scatterplot:
Galicia
Female 20,0 % 60,0 % 10,0 % 10,0 % 100,0 %
Male 50,0 % 30,0 % 15,0 % 5,0 % 100,0 %
Total Galicia 35,0 % 45,0 % 12,5 % 7,5 % 100,0 %
Madrid
Female 15,0 % 30,0 % 35,0 % 20,0 % 100,0 %
Male 5,0 % 15,0 % 35,0 % 45,0 % 100,0 %
Total Madrid 10,0 % 22,5 % 35,0 % 32,5 % 100,0 %
Total both 22,5 % 33,8 % 23,8 % 20,0 % 100,0 %
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
All the interpretation questions can be answered from this bar chart.
The differences between regions are shown in the following bar plot:
5. Answer(s):
a) Scatterplot:
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
6. Answer(s):
a) Scatterplot:
(c) The conditional mean is x̄ | (Y < 50) = 0,2609, and the mode is 0.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
(d) Boxplot a) corresponds to those households for which X = 2, b) to those where X = 0 and c) to those having X = 1.
(e) Histogram I corresponds to c) (X = 1), II to a) (X = 2) and III to b) (X = 0).
(f) The standardized incomes (zi = (xi − x̄)/s) are 0,3045, 0,1592, −3,3487 respectively. The household with an income
equal to 75,000 (X = 2) is the poorest in relative terms.
10. (May 2016 Exam) See the answer for question (f) in the problem sheet for Lesson 2 (Exercise 12).
11. (May 2017 Exam) See the answer for question (d) in the problem sheet for Lesson 2 (Exercise 14).
12. (June 2017 Exam)
Answer(s):
a) The scatterplot shows high dispersion levels for v when p is between 0,6 and 0,8. Also, the correlation coefficient
is quite small. Being conservative in our interpretation, we may conclude that there is no clear linear relationship
between the variables.
b) From the data we can obtain the limits for the corresponding boxplots. They are [0,55, 0,91] for p and [0, 0,68] for v.
We observe in the scatterplot that there exist values of p beyond these limits, both above and below them. There are
also outliers for v above 0,60, but there are no outliers for small values of this variable.
c) The mean and median values do not change, and this may seem to indicate that the distributions could be symmetric.
Looking at the histograms we notice that the distribution of p has positive asymmetry and the distribution of v has
a slight negative asymmetry.
d ) Comparing the corresponding relative frequencies
2239 9027
≈ 0,13 < 0,26 ≈ ,
2239 + 14803 9027 + 27686
we see that the percentage of votes in polling stations with high participation is twice that for the low participation
polling stations, supporting the conclusion of the analysts.
13. (June 2018 Exam)
Answer(s):
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
The modal interval would be [1800, 2100).
b) We will compare the coefficients of variation. Their values are:
sI sII
cvI = = 0,3723 cvII = = 0,4096
|xI | |xII |
Both values are very similar, but the variability in Region I is slightly lower.
c) We will conduct this comparison by using the histograms and frequency polygons for both regions.
From these graphs we observe that Region I has a negative asymmetric distribution, while Region II has a slightly
positive asymmetric distribution. This is confirmed by the signs of their asymmetry coefficients.
An interpretation of these differences is that Region I seems to present larger income inequalities, while having larger
income levels.
d) We compute the conditional means using the class marks:
900 · 12 + 1500 · 6 + 2100 · 3 + 2700 · 1 + 3300 · 6
x|(y = 1) = = 1735,7 euros
12 + 6 + 3 + 1 + 6
900 · 2 + 1500 · 3 + 2100 · 12 + 2700 · 5 + 3300 · 8
x|(y = 2) = = 2380 euros
2 + 3 + 12 + 5 + 8
900 · 3 + 1500 · 3 + 2100 · 3 + 2700 · 9 + 3300 · 22
x|(y = 3) = = 2760 euros
3 + 3 + 3 + 9 + 22
900 · 1 + 1500 · 0 + 2100 · 2 + 2700 · 5 + 3300 · 14
x|(y = 4 or more) = = 2945 euros
1 + 0 + 2 + 5 + 14
The average incomes increase with the number of family members.
(a) The right boxplot includes an outlier. We can use the lower limits for outlier values in both cases to differentiate
between them: LL = mı́n −1,5 × IQR, where IQR = Q3 − Q1 .
For employees up to 35 years old, Q1 = 17 and Q3 = 31. Thus, IQR = 31 − 17 = 14 and the lower limit is
LL = 14 − 1,5 × 14 < 0. We conclude that in this group there are no lower outlying observations. These data
correspond to the left boxplot.
For employees older than 35: Q1 = 34 and Q3 = 41. Thus, IQR = 34−41 = 7 and the lower limit is LL = 34−1,5×7 =
23,5. In this case there is an outlier (x = 21). The right boxplot is the one representing these values.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
(b) Median = 32,5, CV = 9,42/29,95 = 0,31, Q1 = Q3 - IQR = 38 − 17 = 21, Range = 43 − 14 = 29.
Symmetry: As the mean is smaller than the median, this suggests the presence of negative asymmetry.
(c) (1) Raise salaries by 1000 euros: y = x + 1.
Mean: 29,95 + 1 = 30,95. Median: 32,5 + 1 = 33,5. SD = not affected by this transformation.
(2) Raise salaries by 10%: z = 1,1x.
Mean: 1,1 × 29,95 = 32,94. Median: 1,1 × 32,5 = 35,75. SD = 1,1 × 9,42 = 10,36.
(d) (i) Mean for employees older than 35: 37,7. Q3 for employees younger than 35: 29.
Thus, 75 % of the employees younger than 35 earn less than 29 (thousands of euros) each year, while the average
salary for older than 35 employees is 37,7. The statement is true.
(ii) Highest salary: 43. Q1 for these 20 employees: 21. Thus, 25 % of these 20 employees earn less than 21 (thousands
of euros), that is, less than 21,5, half the income of the employee with the highest salary. The statement is true.
(a) Both variables are quantitative (numerical) continuous variables (and their frequencies are grouped by intervals).
(b) The frequency table including the marginal distributions is:
X\Y 6 60 (60, 80] (80, 100] (100, 150] > 150 ni· fi·
(50, 100] 20 18 2 1 0 41 0,155
(100, 200] 25 40 30 2 1 98 0,370
(200, 350] 5 10 15 25 3 58 0,219
(350, 500] 0 5 15 20 8 48 0,181
> 500 0 1 2 7 10 20 0,075
n·j 50 74 64 55 22 N = 265 1
f·j 0,189 0,279 0,242 0,208 0,083 1
(c) The modal interval for the monthly income is (100, 200]. The distribution of the home size conditioned to this interval
is given by
X\Y 6 60 (60, 80] (80, 100] (100, 150] > 150
(100, 200] 0,255 0,408 0,307 0,02 0,01
(d) The median interval for the home size is (80, 100]. The distribution of family income conditioned to this interval is
X\Y (80, 100]
(50, 100] 0,031
(100, 200] 0,470
(200, 350] 0,234
(350, 500] 0,234
> 500 0,031
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302053
Topic 4. Probability
1. Random experiments, sample space, elementary and composite events.
2. Definition of probability. Properties.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3. Conditional probability and Multiplication Law. Independence.
4. Law of Total Probability and Bayes’ Theorem.
Topic 4: Probability
Basic concepts: examples
• Random experiment: outcome of a die toss
o Sample space (possible outcomes) finite: Ω = {1, 2, 3, 4, 5, 6}
o Elementary events (sample points): {1}, 2, . . ., 6
o Composite events: e.g., A = “even outcome” = {2, 4, 6}, B = “outcome greater than 3”
= {4, 5, 6}
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Basic operations with random events
Suppose that A and B are events of the sample space Ω
• Intersection of events: The intersection A ∩ B comprises all elements that are both in A and
B (A∩B: “A and B happen”)
o A and B are incompatible events if they have no element in common, i.e., if their
intersection is A ∩ B = ∅
• Union of events: The union A ∪ B comprises all elements that are in A or in B (A ∪ B: “A or
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
B happen”)
• Difference of events: The difference A \ B comprises all elements of A that are not in B (A \
B: “A happens but not B”)
De Morgan’s Laws
Relations between the union, intersection and complementary events:
Three approaches/interpretations
Classical probability (Laplace’s Rule): It considers random experiments where all elementary
events are equiprobable. If event A has n(A) sample points, we define the probability of A as
Frequentist approach: If the experiment were to be repeated many times, the relative frequency
of event A happening would converge to its probability.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Probability of the complementary:
P(A¯) = 1 − P(A)
• P(∅) = 0
• If (event A is included in event B) A ⊆ B ⇒ P(A) ≤ P(B)
• If A = {e1, . . ., en} is finite (or countably infinite) ⇒ P(A) = ∑n i=1 P({ei}) (note: we’ll write P({ei})
= P(ei))
• Probability of the union:
Independent events
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
P(A ∩ B) = P(A)P(B)
Property: Suppose that P(B) > 0. Then, A and B are independent ⇐⇒ P(A | B) = P(A)
Conditional probability: Example
The following table shows the results of classifying a group of 100 executives according to their
weight and to whether or not they are hypertensive:
The conditional probability P(A | B) is the probability that A happens given that we know B has
happened
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
Fundamental theorems of probability calculus
Multiplication Rule
Useful to compute the probability that several events happen simultaneously when the conditional
probabilities are easy to calculate.
• P(A ∩ B) = P(A) P(B | A), if P(A > 0)
• P(A ∩ B ∩ C) = P(A) P(B | A) P(C | A ∩ B), if P(A ∩ B) > 0
• It extends to calculate the probability of the intersection of n events A1, . . . , An
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Multiplication Rule: examples
We draw successively two cards from a Spanish card deck. Probability that:
• the first card is a “copa”: P(A) = 12/48
• the second card is a “copa”, knowing that the first card was a “copa”: P(B | A) = 11/47
• both cards are “copas”: P(A ∩ B) = P(A)P(B | A) = 12/48 × 11/47
We throw twice a fair die. Probability that:
• we get a 1 in the first throw: P(C) = 1/6
• we get a 1 in the second throw, knowing that in the first we got a 1: P(D | C) = P(D) = 1/6
If B1, B2,… ,Bk is a partition of the sample space such that P(Bi) 6= 0, i = 1, …, k, and A is any
event, then
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Example: (cont.) Suppose that the chosen cookie package is defective. What is the probability that
it was packed in line A1?
P(A1 | D) = P(D | A1)P(A1)/P(D) = (0.01 × 0.35)/0.0197 = 0.17766
Given a partition of events of the sample space B1, B2, . . . , Bk , with P(Bi) 6= 0, i = 1, . . . , k, and
given an event A with P(A) > 0, we have, for j = 1 . . . , k,
P(Bj | A) = P(A | Bj)P(Bj)/P(A) =
P(A | Bj)P(Bj)/ [P(A | B1)P(B1) + P(A | B2)P(B2) + . . . + P(A | Bk )P(Bk )]
• Prior probabilities of the Bj : P(B1), . . . , P(Bk )
The probability that the person has the disease is only 0.33%
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Reservados todos los derechos.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Basic concepts in Combinatorics
• Many random experiments can be defined on the basis of a finite number of elementary
equiprobable events. For example, many games of chance have this property
o In these cases, the computation of probabilities for specific events can be carried out
by counting the number of elementary events that belong to the events of interest
(Laplace rule)
• This count can be facilitated applying results from Combinatorics
o The study of different ways to arrange or configure the elements of a finite set, and of
counting the elements in the resulting configurations
Combinations
We call combinations of n elements choose m at a time, the subsets of m different elements that
can be selected from a set composed of n elements
• The order of the elements is not relevant in this case
The total number of different combinations for n elements choose m is
given by the combinatorial number:
Example: En una clase de 35 alumnos se quiere elegir un comité formado por tres alumnos.
Calcular el número de comités que se pueden formar
Combinations with repetition
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451211
The combinations with repetition of n elements choose m at a time, the different subsets of m
elements that can be obtained from a set of n elements when repetition is allowed
• They can be interpreted, for example, as the different ways to assign n tasks to m individuals,
if a task can be assigned to more than one individual
The number of combinations with repetition of n elements
choose m is given by:
Example: Obtain the number of possible outcomes obtained by throwing four indistinguishable dice
(n = 6, m = 4)
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Variations
The variations of n elements choose m at a time are defined as the different orderings of m elements
that can be generated from a set having n elements
• The order of the elements is relevant in this case
The total number of variations of n elements choose m
is given by:
Example. Ten athletes compete in a race. Obtain the number of different ways the podium could be
configured with a first, a second and a third athlete
1. [Newbold et al.] A mail-order company considers three possible errors when handling an order: A =
“a wrong item is delivered”, B = “the item is lost in transit”, and C = “the item is delivered
damaged”. It is assumed that A is independent of B and of C, and that B and C are mutually
exclusive. It is known that P (A) = 0.02, P (B) = 0.01, and P (C) = 0.04. Calculate the probability
that some of the mentioned errors happens for a given order.
Answer(s): 0.069.
2. The following are four possible results of a stock index in two consecutive days:
O1 : the index goes up both days.
O2 : the index goes up the first day and does not go up the second.
O3 : the index does not go up the first day and goes up the second.
O4 : the index does not go up either day.
a) Determine the sample space corresponding to the random experiment “Observe whether the
index goes up or not in two consecutive days”.
b) Consider the following events:
A: “The index goes up the first day”.
B: “The index goes up the second day”.
Answer(s):
a) Ω = {O1 , O2 , O3 , O4 }.
b) A ∩ B = {O1 }; A ∪ B = {O1 , O2 , O3 }; Ā = {O3 , O4 } y B̄ = {O2 , O4 }
c) 0.75.
3. From the experience of an online clothes shopping portal, it has been observed that, on average,
every 1000 visits result in 10 big sales (over 500 e) and 100 small sales. We assume that all visits
have the same probability of resulting in a big sale, and the same for a small sale.
a) Indicate the sample space corresponding to the random experiment “observe the result of a
visit to the portal”.
b) What is the probability that a visit results in a big sale?
c) What is the probability that a visit results in a small sale?
d ) What is the probability that a visit results in a sale?
Answer(s):
a) Ω = {VP, VG, NV}, donde VP = “venta pequeña”, VG = “venta grande” y NV = “no venta”.
b) 0.01.
c) 0.10.
d ) 0.11.
4. In a market research study, a mobile phone company observed that 75% of its clients wanted the
SMS functionality, 80% wanted the capability to take pictures, and 65% wanted both.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
a) What is the probability that a client wants at least one of the two functionalities?
b) What is the probability that a client who wants SMSs also wants to be able to take pictures?
c) What is the probability that a client who wants to be able to take pictures also wants SMSs?
Answer(s):
a) 0.9.
b) 0.8667.
c) 0.8125.
5. In a Spanish oposición, the opositores must develop a topic they can choose among 3 drawn at
random from a total of 85 topics. An opositor has prepared only 35 topics. What is the probability
that (s)he has prepared at least one of the 3 topics in her/his draw?
Answer(s): 0.8016.
6. In order to estimate the audiences of a debate and a movie aired at nonoverlapping times, a TV
chain asked 2500 people whether they watched each of them: 2100 watched the movie, 1500 watched
the debate, and 350 did not watch any of the two programs. If we choose at random one of the
people surveyed:
a) What is the probability that this person watched both the movie and the debate?
b) What is the probability that this person watched the movie, knowing that (s)he watched the
debate?
c) knowing that (s)he watched the movie, what is the probability that this person watched the
debate?
Answer(s):
a) 0.58.
b) 0.9667.
c) 0.6905.
7. According to a study, 38% of Madrid households has a monthly income over 2000 e and 37%
between 1000 e and 2000 e. On the other hand, the percentage of households owning a second
residence is 6.4% among those with incomes of not more than 1000 e, 12.57% among households
with incomes between 1000 e and 2000 e, and 23.4% among households with incomes over 2000
e.
Answer(s):
a) 0.1514.
b) 0.5873.
c) 0.614571.
d ) 0.1008.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
8. [Ross 2005] An insurance company classifies its clients in two groups, those who are accident-prone,
representing 20% of clients, and those who are not. Data indicate that, in a given year, 10% of
accident-prone clients have an accident, while only 5% of not accident-prone clients have it.
a) What is the probability that a new client has an accident during the first year?
b) If a new client has an accident during the first year, what is the probability that the client is
accident-prone?
c) If a new client does not have an accident during the first year, what is the probability that the
client is not accident-prone?
Answer(s):
a) 0.06.
1
b) 3.
c) 0.8085.
9. [Ross 2005] An inspector in charge of a criminal investigation has a certainty of 60% that a certain
suspect is guilty. A forensic analysis reveals that the criminal was left-handed. It is known that 20%
of the general population is left-handed.
Answer(s):
a) 0.68.
b) 0.8824.
10. (Exam, May 2016) The AROPE indicator is computed using several social vulnerability factors and
measures the risk that a given household is under risk of poverty exclusion. During 2014, an NGO
attended 1.200.000 households and 156.000 were not under AROPE. Regarding the households that
did not suffer AROPE, 84 % of them were not over-indebted. On the other hand, considering the
households that were under AROPE, 40 % of them also suffer over-indebtedness.
(a) Calculate the number of households attended by the NGO that suffer over-indebtedness.
(b) Given that a household is not over-indebted, compute the probability that it is under AROPE.
(c) (Topic 5) A social worker can visit 20 households per day. Compute the probability that, in a
given day, 5 out of 20 households are not under AROPE.
(d) (Topic 5) Considering that 420 households can be visited per month, compute the probability
that at least 150 households suffer both AROPE and over-indebtedness.
Answer(s).
a) 0.3688
b) .8270
c) 0.0713
d ) 0.6517
11. (Exam, June 2016) In a city of 3.5 millon inhabitants there are three urban transport systems:
metro, bus and tram. In general, in a working day, the amount of travellers are 1.500.000 for the
metro, 750.000 for the bus and 450.000 for the tram. Moreover, it is known that, 30 % of metro
travellers also use the bus, 10 % of metro travellers also use the tram and 5 % of metro travellers
also use both bus and tram. Finally, 15 % of bus travellers also use the tram. (Hint: An inhabitant
can take or not the urban transport).
(a) Calculate the probability that, in a working day, an inhabitant uses only one of the three
transport systems.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
(b) Calculate the probability that, in a working day, an inhabitant uses at least one transport
system.
(c) When only a transport system is used, there is a 2 % probability of having a delay of more that
5 minutes in a working day. However, the probability of having such a delay rises to 7 % when
combining more than one transport system in a working day. Calculate the probability that an
inhabitant suffers a delay of more that 5 minutes in a working day.
(d) With the same information as in part (c) and given that a traveller suffered a delay of more than
5 minutes, calculate the probability that this traveller took more than one transport system.
Answer(s).
a) 0.4286
b) 0.5893
c) 0.0198
d ) 0.5681
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302062
Topic 5. Probability models
• Random variables: concept
•
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Discrete random variables:
o Probability function and distribution function
o Mean and variance of a discrete r.v.
• Continuous random variables:
o Density function and distribution function
o Mean and variance of a continuous r.v
• Probability models:
o Discrete probability models: Bernoulli, Binomial and Poisson.
o Continuous probability models: Uniform, exponential and normal.
o The Central Limit Theorem and applications (Normal approximation to the Binomial)
Topic 5
Random variables: concept
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Properties
Let X be a discrete r.v. taking values in the set S with probabilities px = P{X = x} for x ∈ S. Then
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Properties example:
A game consists in trying to insert 3 rings successively in a stick. Participating costs 3 euros. Prizes
are 4 euros for 1 success, 6 euros for two and 30 euros for three. We assume that the probability of
inserting a ring is 0.1 in each toss, and that the outcomes are independent.
We define the r.v. X as the net gain in the game. The sample space is
Ω = {{f , f , f }, {a, f , f }, {f , a, f }, {f , f , a}, {a, a, f }, {a, f , a}, {f , a, a}, {a, a, a}}
where a denotes success and f failure. Hence, X only admits four possible outcomes, with the
following probabilities:
What is the probability of earning at least 3 euros, net of the 3 euros for participating?
P{X ≥ 3} = P{X = 3}+P{X = 27} = 0.027+0.001 = 0.028
What is the probability of not losing money?
P{X ≥ 0} = P{X = 1} + P{X = 3} + P{X = 27} = 0.243 + 0.027 + 0.001 = 0.271
or, equivalently,
P{X ≥ 0} = 1−P{X < 0} = 1−P{X = −3} = 1−0.729 = 0.271
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
o For any a, b ∈ R, P{a < X ≤ b} = P{X ≤ b} − P{X ≤ a} = F(b) − F(a).
Example:
The probability function of the r.v. X in the game example is
X=0 -3 1 3 27
F(x) 0,729 0,729+0,243 0,972 + 0,999 +
= 0,972 0,027 = 0,001 = 1
0,999
Properties
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Discrete random variables: Variance
The variance of the discrete r.v. X is
The square root of the variance is the standard deviation, denoted by S[X] = √ V[X]
Properties
Example:
-3 1 3 27
Px 0,729 0,243 0,027 0,001
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
We participate in a gamble where we have to pay 6 euros upfront. If when tossing 3 times the coin
in the previous example we get 1 tail, we earn 4 euros, if we get 2 tails we earn 6 euros, and if we
get 3 tails we earn 30 euros. What is the mean net gain?
Let Y be the r.v. “net gain in the gamble”. We have:
• If we don’t get any tails, X = 3, so Y = −6 with probability P{Y = −6} = P{X = 3} = 8/27
• If we get one tail, X = 1, so Y = −2 with probability P{Y = −2} = P{X = 1} = 4/9
As in the discrete case, the function F(x) gives the cumulative probabilities until the point x ∈ R, but
now it is a continuous function
Properties
• 0 ≤ F(x) ≤ 1 for every x ∈ R
• F(−∞) = 0.
• F(∞) = 1.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
For a continuous r.v. X with distribution function F(x), the density function of X is the derivative of
the distribution function. The density is not a probability but is related to probability when you
integrate it (second property).
Properties
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Continuous random variables: Expectation (mean)
Let X be a continuous r.v. taking values in S ⊆ R, with density function f (x). Then, the expectation
(mean) of X is
The square root of the variance is the standard deviation, denoted by S[X] = √V[X].
The same properties as for the variance of a discrete r.v. hold
Example
Probability models
• Discrete probability models: Bernoulli, Binomial and Poisson
• Continuous probability models: Uniform, exponential and normal
• Central Limit Theorem
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Consider a random experiment with two possible outcomes, which we call “success” and “failure”
Define the r.v.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Then X ∼ Ber(1/2)
Example: An airline considers that passengers who buy a ticket for a certain flight have a probability
of 0.05 of not showing up Define, for a randomly chosen passenger who buys a ticket for that flight,
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The airline of the previous example has sold 80 tickets for the flight. The probability that a passenger
does not show up is 0.05. Define X = number of passengers showing up. Then (assuming
independence)
X ∼ B(80, 0.95)
The probability that all passengers show up is
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The probability that there are 20 accidents or less is
In the previous example, consider Y, the number of of accidents in such a road over two consecutive
years
The distribution of the r.v. Y is Y ∼ P(2 × 25) = P(50)
Properties
Example:
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
We write X ∼ U[a, b].
Properties
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Example: Distribution U[3, 5]
We write X ∼ Exp(λ)
The exponential distribution: distribution function
Assume that X ∼ Exp(λ). Its distribution function is
Properties
Properties
Tabla N(0,1)
Example:
Let Z ∼ N(0, 1). Let us calculate some probabilities:
• P(Z < 1.5) = 0.9332. (table)
• P(Z > −1.5) = P(Z < 1.5) = 0.9332. (why?)
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
• P(Z < −1.5) = P(Z > 1.5) = 1 − P(Z < 1.5) = 1 − 0.9332 = 0.0668. (why not ≤?)
• P(−1.5 < Z < 1.5) = P(Z < 1.5) − P(Z < −1.5) = 0.9332 − 0.0668 = 0.8664.
Let X ∼ N(µ = 2, σ = 3). We want to calculate P(X < 4) and P(−1 < X < 3.5):
• First, we standardize the original r.v. as follows:
where Z ∼ N(0, 1)
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Then,
where Z ∼ N(0, 1)
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
sample of 4 packages having a loss of liquid between 3 % and 5 %. We have Y ∼ B(4, 0.6827), and
If the sample were of 5 packages, what would be the probability that at least one would have losses
between 3 % and 5 %? We have n = 5 and p = 0.6827. Therefore, Y ∼ B(5, 0.6827). Then,
This result refers to the limit of the sample mean from n independent and
identically distributed (i.i.d.) r.v. with finite mean µ and standard deviation σ. It says that, for large
n, the distribution of X¯ is approximately normal, whatever the distribution of the Xi
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451212
Question 1
Given a collection of independent r.v.s Xi (i=1,…,n) that follow Poisson distributions with the same
parameter λ, then Y = X1 + ⋯ + Xn also follows a Poisson distribution with parameter λ/n.
Y follows a Poisson distribution with parameter nλ. Note that E[Xi]= λ and that implies E[Y]= nλ
Question 2
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
The variance of any r.v. X (if it exists) must satisfy Var(X) = E[X^2] − (E[X])^2 ≤ E[X^2]
Question 3
Let X and Y be two independent r.v.s, both of them with means equal to 1 and variances equal to 2.
The random variable W = Y − X has mean equal to 0 and standard deviation equal to 2.
W has a normal distribution, as it is a linear combination or normal r.v.s. Its mean is E[W] = E[Y] −
E[X] = 1 – 1 = 0 and its variance is Var(W) = Var(Y) + Var(X) = 2 + 2 = 4, implying that σY = √4 = 2
Question 4
A r.v. X is defined as the number of times you get a prize after 10 (independent) draws in a lottery.
Another r.v. Y is defined as the number of times you get a prize after 5 additional (independent)
draws. If pp is the probability of getting a prize in one draw, the variance of X + Y takes the
value 15p(1−p).
X+Y follows a binomial distribution with parameters n=15 and p, and its variance is given
by Var(X+Y) = np(1−p).
1. The random variable X = number of children in a randomly chosen family from a certain city has the following probability
function:
X P(X=x)
0 0.47
1 0.30
2 0.10
3 0.06
4 0.04
5 0.02
6 0.01
Answers(s).
(d) µY = 1655
Z P (Z = z)
0 0.47
350 0.30
1400 0.10
(e)
3150 0.06
5600 0.04
8750 0.02
12600 0.01
2
p
E(Z) = 959 euros, V (Z) = 4281669 euros and DT (Z) = V (Z) = 2069, 219 euros.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
2. The following gure shows the density function of a continuous r.v. X.
(a) Calculate the probability that X is less than 1 using the graph.
(b) Calculate the probability that X is greater than 0.5 and less than 3/2, arguing analytically.
1 − 21 x,
if x ∈ (0, 2),
f (x) =
0 if x∈/ (0, 2).
3. The length in minutes of a phone call to a certain customer service is a continuous random variable with distribution
function (
0 if x≤0
F (x) = −2x −x
1 − 23 e 3 − 31 e 3 if x>0
It is known that calls lasting more than 6 minutes receive a very low satisfaction rating, while those lasting less than 3
minutes receive a very high rating:
(a) Calculate the probability that the duration of a call lies between 3 and 6 minutes.
(b) Calculate the probability that the duration of a call is over 6 minutes.
(c) Knowing that an ongoing call has already lasted 3 minutes, what is the probability that it will be shorter than 6
minutes?
Answers(s).
(a)
P (3 < X < 6) = 0.1552
(b)
P (X > 6) = 0.0574
(c)
P (X < 6|X > 3) = 0.7292
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
4. In each of the following situations, say if the random variable X dened can be modeled as a binomial distribution. In
such a case, identify the parameter values n and p:
(a) We roll a die 100 times and X is the total number of 1's obtained.
(b) We draw one card from a deck of 52 cards and check if it is an ace. We repeat this procedure 10 times, without
putting the card back after each draw. X is the total number of aces drawn.
(c) 2% of oranges shipped in a certain place are rotten. Oranges are placed in bags of 10 units. We randomly select one
bag and take X to be the number of rotten oranges in the bag.
(d) In a box there are 2 red balls, 3 white balls and 2 green balls. We draw one ball at random, write down its colour and
put it back in the box. We repeat this procedure 10 times and take X to be the total number of white balls drawn.
(e) In a box there are 2 red balls, 3 are white and 2 are green. We select one ball at random, report its color and return
it to the box. We repeat this process 10 times and count the number of balls of each colour.
Answer(s).
5. (May 2017 exam) A rm has designed the following campaign to advertise a product on a global scale by massive email
sending: The rm will send one hundred thousand emails to potential customers unrelated among themselves oering the
product, which yields a prot of 70 e per unit sold. The rm assumes that, on average, one out of ten people receiving
the email will purchase the product. Answer the following questions, providing adequate justication.
(a) Specify a probabilistic model for the random variable Y, which models the prot that will be obtained with the
campaign.
(b) Calculate the mean, variance and standard deviation of the prot Y.
(c) Calculate (exactly or approximately) the probability that the prot Y exceeds 712000 e.
Solution.
(c) Since n is large, p > 0.1 and np > 5, then by the CLT:
Y − 700000 712000 − 700000
P {Y > 712000} = P >
6640.78 6640.78
≈ P {Z > 1.81} ≈ 0.0351.
6. A bank oers a deposit of 6000 euros with full liquidity. It has 25 subscribed deposits. If the probability that a customer
requests a full refund in a given day is 0.01, and refund requests are independent, how much money should the bank reserve
to ensure that refund requests in a given day are honored with a probability of at least 99%?
Answer(s). P99 = 2 and therefore the bank should reserve 12000 euros.
7. A company rents a computer for periods of t hours, charging for it 600 euros per hour. The number of times the computer
breaks down in an hour is a random variable with a Poisson distribution and failure rate λ = 0.08 per hour. If the computer
breaks down x times in the t hours, the company must pay 50x2 to x it. Calculate the expected benet of the company
as a function of t. For what value of t does the company obtain the maximum expected benet?
Answer(s). E(B) = 596t − 0.32t2 . Thus, the maximum expected benet is attained at t = t = 931.25 hours (38.8 days).
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
8. An insurance company receives on average 3 claims for accidents per day. It is assumed that the number of claims in a
random day follows a Poisson distribution.
a) Calculate the probability that more than two claims are received in two consecutive days.
b) If the amount to be paid for a random claim follows an exponential distribution with mean 500 euros, determine its
distribution function. Calculate the probability that the company has to pay for a claim more than 1200 euros.
Answer(s).
a) P {Y > 2} = 0.9380
b)
(
0, if x < 0,
F (x) = x
1 − exp− 500 , if x ≥ 0.
P {C > 1200} = 0.091
c) M d = 346.5736
9. Based on her previous experience, the head of a construction company knows that the amount of a randomly chosen project
contract follows a uniform distribution in the interval ( 2C
3 , 2C), where C is the project cost. What is the expected benet
per project?
C
Answer(s). E(B) = 3.
10. (May 2014 exam) In a company, each technical service visit to x a computer system breakdown costs 350 euros, plus a
xed monthly fee of 175 euros. The monthly average number of breakdowns is 9.5 with a standard deviation of 2.
a) Obtain the expectation and variance of the monthly repair cost (including the monthly fee).
b) Using Chebyshev's inequality, bound the probability that in a given month the cost of repairr is lower than or equal to
2000 euros or greater than or equal to 5000 euros.
c) If we instead assume that the monthly cost of repairs is uniformly distributed with the expectation and variance in a),
calculate the probability of the previous part.
d) How can you explain the dierence between the results in parts b) and c) ?
Solution.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
11. (Anderson et al., 2018) Wendy's fast-food chain has been recognized for having the fastest average service time among
fast-food restaurants. In a benchmark study, Wendy's average service time of 2.2 minutes was less than those of Burger
King, Chick-ll-A, Krystal, McDonalds, Taco Bell, y Taco John's (QSR Magazine website, December 2014). Assume that
the service time for Wendy's follows an exponential distribution.
a) What is the probability that a service does not exceed one minute?
b) What is the probability that a service is between 30 seconds and one minute?
c) Suppose Wendy's is considering a policy under which if a service time exceeds ve minutes, the customer's order is free.
What is the probability that you will get a free meal?
Answer(s).
a)
P {X ≤ 1} = 0.365
b)
P {0.5 ≤ X ≤ 1} = 0.162
c)
P {X ≥ 5} = 0.103
12. (Newbold et al., 2013) In Great Britain, a factory of 2000 employees has a rate of weekly accidents with a loss equal
to λ = 0.4 and the number of accidents follows a Poisson distribution. Get the average time between two consecutive
accidents. What is the probability that the time between two consecutive accidents is less than 2 weeks?
1
Answer(s). E(T ) = 0.4 = 2.5 weeks and P {T < 2} = 0.5507
13. The manufacturing cost (in euro) of a certain product can be modeled as a random variable X with normal distribution
N (100, σ = 3). The sale price is independent of the manufacturing cost and varies depending on market conditions. Let Y
be the normal random variable N (129, σ = 6) indicating the unit sale price (in euro) of the product. Answer the following
questions:
a) Obtain the distribution of the benet obtained with the sale of 10 product units. Do you need to assume any additional
hypothesis to answer?
Answer(s).
√
a) B ∼ N (290, 450)
√
b) E(B) = 290 euros and SD(B) = 450 euros.
c) P {B ≥ 320} ≈ 0.079
14. According to a bank's study, the number of bounced checks received in a bank branch follows a Poisson distribution, with
a mean of 10 bounced checks per day.
(a) A random sample of 200 branches is selected, for which the number of bounced checks received is recorded in a day.
What is the probability that the total number of bounced checks received is larger than 1900?
(b) What is the probability that a branch receives less than 3 bounced checks in a day?
Answer(s).
a) P ≈ 0.987
b) P = 0.0028.
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
15. (May 2010 exam) The transit time X (in minutes) for the buses of a certain city in a certain route is modeled as a uniform
random variable on the interval (30, 40).
a) Draw the probability density function of X. Indicate in the coordinate axes both the values of X with positive density
and the values of the density function.
b) What is the probability that the route transit time of a bus will be between 30 and 37.5 minutes?
c) We select 100 buses at random and are interested in the number of buses having a route transit time between 30 and
37.5 (we denote this random variable as Y ). Give the name of the distribution of Y and its parameter values. Calculate
the expectation and standard deviation of Y .
d) What is the (approximate) probability that less than 64 buses will have route transit times between 30 and 37.5?
Solution.
1
a) f (x) = 10 , for all x ∈ (30, 40), and f (x) = 0, for all x∈
/ (30, 40).
0,12
0,1
0,08
0,06
0,04
0,02
0
-10 0 10 20 30 40 50 60 70
b) 0.75
c) Y ∼ B(n = 100,pp = 0.75), were success={ the bus has a route transit time between 30 and 37.5 }. E(Y ) = np = 75,
and DT (Y ) = V (Y ) = 4.33
d) 0.0055 (by the CLT).
Reservados todos los derechos. No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-5302072
Topic 6. Introduction to statistical inference
• Statistical inference: objectives and basic concepts.
•
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Point estimation of parameters.
• Goodness of fit to a distribution. Graphical methods.
• Distribution of the sample mean.
• Confidence intervals for the population mean.
Topic 6: Introduction to statistical inference
Statistical inference
• Goal: Obtaining information about the parameters of a population from a sample from it.
• We identify the concept of statistical population with that of population for a random
variable (r.v.) X.
• The distribution of the population is the distribution of the r.v. X. For example, X may have
a normal distribution with parameters µ and σ, X ∼ N(µ, σ).
Sampling
• A sample is a finite subset of a population. The number of individuals on it is called the
sample size.
• The reasons to consider a sample instead of the entire population include the following:
o The elements of the population may exist conceptually, but maybe not in reality at a
given moment (population of defective parts that a machine will produce during its
lifespan).
o It can be economically infeasible to study the entire population.
o The study of the population would take an excessive time. Further, its characteristics
might change over time (electoral polls).
o The study might entail the destruction of elements studied (mean life of a type of light
bulb, mean breakpoint tension of a cable type, …).
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
Simple random sample
• Let X be an r.v. with distribution F. A simple random sample (s.r.s.) of size n of X is a set of
r.v. X1, …, Xn such that:
o X1, …, Xn have distribution F (Xi ∼ F, for i = 1, …, n).
o X1, …, Xn are mutually independent.
• Each realization x1, …, xn of such an s.r.s. is called a particular sample.
• A statistic is a function of the s.r.s. X1, …, Xn. Hence, a statistic is an r.v. (unlike a
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
parameter, which takes a fixed numerical value for a given population)
• An estimator is a statistic used to approximate a parameter
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
3.8 9.5 4.8 1.6 0.2 0.8 1.5
The sample mean of these values is mean = 3.171. The relative error of this estimate of E[X] is
(3.171 − 4) /4 = −0.207 (−20.7%).
If we add new elements to the above s.r.s., the sample mean changes: it tends to get closer to the
population mean.
Below, we see histograms of the possible values of the sample mean for samples of size n = 7 and
n = 17, respectively. What do such histograms suggest?
• Thus, the expected value (mean) of X is E [X], so we say that the mean is an unbiased
estimator of E [X].
• Further, since V[mean] = V[X]/n, the larger n is, the more concentrated around E [X] will be
the distribution of the mean.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Application to the Bernoulli distribution
• The above results allow us to obtain statistics useful to estimate the parameters of the
distributions studied in Topic 5, from an s.r.s.
• Let X be a Bernoulli r.v. with parameter p, X ∼ Ber (p):
Hence, if the sample size n is large, we can expect pb to be close to p (^p ≈ p).
Example
Pablo wants to run for mayor of his town. To assess his chances, he takes a poll of n = 10 voters to
estimate the proportion of votes that he would obtain.
Consider the r.v. X =“Votes to Pablo”, taking the value 1 if the person says (s)he will vote for Pablo,
and 0 otherwise, with X ∼ Ber (p)
He thus draws a sample of size n = 10, obtaining
1001101010
From this particular sample, we obtain the estimate p^ = 0.5 of p, the expected proportion of votes
that Pablo would obtain. x̅ = (1+0+0+1+1+0+1+0+1+0)/10 = 5/10 = ½ = 0.5
Binomial distribution
• Let Y be a Binomial r.v. with parameters m and p, Y ∼ B (m, p):
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
• We know that Y is the sum of m independent Bernoulli r.v. X1, . . . , Xm with parameter p: Y
= X1 + · · · + Xm.
• Recall that E [Y] = mp and V [Y] = mp(1 − p).
• We will see how to estimate p.
• We have an s.r.s. of size n of Y, Y1, …, Yn (recall that m is the number of Bernoulli trials of
X), we estimate p as follows:
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• Furthermore, by the properties of the sample mean, we have
hence, if the number of Bernoulli trials, m, and/or the number of binomial samples, n, is very
large, we can expect pb to be close to p.
• That is why the following estimator called quasi-standard deviation is also used, given by:
Example
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
• To carry out a statistical inference analysis, it is often assumed that data come from a certain
distribution type (example: normal). Yet, such an assumption should be properly justified.
• There are different methods that can be used for such a purpose, called goodness of fit
methods.
• Here we will only consider two very common graphical goodness of fit methods.
Histogram with density function
• The first is to compare a histogram of the data to the density function obtained with the
estimated parameters. If the hypothesis is true, then such a density function will be close to
For example, the following chart is obtained from data from 200 returns of a financial asset. The
chart shows the histogram and the normal density function obtained with the estimated parameters
(µ^ = 0.83 and σ^ = 4.12)
QQ-plot
• The second method is based on a chart called QQ-plot. This plots the estimated quantiles
from the data vs. the theoretical quantiles for the distribution with the parameters estimated
from the sample.
• If the data come from the assumed distribution, then the points in the plot will be close to the
line y = x.
• If the distribution function is continuous and increasing, the p-th quantile (0 < p < 1), denoted
by qp, is obtained by inverting the distribution function. Thus, if we look for the value qp such
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
from smallest to largest, obtaining x(1), . . . , x(n) ; (2) then, take Qp = x([np]).
For example, the following chart shows the QQ-plot of 200 returns of a certain financial asset, where
the estimated quantiles are plotted against the quantiles for the normal distribution with parameters
µ^ = 0.83 and σ^ = 4.12.
The chart shows that the fit is quite good.
• If X has expectation E [X] and variance V [X], and does not have a normal distribution,
then the Central Limit Theorem (CLT) ensures that, if X1, …, Xn is an s.r.s. of X, with n
large enough (n ≥ 30), it holds approximately that
Example
If X1,… , Xn is an s.r.s. of X with distribution Ber (p), for large n we have
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213
Let X be a discrete r.v. with probability function:
We draw an s.r.s. of size n = 125 of X. What is the probability that the sample mean lies between
2.4 and 2.6?
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
We have
Confidence intervals
• Instead of a point estimator, it is more informative to give an interval of plausible values for
the unknown parameter.
• Given a sample, we would like to have a narrow interval of values that, with certainty, will
contain the true value of the population mean, µ. But that is not possible. Why?
• We will consider a method to construct random intervals from an s.r.s., such that about
(1−α)% of the generated intervals from different s.r.s. contain the true value of the population
mean µ. We will call 1 − α the confidence level and the intervals obtained confidence
intervals.
• We have generated 100 samples of size n = 50 of a distribution N(−2, 1). The following chart
shows the resulting 90% confidence intervals for µ. About 90% of them contain the true value
µ = 2.
No se permite la explotación económica ni la transformación de esta obra. Queda permitida la impresión en su totalidad.
Confidence intervals with large samples
• What if the standard deviation is unknown or the population is not normal?
• When the sample size n is large, the CLT ensures that the distribution of X is approximately
normal, regardless of the distribution of the observations.
• Hence, if data are not normal, for large samples we can use the following confidence interval
for the population mean:
Example
In the example of the estimation of the Bernoulli parameter p, Pablo finally takes a poll of n = 100
voters and obtains the estimate pb = 0.4.
The 95% confidence interval for p is
a64b0469ff35958ef4ab887a898bd50bdfbbe91a-4451213