0% found this document useful (0 votes)
29 views43 pages

Stat210 FL17 LCN 1

Uploaded by

haifa almazrouie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views43 pages

Stat210 FL17 LCN 1

Uploaded by

haifa almazrouie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

STAT 210

Probability and
Statistics
Unit 1:Descriptive Statistics
Outline
 Introduction to Statistics:

 Graphical method:
Bar and pie charts, Histogram

 Summary Statistics:
Measures of location, measures of variability,
boxplot

STAT210: Probability and Statistics 2


Why Statistics?
 Statistics deals with collecting, processing, summarizing,
analyzing and interpreting data. On the other hand,
engineering and industrial management deal with such
diverse issues as solving production problems, effective use
of materials and labor, development of new products,
quality improvement and reliability and, of course, basic
research.
 The field of statistics involves methods for:
1. Designing and carrying out research studies.
2. Describing collected data.
3. Making decisions, predictions, or inferences about
phenomena represented by the data by designing valid
experiments and drawing reliable conclusions.
STAT210: Probability and Statistics 3
Why Statistics?
Branches of Statistics
1. Descriptive statistics: statistical
methods that summarize and describe the
prominent features of data.
2. Inferential statistics: statistical methods that
generalize results from a sample to a
population.

STAT210: Probability and Statistics 4


Sampling
 As it is generally impossible or impractical to find out something about
the entire population, we examine a part of it to make inferences.
 A population is the entire collection of objects or outcomes about
which information is sought.
 A sample is a subset of a population, containing the objects or
outcomes that are actually observed.
 A parameter is a numerical characteristic of a population, which is
usually unknown.
 A statistic is computed from the sample and varies from sample to
sample and used as an estimate of the population parameter.

Example: A researcher is interested in measuring the satisfaction of


customers about the internet connection in a certain city. He randomly
sampled 50 customers from a list of subscribers. The population of
interest is all customers in the city while the sample is the 50 selected
customers.
STAT210: Probability and Statistics 5
Data Collection
Besides organizing and analyzing data, statistics
deals with the development of techniques for
collecting the data. If data is not properly collected,
an investigator may not be able to answer the
questions under consideration with a reasonable
degree of confidence.
Observational Studies: Engineer simply observes the
process without disturbing it and records quantities
of interest. May be able to find relationship between
input and output but cannot study relationship
between all factors because appropriate changes
were not made.
STAT210: Probability and Statistics 6
Data Collection
Controlled (Designed) Experiments: Measurements
are recorded while controlling some factors that
might influence the results of the study. Measures
the response or output variable of interest.

Surveys: Questionnaires designed to solicit


information from people. Data may be collected by
face-to-face interview, telephone interview, postal
mail, email, fax.

STAT210: Probability and Statistics 7


Simple Random Sampling
(SRS)
A simple random sample (SRS) of size n is a sample chosen by a
method in which each collection of n population items is equally
likely to comprise the sample.
 A SRS is not guaranteed to reflect the population perfectly;

 SRS's always differ in some ways from each other;


 Two samples from the same population may vary from each
other. This is known as sampling variation;
 Items in a SRS may be treated as independent in most
cases encountered in practice. The exception occurs when
the population is finite and the sample comprises a
substantial fraction (more than 5%) of the population.

STAT210: Probability and Statistics 8


Simple Random sampling
Sampling with replacement: Replace each item after it is
sampled.
 The population remains the same on every draw. The
sampled units are truly independent.
 In the sample the researcher collected, 80% of users were
satisfied with their internet connection.
 In the population of customers, it is unlikely there will be
exactly 80% who are satisfied with their internet connection.
 It is more realistic to think that there will be somewhere
around 80% of the customers who are satisfied with their
internet connection.
 Another researcher repeats the study with a different SRS of
50 customers. She finds 90% are satisfied with their internet
connection.
STAT210: Probability and Statistics 9
Simple Random Sampling
 Did she do something wrong or did the first researcher do
something wrong?
 Sample variation at work; two different samples from the
same population will differ from each other and from the
population.

STAT210: Probability and Statistics 10


Stratified Sampling
Sometimes alternative sampling methods can be used to
make the selection process easier, to obtain extra information,
or to increase the degree of confidence in conclusions.
 One such method, stratified sampling, entails separating
the population units into non-overlapping groups and
taking a sample from each one.
 For example, a manufacturer of TV might want information
about customer satisfaction for units produced during the
previous year. If three different models were manufactured
and sold, a separate sample could be selected from each
of the three corresponding strata.
 This would result in information on all three models and
ensure that no one model was over- or underrepresented
in the entire sample.
STAT210: Probability and Statistics 11
Convenience Sampling
Frequently a convenience sample is obtained by selecting
individuals or objects without systematic randomization. Such
sample is not drawn by a well defined random method.
Example: A computer engineer received a shipment of
1000 monitors in a huge container. He wants to test the
brightness of the monitors by testing a sample of 10 ones.
The engineer takes 10 monitors from the top of the
container as the sample.
 Things to consider with convenience samples:
 Differ systematically in some way from the population.
 Only use when it is not feasible to draw a random
sample.

STAT210: Probability and Statistics 12


Types of Variable
 A variable is any characteristic whose value may change
from one object to another. The variables can be classified
as either quantitative or qualitative.
 Quantitative (Numerical) variables: A numerical quantity is
assigned to each item in the sample. Quantitative variables
can be classified as either discrete or continuous:
 A discrete variable is a variable whose possible values
can be listed, even though the list may continue
indefinitely. For example, the number of visits to a
particular Web site during a specified period, the
number of PCs owned by a family, or the number of
students in an introductory statistics class.
 A continuous variable is a variable whose possible
values form some interval of numbers. Typically, a
continuous variable involves a measurement of
STAT210: Probability and Statistics 13
something, such as the price of a laptop, the CPU time
Types of Variable
Qualitative (Categorical) variables: The sample items are
placed into categories, groups or levels.
Examples: brand of laptop owned by a student, the defective
status (defective or not), computer knowledge (beginner,
intermediate, expert), education level (less than high school,
high school, etc.).
Values of a qualitative variable are sometimes coded with numbers.
We cannot do arithmetic with such numbers, in contrast to those of a
quantitative variable.
Qualitative data can be classified as either nominal or ordinal. The
categories of an ordinal data can be ranked or meaningfully ordered
but the categories of a nominal data can't be ordered. Of the four
qualitative data sets listed above, brand of laptop and defective
status are nominal while computer knowledge and education level
are ordinal.

STAT210: Probability and Statistics 14


Exercises
(1) An IT student, working on his thesis, plans a survey to determine
the proportion of all computer users who regularly scan flash disks
before using them. He decides to interview his classmates in the
three classes he is currently enrolled.
a) What is the population of interest?

b) Do the student's classmates constitute a simple random sample


from the population of interest?

c) What name have we given to the sample that the student


collected?

d) Do you think that this sample proportion is likely to


overestimate, or underestimate the true proportion of all computer
users that regularly scan flash disks before using them?

STAT210: Probability and Statistics 15


Exercises
(2) Are the following data quantitative or qualitative?
a) Number of hard drives a PC has.
b) Employment Status (employed, unemployed).
c) The price of a laptop.
d) Quality of an item (low, medium, high).

STAT210: Probability and Statistics 16


Graphical Methods
Descriptive statistics can be divided into two general areas;
graphical and numerical. In this part, we consider representing
a data set using graphical techniques.
Appropriate graphs are-
 For qualitative data: Bar chart and Pie chart
 For quantitative data: Histogram; Boxplot

STAT210: Probability and Statistics 17


Bar and Pie Charts
 Bar chart: A vertical or horizontal rectangle represents the
frequency for each category.
Height can be frequency, relative frequency, or percent
frequency.
In some cases, there will be a natural ordering of groups; for
example, freshmen, sophomores, juniors, seniors, graduate
students whereas in other cases the order will be arbitrary; for
example, Dell, hp, etc.
What to Look For: Frequently and infrequently occurring
categories. In Minitab: Graph - Bar Chart
 Pie chart: A circle divided into slices where the size of each slice
represents its relative frequency or percent frequency.
What to Look For: Categories that form large and small
proportions of the data set.
In Minitab: Graph - Pie Chart
STAT210: Probability and Statistics 18
Example
A quality manager uses a questionnaire to ask customers how
they rate the customer support services offered by the IT
Services center. The services are rated on a scale of
outstanding (O), very good (V), good (G), average (A), and
poor (P). The responses of 50 customers were:
GOVGAOVOVGOVAVOPVOGAOOOGOVVAGOVPV
OOGOOVOGAOVOOGVAG
The data are summarized in the following frequency table:
Rating Frequency
Outstanding 19
Very good 13
Good 10
Average 6
Poor 2

STAT210: Probability and Statistics 19


Example

STAT210: Probability and Statistics 20


Exercise
The top three internet browsers in 2011 were Internet Explorer (IE),
Firefox (FF) and Chrome (GC) besides others (OT). Data indicating the
preferred browser for a sample of 60 internet users follow.
GC FF FF IE IE IE IE GC OT GC

IE FF GC GC OT FF FF FF FF IE

GC FF FF OT FF FF IE GC FF FF

GC IE IE IE GC FF OT OT OT OT

FF IE IE IE OT IE FF OT IE FF

FF IE IE GC IE FF GC GC GC FF

(a) Are these data categorical or quantitative?


(b) Provide frequency and percent frequency distributions.
© Construct a bar chart and a pie chart.
(d) On the basis of the sample, which browser has the largest
share? Which one is second?

STAT210: Probability and Statistics 21


Histogram
Graphical display that gives an idea of the shape of
the data distribution.
The bars of the histogram touch each other. A space
indicates that there are no observations in that
interval.
What to Look For: Central or typical value, extent of
spread or variation, general shape, location and
number of peaks, presence of gaps and outliers.

In Minitab: Graph - Histogram

STAT210: Probability and Statistics 22


Shapes of Histogram
 A histogram is perfectly symmetric if its right half is a
mirror image of its left half.
 Histograms that are not symmetric are referred to as
skewed.
 A histogram with a long right-hand tail is said to be skewed
to the right, or positively skewed.
 A histogram with a long left-hand tail is said to be skewed
to the left, or negatively skewed.

A histogram is unimodal if it has only one peak, or mode, and


bimodal if it has two clearly distinct modes. Bimodality can occur
when the data set consists of observations on two quite different
kinds of individuals or objects. In principle, a histogram can have
more than two modes, but this does not happen often in practice.
STAT210: Probability and Statistics 23
Shapes of Histogram

STAT210: Probability and Statistics 24


Example
To evaluate the effectiveness of a processor for a certain type
of tasks, a researcher recorded the CPU time for n = 30
randomly chosen jobs (in seconds),
70 36 43 69 82 48
34 62 35 15 59 139
46 37 42 30 55 56
36 82 38 89 54 25
35 24 22 9 56 19

Construct a histogram and describe the distribution of the CPU


times.

STAT210: Probability and Statistics 25


Example

The distribution of the CPU times is skewed to the right with one potential outlier.

STAT210: Probability and Statistics 26


Exercises
For each of the following data set, draw a histogram
determine whether the distribution is right-skewed, left-
skewed, or symmetric.
(1) 19, 24, 12, 19, 18, 24, 8, 5, 9, 20, 13, 11, 1, 12, 11, 10,
22, 21, 7, 16, 15, 15, 26, 16, 1, 13, 21, 21, 20, 19
(2) 17, 24, 21, 22, 26, 22, 19, 21, 23, 11, 19, 14, 23, 25,
26, 15, 17, 26, 21, 18, 19,21,24,18,16,20,21,20,23,33

(3) 56,52, 13,34,33, 18, 44, 41, 48, 75, 24, 19,35, 27, 46,
62, 71, 24, 66, 94, 40,18,15,39,53,23,41,78,15,35

STAT210: Probability and Statistics 27


Descriptive Statistics
Visual summaries of data are excellent tools for obtaining preliminary
impressions and insights. More formal data analysis often requires
the calculation and interpretation of numerical summary measures.
In practice, the entire population is never observed, so the population
parameters cannot be calculated directly. However, sample statistics
are often used to estimate parameters.
Percentages and proportions are used to summarize the distribution
of qualitative variables. For quantitative data, we will look at:
 Measures of location (center): mean, median, trimmed mean,
percentiles and quartiles.
 Measures of variability (spread): variance, standard deviation
(SD), range, interquartile range (IQR).

In Minitab, all summary statistics can be produced using:


Stat - Basic Statistics - Display Descriptive Statistics

STAT210: Probability and Statistics 28


Mean
Let x1, x2,…, xn be the values of the sample data, then the
mean is the average of these values.
The sample mean, denoted by x, is given by
n

x i
x i 1

n
Similarly, the population mean, denoted by µ, is given by
N

x i
  i 1
N
where N is the population size.
Sometimes a sample may contain a few points that are much
larger or smaller than the rest. Such points are called outliers
and may affect the mean.
STAT210: Probability and Statistics 29
Median
The median is the value in the middle when the data are
arranged in ascending order (smallest value to largest value).
To find the median the values in the sample are ordered from
smallest to largest, then
 If n is odd, the sample median is the number in (n+1)/2
position .
 If n is even, the sample median is the average of the
numbers in n/2 and (n/2)+1 positions.
Although the mean is the more commonly used measure of
central location, in some situations the median is preferred. The
mean is influenced by extremely small and large data values. In
such case, the median is often the preferred measure of central
location.
STAT210: Probability and Statistics 30
Mean vs. Median
 Mean tends to be drawn in the direction of the tail of a
skewed distribution. The median is more appropriate when
the distribution is highly skewed.
 Mean can be greatly a effected by the presence of outliers
whereas median is not.
 For symmetric distributions, mean and median are the
same.
 For skewed distributions, the mean lies towards the longer
tail relative to the median.

STAT210: Probability and Statistics 31


Mode and Trimmed Mean
Mode:
The mode is the value which occurs most frequently in the
sample. There may be no mode or may be several modes.
 The mode is not a affected by extreme values.
 Mainly used for grouped numerical data or categorical data.

Trimmed Mean:
 The trimmed mean is a measure of center that is not affected by
outliers.
 With the trimmed mean, p% of the data is trimmed from either
end of the data set.
 First, arranging the sample values in (ascending or descending)
order. 2 Then, trimming an equal number of them (np/100 points)
from each end. Finally, computing the sample mean of the
remaining points.
Note: Minitab prints the 5% trimmed mean.

STAT210: Probability and Statistics 32


Percentile and Quartile
The pth percentile of a sample, for a number between 0 and 100,
divides the sample so that as nearly as possible p% of the sample
values are less than the pth percentile.

To find the percentiles, order the sample values from smallest to


largest. Then compute the quantity i = (n+1)p/100, where n is the
sample size. If this quantity is an integer, the sample value in this
position is the pth percentile. Otherwise, average the two sample
values on either side.

The first quartile, Q1, is the value that has approximately 25% of
the observations below it. It represents the median of the lower half
of the data and corresponds to the 25th percentile.
The second quartile or median is the 50th percentile.
The third quartile, Q3, has approximately 75% of the observations
below it and corresponds to the 75th percentile.
STAT210: Probability and Statistics 33
Measures of Variability: Variance and
Standard Deviation
The variance is the average of squared deviations of values from the
mean. The population variance (σ2) is given by
N
1
 
N
2
 (x  )
i 1
i
2

While the sample variance (s2) is given by


1 N
2
s   i
n  1 i 1
( x  x ) 2

The sample variance is a reasonable estimate of the population


variance.

The standard deviation is the square root of the variance.

STAT210: Probability and Statistics 34


Range and Inter Quartile
Range
The Range (R) is simplest measure of variation but of limited use.

It is difference between the largest and the smallest observations.


R= max(xi) - min(xi)
It is not commonly used as it is based on only two observations and
is highly influenced by extreme values.

The Interquartile Range (IQR) is the range for the middle 50% of
the data.
IQR = Q3 - Q1
It is not in influenced by outliers but used to detect them.

Detection of outliers: Measure 1.5×(IQR) down from the first


quartile and up from the third quartile. All the data points observed
outside of this interval are classified as outliers.
STAT210: Probability and Statistics 35
Example
To evaluate the effectiveness of a processor for a certain type of
tasks, a researcher recorded the CPU time for n = 30 randomly
chosen jobs (in seconds),
70 36 43 69 82 48 34 62 35 15
59 139 46 37 42 30 55 56 36 82
38 89 54 25 35 24 22 9 56 19

Minitab Output:
Descriptive Statistics: CPU Time

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum


CPU Time 30 0 48.23 4.84 26.52 9.00 33.00 42.50 59.75 139.00

STAT210: Probability and Statistics 36


Boxplot
 The boxplot is a graphical display that simultaneously
describes several important features of a data set, such as
center, spread, departure from symmetry, and identification
of outliers.
 The plot is based on the five number summary:
(minimum; Q1; median; Q3; maximum)
 Comparative or side-by-side boxplots is a very effective
way of comparing two or more data sets consisting of
observations on the same variable fuel efficiency
observations for four different types of automobiles, prices
for three different brands of note-books, and so on.

In Minitab: Graph - Boxplot

STAT210: Probability and Statistics 37


Distribution shape and
Boxplot

STAT210: Probability and Statistics 38


Example

The distribution of the CPU times is skewed to the right with


one outlier.

STAT210: Probability and Statistics 39


Comparative or side-by-side
boxplot
The following comparative boxplots represent the amount of internet traffic
handled by a certain center during a week. What we can see:
 Traffic is heaviest on Fridays and least on Saturdays and Sundays.
 The greatest spread occurs on Fridays and the least on Saturdays and
Sundays.
 The distributions all appear to be slightly right skewed, although there is
little skew in the distributions on Saturday and Sunday. There our large
outliers on Monday, Thursday, and Friday.

STAT210: Probability and Statistics 40


Exercises
(1) The following data set represents the number of new computer
accounts registered during ten consecutive days:
43 37 50 51 58 105 52 45 45 10
a) Compute the mean, median, quartiles, and standard deviation.
b) Delete the outliers and redo part (a) again.
c) Make a conclusion about the effect of outliers.

(2) The numbers of blocked intrusion attempts on each day during


the first two weeks of the month were
56 47 49 37 38 60 50 43 43 59 50 56 54 58
After the change of firewall settings, the numbers of intrusions during the next
20 days were
53 21 32 49 45 38 44 33 32 43

53 46 36 48 39 35 37 36 39 45
compare the number of intrusions before and after the change, construct
parallel boxplots and comment on your findings.
STAT210: Probability and Statistics 41
Exercise
(3) Match each histogram to the boxplot that represents the
same data set.

STAT210: Probability and Statistics 42


Exercise
(4) A network provider investigates the load of its network. The
number of concurrent users is recorded at 50 locations (‘000 of
people),
17.2 22.1 18.5 17.2 18.6 14.8 21.7 15.8 16.3 22.8

24.1 13.3 16.2 17.5 19.0 23.9 14.8 22.2 21.7 20.7

13.5 15.8 13.1 16.1 21.9 23.9 19.3 12.0 19.9 19.4

15.4 16.7 19.5 16.2 16.9 17.1 20.2 13.4 19.8 17.7

19.7 18.7 17.6 15.9 15.2 17.1 15.0 18.8 21.6 11.9

a) Compute the sample mean, variance, and standard deviation of


the number of concurrent users.
b) Compute the five-number summary and construct a boxplot.
c) Compute the interquartile range. Are there any outliers?
d) It is reported that the number of concurrent users follows
approximately normal distribution. Does the histogram support
this claim?
STAT210: Probability and Statistics 43

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy