Stats 1 Module Updated
Stats 1 Module Updated
Types of Statistics
1. Descriptive
2. Inferential
Descriptive
Inferential
If we say 20% +/- 2%, i.e., 20% people with 2% margin of error
like blue cars. So in this, we are 98% sure that this is correct.
This is called inferential.
• Data (plural): The set of values collected for the variable from
each of the elements belonging to the sample.
Example: You have all the data on how the business is going
on, how much inventory you keep, how many customers come
to your store, In which month it has been more, at what day of
the week it occurs more. Which product is being sold more at
what point of time, on what hours is your product sold more.
What kind of customers come, do male customers come more
at a certain point in time, or female customers come then.
People with children come more, cigarette buyers come more,
or beer buyers come more, or grocery item buyers come more.
Examples:
Levels of measurement
1) Qualitative: A variable that categorizes or describes a
population element.
Nominal
Ordinal
Examples:
Here n is the size of the data set, x̄ is the sample mean, and
x¡ the numbers in sequence.
Example – if n is odd
Example – if n is even
-> The mode between the two men having the same
cholesterol level = 274.
Mean, Mode and Median in Brief:
Note:
Mean is highly sensitive to outliers
Example:
o 1,2,3,4,5
o -> Mean: 3
-> Median: 3
o 1,2,3,4,5,100
24820.02338888958
Caculating Median:
np.median(expenditure)
24691.98032372038
Now, we are adding a large number to the sample.
expenditure = np.append(expenditure, [10000000000])
np.median(expenditure)
24698.883118187983
np.mean(expenditure)
424650.16332356015
Here, the Median did not change much, but the Mean did.
Calculating Mode:
ModeResult(mode=array([15]), count=array([15]))
Measures of Dispersion
Range
-> Finite
-> Infinite
2.5833333333333335
Standard deviation
dataset = [5,5,2,3,4,6,18]
#mean value
mean = np.median(dataset)
#median value
median = np.median(dataset)
#mode value
mode = stats.mode(dataset)
std = np.std(dataset)
vr = np.var(dataset)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("STD", std)
print("Var:", vr)
Outputs:
Mean: 5.0
Median: 5.0
Mode: ModeResult(mode=array([5]), count=array([2]))
STD 4.997958767010258
Var: 24.9795918367347
Random Variables
A random variable is a set of all the possible values from a
random experiment.
Set
Set Definition
A set is a well-defined collection of objects.
A set that contains zero elements is called a null set
(empty set).
Let A and B be two sets. Then A is said to be a subset of B
(or B is a superset of A) if every element of A belongs to B.
Operations of Set
The union of two sets is defined as the set of elements
that are present in one or both sets. Thus, if A is {1, 2}
and B is {2, 3, 4}, the union of sets A and B is:
A ∪ B = {1, 2, 3, 4}
A ∩ B = {2,5}
X' = {0, 8, 9}
Examples
1. The set of consonants.
B = {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x}
The figure above is for an asymmetrical data set. This data set
was created by generating the data from 65 to 135 in 5 number
of steps with the number of each value, as shown in Figure
above.
The above figure shows Symmetrical Data set with
Skewness equals to 0
For example, there are three 65's, six 70's, and nine 75's, etc.
when X = 65
So, the -4278 value and the value of+4278 even out at 0. So, a
Symmetrical data set will have 0 skewness.
If the skewness is between -0.5 and 0.5, the data are fairly
symmetrical. If the skewness is between -1 and – 0.5 or
between 0.5 and 1, the data is moderately skewed.
If the skewness is greater than 1or less than -1, the data is
highly skewed.
59.5–
62.5 61 5
62.5–
65.5 64 18
65.5–
68.5 67 42
68.5–
71.5 70 27
Heig Clas Freque
ht s ncy,
(inch Mar
es) k, x f
71.5–
74.5 73 8
n = 4+19+42+27+8 = 100
x̅ = (61×5 + 64×18 + 67×42 + 70×27 + 73×8) ÷ 100
x̅ = 9305 + 1152 + 2814 + 1890 + 584) ÷ 100
x̅ = 6745÷100 = 67.45
(x
Class Freque (x− (x−x̅
x*f −x̅
Marks, x ncy, f x̅)²f )³f
)
- -
30 6.4 208. 1341.
61 5 5 5 01 68
- -
11 3.4 214. 739.1
64 18 52 5 25 5
(x
Class Freque (x− (x−x̅
x*f −x̅
Marks, x ncy, f x̅)²f )³f
)
-
28 0.4
67 42 14 5 8.51 -3.83
∑ 67 852. −269
45 n/a 75 .33
= − 0.1082
Probability Density Function
Normal Distribution
Bernoulli’s Distribution
Binomial Distribution
Uniform Distribution
Student’s T Distribution
Poisson Distribution
Expected value
E(C) = C
The Expected Value of a Constant is only a value of a
constant.
E (X + C) = E(X) + C
E(CX) = cE(X)
We can “pull” a constant out of an expected value
expression as a part of a sum with a random variable
X.
Binomial Distribution
The binomial distribution is used when there is more than one
outcome of a trial. These outcomes are labeled as “Success”
and “Failure.”
Here, the probability of both outcomes is the same for all the
trials.
Binomial distribution:
Pyt
hon binomial distribution tells us the probability how often
there will be a success in ‘n’ independent experiments. Such
experiments are yes-no questions. One example may be
tossing a coin.
import seaborn
from scipy.stats import binom
data=binom.rvs(n=17,p=0.7,loc=0,size=1010)
ax=seaborn.distplot(data,kde=True,color='pink',hist_kws={"li
newidth": 22,'alpha':0.77})
ax.set(xlabel='Binomial',ylabel='Frequency')
Examples
(c) 3 fives?
Solution:
There are only two possible outcomes (we get a 5, or we do
not).
(a) When, x = 0.
(b) When, x = 1.
(c) When, x = 3.
Solution:
= 0.20133
Solution:
= 0.8192
= 0.0272
P(X ≤ 2)
= 0.2785 + 0.37977 + 0.23304
= 0.89131
=1−P(X≤1)
=1−(P(x0)+P(x1))
=1−(0.2785+0.37977)
=0.34173
Question:
A company drills 9 wild-cat oil exploration wells, each with an
estimated probability of success of 0.1. What is the probability
that all nine wells fail?
Solution:
Let’s do 20,000 trials of the model, and count the number that
generates zero positive results.
0.3918
Question:
Solution:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Input variables
# Number of trials
trials = 1000
ax.set_xlabel("Number of Heads",fontsize=16)
ax.set_ylabel("Frequency",fontsize=16)
# Probability of getting 5 heads
runs = 10000
prob_5 = sum([1 for i in np.random.binomial(n, p, size = runs)
if i==5])/runs
print('The probability of 6 heads is: ' + str(prob_5))
N~(μ, σ2 )
With the help of Z scores, we can come to know how far a value
is from the mean. When you standardize a random variable, its
μ becomes 0, and its standard deviation becomes 1.
X = {1,1,1,2,2,2,3,3,4,4,4,4,5}
Now we will subtract the mean from every the data points, that
is, x – μ.
μ as 0, but the variance and std dev still as 1.49 and 1.22
respectively
Plotting it on a graph :
Bernoulli Distribution
Bernoulli distribution is a discrete probability distribution of a
random variable that has only two outcomes.,
namely 1 (success) and 0 (failure). where n = 1 occurs with
probability p and n = 0 (usually called a “failure”) occurs with
probability q = 1 – p,
Since the area under the curve must be equal to 1, and the
length of the interval determines the height of the curve, the
following figure shows a uniform distribution (a,b).
EXAMPLE:
Example:
Solution: Here, μ = 3
P (X ≥ 0)
=0.95021
P(2≤X<5)
= P(x2)+P(x3)+P(x4)
= 0.61611
(c) The average number of policies sold per day is 3/5 = 0.6.
On a given day,
Number Frequ
of flaws ency
`0` `4`
`1` `3`
`2` `5`
`3` `2`
`4` `4`
`5` `1`
`6` `1`
Probability=P(X≥3)
=1−(P(x0)+P(x1)+P(x2))
=0.40396
Question:
Z Stats
z= (x – μ) / σ
When you have multiple samples and their sample means (the
standard error), you will use the following z-score formula:
z = (x – μ) / (σ / √n)
Calculating z-scores
Question.
The grades on a physics midterm at Covington are roughly
symmetric with μ = 72 and σ=2.0.
Stephanie scored 74 on the exam.
Find the z-score for Stephanie's exam grade. Round to two
decimal places.
Solution:
z= (74-72)/2.0
z ≈ 1.00
(A) 0.10
(B) 0.18
(C) 1.50
(D) 5.82
(E) 2.90
= 1 - P( z < 0.90)
=1 - 0.8159 = 0.1841.
Where,
be.
We write
=
= P(z < −2.33)
= 1 − 0.9901
= 0.0099.
= 0.95.
A = 52854.88
The weekly delivery must be 52854.8 gallons.