ps project file
ps project file
INDEX
PRACTICAL:01
Aim:Load real-world datasets from sources like CSV files or online repositories.
Output:
PRACTICAL:02
Aim: Calculate descriptive statistics like mean, median, mode and
standard deviation
Mean: - The mean is the average of the data. It is the sum of all data divided by
the number of data points. The mean works best if the data is distributed in a
normal distribution or distributed evenly. The mean represents the expected value if
the distribution is random.
Median: - The median is the middle or midpoint of the data and is also the 50
percentiles of the data. The median is affected by the outliers and skewness of the
data. The median can be a better measurement for centrality than the mean if the
data is skewed. The mean is the average, which is liable to be influenced by
outliers, so median is a better measure when the data is skewed
Mode: - Mode is a value in data that has the highest frequency and is useful
when the differences are non-numeric and seldom occur.
Standard Deviation: - Standard deviation R is the measure of the
dispersion of the values.
Range: - The range is the difference between the largest and smallest points in
the data.
Inquartile Range: - The interquartile range is the measure of the difference
between the 75 percentile or third quartile and the 25 percentile or first quartile.
Question: Twenty students , graduates and undergraduates, were enrolled in a
statistics course. Their ages were
18,19,19,19,19,20,20,20,20,20,21,21,21,21,22,23,24,27,30,36.
a) Find Mean and Median of all students
b) Find median age of all students under 25 years.
c) Find modal age of all student
PRACTICAL:03
Aim: Create histograms, boxplots, scatter plots, and bar charts to visualize data
distributions, relationships between variables, and identify potential outliers.
Output:
Boxplot:A boxplot is useful for visualizing the distribution of data, including the
median, quartiles, and potential outliers
Output:
Scatter Plot:A scatter plot helps visualize the relationship between two
continuous variables. Since we only have one variable (ages) in this
example, we'll generate a simple scatter plot of ages against an index
(just to demonstrate).
Output:
Bar Chart:A bar chart is suitable for categorical data, showing the
frequency of each category. In this case, we'll create a bar chart showing
the frequency of each age.
Output:
PRACTICAL:04
Aim: Simulate random outcomes and calculate probabilities for the case of coin
flips and card draws
Description:
There are two main types of random variables:
1. Discrete Random Variable: Takes on a countable number of distinct values. For
example, the number of heads in a series of coin flips.
E(X)=i=1∑nxi⋅P(xi).
where 𝑥 are the possible values of the random variable 𝑋 and 𝑃(𝑋 = 𝑥) is the
probability of 𝑋 taking the value 𝑥.
Var(X)=E(X2)−[E(X)]2
E(X)=∫−∞∞x⋅f(x)dx
Where:
X is the possible values of the random variable,
f(x) is the probability density function.
b) The variance of a continuous random variable X is defined as:
Var(X) = E(X^2) - [E(X)]^2
Description:
Binomial Distribution
The binomial distribution is a discrete probability distribution that models the
number of successes in a fixed number of independent trials of a binary
experiment. Each trial can result in one of two outcomes: "success" or "failure."
Key Characteristics
1. dbinom(𝑥, size, prob) : This function gives the probability density distribution at
each point.
2. pbinom(𝑥, size, prob, lower.tail = TRUE) : This function gives the cumulative
probability of an event.
Parameters:
lower.tail: If TRUE (default), probabilities are 𝑃(𝑋 ≤ 𝑞); if FALSE, 𝑃(𝑋 > 𝑞).
experiment of flipping a fair coin 20 times. Also, create a bar plot to visualize the
probabilities for each possible outcome (number of heads).
Poisson Distribution
The Poisson distribution is a discrete probability distribution that expresses the
probability of a given number of events occurring continuously but within a fixed
interval of time or space, given that these events occur with a known constant
mean rate and are independent of the time since the last event. We call it the
distribution of rare events.
Key Characteristics
1. Parameter (𝜆): The average number of events in the given interval. It is also the
mean and variance of the distribution.
2. Events: The events must occur independently. That is, the occurrence of one
event does not affect the occurrence of another.
3. Interval: The interval can be time, space, or any other measurable quantity.
4. For a small interval, the probability of the event occurring is proportional to the
size of the interval.
5. The probability of more than one occurrence in the small interval is negligible.
Parameters:
lower.tail: If TRUE (default), probabilities are 𝑃(𝑋 ≤ 𝑞); if FALSE, 𝑃(𝑋 > 𝑞).
A bookstore sells an average of 6 books per hour. What is the probability that the
bookstore sells exactly 4 books in a given hour? Additionally, visualize the
probability mass function (PMF) of the Poisson distribution for the number of books
sold from 0 to 10 in one hour.
Geometric Distribution
The geometric distribution is a discrete probability distribution that models the
number of trials required to achieve the first success in a sequence of independent
Bernoulli trials (where each trial has two possible outcomes: success or failure). It is
particularly useful in scenarios where you want to determine how many attempts it
takes before the first success occurs.
Key Characteristics
Formula
R provides several built-in functions for handling the Geometric distribution.
1. dgeom(x, prob, log = FALSE): Calculates the probability of having exactly x
failures before the first success.
2. pgeom(q, prob, lower.tail = TRUE, log.p = FALSE): Calculates the probability of
having at most q failures before the first success.
In a scenario where the probability of success in a Bernoulli trial is 0.4, what is:
1. The probability of experiencing exactly 5 failures before achieving the first
success?
2. The cumulative probability of experiencing at most 5 failures before the first
success?
3. Additionally, visualize the probability mass function (PMF) of the geometric
distribution for the first 20 trials.
Normal Distribution
The normal distribution, also known as the Gaussian distribution, is a continuous
probability distribution that is symmetric about its mean, indicating that data near
the mean are more frequent in occurrence than data far from the mean.
Key Characteristics
1. Bell-Shaped Curve: The graph of the normal distribution is bell-shaped and
symmetric around the mean. In the graph, fifty percent of values lie to the left of
the mean and the other fifty percent lie to the right of the graph.
2. Mean (𝜇): The central value of the distribution, which is also its median and
mode.
y=β0+β1x+ϵ
Parameters:
1. y is the dependent variable,
2. x is the independent variable,
3. β0 is the intercept,
4. β1 is the slope (coefficient for x),
5. ϵ is the error term.
Plot of the wt vs mpg data as a scatter plot and then overlay the regression line in
red, showing the linear relationship between weight and miles per gallon.
THANK YOU: