0% found this document useful (0 votes)
17 views33 pages

ps project file

Uploaded by

Durga Nandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views33 pages

ps project file

Uploaded by

Durga Nandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

SO.

N PRACTICAL DATE REMARK


O

1. Load real-world datasets from sources 6/9/24


like CSV files or online repositories.

2. Calculate descriptive statistics like 13/9/24


mean, median, and standard deviation.

3. Create histograms, boxplots, scatter 20/9/24


plots, and bar charts to visualize data
distributions, relationships between
variables, and identify potential outliers.

4. Simulate random outcomes and 27/9/24


calculate probabilities for the case of
coin flips and card draws.

5. Simulate experimental probabilities and 4/10/24


compare it with theoretical probabilities.

6. alculate expected value and variance in 11/10/2


the context of single random variable. 4

7. Generate and plot probabilities for 8/10/24


events in discrete and continuous
distributions (Binomial, Poisson,
Geometric, and Normal).

8. Fit simple linear regression models 15/10/2


using built-in functions. 4

INDEX
PRACTICAL:01
Aim:Load real-world datasets from sources like CSV files or online repositories.
Output:
PRACTICAL:02
Aim: Calculate descriptive statistics like mean, median, mode and
standard deviation

Central Tendency: - Central tendency is a measure that best summarizes


the data and is a measure that is related to the centre of the data set. Mean,
median, and mode are the most common measures for central tendency.

Mean: - The mean is the average of the data. It is the sum of all data divided by
the number of data points. The mean works best if the data is distributed in a
normal distribution or distributed evenly. The mean represents the expected value if
the distribution is random.

Median: - The median is the middle or midpoint of the data and is also the 50
percentiles of the data. The median is affected by the outliers and skewness of the
data. The median can be a better measurement for centrality than the mean if the
data is skewed. The mean is the average, which is liable to be influenced by
outliers, so median is a better measure when the data is skewed
Mode: - Mode is a value in data that has the highest frequency and is useful
when the differences are non-numeric and seldom occur.
Standard Deviation: - Standard deviation R is the measure of the
dispersion of the values.

Range: - The range is the difference between the largest and smallest points in
the data.
Inquartile Range: - The interquartile range is the measure of the difference
between the 75 percentile or third quartile and the 25 percentile or first quartile.
Question: Twenty students , graduates and undergraduates, were enrolled in a
statistics course. Their ages were
18,19,19,19,19,20,20,20,20,20,21,21,21,21,22,23,24,27,30,36.
a) Find Mean and Median of all students
b) Find median age of all students under 25 years.
c) Find modal age of all student
PRACTICAL:03
Aim: Create histograms, boxplots, scatter plots, and bar charts to visualize data
distributions, relationships between variables, and identify potential outliers.

Histogram: A histogram shows the frequency distribution of a continuous


variable. It’s useful for understanding the distribution of data .

Output:

Boxplot:A boxplot is useful for visualizing the distribution of data, including the
median, quartiles, and potential outliers
Output:

Scatter Plot:A scatter plot helps visualize the relationship between two
continuous variables. Since we only have one variable (ages) in this
example, we'll generate a simple scatter plot of ages against an index
(just to demonstrate).

Output:
Bar Chart:A bar chart is suitable for categorical data, showing the
frequency of each category. In this case, we'll create a bar chart showing
the frequency of each age.

Output:
PRACTICAL:04
Aim: Simulate random outcomes and calculate probabilities for the case of coin
flips and card draws

Coin Flips Simulation:In this scenario, we will simulate a certain number of


coin flips and calculate the probabilities of heads and tails
sample(x, size, replace = FALSE, prob = NULL)
• x is the vector of elements from which you are sampling.
• size is the number of samples you wish to take.
• replace determines whether you are sampling with replacement or not. Sampling
without replacement means that sample will not pick the same value twice, and this
is the default behaviour. Pass replace = TRUE to sample if you wish to sample with
replacement.
• prob is a vector of probabilities or weights associated with x. It should be a vector
of nonnegative numbers of the same length as x. If the sum of prob is not 1, it will
be normalized. If this value is not provided, then each element of x is considered to
be equally likely.

Card Draw Simulation:To simulate drawing cards from a shuffled deck,


here's an easy method:
. PRACTICAL:05
Aim: Simulate experimental probabilities and compare it with theoretical
probabilities
Probability: Mathematical Approach
P(success)=number of ways to get success/total number of possible outcomes
Probability: Statistical Approach
P(success)=number of times the event occurred /total number of trials of
experiment

1: Comparison of Theoretical prob and Experimental Prob to flip a coin


2: Comparison of Theoretical prob and Experimental Prob for a die
with size 35
. PRACTICAL:06
Aim: alculate expected value and variance in the context of single random
variable

Description:
There are two main types of random variables:
1. Discrete Random Variable: Takes on a countable number of distinct values. For
example, the number of heads in a series of coin flips.

a) Expected Value (𝑬[𝑿]): is a measure of the central tendency of a random


variable. It represents mean of the possible values that a random variable can take,
weighted by their probabilities. It's calculated by summing the products of each
value and its probability

E(X)=i=1∑nxi⋅P(xi).
where 𝑥 are the possible values of the random variable 𝑋 and 𝑃(𝑋 = 𝑥) is the
probability of 𝑋 taking the value 𝑥.

b) Variance (Var(𝑿)): measures the spread or dispersion of a random variable's


possible values around the expected value. It quantifies how much the values of the
random variable differ from the expected value. A higher variance indicates greater
variability in the values. Variance is calculated as the expected value of the squared
deviation of a random variable from its mean:

Var(X)=E(X2)−[E(X)]2

 E(X^2) is the expected value of X^2, i.e., E(X2)=∑i= xi^2 ⋅P(xi)


Where:

 E(X)E(X)E(X) is the expected value.


2. Continuous Random Variable: Takes on an infinite number of possible values
within a given range. For example, the height of individuals in a population. a)
Expected Value (𝑬[𝑿]): It's calculated by integrating the product of the value and
its probability density function over the entire range.

E(X)=∫−∞∞x⋅f(x)dx
Where:
 X is the possible values of the random variable,
 f(x) is the probability density function.
b) The variance of a continuous random variable X is defined as:
Var(X) = E(X^2) - [E(X)]^2

 E(X^2) = ∫−∞∞ x^2 ⋅f(x)dx,


Where:

 E(X)E(X)E(X) is the expected value calculated earlier


For Discrete Random Variable:

For Continuous Random Variable:


Given the continuous random variable X with the probability density function (PDF):
f(x)={6x(1−x),0<x<10
0 ,otherwise
We are tasked with finding the expected value and variance of X.
PRACTICAL:07
Aim: Generate and plot probabilities for events in discrete and continuous
distributions (Binomial, Poisson, Geometric, and Normal).

Description:
Binomial Distribution
The binomial distribution is a discrete probability distribution that models the
number of successes in a fixed number of independent trials of a binary
experiment. Each trial can result in one of two outcomes: "success" or "failure."
Key Characteristics

1. Number of Trials (𝑛): The total number of independent trials or experiments.

2. Probability of Success (𝑝): The probability of success in a single trial.

3. Probability of Failure (𝑞): The probability of failure in a single trial, calculated as 𝑞


= 1 − 𝑝.

4. Random Variable (X): Represents the number of successes in 𝑛 trials. Probability


Mass Function (PMF)
Formula R provides several built-in functions for handling the binomial distribution.

1. dbinom(𝑥, size, prob) : This function gives the probability density distribution at
each point.

2. pbinom(𝑥, size, prob, lower.tail = TRUE) : This function gives the cumulative
probability of an event.
Parameters:

 𝑥: Number of successes (can be a vector).

 𝑞: The quantile (number of successes).

 size: Number of trials.

 prob: Probability of success on each trial.

 lower.tail: If TRUE (default), probabilities are 𝑃(𝑋 ≤ 𝑞); if FALSE, 𝑃(𝑋 > 𝑞).

experiment of flipping a fair coin 20 times. Also, create a bar plot to visualize the
probabilities for each possible outcome (number of heads).
Poisson Distribution
The Poisson distribution is a discrete probability distribution that expresses the
probability of a given number of events occurring continuously but within a fixed
interval of time or space, given that these events occur with a known constant
mean rate and are independent of the time since the last event. We call it the
distribution of rare events.
Key Characteristics

1. Parameter (𝜆): The average number of events in the given interval. It is also the
mean and variance of the distribution.
2. Events: The events must occur independently. That is, the occurrence of one
event does not affect the occurrence of another.
3. Interval: The interval can be time, space, or any other measurable quantity.
4. For a small interval, the probability of the event occurring is proportional to the
size of the interval.
5. The probability of more than one occurrence in the small interval is negligible.

Parameters:

 x or q: Number of successes (can be a vector).


 lambda: average no. of times event occur
 log: A logical value. If TRUE, the function returns the logarithm of the probability; if
FALSE (the default), it returns the actual probability.

 lower.tail: If TRUE (default), probabilities are 𝑃(𝑋 ≤ 𝑞); if FALSE, 𝑃(𝑋 > 𝑞).
A bookstore sells an average of 6 books per hour. What is the probability that the
bookstore sells exactly 4 books in a given hour? Additionally, visualize the
probability mass function (PMF) of the Poisson distribution for the number of books
sold from 0 to 10 in one hour.
Geometric Distribution
The geometric distribution is a discrete probability distribution that models the
number of trials required to achieve the first success in a sequence of independent
Bernoulli trials (where each trial has two possible outcomes: success or failure). It is
particularly useful in scenarios where you want to determine how many attempts it
takes before the first success occurs.
Key Characteristics

1. Parameter (𝑝): The probability of success on each trial.


2. Trials: Each trial is independent, meaning the outcome of one trial does not affect
the others.

Formula
R provides several built-in functions for handling the Geometric distribution.
1. dgeom(x, prob, log = FALSE): Calculates the probability of having exactly x
failures before the first success.
2. pgeom(q, prob, lower.tail = TRUE, log.p = FALSE): Calculates the probability of
having at most q failures before the first success.
In a scenario where the probability of success in a Bernoulli trial is 0.4, what is:
1. The probability of experiencing exactly 5 failures before achieving the first
success?
2. The cumulative probability of experiencing at most 5 failures before the first
success?
3. Additionally, visualize the probability mass function (PMF) of the geometric
distribution for the first 20 trials.
Normal Distribution
The normal distribution, also known as the Gaussian distribution, is a continuous
probability distribution that is symmetric about its mean, indicating that data near
the mean are more frequent in occurrence than data far from the mean.
Key Characteristics
1. Bell-Shaped Curve: The graph of the normal distribution is bell-shaped and
symmetric around the mean. In the graph, fifty percent of values lie to the left of
the mean and the other fifty percent lie to the right of the graph.

2. Mean (𝜇): The central value of the distribution, which is also its median and
mode.

3. Standard Deviation (𝜎): A measure of the dispersion or spread of the distribution.


It determines the width of the bell curve.

4. Notation: A normal distribution is often denoted as 𝑁(𝜇, 𝜎ଶ), where 𝜇 is the


mean and 𝜎 ଶ is the variance.
Parameters:
 x or q: vector of quantities.
 mean: is the mean value of the sample data. Its default value is zero.
 sd: is the standard deviation. Its default value is 1
PRACTICAL:08
Aim: Fit simple linear regression models using built-in functions .
Formula:

y=β0+β1x+ϵ

Parameters:
1. y is the dependent variable,
2. x is the independent variable,
3. β0 is the intercept,
4. β1 is the slope (coefficient for x),
5. ϵ is the error term.
Plot of the wt vs mpg data as a scatter plot and then overlay the regression line in
red, showing the linear relationship between weight and miles per gallon.

THANK YOU:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy