0% found this document useful (0 votes)
20 views27 pages

Unit Iii

The document provides an overview of statistics and probability, focusing on data visualization techniques in R, including bar charts, pie charts, histograms, boxplots, and scatterplots. It explains the syntax and parameters for creating these visualizations, along with examples using R code. Additionally, it covers basic statistical concepts such as average, variance, standard deviation, and random variables, including discrete and continuous types.

Uploaded by

lonow36672
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views27 pages

Unit Iii

The document provides an overview of statistics and probability, focusing on data visualization techniques in R, including bar charts, pie charts, histograms, boxplots, and scatterplots. It explains the syntax and parameters for creating these visualizations, along with examples using R code. Additionally, it covers basic statistical concepts such as average, variance, standard deviation, and random variables, including discrete and continuous types.

Uploaded by

lonow36672
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

STATISTICS AND PROBABILITY

UNIT-III
STATISTICS AND PROBABILITY
Basic Data Visualization
Data visualization is an efficient technique for gaining insight about data through
a visual medium.
By using the data visualization technique, we can work with large datasets to
efficiently obtain key insights about it.
Graphics play an important role in carrying out the important features of the data.

R Bar Charts
A bar chart is a pictorial representation in which numerical values of variables are
represented by length or height of lines or rectangles of equal width.
A bar chart is used for summarizing a set of categorical data.
In bar chart, the data is shown through rectangular bars having the length of the
bar proportional to the value of the variable.
syntax:
barplot(h,x,y,main, names.arg,col)

S.No Parameter Description

1. H A vector or matrix which contains numeric values used


in the bar chart.

2. xlab A label for the x-axis.

3. ylab A label for the y-axis.

4. main A title of the bar chart.

5. names.arg A vector of names that appear under each bar.

6. col It is used to give colors to the bars in the graph.

BGS FGC Mysuru


STATISTICS AND PROBABILITY

# Creating the data for Bar chart


H <- c(12,35,54,3,41)
M<- c("Feb","Mar","Apr","May","Jun")
Example:
# Giving the chart file a name
png(file = "bar_properties.png")

# Plotting the bar chart


barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="Green",
main="Revenue Bar chart",border="red")
# Saving the file
dev.off()

Output:

BGS FGC Mysuru


STATISTICS AND PROBABILITY

Group Bar Chart & Stacked Bar Chart


We can create bar charts with groups of bars and stacks using matrices as input
values in each bar. One or more variables are represented as a matrix that is used
to construct group bar charts and stacked bar charts.
Example:
months <- c("Jan","Feb","Mar","Apr","May")
regions <- c("West","North","South")
# Creating the matrix of the values.
Values <- matrix(c(21,32,33,14,95,46,67,78,39,11,22,23,94,15,16), nrow = 3,
ncol = 5, byrow = TRUE)
# Giving the chart file a name
png(file = "stacked_chart.png")
# Creating the bar chart
barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month",
ylab = "Revenue", ccol =c("cadetblue3","deeppink2","goldenrod1"))
# Adding the legend to the chart
legend("topleft", regions, cex = 1.3, fill =
c("cadetblue3","deeppink2","goldenrod1"))
# Saving the file
dev.off()
Output:

BGS FGC Mysuru


STATISTICS AND PROBABILITY

R Pie Charts
A pie-chart is a representation of values in the form of slices of a circle with
different colors.
Slices are labeled with a description, and the numbers corresponding to each slice
are also shown in the chart.
The Pie charts are created with the help of pie () function, which takes positive
numbers as vector input.
Syntax:
pie(X, Labels, Radius, Main, Col, Clockwise)

S.No Parameter Description

1. X is a vector that contains the numeric values used in the


pie chart.

2. Labels are used to give the description to the slices.

3. Radius describes the radius of the pie chart.

4. Main describes the title of the chart.

BGS FGC Mysuru


STATISTICS AND PROBABILITY

5. Col defines the colour palette.

6. Clockwise is a logical value that indicates the clockwise or anti-


clockwise direction in which slices are drawn.

Example:
# Creating data for the graph.
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
# Giving the chart file a name.
png(file = "title_color.jpg")
# Plotting the chart.
pie(x,labels,main="Country Pie chart",col=rainbow(length(x)))
# Saving the file.
dev.off()
Output:

Slice Percentage & Chart Legend

BGS FGC Mysuru


STATISTICS AND PROBABILITY

There are two additional properties of the pie chart, i.e., slice percentage and chart
legend. We can show the data in the form of percentage as well as we can add
legends to plots in R by using the legend() function.
Example:
# Creating data for the graph.
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
pie_percent<- round(100*x/sum(x), 1)
# Giving the chart file a name.
png(file = "per_pie.jpg")
# Plotting the chart.
pie(x, labels = pie_percent, main = "Country Pie Chart",col = rainbow(length(x)))
legend("topright", c("India", "America", "Shri Lanka", "Nepal"), cex = 0.8,
fill = rainbow(length(x)))
#Saving the file.
dev.off()
Output:

BGS FGC Mysuru


STATISTICS AND PROBABILITY

R Histogram
A histogram is a type of bar chart which shows the frequency of the number of
values which are compared with a set of values ranges.
The histogram is used for the distribution, whereas a bar chart is used for
comparing different entities.
In the histogram, each bar represents the height of the number of values present
in the given range.
For creating a histogram, R provides hist() function, which takes a vector as an
input.

Syntax:
hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)

S.No Parameter Description

1. v It is a vector that contains numeric values.

2. main It indicates the title of the chart.

3. col It is used to set the color of the bars.

4. border It is used to set the border color of each bar.

5. xlab It is used to describe the x-axis.

6. ylab It is used to describe the y-axis.

7. xlim It is used to specify the range of values on the x-axis.

8. ylim It is used to specify the range of values on the y-axis.

9. breaks It is used to mention the width of each bar.

Example:

BGS FGC Mysuru


STATISTICS AND PROBABILITY

# Creating data for the graph.


v <- c(12,24,16,38,21,13,55,17,39,10,60)

# Giving a name to the chart file.


png(file = "histogram_chart.png")

# Creating the histogram.


hist(v,xlab = "Weight",ylab="Frequency",col = "green",border = "red")

# Saving the file.


dev.off()

Example: Use of xlim & ylim parameter


# Creating data for the graph.
v <- c(12,24,16,38,21,13,55,17,39,10,60)

BGS FGC Mysuru


STATISTICS AND PROBABILITY

# Giving a name to the chart file.


png(file = "histogram_chart_lim.png")

# Creating the histogram.


hist(v,xlab = "Weight",ylab="Frequency",col = "green",border = "red",xlim =
c(0,40), ylim = c(0,3), breaks = 5)

# Saving the file.


dev.off()
Output:

R Boxplot

Boxplots are a measure of how well data is distributed across a data set. This
divides the data set into three quartiles. This graph represents the minimum,
maximum, average.

Boxplot is also useful in comparing the distribution of data in a data set by


drawing a boxplot for each of them.

BGS FGC Mysuru


STATISTICS AND PROBABILITY

R provides a boxplot() function to create a boxplot.

Syntax:

boxplot(x, data, notch, varwidth, names, main)

S.No Parameter Description

1. x It is a vector or a formula.

2. data It is the data frame.

3. notch It is a logical value set as true to draw a notch.

4. varwidth It is also a logical value set as true to draw the width


of the box same as the sample size.

5. names It is the group of labels that will be printed under each


boxplot.

6. main It is used to give a title to the graph.

Example: In the below example, we will use the "mtcars" dataset present in the
R environment. We will use its two columns only, i.e., "mpg" and "cyl". The
below example will create a boxplot graph for the relation between mpg and cyl,
i.e., miles per gallon and number of cylinders, respectively.

# Giving a name to the chart file.

png(file = "boxplot.png")

# Plotting the chart.

boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantity of Cylinders",

ylab = "Miles Per Gallon", main = "R Boxplot Example")

# Save the file.

dev.off()

Output:
BGS FGC Mysuru
STATISTICS AND PROBABILITY

Boxplot using notch

In R, we can draw a boxplot using a notch.

Example:

# Giving a name to our chart.

png(file = "boxplot_using_notch.png")

# Plotting the chart.

boxplot(mpg ~ cyl, data = mtcars,

xlab = "Quantity of Cylinders",

ylab = "Miles Per Gallon",

main = "Boxplot Example",

notch = TRUE,

varwidth = TRUE,

ccol = c("green","yellow","red"),

names = c("High","Medium","Low")

# Saving the file.


BGS FGC Mysuru
STATISTICS AND PROBABILITY

dev.off()

Output:

R Scatterplots
The scatter plots are used to compare variables. A comparison between variables
is required when we need to define how much one variable is affected by another
variable.
In a scatterplot, the data is represented as a collection of points. Each point on
the scatterplot defines the values of the two variables.
One variable is selected for the vertical axis and other for the horizontal axis.
Syntax:
plot(x, y, main, xlab, ylab, xlim, ylim, axes)

S.No Parameters Description

1. x It is the dataset whose values are the horizontal


coordinates.

2. y It is the dataset whose values are the vertical


coordinates.
BGS FGC Mysuru
STATISTICS AND PROBABILITY

3. main It is the title of the graph.

4. xlab It is the label on the horizontal axis.

5. ylab It is the label on the vertical axis.

6. xlim It is the limits of the x values which is used for


plotting.

7. ylim It is the limits of the values of y, which is used for


plotting.

8. axes It indicates whether both axes should be drawn on the


plot.

Example: In our example, we will use the dataset "mtcars", which is the
predefined dataset available in the R environment.
#Fetching two columns from mtcars
data <-mtcars[,c('wt','mpg')]
# Giving a name to the chart file.
png(file = "scatterplot.png")
# Plotting the chart for cars with weight between 2.5 to 5 and mileage
between 15 and 30.
plot(x = data$wt,y = data$mpg, xlab = "Weight", ylab = "Milage", xlim = c(2.5,5),
ylim = c(15,30), main = "Weight v/sMilage")
# Saving the file.
dev.off()
Output:

BGS FGC Mysuru


STATISTICS AND PROBABILITY

Scatterplot using ggplot2


In R, there is another way for creating scatterplot i.e. with the help of ggplot2
package.
The ggplot2 package provides ggplot() and geom_point() function for creating a
scatterplot. The ggplot() function takes a series of the input item. The first
parameter is an input vector, and the second is the aes() function in which we add
the x-axis and y-axis.
Example:
#Loading ggplot2 package
library(ggplot2)
# Giving a name to the chart file.
png(file = "scatterplot_ggplot.png")
# Plotting the chart using ggplot() and geom_point() functions.
ggplot(mtcars, aes(x = drat, y = mpg)) +geom_point()
# Saving the file.
dev.off()
Output:

BGS FGC Mysuru


STATISTICS AND PROBABILITY

Statistics
Statistics is a form of mathematical analysis that concerns the collection,
organization, analysis, interpretation, and presentation of data.
R – Statistics
R is a programming language and is used for environment statistical computing
and graphics.
Average in R Programming: Average is calculated by dividing the sum of the
values in the set by their number.

Variance in R Programming Language: Variance is the sum of squares of


differences between all numbers and means.

Standard Deviation in R Programming Language: Standard Deviation is the


square root of variance.

Random Variable
Real values od random experiment is called Random variable.
BGS FGC Mysuru
STATISTICS AND PROBABILITY

Two types of Random variables are:-


1)Discrete Random Variable Definition
In probability theory, a discrete random variable is a type of random variable that
can take on a finite or countable number of distinct values. These values are often
represented by integers or whole numbers, other than this they can also be
represented by other discrete values.
For example, the number of heads obtained after flipping a coin three times is a
discrete random variable. The possible values of this variable are 0, 1, 2, or 3.
2)Continuous Random Variable
Consider a generalized experiment rather than taking some particular experiment.
Suppose that in your experiment, the outcome of this experiment can take values
in some interval (a, b). That means that each and every single point in the interval
can be taken up as the outcome values when you do the experiment.
Thus, X= {x: x belongs to (a, b)}
Example: The speed of a vehicle on a highway.

Common Probability Distribution


Common Probability Mass Functions
1) Bernoulli Distribution
Bernoulli Distribution is a special case of Binomial distribution where only a
single trial is performed. It is a discrete probability distribution for a Bernoulli
trial (a trial that has only two outcomes i.e. either success or failure).
For example, In R it can be represented as a coin toss where the probability of
getting the head is 0.5 and getting a tail is 0.5. It is a probability distribution of a
random variable that takes value 1 with probability p and the value 0 with
probability q=1-p.
The probability mass function f of this distribution, over possible outcomes k, is
given by :

BGS FGC Mysuru


STATISTICS AND PROBABILITY

In R Programming Language, there are 4 built-in functions to for Bernoulli


distribution. They are:
dbern()
dbern( ) function in R programming measures the density function of the
Bernoulli distribution.
Syntax: dbern(x, prob, log = FALSE)
Parameter:
x: vector of quantiles
prob: probability of success on each trial
log: logical; if TRUE, probabilities p are given as log(p)
Example:
# Importing the Rlab library
library(Rlab)
# x values for the dbern() function
x <- c(0, 1, 3, 5, 7, 10)
# Using dbern() function to obtain the corresponding Bernoulli PDF
y <- dbern(x, prob = 0.5)
# Plotting dbern values
plot(x, y, type = "o")
Output:

BGS FGC Mysuru


STATISTICS AND PROBABILITY

pbern()
pbern( ) function in R programming giver the distribution function for the
Bernoulli distribution.
The distribution function or cumulative distribution function (CDF) or
cumulative frequency function, describes the probability that a variate X takes on
a value less than or equal to a number x.
Syntax: pbern(q, prob, log.p = FALSE)
Parameter:
q: vector of quantiles
prob: probability of success on each trial
log.p: logical; if TRUE, probabilities p are given as log(p).

qbern()
qbern( ) gives the quantile function for the Bernoulli distribution. A quantile
function in statistical terms specifies the value of the random variable such that
the probability of the variable being less than or equal to that value equals the
given probability.
Syntax: pbern(q, prob, log.p = FALSE)
Parameter:
q: vector of quantiles
prob: probability of success on each trial
log.p: logical; if TRUE, probabilities p are given as log(p).
BGS FGC Mysuru
STATISTICS AND PROBABILITY

rbern()
rbern( ) function in R programming is used to generate a vector of random
numbers which are Bernoulli distributed.
Syntax: rbern(n, prob)
Parameter:
n: number of observations.
prob: number of observations.

2)Binomial Distribution
Binomial distribution in R is a probability distribution used in statistics.
The binomial distribution is a discrete distribution and has only two outcomes i.e.
success or failure. All its trials are independent, the probability of success remains
the same and the previous outcome does not affect the next outcome.
The outcomes from different trials are independent. Binomial distribution helps
us to find the individual probabilities as well as cumulative probabilities over a
certain range.
In mathematical terms, for a discrete random variable X=x, the binomial mass
function is

In R Programming Language, there are 4 built-in functions to for Binomial


distribution. They are:
BGS FGC Mysuru
STATISTICS AND PROBABILITY

pbinom()
The function pbinom() is used to find the cumulative probability of a data
following binomial distribution till a given value ie it finds P(X <= k)
Syntax:
pbinom(k, n, p)

qbinom()
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds
k.
Syntax:
qbinom(P, n, p)

rbinom()
This function generates n random variables of a particular probability.
Syntax:
rbinom(n, N, p)
where n is total number of trials, p is probability of success, k is the value at
which the probability has to be found out.

3)Poisson Functions
The Poisson distribution represents the probability of a provided number of cases
happening in a set period of space or time if these cases happen with an identified
constant mean rate.
In mathematical terms, for a discrete random variable and a realization X = x, the
Poisson mass function f is given as follows, where λp is a parameter of the
distribution.

BGS FGC Mysuru


STATISTICS AND PROBABILITY

There are four Poisson functions available in R:


dpois()
This function is used for illustration of Poisson density in an R plot. The function
dpois() calculates the probability of a random variable that is available within a
certain range.

Syntax:
where,
K: number of successful events happened in an interval
λ: mean per interval
log: If TRUE then the function returns probability in form of log
ppois()
This function is used for the illustration of cumulative probability function in an
R plot. The function ppois() calculates the probability of a random variable that
will be equal to or less than a number.

Syntax:
where,
K: number of successful events happened in an interval
λ: mean per interval
lower.tail: If TRUE then left tail is considered otherwise if the FALSE right tail
is considered
log: If TRUE then the function returns probability in form of log
rpois()
The function rpois() is used for generating random numbers from a given
Poisson’s distribution.
BGS FGC Mysuru
STATISTICS AND PROBABILITY

Syntax:
Where,
q: number of random numbers needed
λ: mean per interval
qpois()
The function qpois() is used for generating quantile of a given Poisson’s
distribution.
In probability, quantiles are marked points that divide the graph of a probability
distribution into intervals (continuous ) which have equal probabilities.

Syntax:
where,
K: number of successful events happened in an interval
λ: mean per interval
lower.tail: If TRUE then left tail is considered otherwise if the FALSE right tail
is considered
log: If TRUE then the function returns probability in form of log

Common Probability Density Functions


1)Uniform Distribution
The continuous uniform distribution is also referred to as the probability
distribution of any random number selection from the continuous interval defined
between intervals a and b.
A uniform distribution holds the same probability for the entire interval. Thus, its
plot is a rectangle, and therefore it is often referred to as rectangular distribution.

BGS FGC Mysuru


STATISTICS AND PROBABILITY

Probability Density Function


dunif() method in R programming language is used to generate density function.
It calculates the uniform density function in R language in the specified interval
(a, b).
Syntax:
dunif(x, min = 0, max = 1, log = FALSE)
Parameter:
x: input sequence
min, max= range of values
log: indicator, of whether to display the output values as probabilities.
Cumulative probability distribution
The punif() method in R is used to calculate the uniform cumulative distribution
function, this is, the probability of a variable X taking a value lower than x (that
is, x <= X).
Syntax:
BGS FGC Mysuru
STATISTICS AND PROBABILITY

punif(q, min = 0, max = 1, lower.tail = TRUE)

The runif() function in R programming language is used to generate a sequence


of random following the uniform distribution.
Syntax:
runif(n, min = 0, max = 1)
Parameter:
n= number of random samples
min=minimum value(by default 0)
max=maximum value(by default 1)

2)Normal Distribution in R
Normal Distribution is a probability function used in statistics that tells about how
the data values are distributed.
For example, the height of the population, shoe size, IQ level, rolling a dice, and
many more.
It is generally observed that data distribution is normal when there is a random
collection of data from independent sources. The graph produced after plotting
the value of the variable on x-axis and count of the value on y-axis is bell-shaped
curve graph.
The graph signifies that the peak point is the mean of the data set and half of the
values of data set lie on the left side of the mean and other half lies on the right
part of the mean.

BGS FGC Mysuru


STATISTICS AND PROBABILITY

In R, there are 4 built-in functions to generate normal distribution:


dnorm()
dnorm() function in R programming measures density function of distribution.
Syntax :
dnorm(x, mean, sd)
pnorm()
pnorm() function is the cumulative distribution function which measures the
probability that a random number X takes a value less than or equal to x.
Syntax:
pnorm(x, mean, sd)
qnorm()
qnorm() function is the inverse of pnorm() function. It takes the probability value
and gives output which corresponds to the probability value. It is useful in finding
the percentiles of a normal distribution.
Syntax:
qnorm(p, mean, sd)
rnorm()
rnorm() function in R programming is used to generate a vector of random
numbers which are normally distributed.
BGS FGC Mysuru
STATISTICS AND PROBABILITY

Syntax:
rnorm(x, mean, sd)

3)Student’s t-distribution
The Student’s t-distribution is a continuous probability distribution generally
used when dealing with statistics estimated from a sample of data.
Any particular t-distribution looks a lot like the standard normal distribution—
it’s bell-shaped, symmetric and it’s centered on zero. The difference is that while
a normal distribution is typically used to deal with a population, the t-distribution
deals with sample from a population.

Functions used:
To find the value of probability density function (pdf) of the Student’s t-
distribution given a random variable x, use the dt() function in R.
Syntax: dt(x, df)

Parameters:
x is the quantiles vector
df is the degrees of freedom

BGS FGC Mysuru


STATISTICS AND PROBABILITY

pt() function is used to get the cumulative distribution function (CDF) of a t-


distribution
Syntax: pt(q, df, lower.tail = TRUE)
Parameter:
q is the quantiles vector
df is the degrees of freedom
lower.tail – if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

The qt() function is used to get the quantile function or inverse cumulative density
function of a t-distribution.
Syntax: qt(p, df, lower.tail = TRUE)
Parameter:
p is the vector of probabilities
df is the degrees of freedom
lower.tail – if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

BGS FGC Mysuru

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy