Unit Iii
Unit Iii
UNIT-III
STATISTICS AND PROBABILITY
Basic Data Visualization
Data visualization is an efficient technique for gaining insight about data through
a visual medium.
By using the data visualization technique, we can work with large datasets to
efficiently obtain key insights about it.
Graphics play an important role in carrying out the important features of the data.
R Bar Charts
A bar chart is a pictorial representation in which numerical values of variables are
represented by length or height of lines or rectangles of equal width.
A bar chart is used for summarizing a set of categorical data.
In bar chart, the data is shown through rectangular bars having the length of the
bar proportional to the value of the variable.
syntax:
barplot(h,x,y,main, names.arg,col)
Output:
R Pie Charts
A pie-chart is a representation of values in the form of slices of a circle with
different colors.
Slices are labeled with a description, and the numbers corresponding to each slice
are also shown in the chart.
The Pie charts are created with the help of pie () function, which takes positive
numbers as vector input.
Syntax:
pie(X, Labels, Radius, Main, Col, Clockwise)
Example:
# Creating data for the graph.
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
# Giving the chart file a name.
png(file = "title_color.jpg")
# Plotting the chart.
pie(x,labels,main="Country Pie chart",col=rainbow(length(x)))
# Saving the file.
dev.off()
Output:
There are two additional properties of the pie chart, i.e., slice percentage and chart
legend. We can show the data in the form of percentage as well as we can add
legends to plots in R by using the legend() function.
Example:
# Creating data for the graph.
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
pie_percent<- round(100*x/sum(x), 1)
# Giving the chart file a name.
png(file = "per_pie.jpg")
# Plotting the chart.
pie(x, labels = pie_percent, main = "Country Pie Chart",col = rainbow(length(x)))
legend("topright", c("India", "America", "Shri Lanka", "Nepal"), cex = 0.8,
fill = rainbow(length(x)))
#Saving the file.
dev.off()
Output:
R Histogram
A histogram is a type of bar chart which shows the frequency of the number of
values which are compared with a set of values ranges.
The histogram is used for the distribution, whereas a bar chart is used for
comparing different entities.
In the histogram, each bar represents the height of the number of values present
in the given range.
For creating a histogram, R provides hist() function, which takes a vector as an
input.
Syntax:
hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)
Example:
R Boxplot
Boxplots are a measure of how well data is distributed across a data set. This
divides the data set into three quartiles. This graph represents the minimum,
maximum, average.
Syntax:
1. x It is a vector or a formula.
Example: In the below example, we will use the "mtcars" dataset present in the
R environment. We will use its two columns only, i.e., "mpg" and "cyl". The
below example will create a boxplot graph for the relation between mpg and cyl,
i.e., miles per gallon and number of cylinders, respectively.
png(file = "boxplot.png")
dev.off()
Output:
BGS FGC Mysuru
STATISTICS AND PROBABILITY
Example:
png(file = "boxplot_using_notch.png")
notch = TRUE,
varwidth = TRUE,
ccol = c("green","yellow","red"),
names = c("High","Medium","Low")
dev.off()
Output:
R Scatterplots
The scatter plots are used to compare variables. A comparison between variables
is required when we need to define how much one variable is affected by another
variable.
In a scatterplot, the data is represented as a collection of points. Each point on
the scatterplot defines the values of the two variables.
One variable is selected for the vertical axis and other for the horizontal axis.
Syntax:
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Example: In our example, we will use the dataset "mtcars", which is the
predefined dataset available in the R environment.
#Fetching two columns from mtcars
data <-mtcars[,c('wt','mpg')]
# Giving a name to the chart file.
png(file = "scatterplot.png")
# Plotting the chart for cars with weight between 2.5 to 5 and mileage
between 15 and 30.
plot(x = data$wt,y = data$mpg, xlab = "Weight", ylab = "Milage", xlim = c(2.5,5),
ylim = c(15,30), main = "Weight v/sMilage")
# Saving the file.
dev.off()
Output:
Statistics
Statistics is a form of mathematical analysis that concerns the collection,
organization, analysis, interpretation, and presentation of data.
R – Statistics
R is a programming language and is used for environment statistical computing
and graphics.
Average in R Programming: Average is calculated by dividing the sum of the
values in the set by their number.
Random Variable
Real values od random experiment is called Random variable.
BGS FGC Mysuru
STATISTICS AND PROBABILITY
pbern()
pbern( ) function in R programming giver the distribution function for the
Bernoulli distribution.
The distribution function or cumulative distribution function (CDF) or
cumulative frequency function, describes the probability that a variate X takes on
a value less than or equal to a number x.
Syntax: pbern(q, prob, log.p = FALSE)
Parameter:
q: vector of quantiles
prob: probability of success on each trial
log.p: logical; if TRUE, probabilities p are given as log(p).
qbern()
qbern( ) gives the quantile function for the Bernoulli distribution. A quantile
function in statistical terms specifies the value of the random variable such that
the probability of the variable being less than or equal to that value equals the
given probability.
Syntax: pbern(q, prob, log.p = FALSE)
Parameter:
q: vector of quantiles
prob: probability of success on each trial
log.p: logical; if TRUE, probabilities p are given as log(p).
BGS FGC Mysuru
STATISTICS AND PROBABILITY
rbern()
rbern( ) function in R programming is used to generate a vector of random
numbers which are Bernoulli distributed.
Syntax: rbern(n, prob)
Parameter:
n: number of observations.
prob: number of observations.
2)Binomial Distribution
Binomial distribution in R is a probability distribution used in statistics.
The binomial distribution is a discrete distribution and has only two outcomes i.e.
success or failure. All its trials are independent, the probability of success remains
the same and the previous outcome does not affect the next outcome.
The outcomes from different trials are independent. Binomial distribution helps
us to find the individual probabilities as well as cumulative probabilities over a
certain range.
In mathematical terms, for a discrete random variable X=x, the binomial mass
function is
pbinom()
The function pbinom() is used to find the cumulative probability of a data
following binomial distribution till a given value ie it finds P(X <= k)
Syntax:
pbinom(k, n, p)
qbinom()
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds
k.
Syntax:
qbinom(P, n, p)
rbinom()
This function generates n random variables of a particular probability.
Syntax:
rbinom(n, N, p)
where n is total number of trials, p is probability of success, k is the value at
which the probability has to be found out.
3)Poisson Functions
The Poisson distribution represents the probability of a provided number of cases
happening in a set period of space or time if these cases happen with an identified
constant mean rate.
In mathematical terms, for a discrete random variable and a realization X = x, the
Poisson mass function f is given as follows, where λp is a parameter of the
distribution.
Syntax:
where,
K: number of successful events happened in an interval
λ: mean per interval
log: If TRUE then the function returns probability in form of log
ppois()
This function is used for the illustration of cumulative probability function in an
R plot. The function ppois() calculates the probability of a random variable that
will be equal to or less than a number.
Syntax:
where,
K: number of successful events happened in an interval
λ: mean per interval
lower.tail: If TRUE then left tail is considered otherwise if the FALSE right tail
is considered
log: If TRUE then the function returns probability in form of log
rpois()
The function rpois() is used for generating random numbers from a given
Poisson’s distribution.
BGS FGC Mysuru
STATISTICS AND PROBABILITY
Syntax:
Where,
q: number of random numbers needed
λ: mean per interval
qpois()
The function qpois() is used for generating quantile of a given Poisson’s
distribution.
In probability, quantiles are marked points that divide the graph of a probability
distribution into intervals (continuous ) which have equal probabilities.
Syntax:
where,
K: number of successful events happened in an interval
λ: mean per interval
lower.tail: If TRUE then left tail is considered otherwise if the FALSE right tail
is considered
log: If TRUE then the function returns probability in form of log
2)Normal Distribution in R
Normal Distribution is a probability function used in statistics that tells about how
the data values are distributed.
For example, the height of the population, shoe size, IQ level, rolling a dice, and
many more.
It is generally observed that data distribution is normal when there is a random
collection of data from independent sources. The graph produced after plotting
the value of the variable on x-axis and count of the value on y-axis is bell-shaped
curve graph.
The graph signifies that the peak point is the mean of the data set and half of the
values of data set lie on the left side of the mean and other half lies on the right
part of the mean.
Syntax:
rnorm(x, mean, sd)
3)Student’s t-distribution
The Student’s t-distribution is a continuous probability distribution generally
used when dealing with statistics estimated from a sample of data.
Any particular t-distribution looks a lot like the standard normal distribution—
it’s bell-shaped, symmetric and it’s centered on zero. The difference is that while
a normal distribution is typically used to deal with a population, the t-distribution
deals with sample from a population.
Functions used:
To find the value of probability density function (pdf) of the Student’s t-
distribution given a random variable x, use the dt() function in R.
Syntax: dt(x, df)
Parameters:
x is the quantiles vector
df is the degrees of freedom
The qt() function is used to get the quantile function or inverse cumulative density
function of a t-distribution.
Syntax: qt(p, df, lower.tail = TRUE)
Parameter:
p is the vector of probabilities
df is the degrees of freedom
lower.tail – if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].