R Stastics PDF
R Stastics PDF
Previous
Next
Mean
It is calculated by taking the sum of the values and dividing with
the number of values in a data series.
Syntax
Example
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
[1] 8.22
When trim = 0.3, 3 values from each end will be dropped from
the calculations to find mean.
In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18,
54) and the values removed from the vector for calculating mean
are (−21,−5,2) from left and (12,18,54) from right.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)
[1] 5.55
Applying NA Option
If there are missing values, then the mean function returns NA.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
result.mean <- mean(x)
print(result.mean)
[1] NA
[1] 8.22
Median
The middle most value in a data series is called the median.
The median() function is used in R to calculate this value.
Syntax
Example
Live Demo
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
Mode
The mode is the value that has highest number of occurrences in
a set of data. Unike mean and median, mode can have both
numeric and character data.
Example
Live Demo
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
[1] 2
[1] "it"
R - Linear Regression
Previous
Next
y = ax + b
Input Data
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the
predictor and the response variable.
Syntax
lm(formula,data)
print(relation)
Coefficients:
(Intercept) x
-38.4551 0.6746
print(summary(relation))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict(object, newdata)
1
76.22869
lm() Function
This function creates the relationship model between the
predictor and the response variable.
Syntax
lm(y ~ x1+x2+x3...,data)
Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
print(Xdisp)
print(Xhp)
print(Xwt)
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
Apply Equation for predicting New Values
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted
mileage is −
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
Print
R - Logistic Regression
Previous
Next
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
Syntax
glm(formula,data,family)
Following is the description of the parameters used −
Example
Live Demo
# Select some columns form mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))
am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460
Live Demo
input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data =
input, family = binomial)
print(summary(am.data))
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion
• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations(sample size).
• mean is the mean value of the sample data. It's default value
is zero.
• sd is the standard deviation. It's default value is 1.
dnorm()
This function gives height of the probability distribution at each
point for a given mean and standard deviation.
Live Demo
# Create a sequence of numbers between -10 and 10
incrementing by 0.1.
x <- seq(-10, 10, by = .1)
# Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)
plot(x,y)
pnorm()
This function gives the probability of a normally distributed
random number to be less that the value of a given number. It is
also called "Cumulative Distribution Function".
Live Demo
# Create a sequence of numbers between -10 and 10
incrementing by 0.2.
x <- seq(-10,10,by = .2)
Live Demo
# Create a sequence of probability values incrementing
by 0.02.
x <- seq(0, 1, by = 0.02)
Live Demo
# Create a sample of 50 numbers which are normally
distributed.
y <- rnorm(50)
R - Binomial Distribution
Previous
Next
• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations.
• size is the number of trials.
• prob is the probability of success of each trial.
dbinom()
This function gives the probability density distribution at each
point.
Live Demo
# Create a sample of 50 numbers which are incremented
by 1.
x <- seq(0,50,by = 1)
Live Demo
# Probability of getting 26 or less heads from a 51
tosses of a coin.
x <- pbinom(26,51,0.5)
print(x)
[1] 0.610116
qbinom()
This function takes the probability value and gives a number
whose cumulative value matches the probability value.
Live Demo
# How many heads will have a probability of 0.25 will
come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
[1] 23
rbinom()
This function generates required number of random values of
given probability from a given sample.
Live Demo
# Find 8 random values from a sample of 150 with
probability of 0.4.
x <- rbinom(8,150,.4)
print(x)
[1] 58 61 59 66 55 60 61 67
R - Random Forest
Previous
Next
An error estimate is made for the cases which were not used
while building the tree. That is called an OOB (Out-of-bag) error
estimate which is mentioned as a percentage.
Install R Package
Use the below command in R console to install the package. You
also have to install the dependent packages if any.
install.packages("randomForest)
Syntax
randomForest(formula, data)
Input Data
Example
Call:
randomForest(formula = nativeSpeaker ~ age + shoeSize + score,
data = readingSkills)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
Conclusion
From the random forest shown above we can conclude that the
shoesize and score are the important factors deciding if someone
is a native speaker or not. Also the model has only 1% error
which means we can predict with 99% accuracy.
Syntax
The function used for performing chi-Square test is chisq.test().
chisq.test(data)
Example
We will take the Cars93 data in the "MASS" library which
represents the sales of different models of car in the year 1993.
Live Demo
library("MASS")
print(str(Cars93))
The above result shows the dataset has many Factor variables
which can be considered as categorical variables. For our model
we will consider the variables "AirBags" and "Type". Here we aim
to find out any significant correlation between the types of car
sold and the type of Air bags it has. If correlation is observed we
can estimate which types of cars can sell better with what types
of air bags.
Live Demo
# Load the library.
library("MASS")
data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723
Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Conclusion
The result shows the p-value of less than 0.05 which indicates a
string correlation.