0% found this document useful (0 votes)
13 views30 pages

R Stastics PDF

The document provides an overview of statistical analysis in R, focusing on functions for calculating mean, median, and mode, as well as linear and multiple regression techniques. It explains the syntax and parameters for relevant functions like mean(), median(), lm(), and glm(), along with examples demonstrating their application. Additionally, it covers the process of establishing relationships between variables and predicting outcomes using regression models.

Uploaded by

shubhamumapure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

R Stastics PDF

The document provides an overview of statistical analysis in R, focusing on functions for calculating mean, median, and mode, as well as linear and multiple regression techniques. It explains the syntax and parameters for relevant functions like mean(), median(), lm(), and glm(), along with examples demonstrating their application. Additionally, it covers the process of establishing relationships between variables and predicting outcomes using regression models.

Uploaded by

shubhamumapure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

R - Mean, Median and Mode

Previous
Next

Statistical analysis in R is performed by using many in-built


functions. Most of these functions are part of the R base package.
These functions take R vector as an input along with the
arguments and give the result.

The functions we are discussing in this chapter are mean, median


and mode.

Mean
It is calculated by taking the sum of the values and dividing with
the number of values in a data series.

The function mean() is used to calculate this in R.

Syntax

The basic syntax for calculating mean in R is −

mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used −

• x is the input vector.


• trim is used to drop some observations from both end of the
sorted vector.
• na.rm is used to remove the missing values from the input
vector.

Example
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)

When we execute the above code, it produces the following result


[1] 8.22

Applying Trim Option


When trim parameter is supplied, the values in the vector get
sorted and then the required numbers of observations are
dropped from calculating the mean.

When trim = 0.3, 3 values from each end will be dropped from
the calculations to find mean.

In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18,
54) and the values removed from the vector for calculating mean
are (−21,−5,2) from left and (12,18,54) from right.

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)

When we execute the above code, it produces the following result


[1] 5.55

Applying NA Option
If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm =


TRUE. which means remove the NA values.

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
result.mean <- mean(x)
print(result.mean)

# Find mean dropping NA values.


result.mean <- mean(x,na.rm = TRUE)
print(result.mean)

When we execute the above code, it produces the following result


[1] NA
[1] 8.22

Median
The middle most value in a data series is called the median.
The median() function is used in R to calculate this value.

Syntax

The basic syntax for calculating median in R is −

median(x, na.rm = FALSE)

Following is the description of the parameters used −

• x is the input vector.


• na.rm is used to remove the missing values from the input
vector.

Example
Live Demo
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.


median.result <- median(x)
print(median.result)

When we execute the above code, it produces the following result



[1] 5.6

Mode
The mode is the value that has highest number of occurrences in
a set of data. Unike mean and median, mode can have both
numeric and character data.

R does not have a standard in-built function to calculate mode.


So we create a user function to calculate mode of a data set in R.
This function takes the vector as input and gives the mode value
as output.

Example
Live Demo
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.


v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.


result <- getmode(v)
print(result)

# Create the vector with characters.


charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.


result <- getmode(charv)
print(result)

When we execute the above code, it produces the following result


[1] 2
[1] "it"
R - Linear Regression
Previous
Next

Regression analysis is a very widely used statistical tool to


establish a relationship model between two variables. One of
these variable is called predictor variable whose value is gathered
through experiments. The other variable is called response
variable whose value is derived from the predictor variable.

In Linear Regression these two variables are related through an


equation, where exponent (power) of both these variables is 1.
Mathematically a linear relationship represents a straight line
when plotted as a graph. A non-linear relationship where the
exponent of any variable is not equal to 1 creates a curve.

The general mathematical equation for a linear regression is −

y = ax + b

Following is the description of the parameters used −

• y is the response variable.


• x is the predictor variable.
• a and b are constants which are called the coefficients.

Steps to Establish a Regression


A simple example of regression is predicting weight of a person
when his height is known. To do this we need to have the
relationship between height and weight of a person.

The steps to create the relationship is −

• Carry out the experiment of gathering a sample of observed


values of height and corresponding weight.
• Create a relationship model using the lm() functions in R.
• Find the coefficients from the model created and create the
mathematical equation using these
• Get a summary of the relationship model to know the average
error in prediction. Also called residuals.
• To predict the weight of new persons, use
the predict() function in R.

Input Data

Below is the sample data representing the observations −

# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the
predictor and the response variable.

Syntax

The basic syntax for lm() function in linear regression is −

lm(formula,data)

Following is the description of the parameters used −

• formula is a symbol presenting the relation between x and y.


• data is the vector on which the formula will be applied.

Create Relationship Model & get the Coefficients


Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result



Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746

Get the Summary of the Relationship


Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(summary(relation))

When we execute the above code, it produces the following result


Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
predict() Function
Syntax

The basic syntax for predict() in linear regression is −

predict(object, newdata)

Following is the description of the parameters used −

• object is the formula which is already created using the lm()


function.
• newdata is the vector containing the new value for predictor
variable.

Predict the weight of new persons


Live Demo
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131)

# The resposne vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)

When we execute the above code, it produces the following result


1
76.22869

Visualize the Regression Graphically


Live Demo
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152,
131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file = "linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight
Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in
Kg",ylab = "Height in cm")

# Save the file.


dev.off()

When we execute the above code, it produces the following result



R - Multiple Regression
Previous
Next

Multiple regression is an extension of linear regression into


relationship between more than two variables. In simple linear
relation we have one predictor and one response variable, but in
multiple regression we have more than one predictor variable and
one response variable.

The general mathematical equation for multiple regression is −

y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used −

• y is the response variable.


• a, b1, b2...bn are the coefficients.
• x1, x2, ...xn are the predictor variables.

We create the regression model using the lm() function in R. The


model determines the value of the coefficients using the input
data. Next we can predict the value of the response variable for a
given set of predictor variables using these coefficients.

lm() Function
This function creates the relationship model between the
predictor and the response variable.

Syntax

The basic syntax for lm() function in multiple regression is −

lm(y ~ x1+x2+x3...,data)

Following is the description of the parameters used −

• formula is a symbol presenting the relation between the


response variable and predictor variables.
• data is the vector on which the formula will be applied.
Example
Input Data

Consider the data set "mtcars" available in the R environment. It


gives a comparison between different car models in terms of
mileage per gallon (mpg), cylinder displacement("disp"), horse
power("hp"), weight of the car("wt") and some more parameters.

The goal of the model is to establish the relationship between


"mpg" as a response variable with "disp","hp" and "wt" as
predictor variables. We create a subset of these variables from
the mtcars data set for this purpose.

Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))

When we execute the above code, it produces the following result


mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460

Create Relationship Model & get the Coefficients


Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]

# Create the relationship model.


model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.


print(model)

# Get the Intercept and coefficients as vector


elements.
cat("# # # # The Coefficient Values # # # ","\n")
a <- coef(model)[1]
print(a)

Xdisp <- coef(model)[2]


Xhp <- coef(model)[3]
Xwt <- coef(model)[4]

print(Xdisp)
print(Xhp)
print(Xwt)

When we execute the above code, it produces the following result


Call:
lm(formula = mpg ~ disp + hp + wt, data = input)

Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

# # # # The Coefficient Values # # #


(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891

Create Equation for Regression Model

Based on the above intercept and coefficient values, we create


the mathematical equation.

Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3
Apply Equation for predicting New Values

We can use the regression equation created above to predict the


mileage when a new set of values for displacement, horse power
and weight is provided.

For a car with disp = 221, hp = 102 and wt = 2.91 the predicted
mileage is −

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
Print

R - Logistic Regression
Previous
Next

The Logistic Regression is a regression model in which the


response variable (dependent variable) has categorical values
such as True/False or 0/1. It actually measures the probability of
a binary response as the value of response variable based on the
mathematical equation relating it with the predictor variables.

The general mathematical equation for logistic regression is −

y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

Following is the description of the parameters used −

• y is the response variable.


• x is the predictor variable.
• a and b are the coefficients which are numeric constants.

The function used to create the regression model is


the glm() function.

Syntax

The basic syntax for glm() function in logistic regression is −

glm(formula,data,family)
Following is the description of the parameters used −

• formula is the symbol presenting the relationship between the


variables.
• data is the data set giving the values of these variables.
• family is R object to specify the details of the model. It's value
is binomial for logistic regression.

Example

The in-built data set "mtcars" describes different models of a car


with their various engine specifications. In "mtcars" data set, the
transmission mode (automatic or manual) is described by the
column am which is a binary value (0 or 1). We can create a
logistic regression model between the columns "am" and 3 other
columns - hp, wt and cyl.

Live Demo
# Select some columns form mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]

print(head(input))

When we execute the above code, it produces the following result


am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460

Create Regression Model


We use the glm() function to create the regression model and get
its summary for analysis.

Live Demo
input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data =
input, family = binomial)

print(summary(am.data))

When we execute the above code, it produces the following result


Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 43.2297 on 31 degrees of freedom


Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Conclusion

In the summary as the p-value in the last column is more than


0.05 for the variables "cyl" and "hp", we consider them to be
insignificant in contributing to the value of the variable "am".
Only weight (wt) impacts the "am" value in this regression model.
R - Normal Distribution
Previous
Next

In a random collection of data from independent sources, it is


generally observed that the distribution of data is normal. Which
means, on plotting a graph with the value of the variable in the
horizontal axis and the count of the values in the vertical axis we
get a bell shape curve. The center of the curve represents the
mean of the data set. In the graph, fifty percent of values lie to
the left of the mean and the other fifty percent lie to the right of
the graph. This is referred as normal distribution in statistics.

R has four in built functions to generate normal distribution. They


are described below.

dnorm(x, mean, sd)


pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)

Following is the description of the parameters used in above


functions −

• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations(sample size).
• mean is the mean value of the sample data. It's default value
is zero.
• sd is the standard deviation. It's default value is 1.

dnorm()
This function gives height of the probability distribution at each
point for a given mean and standard deviation.

Live Demo
# Create a sequence of numbers between -10 and 10
incrementing by 0.1.
x <- seq(-10, 10, by = .1)
# Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)

# Give the chart file a name.


png(file = "dnorm.png")

plot(x,y)

# Save the file.


dev.off()

When we execute the above code, it produces the following result


pnorm()
This function gives the probability of a normally distributed
random number to be less that the value of a given number. It is
also called "Cumulative Distribution Function".
Live Demo
# Create a sequence of numbers between -10 and 10
incrementing by 0.2.
x <- seq(-10,10,by = .2)

# Choose the mean as 2.5 and standard deviation as 2.


y <- pnorm(x, mean = 2.5, sd = 2)

# Give the chart file a name.


png(file = "pnorm.png")

# Plot the graph.


plot(x,y)

# Save the file.


dev.off()

When we execute the above code, it produces the following result



qnorm()
This function takes the probability value and gives a number
whose cumulative value matches the probability value.

Live Demo
# Create a sequence of probability values incrementing
by 0.02.
x <- seq(0, 1, by = 0.02)

# Choose the mean as 2 and standard deviation as 3.


y <- qnorm(x, mean = 2, sd = 1)

# Give the chart file a name.


png(file = "qnorm.png")

# Plot the graph.


plot(x,y)

# Save the file.


dev.off()

When we execute the above code, it produces the following result



rnorm()
This function is used to generate random numbers whose
distribution is normal. It takes the sample size as input and
generates that many random numbers. We draw a histogram to
show the distribution of the generated numbers.

Live Demo
# Create a sample of 50 numbers which are normally
distributed.
y <- rnorm(50)

# Give the chart file a name.


png(file = "rnorm.png")

# Plot the histogram for this sample.


hist(y, main = "Normal DIstribution")

# Save the file.


dev.off()
When we execute the above code, it produces the following result

R - Binomial Distribution
Previous
Next

The binomial distribution model deals with finding the probability


of success of an event which has only two possible outcomes in a
series of experiments. For example, tossing of a coin always gives
a head or a tail. The probability of finding exactly 3 heads in
tossing a coin repeatedly for 10 times is estimated during the
binomial distribution.
R has four in-built functions to generate binomial distribution.
They are described below.

dbinom(x, size, prob)


pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

Following is the description of the parameters used −

• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations.
• size is the number of trials.
• prob is the probability of success of each trial.

dbinom()
This function gives the probability density distribution at each
point.

Live Demo
# Create a sample of 50 numbers which are incremented
by 1.
x <- seq(0,50,by = 1)

# Create the binomial distribution.


y <- dbinom(x,50,0.5)

# Give the chart file a name.


png(file = "dbinom.png")

# Plot the graph for this sample.


plot(x,y)

# Save the file.


dev.off()

When we execute the above code, it produces the following result



pbinom()
This function gives the cumulative probability of an event. It is a
single value representing the probability.

Live Demo
# Probability of getting 26 or less heads from a 51
tosses of a coin.
x <- pbinom(26,51,0.5)

print(x)

When we execute the above code, it produces the following result


[1] 0.610116

qbinom()
This function takes the probability value and gives a number
whose cumulative value matches the probability value.

Live Demo
# How many heads will have a probability of 0.25 will
come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)

print(x)

When we execute the above code, it produces the following result


[1] 23

rbinom()
This function generates required number of random values of
given probability from a given sample.

Live Demo
# Find 8 random values from a sample of 150 with
probability of 0.4.
x <- rbinom(8,150,.4)

print(x)

When we execute the above code, it produces the following result


[1] 58 61 59 66 55 60 61 67

R - Random Forest
Previous
Next

In the random forest approach, a large number of decision trees


are created. Every observation is fed into every decision tree. The
most common outcome for each observation is used as the final
output. A new observation is fed into all the trees and taking a
majority vote for each classification model.

An error estimate is made for the cases which were not used
while building the tree. That is called an OOB (Out-of-bag) error
estimate which is mentioned as a percentage.

The R package "randomForest" is used to create random forests.

Install R Package
Use the below command in R console to install the package. You
also have to install the dependent packages if any.

install.packages("randomForest)

The package "randomForest" has the


function randomForest() which is used to create and analyze
random forests.

Syntax

The basic syntax for creating a random forest in R is −

randomForest(formula, data)

Following is the description of the parameters used −

• formula is a formula describing the predictor and response


variables.
• data is the name of the data set used.

Input Data

We will use the R in-built data set named readingSkills to create a


decision tree. It describes the score of someone's readingSkills if
we know the variables "age","shoesize","score" and whether the
person is a native speaker.

Here is the sample data.

# Load the party package. It will automatically load


other
# required packages.
library(party)

# Print some records from data set readingSkills.


print(head(readingSkills))

When we execute the above code, it produces the following result


and chart −

nativeSpeaker age shoeSize score


1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................

Example

We will use the randomForest() function to create the decision tree


and see it's graph.

# Load the party package. It will automatically load


other
# required packages.
library(party)
library(randomForest)

# Create the forest.


output.forest <- randomForest(nativeSpeaker ~ age +
shoeSize + score,
data = readingSkills)

# View the forest results.


print(output.forest)

# Importance of each predictor.


print(importance(fit,type = 2))
When we execute the above code, it produces the following result

Call:
randomForest(formula = nativeSpeaker ~ age + shoeSize + score,
data = readingSkills)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1

OOB estimate of error rate: 1%


Confusion matrix:
no yes class.error
no 99 1 0.01
yes 1 99 0.01
MeanDecreaseGini
age 13.95406
shoeSize 18.91006
score 56.73051

Conclusion

From the random forest shown above we can conclude that the
shoesize and score are the important factors deciding if someone
is a native speaker or not. Also the model has only 1% error
which means we can predict with 99% accuracy.

R - Chi Square Test


Previous
Next

Chi-Square test is a statistical method to determine if two


categorical variables have a significant correlation between them.
Both those variables should be from same population and they
should be categorical like − Yes/No, Male/Female, Red/Green etc.
For example, we can build a data set with observations on
people's ice-cream buying pattern and try to correlate the gender
of a person with the flavor of the ice-cream they prefer. If a
correlation is found we can plan for appropriate stock of flavors
by knowing the number of gender of people visiting.

Syntax
The function used for performing chi-Square test is chisq.test().

The basic syntax for creating a chi-square test in R is −

chisq.test(data)

Following is the description of the parameters used −

• data is the data in form of a table containing the count value


of the variables in the observation.

Example
We will take the Cars93 data in the "MASS" library which
represents the sales of different models of car in the year 1993.

Live Demo
library("MASS")
print(str(Cars93))

When we execute the above code, it produces the following result


'data.frame': 93 obs. of 27 variables:


$ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
$ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
$ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
$ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Width : int 68 71 67 70 69 69 74 78 73 73 ...
$ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
$ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
$ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
$ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

The above result shows the dataset has many Factor variables
which can be considered as categorical variables. For our model
we will consider the variables "AirBags" and "Type". Here we aim
to find out any significant correlation between the types of car
sold and the type of Air bags it has. If correlation is observed we
can estimate which types of cars can sell better with what types
of air bags.

Live Demo
# Load the library.
library("MASS")

# Create a data frame from the main data set.


car.data <- data.frame(Cars93$AirBags, Cars93$Type)

# Create a table with the needed variables.


car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)

# Perform the Chi-Square test.


print(chisq.test(car.data))
When we execute the above code, it produces the following result

Compact Large Midsize Small Sporty Van


Driver & Passenger 2 4 7 0 3 0
Driver only 9 7 11 5 8 3
None 5 0 4 16 3 6

Pearson's Chi-squared test

data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723

Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect

Conclusion
The result shows the p-value of less than 0.05 which indicates a
string correlation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy