0% found this document useful (0 votes)
3 views8 pages

Lab 3 Manual

This document covers logical data types in R, focusing on how to use logical expressions for data manipulation and subsetting with the chocopie dataset. It explains how to create logical vectors, combine logical expressions, and perform operations such as sampling and hypothesis testing using the non-parametric bootstrap method. Additionally, it includes practical exercises for applying these concepts in R.

Uploaded by

rishit jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

Lab 3 Manual

This document covers logical data types in R, focusing on how to use logical expressions for data manipulation and subsetting with the chocopie dataset. It explains how to create logical vectors, combine logical expressions, and perform operations such as sampling and hypothesis testing using the non-parametric bootstrap method. Additionally, it includes practical exercises for applying these concepts in R.

Uploaded by

rishit jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Lab 3

STATS 13 – Kim

Winter 2025

Logical Data and Subsetting


The final type of data we will learn in R is the logical type. Logical data can only take on the values TRUE
and FALSE and is typically generated from relational operators and comparisons. For example, the results of
inequalities are expressed with logical data.
5 > 3

## [1] TRUE
5 < 3

## [1] FALSE
You can use a relational operator on a vector against a number to perform the comparison on all elements of
a vector. All the typical operators are demonstrated as follows.
3 > c(1, 3, 5) # Greater than

## [1] TRUE FALSE FALSE


3 >= c(1, 3, 5) # Greater than or equal to

## [1] TRUE TRUE FALSE


3 < c(1, 3, 5) # Less than

## [1] FALSE FALSE TRUE


3 <= c(1, 3, 5) # Less than or equal to

## [1] FALSE TRUE TRUE


3 == c(1, 3, 5) # Equal to

## [1] FALSE TRUE FALSE


3 != c(1, 3, 5) # Not equal to

## [1] TRUE FALSE TRUE


Logical data can interact with functions that require numeric data, such as sum() or mean(). In these cases,
it will treat TRUE as a 1 and FALSE as a 0. Therefore, sum() will simply count the number of TRUE elements
and mean() will return the proportion of TRUE elements. See the following example.
logical_vec <- c(TRUE, TRUE, FALSE, TRUE)
sum(logical_vec)

## [1] 3

1
mean(logical_vec)

## [1] 0.75

The chocopie Dataset


We will practice logical expressions on a new dataset: the chocopie dataset.
chocopie <- read.csv("chocopie.csv")
head(chocopie)

## flavor weight diameter taste


## 1 Green Tea 33.52 7.22 3
## 2 Green Tea 36.76 6.24 4
## 3 Chocolate 34.63 6.08 3
## 4 Chocolate 33.52 6.99 2
## 5 Milk Tea 35.78 7.35 5
## 6 Green Tea 34.49 6.99 3
It contains data on n = 500 chocopies that Professor Kim has collected. The variables are flavor (type of
chocopie), weight (in grams), diameter (in cm) and Professor Kim’s taste rating (on a scale of 1 to 5).
We can use logical expressions to query more advanced quantities. For example, we can calculate the number
of chocopies which have a flavor rating of 3 or less with the following code.
sum(chocopie$taste <= 3)

## [1] 193

Subsetting with Logical Expressions


We can use logical vectors to create subsets of data frames. We can do this by inserting a logical vector
into the square bracket. If a logical vector is put into the square brackets, every place that says TRUE will
be selected, and every place that says FALSE will not. For example, suppose we wanted to subset on all
large chocopies (weight of 35g or more). We can create a logical vector, then subset on those rows with the
following code.
chocopie_large <- chocopie[chocopie$weight >= 35, ]
head(chocopie_large)

## flavor weight diameter taste


## 2 Green Tea 36.76 6.24 4
## 5 Milk Tea 35.78 7.35 5
## 7 Milk Tea 35.90 7.20 5
## 9 Milk Tea 35.85 6.16 5
## 11 Green Tea 36.49 6.16 4
## 12 Milk Tea 35.76 6.75 5

Combining Logical Expressions


Logical expressions can be fine tuned using the “and”, “or” operators. They are invoked with the & and |
symbols, respectively. As an example, suppose we only want the data of large chocopies with a taste rating of
5. We can use the & symbol to require both conditions. Since compounding expressions tends to get unwieldy,
we will save the logical vector in its own object first.
large_and_tasty_index <- chocopie$weight >= 35 & chocopie$taste == 5
chocopie_large_and_tasty <- chocopie[large_and_tasty_index, ]
head(chocopie_large_and_tasty)

2
## flavor weight diameter taste
## 5 Milk Tea 35.78 7.35 5
## 7 Milk Tea 35.90 7.20 5
## 9 Milk Tea 35.85 6.16 5
## 12 Milk Tea 35.76 6.75 5
## 13 Milk Tea 36.33 7.18 5
## 22 Milk Tea 36.80 7.33 5
Now suppose we wanted a dataset with either a large weight or a taste rating of 5. We can use the | symbol
to require at least one of these conditions.
large_or_tasty_index <- chocopie$weight >= 35 | chocopie$taste == 5
chocopie_large_or_tasty <- chocopie[large_or_tasty_index, ]
head(chocopie_large_or_tasty)

## flavor weight diameter taste


## 2 Green Tea 36.76 6.24 4
## 5 Milk Tea 35.78 7.35 5
## 7 Milk Tea 35.90 7.20 5
## 8 Milk Tea 33.47 5.71 5
## 9 Milk Tea 35.85 6.16 5
## 11 Green Tea 36.49 6.16 4
Use what you learned about logical expression to answer the following questions.

With Your TA

Question 1: (1 point) Read in the chocopie.csv file into an object called chocopie. Print out the
first 6 rows and verify it matches the lab manual.

On Your Own

Question 2: (1 point) Save a subset of the chocopie dataset that contains only data with taste
ratings of 4 or more. Save this into an object called chocopie_subset_1 and print the first 6 rows.

Question 3: (1 point) Save a subset of the chocopie dataset that contains only data with taste
ratings exactly 3 and weights of 35 or less. Save this into an object called chocopie_subset_2 and
print the first 6 rows.

Question 4: (1 point) Save a subset of the chocopie dataset that contains only data with taste
ratings of 2 or taste ratings of 4. Save this into an object called chocopie_subset_3 and print the
first 6 rows.

The sample() Function


A general way to choose elements at random is the sample() function, and can be used to create our own
random variables. For example, R doesn’t have a random variable that represents rolling a die, but we can
create one using sample(). To create a die roll, we can use the following code:
set.seed(154)
sample(1:6, size = 1, replace = TRUE)

## [1] 4
The first argument we put into sample() is the sample space that we would like to draw from. In this case,

3
we put in the vector 1:6 since it is the sample space of a die. The size argument indicates how many
samples we would like to take, and the replace argument indicates if we would like to sample with or without
replacement. We will always set this to TRUE, otherwise an outcome will be removed from the sample space
after it is drawn. The sample() function chooses outcomes all with equal probability by default. As another
example, consider drawing 5 samples from the numbers 1 through 10, all with equal probability.
set.seed(544)
sample(1:10, size = 5, replace = TRUE)

## [1] 4 10 2 8 3
Use what you learned about the sample() function to answer the following questions.

On Your Own

Question 5: (1 point) Set the seed to 1464 and draw 10 samples from the numbers 1 through 4, each
having equal probability.

Question 6: (1 point) Set the seed to 8535 and draw 7 samples from the numbers 10 through 30,
each having equal probability.

The for Loop


The for loop is used to easily repeat a block of code. An example loop is as follows.
for(i in 1:5) {
print("Chocopie is yummy.")
}

## [1] "Chocopie is yummy."


## [1] "Chocopie is yummy."
## [1] "Chocopie is yummy."
## [1] "Chocopie is yummy."
## [1] "Chocopie is yummy."
The syntax for(i in 1:5) means it will execute all the code in the set of curly braces once for each element
in the vector 1:5. Now, the powerful aspect of for loops is that we can have the block of code depend on i.
Consider this next example.
for(i in 1:5) {
print(iˆ2 + 1)
}

## [1] 2
## [1] 5
## [1] 10
## [1] 17
## [1] 26
Here, we calculated x2 + 1 for the numbers 1 through 5. Thus we have effectively executed 5 lines of code
with a single line of code inside a for loop.

4
On Your Own

Question 7: (1 point) Use a for loop to print your full name three times.

Question 8: (1 point) Use a for loop to print the calculation of (x − 2)3 for the numbers 5 through 10.

Non-Parametric Bootstrap
We will turn our attention to the Non-Parametric Bootstrap, which is a method of approximating a distribution
based on the data. It does this by empirically approximating the original distribution with a discrete one
that samples each data point with probability 1/n (equal probability). Combining the sample() function
and a for loop make it easy to simulate these distributions.

Hypothesis Testing with Non-Parametric Bootstrap


Consider the taste variable in the chocopie dataset. Suppose we know that the average taste ratings (X)
of chocopies from other taste raters is 3.7. Consider the research question: Are Professor Kim’s taste ratings
of chocopies different than other taste raters, on average?

(1) Setup:
Our hypotheses would be as follows.
H0 : E[X] = 3.7
H1 : E[X] ̸= 3.7
Let us assume α = 0.05. Notice that we use E[X] since we are talking about a population mean, not a
parameter.

(2) Sampling Distribution of the test statistic if H0 is true:


Since our population value is E[X] our test statistic will be x. However, since we do not know the distribution
Xi ∼ F , we do not know the sampling distribution X. So we will construct an empirical version of F , Femp
such that H0 is true using the re-centering technique we learned in lecture. Recall that if x is our data, then
the shifted data (x − x + c) will have a sample mean of c. We can execute this with a simple line of code.
F_emp <- chocopie$taste - mean(chocopie$taste) + 3.7
mean(F_emp)

## [1] 3.7
We can see that the mean of Femp indeed matches the null hypothesis. Now, we will sample from Femp to
construct a sampling distribution of the mean X emp .
M <- 1000 # Number of bootstrap samples
n <- length(chocopie$taste) # Sample size

X_bar_emp <- numeric(M) # Empty numeric vector of size M


set.seed(8481)
for(i in 1:M) {

# 1. Draw a bootstrap sample from F_emp


bootstrap_sample <- sample(F_emp, size = n, replace = TRUE)

# 2. Save the mean into X_bar_emp


X_bar_emp[i] <- mean(bootstrap_sample)

5
}

Note that the length() function simply returns the size of a vector. This is an easy way to get the sample
size. Recall that square bracket allows you to access a given element in a vector, hence X_bar_emp[i] would
yield the ith element of X_bar_emp. Notice that we can use it to assign a value to the ith element as well, as
you see here.
Let’s view our sampling distribution with the histogram() function. Recall that you will need to load the
mosaic package to use it, and note that mosaic is loaded automatically in your templates.
library(mosaic)
histogram(X_bar_emp)

6
Density

3.5 3.6 3.7 3.8

X_bar_emp

From here, we can calculate p-values and critical regions.

(3) Calculate p-values and critical regions:


Recall that in a two-tailed test, the definition of more extreme is “farther from H0 than the test statistic’ ’.
So we will calculate the distance between H0 and xobs to demarcate the extreme regions, and calculate the
proportion of observations that fall in these regions with logical expressions.
# Calculate the distance between the null and x_bar_obs
d <- abs(3.7 - mean(chocopie$taste))

# Borders of extreme regions


upper_ex <- 3.7 + d
lower_ex <- 3.7 - d

6
# Count number of samples in these regions
upper_sum <- sum(X_bar_emp >= upper_ex)
lower_sum <- sum(X_bar_emp <= lower_ex)

# Calculate proportion
pvalue <- (upper_sum + lower_sum)/M
pvalue

## [1] 0.009
Thus we have a p-value of 0.009. Critical regions are much simpler. Since this is a two-tailed test, we want to
figure out what values of the sampling distribution correspond to the bottom 0.025 area under the curve and
the top 0.025 area under the curve. In other words, we want the 2.5 and 97.5 percentiles. We can do this
with the quantile() function.
quantile(X_bar_emp, probs = c(0.025, 0.975))

## 2.5% 97.5%
## 3.606 3.798
Therefore the critical regions are (−∞, 3.606], [3.798, ∞).

(4) Decision
Since our p-value of 0.009 is less than or equal to our α = 0.05, we reject H0 . Equivalently, we can see that
our xobs = 3.824 is inside the critical regions of (−∞, 3.606], [3.798, ∞), thus we reject H0 .

(5) Conclusion
We conclude that Professor Kim’s taste ratings of chocopies different than other taste raters, on average
(subject to α = 0.05).
Now, using this example as a reference, conduct the hypothesis test using the non-parametric bootstrap.

With Your TA
Consider the weight variable in the chocopie dataset. Suppose the chocopies have an advertised
weight of 35 grams. Professor Kim wonders if his chocopies are the same weight as advertised or
not. Consider the research question: Are Professor Kim’s chocopie weights (X) different than the
advertised weight, on average? Assume α = 0.05.

Question 9: (1 point) State the null and alternative hypotheses. Write it in LATEX code.

Question 10: (1 point) Use the re-centering technique to construct Femp such that it obeys H0 .
Print the mean to verify H0 is true.

Question 11: (1 point) Using a loop and the sample() function like the example above, create
X emp . Set M = 1000, and use 7947 as a seed. Print a histogram of this sampling distribution.

Question 12: (1 point) Calculate and print xobs .

Question 13: (1 point) Calculate and print the p-value.

Question 14: (1 point) Calculate and print the critical values and state the critical regions. Write
your regions using LATEX code (your TA will teach you).

7
On Your Own

Question 15: (1 point) What is the decision? You may use any scale to make your decision.

Question 16: (1 point) What is the conclusion?

On Your Own
Now, you will conduct a hypothesis completely on your own, using the re-centering technique you learned.
For Questions 17 to 24, your R and LATEXcode should mirror that of Questions 9 to 16. However, it is up to
you to modify the correct parts of the code to conduct the new hypothesis test correctly.

On Your Own
Consider the diameter variable in the chocopie dataset. Suppose the chocopies have an advertised
diameter of 6.5 cm. Professor Kim wonders if his chocopies are the same diameter as advertised or
not. Consider the research question: Are Professor Kim’s chocopie diameters (X) different than the
advertised diameter, on average? Assume α = 0.01.

Question 17: (1 point) State the null and alternative hypotheses. Write it in LATEX code.

Question 18: (1 point) Use the re-centering technique to construct Femp such that it obeys H0 .
Print the mean to verify H0 is true.

Question 19: (1 point) Using a loop and the sample() function like the example above, create
X emp . Set M = 1000, and use 5820 as a seed. Print a histogram of this sampling distribution.

Question 20: (1 point) Calculate and print xobs .

Question 21: (1 point) Calculate and print the p-value.

Question 22: (1 point) Calculate and print the critical values and state the critical regions. Write
your regions using LATEX code.

Question 23: (1 point) What is the decision? You may use any scale to make your decision.

Question 24: (1 point) What is the conclusion?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy