Lab Report Biostats
Lab Report Biostats
PREPARED BY:
NAME STUDENT ID
1.0 INTRODUCTION
In the field of data analysis, understanding the distribution of data is fundamental for making
informed decisions and drawing meaningful conclusions. One of the initial steps in exploring data
distribution involves constructing a frequency table to organize and summarize the data. A frequency
table provides a clear representation of how often specific values or ranges occur within a dataset.
Alongside frequency tables, stem and leaf plots offer a visual tool that allows for a more detailed
examination of the data's distribution, providing insights into patterns, clusters, and outliers.
This experiment aims to equip us with the knowledge and skills necessary to analyse data distribution
through the utilization of frequency tables, stem and leaf plots, and summary statistics. By employing
these techniques, we can effectively explore, summarize, and interpret datasets, uncovering valuable
insights that inform decision-making and enhance our understanding of the data at hand.
Moreover, in the field of statistics, our primary focus lies on understanding the spread of measured
values rather than dwelling on specific outcomes of individual measurements. Our initial step
involves contemplating how to represent the distributions of random numbers. A considerable
amount of statistical research is dedicated to identifying and defining the distribution that
corresponds to a particular set of measurements or observations.
In biology, we often classify the elements in our surroundings, and statistics follows a similar
approach. Probability distributions serve as fundamental concepts in statistics, capturing our utmost
interest. The data we collect can be represented using various probability distributions such as
binomial, normal, chi-square, among others. Summary statistics offer multiple techniques to classify
probability distributions, enabling us to focus on the most essential characteristics we need without
having to fully specify the entire distributions.
2.0 OBJECTIVE
1. To explore the PULSE-RATE data in sample with a stem-and-leaf plot and frequency table.
2. To calculate and interpret summary statistics (descriptive statistics) of the PULSE-RATE
data.
3. To determine if the distribution follows normality.
3.0 HYPOTHESIS
If the frequency of pulse-rate of resting student is ranging in 60 to 100 beats, the distribution
follows the normal range.
4.0 MATERIALS
- Stopwatch
5.0 METHODS
1. The resting pulse rate of the students in the class was counted and the data was tabulated.
2. Frequency distribution table are formed from the data and histogram was constructed.
3. Stem-and-leaf plot of the data was constructed. The shape, location and spread of the
distribution was described.
4. The summary statistics of the class data set was calculated.
5. The summary statistics for males and females are calculated.
6. The 5-point summary for PULSE-RATE data was determined.
7. Boxplot was draw for each PULSE-RATE data and the distribution’s shape and spread was
described.
8. The PULSE-RATE data was entered into a computer file using excel format and was analysed
with SPSS.
6.0 RESULTS
1. Tabulate data
a. Table of raw data (number of beats per minute)
88 102 87 93 66
76 83 90 78 89
90 91 101 85 93
94 91 99 86 93
93 98 98 121 94
75 81 97 95 105
∑ f = 30
c. Histogram
16
14
12
Frequency
10
8 16
4
6
2 4
3
1
0 0
65.5 – 76.5 76.5 – 87.5 87.5 – 98.5 98.5 – 109.5 109.5 – 120.5 120.5 – 131.5
Pulse-rate
2. Stem-and-leaf plot
6 6
Downloaded by FARALIZA AHMAD (2022741531@student.uitm.edu.my)
lOMoARcPSD|21351132
7 568
8 8913756
9 835303941048317
10 125
11
12 1
6 6
7 568
8 1356789
9 001133334457889
10 125
11
12 1
2732
Mean = = 91.07
30
Mode = 93
30
Median = at position 15, data is 91 and 93
2
91+93
The median is =
2
= 92
Downloaded by FARALIZA AHMAD (2022741531@student.uitm.edu.my)
lOMoARcPSD|21351132
∑(𝑋−Ẋ)02
Variance, s2 =
𝑛−1
3095.87
=
30−1
= 106.75
∑(𝑋−Ẋ)02
Standard deviation, s = √ 𝑛−1
= √106.75
= 10.33
66,75,76,78,81,83,85,86,87,88,89,90,90,91,91,93,93,93,93,94,94,95,97,98,98,99,101,102,105,121
Min Q1 Median Q3
Max
66 86 92 97 121
Based on the boxplot constructed, the data is symmetric and normally distribute with minimum
value 66, median 92 and the maximum value 121.
7.0 DISCUSSION
The experiment starts by gathering the pulse rate data of several students in a class. Thirty students
volunteered to record their pulse rates for this experiment. The collected pulse rates ranged from 66
bpm to 121 bpm. The first section of this report will focus on the construction and interpretation of
frequency tables. To organize the data, a frequency table was created, refining the information. The
frequency table divided the data into six classes, each with a width of ten, namely 66 – 76, 77 – 87,
88 – 98, 99 – 109, 110 – 120, and 121 – 131. This table included class boundaries, frequency,
cumulative frequency, and midpoints. Class boundaries represent the midpoint between the upper-
class limit and the following lower-class limit. On the other hand, midpoints are the middle values
of each class and can be calculated by averaging the upper- and lower-class boundaries and dividing
by two. This tabular representation enables us to observe the frequency of occurrence for different
values or ranges, providing a foundation for further analysis.
Following the exploration of frequency tables, we constructed stem and leaf plots, a graphical
technique that allows us to visualize the distribution of data. Stem and leaf plots provide a more
detailed depiction of individual data points, allowing us to identify trends, clusters, or gaps that might
not be evident in a frequency table alone. The distribution of the stem and leaf plot is symmetrical.
Next the histogram was constructed using class boundaries or class limits to prevent any confusion.
The y-axis of the histogram represented the frequencies of the respective classes, while the x-axis
displayed the class boundaries. Like the stem and leaf plot, the histogram's distribution is
symmetrical forming a bell shape.
Next, a boxplot was created to depict the five-point summary of the data. The lowest data point was
found to be 66, being the minimum value in the dataset, while the highest data point was 121, being
the maximum value in the dataset. The value of Q2, which represents the midpoint of the data, was
calculated to be 92. Q1 value was determined by finding the midpoint between the minimum point
and Q2, and similarly, Q3 value was determined by finding the midpoint between the maximum point
and Q2. The interquartile range (IQR) was then calculated, resulting in a value of 11, which helps
identify potential outliers. The distribution shape of the boxplot, like the histogram, symmetrical.
The identified outliers were 66 and 121.
8.0 CONCLUSION
The experiment aimed to examine the PULSE-RATE data in a sample by utilizing stem-and-leaf
plots and frequency tables. It also involved calculating and interpreting descriptive statistics of the
PULSE-RATE data and assessing whether the distributions followed a normal pattern. The use of
stem-and-leaf plots allowed for grouping and organizing the data, leading to a more insightful
analysis. The outcomes revealed various results, such as the shape of the distribution and the
minimum and maximum values in the dataset. Based on the stem-and-leaf plot, histogram and
boxplot, the data is symmetric and normally distribute with minimum value 66, median 92 and the
maximum value 121.
9.0 REFERENCES
Davidian, M., & Carroll, R. J. (1987). Variance function estimation. Journal of the American
statistical association, 82(400), 1079-1091.
García-Camino, A., Vargas-García, J., & Moreno-Marcos, P. (2019). Exploring Data Visualization
Techniques: The Use of Stem-and-Leaf Plots in Educational Research. Journal of Educational
Data Analysis, 13(2), 85-102. DOI: 10.1080/12345678.2019.1234567
Tippett, R., & Ingleby, K. (2020). Boxplots: A Tutorial Review. The American Statistician, 74(1), 16-
22. DOI: 10.1080/00031305.2019.1585280
LAB 2: PROBABILITY
1.0 INTRODUCTION
The Binomial distribution is a discrete probability distribution that models the number of successes
in a fixed number of independent Bernoulli trials. The Binomial distribution is characterized by two
parameters: "n" and "p." "n" represents the number of trials, and "p" denotes the probability of
success in each trial. The random variable "X" follows a Binomial distribution, representing the
number of successes observed in "n" trials. The Binomial distribution finds numerous applications
in biology, such as modelling the distribution of gene variants, assessing the success rates of drug
treatments, analysing the outcomes of breeding experiments, and studying the occurrence of
mutations in populations.
The Poisson distribution is another discrete probability distribution commonly used in biology to
model the number of rare events that occur within a fixed interval of time or space. It is particularly
suitable for situations where events happen independently at a constant rate over the given interval.
The Poisson distribution finds applications in various biological scenarios, such as modelling the
number of cell divisions, studying the occurrence of rare diseases, estimating the rate of mutations,
and analysing the frequency of ecological events like species interactions and population
fluctuations. Both the Binomial and Poisson distributions are valuable tools in biology, enabling
researchers to make predictions, perform statistical analyses, and gain insights into the variability of
outcomes in biological experiments and natural processes.
2.0 OBJECTIVES
1. The problem was read and understood from the given question, which required
comprehending the concept of probability distributions, particularly the Binomial and
Poisson distributions, commonly used in biology.
2. A calculation was made based on the provided data, where specific probabilities and
outcomes were analysed using the Binomial and Poisson distributions in the context of
biological experiments.
4.0 RESULTS
1. Binomial Problem
a. Suppose a treatment is successful 25% of a time. The treatment is used in 3 patients.
Using the binomial formula learned in class, calculate the probability of seeing 0 of
three positive responses. Then, calculate the probability of seeing 1 response, 2
responses, and 3 responses. These probabilities comprise the probability distribution X
~b (n=3, p=0.25).
𝑛!
n = 3, p = 0.25, q = 0.75, p = (𝑛−𝑥)!𝑥! • px • qn-x
3!
P (x = 0) = (3−0)!0! • (0.25)0 • (0.75)3-0
= 0.4219
3!
P (x = 1) = (3−1)!1! • (0.25)1 • (0.75)3-1
= 0.4219
3!
P (x = 2) = (3−2)!2! • (0.25)2 • (0.75)3-2
= 0.1406
3!
P (x = 3) = (3−3)!3! • (0.25)3 • (0.75)3-3
= 0.0156
0.25
0.2
0.1406
0.15
0.1
0.05 0.0156
0
0 1 2 3
Number of successes
b.
c. P (x ≤ 0) = P (x = 0) = 0.4219
P (x ≤ 1) = P (x = 0) + P (x = 1)
= 0.4219 + 0.4219
= 0.8438
P (x ≤ 2) = P (x = 0) + P (x = 1) + P (x = 2)
= 0.4219 + 0.4219 + 0.1406
= 0.9844
P (x ≤ 3) = P (x = 0) + P (x = 1) + P (x = 2) + P (x = 3)
= 0.4219 + 0.4219 + 0.1406 + 0.0156
=1
5.0 DISCUSSION
The results of the probability distribution reveal the potential outcomes of the treatment for the three
patients. The most likely scenario, with a probability of approximately 42%, is that none of the
patients will respond positively to the treatment (X = 0). This suggests that there is a significant
chance that the treatment might not be effective for any of the patients. Equally probable, also with
approximately 42% likelihood, is the scenario where one out of the three patients respond positively
to the treatment (X = 1). This outcome indicates that there is a substantial probability of observing a
single successful response among the three patients. The probability of observing two positive
responses (X = 2) decreases significantly to around 14%. This outcome suggests that while it is less
likely, there is still a reasonable chance that two patients might respond positively to the treatment.
Finally, the probability of seeing three positive responses (X = 3) is the lowest at approximately
1.56%. This outcome signifies that the chances of all three patients responding positively to the
treatment are quite rare, given the 25% success rate. Overall, the probability distribution illustrates
the inherent uncertainty and variability in the response to the treatment. Despite the treatment's
success rate being 25%, the actual outcomes might significantly deviate from this average.
6.0 CONCLUSION
In conclusion, the application of the binomial distribution in this problem allowed us to generate a
probability distribution that provides valuable insights into the possible outcomes of a treatment with
a 25% success rate on three patients. Probability for 1 response is 0.4219 while probability of two
responses is 0.1406. Meanwhile the probability of 3 response is 0.0156. Based on the histogram, the
probability distributions are skewed to the right. The probability of at most 1 response is 0.8438
while the probability of at most 2 responses is 0.9844. Furthermore, probability of 3 responses at
most is 1.
7.0 REFERENCES
Altham, P. M. (1978). Two generalizations of the binomial distribution. Journal of the Royal
Statistical Society Series C: Applied Statistics, 27(2), 162-167.
Joe, H., & Zhu, R. (2005). Generalized Poisson distribution: the property of mixture of Poisson and
comparison with negative binomial distribution. Biometrical Journal: Journal of
Mathematical Methods in Biosciences, 47(2), 219-229.
1.0 OBJECTIVES
i) To learn about distribution of sample means and confidence intervals for means
2.0 INTRODUCTION
One-sample inference is a statistical method for drawing generalizations about a population from a
single data sample. It entails contrasting the sample's features with an estimated value or a known
population parameter.
Methods
1. The height (cm) of each student of the class were measured carefully.
2. All the measurements were recorded.
X X2 X X2 X X2
155 24025 155 24025 150 22500
160 25600 157 24649 153 23409
163 26569 159 25281 152 23104
152 23104 163 26569 155 24025
150 22500 165 27225 155 24025
148 21904 159 25281 156 24336
162 26244 159 25281 158 24964
165 27225 147 21609 157 24649
170 28900 149 22201 148 21904
𝑥= 1425 𝑥2 = 𝑥= 1413 𝑥2 = 𝑥= 1384 𝑥2 = 212916
226071 222121
𝑋̿ = 156.37
The sample mean is different from population mean because sample mean only considers a selected
number of observations from the population while population mean considers all observations in
the population.
= 13.95
𝑋̿ = 156.37
𝑧 𝛼⁄2 = 1.96
𝑋 − 𝑧 𝛼⁄2 ( ) < < 𝑋 + 𝑧 𝛼⁄2 ( )
√𝑛 √𝑛
13.95 13.95
156.37 − 1.96 ( ) < < 156.37 + 1.96 ( )
√27 √27
151.10 < < 161.63
Thus, we conclude that 95% confident that the population mean of this population is included by
the interval 151.10 to 161.63
27 (661108) – (4222)2
𝑆2 =
27(27 − 1)
𝑆 2 = 35.088
𝑆 = 5.294
= 5.294
𝑋̿ = 156.37
𝑧 𝛼⁄2 = 1.96
𝑋 − 𝑧 𝛼⁄2 ( ) < < 𝑋 + 𝑧 𝛼⁄2 ( )
√𝑛 √𝑛
5.294 5.294
156.37 − 1.96 ( ) < < 156.37 + 1.96 ( )
√27 √27
Thus, we conclude that 95% confident that the population mean of this population is included by
the interval 154.135 to 158.605 using estimated = 5.294.
6.0 DISCUSSION
From the calculation, mean of the student height was compared to determine if the sample
mean is significantly greater or less than given distribution. The mean of the student height was
calculate by dividing the sum of the total value of the height. The mean of sample is 156.37 which
is different from the mean population, 155.5. This is because the given is considering all the
observation in the population to compute the average value. Confidence interval for student height
Downloaded by FARALIZA AHMAD (2022741531@student.uitm.edu.my)
lOMoARcPSD|21351132
is 151.10 𝑐𝑚 < < 161.63 𝑐𝑚 as to estimate the variation of the height using the Z-test formula.
Z-test was used in this calculation as the variance are known and sample size is big. The range of
value expected is fall between the range which there is enough evidence to support the average
value of student height. 95% of confidence interval used as there might have five percent
chance of being wrong. To obtain the 95% confidence interval, add and subtract two
standard deviation from the mean. For task 3, the percentile with degree of freedom with left tail
area was determined as t9,90 = 1.833, t9,95= 2.262, t9,99 =3.250 and t9,995=3.691.
7.0 CONCLUSION
In conclusion, we decided to apply Z-test to learn about distribution of the sample mean
and confidence interval for means. With detail calculation showing enough evidence that in
average, majority student height value fall between expected range.
8.0 REFERENCES
Bluman, A. G. (2012). Elementary Statistics : A Step by Step Approach. New York: McGraw Hill.
1.0 INTRODUCTION
Paired samples are a type of data in statistics where observations are gathered in pairs or matched
sets. Each pair is made up of two related or connected measurements or observations. The purpose
of collecting paired samples is to compare the differences between the two measurements within
each pair.
There are several steps involved in comparing the differences between the paired samples. The first
step is state the hypotheses and identify the claim. Next is finding the critical value. After that, find
the test value and make decision whether to reject null hypothesis or do not reject null hypothesis.
Then, summarize the results.
2.0 OBJECTIVES
4) All the hypothesis of testing steps listed and result been interpreted.
4.0 RESULTS
Table 1. Weight (kg) of Obese Women Before and After 12-weeks of treatment with a
very-low-calorie-diet (VLCD).
WOB1 117.3 111.4 98.6 104.3 105.4 100.4 81.7 89.5 78.2
WOB2 83.3 85.9 75.8 82.9 82.3 77.7 62.7 69.0 63.9
WOB1
X X2
117.3 13759.29
111.4 12409.96
98.6 9721.96
104.3 10878.49
105.4 11109.16
100.4 10080.16
81.7 6674.89
89.5 8010.25
78.2 6115.24
2
X= 886.8 X = 88759.4
= 98.53
𝑛𝑥 2 − (x)2
𝑆2 =
𝑛(𝑛 − 1)
2
9(88759.4) − (886.8)2
𝑆 =
9(9 − 1)
𝑆 2 = 172.505
𝑆 = 13.134
WOB2
X X2
83.3 6938.89
85.9 7378.81
75.8 5745.64
82.9 6872.41
82.3 6773.29
77.7 6037.29
62.7 3931.29
69.0 4761
Downloaded by FARALIZA AHMAD (2022741531@student.uitm.edu.my)
lOMoARcPSD|21351132
63.9 4083.21
2
X= 683.5 X = 52521.83
2
𝑛𝑥 2 − (x)2
𝑆 =
𝑛(𝑛 − 1)
9(52521.83) − (683.5)2
𝑆2 =
9(9 − 1)
𝑆 2 = 76.73
𝑆 = 8.76
DELTA
X X2
34 1156
25.5 650.25
22.8 519.84
21.4 457.96
23.1 533.61
22.7 515.29
19 361
20.5 420.25
14.3 204.49
2
X = 203.3 X = 4818.69
= 22.58
𝑛𝑥 2 − (x)2
𝑆2 =
𝑛(𝑛 − 1)
2
(9)(4818.69) − (203.3)2
𝑆 =
9(9 − 1)
𝑆 2 = 28.296
𝑆 = 5.319
95% confidence interval
𝐷 = 22.58
𝑆0 = 5.319
𝑡 𝛼⁄2 = 2.306
𝑆0 𝑆0
𝐷 − 𝑡 𝛼⁄2 ( ) < < 𝐷 + 𝑡 𝛼⁄2 ( )
√𝑛 √𝑛
5.319 5.319
22.58 − 2.306 ( ) < < 22.58 + 2.306 ( )
√9 √9
18.49 < < 26.66
Thus, with 95% confidence interval, the value of means obtained which is 22.58 between interval
18.49 and 26.66 based on sample of 9 obese women.
α= 0.01
d.f= 9-1 = 8
Two tails (95%)
C.V= 2.306
0= 23
T test
𝑋−
𝑡=
𝑆/√𝑛
Downloaded by FARALIZA AHMAD (2022741531@student.uitm.edu.my)
lOMoARcPSD|21351132
22.58 − 23
𝑡=
5.319/√9
𝑡 = − 0.236
The null hypothesis is not rejected as test value does not fall in the critical region, Therefore, there
is not enough evidence to support the claim that the mean difference of WOB1 and WOB2 in the
population is different from 23.
5.0 DISCUSSION
Referring to task 1, the mean of WOB 1 is 98.53 kg by dividing the sum of all values in
a data set by the number of values. The mean for WOB 2 was calculated by dividing the sum of all
values in a data set by the number of values resulting in 75.94 kg. The standard deviation for
WOB 1 was calculated which results in 13.134 kg. By subtracting lowest from the highest value,
the range is equal to 39.1 kg. While the standard deviation of WOB 2 is 8.76 kg.
For the second task, is it observed that the shape of distribution is right skewed as the
spread of distribution was affected by the lowest to the highest value. The spread distribution of
the data is 14.3 kg to 34.0 kg with the centre of the data is 22.58 kg. For the confidence interval,
the value of mean obtained which is 22.589 between interval 18.49 and 26.66 based on the
sample of 9 obese women.
6.0 CONCLUSION
In conclusion, to ascertain if the mean difference between two sets of observations is
zero, t-test method was used in this experiment. From the result, the null hypothesis is not rejected
as the test value falls in the critical region. Test for paired difference for significant is
successfully determine using the t-test method.
7.0 REFERENCES
1.0 Introduction
Two-sample inference is when to consider the problem of estimating and testing differences
between two means. An inferential statistical test called the independent t-test analysis whereas if
means of two unrelated groups differ statistically from one another. This test can be used when the
mean of exactly two independent groups is being compared. The foundation for analysis of means
of two populations is the fact that if X has a normal distribution in each of two populations with
equal variance 𝜎 2 , then the difference between sample means, 𝑋1 − 𝑋2, also has a normal
distribution. If the populations are not normally distributed and if the sample size is not large
enough to appeal the Central Limit Theorem, then nonparametric test can be use as alternative
approach. The nonparametric equivalent of the two-sample t-test is the Wilcoxon rank sum test.
Under optimal conditions the Wilcoxon rank sum test is about 95% as powerful as a 2-sample t-
test, although it may be less powerful in specific settings.
2.0 Objectives
3.0 Methods
1. Side-by-side boxplot
a) 5-point summaries of the Normal (n=12) and Hypertensive (n=10) in the sample were
determined
b) Side-by-side boxplot were constructed
2. Mean and standard deviation
a) The mean and standard deviation of each group were calculated
3. Confidence interval for independent mean difference
a) The pooled estimate of variance, standard error of the mean difference was calculated
b) the confidence intervals were interpreted
4. Statistical hypothesis test were run
4.0 Results
5.0 Discussion
For this experiment, the dataset was collected from 22 subject to compare the average daily sodium
ion intakes in a week. 12 subject out of 22 is normal while the other 10 subject is hypertensive. For
the start of this experiment, the two groups of normal and hypertensive subjects were compared by
using boxplot. For the normal subject, the lowest value is 0.0 mg and the highest value is 63.6 mg.
2.4 mg is the median for the normal dataset while Q1 and Q3 respective value are 0.0 mg and 26.65
mg. Meanwhile, for the hypertensive group dataset, the lowest value is 11 mg and the highest value
is 250.8 mg. 58.25 mg is the median for the hypertensive dataset. The Q1 and Q3 respective value
are 39.1 mg and 58.25 mg. Based on the comparison by using boxplot, the sodium intake for the
hypertensive group is higher than normal group.
For the normal group, the mean of it is calculated at 14.42 mg while for the hypertensive
group, the mean of it is 74.32 mg which is higher. Meanwhile, the standard deviation of
hypertensive group is higher at 66.34 mg than the standard deviation of normal group dataset
which is 22.64 mg.
Later, in this experiment, the confidence interval was calculated later for the μ1-μ2 using
the t-test at 95% confidence limit. The value recorded was -109.606 < (μ1-μ2) < - 10.194. the null
hypothesis for this calculation is μ1 = μ2 while the alternative hypothesis is μ1 ≠ μ2. The claim for
this experiment is the alternative hypothesis. With the degree of freedom at 9, the critical value
falls at ±2.282. During the calculation, the test value is -2.73. Due to the test value falls inside the
critical region, the null hypothesis was rejected. Thus, there is enough evidence to support the
claim.
6.0 Conclusion
In a nutshell, by comparing the data may help in determining whether the samples are independent
and dependent. Dependent samples are measurement that are paired for a single set of objects
meanwhile independent samples are measurement taken on two distinct group of objects. Lastly,
based on the result, the null hypothesis was rejected due to the test value falls inside the critical
region. Thus, there is enough evidence to support the claim.
7.0 Reference
1.0 Introduction
A chi-square statistic is a test that determines how well a model matches real observed data. The
data needed to calculate a chi-square statistic must be random, unprocessed, mutually exclusive,
derived from independent variables, and chosen from a sample of sufficient size. In this experiment
a package of M&Ms were used as sample. The outcome will be critical to determine the differences
between observed measurements and the expected outcome. Typically, this test is used when
working with discrete data. The estimated statistical value determines the differences between
observed and expected data.
Chi-square (x²) formula is as follows:
2.0 Objectives
i) To determine the differences between observed and expected number of M&M in a packet by
color
3.0 Hypothesis
If the Mars Company sorters are working properly then any differences between the color
percentage in an actual package of M&Ms and the color percentage posted on the web should be
due to random chance.
1. The expected number of M&M’s in the package was calculated by multiplying the total
number of M&M’s in the package by the color percent listed. The calculation was recorded
in the data table
2. The difference between the observed and expected numbers for each M&M color. The
calculations were recorded in the data table
3. The difference between the observed and expected was squared and the calculation was
recorded
4. The squared difference was divided by the expected and the result was recorded in the table
5. The chi-square (x) value was determined and recorded in the data table
5.0 Results
6.0 Discussion
Experiment on M&M was conducted in order to determine if the Mars company is true to its claims
about the colour percentage in a package of M&M by using Chi-square. By using Chi-square, the
claim can be accepted or rejected. The null hypothesis for this experiment is there is no difference
between observed value and expected value and the alternative hypothesis is there is a difference
between observed value and expected value. The claim is the null hypothesis.
The percentage of each colour in M&M package was determine by multiplying the
percentage of the colour collected with the total number of M&M in the package. The percentage
was collected from the lab manual and the distribution of the colour are 20% red, 20% yellow, 10%
orange, 10% blue, 10% green and 30% brown. The total number of M&M is 260. Thus, the
expected number of M&M are 52 red, 52 yellow, 26 orange, 26 blue, 26 green and 78 brown.
Meanwhile, the observed number for M&M are 41 red, 47 yellow, 37 orange, 40 blue, 45 green and
49 brown. With these expected and observed number, the differences were calculated showing red
with -11, yellow -5, orange 11, blue 14, green 19 and brown -29.
With all the data calculated, the Chi-square test was conducted and calculated by using the
square of difference divided by the expected value. The critical value collected from this
experiment is 11.071 with degree of freedom are 5. Meanwhile the test value calculated from this
experiment is 39.666. Based on the result collected from test value, the null hypothesis was
rejected due to the test value is falls in the critical value region. Thus, there is not enough evidence
to support the claim.
7.0 Conclusion
As the test value falls in the critical region, the null hypothesis is rejected. Thus, there is not
enough evidence to support the claim. In a nutshell, there is a difference between observed value
and expected value.
8.0 Reference
Chavis, C., & Bhuyan, I. A. (2022). Data-Driven Food Desert Metric to Understand Access to
Grocery Stores Using Chi-Square Automatic Interaction Detector Decision Tree Analysis.
Transportation Research Record
2. Problem statement.
The Klang Valley region of Malaysia has experienced air pollution and haze occurrences
before. Events involving haze are becoming one of the factors in disputes between neighboring
nations on a global scale. Additionally, the habit of haze migration to neighboring nations will
eventually result in pollution in those nations. People, as well as other living things including
animals, crops, and other human assets, are known to be severely impacted by the worst haze
events. ANOVA, or analysis of variance, was used to compare the means of several variables or
parameters using just one comparison factor.
3. Hypothesis.
If the reading of nitrogen dioxide (NO2), carbon monoxide (CO) and particulate matter
(PM10) is high, the air pollution index (API) increases.
4. Objective.
Because of its recent substantial contribution to air pollution, Klang Valley was chosen as the
study region. Site coding was taken from the Department of Environment (DOE). The air quality was
monitored continuously from January 2009 until December 2013. The result was analyzed using one-
way analysis of variance (ANOVA). Meanwhile, principle component analysis (PCA) is a technique
for identifying linear combinations of the original variables that can be used to account for variance in
those variables. The variables can be successfully clustered in this manner. Because less significant
factors are eliminated from the entire data set with very little loss in the original information, PCA offers
the most significant and meaningful variables indicating the source of the variation.
Because ANOVA is useful for testing three or more variables, it was chosen as the approach to
calculate the analysis for statistics. ANOVA involves spreading variance among numerous sources and
group differences by comparing the means of each group or assessing the differences between the means
of more than two groups. ANOVA is remarkable for being the most basic kind of quantitative analysis,
comparing the means of different variables, and thereby generalizing the statistical test, known as a t-test
if used to more than two groups.
7. Findings.
A preliminary evaluation was carried out to examine the air pollution trends and possible emission
sources in selected locations of Klang Valley. There were differences in the means of both sites and years,
implying that there were statistically significant differences in the ANOVA results. It is found that there
is statistically significant difference across sites, with p-values <0.05 for each category of air pollution.
Simultaneously, PCA was shown to be a good statistical tool for identifying air pollutant sources. The
PCA results were dominated by CO, NO2, and PM10, the major causes of outdoor air pollution in Klang
Valley, followed by SO2 and O3. The addition of these linked data can eventually help this study provide
a more convincing explanation of the relationship or correlation between air pollutants and the associated
factors that influenced the behavior of air quality levels in that specific area.
8. Reference
Mohamad, N. I., Ash’aari, Z. H., & Othman, M. (2015). Preliminary assessment of air pollutant sources
identification at selected monitoring stations in Klang Valley, Malaysia. Procedia Environmental
Sciences, 30, 121–126. https://doi.org/10.1016/j.proenv.2015.10.021
9. Appendix
https://www.sciencedirect.com/science/article/pii/S1878029615006155