Statistical Significance
Statistical Significance
{\displaystyle \alpha }
, is the probability of the study rejecting the null hypothesis, given that the
null hypothesis is true;[4] and the p-value of a result,
{\displaystyle p}
, is the probability of obtaining a result at least as extreme, given that the null
hypothesis is true.[5] The result is statistically significant, by the standards of
the study, when
p
≤
α
{\displaystyle \alpha }
{\displaystyle \alpha }
Related concepts[edit]
The significance level
{\displaystyle \alpha }
{\displaystyle p}
below which the null hypothesis is rejected even though by assumption it were
true, and something else is going on. This means that
α
{\displaystyle \alpha }
is also the probability of mistakenly rejecting the null hypothesis, if the null
hypothesis is true.[4] This is also called false positive and type I error.
Sometimes researchers talk about the confidence level γ = (1 − α) instead. This is
the probability of not rejecting the null hypothesis given that it is true.[33][34]
Confidence levels and confidence intervals were introduced by Neyman in 1937.[35]
{\displaystyle \alpha }
.
To determine whether a result is statistically significant, a researcher calculates
a p-value, which is the probability of observing an effect of the same magnitude or
more extreme given that the null hypothesis is true.[5][12] The null hypothesis is
rejected if the p-value is less than (or equal to) a predetermined level,
{\displaystyle \alpha }
{\displaystyle \alpha }
is also called the significance level, and is the probability of rejecting the
null hypothesis given that it is true (a type I error). It is usually set at or
below 5%.
For example, when
α
{\displaystyle \alpha }
is set to 5%, the conditional probability of a type I error, given that the null
hypothesis is true, is 5%,[37] and a statistically significant result is one where
the observed p-value is less than (or equal to) 5%.[38] When drawing data from a
sample, this means that the rejection region comprises 5% of the sampling
distribution.[39] These 5% can be allocated to one side of the sampling
distribution, as in a one-tailed test, or partitioned to both sides of the
distribution, as in a two-tailed test, with each tail (or rejection region)
containing 2.5% of the distribution.
The use of a one-tailed test is dependent on whether the research question or
alternative hypothesis specifies a direction such as whether a group of objects is
heavier or the performance of students on an assessment is better.[3] A two-tailed
test may still be used but it will be less powerful than a one-tailed test, because
the rejection region for a one-tailed test is concentrated on one end of the null
distribution and is twice the size (5% vs. 2.5%) of each rejection region for a
two-tailed test. As a result, the null hypothesis can be rejected with a less
extreme result if a one-tailed test was used.[40] The one-tailed test is only more
powerful than a two-tailed test if the specified direction of the alternative
hypothesis is correct. If it is wrong, however, then the one-tailed test has no
power.
Limitations[edit]
Researchers focusing solely on whether their results are statistically significant
might report findings that are not substantive[46] and not replicable.[47][48]
There is also a difference between statistical significance and practical
significance. A study that is found to be statistically significant may not
necessarily be practically significant.[49][19]
Effect size[edit]
Main article: Effect size
Effect size is a measure of a study's practical significance.[49] A statistically
significant result may have a weak effect. To gauge the research significance of
their result, researchers are encouraged to always report an effect size along with
p-values. An effect size measure quantifies the strength of an effect, such as the
distance between two means in units of standard deviation (cf. Cohen's d), the
correlation coefficient between two variables or its square, and other measures.
[50]
Reproducibility[edit]
Main article: Reproducibility
A statistically significant result may not be easy to reproduce.[48] In particular,
some statistically significant results will in fact be false positives. Each failed
attempt to reproduce a result increases the likelihood that the result was a false
positive.[51]
Challenges[edit]
See also: Misuse of p-values
Overuse in some journals[edit]
Starting in the 2010s, some journals began questioning whether significance
testing, and particularly using a threshold of α=5%, was being relied on too
heavily as the primary measure of validity of a hypothesis.[52] Some journals
encouraged authors to do more detailed analysis than just a statistical
significance test. In social psychology, the journal Basic and Applied Social
Psychology banned the use of significance testing altogether from papers it
published,[53] requiring authors to use other measures to evaluate hypotheses and
impact.[54][55]
Other editors, commenting on this ban have noted: "Banning the reporting of p-
values, as Basic and Applied Social Psychology recently did, is not going to solve
the problem because it is merely treating a symptom of the problem. There is
nothing wrong with hypothesis testing and p-values per se as long as authors,
reviewers, and action editors use them correctly."[56] Some statisticians prefer to
use alternative measures of evidence, such as likelihood ratios or Bayes factors.
[57] Using Bayesian statistics can avoid confidence levels, but also requires
making additional assumptions,[57] and may not necessarily improve practice
regarding statistical testing.[58]
The widespread abuse of statistical significance represents an important topic of
research in metascience.[59]
Redefining significance[edit]
In 2016, the American Statistical Association (ASA) published a statement on p-
values, saying that "the widespread use of 'statistical significance' (generally
interpreted as 'p ≤ 0.05') as a license for making a claim of a scientific finding
(or implied truth) leads to considerable distortion of the scientific process".[57]
In 2017, a group of 72 authors proposed to enhance reproducibility by changing the
p-value threshold for statistical significance from 0.05 to 0.005.[60] Other
researchers responded that imposing a more stringent significance threshold would
aggravate problems such as data dredging; alternative propositions are thus to
select and justify flexible p-value thresholds before collecting data,[61] or to
interpret p-values as continuous indices, thereby discarding thresholds and
statistical significance.[62] Additionally, the change to 0.005 would increase the
likelihood of false negatives, whereby the effect being studied is real, but the
test fails to show it.[63]
In 2019, over 800 statisticians and scientists signed a message calling for the
abandonment of the term "statistical significance" in science,[64] and the ASA
published a further official statement [65] declaring (page 2): We conclude, based
on our review of the articles in this special issue and the broader literature,
that it is time to stop using the term "statistically significant" entirely. Nor
should variants such as "significantly different," "
p
≤
0.05
Mathematics portal
A/B testing, ABX test
Estimation statistics
Fisher's method for combining independent tests of significance
Look-elsewhere effect
Multiple comparisons problem
Sample size
Texas sharpshooter fallacy (gives examples of tests where the significance level
was set too high)
References[edit]
^ a b Myers, Jerome L.; Well, Arnold D.; Lorch, Robert F. Jr. (2010). "Developing
fundamentals of hypothesis testing using the binomial distribution". Research
design and statistical analysis (3rd ed.). New York, NY: Routledge. pp. 65–90.
ISBN 978-0-8058-6431-1.
^ Cumming, Geoff (2012). Understanding The New Statistics: Effect Sizes, Confidence
Intervals, and Meta-Analysis. New York, USA: Routledge. pp. 27–28.
^ Sham, Pak C.; Purcell, Shaun M (17 April 2014). "Statistical power and
significance testing in large-scale genetic studies". Nature Reviews Genetics. 15
(5): 335–346. doi:10.1038/nrg3706. PMID 24739678. S2CID 10961123.
^ Altman, Douglas G. (1999). Practical Statistics for Medical Research. New York,
USA: Chapman & Hall/CRC. pp. 167. ISBN 978-0-412-27630-9.
^ a b Devore, Jay L. (2011). Probability and Statistics for Engineering and the
Sciences (8th ed.). Boston, MA: Cengage Learning. pp. 300–344. ISBN 978-0-538-
73352-6.
^ Babbie, Earl R. (2013). "The logic of sampling". The Practice of Social Research
(13th ed.). Belmont, CA: Cengage Learning. pp. 185–226. ISBN 978-1-133-04979-1.
^ McKillup, Steve (2006). "Probability helps you make a decision about your
results". Statistics Explained: An Introductory Guide for Life Scientists
(1st ed.). Cambridge, United Kingdom: Cambridge University Press. pp. 44–56.
ISBN 978-0-521-54316-3.
^ Myers, Jerome L.; Well, Arnold D.; Lorch, Robert F. Jr. (2010). "The t
distribution and its applications". Research Design and Statistical Analysis
(3rd ed.). New York, NY: Routledge. pp. 124–153. ISBN 978-0-8058-6431-1.
^ John Arbuthnot (1710). "An argument for Divine Providence, taken from the
constant regularity observed in the births of both sexes" (PDF). Philosophical
Transactions of the Royal Society of London. 27 (325–336): 186–190.
doi:10.1098/rstl.1710.0011.
^ Conover, W.J. (1999), "Chapter 3.4: The Sign Test", Practical Nonparametric
Statistics (Third ed.), Wiley, pp. 157–176, ISBN 978-0-471-16068-7
^ a b c Quinn, Geoffrey R.; Keough, Michael J. (2002). Experimental Design and Data
Analysis for Biologists (1st ed.). Cambridge, UK: Cambridge University Press.
pp. 46–69. ISBN 978-0-521-00976-8.
^ "Conclusions about statistical significance are possible with the help of the
confidence interval. If the confidence interval does not include the value of zero
effect, it can be assumed that there is a statistically significant result." Prel,
Jean-Baptist du; Hommel, Gerhard; Röhrig, Bernd; Blettner, Maria (2009).
"Confidence Interval or P-Value?". Deutsches Ärzteblatt Online. 106 (19): 335–9.
doi:10.3238/arztebl.2009.0335. PMC 2689604. PMID 19547734.
^ Meier, Kenneth J.; Brudney, Jeffrey L.; Bohte, John (2011). Applied Statistics
for Public and Nonprofit Administration (3rd ed.). Boston, MA: Cengage Learning.
pp. 189–209. ISBN 978-1-111-34280-7.
^ Healy, Joseph F. (2009). The Essentials of Statistics: A Tool for Social Research
(2nd ed.). Belmont, CA: Cengage Learning. pp. 177–205. ISBN 978-0-495-60143-2.
^ Vaughan, Simon (2013). Scientific Inference: Learning from Data (1st ed.).
Cambridge, UK: Cambridge University Press. pp. 146–152. ISBN 978-1-107-02482-3.
^ Franklin, Allan (2013). "Prologue: The rise of the sigmas". Shifting Standards:
Experiments in Particle Physics in the Twentieth Century (1st ed.). Pittsburgh, PA:
University of Pittsburgh Press. pp. Ii–Iii. ISBN 978-0-8229-4430-0.
^ Clarke, GM; Anderson, CA; Pettersson, FH; Cardon, LR; Morris, AP; Zondervan, KT
(February 6, 2011). "Basic statistical analysis in genetic case-control studies".
Nature Protocols. 6 (2): 121–33. doi:10.1038/nprot.2010.182. PMC 3154648.
PMID 21293453.
^ Ioannidis, John P. A. (2005). "Why most published research findings are false".
PLOS Medicine. 2 (8): e124. doi:10.1371/journal.pmed.0020124. PMC 1182327.
PMID 16060722.
^ a b Hojat, Mohammadreza; Xu, Gang (2004). "A Visitor's Guide to Effect Sizes".
Advances in Health Sciences Education. 9 (3): 241–9.
doi:10.1023/B:AHSE.0000038173.00909.f6. PMID 15316274. S2CID 8045624.
^ "CSSME Seminar Series: The argument over p-values and the Null Hypothesis
Significance Testing (NHST) paradigm". www.education.leeds.ac.uk. School of
Education, University of Leeds. Retrieved 2016-12-01.
^ Siegfried, Tom (2015-03-17). "P value ban: small step for a journal, giant leap
for science". Science News. Retrieved 2016-12-01.
^ Antonakis, John (February 2017). "On doing better science: From thrill of
discovery to policy implications" (PDF). The Leadership Quarterly. 28 (1): 5–21.
doi:10.1016/j.leaqua.2017.01.006.
^ García-Pérez, Miguel A. (2016-10-05). "Thou Shalt Not Bear False Witness Against
Null Hypothesis Significance Testing". Educational and Psychological Measurement.
77 (4): 631–662. doi:10.1177/0013164416668232. ISSN 0013-1644. PMC 5991793.
PMID 30034024.
^ Ioannidis, John P. A.; Ware, Jennifer J.; Wagenmakers, Eric-Jan; Simonsohn, Uri;
Chambers, Christopher D.; Button, Katherine S.; Bishop, Dorothy V. M.; Nosek, Brian
A.; Munafò, Marcus R. (January 2017). "A manifesto for reproducible science".
Nature Human Behaviour. 1 (1): 0021. doi:10.1038/s41562-016-0021. PMC 7610724.
PMID 33954258.
^ Wasserstein, Ronald L.; Schirm, Allen L.; Lazar, Nicole A. (2019-03-20). "Moving
to a World Beyond "p < 0.05"". The American Statistician. 73 (sup1): 1–19.
doi:10.1080/00031305.2019.1583913.
Further reading[edit]
Lydia Denworth, "A Significant Problem: Standard scientific methods are under fire.
Will anything change?", Scientific American, vol. 321, no. 4 (October 2019),
pp. 62–67. "The use of p values for nearly a century [since 1925] to determine
statistical significance of experimental results has contributed to an illusion of
certainty and [to] reproducibility crises in many scientific fields. There is
growing determination to reform statistical analysis... Some [researchers] suggest
changing statistical methods, whereas others would do away with a threshold for
defining "significant" results." (p. 63.)
Ziliak, Stephen and Deirdre McCloskey (2008), The Cult of Statistical Significance:
How the Standard Error Costs Us Jobs, Justice, and Lives Archived 2010-06-08 at the
Wayback Machine. Ann Arbor, University of Michigan Press, 2009. ISBN 978-0-472-
07007-7. Reviews and reception: (compiled by Ziliak)
Thompson, Bruce (2004). "The "significance" crisis in psychology and education".
Journal of Socio-Economics. 33 (5): 607–613. doi:10.1016/j.socec.2004.09.034.
Chow, Siu L., (1996). Statistical Significance: Rationale, Validity and Utility
Archived 2013-12-03 at the Wayback Machine, Volume 1 of series Introducing
Statistical Methods, Sage Publications Ltd, ISBN 978-0-7619-5205-3 – argues that
statistical significance is useful in certain circumstances.
Kline, Rex, (2004). Beyond Significance Testing: Reforming Data Analysis Methods in
Behavioral Research Washington, DC: American Psychological Association.
Nuzzo, Regina (2014). Scientific method: Statistical errors. Nature Vol. 506,
p. 150-152 (open access). Highlights common misunderstandings about the p value.
Cohen, Joseph (1994). [1] Archived 2017-07-13 at the Wayback Machine. The earth is
round (p<.05). American Psychologist. Vol 49, p. 997-1003. Reviews problems with
null hypothesis statistical testing.
Amrhein, Valentin; Greenland, Sander; McShane, Blake (2019-03-20). "Scientists rise
up against statistical significance". Nature. 567 (7748): 305–307.
Bibcode:2019Natur.567..305A. doi:10.1038/d41586-019-00857-9. PMID 30894741.
External links[edit]
The article "Earliest Known Uses of Some of the Words of Mathematics (S)" contains
an entry on Significance that provides some historical information.
"The Concept of Statistical Significance Testing Archived 2022-09-07 at the Wayback
Machine" (February 1994): article by Bruce Thompon hosted by the ERIC Clearinghouse
on Assessment and Evaluation, Washington, D.C.
"What does it mean for a result to be "statistically significant"?" (no date): an
article from the Statistical Assessment Service at George Mason University,
Washington, D.C.
vteStatistics
Outline
Index
Descriptive statisticsContinuous dataCenter
Mean
Arithmetic
Arithmetic-Geometric
Cubic
Generalized/power
Geometric
Harmonic
Heronian
Heinz
Lehmer
Median
Mode
Dispersion
Average absolute deviation
Coefficient of variation
Interquartile range
Percentile
Range
Standard deviation
Variance
Shape
Central limit theorem
Moments
Kurtosis
L-moments
Skewness
Count data
Index of dispersion
Summary tables
Contingency table
Frequency distribution
Grouped data
Dependence
Partial correlation
Pearson product-moment correlation
Rank correlation
Kendall's τ
Spearman's ρ
Scatter plot
Graphics
Bar chart
Biplot
Box plot
Control chart
Correlogram
Fan chart
Forest plot
Histogram
Pie chart
Q–Q plot
Radar chart
Run chart
Scatter plot
Stem-and-leaf display
Violin plot
Data collectionStudy design
Effect size
Missing data
Optimal design
Population
Replication
Sample size determination
Statistic
Statistical power
Survey methodology
Sampling
Cluster
Stratified
Opinion poll
Questionnaire
Standard error
Controlled experiments
Blocking
Factorial experiment
Interaction
Random assignment
Randomized controlled trial
Randomized experiment
Scientific control
Adaptive designs
Adaptive clinical trial
Stochastic approximation
Up-and-down designs
Observational studies
Cohort study
Cross-sectional study
Natural experiment
Quasi-experiment
Statistical inferenceStatistical theory
Population
Statistic
Probability distribution
Sampling distribution
Order statistic
Empirical distribution
Density estimation
Statistical model
Model specification
Lp space
Parameter
location
scale
shape
Parametric family
Likelihood (monotone)
Location–scale family
Exponential family
Completeness
Sufficiency
Statistical functional
Bootstrap
U
V
Optimal decision
loss function
Efficiency
Statistical distance
divergence
Asymptotics
Robustness
Frequentist inferencePoint estimation
Estimating equations
Maximum likelihood
Method of moments
M-estimator
Minimum distance
Unbiased estimators
Mean-unbiased minimum-variance
Rao–Blackwellization
Lehmann–Scheffé theorem
Median unbiased
Plug-in
Interval estimation
Confidence interval
Pivot
Likelihood interval
Prediction interval
Tolerance interval
Resampling
Bootstrap
Jackknife
Testing hypotheses
1- & 2-tails
Power
Uniformly most powerful test
Permutation test
Randomization test
Multiple comparisons
Parametric tests
Likelihood-ratio
Score/Lagrange multiplier
Wald
Specific tests
Z-test (normal)
Student's t-test
F-test
Goodness of fit
Chi-squared
G-test
Kolmogorov–Smirnov
Anderson–Darling
Lilliefors
Jarque–Bera
Normality (Shapiro–Wilk)
Likelihood-ratio test
Model selection
Cross validation
AIC
BIC
Rank statistics
Sign
Sample median
Signed rank (Wilcoxon)
Hodges–Lehmann estimator
Rank sum (Mann–Whitney)
Nonparametric anova
1-way (Kruskal–Wallis)
2-way (Friedman)
Ordered alternative (Jonckheere–Terpstra)
Van der Waerden test
Bayesian inference
Bayesian probability
prior
posterior
Credible interval
Bayes factor
Bayesian estimator
Maximum posterior estimator
CorrelationRegression analysisCorrelation
Pearson product-moment
Partial correlation
Confounding variable
Coefficient of determination
Regression analysis
Errors and residuals
Regression validation
Mixed effects models
Simultaneous equations models
Multivariate adaptive regression splines (MARS)
Linear regression
Simple linear regression
Ordinary least squares
General linear model
Bayesian regression
Non-standard predictors
Nonlinear regression
Nonparametric
Semiparametric
Isotonic
Robust
Heteroscedasticity
Homoscedasticity
Generalized linear model
Exponential families
Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance
Analysis of variance (ANOVA, anova)
Analysis of covariance
Multivariate ANOVA
Degrees of freedom
Categorical / Multivariate / Time-series / Survival analysisCategorical
Cohen's kappa
Contingency table
Graphical model
Log-linear model
McNemar's test
Cochran–Mantel–Haenszel statistics
Multivariate
Regression
Manova
Principal components
Canonical correlation
Discriminant analysis
Cluster analysis
Classification
Structural equation model
Factor analysis
Multivariate distributions
Elliptical distributions
Normal
Time-seriesGeneral
Decomposition
Trend
Stationarity
Seasonal adjustment
Exponential smoothing
Cointegration
Structural break
Granger causality
Specific tests
Dickey–Fuller
Johansen
Q-statistic (Ljung–Box)
Durbin–Watson
Breusch–Godfrey
Time domain
Autocorrelation (ACF)
partial (PACF)
Cross-correlation (XCF)
ARMA model
ARIMA model (Box–Jenkins)
Autoregressive conditional heteroskedasticity (ARCH)
Vector autoregression (VAR)
Frequency domain
Spectral density estimation
Fourier analysis
Least-squares spectral analysis
Wavelet
Whittle likelihood
SurvivalSurvival function
Kaplan–Meier estimator (product limit)
Proportional hazards models
Accelerated failure time (AFT) model
First hitting time
Hazard function
Nelson–Aalen estimator
Test
Log-rank test
ApplicationsBiostatistics
Bioinformatics
Clinical trials / studies
Epidemiology
Medical statistics
Engineering statistics
Chemometrics
Methods engineering
Probabilistic design
Process / quality control
Reliability
System identification
Social statistics
Actuarial science
Census
Crime statistics
Demography
Econometrics
Jurimetrics
National accounts
Official statistics
Population statistics
Psychometrics
Spatial statistics
Cartography
Environmental statistics
Geographic information system
Geostatistics
Kriging
Category
Mathematics portal
Commons
WikiProject