Research Notes
Research Notes
TYPES OF FA
Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) are statistical
methods used to study latent variables in datasets, particularly in psychology, education, and
social sciences. Here's a detailed explanation of their differences, steps for each, and commonly
used indices.
1. Data Preparation
○ Ensure a large sample size (minimum 5-10 cases per variable).
○ Perform Bartlett’s test of sphericity and calculate the Kaiser-Meyer-Olkin (KMO)
measure to check data suitability.
2. Factor Extraction
○ Choose an extraction method like Principal Axis Factoring or Maximum
Likelihood.
3. Determining the Number of Factors
○ Use eigenvalues > 1, scree plot, or parallel analysis to decide the number of
factors.
4. Factor Rotation
○ Apply rotation (e.g., Varimax for orthogonal, Promax for oblique) to achieve a
simpler factor structure.
5. Interpretation
○ Examine the rotated factor matrix and assign meaningful labels to the factors.
1. Model Specification
○ Define the number of factors and the relationships between observed and latent
variables based on theory.
2. Model Identification
○ Ensure the model is mathematically identifiable (more observed variables than
free parameters).
3. Parameter Estimation
○ Estimate factor loadings, error variances, and covariances using techniques like
Maximum Likelihood Estimation (MLE).
4. Model Evaluation
○ Assess model fit using indices (explained below). Modify the model if necessary
(e.g., adding correlations between errors).
5. Interpretation and Validation
○ Confirm whether the model fits the data well and validate it with a separate
dataset.
○
Methods of extraction
● Function: This method minimizes residuals (errors between observed and predicted
correlations), typically using a matrix approach. Its effectiveness depends on the number
of factors extracted.
● Example: In factorizing a set of survey data, residuals would be minimized to produce a
clean, factorized result.
● Function: This method minimizes the squared differences between observed and
reproduced correlation matrices.
● Example: In a study of employee satisfaction, ULS could minimize differences between
the actual survey responses and the predicted responses based on factors like job
satisfaction and work-life balance.
● Function: GLS is similar to ULS but introduces weights for variables when minimizing
differences between the observed and reproduced correlation matrices.
● Example: In a factor analysis of educational data, GLS might give more weight to
variables with higher variance (e.g., test scores) to refine the factor extraction.
7. Image Factoring:
● Function: A hybrid approach combining PCA and PAF. It works with the "image" or
structure of the variables, using image scores to extract factors.
● Example: In social media data analysis, image factoring could extract factors like "user
engagement" by combining PCA's variance extraction with PAF's focus on shared
behaviors.
8. Alpha Factoring:
● Purpose: Used for psychometric purposes, especially for assessing reliability. It uses
Cronbach's alpha to measure internal consistency (the extent to which items in a scale
are related).
● Example: In a psychological test, alpha factoring could assess the reliability of questions
measuring depression symptoms.
Rotation in Factor Analysis refers to the process of transforming the factor solution in
order to achieve a simpler, more interpretable structure. After an initial factor extraction (such as
Principal Component Analysis or Maximum Likelihood), the factors are usually not easily
interpretable. Rotation helps to achieve a solution where the factors are more meaningful and
easier to understand.
Types of Rotation:
1. Orthogonal Rotation:
○ In this type, the factors remain uncorrelated, meaning they are kept at right
angles (perpendicular) to each other.
○ Types of Orthogonal Rotation:
■ Varimax: The most commonly used. It aims to maximize the variance of
each factor, making each factor as distinct as possible by increasing high
loadings and reducing low ones.
■ Quartimax: Attempts to simplify the factor structure by reducing the
number of variables with large loadings on more than one factor.
■ Equamax: A compromise between Varimax and Quartimax, aiming for a
balance in simplicity and clarity.
2. Oblique Rotation:
○ Here, the factors are allowed to correlate, meaning the axes of the factors can tilt
in any direction. This is more flexible and realistic in many social science
contexts, where factors often have some degree of correlation.
○ Types of Oblique Rotation:
■ Direct Oblimin: A commonly used oblique rotation method that allows for
some degree of correlation between factors.
■ Promax: A faster and simpler oblique method. It starts with an orthogonal
rotation and then "promotes" the factors into oblique relationships.
Steps of Confirmatory Factor Analysis (CFA)
Confirmatory Factor Analysis (CFA) is a statistical technique used to test whether the data fits a
hypothesized measurement model. The steps involved are:
Descriptive Statistics involve methods for summarizing and organizing data to make it
interpretable. They focus on describing the basic features of data in a study.
Characteristics:
Advantages:
Measures of Central Tendency describe the center or typical value of a dataset. The three
main measures are Mean, Median, and Mode.
The mean is the sum of all values divided by the total number of values.
Formula:
Example:
Consider the dataset: 10, 20, 20, 30, 40.
Use:
The median is the middle value when the data is ordered. If the dataset has an even number of
values, the median is the average of the two middle values.
Steps:
Example:
Dataset: 10, 20, 20, 30, 40 (already in order).
Median=20\text{Median} = 20Median=20
Use:
● Appropriate for ordinal, interval, or ratio data.
● Effective when the dataset contains extreme outliers.
The mode is the most frequently occurring value(s) in the dataset. There can be more than one
mode.
Example:
Dataset: 10, 20, 20, 30, 40.
Mode=20\text{Mode} = 20Mode=20
Use:
Mean 10+20+20+30+405\frac{10 + 20 + 20 24
+ 30 + 40}{5}510+20+20+30+40
1. Variance
Variance is a measure of how far each data point in the dataset is from the mean (average). It
calculates the average of the squared differences from the mean. This method gives a sense of
the overall spread in the data but in squared units, which can sometimes make it difficult to
interpret in the context of the original data.
2. Standard Deviation
The standard deviation is the square root of the variance. This measure gives a sense of the
spread of data in the same units as the original data, which makes it easier to interpret than
variance. A larger standard deviation indicates that the data points are more spread out from the
mean, while a smaller standard deviation indicates that the data points are closer to the mean.
1. Units:
○ Variance is expressed in squared units, making it harder to interpret in practical
terms.
○ Standard deviation is expressed in the original units, making it easier to
understand.
2. Interpretability:
○ Standard deviation is more commonly used because it provides a clearer picture
of the data spread in its original units.
● Variance and standard deviation are essential for understanding data spread,
particularly in fields like finance, education, psychology, and quality control. They help
determine the risk, predictability, and consistency of datasets.
What is Skewness? Skewness indicates whether a data set is symmetrical or uneven. how much a
data set leans to one side compared to the other.
Positive Skew (Right Skewed): The right side has a longer tail. Most values are lower, with a few
very high ones.
Example: Imagine a test where most students score between 50 and 70, but a few students score
100. The average score will be higher than most individual scores because of those few high scores.
Negative Skew (Left Skewed): The left side has a longer tail. Most values are higher, with a
few very low ones.
Example: most students score between 80 and 100, but a few score 20. The average score will
be lower than most individual scores because of those few low scores.
Kurtosis
What is Kurtosis? Kurtosis measures the "peakedness" of a distribution. It tells us how much
of the data is in the tails (extreme values) compared to the center.
Why is Kurtosis Important? Kurtosis helps us understand the risk of extreme outcomes. For
example, in finance, knowing if returns are high in kurtosis can indicate a higher risk of large
losses or gains.
Types of Kurtosis:
● Mesokurtic: This is a normal distribution (like a bell curve) with a balanced peak and
tails.
● Leptokurtic: This has a sharp peak and thicker tails, indicating more extreme values.
Think of stock prices that can vary widely.
● Platykurtic: This has a flatter peak and thinner tails, meaning fewer extreme values. For
example, daily temperatures that stay fairly consistent.
● Skewness helps determine if certain statistical methods can be applied to the data.
● Kurtosis provides insight into the likelihood of extreme outcomes, such as significant
gains or losses.
Key Concepts:
1. Factors and Levels:
○ This is the outcome variable that is measured. In the example above, the
dependent variable would be the students' test scores.
3. Hypotheses:
○ Null Hypotheses (H0): These hypotheses state that there are no effects from
the factors on the dependent variable:
■ H0 for Factor A: The means of the dependent variable are equal across
all levels of Factor A.
■ H0 for Factor B: The means of the dependent variable are equal across
all levels of Factor B.
■ H0 for Interaction: There is no interaction effect between Factor A and
Factor B on the dependent variable.
○ Alternative Hypotheses (H1): At least one of the means is different from the
others.
4. Interaction Effect:
○ A significant interaction effect indicates that the impact of one factor on the
dependent variable varies depending on the level of the other factor. For
example, the effectiveness of a teaching method may differ between male and
female students.
1. Data Collection:
○Gather data for the dependent variable across all combinations of the levels of
the two factors.
2. Assumption Checks:
○If the p-value is less than the significance level (commonly set at 0.05), the null
hypothesis for that factor or interaction is rejected, indicating a significant effect.
○ If significant effects are found, post-hoc tests (like Tukey's HSD) can be
performed to identify which specific groups differ.
5. Reporting Findings:
○ Present the results clearly, including F-values, p-values, and any significant
interactions.
Applications:
Two-Way ANOVA is widely utilized in various fields, including psychology, medicine, and social
sciences, to analyze the effects of multiple factors on a response variable
Two-Way MANOVA:
Key Concepts:
○ In Two-Way MANOVA, there are two or more dependent variables that are
measured. For example, you might measure both test scores and engagement
levels as outcomes of the teaching methods and gender.
3. Hypotheses:
○ Null Hypotheses (H0): These hypotheses state that there are no effects from
the factors on the dependent variables:
■ H0 for Factor A: The means of the dependent variables are equal across
all levels of Factor A.
■ H0 for Factor B: The means of the dependent variables are equal across
all levels of Factor B.
■ H0 for Interaction: There is no interaction effect between Factor A and
Factor B on the dependent variables.
○ Alternative Hypotheses (H1): At least one of the means is different from the
others.
4. Interaction Effect:
○ A significant interaction effect indicates that the impact of one factor on the
dependent variables varies depending on the level of the other factor. For
example, the effectiveness of a teaching method may differ between male and
female students in terms of both test scores and engagement levels.
1. Data Collection:
○Gather data for the dependent variables across all combinations of the levels of
the two factors.
2. Assumption Checks:
○If the p-value is less than the significance level (commonly set at 0.05), the null
hypothesis for that factor or interaction is rejected, indicating a significant effect.
○ If significant effects are found, follow-up analyses (such as ANOVA for each
dependent variable) can be performed to identify which specific groups differ.
5. Reporting Findings:
○ Present the results clearly, including multivariate statistics (e.g., Wilks' Lambda),
F-values, p-values, and any significant interactions.
Applications:
Two-Way MANOVA is widely utilized in various fields, including psychology, education, and
social sciences, to analyze the effects of multiple factors on multiple response variables. For
instance, it can be employed to study the impact of different therapies (Factor A) and patient
age groups (Factor B) on recovery rates and quality of life measures
Correlation
Correlation theory is a statistical method used to measure and analyze the strength and
direction of the relationship between two or more variables. It is a fundamental concept in
statistics and research methods, particularly in fields like psychology, where understanding
relationships between variables is crucial.
1. Definition of Correlation:
○Correlation quantifies the degree to which two variables are related. A correlation
coefficient, typically denoted as r, ranges from -1 to +1. A value of +1 indicates a
perfect positive correlation, -1 indicates a perfect negative correlation, and 0
indicates no correlation.
2. Types of Correlation:
○When the data do not meet the assumptions of normality or when dealing with
ordinal data, Spearman's rank correlation can be used. It assesses how well the
relationship between two variables can be described by a monotonic function.
5. Assumptions of Correlation:
○ Strength of Correlation:
■ 0.00 to 0.19: Very weak
■ 0.20 to 0.39: Weak
■ 0.40 to 0.59: Moderate
■ 0.60 to 0.79: Strong
■ 0.80 to 1.00: Very strong
○ Direction of Correlation:
■ Positive values indicate a direct relationship, while negative values
indicate an inverse relationship.
7. Limitations of Correlation:
○ Correlation does not imply causation. Just because two variables are correlated
does not mean that one causes the other.
○ Outliers can significantly affect the correlation coefficient, leading to misleading
interpretations.
8. Applications of Correlation:
Both the t-test and ANOVA (Analysis of Variance) are statistical methods used to compare
means across groups. However, they rely on certain assumptions to ensure the validity of the
results. Below are the key assumptions for both tests, explained in detail.
1. Random Sampling
Definition: Random sampling refers to the process of selecting a subset of individuals from a
larger population in such a way that every individual has an equal chance of being chosen. This
helps to ensure that the sample is representative of the population.
Importance:
● Generalizability: Random sampling enhances the ability to generalize findings from the
sample to the broader population.
● Reduction of Bias: It minimizes selection bias, ensuring that the results are not skewed
by the characteristics of the sample.
● Validity of Inference: Random sampling supports the validity of statistical inferences
made from the sample data.
● For both t-tests and ANOVA, it is crucial that the samples are drawn randomly from the
populations being studied. If the samples are not random, the results may not accurately
reflect the population parameters, leading to erroneous conclusions.
2. Normality
Definition: Normality refers to the assumption that the data follows a normal distribution
(bell-shaped curve). This means that most of the observations cluster around the mean, with
fewer observations occurring as you move away from the mean.
Importance:
● Statistical Validity: Many statistical tests, including the t-test and ANOVA, assume that
the data is normally distributed. This assumption is particularly important for small
sample sizes.
● Robustness: While t-tests and ANOVA are robust to violations of normality with larger
sample sizes (due to the Central Limit Theorem), significant deviations from normality
can affect the results, especially in smaller samples.
● Normality can be assessed using graphical methods (e.g., Q-Q plots, histograms) or
statistical tests (e.g., Shapiro-Wilk test, Kolmogorov-Smirnov test). If the data
significantly deviates from normality, transformations (e.g., logarithmic, square root) may
be applied, or non-parametric tests may be considered.
Definition: Homogeneity of variance refers to the assumption that the variances among the
groups being compared are equal. This means that the spread or dispersion of scores in each
group should be similar.
Importance:
● Validity of Results: Homogeneity of variance is crucial for the validity of the t-test and
ANOVA results. If the variances are significantly different, it can lead to inaccurate
conclusions about the means of the groups.
● Type I Error Rate: Violations of this assumption can inflate the Type I error rate (the
probability of incorrectly rejecting the null hypothesis), leading to false positives.
● The assumption can be tested using Levene's test, Bartlett's test, or the Brown-Forsythe
test. If the assumption is violated, researchers may consider using alternative methods,
such as:
○ Welch's t-test: A variation of the t-test that does not assume equal variances.
○ Welch's ANOVA: A version of ANOVA that is robust to violations of homogeneity
of variance.
● Definition: The null hypothesis posits that no statistically significant relationship, effect,
or difference exists in the population.
● Purpose: Acts as a baseline for testing. Researchers aim to disprove H0H_0H0.
● Example: In a drug efficacy study, H0H_0H0: "The drug has no effect on reducing
symptoms."
● The significance level is the threshold for determining whether to reject the null
hypothesis.
● Commonly used values are α=0.05\alpha = 0.05α=0.05 (5%) or α=0.01\alpha =
0.01α=0.01 (1%).
● α\alphaα represents the probability of rejecting the null hypothesis when it is true (Type I
error).
● The choice of test depends on the type of data, the sample size, and the hypothesis
being tested. Examples include:
○ Z-test: For large sample sizes and known population standard deviation.
○ t-test: For small sample sizes or unknown population standard deviation.
○ ANOVA: For comparing means of three or more groups.
● Use the appropriate formula for the selected statistical test. The test statistic quantifies
the difference between the sample data and what is expected under H0H_0H0.
5. Determine the Critical Value or p-value
● Compare the calculated test statistic to the critical value based on the significance level
(α\alphaα).
● Alternatively, calculate the p-value, which represents the probability of observing the test
results under H0H_0H0.
6. Make a Decision
● If the test statistic exceeds the critical value (or if p-value ≤α\leq \alpha≤α):
○ Reject the null hypothesis (H0H_0H0).
○ Accept the alternative hypothesis (HaH_aHa).
● If the test statistic does not exceed the critical value (or if p-value > α\alphaα):
○ Fail to reject H0H_0H0.
○ Conclude there is insufficient evidence to support HaH_aHa.
● Clearly state the conclusion in the context of the research question, ensuring it aligns
with the findings and the hypothesis.
5 types of graphs
Here are five types of graphs commonly used in statistics and their detailed characteristics:
1. Bar Graph
● Represents: Categorical data (nominal or ordinal).
● Purpose: Compares the frequency or magnitude of different categories.
● Characteristics:
○ The x-axis represents the different categories (e.g., colors, types of products, or
countries).
○ The y-axis shows the frequency or magnitude (e.g., number of occurrences or
percentage).
○ Bars are typically rectangular and stand separate from each other, emphasizing
that the categories are distinct.
○ Can be plotted vertically or horizontally.
● When to use: When you have discrete data points that are not numerically ordered or
do not have a logical progression.
2. Histogram
● Represents: Continuous numerical data.
● Purpose: Shows the distribution of data across intervals (or bins).
● Characteristics:
○ The x-axis represents intervals or ranges (e.g., 10-20, 21-30) of numerical data.
○ The y-axis represents the frequency or count of data points within each range.
○ The bars are adjacent to each other to indicate continuity in the data.
○ Useful for visualizing the spread, central tendency, and variability of data.
● When to use: When dealing with numerical data that has a continuous scale and you
want to examine the distribution.
3. Line Graph
● Represents: Continuous data or trends over time.
● Purpose: Displays changes in data points over intervals (e.g., time, temperature, sales).
● Characteristics:
○ The x-axis typically represents time or ordered categories.
○ The y-axis represents the variable being measured (e.g., temperature, stock
price).
○ Data points are connected by lines to highlight trends and changes over time.
○ The graph is particularly useful for showing trends, patterns, or relationships over
time.
● When to use: For time series data or when you need to track changes in a variable over
a continuous period.
4. Pie Chart
● Represents: Proportional data in a whole.
● Purpose: Displays how different parts contribute to a total.
● Characteristics:
○ A circular graph divided into slices, each representing a category's proportion of
the total.
○ The size of each slice is proportional to the percentage or frequency of each
category.
○ Labels or a legend are often used to indicate what each slice represents.
○ Ideal for showing parts of a whole, particularly when there are a limited number of
categories.
● When to use: When you need to show the relative proportions of different categories,
especially if the categories sum up to a total of 100%.
5. Scatter Plot
● Represents: Relationship between two continuous variables.
● Purpose: Identifies correlations, trends, or patterns between two variables.
● Characteristics:
○ Each point on the graph represents one observation or data point.
○ The x-axis and y-axis represent two different variables.
○ Points are plotted on the graph, and patterns can reveal correlations, outliers, or
clusters.
○ Often used to show the strength and direction of a relationship between two
variables (positive, negative, or none).
● When to use: When you need to explore or visualize the relationship or correlation
between two variables.
Conclusion
Each of these graphs serves a different purpose based on the type of data you are working with: