0% found this document useful (0 votes)
37 views284 pages

EBD Merged Notes-Quiz

Uploaded by

rcybervirus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views284 pages

EBD Merged Notes-Quiz

Uploaded by

rcybervirus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 284

UNIT -1 INTRODUCTION TO ECONOMETRICS

STRUCTURE

1.0 Learning Objectives

1.1 Introduction

1.2 Testing of Hypothesis - Normal distribution, T-distribution, chi square distribution, F-


distribution

1.3 Nature, meaning and scope of econometrics

1.4 Simple and general linear regression model

1.5 Assumptions, Estimation (through Ordinary Least Square approach) and properties of
econometrics.

1.6 Gauss-Markov theorem

1.7 Normality Assumptions

1.8 Concepts and derivation of R2 and adjusted R2

1.9 Concept and Interpretation of Partial and Multiple Correlation

1.10 Analysis of Variance and its Applications in Regression Analysis

1.11 Reporting of the Results of Regression

1.12 Summary

1.13 Keywords

1.13 Learning Activity

1.14 Unit End Questions

1.14 References

1.0 LEARNING OBJECTIVES

After studying this unit, you will be able to:

 To comprehend the nature and scope of econometrics. This chapter will help the
students to understand the concepts like Testing of Hypothesis, linear regression,
Estimation.

 It also includes measures of central tendency, mean, median and mode, measures of
dispersion, range, standard deviation, coefficient of variation etc.

 Understand maximum likelihood estimation and identify what is common and what
not with respect to least squares estimation.

 Apply maximum likelihood estimation and likelihood ratio testing in specific models

1.1 INTRODUCTION

The measuring of economic interactions is the focus of econometrics. With the goal of giving
numerical values to the characteristics of economic connections, it integrates economics,
mathematical economics, and statistics. Mathematical representations are typically used to
express the links between economic theories. with empirical economics integrated. The
values of are obtained using econometric techniques.

parameters, which are simply the coefficients of the economic relationships' mathematical
form. Econometric approaches are statistical techniques that are geared to explain economic
phenomena.

methods. The random behaviour of economic linkages is represented by econometric


relationships, which are Usually not taken into account in mathematical and economic
formulations. It should be noted that other fields, such as engineering sciences, can apply
econometric methodologies. sciences related to biology, medicine, geosciences, agriculture,
etc. Simply said, wherever there is econometric techniques and tools must be used to find the
stochastic relationship in mathematical form. help. The relationships between variables can
be explained using econometric methods. Economical models A reduced version of a real-
world process is called a model. It ought to be illustrative in the sense that should list the key
characteristics of the phenomenon being studied. Generally speaking, one of the goals in To
model is to create a straightforward explanation for a complicated occurrence. Such a goal
may occasionally lead to an oversimplified model and occasionally erroneous assumptions.
Generally speaking, all The experimenter includes the factors that they believe are necessary
to explain the phenomena in the model. The remaining variables are thrown into a
"disturbances" basket, where the disturbances are random variables. The key distinction
between economic modelling and econometric modelling is this. In addition the primary
distinction between statistical and mathematical modelling. The modelling is mathematical
While the statistical modelling also includes a stochastic term, it is precise in nature. An
economic model is a framework of presumptions that explains how an economy or, more
broadly, a phenomenon, behaves.

An economic model includes

- an equation system outlining the behaviour. These formulas were created using the
economic model and are composed of disturbances and observable variables.

– a claim on the inaccuracies in the observed values of variables.

- a description of the disturbances' probability distribution.

Objectives of economics:

Following are econometrics' three primary goals:

1. Developing and defining econometric models:

The economic models are presented in a way that can be empirically tested. There are
various econometric models that constructed using an economic model. These models vary
due to various functional form choices, description of the variables' stochastic structures,
etc.

2. Model estimation and testing:

The models are calculated using the observed set of data, and their applicability is evaluated.
Here is the modeling's statistical inference component. Various estimating techniques are
employed to determine numerical values for the model's unknown parameters. based on
several statistical formulas Among the models, a good and adequate model is chosen.

3. Model usage

Forecasting and policy formation, which is a crucial component of any policy, are done using
the produced models. decision. These forecasts assist decision-makers in determining the
accuracy of the fitted model and in making the appropriate actions to alter the pertinent
economic factors.

Statistics and econometrics


Both mathematical statistics and economic statistics are not the same as econometrics.
Economic statistics, The pattern in their evolution over time is described using empirical data
that is gathered, documented, tabulated, and utilised time. An element of economics that is
descriptive is economic statistics. It does not offer any justifications. measurement of the
parameters of the relationships or the growth of numerous factors. Measurement techniques
created based on controlled studies are referred to as statistical approaches experiments. Such
approaches might not be appropriate given the current state of the economy because they
don't controlled experimentation framework Real-world experiments, for instance, typically
involve changing variables. controlled experiments cannot be set up continually and
simultaneously because of this.

Following its adaptation to the issues of economic life, statistical methods are used in
econometrics. the adoption of Statistical techniques are frequently referred to as econometric
techniques. These techniques are modified such that become suitable for stochastic
relationship measurement. These changes essentially aim to give specific examples of
stochastic elements that operate on actual data and are used in the determination. of the
recorded data. As a result, the information can be referred to as a random sample, which is
required for the application of statistical equipment The creation of acceptable techniques for
the measurement of economic variables is one of the theoretical econometrics.

Relationships in the economy that are inappropriate for use in laboratory-based controlled
investigations. The analysis of non-experimental data is typically the purpose for which
econometric tools are developed.

The term "applied econometrics" refers to the use of econometric techniques in particular
disciplines of theory of economics and issues with supply, demand, production, investment,
consumption, etc. they use the use of econometric theory's tools in the study of economic data
is known as econometrics predicting economic behaviour and phenomena.

Econometrics and regression analysis:

One of the very important roles of econometrics is to provide the tools for modeling on the
basis of give data. The regression modeling technique helps a lot in this task. The regression
models can be either linear or non-linear based on which we have linear regression analysis
and non-linear regression analysis. We will consider only the tools of linear regression
analysis and our main interest will be the fitting of the linear regression model to a given set
of data.
Econometrics is the application of statistical methods to economic data in order to give
empirical content to economic relationships. More precisely, it is "the quantitative analysis of
actual economic phenomena based on the concurrent development of theory and observation,
related by appropriate methods of inference". An introductory economics textbook describes
econometrics as allowing economists "to sift through mountains of data to extract simple
relationships". Jan Tinbergen is one of the two founding fathers of econometrics. The other,
Ragnar Frisch, also coined the term in the sense in which it is used today.

A basic tool for econometrics is the multiple linear regression model. Econometric theory
uses statistical theory and mathematical statistics to evaluate and develop econometric
methods. Econometricians try to find estimators that have desirable statistical properties
including unbiasedness, efficiency, and consistency. Applied econometrics uses theoretical
econometrics and real-world data for assessing economic theories, developing econometric
models, analysing economic history, and forecasting.

METHOD-

Econometrics may use standard statistical models to study economic questions, but most
often they are with observational data, rather than in controlled experiments. In this, the
design of observational studies in econometrics is similar to the design of studies in other
observational disciplines, such as astronomy, epidemiology, sociology and political science.
Analysis of data from an observational study is guided by the study protocol, although
exploratory data analysis may be useful for generating new hypotheses. Economics often
analyses systems of equations and inequalities, such as supply and demand hypothesized to
be in equilibrium. Consequently, the field of econometrics has developed methods for
identification and estimation of simultaneous equations models. These methods are analogous
to methods used in other areas of science, such as the field of system identification in systems
analysis and control theory. Such methods may allow researchers to estimate models and
investigate their empirical consequences, without directly manipulating the system.

One of the fundamental statistical methods used by econometricians is regression analysis.


Regression methods are important in econometrics because economists typically cannot use
controlled experiments. Econometricians often seek illuminating natural experiments in the
absence of evidence from controlled experiments. Observational data may be subject to
omitted-variable bias and a list of other problems that must be addressed using causal
analysis of simultaneous-equation models.

In addition to natural experiments, quasi-experimental methods have been used increasingly


commonly by econometricians since the 1980s, in order to credibly identify causal effects

1.2 TESTING OF HYPOTHESIS - NORMAL DISTRIBUTION,


T-DISTRIBUTION, CHI SQUARE DISTRIBUTION, F-DISTRIBUTION

Hypothesis means the assumption or quantitative statement of the population parameter


which may be true or false. In order to make proper decision about the quantitative statement
of the population, testing of hypothesis technique is used. The procedure of testing the
reliability or validity of such hypothesis by using sample statistic is called testing of
hypothesis or statistical hypothesis or test of significance.

For e.g.: Suppose a manufacturer of light bulbs claims that the mean life of its product is
3000 hours. Using a sample say 500 bulbs selected at random a consumer or decision maker
can test the claim of the manufacturer by looking how many hours on average they last by
calculating the sample mean life. If the selected samples gave an average life of 2900 or 3100
hours then the decision maker will accept the manufacturer’s claim. If the selected samples
gave an average life of 1800 hours, then the decision maker will reject the manufacturer’s
claim. From this example it is clear that acceptance and rejection depends upon the gap
between the sample statistic and population parameter.

Types of Hypotheses

Basically, there are two types of hypotheses

1) Null hypothesis 2) Alternative hypothesis

1) Null Hypothesis-

A null hypothesis is a type of statistical hypothesis that proposes that no statistical


significance exists in a set of given observations. Hypothesis testing is used to assess the
credibility of a hypothesis by using sample data. Sometimes referred to simply as the "null,"
it is represented as H0.

The null hypothesis, also known as the conjecture, is used in quantitative analysis to test
theories about markets, investing strategies, or economies to decide if an idea is true or false.

A null hypothesis is a type of conjecture in statistics that proposes that there is no difference
between certain characteristics of a population or data-generating process.1 For example, a
gambler may be interested in whether a game of chance is fair. If it is fair, then the expected
earnings per play come to zero for both players. If the game is not fair, then the expected
earnings are positive for one player and negative for the other. To test whether the game is
fair, the gambler collects earnings data from many repetitions of the game, calculates the
average earnings from these data, then tests the null hypothesis that the expected earnings are
not different from zero.

If the average earnings from the sample data are sufficiently far from zero, then the gambler
will reject the null hypothesis and conclude the alternative hypothesis—namely, that the
expected earnings per play are different from zero. If the average earnings from the sample
data are near zero, then the gambler will not reject the null hypothesis, concluding instead
that the difference between the average from the data and zero is explainable by chance
alone.

The null hypothesis assumes that any kind of difference between the chosen characteristics
that you see in a set of data is due to chance. For example, if the expected earnings for the
gambling game are truly equal to zero, then any difference between the average earnings in
the data and zero is due to chance.

Analysts look to reject the null hypothesis because doing so is a strong conclusion. This
requires strong evidence in the form of an observed difference that is too large to be
explained solely by chance. Failing to reject the null hypothesis—that the results are
explainable by chance alone—is a weak conclusion because it allows that factors other than
chance may be at work but may not be strong enough for the statistical test to detect them.

It is a hypothesis of no difference which means there is no significant difference between the


sample statistic and population parameter. In other word, if the difference between true and
expected value is set zero, then the hypothesis is called null hypothesis. It is denoted by H o.
For e.g.: If the population mean ( ) has specified value o, then we set up the null hypothesis
as

Ho:  = o ( - true value, o – expected value)

If the manufacturer’s claim that average life of bulb is 3000 hours then null hypothesis is set
up as Ho:  = 3000 hours.

Examples of a Null Hypothesis

Here is a simple example: A school principal claims that students in her school score an
average of seven out of 10 in exams. The null hypothesis is that the population mean is 7.0.
To test this null hypothesis, we record marks of, say, 30 students (sample) from the entire
student population of the school (say 300) and calculate the mean of that sample.

We can then compare the (calculated) sample mean to the (hypothesized) population mean of
7.0 and attempt to reject the null hypothesis. (The null hypothesis here—that the population
mean is 7.0—cannot be proved using the sample data. It can only be rejected.)

Take another example: The annual return of a particular mutual fund is claimed to be 8%.
Assume that a mutual fund has been in existence for 20 years. The null hypothesis is that the
mean return is 8% for the mutual fund. We take a random sample of annual returns of the
mutual fund for, say, five years (sample) and calculate the sample mean. We then compare
the (calculated) sample mean to the (claimed) population mean (8%) to test the null
hypothesis.

For the above examples, null hypotheses are:

Example A: Students in the school score an average of seven out of 10 in exams

Example B: Mean annual return of the mutual fund is 8% per year.

For the purposes of determining whether to reject the null hypothesis, the null hypothesis
(abbreviated H0) is assumed, for the sake of argument, to be true. Then the likely range of
possible values of the calculated statistic (e.g., the average score on 30 students’ tests) is
determined under this presumption (e.g., the range of plausible averages might range from
6.2 to 7.8 if the population mean is 7.0). Then, if the sample average is outside of this range,
the null hypothesis is rejected. Otherwise, the difference is said to be “explainable by chance
alone,” being within the range that is determined by chance alone.
2) Alternative Hypothesis-

An important point to note is that we are testing the null hypothesis because there is an
element of doubt about its validity. Whatever information that is against the stated null
hypothesis is captured in the alternative (alternate) hypothesis .

For the above examples, the alternative hypothesis would be:

Students score an average that is not equal to seven.

The mean annual return of the mutual fund is not equal to 8% per year.

In other words, the alternative hypothesis is a direct contradiction of the null hypothesis.

If the decision maker rejects the null hypothesis on the basis of sample information, he/she
should accept another hypothesis which is complementary to null hypothesis and is known as
alternative hypothesis. In other word, if the difference between true and expected value is not
equal to zero, then the hypothesis is called alternative hypothesis. It is denoted by H1.

If the null hypothesis is set up as Ho:  = o, then the alternative hypothesis may be either of
the following

i) H1:  = o, i.e., there is significant difference between sample statistic and
population parameter.
ii) H1:  > o, i.e., population mean is greater than o
iii) H1:  < o, i.e., population mean is less than o

Types of Error in Testing of Hypothesis

When the test procedure is applied to test the Ho against H1, we may find two types of error.

i) we may reject Ho when Ho is true


ii) we may accept Ho when Ho is false

It can be shown in following table

True Ho True Ho False


Situation
Decision

Accept Ho No Error Type II Error

Reject Ho Type I Error No Error

Type I Error

The error committed in rejecting Ho when Ho is true is called type I error. The probability of
committing type I error is called the size of test or size of critical region and is denoted by .

 = Prob. {type I error} = Prob. {Reject Ho / Ho is true}

Type II Error

The error committed in accepting Ho when Ho is false is called type II error. The probability
of committing type II error is denoted by β.

β = Prob. {type II error} = Prob. {Accept Ho / Ho is false}

[Suppose we are going to buy 1000 pieces of apples. Out of these 1000, 50 apples are sour
and remaining is sweet. In order to buy apples, 20 apples are selected as sample and taste it.
By chance, selected 20 apples are sour. Then we drop the idea of buying the apples. Here not
buying the apple is wrong decision. Such error decision is said to be type I error as we are
rejecting true hypothesis i.e., rejecting sweet apples.

Similarly, if 50 apples are sweet out of 1000 and remaining are sour. Suppose 20 apples are
selected as sample and taste it. By chance, selected 20 apples are sweet. Then we decide to
buy the apples. Here buying the apple is wrong decision. Such error decision is said to be
type II error as we are accepting false hypothesis i.e., accepting sour apples.

If the hypothesis is true and rejects it, no great harm has been done because we can wait for
next lot. This type of error simply leads to opportunity loss. But if the hypothesis is false and
accept it, the result may be very harmful. Type II error is very undesirable result and makes
direct impact to the decision maker. Type II error is more harmful than type I error.]
Level of Significance

The maximum probability of committing a type I error is called the level of significance. In
other word, the probability of rejecting a true null hypothesis is called level of significance. It
is denoted by .

 = Prob. {type I error} = Prob. {Rejecting Ho when Ho is true}

Critical Region or Region of Rejection

A region in the sample space S which amounts to rejection of Ho is called as critical region or
region of rejection.

Let us define a sample space S we divide the whole sample space S into two region or two
disjoint subsets namely W and S-W such that

W  S-W = S

W  S-W =  (null set)

If the sample points are fall into region W, then we reject the H o and this region is known as
critical region or region of rejection. Otherwise, if the sample points are fall into region S-W,
it is called the acceptance region, and then we accept the Ho.

If the sampling distribution follows normal distribution the critical region or region of
rejection is the area under standard normal curve corresponding to a pre-assigned level of
significance (). Therefore, critical region included the probability of all the set of possible
sample values which leads to the rejection of Ho when it is true. The region under normal
curve other than size of  is called acceptance region.

One Tailed and Two Tailed Test

One Tailed Test

Any test where the critical region consist only one tail of the sampling distribution of the test
statistic is called one tailed test or one-sided test. In other word, the test of hypothesis which
is based on critical region represented by only one tail under the normal curve is called one
tailed test. In one tailed test, there is one rejection region and the alternative hypothesis may
be right tailed test or left tailed test. If the critical region lies entirely on the right tail of
normal probability curve, then it is called right tailed test. If the critical region lies entirely on
the left of normal probability curve, then it is called left tailed test.

In one tailed test, the hypothesis is set up as follows

Ho:  = o. against

I) H1:  > o (right tailed test)

ii) H1:  < o (left tailed test)

Two Tailed Test

Any test where the critical region consists two tails of the sampling distribution of the test
statistic is called two tailed test or two-sided test. In other word, the test of hypothesis which
is based on critical region represented by both tail under the normal curve is called two tailed
tests. In two tailed tests, the null hypothesis is rejected if the sample value is significantly
higher or lower than the expected value of the population parameter.

In two tailed tests, the hypothesis is set up as follows.

Null hypothesis Ho:  = o. against

alternative hypothesis H1:  ≠ o (  > o or  < o)

Critical Values or Significant Values

The values of test statistic which separate the critical region and the acceptance region are
called critical values or significant values. The critical values depend upon the level of
significance () and the alternative hypothesis which leads to one tailed and two tailed tests.

For e.g.: in case of normally distributed sampling distribution the critical value which divides
total distribution into parts i.e., acceptance and rejection region for given level of significance
 is given by Z. It is important to note that the critical value of Z for one tailed test at level
of significance  is same as the critical value of Z at two tailed tests at level of significance
2. The critical value of Z for two tailed test and one tailed test at

 = 5% level of significance can be shown in following diagram.


Critical Values (Z) of Z

Critical values Level of significance ()

(Z)
1% 5% 10 %

Two-tailed test Z = 2.58 Z = 1.96 Z = 1.645

Right-tailed test Z = 2.33 Z = 1.645 Z = 1.28

Left-tailed test Z = -2.33 Z = -1.645 Z = -1.28

Procedure of Testing a Hypothesis

The following steps should be considered while testing a hypothesis

1. Formulating the hypothesis:

First of all, set up the null hypothesis against alternative hypothesis.

Ho:  = o. against

I) H1:  > o (right tailed test)

ii) H1:  < o (left tailed test)

iii) H1:  ≠ o (two tailed test)

2. Compute the Test Statistic:

After formulating the hypothesis, the next step is to compute appropriate test statistics. For
testing whether the null hypothesis should be accepted or rejected, we use Z-test for large
sample (n ≥30) and t-test for small sample (n< 30).

3. Choose the level of Significance:

Determine the level of significance  at which hypothesis is to be tested. Generally, 5% level


of significance is fixed.
4. Find Critical Value:

Obtain the critical value or significant value at  level of significance (tabulated value)
according as whether alternative hypothesis is one tailed (left or right) or two tailed test.

5. Draw Conclusion:

Draw conclusion by comparing the calculated value and tabulated value of test statistic as
follows;

If calculated value is less than or equal to the tabulated value at particular level of
significance, then the null hypothesis is accepted which means that there is no significant
difference between the sample statistic and the population parameter.

If calculated value is greater than the tabulated value at particular level of significance, then
the null hypothesis is rejected which means that there is significant difference between the
sample statistic and the population parameter.

Four Steps of Hypothesis Testing

 All hypotheses are tested using a four-step process:

 The first step is for the analyst to state the two hypotheses so that only one can be
right.

 The next step is to formulate an analysis plan, which outlines how the data
will beevaluated.

 The third step is to carry out the plan and physically analyses the sample data.

 The fourth and final step is to analyses the results and either accept or reject
the nullhypothesis.

A) Normal Distribution

A hypothesis test formally tests if the population the sample represents is normally-
distributed. The null hypothesis states that the population is normally distributed, against
the alternative hypothesis that it is not normally-distributed. If the test p-value is less than
the predefined significance level, you can reject the null hypothesis and conclude the data
are not from a population with a normal distribution. If the p-value is greater than the
predefined significance level, you cannot reject the null hypothesis.

Note that small deviations from normality can produce a statistically significant p-value
when the sample size is large, and conversely it can be impossible to detect non-normality
with a small sample. You should always examine the normal plot and use your judgment,
rather than rely solely on the hypothesis test. Many statistical tests and estimators are
robust against moderate departures in normality due to the central limit theorem.

The second building block of statistical significance is the normal distribution, also called
the Gaussian or bell curve. The normal distribution is used to represent how data from a
process is distributed and is defined by the mean, given the Greek letter μ (mu), and the
standard deviation, given the letter σ (sigma). The mean shows the location of the center of
the data and the standard deviation is the spread in the data.

and the population parameter.

FIGURE -1.1

The application of the normal distribution comes from assessing data points in terms of the
standard deviation. We can determine how anomalous a data point is based on how many
standard deviations it is from the mean. The normal distribution has the following helpful.

Properties:

 68% of data is within ± 1 standard deviations from the mean

 95% of data is within ± 2 standard deviations from the mean

 99.7% of data is within ± 3 standard deviations from the mean

B) T – Distribution

When you perform a t-test for a single study, you obtain a single t-value. However, if we
drew multiple random samples of the same size from the same population and performed
the same t- test, we would obtain many t-values and we could plot a distribution of all of
them. This type of distribution is known as a sampling distribution.

Fortunately, the properties of t-distributions are well understood in statistics, so we can


plot them without having to collect many samples! A specific t-distribution is defined by its
degrees of freedom (DF), a value closely related to sample size. Therefore, different t-
distributions exist for every sample size. You can graph t-distributions using Minitab ‘s
probability distribution plots.

T-distributions assume that you draw repeated random samples from a population where
the null hypothesis is true. You place the t-value from your study in the t-distribution to
determine how consistent your results are with the null hypothesis.

The t-test is any statistical hypothesis test in which the test statistic follows a student’s t-
distribution under the null hypothesis.

A t-test is most commonly applied when the test statistic would follow a normal distribution
if the value of a scaling term in the test statistic were known. When the scaling term is
unknown and is replaced by an estimate based on the data, the test statistics (under certain
conditions) follow a student’s t distribution. The t-test can be used, for example, to
determine if the meansof two sets of data are significantly different from each other.

Among the most frequently used t-tests are:

 A one-sample location test of whether the mean of a population has a value specified in
a null hypothesis.

 A two-sample location test of the null hypothesis such that the means of two
populations are equal. All such tests are usually called Student's t-tests, though strictly
speaking that name should only be used if the variances of the two populations are also
assumed to be equal; the form of the test used when this assumption is dropped is
sometimes called Welch's t-test. These tests are often referred to as "unpaired" or
"independent samples" t-tests, as they are typically applied when the statistical units
underlying the two samples being compared are non-overlapping.

T-tests are called t-tests because the test results are all based on t-values. T-values are an
example of what statisticians call test statistics. A test statistic is a standardized value that
is calculated from sample data during a hypothesis test. The procedure that calculates the
test statistic compares your data to what is expected under the null hypothesis.

Each type of t-test uses a specific procedure to boil all of your sample data down to one
value, the t-value. The calculations behind t-values compare your sample mean(s) to the
null hypothesis and incorporates both the sample size and the variability in the data. A t-
value of 0 indicates that the sample results exactly equal the null hypothesis. As the
difference between the sample data and the null hypothesis increases, the absolute value of
the t-value increases.

Assume that we perform a t-test and it calculates a t-value of 2 for our sample data. What
does that even mean? I might as well have told you that our data equal 2 fibrins! We don ‘t
knows if that’s common or rare when the null hypothesis is true.

By itself, a t-value of 2 doesn’t ‘t really tells us anything. T-values are not in the units of
the original data, or anything else we ‘d be familiar with. We need a larger context in
which we can place individual t-values before we can interpret them. This is where t-
distributions come in.

FIGURE-1.2

The graph above shows a t-distribution that has 20 degrees of freedom, which corresponds
to a sample size of 21 in a one-sample t-test. It is a symmetric, bell-shaped distribution that
is similar to the normal distribution, but with thicker tails. This graph plots the probability
density function (PDF), which describes the likelihood of each t-value.

The peak of the graph is right at zero, which indicates that obtaining a sample value close
to the null hypothesis is the most likely. That makes sense because t-distributions assume
that the null hypothesis is true. T-values become less likely as you get further away from
zero in either direction. In other words, when the null hypothesis is true, you are less likely
to obtain a sample that is very different from the null hypothesis.

Our t-value of 2 indicates a positive difference between our sample data and the null
hypothesis. The graph shows that there is a reasonable probability of obtaining a t-value
from -2 to +2 when the null hypothesis is true. Our t-value of 2 is an unusual value, but we
don ‘t knows exactly how unusual. Our ultimate goal is to determine whether our t-value is
unusual enough to warrant rejecting the null hypothesis. To do that, we'll need to calculate
the probability.

A) Chi-square distribution

The phrase "a chi-squared test", also written as χ2 test, could be used as the description of
any statistical hypothesis test where the sampling distribution of the test statistic is, under
some circumstances, approximately, or is simply hoped to be approximately, a chi-
square

distribution, when the null hypothesis is true. Usually however, the phrase is used as short
for Pearson's chi-squared test, and variants thereof. Pearson's chi-squared test is used to
determine whether there is a statistically significant difference (i.e., a difference which
clearly is not just due to chance fluctuations) between the expected frequencies and the
observed frequencies in one or more categories of a so-called contingency table.

In the standard applications of this test, the observations are classified into mutually
exclusive classes, and there is some theory, which is called "the null hypothesis", which
gives the probability that any observation falls into the corresponding class. The purpose of
the test is to evaluate how likely the observations that are made would be, assuming the
null hypothesis is true.

Approximately chi-squared distributed tests often arise when looking at a sum of squared
errors, or through the sample variance. Test statistics that follow a chi-squared distribution
arise from an assumption of independent normally distributed observations, which might
be approximately valid in some cases due to the central limit theorem. There exist chi-
squared tests for testing the null hypothesis of independence of a pair of random variables
based on observations of the pairs.

As mentioned already, the phrase "chi-squared test" is often used to describe tests for
which the chi-square distribution only holds asymptotically, meaning that the sampling
distribution (if the null hypothesis is true) approximates a chi-squared distribution more
and more closely as the sample size gets larger.

When the underlying distribution is normal, the chi-squared distribution is most often
employed for hypothesis testing and, to a lesser extent, for estimating confidence intervals
for population variance. The chi-squared distribution is not as commonly used as the
normal distribution or the exponential distribution for directly modelling natural
occurrences. It is a problem in, among others, the following hypothesis tests:

Test for independence of groups using the chi-squared statistic in a table of data

Chi-squared test for comparing two hypothetical distributions with actual data.

Test for nested models using the likelihood ratio

The log-rank test for use in survival analysis

Cochran-Mantel-Haenszel test for categorized frequency tables

It's utilized in t-tests, ANOVA, and regression analysis, and it's part of the definition of the
F-distribution.

Because of its close resemblance to the normal distribution, the chi-squared test is
frequently employed to verify hypotheses. The t-statistic, for example, is used in the t-test,
although it is just one example of a test statistic. The sampling distribution of the test
statistic for these hypotheses approaches the normal distribution as n grows larger (central
limit theorem). If the sample size is big enough, the distribution used to test hypotheses can
be an approximation of the normal distribution, as the test statistic (such as t) is
asymptotically regularly distributed. Statistical hypothesis testing in the context of a normal
distribution is straightforward and well-understood. Square of a normal distribution is the
simplest form of the chi-squared distribution. It follows that the chi-squared distribution
may be employed for hypothesis testing in every situation where the normal distribution
would be appropriate.

It is assumed that display style ZZ is a random variable drawn from a normal distribution
with a mean of 0 and a variance of 1: display style Zaim N (0,1) display style Zaim N (0,1).
Now, use display style Q=Z2 as an example of a stochastic variable. An illustration of a
chi-

squared distribution is the distribution of the random variable display style Q: display style
Q sim chi 12. Translated from Chinese: Q sim chi 12. Chi-squared distributions can be
built from several normal distributions, however the subscript 1 denotes that this specific
distribution is built from a single normal distribution. One degree of freedom is associated
with a chi-squared distribution that is the result of squaring a single normal distribution. As
a result, a normal distribution is approached by the distribution of the test statistic as the
sample size for a hypothesis test grows. The same is true for the chi-squared distribution:
extreme values have a low probability and result in tiny p-values, just as they do for the
normal distribution.

As the large sample distribution of generalized likelihood ratio tests, the chi-squared
distribution is also commonly employed (LRT).

The Neiman-Pearson lemma shows that the optimality qualities of generalized LRTs are
based on the fact that simple LRTs often offer the most power to reject the null hypothesis.
Normal and chi-squared approximations, however, hold true only in the asymptotic limit.
As a result, the t distribution is preferred to the normal approximation and the chi-squared
approximation when working with a small sample size. Fisher's exact test is superior to the
chi-squared approximation in studies of contingency tables when the sample size is limited.
Ramsey demonstrates that the precise binomial test is superior to the standard
approximation in all cases.

The chi-squared distribution (also chi-square or 2-distribution) with k degrees of freedom is


the distribution of the sum of the squares of k independent standard normal random
variables in probability theory and statistics. One of the most popular probability
distributions in inferential statistics, the chi-squared distribution is a particular instance of
the gamma distribution used mostly in hypothesis testing and confidence interval creation.
Often referred to as a subset of the broader noncentral chi-squared distribution, the central
chi-squared distribution is another name for this form of distribution.

Chi-squared tests are frequently used to determine whether or not an observed distribution
is a good fit for a theoretical distribution, whether or not two criteria for classifying
qualitative data are independent, and to estimate confidence intervals for the population
standard deviation of a normal distribution given a sample standard deviation. The analysis
of variance by ranks developed by Friedman is only one example of the many statistical
tests that make use of this distribution.

A) Chi-squared test for variance in a normal population

To ascertain if the variance of a variable acquired from a specific sample has the same
magnitude as the known variance of the same variables in the population, a non-parametric
statistical approach with a chi-square-distributed test statistic is employed: the chi-square test
for variance. The variance of a population can only be calculated by looking at the entire
population. In many cases, a statistically representative sample is all that is needed to
accurately estimate the population standard deviation. The chi-square test may be performed
regardless of the scale value of the variable being analyzed.

When the test statistic is chi-squared distributed under the null hypothesis, a statistical
hypothesis test called a chi-squared test (also known as a chi-square or 2 test) can be used.
This includes Pearson's chi-squared test and its variations. If there is a statistically
significant discrepancy between the predicted frequencies and the observed frequencies in
one or more categories of a contingency table, it may be determined using Pearson's chi-
squared test.

The observations are categorized into groups that are mutually exclusive in the conventional
uses of this exam. The test statistic generated from the data follows a 2-frequency
distribution if the null hypothesis, that there are no differences between the classes in the
population, is true. The goal of the test is to determine how likely it would be for the
observed frequencies to occur under the null hypothesis.

When the data are unrelated, test statistics with a 2 distribution are produced. Additionally,
depending on observations of the pairings, there are 2 tests to examine the independence of a
pair of random variables.

Chi-squared tests are tests for which the test statistic's distribution asymptotically approaches
the 2 distributions, i.e., the test statistic's sampling distribution (if the null hypothesis is true)
increasingly resembles a chi-squared distribution as sample sizes rise.

In cryptanalysis, the distribution of plaintext and (potentially) decoded ciphertext is


compared using the chi-squared test. The test's lowest result indicates that there is a strong
possibility that the decryption was successful. Modern cryptographic challenges may be
solved using this technique in general.

The chi-squared test is employed in bioinformatics to assess the distribution of certain genes'
features across several categories, including genomic content, mutation rate, interaction
network clustering, etc. (e.g., disease genes, essential genes, genes on a certain chromosome
etc.).

If a sample of size n is taken from a population having a normal distribution, then there is a
result (see distribution of the sample variance) which allows a test to be made of whether the
variance of the population has a pre-determined value. For example, a manufacturing
process might have been in stable condition for a long period, allowing a value for the
variance to be determined essentially without error. Suppose that a variant of the
process is being tested, giving rise to a small sample of n product items whose variation is
to be tested. The test statistic T in this instance could be set to be the sum of squares about
the sample mean,

divided by the nominal value for the variance (i.e., the value to be tested as holding). Then
T has a chi-squared distribution with n − 1 degrees of freedom. For example, if the sample
size is 21, the acceptance region for T with a significance level of 5% is between 9.59 and
34.17.

How to calculate the chi-square statistic

First, we have to calculate the expected value of the two nominal variables. We can
calculate the expected value of the two nominal variables by using this formula:

Were

= expected value

= Sum of the itch column

= Sum of the
kth rowN = total number

After calculating the expected value, we will apply the following formula to calculate the
valueof the Chi-Square test of Independence:
= Chi-Square test of Independence

= Observed value of two nominal

variables = Expected value of two


nominal variables

Degree of freedom is calculated by using the following formula:

DF = (r-1) (c-1)

Were

DF = Degree of
freedom r = number
of rows

c = number of
columns
Hypothesis:

 Null hypothesis: Assumes that there is no association between the two variables.

 Alternative hypothesis: Assumes that there is an association between the two variables.
Hypothesis testing: Hypothesis testing for the chi-square test of independence as it is for
other tests like ANOVA, where a test statistic is computed and compared to a critical
value. The critical value for the chi-square statistic is determined by the level of
significance (typically .05)and the degrees of freedom. The degrees of freedom for the chi-
square are calculated using the following formula: do = (r-1) (c-1) where r is the
number of rows and c is the number of columns. If the observed chi-square test
statistic is greater than the critical value, the nullhypothesis can be rejected.

F – distribution

An F-test is any statistical test in which the test statistic has an F-distribution under the null
hypothesis. It is most often used when comparing statistical models that have been fitted to
a data set, in order to identify the model that best fits the population from which the data
were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data
using least squares. The name was coined by George W. Snedeker, in honor of Sir Ronald
A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

An ―F Test‖ is a catch-all term for any test that uses the F-distribution. In most cases,
when people talk about the F-Test, what they are actually talking about is the F-Test to
Compare Two Variances. However, the f-statistic is used in a variety of tests including
regressionanalysis, the Chow test and the Schefft Test.

An ―F Test‖ is a catch-all term for any test that uses the F-distribution. In most cases,
when people talk about the F-Test, what they are actually talking about is the F-Test to
Compare Two Variances. However, the f-statistic is used in a variety of tests including
regressionanalysis, the Chow test and the Schefft Test (a post-hoc ANOVA test).

General Steps for an F Test

If you ‘re running an F Test, you should use Excel, SPSS, Minitab or some other kind of
technology to run the test. Why? Calculating the F test by hand, including variances, is
tedious and time-consuming. Therefore you ‘ll probably make some errors along the way.

If you ‘re running an F Test using technology (for example, an F Test two sample for
variances in Excel), the only steps you really need to do are Step 1 and 4 (dealing with the
null hypothesis). Technology will calculate Steps 2 and 3 for you.

 State the null hypothesis and the alternate hypothesis.

 Calculate the F value. The F Value is calculated using the formula F = (SSE1 – SSE2 /
m) / SSE2 / n-k, where SSE = residual sum of squares, m = number of restrictions and
k = number of independent variables.

 Find the F Statistic (the critical value for this test). The F statistic formula is:

 F Statistic = variance of the group means / mean of the within group variances.

 You can find the F Statistic in the F-Table.

 Support or Reject the Null Hypothesis.


Assumptions

 The larger variance should always go in the numerator (the top number) to force the
test intoa right-tailed test. Right-tailed tests are easier to calculate.

 For two-tailed tests, divide alpha by 2 before finding the right critical value.

 If you are given standard deviations, they must be squared to get the variances.

 If your degrees of freedom aren ‘t listed in the F Table, use the larger critical value.
Thishelps to avoid the possibility of Type I errors.

1.3 NATURE, MEANING AND SCOPE OF ECONOMETRICS

By using statistical techniques on economic data, econometrics aims to provide economic


connections an empirical foundation. More specifically, it is "the quantitative study of real
economic events based on the parallel development of theory and observation, coupled by
suitable inference procedures. “Economists may use econometrics to "sift through masses of
data to identify basic associations," according to one introductory economics textbook. One
of the two founding founders of econometrics is Jan Tinbergen. The phrase was also first
used by Ragnar Frisch, who also gave it its current meaning.

The multiple linear regression model is a fundamental piece of econometrics equipment.

To assess and create econometric approaches, econometric theory makes use of statistical
theory and mathematical statistics. The unbiasedness, efficiency, and consistency are only a
few of the desired statistical features that econometricians look for in estimators. In order to
evaluate economic theories, create econometric models, examine economic history, and make
forecasts, applied econometrics combines theoretical econometrics with actual data.

To assess and create econometric approaches, econometric theory makes use of statistical
theory and mathematical statistics. The unbiasedness, efficiency, and consistency are only a
few of the desired statistical features that econometricians look for in estimators. An
estimator is efficient if, for a given sample size, it has a smaller standard error than other
unbiased estimators and its anticipated value is the real value of the parameter. An estimator
is unbiased if it converges to the true value as the sample size increases. Since it offers the
BLUE or "best linear unbiased estimator" (where "best" denotes most effective, unbiased
estimator) under the Gauss-Markov assumptions, ordinary least squares (OLS) is frequently
employed for estimation. Different estimate procedures, such as maximum likelihood
estimation, generalized method of moments, or generalized least squares, are utilized when
these presumptions are broken or other statistical features are sought. Those who prefer
Bayesian statistics over conventional, classical, or "frequentist" techniques support estimators
that take prior beliefs into account.

Standard statistical models may be used in econometrics to analyses economic issues;


however, these models are most frequently used using observational data rather than in
controlled trials. In this respect, observational studies in econometrics are designed similarly
to those in astronomy, epidemiology, sociology, and political science, among other
observational fields. An observational study's data analysis is governed by its protocol, while
exploratory data analysis may be helpful for developing new ideas. Systems of equations and
inequalities, such as supply and demand that are assumed to be in equilibrium, are frequently
analyzed in economics. Consequently, techniques for identifying and estimating simultaneous
equations models have been developed in the discipline of econometrics. These techniques
are comparable to those employed in other branches of research, such as control theory and
the subject of system identification. These techniques could enable researchers to estimate
models and examine their empirical implications without actually changing the system.

Regression analysis is one of the core statistical techniques employed by econometricians.

Because economists frequently cannot conduct controlled experiments, regression techniques


are crucial in econometrics. In the lack of data from controlled trials, econometricians
frequently look for enlightening natural experiments. When utilizing simultaneous equation
models for causal analysis, it is necessary to take into account the possibility of omitted-
variable bias and a number of other issues that might affect observational data.

Since the 1980s, econometricians have employed quasi-experimental approaches more


frequently than natural trials to reliably uncover causal effects.

Information about structural econometrics: structural assessment


By employing economic models as the lens through which to examine the data, structural
econometrics expands the range of data analysis techniques available to researchers. The
advantage of this strategy is that any policy suggestions won't be open to the Lucas criticism,
provided that counterfactual studies take an agent's re-optimization into account. Starting
with an economic model that encapsulates the key characteristics of the actors under study,
structural econometric studies are conducted. The researcher then looks for model parameters
that correspond the model's outputs to the data.

Dynamic discrete choice is one instance, where there are two typical approaches. In the first,
the researcher must fully solve the model before using maximum likelihood. The second
method avoids the model's complete solution and estimates models in two steps, enabling the
researcher to take into account more complex models with strategic interactions and
numerous equilibria.

The estimation of first-price sealed-bid auctions with independent private values is yet
another instance of structural econometrics in action.

The main issue with the bidding data from these auctions is that the bids shade the underlying
values, which means that they only partially convey information about them. To determine
the size of the profits made by each bidder, one would wish to estimate these values. More
crucially, designing a mechanism requires having the value distribution in hand. A bidder's
anticipated return in a first-price sealed-bid auction is indicated by:

In recent years, econometricians have relied more and more on experiments to assess the
sometimes-incongruent findings of observational research. Here, compared to just
observational investigations, controlled and randomized experiments offer statistical
conclusions that may produce higher empirical performance.
As the preceding definitions suggest, econometrics makes use of economic the-ory,
mathematical economics, economic statistics (i.e., economic data), and mathematical
statistics. Yet, it is a subject that deserves to be studied in its own right for the following
reasons. Economic theory makes statements or hypotheses that are mostly qualitative in
nature. For example, microeconomic theory states that, other things remain-Ing the same (the
famous ceteris perinucleusi of economics), an increase in the price of a commodity is
expected to decrease the quantity demanded of that commodity. Thus, economic theory
postulates a negative or inverse relation-ship between the price and quantity demanded
of a commodity—this is the widely known law of downward-sloping demand or simply the
law of demand. But the theory itself does not provide any numerical measure of the strength
of the relationship between the two; that is, it does not tell by how much the quantity
demanded will go up or down as a result of a certain change in the price of the commodity.
It is the econometrician’s job to provide such numerical ester-mates. Econometrics
gives empirical (i.e., based on observation or experiment)content to most economic theory. If
we find in a study or experiment that when the price of a unit increases by a dollar the
quantity demanded goes down by, say, 100 units, we have not only confirmed the law
of demand, but in the process, we have also provided a numerical estimate of the
relationship between the two variables—price and quantity. The main concern of
mathematical economics is to express economic theory in mathematical form or equations (or
models) without regard to measurability or empirical verification of the theory.
Econometrics, as noted earlier, is primarily interested in the empirical verification of
economic theory. As we will show shortly, the econometrician often uses mathematical
models proposed by the mathematical economist but puts these models in forms that lend
themselves to empirical testing. Economic statistics is mainly concerned with collecting,
processing, and pre-sending economic data in the form of charts, diagrams, and tables.
This is the economic statistician’s job. He or she collects data on the GDP, employment, un-
employment, prices, etc. These data constitute the raw data for econometric work. But
the economic statistician does not go any further because he or she is not primarily concerned
with using the collected data to test economic theories. Although mathematical statistics
provides many of the tools employed in the trade, the econometrician often needs special
methods because of the unique nature of most economic data, namely, that the data are not
usually generated as the result of a controlled experiment. The econometrician, like the
meteorologist, generally depends on data that cannot be controlled directly. Thus, data on
consumption, income, investments, savings, prices, etc., which are collected by public and
private agencies, are nonexperimental in nature. The econometrician takes these data as
given. This creates special problems not normally dealt with in mathematical statistics.
Moreover, such data are likely to contain errors of measurement, of either omission or
commission, and the econometrician.

Confirming Pages may be called upon to develop special methods of analysis to deal
with such errors of measurement. For students majoring in economics and business there is a
pragmatic reason for studying econometrics. After graduation, in their employment, they may
becalled upon to forecast sales, interest rates, and money supply or to estimate de-mind and
supply functions or price elasticities for products. Quite often, economy-mists appear as
expert witnesses before federal and state regulatory agencies on behalf of their clients or the
public at large. Thus, an economist appearing before state regulatory commission that
controls prices of gas and electricity may be re-quired to assess the impact of a proposed
price increase on the quantity de-minded of electricity before the commission will approve
the price increase. Insinuations like this the economist may need to develop a demand
function for electricity for this purpose. Such a demand function may enable the economist to
estimate the price elasticity of demand, that is, the percentage change in the quantity
demanded for a percentage change in the price. Knowledge of econometrics is very helpful in
estimating such demand functions. It is fair to say that econometrics has become an integral
part of training in economics and business

Definition

Econometrics is the application of statistical methods to economic data and is described as


the branch of economics that aims to give empirical content to economic relations.
Econometrics is an amalgam of economic theory, mathematical economics, economic
statistics, and mathematical statistics. Economic theory makes statements or hypotheses that
are mostly qualitative in nature; while, econometrics gives empirical content to most
economic theory. For example, microeconomic theory states that, other things remaining the
same, a reduction in the price of a commodity is expected to increase the quantity demanded
of that commodity. Thus, economic theory postulates a negative or inverse relationship
between the price and quantity demanded of a commodity. But the theory itself does not
provide any numerical measure of the relationship between the two; that is, it does not tell by
how much the quantity will go up or down as a result of a certain change in the price of the
commodity. It is the job of the econometrician to provide such numerical estimates.
Econometrics allows economists to sift through mountains of data to extract simple
relationships. Econometricians formulate a statistical model, usually based on economic
theory, confront it with the data, and try to come up with a specification that meets the
required goals. Precisely, Econometrics is the quantitative analysis of actual economic
phenomena based on the concurrent development of theory and observation, related by
appropriate methods of inference. he first known use of the term "econometrics" (in cognate
form) was by Polish economist Pawed Ciampa in 1910. Ragnar Frisch is credited with
coining the term in the sense in which it is used today.

(stochastic) variable having well‐defined probabilistic properties. The disturbance term may
well represent all the effective factors that are not taken into account explicitly.

For example, the econometric model of Keynesian postulate be as:

Y = β1 + β2X + u

where u is the error (disturbance) term

Obtaining the data: To estimate the econometric model, that is, to obtain the numerical values
of the parameters of the model, we need data.

For example, we may use U.S. economic data for the period 1981–1996, from Economic
Report of the President (1998), to estimate numeric values of β1 and β2.

Estimation of the parameters of the econometric model: Having the data, the next task is to
estimate the parameters of the model. The numerical estimates of the parameters give
empirical content to the model.

For example, using regression analysis technique and the data, we obtain the estimates of β1
and β2 as: −184.08 and 0.7064.

Thus, the estimated consumption function is:

Y= −184.08 + 0.7064Xi

The hat on the Y indicates that it is an estimate

Hypothesis testing: According to “positive” economists like Milton Friedman, a theory or


hypothesis that is not verifiable by appeal to empirical evidence may not be admissible as a
part of scientific enquiry. So, assuming that the fitted model is a reasonably good
approximation of reality, we have to develop suitable criteria to find out whether the obtained
estimates are in accord with the expectations of the theory that is being tested. Such
confirmation or refutation of economic theories on the basis of sample evidence is known as
hypothesis testing.

For example, as noted earlier, Keynes expected the MPC to be positive but less than 1. In our
example, we found the MPC to be about 0.70. But before we accept this finding as
confirmation of Keynesian consumption theory, we must enquire whether this estimate is
sufficiently below 1 to convince us that this is not a chance occurrence or peculiarity of the
particular data we have used. In other words, is 0.70 statistically less than 1? If it is, it may
support Keynes’ theory.

Forecasting or prediction: If the chosen model does not refute the hypothesis or theory under
consideration, we may use it to predict the future value(s) of the dependent variable (or,
forecast variable) Y on the basis of known or expected future value(s) of the explanatory (or
predictor) variable X.

For example, putting the GDP value for 1997 (7269.8 billion dollars) on the obtained
Keynesian model, we obtain:

Y1997 = −184.0779 + 0.7064 (7269.8)

= 4951.3167

Thus, given the value of the GDP, the average forecast consumption expenditure is about
4951 billion dollars.

Using the model for control or policy purposes: It is possible, and of a great interest, to use an
estimated model for control, or policy, purposes. By appropriate fiscal and monetary policy
mix, the government can manipulate the control variable(s) to produce the desired level of the
target variable.

Nature of Econometrics

Econometrics analyses data using statistical methods in order to test or develop economic
theory. These methods rely on statistical inferences to quantify and analyze economic
theories by leveraging tools such as frequency distributions, probability, and probability
distributions, statistical inference, correlation analysis, simple and multiple regression
analysis, simultaneous equations models, and time series methods.
Econometrics was pioneered by Lawrence Klein, Ragnar Frisch, and Simon Kuznets. All
three won the Nobel Prize in economics in 1971 for their contributions. Today, it is used
regularly among academics as well as practitioners such as Wall Street traders and analysts.

An example of the application of econometrics is to study the income effect using


observable data. An economist may hypothesize that as a person increases his income,
his spending will also increase. If the data show that such an association is present, a
regression analysis can then be conducted to understand the strength of the relationship
between income and consumption and whether or not that relationship is statistically
significant—that is, it appears to be unlikely that it is due to chance alone.

Scope of Econometrics

Quantitative economics is a highly specialized field of study which is taught at the post-
graduation level. The courses related to this field are popular in India but it is done by some
of the best brains of the society. The study of quantitative economics became all the more
important as a result of the utilization of economics as a subject that can analytically
approach the problems and provide efficient solutions. The subject is also known as
econometrics as it deals with the study of complex mathematical and statistical models
which help in detailed study of concepts of economics.

The growth and development of industries is always dependent on a few factors such as
resource utilization, maximization of revenue and similar factors. The subject quantitative
economics provide economic models required for the analysis of such factors. The demand
for experts in this field are huge across all the sectors some of them being advisory bodies,
multinational corporations, manufacturing units, business conglomerates and so on.

The global competition has led to unending race where every enterprise wants to become the
market leader. This has led to the enhanced quality and services provided by organizations.
The role of economic models in such a scenario becomes all the more important.
Econometrics is a highly important subject for research purpose because it always leads to
an efficient solution of an economic problem.
A number of esteemed institutes of the country offers courses on this subject. In a
developing economy like India, the role of econometrists becomes obligatory. The experts
in this field are the ones who bag the best jobs in the market. Not only in India but the
employment scope for the quantitative economics experts is huge abroad also.

Econometrics is the application of statistical methods and mathematics to economic data. It


is a branch of economics that focuses on giving experimental content for finding out
economic relations. It also aims at computing relationships between economic variables
through statistical techniques.

Courses in Econometrics

There are many courses available in the field of Econometrics. Some of them are Graduate
Diploma in Econometrics, B.A in Economics with Econometrics, M.Sc. in Econometrics,
M.A in Econometrics, etc. A few colleges or universities in India that offer various courses
in Econometrics are listed below.

 University of Madras

 Sri Venkateshwara University College of Arts

 Gujarat University

 Centre for Population Studies

 Bharathi University

 University of Jammu

 University of Delhi

 Dibrugarh University

 Delhi School of Economics


1.4 SIMPLE AND GENERAL LINEAR REGRESSION MODEL

Simple Linear Regression Model

In statistics, simple linear regression is a linear regression model with a single explanatory
variable. That is, it concerns two-dimensional sample points with one independent variable
and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate
system) and finds a linear function (a non-vertical straight line) that, as accurately as possible,
predicts the dependent variable values as a function of the independent variable. The
adjective simple refers to the fact that the outcome variable is related to a single predictor.

It is common to make the additional stipulation that the ordinary least squares (OLS) method
should be used: the accuracy of each predicted value is measured by its squared residual
(vertical distance between the point of the data set and the fitted line), and the goal is to make
the sum of these squared deviations as small as possible. Other regression methods that can
be used in place of ordinary least squares include least absolute deviations (minimizing the
sum of absolute values of residuals) and the Theil–Sen estimator (which chooses a line whose
slope is the median of the slopes determined by pairs of sample points). Deming regression
(total least squares) also finds a line that fits a set of two-dimensional sample points, but
(unlike ordinary least squares, least absolute deviations, and median slope regression) it is not
really an instance of simple linear regression, because it does not separate the coordinates
into one dependent and one independent variable and could potentially return a vertical line
as its fit.

The remainder of the article assumes an ordinary least squares regression. In this case, the
slope of the fitted line is equal to the correlation between y and x corrected by the ratio of
standard deviations of these variables. The intercept of the fitted line is such that the line
passes through the center of mass (x, y) of the data points.

The simple linear regression model is represented like this: y = (β0 +β1 + Ε

By mathematical convention, the two factors that are involved in a simple linear regression
analysis are designated x and y. The equation that describes how y is related to x is known
as the regression model. The linear regression model also contains an error term that is
represented by Ε, or the Greek letter epsilon. The error term is used to account for the
variability in y that cannot be explained by the linear relationship between x and y. There
also parameters that represent the population being studied.

These parameters of the model that are represented by (β0+β1x).

The simple linear regression equation is represented like this: Ε(y) = (β0
+β1 x).The simple linear regression equation is graphed as a straight line.

(β0 is the y intercept of the regression


line.β1 is the slope.

Ε(y) is the mean or expected value of y for a given value of x.

A regression line can show a positive linear relationship, a negative linear relationship, or
norelationship. If the graphed line in a simple linear regression is flat (not sloped), there
is no relationship between the two variables. If the regression line slopes upward with the
lower end of the line at the y intercept (axis) of the graph, and the upper end of line
extending upward into the graph field, away from the x intercept (axis) a positive linear
relationship exists. If the regression line slopes downward with the upper end of the line at
the y intercept (axis) of the graph, and the lower end of line extending downward into the
graph field, toward the x intercept(axis) a negative linear relationship exists.

If the parameters of the population were known, the simple linear regression equation
(shown below) could be used to compute the mean value of y for a known value of x.

Ε(y) = (β0 +β1 x).

However, in practice, the parameter values are not known so they must be estimated by
using data from a sample of the population. The population parameters are estimated by
using sample statistics. The sample statistics are represented by b0 +b1. When the sample
statistics are substituted for the population parameters, the estimated regression equation is
formed.

The estimated regression equation is shown below.


(ŷ) = (β0 +β1 x

(ŷ) is pronounced y hat.

The graph of the estimated simple regression equation is called the estimated regression
line.The b0 is the y intercept.

The b1 is the slope.

The ŷ) is the estimated value of y for a given value of x.

General Linear Regression Model

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear


regression. The GLM generalizes linear regression by allowing the linear model to be related
to the response variable via a link function and by allowing the magnitude of the variance of
each measurement to be a function of its predicted value.

Generalized linear models were formulated by John Nelder and Robert Wedderburn as a way
of unifying various other statistical models, including linear regression, logistic regression
and Poisson regression. They proposed an iteratively reweighted least squares method for
maximum likelihood estimation of the model parameters. Maximum-likelihood estimation
remains popular and is the default method on many statistical computing packages. Other
approaches, including Bayesian approaches and least squares fits to variance stabilized
responses, have been developed.

The general linear model or multivariate regression model is a statistical linear model. It may
bewritten as

Y = XB + U

where Y is a matrix with series of multivariate measurements (each column being a set of
measurements on one of the dependent variables), X is a matrix of observations on
independent variables that might be a design matrix (each column being a set of
observations on one of the independent variables), B is a matrix containing parameters that
are usually to be estimated and U is a matrix containing errors (noise). The errors are usually
assumed to be uncorrelated across measurements, and follow a multivariate normal
distribution. If the errors do not follow a multivariate normal distribution, generalized linear
models may be used to relax assumptions about Y and U.

The general linear model incorporates a number of different statistical models: ANOVA,
ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The
general linear model is a generalization of multiple linear regression to the case of more
than one dependent variable. If Y, B, and U were column vectors, the matrix equation
above would represent multiple linear regression.

Hypothesis tests with the general linear model can be made in two ways: multivariate or as
several independent univariate tests. In multivariate tests the columns of Y are tested
together, whereas in univariate tests the columns of Y are tested independently, i.e., as
multiple univariatetests with the same design matrix.

An application of the general linear model appears in the analysis of multiple brain
scans in scientific experiments where Y contains data from brain scanners, X contains
experimental design variables and confounds. It is usually tested in a univariate way
(usually referred to a mass-univariate in this setting) and is often referred to as statistical
parametric mapping.

The general linear model or general multivariate regression model is a compact way of
simultaneously writing several multiple linear regression models. In that sense it is not a
separate statistical linear model. The various multiple linear regression models may be
compactly written as

where Y is a matrix with series of multivariate measurements (each column being a set of
measurements on one of the dependent variables), X is a matrix of observations on
independent variables that might be a design matrix (each column being a set of
observations on one of the independent variables), B is a matrix containing parameters that
are usually to be estimated and U is a matrix containing errors (noise). The errors are
usually assumed to be uncorrelated across measurements, and follow a multivariate normal
distribution. If the errors do not follow a multivariate normal distribution, generalized
linear models may be used to relax assumptions about Y and U.
The general linear model incorporates a number of different statistical models: ANOVA,
ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The
general linear model is a generalization of multiple linear regression to the case of more
than one dependent variable. If Y, B, and U were column vectors, the matrix equation
above would represent multiple linear regression.

Hypothesis tests with the general linear model can be made in two ways: multivariate or as
several independent univariate tests. In multivariate tests the columns of Y are tested
together, whereas in univariate tests the columns of Y are tested independently, i.e., as
multiple univariate tests with the same design matrix.

1.5 ASSUMPTIONS, ESTIMATION (THROUGH ORDINARY LEAST


SQUARE APPROACH AND PROPERTIES OF ESTIMATORS

OLS estimators have the following properties:

Linear

OLS estimators are linear functions of the values of Y (the dependent variable) which are
linearly combined using weights that are a non-linear function of the values of X (the
regressors or explanatory variables). So, the OLS estimator is a "linear" estimator with
respect to how it uses the values of the dependent variable only, and irrespective of how it
uses the values of the regressors.

This property is more concerned with the estimator rather than the original equation
that isbeing estimated. In assumption A1, the focus was that the linear regression should be
―linear in parameters. ‖ However, the linear property of OLS estimator means that OLS
belongs to that class of estimators, which are linear in Y, the dependent variable. Note that
OLS estimators are linear only with respect to the dependent variable and not necessarily
with respect to the independent variables. The linear property of OLS estimators doesn’t ‘t
depends only on assumption A1 but on all assumptions A1 to A5.
Unbiased

If you look at the regression equation, you will find an error term associated with the
regression equation that is estimated. This makes the dependent variable also random. If an
estimator uses the dependent variable, then that estimator would also be a random number.
Therefore, before describing what unbiasedness is, it is important to mention that
unbiasedness property is a property of the estimator and not of any sample.

Unbiasedness is one of the most desirable properties of any estimator. The estimator should
ideally be an unbiased estimator of true parameter/population values.

Efficient: it has the minimum variance

First, let us look at what efficient estimators are. The efficient property of any estimator
saysthat the estimator is the minimum variance unbiased estimator. Therefore, if you take all
the unbiased estimators of the unknown population parameter, the estimator will have the
least variance. The estimator that has less variance will have individual data points
closer to the mean. As a result, they will be more likely to give better and accurate results
than other estimators having higher variance. In short:

If the estimator is unbiased but doesn’t ‘t has the least variance – it’s not the best!

If the estimator has the least variance but is biased – it’s again not the best, if the estimator
is both unbiased and has the least variance – it’s the best estimator. Now, talking about OLS,
OLS estimators have the least variance among the class of all linear unbiased estimators. So,
this property of OLS regression is less strict than efficiency property. Efficiency property
says least variance among all unbiased estimators, and OLS estimators have the least
variance among all linear and unbiased estimators.

Consistent

An estimator is said to be consistent if its value approaches the actual, true parameter
(population) value as the sample size increases. An estimator is consistent if it satisfies two
conditions:

1. It is asymptotically unbiased

2. Its variance converges to 0 as the sample size increases.

Both these hold true for OLS estimators and, hence, they are consistent estimators. For an
estimator to be useful, consistency is the minimum basic requirement. Since there may be
several such estimators, asymptotic efficiency also is considered. Asymptotic efficiency is
the sufficient condition that makes OLS estimators the best estimators.

Asymptotic Unbiasedness

This property of OLS says that as the sample size increases, the biasedness of OLS
estimators disappears.

In real life, linear regression models have many uses. The Ordinary Least Squares (OLS)
method is a popular econometric technique for estimating a linear regression model’s
parameter. While running linear regression models, some assumptions are made regarding
the accuracy of OLS estimations.

A1. The linear regression model has parameters that are linear.

A2: The observations are chosen at random.

A3. There should be no conditional mean.

A4. Multi-collinearity is absent (or perfect collinearity).

A5. Spherical errors: No auto-correlation and homoscedasticity

A6: A possible supposition is that error terms should have a normal distribution.

These presumptions are crucial because if any of them were broken, OLS estimates would
be inaccurate and unreliable. In particular, a violation would lead to OLS estimates with
wrong signs or unreliable variance, resulting in either too broad or too small confidence
intervals.

Given this, it is vital to look at why OLS estimators and its underlying assumptions receive
so much attention. The characteristics of the OLS model are covered in this article. The
well-known Gauss-Markov Theorem is described first. The characteristics of the OLS
model are then discussed in great depth. The essay concludes with a brief discussion of the
uses of the OLS features in econometrics.

Applications and their Relationship to Econometrics Research

Due to the above-discussed beneficial qualities, OLS estimators are popular and have many
practical applications.

Consider a bank that wishes to forecast a customer's risk in the event of a default. The bank
may use a number of independent factors, such as client level characteristics, credit history,
loan type, mortgage, etc., together with the exposure at default as the dependent variable. To
determine the elements that are significant in predicting a customer's risk at default, the
bank may easily conduct an OLS regression and receive the estimates. OLS estimators are
simple to operate and comprehend. They may be utilized extensively and are included in
many statistical software programmes.

The foundation of econometrics is OLS regressions. Every econometrics course will begin
by assuming OLS regressions. It is one of the most often asked topics in employment and
college entrance interviews. A variety of models, including GLM (generalized linear
models), general linear models, heteroscedastic models, multi-level regression models, etc.,
have been developed based on the OLS's fundamental components and by loosening the
assumptions.

Econometrics has a significant influence on research in both finance and economics. The
foundation of econometrics is OLS. However, in practice, there are problems like reverse
causality that make OLS inappropriate or irrelevant. OLS may still be used to look at the
problems with cross-sectional data, though. Even though OLS cannot be used for regression,
it is utilized to identify concerns, problems, and potential solutions.

Conclusion

In conclusion, OLS estimation technique is the most used and is an essential form of linear
regression. The characteristics of OLS estimators were covered in this article because it is
the most popular estimate method. Blue OLS estimators (i.e., they are linear, unbiased and
have the least variance among the class of all linear and unbiased estimators). Remember
that the Gauss-Markov Theorem, which states that the estimators of the OLS model are
BLUE, only true if the OLS assumptions are met. Each assumption established while
researching OLS places limitations on the model but also enables more definitive claims to
be made about OLS. As a result, you should always verify the OLS assumptions anytime
you want to employ a linear regression model that uses OLS. Because of the Gauss-Markov
theorem, life is made easier if the OLS assumptions are met. OLS may then be used directly
for the best outcomes.
The Seven Classical OLS Assumptions

Like many statistical analyses, ordinary least squares (OLS) regression has underlying
assumptions. When these classical assumptions for linear regression are true, ordinary least
squares produces the best estimates. However, if some of these assumptions are not true,
you might need to employ remedial measures or use other estimation methods to improve
the results.

Many of these assumptions describe properties of the error term. Unfortunately, the error
term is a population value that we’ll never know. Instead, we’ll use the next best thing that
is available—the residuals. Residuals are the sample estimate of the error for each
observation.

Residuals = Observed value – the fitted value

When it comes to checking OLS assumptions, assessing the residuals is crucial!

There are seven classical OLS assumptions for linear regression. The first six are
mandatory to produce the best estimates. While the quality of the estimates does not
depend on the seventh assumption, analysts often evaluate it for other important reasons
that I’ll cover.

OLS Assumption 1: The regression model is linear in the coefficients and the error term

This assumption addresses the functional form of the model. In statistics, a regression
model is linear when all terms in the model are either the constant or a parameter
multiplied by an independent variable. You build the model equation only by adding the
terms together. These rules constrain the model to one type:

In the equation, the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the
random error.

In fact, the defining characteristic of linear regression is this functional form of the
parameters rather than the ability to model curvature. Linear models can model curvature
by including nonlinear variables such as polynomials and transforming exponential
functions.
OLS Assumption 2: The error term has a population mean of zero

The error term accounts for the variation in the dependent variable that the independent
variables do not explain. Random chance should determine the values of the error term. For
your model to be unbiased, the average value of the error term must equal zero.

Suppose the average error is +7. This non-zero average error indicates that our model
systematically underpredicts the observed values. Statisticians refer to systematic error like
this as bias, and it signifies that our model is inadequate because it is not correct on
average.

Stated another way, we want the expected value of the error to equal zero. If the expected
value is +7 rather than zero, part of the error term is predictable, and we should add that
information to the regression model itself. We want only random error left for the error
term.

You don’t need to worry about this assumption when you include the constant in your
regression model because it forces the mean of the residuals to equal zero. For more
information about this assumption, read my post about the regression constant.

OLS Assumption 3: All independent variables are uncorrelated with the error term

If an independent variable is correlated with the error term, we can use the independent
variable to predict the error term, which violates the notion that the error term represents
unpredictable random error. We need to find a way to incorporate that information into the
regression model itself.

This assumption is also referred to as exogeneity. When this type of correlation exists,
there is endogeneity. Violations of this assumption can occur because there is simultaneity
between the independent and dependent variables, omitted variable bias, or measurement
error in the independent variables.

Violating this assumption biases the coefficient estimate. To understand why this bias
occurs, keep in mind that the error term always explains some of the variability in the
dependent variable. However, when an independent variable correlates with the error term,
OLS incorrectly attributes some of the variance that the error term actually explains to the
independent variable instead. For more information about violating this assumption

OLS Assumption 4: Observations of the error term are uncorrelated with each other

One observation of the error term should not predict the next observation. For instance, if
the error for one observation is positive and that systematically increases the probability
that the following error is positive, that is a positive correlation. If the subsequent error is
more likely to have the opposite sign, that is a negative correlation. This problem is known
both as serial correlation and autocorrelation. Serial correlation is most likely to occur in
time series models.

For example, if sales are unexpectedly high on one day, then they are likely to be higher
than average on the next day. This type of correlation isn’t an unreasonable expectation for
some subject areas, such as inflation rates, GDP, unemployment, and so on.

Assess this assumption by graphing the residuals in the order that the data were collected.
You want to see randomness in the plot. In the graph for a sales model, there is a cyclical
pattern with a positive correlation.

Fig.-1.3 Versus Order


As I’ve explained, if you have information that allows you to predict the error term for an
observation, you must incorporate that information into the model itself. To resolve this
issue, you might need to add an independent variable to the model that captures this
information. Analysts commonly use distributed lag models, which use both current values
of the dependent variable and past values of independent variables.

For the sales model above, we need to add variables that explains the cyclical pattern.

Serial correlation reduces the precision of OLS estimates. Analysts can also use time series
analysis for time dependent effects.

OLS Assumption 5: The error term has a constant variance (no heteroscedasticity)

The variance of the errors should be consistent for all observations. In other words, the
variance does not change for each observation or for a range of observations. This
preferred condition is known as homoscedasticity (same scatter). If the variance changes,
we refer to that as heteroscedasticity (different scatter).

The easiest way to check this assumption is to create a residuals versus fitted value plot. On
this type of graph, heteroscedasticity appears as a cone shape where the spread of the
residuals increases in one direction. In the graph below, the spread of the residuals
increases as the fitted value increases.
Fig. 1.4 Versus Fits

Heteroscedasticity reduces the precision of the estimates in OLS linear regression.

OLS Assumption 6: No independent variable is a perfect linear function of other


explanatory variables

Perfect correlation occurs when two variables have a Pearson’s correlation coefficient of
+1 or -1. When one of the variables changes, the other variable also changes by a
completely fixed proportion. The two variables move in unison.

Perfect correlation suggests that two variables are different forms of the same variable. For
example, games won and games lost have a perfect negative correlation (-1). The
temperature in Fahrenheit and Celsius have a perfect positive correlation (+1).

Ordinary least squares cannot distinguish one variable from the other when they are
perfectly correlated. If you specify a model that contains independent variables with perfect
correlation, your statistical software can’t fit the model, and it will display an error
message. You must remove one of the variables from the model to proceed.

Perfect correlation is a show stopper. However, your statistical software can fit OLS
regression models with imperfect but strong relationships between the independent
variables. If these correlations are high enough, they can cause problems. Statisticians refer
to this condition as multicollinearity, and it reduces the precision of the estimates in OLS
linear regression.

OLS Assumption 7: The error term is normally distributed (optional)

OLS does not require that the error term follows a normal distribution to produce unbiased
estimates with the minimum variance. However, satisfying this assumption allows you to
perform statistical hypothesis testing and generate reliable confidence intervals and
prediction intervals.

The easiest way to determine whether the residuals follow a normal distribution is to assess
a normal probability plot. If the residuals follow the straight line on this type of graph, they
are normally distributed. They look good on the plot below

Fig.-1.5 Normal Probability Plot


1.6 GAUSS-MARKOV THEOREM

In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states
that the ordinary least squares (OLS) estimator has the lowest sampling variance within the
class of linear unbiased estimators, if the errors in the linear regression model are
uncorrelated, have equal variances and expectation value of zero. The errors do not need to
be normal, nor do they need to be independent and identically distributed (only uncorrelated
with mean zero and homoscedastic with finite variance). The requirement that the estimator
be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for
example, the James–Stein estimator (which also drops linearity), ridge regression, or simply
any degenerate estimator.

The theorem was named after Carl Friedrich Gauss and Andrey Markov, although Gauss'
work significantly predates Markov's. But while Gauss derived the result under the
assumption of independence and normality, Markov reduced the assumptions to the form
stated above. A further generalization to non-spherical errors was given by Alexander Aitken

The ordinary least squares (OLS) estimator has the lowest sampling variance among the
class of linear unbiased estimators in statistics if the errors in the linear regression model
are uncorrelated, have equal variances, and have an expectation value of zero. This is
known as the Gauss-Markov theorem (or simply Gauss theorem for some authors). The
mistakes do

not have to be independent and evenly distributed, nor do they have to be normal (only
uncorrelated with mean zero and homoscedastic with finite variance). Since biased
estimators exist and have lower variance, the requirement that the estimator be unbiased
cannot be eliminated. See, for instance, ridge regression, the James-Stein estimator (which
also loses linearity), or any degenerate estimator.

Despite the fact that Carl Friedrich Gauss' work much precedes Andrey Markov's, the
theorem was titled in their honor. Markov, however, simplified the assumptions to the form
given above, whereas Gauss derived the conclusion under the conditions of independence
and normality. Alexander Aitken provided a further generalization to non-spherical errors.

The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the best linear
unbiased estimate (BLUE) possible.

When estimating regression models, we know that the results of the estimation procedure
are random. However, when using unbiased estimators, at least on average, we estimate the
true parameter. When comparing different unbiased estimators, it is therefore interesting to
know which one has the highest precision: being aware that the likelihood of estimating the
exact value of the parameter of interest is 0 in an empirical application, we want to make
sure that the likelihood of obtaining an estimate very close to the true value is as high
as possible. This means we want to use the estimator with the lowest variance of all
unbiased estimators, provided we care about unbiasedness. The Gauss-Markov theorem
states that, in the class of conditionally unbiased linear estimators, the OLS estimator has
this property under certain conditions.

Gauss Markov Assumptions

There are five Gauss Markov assumptions (also called conditions):

 Linearity: the parameters we are estimating using the OLS method must be
themselveslinear.

 Random: our data must have been randomly sampled from the population.

 Non-Collinearity: the regressors being calculated aren ‘t perfectly correlated with each
other.

 Exogeneity

 the regressors aren ‘t correlated with the error term.

 Homoscedasticity: no matter what the values of our regressors might be, the error
of thevariance is constant.

Purpose of the Assumptions

The Gauss Markov assumptions guarantee the validity of ordinary least squares for
estimating regression coefficients.
Checking how well our data matches these assumptions is an important part of estimating
regression coefficients. When you know where these conditions are violated, you may be
able to plan ways to change your experiment setup to help your situation fit the ideal Gauss
Markov situation more closely.

In practice, the Gauss Markov assumptions are rarely all met perfectly, but they are still
useful as a benchmark, and because they show us what ‗ideal ‘conditions would be. They
also allow us to pinpoint problem areas that might cause our estimated regression
coefficients to be inaccurateor even unusable.

The Gauss-Markov Assumptions in Algebra

We can summarize the Gauss-Markov Assumptions succinctly in algebra, by saying that a


linear regression model represented by

Yi = xi‗ β + mi

and generated by the ordinary least squares estimate is the best linear unbiased estimate
(BLUE)possible if

 E{mi} = 0, I = 1, . . ., N

 {ε1……in} and {x1…., Xin} are independent

 cove {mi, job} = 0, I, j = 1, N I ≠ j.

 V {ε1 = σ2, I= 1, …. N

The first of these assumptions can be read as ―The expected value of the error term is zero.
―The second assumption is collinearity, the third is exogeneity, and the fourth is
homoscedasticity.

1.7 NORMALITY ASSUMPTIONS

Normality tests are used in statistics to examine if a data set is well-modeled by a normal
distribution and to calculate the likelihood that a random variable underlying the data set will be
normally distributed.
Specifically, the tests are a type of model selection, and depending on how one views probability,
they may be understood in a variety of ways:

Without passing judgement on any underlying variables, descriptive statistics examines the
goodness of fit of a normal model to the data. If the fit is low, the data are not properly described
by a normal distribution in that aspect.

Data are checked against the null hypothesis that they are regularly distributed in frequentist
statistics statistical hypothesis testing.

In Bayesian statistics, one does not specifically "test normality," but rather computes the
likelihood that the data come from a normal distribution with given parameters, (for all,), and
compares that to the likelihood that the data come from other distributions under consideration.
This is most commonly done by using a Bayes factor, which indicates the relative likelihood of
seeing the data given various models, or more precisely by using a prior distribution on potential
models and parameters.

To ascertain if sample data were taken from a regularly distributed population, a normality test is
utilized (within some tolerance). A regularly distributed sample population is needed for a variety
of statistical tests, including the student’s t-test and the one-way and two-way ANOVA.

Graphical methods -

An informal approach to testing normality is to compare a histogram of the sample data to a


normal probability curve. The empirical distribution of the data (the histogram) should be bell-
shaped and resemble the normal distribution. This might be difficult to see if the sample is small.
In this case one might proceed by regressing the data against the quantiles of a normal
distribution with the same mean and variance as the sample. Lack of fit to the regression line
suggests a departure from normality (see Anderson Darling coefficient and Minitab).

A graphical tool for assessing normality is the normal probability plot, a quantile-quantile plot
(QQ plot) of the standardized data against the standard normal distribution. Here the correlation
between the sample data and normal quantiles (a measure of the goodness of fit) measures how
well the data are modeled by a normal distribution. For normal data the points plotted in the QQ
plot should fall approximately on a straight line, indicating high positive correlation. These plots
are easy to interpret and also have the benefit that outliers are easily identified.

One application of normality tests is to the residuals from a linear regression model. If they are
not normally distributed, the residuals should not be used in Z tests or in any other tests derived
from the normal distribution, such as t tests, F tests and chi-squared tests. If the residuals are not
normally distributed, then the dependent variable or at least one explanatory variable may have
the wrong functional form, or important variables may be missing, etc. Correcting one or more of
these systematic errors may produce residuals that are normally distributed; in other words, non-
normality of residuals is often a model deficiency rather than a data problem

Assumption of normality means that you should make sure your data roughly fits a bell
curve shape before running certain statistical tests or regression. The tests that require
normallydistributed data include:

 Independent Samples t-test.

 Hierarchical Linear Modeling.

 ANCOVA.

 Goodness of Fit Test.

You ‘ve got two main ways to test for normality: eyeball a graph, or run a test that’s
specifically designed to test for normality. The data doesn’t ‘t has to be perfectly normal.
However, data that definitely does not meet the assumption of normality is going to give
you poor results for certain types of tests (i.e., ones that state that the assumption must be
met!). How closely does your data have to meet the test for normality? This is a judgment
call.

If you ‘re uncomfortable with making that judgement (which is usually based on
experience with statistics), then you may be better off running a statistical test (scroll down
for the test options). That said, some of the tests can be cumbersome to use and involve
finding test statistics and critical values.

Stuck on which option to choose? If you ‘re new to statistics, the easiest graph to decipher
is the histogram. The easiest test to run is probably the Jarque-Bear Test.
1.8 CONCEPTS AND DERIVATION OF R2 AND ADJUSTED R2

R-squared (R2) and adjusted R-square allow an investor to measure the value of a mutual
fund against the value of a benchmark. Investors may also use this calculation to measure
their portfolio against a given benchmark.

These values range between 0 and 100. The resulting figure does not indicate how well a
particular group of securities is performing, and it only measures how closely the returns
from the holdings align to those of the measured benchmark.

R-squared—also known as the coefficient of determination—is a statistical analysis tool


used to predict the future outcome of an investment and how closely it aligns to a single
measured model. Adjusted R-squared compares the correlation of the investment to several
measured models.

R-Squared

R-squared cannot verify whether the coefficient ballpark figure and its predictions are
prejudiced. It also does not show if a regression model is satisfactory; it can show an R-
squared figure for a good model or a high R-squared figure for a model that doesn’t ‘t fit.
The lower the value of the R2 the less the two variables correlate to one another. Results
higher than 70% usually indicate that a portfolio closely follows the measured benchmark.
Higher R-squared values also indicate the reliability of beta readings. Beta measures the
volatility of a security ora portfolio.

One major difference between R-squared and the adjusted R-squared is that R2 assumes
every independent variable—benchmark—in the model explains the variation in the
dependent variable—mutual fund or portfolio. It gives the percentage of explained variation
as if all independent variables in the model affect the dependent variable. In the real world,
this one-to- one relationship rarely happens. Adjusted R-squared, on the other hand, gives
the percentage of variation explained by only those independent variables that, in reality,
affect the dependent variable.
Adjusted R-Squared

The adjusted R-squared compares the descriptive power of regression models—two or more
variables—that include a diverse number of independent variables—known as a predictor.
Every predictor or independent variable, added to a model increases the R-squared value
and never decreases it. So, a model that includes several predictors will return higher R2
values and may seem to be a better fit. However, this result is due to it including more terms.

The adjusted R-squared compensates for the addition of variables and only increases if the
new predictor enhances the model above what would be obtained by probability.
Conversely, it will decrease when a predictor improves the model less than what is predicted
by chance.

When too few data points are used in a statistical model it is called overfitting. Overfitting
can return an unwarranted high R-squared value. This incorrect figure can lead to a
decreased ability to predict performance outcomes. The adjusted R-squared is a modified
version of R2 for the number of predictors in a model. The adjusted R-squared can be
negative but isn't always.

While an R-squared value between 0 and 100 and shows the linear relationship in the
sample of data even when there is no basic relationship, the adjusted R-squared gives the
best estimate of the degree of relationship in the basic population.

To show the correlation of models with R-squared, pick the model with the highest limit.
However, the best, and easiest, way to compare models is to select one with the smaller
adjusted

R-squared. Adjusted R-squared is not a typical model for comparing nonlinear models but,
instead, shows multiple linear regressions.

R-Squared is a statistical measure of fit that indicates how much variation of a dependent
variable is explained by the independent variable(s) in a regression model.

In investing, R-squared is generally interpreted as the percentage of a fund or security's


movements that can be explained by movements in a benchmark index.

An R-squared of 100% means that all movements of a security (or other dependent
variables) are completely explained by movements in the index (or the independent
variable(s) you are interested in).

Limitations of R-Squared

R-squared will give you an estimate of the relationship between movements of a dependent
variable based on an independent variable's movements. It doesn't tell you whether your
chosen model is good or bad, nor will it tell you whether the data and predictions are
biased. A high or low R-square isn't necessarily good or bad, as it doesn't convey the
reliability of the model, nor whether you've chosen the right regression. You can get a low
R-squared for a good model, or a high R-square for a poorly fitted model, and vice versa.

R-squared and adjusted R-squared enable investors to measure the performance of a mutual
fund against that of a benchmark. Investors may also use them to calculate the performance
of their portfolio against a given benchmark.

In the world of investing, R-squared is expressed as a percentage between 0 and 100, with
100 signaling perfect correlation and zero no correlation at all. The figure does not indicate
how well a particular group of securities is performing. It only measures how closely the
returns align with those of the measured benchmark. It is also backwards-looking—it is not
a predictor of future results.

Adjusted R-squared can provide a more precise view of that correlation by also taking into
account how many independent variables are added to a particular model against which the
stock index is measured. This is done because such additions of independent variables
usually increase the reliability of that model—meaning, for investors, the correlation with
the index.

The most obvious difference between adjusted R-squared and R-squared is simply that
adjusted R-squared considers and tests different independent variables against the stock
index and R-squared does not. Because of this, many investment professionals prefer using
adjusted R-squared because it has the potential to be more accurate. Furthermore, investors
can gain additional information about what is affecting a stock by testing various
independent variables using the adjusted R-squared model.

R-squared, on the other hand, does have its limitations. One of the most essential limits to
using this model is that R-squared cannot be used to determine whether or not the
coefficient estimates and predictions are biased. Furthermore, in multiple linear regression,
the R-squared can not tell us which regression variable is more important than the other.

1.9 CONCEPT AND INTERPRETATION OF PARTIAL AND


MULTIPLE CORRELATION

Partial Correlation

In probability theory and statistics, partial correlation measures the degree of association
between two random variables, with the effect of a set of controlling random variables
removed. If we are interested in finding whether or to what extent there is a numerical
relationship between two variables of interest, using their correlation coefficient will give
misleading results if there is another, confounding, variable that is numerically related to
both variables of interest. This misleading information can be avoided by controlling for the
confounding variable, which is done by computing the partial correlation coefficient. This is
precisely the motivation for including other right-side variables in a multiple regression; but
while multiple regression gives unbiased results for the effect size, it does not give a
numerical value of a measure of the strength of the relationship between the two variables of
interest.

For example, if we have economic data on the consumption, income, and wealth of various
individuals and we wish to see if there is a relationship between consumption and income,
failing to control for wealth when computing a correlation coefficient between
consumption and income would give a misleading result, since income might be
numerically related to wealth which in turn might be numerically related to consumption; a
measured correlation between consumption and income might actually be contaminated by
these other correlations. The use ofa partial correlation avoids this problem.

Like the correlation coefficient, the partial correlation coefficient takes on a value in the
range from –1 to 1. The value –1 conveys a perfect negative correlation controlling for
some variables (that is, an exact linear relationship in which higher values of one variable
are associated with lower values of the other); the value 1 conveys a perfect positive linear
relationship, and the value 0 conveys that there is no linear relationship.

The partial correlation coincides with the conditional correlation if the random variables
are jointly distributed as the multivariate normal, other elliptical, multivariate
hypergeometric, multivariate negative hypergeometric, multinomial or Dirichlet
distribution, but not in general otherwise.

Formally, the partial correlation between X and Y given a set of n controlling variables Z =
{Z1, Z2, ..., Zn}, written ρXY·Z, is the correlation between the residuals eX and eY resulting
from the linear regression of X with Z and of Y with Z, respectively. The first-order partial
correlation (i.e., when n = 1) is the difference between a correlation and the product of the
removable correlations divided by the product of the coefficients of alienation of the
removable correlations. The coefficient of alienation, and its relation with joint variance
through correlation are available in Guilford (1973, pp. 344–345).

It indicates relationship between two variables. Indeed, there are correlation coefficients
that involve more than two
variables. It sounds unusual and you would wonder how to do it? Under what
circumstance it can be done? Let me give you two examples. The first is about the
correlation between cholesterol level and bank balance for adults. Let us say that we

find a positive correlation between these two factors. That is, as the bank balance

increases, cholesterol level also increases. But this is not a correct relationship as

Cholesterol level can also increase as age increases. Also, as age increases, the bank
balance may also increase because a person can save from his salary over the years.

Thus, there is age factor which influences both cholesterol level and bank balance.

Suppose we want to know only the correlation between cholesterol and bank balance
without the age influence, we could take persons from the same age group and thus
control age, but if this is not possible, we can statistically control the age factor and
thus remove its influence on both cholesterol and bank balance. This if done is called
partial correlation. That is, we can use partial and part correlation for doing the
same. Sometimes in psychology we have certain factors which are influenced by
large number of variables. For instance, academic achievement will be affected by

intelligence, work habit, extra coaching, socio economic status, etc. To find out the

correlation between academic achievement with various other factors ad mentioned

above can be done by Multiple Correlation. In this unit we will be learning about
partial, part and multiple correlation

Partial correlation is a method used to describe the relationship between two variables
whilst taking away the effects of another variable, or several other variables, on this
relationship.

Partial correlation is best thought of in terms of multiple regression; Stats Direct shows the
partial correlation coefficient r with its main results from multiple linear regression.

A different way to calculate partial correlation coefficients, which does not require a full
multiple regression, is show below for the sake of further explanation of the principles:

Consider a correlation matrix for variables A, B and C (note that the multiple line
regression function in Stats Direct will output correlation matrices for you as one of its
options

The partial correlation of A and B adjusted for C is:

The same can be done using Spearman's rank correlation co-efficient.


The hypothesis test for the partial correlation co-efficient is performed in the same way as
for the usual correlation co-efficient but it is based upon n-3 degrees of freedom.

Please note that this sort of relationship between three or more variables is more usefully
investigated using the multiple regression itself (Altman, 1991).

The general form of partial correlation from a multiple regression is as follows:

where tk is the Student t statistic for the kth term in the linear model.

Multiple Correlations

In statistics, the coefficient of multiple correlation is a measure of how well a given


variable can be predicted using a linear function of a set of other variables. It is the
correlation between the

variable's values and the best predictions that can be computed linearly from the predictive
variables.

The coefficient of multiple correlation takes values between .00 and 1.00; a higher value
indicates a low predictability of the dependent variable from the independent variables,
with a value of 1 indicating that the predictions are exactly correct and a value of 0
indicating that no linear combination of the independent variables is a better predictor than
is the fixed mean of the dependent variable.

The coefficient of multiple correlation is known as the square root of the coefficient of
determination, but under the particular assumptions that an intercept is included and
that the best possible linear predictors are used, whereas the coefficient of determination is
defined for more general cases, including those of nonlinear prediction and those in which
the predicted values have not been derived from a model-fitting procedure.

The coefficient of multiple correlation, denoted R, is a scalar that is defined as the Pearson
correlation coefficient between the predicted and the actual values of the dependent
variable in alinear regression model that includes an intercept.

An intuitive approach to the multiple regression analysis is to sum the squared correlations
between the predictor variables and the criterion variable to obtain an index of the over-all
relationship between the predictor variables and the criterion variable. However, such a
sum is often greater than one, suggesting that simple summation of the squared coefficients
of correlations is not a correct procedure to employ. In fact, a simple summation of squared
coefficients of correlations between the predictor variables and the criterion variable is the
correct procedure, but only in the special case when the predictor variables are not
correlated. If the predictors are related, their inter-correlations must be removed so that
only the unique contributions of each predictor toward explanation of the criterion are
included.

Multiple linear regression (MLR) is a multivariate statistical technique for examining the
linear correlations between two or more independent variables (IVs) and a single dependent
variable (DV).

Research questions suitable for MLR can be of the form "To what extent do X1, X2, and
X3 (IVs) predict Y (DV)?"

e.g., "To what extent does people's age and gender (IVs) predict their levels of blood
cholesterol (DV)?"

MLR analyses can be visualized as path diagrams and or Venn diagrams

FIGURE-1.6
1.10 ANALYSIS OF VARIANCE AND ITS APPLICATIONS IN
REGRESSION ANALYSIS

Analysis of variance (ANOVA) is a collection of statistical models and their associated


estimation procedures (such as the "variation" among and between groups) used to analyses
the differences among group means in a sample. ANOVA was developed by statistician and
evolutionary biologist Ronald Fisher. The ANOVA is based on the law of total variance,
where the observed variance in a particular variable is partitioned into components
attributable to different sources of variation. In its simplest form, ANOVA provides a
statistical test of whether two or more population means are equal, and therefore generalizes
the t-test beyond two means.

Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed
aggregate variability found inside a data set into two parts: systematic factors and random
factors. The systematic factors have a statistical influence on the given data set, while the
random factors do not. Analysts use the ANOVA test to determine the influence that
independent variables have on the dependent variable in a regression study.

The independent variables are referred to as factors in an ANOVA, while the dependent
variable is referred to as the response variable. Factor levels refer to the various values of
the independent variables. A treatment is an experiment (or observation) at a certain level of
all elements.

An F test is used in single-factor ANOVA models (also known as one-way ANOVA


models) to evaluate the equality of multiple means. It's crucial to understand that an
ANOVA won't reveal the optimum treatment or the relationship between the variables and
the response variable. It simply informs us if there is a notable distinction between factors
and factor levels.

The two-way ANOVA is one of the most frequently used ANOVA models (with two
factors). Here, we may test for an interaction between the components as well as the
importance of the various factors. The majority of statistical software tools will provide us
with the full ANOVA table, which includes all of the crucial information we require to
determine if there is a difference between the treatment means or not. When we talk about
treatment methods, we imply combining each level of one element with each level of a
different factor. Our data may be represented graphically, and we can infer from these charts
whether there is or is not an interaction, as well as whether the first or second element had
the predominant influence.

An analysis of variance is used in business to examine any differences in a company’s


financial performance. Furthermore, it assists management in doing an extra control check
on operational performance, hence maintaining operations under budget.

The ANOVA test allows you to investigate discrepancies in your data set by analyzing the
numerous elements that influence it. These techniques are used by analysts to create
supplementary data that is more compatible with regression models. When there is no
significant difference between the two tested groups, this is referred to as a ‘null
hypothesis,’ and the F-ratio of the ANOVA test should be near to one.

Stigler claims that although the analysis of variance was developed in the 20th century, its
precursors go back many centuries. These comprise the additive model, sums of squares
partitioning, experimentation, and hypothesis testing. In the 1770s, Laplace started testing
hypotheses. The least-squares approach for combining data was created by Laplace and
Gauss about 1800, and it replaced earlier techniques used in geodesy and astronomy. The
contributions to sums of squares have also been the subject of extensive investigation.
Laplace was able to calculate a variance from a residual sum of squares rather than a
complete sum. Laplace began applying least squares techniques to ANOVA issues involving
observations of atmospheric tides in 1827. Before the year 1800, astronomers had identified
observational mistakes (the "personal equation") caused by reaction delays and had created
techniques to minimize the flaws. The experimental techniques utilized in the investigation
of the personal equation were eventually approved by psychology, a newly forming
discipline. This produced robust (full factorial) experimental techniques, to which blinding
and randomization were shortly added. The additive effects model had an elegant non-
mathematical justification in 1885.

The Correlation Between Relatives on the Supposition of Mendelian Inheritance, a 1918


work by Ronald Fisher, introduced the word variance and suggested its systematic
investigation. In 1921, he reported the results of his initial study of variance. Following its
inclusion in Fisher's 1925 book Statistical Methods for Research Workers, analysis of
variance gained widespread recognition.

A number of researchers created randomization models. Jerzy Neiman released the first in
Polish in 1923.

The Formula for ANOVA is:

F= MSE / MST

Were,

F = ANOVA coefficient

MST = Mean sum of squares due to


treatment MSE = Mean sum of squares
due to error

ANOVA is used in the analysis of comparative experiments, those in which only the
difference in outcomes is of interest. The statistical significance of the experiment is
determined by a ratio of two variances. This ratio is independent of several possible
alterations to the experimental observations: Adding a constant to all observations does not
alter significance. Multiplying all observations by a constant does not alter significance.
So, ANOVA statistical significance result is independent of constant bias and scaling errors
as well as the units used in expressing observations. In the era of mechanical calculation, it
was common to subtract a constant from all observations (when equivalent to dropping
leading digits) to simplify data entry. This is an example of data coding

The ANOVA test is the initial step in analyzing factors that affect a given data set. Once
the test is finished, an analyst performs additional testing on the methodical factors that
measurably

contribute to the data set's inconsistency. The analyst utilizes the ANOVA test results in an
f- test to generate additional data that aligns with the proposed regression models.

The ANOVA test allows a comparison of more than two groups at the same time to
determine whether a relationship exists between them. The result of the ANOVA formula,
the F statistic (also called the F-ratio), allows for the analysis of multiple groups of data to
determine the variability between samples and within samples.

If no real difference exists between the tested groups, which is called the null hypothesis,
the result of the ANOVA's F-ratio statistic will be close to 1. Fluctuations in its sampling
will likelyfollow the Fisher F distribution. This is actually a group of distribution functions,
with two characteristic numbers, called the numerator degrees of freedom and the
denominator degrees of freedom

For example-

Test students from several institutions, for instance, to check if one of the colleges
regularly produces better performers than the others. To determine whether product
creation technique is more cost-effective than the other in a corporate setting, an R&D
researcher may examine two distinct methods.

ANOVA test types are determined by a variety of variables. It is used when experimental
data is required. If statistical software is unavailable, analysis of variance is used, requiring
manual computation of ANOVA. It is easy to use and works best with little samples. The
sample sizes must be the same for all possible factor level combinations in many
experimental designs.

When examining three or more variables, an ANOVA is useful. It resembles several two-
sample t-tests. But it produces fewer type I errors and is suitable for a variety of problems.
ANOVA comprises dispersing the variation across many sources and groupings differences
by comparing the means of each group. It is used with test subjects, study groups, as well
as between and within groups.

The analysis of variance can be used to describe otherwise complex relations among
variables. A dog show provides an example. A dog show is not a random sampling of the
breed: it is typically limited to dogs that are adult, pure-bred, and exemplary. A histogram
of dog weights from a show might plausibly be rather complex, like the yellow-orange
distribution shown in the illustrations. Suppose we wanted to predict the weight of a dog
based on a certain set of characteristics of each dog. One way to do that is to explain the
distribution of weights by dividing the dog population into groups based on those
characteristics. A successful grouping will split dogs such that (a) each group has a low
variance of dog weights (meaning the group is relatively homogeneous) and (b) the mean
of each group is distinct (if two groups have the same mean, then it isn't reasonable to
conclude that the groups are, in fact, separate in any meaningful way).

In the illustrations to the right, groups are identified as X1, X2, etc. In the first illustration,
the dogs are divided according to the product (interaction) of two binary groupings: young
vs old, and short-haired vs long-haired (e.g., group 1 is young, short-haired dogs, group 2 is
young, long-haired dogs, etc.). Since the distributions of dog weight within each of the
groups (shown in blue) has a relatively large variance, and since the means are very similar
across groups, grouping dogs by these characteristics does not produce an effective way to
explain the variation in dog weights: knowing which group a dog is in doesn't allow us to
predict its weight much better than simply knowing the dog is in a dog show. Thus, this
grouping fails to explain the variation in the overall distribution (yellow-orange).

An attempt to explain the weight distribution by grouping dogs as pet vs working breed and
less athletic vs more athletic would probably be somewhat more successful (fair fit). The
heaviest show dogs are likely to be big, strong, working breeds, while breeds kept as pets
tend to be smaller and thus lighter. As shown by the second illustration, the distributions
have variances that are considerably smaller than in the first case, and the means are more
distinguishable. However, the significant overlap of distributions, for example, means that
we cannot distinguish X1 and X2 reliably. Grouping dogs according to a coin flip might
produce distributions that look similar.

An attempt to explain weight by breed is likely to produce a very good fit. All Chihuahuas
are light and all St Bernard’s are heavy. The difference in weights between Setters and
Pointers does not justify separate breeds. The analysis of variance provides the formal tools
to justify these intuitive judgments. A common use of the method is the analysis of
experimental data or the development of models. The method has some advantages over
correlation: not all of the data must be numeric and one result of the method is a judgment
in the confidence in an explanatory relationship.

1.11 REPORTING OF THE RESULTS OF REGRESSION


Results of the multiple linear regression indicated that there was a collective significant
effect between the gender, age, and job satisfaction, (F (9, 394) = 20.82, p < .001, R2 =
.32). The individual predictors were examined further and indicated that age (t = -11.98, p
= .002) and gender (t = 2.81, p = .005) were significant predictors in the model.

Results of the binary logistic regression indicated that there was a significant association
between age, gender, race, and passing the reading exam (χ2(3) = 69.22, p < .001).

In the above examples, the numbers in parentheses after the test statistics F and χ2 again
represent the degrees of freedom. The F statistics will always have two numbers reported
for thedegrees of freedom following the format: (df regression, df error). For statistics such
as R2 and p-values where the number before the decimal point is assumed to be zero, the 0
is omitted.

Results of the independent sample t-tests indicated that there were not significant
differences injob satisfaction between males and females, (t (29) = -1.85, p = .074).

Results of the dependent (paired) sample t-tests indicated that there were significant
differencesin job satisfaction between pretest and posttest, (t (33) = 37.25, p < .001).

Once again, for t-tests, the number in parentheses following the t is the degrees of freedom.
Results of the ANOVA indicated that there were not significant differences in job
satisfactionbetween ethnicities (F (2, 125) = 0.16, p = .854, partial η2 = .003).

Following the F notation from the previous regression example, the first number in
parentheses refers to the numerator degrees of freedom and the second number corresponds
to the denominator (error) degrees of freedom. The partial η2 refers to the effect size of the
test.

1.12 SUMMARY

 Econometrics deals with the measurement of economic relationships. It is an


integration of economics, mathematical economics and statistics with an objective to
provide numerical valuesto the parameters of economic relationships.
 The relationships of economic theories are usually expressed in mathematical forms
and combined with empirical economics.

 The econometrics methods are used to obtain the values of parameters which are
essentially the coefficients of mathematical form of the economic relationships.

 The statistical methods which help in explaining the economic phenomenon are
adapted as econometric methods.

 The econometric relationships depict the random behavior of economic relationships


which are generally not considered in economics and mathematical formulations.

 It may be pointed out that the econometric methods can be used in other areas like
engineering sciences, biological sciences, medical sciences, geosciences, agricultural
sciences etc.

 Econometrics is the use of statistical methods to develop theories or test existing


hypotheses in economics or finance.

 Econometrics relies on techniques such as regression models and null hypothesis


testing.

 Econometrics can also be used to try to forecast future economic or financial trends.

 As with other statistical tools, econometricians should be careful not to infer a causal
relationship from statistical correlation.

 Some economists have criticized the field of econometrics for prioritizing statistical
models over economic reasoning.

1.13 KEYWORDS

 Dependent variable – The variable of interest (e.g., household WTP): a variable that
is observed to arise based on the levels of all.

 independent variables- Independent variable – Variables that are considered


changeable and affecting the dependent variable.

 Continuous variable: A variable which in the simplest term can take „any‟ value.
For example, survey respondents age or distance from a respondent’s home to a
visitor site.

 Categorical variable: A variable which can take a value from a finite set of values.
Survey questions that require respondents to answer on the basis of a Likert scale
result in categorical variables.

 Dummy variable: Categorical variables believed to influence the outcome of a


regression model can be coded as 1 or 0 for existing or not existing respectively.
Inclusion of dummy variables can help to increase the fit of a model, but at a loss of
generality of the model.

1.14 LEARNING ACTIVITY

1. Define Hypothesis.

___________________________________________________________________________
___________________________________________________________________________

2. State the history of Econometrics

__________________________________________________________________________
__________________________________________________________________________

1.15 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1.Describe about nature of econometrics

2.Explain the scope of econometrics.

3.What is simple linear regression.

4. Explain general linear regression

5. Define regression in short.

6.What are approaches of ordinary least square approach.

7. What are the properties of estimators.


Long Questions

1.Write a note on testing of hypothesis for normal distribution, T- Distribution & Chi-square
distribution.

2. Explain nature, meaning & Scope of econometrics in detail.

3.Describe assumptions and estimators with its properties.

4. Explain Gauss- Markov’s thermos.

5. Write a note on analysis of variance & its application in regression analysis.

6.how is reporting of the result is done for regression.

B. Multiple Choice Questions

1. Two events, A and B, are said to be mutually exclusive if:

A. P (A | B) = 1

B. P (B | A) = 1

C. P (A and B) = 1

D. P (A and B) = 0

2. A Type I error occurs when we:

A. reject a false null hypothesis

B. reject a true null hypothesis

C. do not reject a false null hypothesis


D. do not reject a true null hypothesis

3. What is the meaning of the term "heteroscedasticity"?

A. The variance of the errors is not constant

B. The variance of the dependent variable is not constant

C. The errors are not linearly independent of one another

D. The errors have non- zero mean

4. What would be then consequences for the Aestivator if heteroscedasticity is present in a


regression model but ignored?

A. It will be ignored

B. It will be inconsistent

C. It will be inefficient

D. a), c), b) will be true.

5. Which one of the following is NOT a plausible remedy for near multicollinearity?

A. Use principal components analysis

B. Drop one of the collinear variables


C. Use a longer run of data

D. Take logarithms of each of the variables

Answers

D-, B-, A-, C-, D-

1.16 REFERENCES

Suggested Readings
 Gujarati, D., Porter, D.C and Guna Sekhar, C. Basic Econometrics, McGraw Hill
Education.
 Anderson, D. R., D. J. Sweeney and T. A. Williams. Statistics for Business and
Economics. 12th Edition, Cengage Learning India Pvt. Ltd.
 Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third edition,
Thomson South-Western.
 Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York.
 Ramanathan, Ramus, Introductory Econometrics with Applications, Harcourt Academic
Press

Houstonians, A. The Theory of Econometrics

Books

 Mostly Harmless Econometrics: An Empiricist’s Companion


 Using Econometrics: A Practical Guide
 Introductory Econometrics: A Modern Approach
 Introduction to Econometrics, (Pearson Series in Economics)
 Econometric Analysis of Cross Section and Panel Data (MIT Press)
 Micro econometrics Using Stata
 Econometric Analysis
Website:

 http://home.iitk.ac.in/~shalab/econometrics/Chapter1-Econometrics-
IntroductionToEconometrics.pdf

 http://home.iitk.ac.in/~shalab/econometrics/Chapter1-Econometrics-
IntroductionToEconometrics.pdf
UNIT- 2 PROBLEMS IN REGRESSION ANALYSIS I

STRUCTURE

2.0 Learning Objective

2.1 Introduction

2.2 Nature, test, consequences and remedial steps of problems of heteroscedasticity

2.3 OLS estimation in the presence of heteroscedasticity

2.4 Method of Generalized Least Squares

2.5 Nature, test, consequences and remedial steps of Multicollinearity

2.6 Estimation in the presence of perfect multicollinearity

2.7 Estimation in the presence of high but imperfect multicollinearity

2.8 Summary

2.9 Keywords

2.10 Learning Activity

2.11 Unit End Questions

2.12 References

2.0 LEARNING OBJECTIVES

After studying this unit, you will be able to:

 To understand Nature, test, consequences and remedial steps of problems of


heteroscedasticity

 To learn about problem in regression analysis.

 To understand method of generalized least square

 This module will help the students to comprehend Multicollinearity.

 Deep understanding about Estimation in the presence of high but imperfect multicollinearity
2.1 INTRODUCTION

In statistical modeling, regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable (often called the ‘outcome variable’) and one or more
independent variables (often called ‘predictors’, ‘covariates’, or ‘features’).
The terminology you will often listen related with regression analysis is:
Dependent variable or target variable: Variable to predict.
Independent variable or predictor variable: Variables to estimate the dependent variable.
Outlier: Observation that differs significantly from other observations. It should be avoided since it
may hamper the result.
Multicollinearity: Situation in which two or more independent variables are highly linearly related.

Homoscedasticity or homogeneity of variance: Situation in which the error term is the same across
all values of the independent variables.
Regression analysis is primarily used for two distinct purposes. First, it is widely used for prediction
and forecasting, which overlaps with the field of machine learning. Second, it is also used to infer
causal relationships between independent and dependent variables.
We introduced the idea of a statistical relationship between two variables, such as a relationship
between sales volume and advertising expenditure, a relationship between crop yield and fertilizer
input, a relationship between product price and supply, and so on. Such variables' relationships
show the strength and direction of their association but do not address the following: Are there any
functional (or algebraic) relationships between the variables? If so, can the most likely value of one
variable be estimated using the value of the other variable?
Regression analysis is a statistical method for estimating the value of a variable based on the known
value of another variable. It expresses the relationship between two or more variables as an
equation. Dependent (or response) variable refers to the variable whose value is estimated using the
algebraic equation, and independent (regressor or predictor) variable refers to the variable whose
value is used to estimate this value. The term "linear regression equation" refers to the algebraic
formula used to express a dependent variable in terms of an independent variable.

Sir Francis Galton first used the term regression in 1877 while researching the relationship between
the heights of fathers and sons. He discovered that, despite the saying "tall fathers have tall sons,"
the average height of tall fathers' sons is x above the average height and 2x/3

above the average height. Galton referred to such a decrease in average height as "regression to
mediocrity." The term regression is used in business and economics to refer to other types of
variables because the Galton theory is not universally applicable. Regression is also referred to as
"moving backward" in literary contexts.

The following is a summary of the key distinctions between correlation and regression analysis:

1. Regression analysis is the process of creating an algebraic equation between two variables from
sample data and predicting the value of one variable given the value of the other variable, while
correlation analysis is the process of determining the strength (or degree) of the relationship
between two variables. The absolute value of the correlation coefficient, on the other hand,
indicates the degree of the relationship between the two variables, while the sign of the correlation
coefficient indicates the nature of the relationship (direct or inverse).

2. The results of a correlation analysis show a connection between two variables x and y, but not a
cause-and-effect connection. In contrast to correlation, regression analysis establishes the cause-
and-effect relationship between x and y, i.e., that a change in the value of the independent variable
x results in an equivalent change (effect) in the value of the dependent variable y, provided that all
other variables that have an impact on y remain constant.

3. While both variables are regarded as independent in correlation analysis, one variable is
considered the dependent variable and the other the independent variable in linear regression
analysis.

1. The amount of variance in the dependent variable that can be explained or accounted for
by variation in the independent variable is indicated by the coefficient of determination,
or r2. R2's value is subject to sampling error because it is derived from a sample. The
assumption of a linear regression may be incorrect even if the value of r2 is high because
it may represent a portion of the relationship that is actually in the shape of a curve.

Several significant benefits of regression analysis include the following:


1. Regression analysis aids in the creation of a regression equation that can be used to estimate the
value of a dependent variable given the value of an independent variable.
2. Standard error of estimate is calculated using regression analysis to assess the variability or range
of values of a dependent variable in relation to the regression line. A good estimate of the value of
variable y can be made because the regression line fits the data better and the pair of values (x, y)
fall closer to it when the variance and estimation error are smaller. The standard error of estimate
equals zero when all the points lie on the line.
3. Changing the values of either x or y is thought to be acceptable when the sample size is large (df
29) and interval estimation is used to predict the value of a dependent variable based on standard
error of estimate. Regardless of the values of the two variables, r2 has a constant magnitude.

Different Kinds of Regression Models-

The creation of a regression model to explain the relationship between two or more variables in a
given population is the main goal of regression analysis. The mathematical equation known as a
regression model predicts the value of the dependent variable based on the values of one or more
independent variables that are known.
The type of data available and the nature of the problem being studied determine which regression
model should be used. However, an equation linking a dependent variable to one or more
independent variables can be used to describe each type of association or

relationship.
Regression Analysis
Regression analysis is the oldest, and probably, most widely used multivariate technique in the social
sciences. Unlike the preceding methods, regression is an example of dependence analysis in which
the variables are not treated symmetrically. In regression analysis, the object is to obtain a
prediction of one variable, given the values of the others. To accommodate this change of viewpoint,
a different terminology and notation are used. The variable being predicted is usually denoted by y
and the predictor variables by x with subscripts added to distinguish one from another. In linear
multiple regression, we look for a linear combination of the predictors (often called regressor
variables). For example, in educational research, we might be interested in the extent to which
school performance could be predicted by home circumstances, age, or performance on a previous
occasion. In practice, regression models are estimated by least squares using appropriate software.
Important practical matters concern the best selection of the best regressor variables, testing the
significance of their coefficients, and setting confidence limits to the predictions.

In statistics, it’s hard to stare at a set of random numbers in a table and try to make any sense of it.
For example, global warming may be reducing average snowfall in your town and you are asked to
predict how much snow you think will fall this year. Looking at the following table you might guess
somewhere around 10-20 inches. That’s a good guess, but you could make a better guess, by using
regression.

Fig. -2.1 Regression Models

Essentially, regression is the “best guess” at using a set of data to make some kind of prediction. It’s
fitting a set of points to a graph. There’s a whole host of tools that can run regression for you,
including Excel, which I used here to help make sense of that snowfall data:
Fig.-2.2 Regression Models Amount

Just by looking at the regression line running down through the data, you can fine tune your best
guess a bit. You can see that the original guess (20 inches or so) was way off. For 2015, it looks like
the line will be somewhere between 5 and 10 inches! That might be “good enough”, but regression
also gives you a useful equation, which for this chart is:
y = -2.2923x + 4624.4.
What that means is you can plug in an x value (the year) and get a pretty good estimate of snowfall
for any year. For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is pretty close to the actual figure of 30 inches
for that year.

Best of all, you can use the equation to make predictions. For example, how much snow will fall in
2017?

y = 2.2923(2017) + 4624.4 = 0.8 inches.

Regression also gives you an R squared value, which for this graph is 0.702. This number tells you
how good your model is. The values range from 0 to 1, with 0 being a terrible model and 1 being a
perfect model. As you can probably see, 0.7 is a fairly decent model so you can be fairly confident in
your weather prediction.

2.2 NATURE, TEST, CONSEQUENCES AND REMEDIAL STEPS OF


PROBLEMS OF HETEROSCEDASTICITY

In statistics, a sequence (or a vector) of random variables is homoscedastic if all its random variables
have the same finite variance. This is also known as homogeneity of variance. The complementary
notion is called heteroscedasticity. The spellings homoskedasticity and heteroskedasticity are also
frequently used.
Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but
inefficient point estimates and in biased estimates of standard errors, and may result in
overestimating the goodness of fit as measured by the Pearson coefficient.

The existence of heteroscedasticity is a major concern in regression analysis and the analysis of
variance, as it invalidates statistical tests of significance that assume that the modelling errors all
have the same variance. While the ordinary least squares estimator is still unbiased in the presence
of heteroscedasticity, it is inefficient and generalized least squares should be used instead.

Because heteroscedasticity concerns expectations of the second moment of the errors, its presence
is referred to as misspecification of the second order.

The econometrician Robert Engle was awarded the 2003 Nobel Memorial Prize for Economics for his
studies on regression analysis in the presence of heteroscedasticity, which led to his formulation of
the autoregressive conditional heteroscedasticity (ARCH) modeling technique

Heteroscedasticity implies that the variances (i.e. - the dispersion around the expected mean of
zero) of the residuals are not constant, but that they are different for different observations. This
causes a problem: if the variances are unequal, then the relative reliability of each observation (used
in the regression analysis) is unequal. The larger the variance, the lower should be the importance
(or weight) attached to that observation.

Nature of Heteroscedasticity

Heteroscedasticity's Nature
One of the key presumptions of the conventional linear regression model, as mentioned in Chapter
3, is that the variance of each disturbance term, up, is some fixed number equal to a2, conditional on
the chosen values of the explanatory variables. This is the homoscedasticity, or equal (homo) spread
(causticity), or variance, underlying assumption. Symbolically,
Diagrammatically, the homoscedasticity in the two-variable regression model may be displayed as in
Figure, which is conveniently duplicated as

Fig. 2.3 2.4 Heteroscedastic disturbances (a)


Fig. 2.3 Heteroscedastic disturbances (b)

Figure demonstrates that regardless of the values obtained by the variable X, the conditional
variance of Yi (which is equal to that of up), conditional upon the given Xi, remains the same.
Figure, on the other hand, demonstrates that the conditional variance of Yi grows as X grows. The
variations of Yi are not the same in this case. As a result, heteroscedasticity exists. Symbolically,
The conditional variances of ui (= conditional variances of Yi) are no longer constant, as the subscript
of o2 serves to remind us.
Assume that Y represents savings and X represents income in the two-variable model Yi = fa + fax +
ui to make the distinction between homoscedasticity and heteroscedasticity evident. Figures and
demonstrate that as income rises, savings likewise do so on average.

Unlike Figure where the variation of savings grows with income, the savings variance is constant
across all income levels. In Figure , it appears that higher-income families save more money on
average than lower-income families do, while there is more variation in their savings.
The deviations of ui may vary for a number of causes, some of which are listed here.
1. In accordance with models of error-learning, people gradually reduce their behavioral mistakes as
they gain experience. In this scenario, a2 is anticipated to drop. Take Figure , for instance, which
shows a relationship between the amount of practice time spent on typing and the quantity of
typing errors produced during a test. According to Figure , the average number of typing errors and
their variability decrease as the amount of practice hours rises.
People have more discretionary income when their salaries rise, which gives them greater freedom
to choose how to spend their money. As a result, a2 will probably rise as income does. Because
people have more options when it comes to their saving behavior, it is likely that the regression of
savings on income will show that a2 increases with income (as in Figure ). Similar to this, it is widely
believed that businesses with higher profits will exhibit more variation in their payout policy than
those with lower profitability. Additionally, growth-oriented businesses are more likely than
established ones to exhibit greater dividend payout ratio unpredictability.
3. A2 is probably going to go down as data collection methods advance. Consequently, banks with
advanced data processing technology are likely to
See p. 48 of Stefan Valvano’s' Econometrics, McGraw-Hill, New York, 1959. 2 According to Valvano’s,
"Income accumulates, and individuals can scarcely distinguish dollars where they could before
distinguish dimes," ibid., p. 48.
Step-by-Step Drawing an Airplane
Fig 2.5 Illustration of heteroscedasticity
compared to banks without such amenities, they make less mistakes on their clients' monthly or
quarterly statements.
4. The existence of outliers can also lead to heteroscedasticity. An observation that deviates
significantly (either much or greatly) from the other observations in the sample is referred to as an
outlier or an outlying observation. An observation from a population other than the one producing
the remaining sample observations is referred to as an outlier. 3 Regression analysis findings can be
significantly changed by including or excluding such an observation, especially if the sample size is
small.

Think about the scattergram in Figure as an illustration. This graph illustrates the percent rate of
change of stock prices (Y) and consumer prices (X) for the post-World War II era through 1969 for 20
nations using the information provided . Because the supplied Y and X values are substantially higher
for Chile than they are for the other nations, the observation on Y and X for Chile in this figure can be
viewed as an anomaly. It would be challenging to uphold the homoscedasticity assumption in
circumstances like these.

Tests Of Heteroscedasticity

Heteroskedasticity checks Heteroskedasticity has an impact on hypothesis estimation and testing.


Numerous factors might cause the data to become heteroskedastic. The tests for heteroskedasticity
make an assumption about the
heteroskedasticity. There are several tests accessible in the literature, including

1. Bartlett test-

In statistics, Bartlett's test, named after Maurice Stevenson Bartlett, is used to test
homoscedasticity, that is, if multiple samples are from populations with equal variances. Some
statistical tests, such as the analysis of variance, assume that variances are equal across groups or
samples, which can be verified with Bartlett's test.

In a Bartlett test, we construct the null and alternative hypothesis. For this purpose, several test
procedures have been devised. The test procedure due to M.S.E (Mean Square Error/Estimator)
Bartlett test is represented here. This test procedure is based on the statistic whose sampling
distribution is approximately a Chi-Square distribution with (k − 1) degrees of freedom, where k is
the number of random samples, which may vary in size and are each drawn from independent
normal distributions. Bartlett's test is sensitive to departures from normality. That is, if the samples
come from non-normal distributions, then Bartlett's test may simply be testing for non-normality.
Levine’s test and the Brown–Forsythe test are alternatives to the Bartlett test that are less sensitive
to departures from normality.

2.Breusch Pagan test-

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used
to test for heteroskedasticity in a linear regression model. It was independently suggested with
some extension by R. Dennis Cook and Sanford Weisberg in 1983 (Cook–Weisberg test). Derived
from the Lagrange multiplier test principle, it tests whether the variance of the errors from a
regression is dependent on the values of the independent variables. In that case, heteroskedasticity
is present.

Suppose that we estimate the regression model

and obtain from this fitted model a set of values for {\display style {\wide hat {u}}} \wide hat {u}, the
residuals. Ordinary least squares constrain these so that their mean is 0 and so, given the
assumption that their variance does not depend on the independent variables, an estimate of this
variance can be obtained from the average of the squared values of the residuals. If the assumption
is not held to be true, a simple model might be that the variance is linearly related to independent
variables. Such a model can be examined by regressing the squared residuals on the independent
variables, using an auxiliary regression equation of the form

This is the basis of the Breusch–Pagan test. It is a chi-squared test: the test statistic is distributed nχ2
with k degrees of freedom. If the test statistic has a p-value below an appropriate threshold (e.g., p <
0.05) then the null hypothesis of homoskedasticity is rejected and heteroskedasticity assumed.

3. The Goldfield-Quentin test-

In statistics, the Goldfield–Quant test checks for homoscedasticity in regression analyses. It does
this by dividing a dataset into two parts or groups, and hence the test is sometimes called a two-
group test. The Goldfield–Quant test is one of two tests proposed in a 1965 paper by Stephen
Goldfield and Richard Quant. Both a parametric and nonparametric test are described in the paper,
but the term "Goldfield–Quant test" is usually associated only with the former.

Advantages and disadvantages

The parametric Goldfield–Quant test offers a simple and intuitive diagnostic for heteroskedastic
errors in a univariate or multivariate regression model. However, some disadvantages arise under
certain specifications or in comparison to other diagnostics, namely the Breusch–Pagan test, as the
Goldfield–Quant test is somewhat of an ad hoc test.[6] Primarily, the Goldfield–Quant test requires
that data be ordered along a known explanatory variable. The parametric test orders along this
explanatory variable from lowest to highest. If the error structure depends on an unknown variable
or an unobserved variable the Goldfield–Quant test provides little guidance. Also, error variance
must be a monotonic function of the specified explanatory variable. For example, when faced with a
quadratic function mapping the explanatory variable to error variance the Goldfield–Quant test may
improperly accept the null hypothesis of homoscedastic errors.

4.Glesjer test-

In statistics, the Glejser test for heteroscedasticity, developed by Herbert Glejser, regresses the
residuals on the explanatory variable that is thought to be related to the heteroscedastic variance.
After it was found not to be asymptotically valid under asymmetric disturbances, similar
improvements have been independently suggested by Im, and Machado and Santos Silva.

5.Spearman rank correlation coefficient test-


In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles
Spearman and often denoted by the Greek letter {\display style \rho} \rho (rho) or as
{\display style r_{s}} r_{s}, is a nonparametric measure of rank correlation (statistical
dependence between the rankings of two variables). It assesses how well the relationship
between two variables can be described using a monotonic function.

The Spearman correlation between two variables is equal to the Pearson correlation between
the rank values of those two variables; while Pearson's correlation assesses linear
relationships, Spearman's correlation assesses monotonic relationships (whether linear or
not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1
occurs when each of the variables is a perfect monotone function of the other.

Intuitively, the Spearman correlation between two variables will be high when observations
have a similar (or identical for a correlation of 1) rank (i.e., relative position label of the
observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when
observations have a dissimilar (or fully opposed for a correlation of −1) rank between the two
variables.
Spearman's coefficient is appropriate for both continuous and discrete ordinal variables Both
Spearman's {\display style \rho} \rho and Kendall's {\display style \tau} \tau can be formulated as
special cases of a more general correlation coefficient.

6.The White Test –

In statistics, the White test is a statistical test that establishes whether the variance of the
errors in a regression model is constant: that is for homoskedasticity.
This test, and an estimator for heteroscedasticity-consistent standard errors, were proposed by
Halbert White in 1980. These methods have become extremely widely used, making this paper one
of the most cited articles in economics.
In cases where the White test statistic is statistically significant, heteroskedasticity may not
necessarily be the cause; instead, the problem could be a specification error. In other words, the
White test can be a test of heteroskedasticity or specification error or both. If no cross-product
terms are introduced in the White test procedure, then this is a test of pure heteroskedasticity. If
cross products are introduced in the model, then it is a test of both heteroskedasticity and
specification bias.
.

7.Ramsey test-

In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general
specification test for the linear regression model. More specifically, it tests whether non-linear
combinations of the fitted values help explain the response variable. The intuition behind the test is
that if non-linear combinations of the explanatory variables have any power in explaining the
response variable, the model is mis specified in the sense that the data generating process might be
better approximated by a polynomial or another non-linear functional form.

The test was developed by James B. Ramsey as part of his Ph.D. thesis at the University of
Wisconsin–Madison in 1968, and later published in the Journal of the Royal Statistical Society in
1969.

8.Szroeter test-

As an alternative to the score test used in estate hottest, the estate Streeter command uses a rank
test for heteroskedasticity. The alternative hypothesis that variance rises monotonically in the
variables under test is evaluated using the Stroeder test. Monotonic is short for one-way. It is
impossible for variance to ever decrease if it is rising monotonically. As the linear regression line
rises, the variance may rise as well, but eventually cease rising and remain constant. It is regarded as
monotonic as long as it never decreases.
The independent variables of a linear regression are tested by this estate Streeter command.
Depending on the parameters chosen, you can test a single variable, a group of named variables, or
all independent variables. The risk that you would mistakenly reject the null hypothesis while testing
numerous hypotheses at once grows with each new test (a false positive or type 1 error). The p-
values of your multiple tests will need to be changed to reflect this.
Three distinct modifications are possible in Stata. The first is the Bonferroni adjustment, which takes
into account both the per-family type 1 mistake rate and type 1 errors of any kind. The test(bond)
option is used to specify the Bonferroni adjustment. The Holm-Bonferroni approach, which is always
more effective or at least as effective as the Bonferroni correction, is the second adjustment. This
technique does not, however, account for the family type 1 error rate, in contrast to the Bonferroni
correction. The test(holm) option specifies the Holm-Bonferroni technique. The Sisak correction is
Stata's third adjustment option. Only when the hypothesis tests are positively independent are the
Sisak correction's effects greater than those of the other two adjustments (i.e., they are independent
or positively dependent). The test(side) option specifies the Sisak correction.
The default setting for this test will leave the p-values uncorrected if you do not provide a multiple-
test adjustment. When conducting this test simultaneously on several variables, it is strongly advised
that you make some sort of correction.

9. The nonparametric peak test-

A variety of tests of hypothesis for continuous, dichotomous, and discrete outcomes were covered in
the three courses on hypothesis testing. While tests for dichotomous and discrete outcomes
concentrated on comparing proportions, tests for continuous outcomes focused on comparing
means. The tests that are discussed in the sections on hypothesis testing are all known as parametric
tests and are predicated on certain premises. For instance, all parametric tests presumptively
assume that the outcome is roughly normally distributed in the population when conducting tests of
hypothesis for means of continuous outcomes. This doesn't imply that the data in the sample that
was seen has a normal distribution; rather, it just means that the outcome in the entire population
that was not observed has a normal distribution. Investigators are at ease using the normalcy
assumption for many outcomes (i.e., most of the observations are in the center of the distribution
while fewer are at either extreme). Additionally, it turns out that a lot of statistical tests are resilient,
which means they keep their statistical qualities even if some of the assumptions are not fully
satisfied. Based on the Central Limit Theorem, tests are resilient in the presence of breaches of the
normality assumption when the sample size is high (see page 11 in the module on Probability).
Alternative tests known as nonparametric tests are acceptable when the sample size is small, the
distribution of the outcome is unknown, and it cannot be assumed that it is roughly normally
distributed.
Nonparametric tests are sometimes called distribution-free tests because they are based on fewer
assumptions (e.g., they do not assume that the outcome is approximately normally distributed).
Parametric tests involve specific probability distributions (e.g., the normal distribution) and the tests
involve estimation of the key parameters of that distribution (e.g., the mean or difference in means)
from the sample data. The cost of fewer assumptions is that nonparametric tests are generally less
powerful than their parametric counterparts (i.e., when the alternative is true, they may be less
likely to reject H0).
It can sometimes be difficult to assess whether a continuous outcome follows a normal distribution
and, thus, whether a parametric or nonparametric test is appropriate. There are several statistical
tests that can be used to assess whether data are likely from a normal distribution. The most popular
are the Kolmogorov-Smirnov test, the Anderson-Darling test, and the Shapiro-Wilk test1. Each test is
essentially a goodness of fit test and compares observed data to quantiles of the normal (or other
specified) distribution. The null hypothesis for each test is H0: Data follow a normal distribution
versus H1: Data do not follow a normal distribution. If the test is statistically significant (e.g., p<0.05),
then data do not follow a normal distribution, and a nonparametric test is warranted. It should be
noted that these tests for normality can be subject to low power. Specifically, the tests may fail to
reject H0: Data follow a normal distribution when in fact the data do not follow a normal
distribution. Low power is a major issue when the sample size is small - which unfortunately is often
when we wish to employ these tests. The most practical approach to assessing normality involves
investigating the distributional form of the outcome in the sample using a histogram and to augment
that with data from other studies, if available, that may indicate the likely distribution of the
outcome in the population.
There are some situations when it is clear that the outcome does not follow a normal distribution.
These include situations:
when the outcome is an ordinal variable or a rank,
when there are definite outliers or
when the outcome has clear limits of detection.

COSEQUENCES

1.Ordinary least squares estimators still linear


and unbiased.
2.Ordinary least squares estimators
not efficient.
3.Usual formulas give incorrect standard errors for least squares.
4.Confidence intervals and hypothesis tests based on usual standard errors are wrong.
The existence of heteroscedasticity in the error term of an equation violates Classical Assumption V,
and the estimation of the equation with OLS has at least three consequences:
OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE.
This implies that if we still use OLS in the presence of heteroscedasticity, our standard errors could
be inappropriate and hence any inferences we make could be misleading.
Whether the standard errors calculated using the usual formulae are too big or too small will depend
upon the form of the heteroscedasticity.
In the presence of heteroscedasticity, the variances of OLS estimators are not provided by the usual
OLS formulas. But if we persist in using the usual OLS formulas, the t and F tests based on them can
be highly mislead- Ing, resulting in erroneous conclusions

REMEDIAL STEPS OF PROBLEMS OF HETEROSCEDASTICITY

The unbiasedness and consistency properties of the OLS estimators remain intact even after
Heteroscedasticity, but they are no longer efficient, not even asymptotically (i.e., large sample size).
This lack of efficiency makes the usual hypothesis-testing procedure of dubious value.
Hence, remedial measures are needed. Below discussed are the various approaches to correcting
Heteroscedasticity.
1.Whentrue error variance
is known: the Method of Generalized Least Squares
(GLS) Estimator
If the variance of error term is non-constant, then the best linear unbiased estimator (BLUE) is the
generalized least squares (GLS) estimator. It is also called as weighted least squares (WLS)
estimator.
the GLS estimator identical to weighted
least squares estimator. Or one can say that, the GLS estimator is a particular kind of WLS
estimator. In GLS estimation, a weight we given to each observation of each variable, which is
inversely proportional to the standard deviation of the error term. It implies that observations
with a smaller error variance are given more weight in the GLS regression and observations with
a large error variance is given less weight
2. Remedial measures when true error variance is unknown
WLS method makes an implicit assumption that true error variance
is known. However, in reality, it is difficult to have knowledge of the true error variance. Thus, we
need some other
methods to obtain consistent estimate of variance of error term.
Feasible Generalized Least Squares (FGLS) Estimator
In this method, we use the sample data to obtain an estimate of tend then apply the GLS
estimator using the estimates of t
When we do this, we have a different estimator called the
Feasible Generalized Least Squares Estimator, or FGLS estimator.
It is important to note that if the transformed model of Heteroscedasticity that we get after using
FGLS estimator is a reasonable approximation of the true Heteroscedasticity, then the non-linear
FGLS estimator is asymptotically more efficient than the OLS estimator. However, if it is not a
reasonable approximation of the true Heteroscedasticity, then the FGLS estimator will produce
worse estimates than the OLS estimator.
White’s Correction White’s Heteroscedasticity – Consistent Variances and Standard Errors method
shows that
statistical inferences can be made about the true parameter values for large samples
(Asymptotically valid). It develops a method to obtain consistent estimates of the variances and
co-variances of the OLS estimates. It is also called as “Heteroscedasticity consistent covariance
According to White, equation is a consistent estimator of equation [4], (or equation converges to
equation the sample size increases indefinitely. Hence, White’s Heteroscedasticity-corrected
standard errors are also known as “Robust Standard Errors”.
2.3 OLS ESTIMATION IN THE PRESENCE OF
HETEROSCEDASTICITY

When heteroscedasticity is present in data, then estimates based on Ordinary Least Square (OLS)
are subjected to following consequences:
We cannot apply the formula of the variance of the coefficients to conduct tests of significance and
construct confidence intervals.

If error term (in) is heteroscedastic, then the OLS estimates do not have the minimum variance
property in the class of unbiased estimators, i.e., they are inefficient in small samples.
Furthermore, they are asymptotically inefficient.

The estimated coefficients remain unbiased statistically. That means the property of biasedness of
OLS estimation is not violated by the presence of heteroscedasticity.
The forecasts based on the model with heteroscedasticity will be less efficient as OLS estimation
yield higher values of the variance of the estimated coefficients.

All this means the standard errors will be underestimated and the t-statistics and F-statistics will be
inaccurate, caused by a number of factors, but the main cause is when the variables have
substantially different values for each observation. For instance, GDP will suffer from
heteroscedasticity if we include large countries such as the USA and small countries such as Cuba.
In this case it may be better to use GDP per person. Also note that heteroscedasticitytends to
affect cross-sectional data more than time series

2.4 METHOD OF GENERALIZED LEAST SQUARE

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters
in a linear regression model when there is a certain degree of correlation between theresiduals in a
regression model. In these cases, ordinary least squares and weighted least squarescan be
statistically inefficient, or even give misleading inferences. GLS was first described by Alexander
Aitken in 1934.

In statistics, Generalized Least Squares (GLS) is one of the most popular methods for estimating
unknown coefficients of a linear regression model when the independent variable is correlating
with the residuals. Ordinary Least Squares (OLS) method only estimates the parameters in linear
regression model. Also, it seeks to minimize the sum of the squares of the differences between the
observed responses in the given dataset and those predicted by a linear function. The main
advantage of using OLS regression for estimating parameters is that it is easy to use. However, OLS
gives robust results only if there are no missing values in the data and there are no major outliers
in the data set. Moreover, OLS regression model does not take into account unequal variance, or
‗heteroskedastic errors ‘. Due to heteroskedastic errors the results are not robust and also creates
bias.
Therefore, the generalized least squares test is crucial in tackling the problem of outliers,
heteroskedasticity and bias in data. It is capable of producing estimators that are ‗Best Linear
Unbiased Estimates ‘. Thus, GLS estimator is unbiased, consistent, efficient and asymptotically
normal.

Major assumption for generalized least square regression analysis

The assumption of GLS is that the errors are independent and identically distributed.
Furthermore, other assumptions include:
 The error variances are homoscedastic
 Errors are uncorrelated
 Normally distributed
 In the absence of these assumptions, the OLS estimators and the GLS estimators are same.
Thus, the difference between OLS and GLS is the assumptions of the error term of the model.
There are 3 different perspectives from which one can understand the GLS estimator:
 A generalization of OLS
 Transforming the model equation to a new model whose errors are uncorrelated and have
equal variances that is homoscedastic.

Application of generalized least squares

 GLS model is useful in regionalization of hydrologic data.


 GLS is also useful in reducing autocorrelation by choosing an appropriate weighting matrix.
 It is one of the best methods to estimate regression models with auto correlate disturbances
and test for serial correlation (Here Serial correlation and auto correlate are same things).
 One can also learn to use the maximum likelihood technique to estimate the regression
models with auto correlated disturbances.
 The GLS procedure finds extensive use across various domains. The goal of GLS method to
estimate the parameters of regional regression models of flood quantiles.

 GLS is widely popular in conducting market response model, econometrics and time series
analysis.
A number of available software support the generalized least squares test, like R, MATLAB, SAS,
SPSS, and STATA.
A special case of GLS called weighted least squares (WLS) occurs when all the off-diagonal entries
of Ω are 0. This situation arises when the variances of the observed values are unequal (i.e.
heteroscedasticity is present), but where no correlations exist among the observed variances. The
weight for unit i is proportional to the reciprocal of the variance of the response for unit i.

The generalized least squares (GLS) estimator of the coefficients of a linear regression is a
generalization of the ordinary least squares (OLS) estimator. It is used to deal with situations in
which the OLS estimator is not BLUE (best linear unbiased estimator) because one of the main
assumptions of the Gauss-Markov theorem, namely that of homoskedasticity and absence of serial
correlation, is violated. In such situations, provided that the other assumptions of the Gauss-
Markov theorem are satisfied, the GLS estimator is BLUE.

2.5 NATURE, TEST, CONSEQUENCES AND REMEDIAL STEPS OF


MULTICOLLINEARITY

In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in


a multiple regression model can be linearly predicted from the others with a substantial degree of
accuracy. In this situation the coefficient estimates of the multiple regression may change
erratically in response to small changes in the model or the data. Multicollinearity does not reduce
the predictive power or reliability of the model as a whole, at least within the sample data set; it
only affects calculations regarding individual predictors. Thatis, a multivariate regression model
with collinear predictors can indicate how well the entire bundle of predictors predicts the
outcome variable, but it may not give valid results about any individual predictor, or about which
predictors are redundant with respect to others.

Note that in statements of the assumptions underlying regression analyses such as ordinary least
squares, the phrase "no multicollinearity" usually refers to the absence of perfect multicollinearity,
which is an exact (non-stochastic) linear relation among the predictors. In such case, the data
matrix X has less than full rank, and therefore the moment matrix XTX cannot be inverted.

In any case, multicollinearity is a characteristic of the data matrix, not the underlying statistical
model. Since it is generally more severe in small samples, Arthur Goldberger went so far as to call it
"micro numerosity."

Multicollinearity is a state of very high intercorrelations or inter-associations among the


independent variables. It is therefore a type of disturbance in the data, and if present in the data
the statistical inferences made about the data may not be reliable.
There are certain reasons why multicollinearity occurs:

 It is caused by an inaccurate use of dummy variables.


 It is caused by the inclusion of a variable which is computed from other variables in the data
set.
 Multicollinearity can also result from the repetition of the same kind of variable.
 Generally, occurs when the variables are highly correlated to each other.

Multicollinearity can result in several problems. These problems are as follows:

 The partial regression coefficient due to multicollinearity may not be estimated precisely.
The standard errors are likely to be high.
 Multicollinearity results in a change in the signs as well as in the magnitudes of the partial
regression coefficients from one sample to another sample.
 Multicollinearity makes it tedious to assess the relative importance of the independent
variables in explaining the variation caused by the dependent variable.

 In the presence of high multicollinearity, the confidence intervals of the coefficients tend to
become very wide and the statistics tend to be very small. It becomes difficult to reject the
null hypothesis of any study when multicollinearity is present in the data under study.

What Causes Multicollinearity?

The two types are:


 Data-based multicollinearity: caused by poorly designed experiments, data that is 100%
observational, or data collection methods that cannot be manipulated. In some cases,
variables may be highly correlated (usually due to collecting data from purely observational
studies) and there is no error on the researcher ‘s part. For this reason, you should conduct
experiments whenever possible, setting the level of the predictor variables in advance.
 Structural multicollinearity: caused by you, the researcher, creating new predictor variables.
 Causes for multicollinearity can also include:
 Insufficient data. In some cases, collecting more data can resolve the issue.
 Dummy variables may be incorrectly used. For example, the researcher may fail to exclude
one category, or add a dummy variable for every category (e.g., spring, summer, autumn,
winter).

 Including a variable in the regression that is actually a combination of two other variables. For
example, including ―total investment income‖ when total investment income = income from
stocks and bonds + income from savings interest.

 Including two identical (or almost identical) variables. For example, weight in pounds and
weight in kilos, or investment income and savings/bond income.
What Problems Do Multicollinearity Cause?

 Multicollinearity causes the following two basic types of problems:


 The coefficient estimates can swing wildly based on which other independent variables are in
the model. The coefficients become very sensitive to small changes in the model.
 Multicollinearity reduces the precision of the estimate coefficients, which weakens the
statistical power of your regression model. You might not be able to trust the p-values to
identify independent variables that are statistically significant.

Imagine you fit a regression model and the coefficient values, and even the signs, change
dramatically depending on the specific variables that you include in the model. It’s a disconcerting
feeling when slightly different models lead to very different conclusions. You don ‘t feels like you
know the actual effect of each variable. Now, throw in the fact that you can ‘t necessarily trusts the
p-values to select the independent variables to include in the model. This problem makes it difficult
both to specify the correct model and to justify the model if many of your p-values are not
statistically significant.

As the severity of the multicollinearity increases so do these problematic effects. However, these
issues affect only those independent variables that are correlated. You can have a model with
severe multicollinearity and yet some variables in the model can be completely unaffected.

The regression example with multicollinearity that I work through later on illustrates these
problems in action.

Do I Have to Fix Multicollinearity?

Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your
model to identify independent variables that are statistically significant. These are definitely serious
problems. However, the good news is that you don ‘t always has to find a way to fix
multicollinearity.

The need to reduce multicollinearity depends on its severity and your primary goal for your
regression model. Keep the following three points in mind:
 The severity of the problems increases with the degree of the multicollinearity. Therefore, if
you have only moderate multicollinearity, you may not need to resolve it.
 Multicollinearity affects only the specific independent variables that are correlated.
Therefore, if multicollinearity is not present for the independent variables that you are
particularly interested in, you may not need to resolve it. Suppose your model contains the
experimental variables of interest and some control variables. If high multicollinearity exists
for the control variables but not the experimental variables, then you can interpret the
experimental variables without problems.
 Multicollinearity affects the coefficients and p-values, but it does not influence the
predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal
is to make predictions, and you don ‘t needs to understand the role of each independent
variable, you don ‘t needs to reduce severe multicollinearity.

The Consequences of Multicollinearity

 Imperfect multicollinearity does not violate Assumption 6. Therefore, the Gauss Markov
Theorem tells us that the OLS estimators are BLUE.
 So then why do we care about multicollinearity?
 The variances and the standard errors of the regression coefficient estimates will increase.
This means lower t-statistics.
 The overall fit of the regression equation will be largely unaffected by multicollinearity.
This also means that forecasting and prediction will be largely unaffected.
 Regression coefficients will be sensitive to specifications. Regression coefficients canchange
substantially when variables are added or dropped.

2.6 ESTIMATION IN THE PRESENCE OF PERFECT


MULTICOLLINEARITY

In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in


a multiple regression model can be linearly predicted from the others with a substantial degree of
accuracy. In this situation, the coefficient estimates of the multiple regression may change
erratically in response to small changes in the model or the data. Multicollinearity does not reduce
the predictive power or reliability of the model as a whole, at least within the sample data set; it
only affects calculations regarding individual predictors. That is, a multivariate regression model
with collinear predictors can indicate how well the entire bundle of predictors predicts the
outcome variable, but it may not give valid results about any individual predictor, or about which
predictors are redundant with respect to others.

Note that in statements of the assumptions underlying regression analyses such as ordinary least
squares, the phrase "no multicollinearity" usually refers to the absence of perfect multicollinearity,
which is an exact (non-stochastic) linear relation among the predictors. In such case, the design
matrix {\display style X}X has less than full rank, and therefore the moment matrix {\display style
X^{\moths {T}}X}{\display style X^{\mathsf {T}}X} cannot be inverted. Under these circumstances,
for a general linear model {\displaystyle y=X\beta +\epsilon }y=X\beta +\epsilon , the ordinary least
squares estimator {\displaystyle {\hat {\beta }}_{OLS}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf
{T}}y}{\displaystyle {\hat {\beta }}_{OLS}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y} does not exist.
In any case, multicollinearity is a characteristic of the design matrix, not the underlying statistical
model.

Multicollinearity refers to a situation in which more than two explanatory variables in a multiple
regression model are highly linearly related. We have perfect multicollinearity if, for example as in
the equation above, the correlation between two independent variables is equal to 1 or −1. In
practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of
multicollinearity arises when there is an approximate linear relationship among two or more
independent variables.

It was stated previously that in the case of perfect multicollinearity the regression coefficients
remain indeterminate and their standard errors are infinite. This fact can be demonstratedreadily
in terms of the three-variable regression model. Using the deviation form, where all the variables
are expressed as deviations from their sample means, we can write the three-variable regression
model as yi = Pi xii + 03 x3i + Ui (10.2.1)
The result of perfect multicollinearity is that you can‘t obtain any structural inferences about the
original model using sample data for estimation. In a model with perfect multicollinearity, your
regression coefficients are indeterminate and their standard errors are infinite.

2.7 ESTIMATION IN THE PRESENCE OF HIGH BUT IMPERFECT


MULTICOLLINEARITY

With imperfect multicollinearity, an independent variable has a strong but not perfect linear
function of one or more independent variables.
This also means that there are also variables in the model that effects the independent variable In
other words If there are two independent variables that are related to each other. Yet there are
also other variables out of the model that effects one of the independent variables which means
that there is no perfect linear function between the two only. Thus, the inclusion of a stochastic
term in the model shows that the existence of other variables are also affecting the regressors.

 Imperfect multicollinearity varies in degree to degree according to the sample size.


 The presence of the error term dilutes the relationship between the independent variables.

Fig.-2.6 Estimation in The Presence Of High But ImperfectMulticollinearity


The equations tell us that there might be a relationship between X1 and X2 but it does not explain
that X1 is to be completely explained by X2; there is a possibility of unexplained variations as well,
in the form of the stochastic error term.

2.8 SUMMARY

 Regression analysis is used in stats to find trends in data. For example, you might
guess that there‘s a connection between how much you eat and how much you
weigh; regression analysis can help you quantify that.
 Regression analysis will provide you with an equation for a graph so that you can
make predictions about your data. For example, if you‘ve been putting on weight
over the last few years, it can predict how much you‘ll weigh in ten years‘ time if
you continue to put on weight at the same rate.
 It will also give you a slew of statistics (including a p-value and a correlation
coefficient) to tell you how accurate your model is. Most elementary stats courses
cover very basic techniques, like making scatter plots and performing linear
regression. However, you may come across more advanced techniques like multiple
regression.

2.9 KEYWORDS

 Unbiased Estimate- estimator of a given parameter is said to be unbiased if its


expected value is equal to the true value of the parameter.
 Heteroscedasticity - When the standard deviations of a predicted variable are not
constant throughout a range of independent variable values or when they are
compared to earlier time periods, this is known as heteroskedasticity (or
heteroscedasticity) in statistics. When residual errors are visually inspected for
heteroskedasticity, they will often fan out with time, as shown in the illustration
below.

 Multivariate Statics - A subfield of statistics known as "multivariate statistics"


includes the concurrent observation and analysis of several outcome variables.

 Inductive Generalization- To draw a conclusion about the population from which


the sample was taken, you use observations about the sample. Statistical broadening
To make claims about populations, you require precise sampling data.

 Mathematical Geology- The use of mathematical techniques to address issues in the


geosciences, including geology and geophysics, with a focus on geodynamics and
seismology, is known as geomathematics (also: mathematical geosciences,
mathematical geology, and mathematical geophysics).

2.10 LEARNING ACTIVITY

1. Explain GLS in detail


__________________________________________________________________________________
____________________________________________________________________
2. State the reasons & the problems of multicollinearity.
_________________________________________________________________________________
___________________________________________________________________
3. What are the remedial steps of the problem & the consequences of Heteroscedasticity?
_________________________________________________________________________________
___________________________________________________________________

2.11 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1.Calculate the regression coefficient and obtain the lines of regression for the following data

2. Calculate the regression coefficient and obtain the lines of regression for the following data
3. Find the means of X and Y variables and the coefficient of correlation between them from the
following two regression equations:

2Y–X–50 = 0

3Y–2X–10 = 0.

4. The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the correlation
coefficient.

5. Find the means of X and Y variables and the coefficient of correlation between them
from the following two regression equations:

4X–5Y+33 = 0

20X–9Y–107 = 0

Long Questions

1. For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2
=55, ∑Y2 =135, ∑XY=83 Find the equation of the lines of regression and estimate
the value of X on the first line when Y=12 and value of Y on the second line if X=8.

2. In a laboratory experiment on correlation research study the equation of the two


regression lines were found to be 2X–Y+1=0 and 3X–2Y+7=0. Find the means of X
and Y. Also work out the values of the regression coefficient and correlation between
the two variables X and Y.

3. For the given lines of regression 3X–2Y=5and X–4Y=7. Find

(I) Regression coefficients

(ii) Coefficient of correlation

4. Obtain the two regression lines from the following data N=20, ∑X=80, ∑Y=40,
∑X2=1680, ∑Y2=320 and ∑XY=48
5. . Find the equation of the regression line of Y on X, if the observations (Xi, Yi) are
the following (1,4) (2,8) (3,2) (4,12) (5, 10) (6, 14) (7, 16) (8, 6) (9, 18)

B. Multiple Choice Questions

1. The process of constructing a mathematical model or function that can be used to predict
or determine one variable by another variable is called

A. regression

B. correlation

C. residual

D. outlier plot

2. In the regression equation Y = 21 - 3X, the slope is

A. 21

B. -21

C. 3

D. -3

3. In the regression equation Y = 75.65 + 0.50X, the intercept is

A. 0.50

B. 75.65

C. 1.00

D. indeterminable

4. The difference between the actual Y value and the predicted Y value found using a
regression equation is called the
A. slope

B. residual

C. outlier

D. scatter plot

5. The total of the squared residuals is called the

A. coefficient of determination

B. sum of squares of error

C. standard error of the estimate

D. r-squared

6. The process of constructing a mathematical model or function that can be used to predict
or determine one variable by another variable is called
A. regression
B. correlation
C. residual
D. outlier plot
1. The process of constructing a mathematical model or function that can be used to predict
or determine one variable by another variable is called
A. regression
B. correlation
C. residual
D. outlier plot

Answers

1-A, 2-D, 3-B, 4-B ,5-B

2.12 REFERENCES

 Gujarati, D., Porter, D.C and Guna Sekhar, C (2012). Basic Econometrics (Fifth Edition)
McGraw Hill Education.
 Anderson, D. R., D. J. Sweeney and T. A. Williams. (2011). Statistics for Business and
Economics. 12th Edition, Cengage Learning India Pvt. Ltd.
 Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third edition,
Thomson South-Western, 2007.
 Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York, 1994.
 Ramanathan, Ramus, Introductory Econometrics with Applications, Harcourt AcademicPress,
2002 (IGM Library Call No. 330.0182 R14I).
 Houstonians, A. The Theory of Econometrics, 2nd Edition, ESLB
UNIT – 3 PROBLEMS IN REGRESSION ANALYSIS II

STRUCTURE

3.0 Learning Objective

3.1 Introduction

3.2 Nature, test, consequences and remedial steps of problems of Auto-correlation

3.3 OLS in the presence of autocorrelation

3.4 BLUE estimator in the presence of Autocorrelation

3.5 Problems and consequences of specification error

3.6 Errors of measurement

3.7 Model Selection Criteria

3.8 Summary

3.9 Keywords

3.10 Learning Activity

3.11 Unit End Questions

3.12 References

3.0 LEARNING OBJECTIVES

After studying this unit, you will be able to:


 This module will help the students to understand the problems of Auto-correlation with
respect to OLS and BLUE estimator
 This module will also introduce the consequences of specification error, consequences of
specification error and Model Selection Criteria

3.1 INTRODUCTION

When attempting to predict a continuous dependent variable from a number of independent


factors, regression analysis is utilized. Logistic regression should be used if the dependent variable is

1
dichotomous. The independent variables employed in regression can be either continuous or
dichotomous (if the split between the two levels of the dependent variable is close to 50-50, then
both logistic and linear regression will end up giving you comparable results). Regression analyses
may also be employed with independent variables that have more than two levels, but they must
first be transformed into variables with just two levels. Dummy code is what this is, and it will be
covered later. Although you may apply regression with experimentally altered variables, regression
analysis is often employed with naturally occurring variables as opposed to experimentally treated
ones. Regression analysis has the drawback that causal links between the variables cannot be
established. Despite the fact that the vocabulary allows us to state that X "predicts" Y, we are unable
to argue that X "causes" Y.
Regression analysis is a group of statistical procedures used in statistical modelling to determine the
relationships between a dependent variable (often referred to as the "outcome" or "response"
variable, or a "label" in machine learning jargon), and one or more independent variables (often
referred to as "predictors," "covariates," "explanatory variables," or "features"). In linear regression,
the most typical type of regression analysis, the line (or a more complicated linear combination) that
most closely matches the data in terms of a given mathematical criterion is found. When using the
ordinary least squares approach, for instance, the specific line (or hyperplane) that minimizes the
sum of squared differences between the genuine data and that line is computed (or hyperplane).
This enables the researcher to estimate the conditional expectation (or population average value) of
the dependent variable when the independent variables take on a specified set of values for precise
mathematical reasons (see linear regression). Quantile regression or Necessary Condition Analysis[1]
are two less popular types of regression that employ somewhat different methods to estimate
alternative location parameters or to estimate the conditional expectation across a larger group of
non-linear models (e.g., nonparametric regression).

Two fundamentally separate uses of regression analysis predominate.

First, there is a significant overlap between the usage of regression analysis and machine learning in
the areas of prediction and forecasting.

Second, regression analysis may be used to establish causal links between the independent and
dependent variables in specific circumstances. Regressions by themselves, it should be noted, only
illuminate connections between a dependent variable and a group of independent variables in a
given dataset. Researchers must carefully explain why existing correlations have predictive value in a
new context or why a link between two variables has a causal meaning before using regressions for
prediction or to infer causal relationships, respectively. When attempting to infer causal linkages
using observational data, the latter is particularly crucial.

2
3.2 NATURE, TEST, CONSEQUENCES AND REMEDIAL STEPS OF
PROBLEMS OF AUTO-CORRELATION

Autocorrelation is a mathematical representation of the degree of similarity between a given time


series and a lagged version of itself over successive time intervals. It is the same as calculating the
correlation between two different time series, except autocorrelation uses the same time series
twice: once in its original form and once lagged one or more time periods.
Autocorrelation can also be referred to as lagged correlation or serial correlation, as it measures
the relationship between a variable's current value and its past values. When computing
autocorrelation, the resulting output can range from 1 to negative 1, in line with the traditional
correlation statistic. An autocorrelation of +1 represents a perfect positive correlation (an increase
seen in one time series leads to a proportionate increase in the other time series). An
autocorrelation of negative 1, on the other hand, represents perfect negative correlation (an
increase seen in one time series results in a proportionate decrease in the other time series).
Autocorrelation measures linear relationships; even if the autocorrelation is minuscule, there may
still be a nonlinear relationship between a time series and a lagged version of itself.
Autocorrelation refers to the degree of correlation between the values of the same variables across
different observations in the data. The concept of autocorrelation is most often discussedin the
context of time series data in which observations occur at different points in time (e.g., air
temperature measured on different days of the month). For example, one might expect the air
temperature on the 1st day of the month to be more similar to the temperature on the 2nd day
compared to the 31st day. If the temperature values that occurred closer together in time are, in
fact, more similar than the temperature values that occurred farther apart in time, the data would
be auto correlated.

However, autocorrelation can also occur in cross-sectional data when the observations are related
in some other way. In a survey, for instance, one might expect people from nearby geographic
locations to provide more similar answers to each other than people who are more
geographically distant. Similarly, students from the same class might perform more similarly toeach
other than students from different classes. Thus, autocorrelation can occur if observations are
dependent in aspects other than time. Autocorrelation can cause problems in conventional
analyses (such as ordinary least squares regression) that assume independence of observations.

In a regression analysis, autocorrelation of the regression residuals can also occur if the model is
incorrectly specified. For example, if you are attempting to model a simple linear relationship but
the observed relationship is non-linear (i.e., it follows a curved or U-shaped function), then the
residuals will be auto correlated.
For example, if it's rainy today, the data suggests that it's more likely to rain tomorrow than if it's
clear today. When it comes to investing, a stock might have a strong positive autocorrelation of
returns, suggesting that if it's "up" today, it's more likely to be up tomorrow, too.

Naturally, autocorrelation can be a useful tool for traders to utilize; particularly for technical
analysts.

3
The most common method of test autocorrelation is the Durbin-Watson test. Without getting too
technical, the Durbin-Watson is a statistic that detects autocorrelation from a regression analysis.

The Durbin-Watson always produces a test number range from 0 to 4. Values closer to 0 indicate a
greater degree of positive correlation, values closer to 4 indicate a greater degree of negative
autocorrelation, while values closer to the middle suggest less autocorrelation.

So why is autocorrelation important in financial markets? Simple. Autocorrelation can be applied to


thoroughly analyze historical price movements, which investors can then use to predict future price
movements. Specifically, autocorrelation can be used to determine if a momentum trading strategy
makes sense.
Autocorrelation can be useful for technical analysis, That's because technical analysis is most
concerned with the trends of, and relationships between, security prices using charting techniques.
This is in contrast with fundamental analysis, which focuses instead on a company's financial health
or management.

Technical analysts can use autocorrelation to figure out how much of an impact past price for a
security have on its future price.

Autocorrelation can help determine if there is a momentum factor at play with a given stock. If a
stock with a high positive autocorrelation posts two straight days of big gains, for example, it might
be reasonable to expect the stock to rise over the next two days, as well.
A good autocorrelation example
Assume Rain wants to know whether a stock's returns in their portfolio show autocorrelation, or a
relationship between those returns and those from earlier trading sessions.

Rain may classify it as a momentum stock if the returns show autocorrelation because past returns
appear to affect future returns. Rain does a regression using the current return as the dependent
variable and the return from the previous trading session as the independent variable. They
discover that returns from the previous day have a 0.8 positive autocorrelation.

Since 0.8 is extremely near to +1, historical returns for this specific stock appear to be a very
excellent predictor of future returns.
Rain might thus alter their portfolio by holding onto their position or buying more shares in order
to benefit from the autocorrelation or momentum.
An autocorrelated error term can take a range of different specifications to manifest a correlation
between pair wise observations. The most basic form of autocorrelation is referred to as the first
order autocorrelation and is specified in the following way:

4
where U refer to the error term of the population regression function. As can be seen from fig the
error term at period t is a function of itself in the previous time period t-1 times the coefficient, A
which is referred to as the first order autocorrelation coefficient (This is the Greek letter rho,
pronounced "row"). The last term V, is a so called a white noise error term, and supposed to be
completely random. It is often assumed to be standard normal.

This type of autocorrelation is called autoregression because the error term is a function of its past
values. Since U is a function of itself one period back only, as appose to several periods, we call it
the first order autoregression error scheme, which is denoted AR(1). This specification can be
generalized to capture up to n terms. We would then refer to it as the nth order of autocorrelation
and it would be specified like this:

The first order autocorrelation is maybe the most common type of autocorrelations and is for that
reason the main target of our discussion. The autocorrelation can be positive or negative, and is
related to the sign of the autocorrelation coefficient in fig. One way to find out whether the model
suffer from autocorrelation and whether it is positive or negative is to plot the residual term
against its own lagged value.

Figure represents two plots that are two examples of how the plots could look like when the error
term is autocorrelated. The graph to the left represents the case of a positive autocorrelation with
a coefficient equal to 0.3. A regression line is also fitted to the dots in order to make it easier to see
in what direction the correlation drives. Sometimes you are exposed to plots where the dependent
variable or the residual term is followed over time. However, when the correlation is below 0.5 in
absolute terms, it might be difficult to identify any pattern using those plots, and therefore the
plots above are preferable.

Fig.-3.1 Positive and negative autocorrelation

Positive and negative autocorrelation

The example above shows positive first-order autocorrelation, where first order indicates that
observations that are one apart are correlated, and positive means that the correlation between
5
the observations is positive. When data exhibiting positive first-order correlation is plotted, the
points appear in a smooth snake-like curve, as on the left. With negative first-order correlation, the
points form a zigzag pattern if connected, as shown on the right.

Fig.-3.2 Positive and negative autocorrelation- Testing for autocorrelation


Testing for autocorrelation

Sampling error alone means that we will typically see some autocorrelation in any data set, so a
statistical test is required to rule out the possibility that sampling error is causing the
autocorrelation. The standard test for this is the Durbin-Watson test. This test only explicitly tests
first order correlation, but in practice it tends to detect most common forms of autocorrelation as
most forms of autocorrelation exhibit some degree of first order correlation.
Testing for Autocorrelation

You can test for autocorrelation with:

A plot of residuals. Plot et against t and look for clusters of successive residuals on one side of the
zero line. You can also try adding a Lowes line, as in the image below.

A Durbin-Watson test.

A Lagrange Multiplier Test.

Lung Box Test.

A correlogram. A pattern in the results is an indication for autocorrelation. Any values above zero
should be looked at with suspicion.

The Moran’s I statistic, which is similar to a correlation coefficient.

6
The power of each of four tests of first-order autocorrelation in the linear regression model is
determined for a simple and multiple regression model whose parameters are presumed to be
known. The tests are: Durbin-Watson bounds test, a test based on Theil's best linear unbiased
scalar estimator, a test devised by Abrahamsen, Koerts and Looter, and an exact test devised by
Durbin.

For positive values of the coefficient of autocorrelation the Durbin-Watson bounds test is generally
better than the tests based on the estimator proposed by Abrahamsen, Koerts and Looter, the best
linear unbiased scalar estimator, and the Durbin exact test. For negative values of the coefficient of
autocorrelation, the pattern of results is mixed for all four test procedures. A byproduct of these
experiments is the demonstrated feasibility of enumerating the distribution of the Durbin-Watson
test statistic for any regression matrix and thus eliminating the region of indeterminacy from the
Durbin-Watson test procedure.

The implications of autocorrelation

When autocorrelation is detected in the residuals from a model, it suggests that the model is mis
specified (i.e., in some sense wrong). A cause is that some key variable or variables are missing
from the model. Where the data has been collected across space or time, and the model does not
explicitly account for this, autocorrelation is likely. For example, if a weather model is wrong in one
suburb, it will likely be wrong in the same way in a neighboring suburb. The fix isto either include
the missing variables, or explicitly model the autocorrelation (e.g., using an ARIMA model). The
existence of autocorrelation means that computed standard errors, and consequently p--values,
are misleading.

When the disturbance term exhibits serial correlation, the values as well as the standard errors of
the parameter’s estimates are affected. In particular
When the residuals are serially correlated the parameters estimates of OLS are statistically unbiased.

With auto correlated values of the disturbance term, the OLS variances of the parameter estimates
are likely to be larger than those of other econometrics method.
The variance of the random term u may be seriously underestimated if the us are auto correlated.

Finally, if the values of u are auto correlated the prediction based on ordinary least squares
estimates will be inefficient.

In a simpler term when we identify presence of autocorrelation in the residuals of a model, we can
say that some key variables are missing from the model. The model might not be correctly specified.
The presence of autocorrelation implies that we cannot rely on the standard errors, and
consequently p--values. The effects of autocorrelated errors on least square estimators (OLS) are

7
If there are no lagged dependent variables among the explanatory variables in our model, the
estimators are still going to be unbiased in presence of autocorrelation, however, they will no longer
be efficient (the most optimal estimator).

If there are lagged dependent variables included in the model, the least square estimators may not
be consistent in the presence of autocorrelation as n (sample size) tends to infinity.

In the classical linear regression model, we assume that successive values of the disturbance

term are temporarily independent when observations are taken over time. But when this
assumption is violated then the problem is known as Autocorrelation. When the disturbance term
exhibits serial correlation, the value of the standard error of the parameter estimates are affected
and the predictions based on ordinary least square estimates will be inefficient. So, we cannot
estimate the correct values of the parameters and the estimates are biased. In this study main focus
is given on how one can detect the problem of autocorrelation and how the problem can be solved
so as to we can estimate the values of the parameters correctly that are best, linear and unbiased.
To explain the procedure of detection of autocorrelation and its remedial measure we take an
example on indices of real compensation per hour and output per hour in the business sector of the
U.S economy for the period 1960-2005. We use Run Test and Durbin-Watson test to detect the
problem of autocorrelation, and then explain the procedure how the problem can be solved.

8
3.3 OLS IN THE PRESENCE OF AUTOCORRELATION

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the
unknown parameters in a linear regression model. OLS chooses the parameters of a linear function
of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares
of the differences between the observed dependent variable (values of the variable being observed)
in the given dataset and those predicted by the linear function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent
variable, between each data point in the set and the corresponding point on the regression
surface—the smaller the differences, the better the model fits the data. The resulting estimator can
be expressed by a simple formula, especially in the case of a simple linear regression, in which there
is a single regressor on the right side of the regression equation.

The OLS estimator is consistent when the regressors are exogenous, and—by the Gauss–Markov
theorem—optimal in the class of linear unbiased estimators when the errors are homoscedastic and
serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-
unbiased estimation when the errors have finite variances. Under the additional assumption that the
errors are normally distributed, OLS is the maximum likelihood estimator.

As in the case of heteroscedasticity, in the presence of autocorrelation the OLS estimators are still
linear unbiased as well as consistent and asymptotically normally distributed, but they are no
longer efficient (i.e., minimum variance). What then happens to our usual hypothesis testing
procedures if we continue to use the OLS estimators? Again, as in the case of heteroscedasticity,
we distinguish two cases. For pedagogical purposes we still continue to workwith the two-variable
model, although the following discussion can be extended to multiple regressions without much
trouble.13

OLS Estimation Allowing for Autocorrelation


As noted, /32 is not BLUE, and even if we use var(02)AR1, the confidence intervals derived from
there are likely to be wider than those based on the GLS procedure. As Kenta shows, this result is
likely to be the case even if the sample size increases indefinitely.14 That is, /32 is not
asymptotically efficient. The implication of this finding for hypothesis testing is clear: We are likely
to declare a coefficient statistically insignificant (i.e., not different from zero) even though in fact
(i.e., based on the correct GLS procedure) it may be. This difference can be seen clearly from Figure
12.4. In this figure we show the 95% OLS [AR(1)] and GLS confidence intervals assuming that true
p2 = 0. Consider a particular estimate of 02, say, b2. Since b2 lies inthe

9
Fig. 3.3 OLS Estimation Allowing for Autocorrelation

3.4 BLUE ESTIMATOR IN THE PRESENCE OF AUTOCORRELATION

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed
data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the
estimate) are distinguished.[1] For example, the sample mean is a commonly used estimator of the
population mean.

There are point and interval estimators. The point estimators yield single-valued results. This is in
contrast to an interval estimator, where the result would be a range of plausible values. "Single
value" does not necessarily mean "single number", but includes vector valued or function valued
estimators.

Estimation theory is concerned with the properties of estimators; that is, with defining properties
that can be used to compare different estimators (different rules for creating estimates) for the
same quantity, based on the same data. Such properties can be used to determine the best rules to
use under given circumstances. However, in robust statistics, statistical theory goes on to consider
the balance between having good properties, if tightly defined assumptions hold, and having less
good properties that hold under wider conditions

We have discussed Minimum Variance Unbiased Estimator (MVUE) in one of the previous
articles. Following points should be considered when applying MVUE to an estimation problem

MVUE is the optimal estimator

Finding a MVUE requires full knowledge of PDF (Probability Density Function) of the underlying
process.

Even if the PDF is known, finding an MVUE is not guaranteed.

If PDF is unknown, it is impossible find an MVUE using techniques like Cramer Rao Lower Bound
(CRLB)

In practice, knowledge of PDF of the underlying process is actually unknown.

Considering all the points above, the best possible solution is to resort to finding a sub-optimal
estimator. When we resort to find a sub-optimal estimator

Consider a data set x[n]= \{ x[0],x[1], \cdots ,x[N-1] \} whose parameterized PDF p(x ;\theta)
depends on the unknown parameter \theta. As the BLUE restricts the estimator to be linear in
data, the estimate of the parameter can be written as linear combination of data samples with
some weights a_n

10
\hat{\theta} = \displaystyle{\sum_{n=0}^{N} a_n x[n] = \textbf{a}^T \textbf{x}} \quad\quad
\rightarrow (1)
Here \textbf{a} is a vector of constants whose value we seek to find in order to meet the design
specifications. Thus, the entire estimation problem boils down to finding the vector of constants
– \textbf{a} . The above equation may lead to multiple solutions for the vector \textbf{a} .
However, we need to choose those set of values of \textbf{a} , that provides estimates that are
unbiased and has minimum variance.
Thus seeking the set of values for \textbf{a} for finding a BLUE estimator that provides
minimum variance, must satisfy the following two constraints
The estimator must be linear in data
Estimate must be unbiased

The Best Linear Unbiased Estimate (BLUE) of a parameter θ based on data Y isa
linear function of Y. That is, the estimator can be written as b`Y,
unbiased (E[b`Y] = θ), and has the smallest variance among all unbiased linear estimators.
The following steps summarize the construction of the Best Linear Unbiased Estimator(B.L.U.E)

The Gauss-Markov theorem famously states that OLS is BLUE. BLUE is an acronym for the following:

Best Linear Unbiased Estimator

In this context, the definition of “best” refers to the minimum variance or the narrowest sampling
distribution. More specifically, when your model satisfies the assumptions, OLS coefficient estimates
follow the tightest possible sampling distribution of unbiased estimates compared to other linear
estimation methods.

What Does OLS Estimate?


Regression analysis is like any other inferential methodology. Our goal is to draw a random sample
from a population and use it to estimate the properties of that population. In regression analysis, the
coefficients in the equation are estimates of the actual population parameters.

The notation for the model of a population is the following:

The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates

The Gauss-Markov theorem states that if your linear regression model satisfies the first six classical
assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that have
the smallest variance of all possible linear estimators.

The proof for this theorem goes way beyond the scope of this blog post. However, the critical point
is that when you satisfy the classical assumptions, you can be confident that you are obtaining the
best possible coefficient estimates. The Gauss-Markov theorem does not state that these are just

11
the best possible estimates for the OLS procedure, but the best possible estimates for any linear
model estimator. Think about that!

In my post about the classical assumptions of OLS linear regression, I explain those assumptions and
how to verify them. In this post, I take a closer look at the nature of OLS estimates. What does the
Gauss-Markov theorem mean exactly when it states that OLS estimates are the best estimates when
the assumptions hold true?

The Gauss-Markov Theorem: OLS is BLUE!

The Gauss-Markov theorem famously states that OLS is BLUE. BLUE is an acronym for the following:

Best Linear Unbiased Estimator

In this context, the definition of “best” refers to the minimum variance or the narrowest sampling
distribution. More specifically, when your model satisfies the assumptions, OLS coefficient estimates
follow the tightest possible sampling distribution of unbiased estimates compared to other linear
estimation methods.

Let’s dig deeper into everything that is packed into that sentence!

What Does OLS Estimate?

Regression analysis is like any other inferential methodology. Our goal is to draw a random sample
from a population and use it to estimate the properties of that population. In regression analysis, the
coefficients in the equation are estimates of the actual population parameters.

The notation for the model of a population is the following:

The betas (β) represent the population parameter for each term in the model. Epsilon (ε) represents
the random error that the model doesn’t explain. Unfortunately, we’ll never know these population
values because it is generally impossible to measure the entire population. Instead, we’ll obtain
estimates of them using our random sample.

The notation for an estimated model from a random sample is the following:

The hats over the betas indicate that these are parameter estimates while e represents the
residuals, which are estimates of the random error.

12
Typically, statisticians consider estimates to be useful when they are unbiased (correct on average)
and precise (minimum variance). To apply these concepts to parameter estimates and the Gauss-
Markov theorem, we’ll need to understand the sampling distribution of the parameter estimates.

Define a linear estimator.

It must have the property of being


unbiased.The minimum variance is then
computed.
The conditions under which the minimum variance is computed need to be determined.
This then needs to be put in the form of a vector.
The BLUE becomes an MVU estimator if the data is Gaussian in nature irrespective of if the
parameter is in scalar or vector form.
In order to estimate the BLUE there are only two details needed. They are scaled mean and the
covariance the first and second moments respectively.

Advantages over Disadvantages

If data can be modelled to have linear observations in noise then the Gauss-Markov theorem canbe
used to find the BLUE. The Markov theorem generalizes the BLUE result to the case where the
ranks are less than full.
BLUE is applicable to amplitude estimation of known signals in noise. However, it is to be noted
that noise need not necessarily be Gaussian is nature.
The biggest disadvantage of BLUE is that is already sub-optimal in nature and sometimes it is not
the right fit to problem in question.

3.5 PROBLEMS AND CONSEQUENCES OF SPECIFICATION ERROR

In the context of a statistical model, specification error means that at least one of the key features or
assumptions of the model is incorrect. In consequence, estimation of the model may yield results
that are incorrect or misleading. Specification error can occur with any sort of statistical model,
although some models and estimation methods are much less affected by it than others. Estimation
methods that are unaffected by certain types of specification error are often said to be robust. For
example, the sample median is a much more robust measure of central tendency than the sample
mean because it is unaffected by the presence of extreme observations in the sample.

For concreteness, consider the case of the linear regression model. The simplest such model is
where Y is the regress and, X is a single regressor, U is an error term, and β 0 and β 1 are parameters
to be estimated. This model, which is usually estimated by ordinary least squares, could be mis
specified in a great many ways. Some forms of misspecification will result in misleading estimates of
the parameters, and other forms will result in misleading confidence intervals and test statistics.

13
One common form of misspecification is caused by nonlinearity. According to the linear regression
model , increasing the value of the regressor X by one unit always increases the expected value of
the regress and Y by β 1units. But perhaps the effect of X on Y depends on the level of X. If so, the
model is mis specified. A more general model is

which includes the square of X as an additional regressor? In many cases, a model like is much less
likely to be mis specified than a model like . A classic example in economics is the relationship
between years of experience in the labor market and wages . Whenever economists estimate such a
relationship, they find that β 1 is positive and β 2 is negative.

If the relationship between X and Y really is nonlinear, and the sample contains a reasonable amount
of information, then it is likely that the estimate of β 2 in will be significantly different from zero.
Thus, we can test for specification error in the linear model by estimating the more general model
and testing the hypothesis that β 2 = 0.

Another type of specification error occurs when we mistakenly use the wrong regressor(s). For
example, suppose that Y really depends on Z, not on X. If X and Z are positively correlated, we may
well get what appear to be reasonable results when we estimate regression . But the correct
regression

should fit better than regression . A number of procedures exist for deciding whether equation ,
equation , or neither of them is correctly specified. These are often called nonvested hypothesis
tests, and they are really a way of testing for specification error. In the case of and , we simply need
to estimate the model

which includes both and as special cases. We can test whether is correctly specified by using the t -
statistic for β 2 = 0, and we can test whether is correctly specified by using the t -statistic for β 1 = 0.

Of course, it is possible that Y actually depends on both X and Z, so that the true model is . In that
case, if we mistakenly estimated equation , we would be guilty of omitting the explanatory variable
Z. Unless X and Z happened to be uncorrelated, this would cause the estimate of β 1 to be biased.
This type of bias is often called omitted variable bias, and it can be severe when the correlation
between X and Z is high.

Another very damaging type of specification error occurs when the error term U is correlated with X.
This can occur in a variety of circumstances, notably when X is measured with error or when
equation is just one equation from a system of simultaneous equations that determine X and Y
jointly. Using ordinary least squares in this situation results in estimates of β 0 and β 1 that are
biased and inconsistent. Because they are biased, the estimates are not centered around the true
values. Moreover, because they are inconsistent, they actually converge to incorrect values of the
parameters as the sample size gets larger.

The classic way of dealing with this type of specification error is to use instrumental variables (IV).
This requires the investigator to find one or more variables that are correlated with X but not

14
correlated with U, something that may or may not be easy to do. The IV estimator that results, which
is also called two-stage least squares, is still biased, although generally much less so than ordinary
least squares, but at least it is consistent. Thus, if the sample size is reasonably large and various
other conditions are satisfied, IV estimates can be quite reliable.

Even if a regression model is correctly specified in the sense that the relationship between the
regress and and the regressors is correct and the regressors are uncorrelated with the error terms, it
may still suffer from specification error. For ordinary least squares estimates with the usual standard
errors to yield valid inferences, it is essential that the error terms be uncorrelated and have constant
variance. If these assumptions are violated, the parameter estimates may still be unbiased, but
confidence intervals and test statistics will generally be incorrect.

When the error terms do not have constant variance, the model is said to suffer from
heteroskedasticity. There are various way to deal with this problem. One of the simplest is just to
use heteroskedasticity-robust standard errors instead of conventional standard errors. When the
sample size is reasonably large, this generally allows us to obtain valid confidence intervals and test
statistics.

If the error terms are correlated, confidence intervals and test statistics will generally be incorrect.
This is most likely to occur with time-series data and with data where the observations fall naturally
into groups, or clusters. In the latter case, it is often a good idea to use cluster-robust standard
errors instead of conventional ones.

When using time-series data, one should always test for serial correlation, that is, correlation
between error terms that are close together in time. If evidence of serial correlation is found, it is
common, but not always wise, to employ an estimation method that “corrects” for it, and there are
many such methods. The problem is that many types of specification error can produce the
appearance of serial correlation. For example, if the true model were but we mistakenly estimated ,
and the variable Z were serially correlated, it is likely that we would find evidence of serial
correlation. The right thing to do in this case would be to estimate , not to estimate using a method
that corrects for serial correlation.

The specification of a linear regression model consists of a formulation of the regression


relationships and of
statements or assumptions concerning the explanatory variables and disturbances. If any of these is
violated,
e.g., incorrect functional form, the improper introduction of disturbance term in the model, etc.,
then
specification error occurs. In a narrower sense, the specification error refers to explanatory
variables.
The complete regression analysis depends on the explanatory variables present in the model. It is
understood
in the regression analysis that only correct and important explanatory variables appear in the model.
In

15
practice, after ensuring the correct functional form of the model, the analyst usually has a pool of
explanatory
variables which possibly influence the process or experiment. Generally, all such candidate variables
are not
used in the regression modeling, but a subset of explanatory variables is chosen from this pool.
While choosing a subset of explanatory variables, there are two possible options:

1. In order to make the model as realistic as possible, the analyst may include as many as
possible explanatory variables.
2. In order to make the model as simple as possible, one may include only fewer number of
explanatory variables.
In such selections, there can be two types of incorrect model specifications.
1. Omission/exclusion of relevant variables.
2. Inclusion of irrelevant variables.
Now we discuss the statistical consequences arising from both situations.

When we think about model assumptions, we tend to focus on assumptions like independence,
normality, and constant variance.
The other big assumption, which is harder to see or test, is that there is no specification error. The
assumption of linearity is part of this, but it‘s actually a bigger assumption.

What is this assumption of no specification error?

The basic idea is that when you choose a final model, you want to choose one that accurately
represents the real relationships among variables. There are a few common ways of specifying a
linear model inaccurately. Specifying a linear relationship between X & Y when the relationshipisn‘t
linear It‘s often the case that the relationship between a predictor X and Y isn‘t a straight line. Let‘s
use a common one as an example: a curvilinear relationship.
Specifying a line when the relationship is really a curve will result in less-than-optimal modelfit,
non-independent residuals, and inaccurate predicted values. One way to check for a curvilinear
relationship is with bivariate graphing before you get started modeling. Many times (though not
always) the fix is simple: a log transformation of X or an addition of a quadratic (X squared) term.
Other ways to find it include residual graphs and, if they make theoretical sense, adding
transformations of X to the model and assessing model fit.
Another example is an interaction term. If the effect of a variable X is moderated by another
predictor, it means X doesn‘t have a simple linear relationship with Y. X‘s relationship with Y
depends on the value of a third variable–the moderator. Including that interaction in the model will
accurately represent the real relationship between X and Y. Failing to include it means mis-
specification of X‘s real effect.

Consequences of specification error

16
Specification error often, but not always, causes other assumptions to fail. For example, sometimes
non-normality of the residuals is solved by adding a missed covariate or interaction term. So, the
first step in solving problems with other assumptions is usually not to jump to transformations or
some other complicated modeling, but to reassess the predictors you‘ve put into the model.

3.6 ERRORS OF MEASUREMENT

In statistics, errors-in-variables models or measurement error models are regression models that
account for measurement errors in the independent variables. In contrast, standard regression
models assume that those regressors have been measured exactly, or observed without error; as
such, those models account only for errors in the dependent variables, or responses.[citation
needed]

Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-
variables models. Two regression lines (red) bound the range of linear regression possibilities. The
shallow slope is obtained when the independent variable (or predictor) is on the abscissa (x-axis).
The steeper slope is obtained when the independent variable is on the ordinate (y-axis). By
convention, with the independent variable on the x-axis, the shallower slope is obtained. Green
reference lines are averages within arbitrary bins along each axis. Note that the steeper green and
red regression estimates are more consistent with smaller errors in the y-axis variable.

In the case when some regressors have been measured with errors, estimation based on the
standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not
tend to the true values even in very large samples. For simple linear regression the effect is an
underestimate of the coefficient, known as the attenuation bias. In non-linear models the direction
of the bias is likely to be more complicated.
Measurement error refers to a circumstance in which the true empirical value of a variable cannot
be observed or measured precisely. The error is thus the difference between the actual value of that
variable and what can be observed or measured. For instance, household consumption/expenditures
over some intervals are often of great empirical interest (in many applications because of the
theoretical role they play under the forward-looking theories of consumption). These are usually
observed or measured via household surveys in which respondents are asked to catalog their
consumption/expenditures over some recall window. However, these respondents often cannot
recall precisely how much they spent on the various items over that window. Their reported
consumption/expenditures are thus unlikely to reflect precisely what they or their households
actually spent over the recall interval.

Unfortunately, measurement error is not without consequence in many empirical applications.


Perhaps the most widely recognized difficulty associated with measurement error is bias to
estimates of regression parameters. Consider the following regression model:

Y = β0 + β1 · x + ε

17
If y and x are observed precisely (and assuming no other complications), β0 and β1 can be estimated
via straightforward linear regression techniques. Suppose, however, that we actually observe x*,
which is x plus some randomly distributed error v:

x* = x + v

In terms of observed variables, our regression model now becomes

y = β 0 + β 1 · (x * – v ) + ε

=β0+β1·x*+ε–β1·v

=β0+β1·x*+ζ

where ζ = ε – β1 · v. From this setup, a problem should be immediately apparent. Because x* = x + v


and ζ = ε – β 1 · v, x * is correlated with the error term ζ violating a central assumption of linear
regression (that is, independence between regressors and the regression error) required to recover
consistent, unbiased estimates of regression parameters. If one were to regress y on what we can
observe (that is, x *), the probability limit of the estimate of β 1, β1, would be

Using x* thus does not yield a consistent estimate of β 1:

up lime (β 1̂ ǀ<ǀβ 1ǀ

If both v and ε are normally distributed or if the conditional expectation from the regression model is
linear, then this holds even in small samples as an expectation (Hausman 2001):

This is generally referred to as attenuation bias. While measurement error to right-hand side
explanatory variables will also result in biased and inconsistent estimates in a multiple regression
framework, the direction of bias is less clear and will depend on the correlations between the
measurement errors of the various regressors. Similarly, biased and inconsistent estimates will
obtain when the measurement error v is correlated with ε or x, although once again the sign of the
bias will no longer be clear a priori.

Measurement error in left-hand side, dependent variables has a different consequence. To cleanly
separate issues, imagine that x can now be observed perfectly but that we cannot observe the
dependent variable y precisely, but only with a degree of error, as follows:

y* = y + v

Here y* is the observed variable. Thus, we cannot observe y directly because of some measurement
error v. Returning to our regression framework, we have

18
y=β0+β1·x+ε

which, in terms of observed variables, yields

y*–v=β0+β1·x+ε

or

y*=β0+β1·x+ζ

where ζ = ε + v. Since x is still uncorrelated with the new regression error ζ straightforward linear
regression of y* on x will still yield unbiased and consistent estimates of the regression parameters β
0 and β1. However, the variance of ζ will in general exceed that of ε implying more uncertain
estimates (and hence higher standard errors and lower t-statistics for those parameters).
Because it is likely a ubiquitous condition (particularly with many variables typically found in
microlevel data, often based on interviews at the household, firm, or individual level),

many econometric remedies for measurement error have been proposed. Here we focus on the case
of measurement errors in right-hand side explanatory variables x because it is errors in these that
will actually lead to biased and inconsistent (as opposed to merely inefficient) estimates. While a
variety of practical solutions has been proposed, in practice one has become particularly popular:
instrumental variables.

In some sense the instrumental variables approach is rooted in part in the contributions of Vincent
Geraci (1976, 1977), who explored identification and estimation of systems of simultaneous
equations with generalized measurement error. Geraci established the necessary conditions for
identification and efficient estimation under such circumstances. Of particular importance, his work
stressed the need for prior restrictions sufficiently numerous to compensate for the additional
parameters introduced by the measurement error.

Despite the rather elaborate work in the context of systems of equations by Geraci and others, in
practice most instrumental variables estimation to surmount measurement error is carried out in a
simple, two-stage setting. Once again, to isolate issues, let us assume that y is observed without
error but that x is; specifically, assume that we actually observe x*, were

x* = x + v.

To implement the instrumental variables remedy for this sort of measurement error, one must have
some variable z (an instrument) that is correlated with the true value x and not the measurement
error v. Furthermore, z must be correlated with y only through its correlation with x. (Following

19
standard results for instrumental variables estimation, z can be correlated with other observed
determinants of y ; what it cannot be correlated with is the regression error ζ = ε – β1 · v ) Once such
an instrument has been identified, the standard two-stage least squares procedure can be adopted:
Regress x* on z, use the fitted model to predict x*, and finally, regress y on the predicted x*. For
example, the case of mismeasured household consumption is often addressed through instruments
such as household income (often measured in a separate survey module), local prices (which
influence consumption, given income), and the like. What is required is a variable correlated with
the true measure and not the error. All the concerns regarding the predictive power of instruments
(see, for example, Stagier and Stock 1997) apply.

The result that mismeasured explanatory variables leads to biased and inconsistent estimates
generalizes to nonlinear regression and limited-dependent variable models (such as logit and probit),
although the instrumental variables solution discussed above is no longer effective. See Jerry
Hausman (2001) for further discussion of the case of nonlinear regression and Douglas Rivers and
Quang Vuong (1988) for solutions in the case of limited dependent variable models. Interestingly
measurement error in dependent variables can lead to biased and inconsistent estimates of model
parameters in limited dependent variable models. See Hausman (2001) for further discussion.
A fundamental assumption in all the statistical analysis is that all the observations are correctly
measured. In
the context of multiple regression model, it is assumed that the observations on the study and
explanatory
variables are observed without any error. In many situations, this basic assumption is violated. There
can be

several reasons for such a violation.

 For example, the variables may not be measurable, e.g., taste, climatic conditions,
intelligence,

education, ability etc. In such cases, the dummy variables are used, and the observations can be

recorded in terms of values of dummy variables.

 Sometimes the variables are clearly defined, but it is hard to take correct
observations. For example,

the age is generally reported in complete years or in multiple of five.

 Sometimes the variable is conceptually well defined, but it is not possible to take a
correct

20
observation on it. Instead, the observations are obtained on closely related proxy variables, e.g., the
level of education is measured by the number of years of schooling.

 Sometimes the variable is well understood, but it is qualitative in nature. For


example, intelligence is measured by intelligence quotient (IQ) scores.

In all such cases, the true value of the variable cannot be recorded. Instead, it is observed with some
error.

The difference between the observed and true values of the variable is called as measurement error
or

errors-in-variables.

Measurement Error (also called Observational Error) is the difference between a measured
quantity and its true value. It includes random error (naturally occurring errors that are to be
expected with any experiment) and systematic error (caused by a mis-calibrated instrument that
affects all measurements).
For example, let‘s say you were measuring the weights of 100 marathon athletes. The scale you use
is one pound off: this is a systematic error that will result in all athlete’s body weight calculations to
be off by a pound. On the other hand, let‘s say your scale was accurate. Some athletes might be
more dehydrated than others. Some might have wetter (and therefore heavier)
clothing or a 2 oz. candy bar in a pocket. These are random errors and are to be expected. In fact,
all collected samples will have random errors — they are, for the most part, unavoidable.

Measurement errors can quickly grow in size when used in formulas. For example, if you‘re using a
small error in a velocity measurement to calculate kinetic energy, your errors can easily quadruple.
To account for this, you should use a formula for error propagation whenever youuse uncertain
measures in an experiment to calculate something else.

Large and small measurement errors


If the magnitude of measurement errors is small, then they can be assumed to be merged in the
disturbance
term, and they will not affect the statistical inferences much. On the other hand, if they are large in
magnitude, then they will lead to incorrect and invalid statistical inferences. For example, in the
context of
linear regression model, the ordinary least squares estimator (OLSE) is the best linear unbiased
estimator of
the regression coefficient when measurement errors are absent. When the measurement errors
are present in
the data, the same OLSE becomes biased as well as inconsistent estimator of regression
coefficients.

21
Types of Errors in Measurement-

The error may arise from the different source and are usually classified into the following types.
These types are

Gross Errors

Systematic Errors

Random Errors

22
Fig. -3.4 Types of Errors in Measurement

1. Gross Errors-

The gross error occurs because of the human mistakes. For examples consider the person using the
instruments takes the wrong reading, or they can record the incorrect data. Such type of error
comes under the gross error. The gross error can only be avoided by taking the reading carefully.
For example – The experimenter reads the 31.5ºC reading while the actual reading is 21.5Cº. This
happens because of the oversights. The experimenter takes the wrong reading and because of which
the error occurs in the measurement.
Such type of error is very common in the measurement. The complete elimination of such type of
error is not possible. Some of the gross error easily detected by the experimenter but some of them
are difficult to find. Two methods can remove the gross error.
Two methods can remove the gross error. These methods are
The reading should be taken very carefully.
Two or more readings should be taken of the measurement quantity. The readings are taken by the
different experimenter and at a different point for removing the error.

2. Systematic Errors-

The systematic errors are mainly classified into three categories.


Instrumental Errors
Environmental Errors
Observational Errors
(I) Instrumental Errors

23
These errors mainly arise due to the three main reasons.
(a) Inherent Shortcomings of Instruments – Such types of errors are inbuilt in instruments because of
their mechanical structure. They may be due to manufacturing, calibration or operation of the
device. These errors may cause the error to read too low or too high.
For example – If the instrument uses the weak spring, then it gives the high value of measuring
quantity. The error occurs in the instrument because of the friction or hysteresis loss.
(b) Misuse of Instrument – The error occurs in the instrument because of the fault of the operator. A
good instrument used in an unintelligent way may give an enormous result.
For example – the misuse of the instrument may cause the failure to adjust the zero of instruments,
poor initial adjustment, using lead to too high resistance. These improper practices may not cause
permanent damage to the instrument, but all the same, they cause errors.
(c) Loading Effect – It is the most common type of error which is caused by the instrument in
measurement work. For example, when the voltmeter is connected to the high resistance circuit it
gives a misleading reading, and when it is connected to the low resistance circuit, it gives the
dependable reading. This means the voltmeter has a loading effect on the circuit.
The error caused by the loading effect can be overcome by using the meters intelligently. For
example, when measuring a low resistance by the ammeter-voltmeter method, a voltmeter having a
very high value of resistance should be used.

(ii) Environmental Errors


These errors are due to the external condition of the measuring devices. Such types of errors mainly
occur due to the effect of temperature, pressure, humidity, dust, vibration or because of the
magnetic or electrostatic field. The corrective measures employed to eliminate or to reduce these
undesirable effects are
The arrangement should be made to keep the conditions as constant as possible.
Using the equipment which is free from these effects.
By using the techniques which eliminate the effect of these disturbances.
By applying the computed correction
(iii) Observational Errors
Such types of errors are due to the wrong observation of the reading. There are many sources of
observational error. For example, the pointer of a voltmeter resets slightly above the surface of the
scale. Thus, an error occurs (because of parallax) unless the line of vision of the observer is exactly
above the pointer. To minimize the parallax error highly accurate meters are provided with mirrored
scales.

1. Random Errors-
The error which is caused by the sudden change in the atmospheric condition, such type of error is
called random error. These types of error remain even after the removal of the systematic error.
Hence such type of error is also called residual error.

Different Measures of Error

Different measures of error include:


 Absolute Error: the amount of error in your measurement. For example, if you step on a scale

24
and it says 150 pounds but you know your true weight is 145 pounds, then the scalehas an
absolute error of 150 lbs. – 145 lbs. = 5 lbs.

 Greatest Possible Error: defined as one half of the measuring unit. For example, if you usea
ruler that measures in whole yards (i.e., without any fractions), then the greatest possible
error is one half yard.

 Instrument Error: error caused by an inaccurate instrument (like a scale that is off or a poorly
worded questionnaire).

 Margin of Error: an amount above and below your measurement. For example, you might say
that the average baby weighs 8 pounds with a margin of error of 2 pounds (± 2 lbs.).

 Measurement Location Error: caused by an instrument being placed somewhere it shouldn‘t,


like a thermometer left out in the full sun.

 Operator Error: human factors that cause error, like reading a scale incorrectly.

 Percent Error: another way of expressing measurement error. Defined as:

Ways to Reduce Measurement Error

 Double check all measurements for accuracy. For example, double-enter all inputs on two
worksheets and compare them.
 Double check your formulas are correct.
 Make sure observers and measurement takers are well trained.
 Make the measurement with the instrument that has the highest precision.
 Take the measurements under controlled conditions.
 Pilot test your measuring instruments. For example, put together a focus group and ask how
easy or difficult the questions were to understand.
 Use multiple measures for the same construct. For example, if you are testing for
depression, use two different questionnaires.

25
Statistical Procedures to Assess Measurement Error

The following methods assess ―absolute reliability‖:


 Standard error of measurement (SEM): estimates how repeated measurements taken on the
same instrument are estimated around the true score.
 Coefficient of variation (CV): a measure of the variability of a distribution of repeated scores
or measurements. Smaller values indicate a smaller variation and therefore values closer to
the true score.

 Limits of agreement (LOA): gives an estimate of the interval where a proportion of the
differences lie between measurements.

3.7 MODEL SELECTION CRITERIA

Model selection is the task of selecting a statistical model from a set of candidate models, given
data. In the simplest cases, a pre-existing set of data is considered. However, the task can also
involve the design of experiments such that the data collected is well-suited to the problem of
model selection. Given candidate models of similar predictive or explanatory power, the
simplest model is most likely to be the best choice

Model selection criteria are rules used to select a statistical model among a set of candidate
models, based on observed data. Typically, the criteria try to minimize the expected dissimilarity,
measured by the Kullback-Leibler divergence, between the chosen model and the true model (i.e.,
the probability distribution that generated the data).
Competing models

First of all, we need to define precisely what we mean by statistical model.


A statistical model is a set of probability distributions that could have generated the data we are
analyzing.
Model selection is an important part of any statistical analysis, and indeed is
central to the pursuit of science in general. Many authors have examined this
question, from both frequentist and Bayesian perspectives, and many tools
for selecting the \best model" have been suggested in the literature. This
paper evaluates the various proposals from a decision{theoretic perspective,
as a way of bringing coherence to a complex and central question in the
There are two main objectives in inference and learning from data. One is for scientific discovery,
understanding of the underlying data-generating mechanism, and interpretation of the nature of
the data. Another objective of learning from data is for predicting future or unseen observations. In
the second objective, the data scientist does not necessarily concern an accurate probabilistic
description of the data. Of course, one may also be interested in both directions.

26
In line with the two different objectives, model selection can also have two directions: model
selection for inference and model selection for prediction. The first direction is to identify the best
model for the data, which will preferably provide a reliable characterization of the sources of
uncertainty for scientific interpretation. For this goal, it is significantly important that the selected
model is not too sensitive to the sample size. Accordingly, an appropriate notion for evaluating
model selection is the selection consistency, meaning that the most robust candidate will be
consistently selected given sufficiently many data samples.

The second direction is to choose a model as machinery to offer excellent predictive performance.
For the latter, however, the selected model may simply be the lucky winner among a few close
competitors, yet the predictive performance can still be the best possible. If so, the model selection
is fine for the second goal (prediction), but the use of the selected model for insight and
interpretation may be severely unreliable and misleading. Moreover, for very complex models
selected this way, even predictions may be unreasonable for data only slightly different from those
on which the selection was made.

This paper discusses the problem of choosing a linear model from a set of nested alternatives. Two
popular devices for selecting a model in this situation have been model selection criteria and
conditional test sequences (model selection tests). If there are only two alternative models then
choosing a model for which the model selection criterion attains its minimum value is equivalent to
an ordinary F-test of the hypothesis that the more restrictive model is true. In certain cases, the
equivalence carries over to situations where both models are false. The paper makes use of this
equivalence when deriving approximate significance levels of model selection criteria for the
general case where the number of alternative models is larger than two but finite. The
approximations do not use any asymptotic theory. A simulation study is carried out to compare
model selection criteria and model selection tests. The latter, as a family, seem to be fully
competitive with the former but determining the significance levels of the individual tests in the
sequence is a crucial problem.
model selection is a process researchers use to compare the relative value of different statistical
models and determine which one is the best fit for the observed data.

The Akaike information criterion is one of the most common methods of model selection. AIC
weights the ability of the model to predict the observed data against the number of parameters the
model requires to reach that level of precision.

AIC model selection can help researchers find a model that explains the observed variation in their
data while avoiding overfitting.

In its most basic forms, model selection is one of the fundamental tasks of scientific inquiry.
Determining the principle that explains a series of observations is often linked directly to a
mathematical model predicting those observations. For example, when Galileo performed his
inclined plane experiments, he demonstrated that the motion of the balls fitted the parabola
predicted by his model[citation needed].

27
Of the countless number of possible mechanisms and processes that could have produced the data,
how can one even begin to choose the best model? The mathematical approach commonly taken
decides among a set of candidate models; this set must be chosen by the researcher. Often simple
models such as polynomials are used, at least initially[citation needed]. Burnham & Anderson
(2002) emphasize throughout their book the importance of choosing models based on sound
scientific principles, such as understanding of the phenomenological processes or mechanisms
(e.g., chemical reactions) underlying the data.

Once the set of candidate models has been chosen, the statistical analysis allows us to select the
best of these models. What is meant by best is controversial. A good model selection technique will
balance goodness of fit with simplicity[citation needed]. More complex models will be better able
to adapt their shape to fit the data (for example, a fifth-order polynomial can exactly fit six points),
but the additional parameters may not represent anything useful. (Perhaps those six points are
really just randomly distributed about a straight line.) Goodness of fit is generally determined using
a likelihood ratio approach, or an approximation of this,
leading to a chi-squared test. The complexity is generally measured by counting the number of
parameters in the model.

Model selection techniques can be considered as estimators of some physical quantity, such as the
probability of the model producing the given data. The bias and variance are both important
measures of the quality of this estimator; efficiency is also often considered.

A standard example of model selection is that of curve fitting, where, given a set of points and
other background knowledge (e.g. points are a result of i.i.d. samples), we must select a curve that
describes the function that generated the points.

3.8 SUMMARY

Regression analysis is helpful statistical method that can be leveraged across an organization to
determine the degree to which particular independent variables are influencing dependent
variables.

The possible scenarios for conducting regression analysis to yield valuable, actionable business
insights are endless.

The next time someone in your business is proposing a hypothesis that states that one factor,
whether you can control that factor or not, is impacting a portion of the business, suggest
performing a regression analysis to determine just how confident you should be in that hypothesis!
This will allow you to make more informed business decisions, allocate resources more efficiently,
and ultimately boost your bottom line.

28
3.9 KEYWORDS

1.Model Specification- Model Specification In statistics, model specification is part of the process of
building a statistical model: specification consists of selecting an appropriate functional form for the
model and choosing which variables to include
Multiple Regression Mode

2.Multiple Regression Model- Multiple regression analysis is a statistical technique that analyzes the
relationship between two or more variables and uses the information to estimate the value of the
dependent variables. In multiple regression, the objective is to develop a model that describes a
dependent variable y to more than one independent variable

3.Standardize Variable- Standardization is the process of putting different variables on the same
scale. In regression analysis, there are some scenarios where it is crucial to standardize your
independent variables or risk obtaining misleading results.

4.Dependent Variable- dependent Variable a dependent variable is exactly what it sounds like. It is
something that depends on other factors.

5.Specification Error- Specification Error-In the context of a statistical model, specification error
means that at least one of the key features or assumptions of the model is incorrect. In
consequence, estimation of the model may yield results that are incorrect or misleading.

3.10 LEARNING ACTIVITY

1. Write a Note on autocorrelation.


__________________________________________________________________________________
____________________________________________________________________
2. Explain specification error?
__________________________________________________________________________________
____________________________________________________________________
3. What does one mean by error of measurement?
__________________________________________________________________________________
____________________________________________________________________

3.11 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

29
1.What are the step / assumption in regression modelling impacts the trade-off between
under-fitting and over-fitting the most?

2. Explain absolute and instrumental error?

3. Define a linear estimator.

4. Advantages of blue estimator in Prescence of autocorrelation?

5. what is Statistical Procedures to Assess Measurement Error?

Long Questions

1.Explain in detail model selection criteria?

2. What are the errors of measurement?

3.Different ways to reduce measurement error?

4.Write a short note on autocorrelation?

5.What are the Consequences of specification error?

B. Multiple Choice Questions

1. Which of the following are types of correlation?

a. Positive and Negative

b. Simple, Partial and Multiple

c. Linear and Nonlinear

d. All of these

2. Which of the following statements is true for correlation analysis?

a. It is a bivariate analysis

b. It is a multivariate analysis

c. It is a univariate analysis

d. Both a and c

30
3. If the values of two variables move in the same direction, ___________

a. The correlation is said to be non-linear

b. The correlation is said to be linear

c. The correlation is said to be negative

d. The correlation is said to be positive

4. Which of the following techniques is an analysis of the relationship between two variables
to help provide the prediction mechanism?

a. Standard error

b. Correlation

c. Regression

d. Progression

5. The original hypothesis is known as ______.

a. Alternate hypothesis

b. Null hypothesis

c. Both a and b are incorrect

d. Both a and b are correct

Answers

1-D, 2-C, 3-D, 4-C, 5-B

3.12 REFERENCES

 Gujarati, D., Porter, D.C and Gunasekhar, C (2012). Basic Econometrics (Fifth Edition)
McGraw Hill Education.

31
 Anderson, D. R., D. J. Sweeney and T. A. Williams. (2011). Statistics for Business and
Economics. 12th Edition, Cengage Learning India Pvt. Ltd.

 Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third edition,


Thomson South-Western, 2007.

 Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York, 1994.

 Ramanathan, Ramu, Introductory Econometrics with Applications, Harcourt AcademicPress,


2002 (IGM Library Call No. 330.0182 R14I).

32
UNIT -4 REGRESSIONS WITH QUALITATIVE
INDEPENDENT VARIABLES

STRUCTURE
4.0 Learning Objective

4.1 Introduction

4.2 Dummy variable technique

4.3 Testing structural stability of regression models comparing to regressions

4.4 Interaction effects, seasonal analysis

4.5 Piecewise linear regression

4.6 Use of dummy variables

4.7 Dummy variable trap

4.8 Regression with dummy dependent variables

4.9 Logit, Probit, Tobit models – Applications

4.10 Summary

4.11 Keywords

4.12 Learning Activity

4.13 Unit End Questions

4.14 References

4.0 LEARNING OBJECTIVES

After studying this unit, you will be able to:


 This module will help the students to comprehend the concept of Dummy variable
technique.
 This module will also introduce the Regression with dummy dependent variables, Logit,
Probit, Tobit models – Applications
 This module will help to understand the Testing structural stability of regression models
 This module will help the students to understand Dummy variable trap

1
4.1 INTRODUCTION

Regression analysis often involves quantitative variables, such as monetary values, years of
experience, the proportion of persons that share a particular attribute, etc. However, there are
situations when we want to include qualitative factors. For instance, after accounting for
Does gender, marital status, or variations attribute to experience and education level matter?
people's wages be? Does race affect compensation or the likelihood of finding employment? When
the
Will the USA's trade patterns be significantly altered by the implementation of NAFTA1? Each of
these situations
The variable we're interested in is either category or qualitative, and several types of numerical
coding can be applied to it.
while it is not numerical in and of itself.
Using the dummy approach, these variables may be included in regression analysis.
variables. Although this approach is extremely broad, let's begin with the most straightforward
scenario, where the qualitative
The variable under consideration is binary, meaning that there are only two potential values (male
versus female, prior to NAFTA).
vs after NAFTA.
As a rule, binary variables are coded with the values 0 and 1. For example, we may
Create a gender dummy variable with a value of 1 for the men in our sample and a value of 0 for the
women.
dummy variable for NAFTA by putting a 0 for years before NAFTA and a 1 for years after it was
signed
in effect.
If you have a continuous dependent variable and several independent factors, you may use
regression analysis to make predictions about the dependent variable. Use logistic regression if your
dependent variable can be categorized into two categories. If the proportion of cases falling into
each of the two categories of the dependent variable is somewhat close to 50-50, then the findings
from logistic and linear regression will be comparable. Regression can be performed with either
continuous or categorical independent variables. When doing a regression analysis, it is possible to
employ independent variables with more than two levels if they are transformed into variables with
just two levels. This is an example of fake code, which will be described further on. Although
regression may be used with modified variables, it is most commonly employed with naturally
occurring variables. Remember that causal links among the variables cannot be identified using
regression analysis. Although we state that X "predicts" Y, we cannot claim that X "causes" Y because
of the way the phrase is constructed.
All the statistical methods we have developed so far have been for quantitative dependent variables,
measured on more-or-less continuous scales. The assumptions of linear regression—in particular,
that the mean value of the population at any combination of the independent variables be a linear
function of the independent variables and that the variation about the plane of means be normally
distributed—required that the dependent variable be measured on a continuous scale. In contrast,
because we did not need to make any assumptions about the nature of the independent variables,

2
we could incorporate qualitative or categorical information (such as whether or not a Martian was
exposed to secondhand tobacco smoke) into the independent variables of the regression model.
There are, however, many times when we would like to evaluate the effects of multiple independent
variables on a qualitative dependent variable, such as the presence or absence of a disease. Because
the methods that we have developed so far depend strongly on the continuous nature of the
dependent variable, we will have to develop a new approach to deal with the problem of regression
with a qualitative dependent variable.

To meet this need, we will develop two related statistical techniques, logistic regression in this
chapter and the Cox proportional hazards model in Chapter 13. Logistic regression is used when we
are seeking to predict a dichotomous outcome* from one or more independent variables, all of
which are known at a given time.* The Cox proportional hazards model is used when we are
following individuals for varying lengths of time to see when events occur and how the pattern of
events over time is influenced by one or more additional independent variables.

To develop these two techniques, we need to address three related issues.


We need a dependent variable that represents the two possible qualitative outcomes in the
observations. We will use a dummy variable, which takes on values of 0 and 1, as the dependent
variable.

We need a way to estimate the coefficients in the regression model because the ordinary least-
squares criterion we have used so far is not relevant when we have a qualitative dependent variable.
We will use maximum likelihood estimation.

We need statistical hypothesis tests for the goodness of fit of the regression model and whether or
not the individual coefficients in the model are significantly different from zero, as well as
confidence intervals for the individual coefficients.

4.2 DUMMY VARIABLE TECHNIQUE

A dummy variable is a numerical variable used in regression analysis to represent subgroups of the
sample in your study. In research design, a dummy variable is often used to distinguish different
treatment groups. In the simplest case, we would use a 0,1-dummy variable where a person is
given a value of 0 if they are in the control group or a 1 if they are in the treated group.Dummy
variables are useful because they enable us to use a single regression equation to represent
multiple groups. This means that we don‘t need to write out separate equation models for each
subgroup. The dummy variables act like ‗switches‘ that turn various parameters on andoff in an
equation. Another advantage of a 0,1 dummy-coded variable is that even though it is a nominal-
level variable you can treat it statistically like an interval-level variable (if this made nosense to you,
you probably should refresh your memory on levels of measurement). For instance, if you take an
average of a 0,1 variable, the result is the proportion of 1s in the distribution.

3
Fig.- 4.1 Dummy Variable Technique

To illustrate dummy variables, consider the simple regression model for a posttest-only two- group
randomized experiment. This model is essentially the same as conducting a t-test on the posttest
means for two groups or conducting a one-way Analysis of Variance (ANOVA). The key term in the
model is b1, the estimate of the difference between the groups. To see how dummy variables
work, we‘ll use this simple model to show you how to use them to pull out theseparate sub-
equations for each subgroup. Then we‘ll show how you estimate the difference between the
subgroups by subtracting their respective equations. You‘ll see that we can pack an enormous
amount of information into a single equation using dummy variables. All I want to show you here is
that b1 is the difference between the treatment and control groups.

To see this, the first step is to compute what the equation would be for each of our two groups
separately. For the control group, Z = 0. When we substitute that into the equation, andrecognize
that by assumption the error term averages to 0, we find that the predicted value for the control
group is b0, the intercept. Now, to figure out the treatment group line, we substitute the value of 1
for Z, again recognizing that by assumption the error term averages to 0. The equation for the
treatment group indicates that the treatment group value is the sum of the two beta values.

Fig. -4.2 Dummy Variable Technique (a)

4
Now, we‘re ready to move on to the second step – computing the difference between the
groups. How do we determine that? Well, the difference must be the difference between the
equations for the two groups that we worked out above. In other word, to find the
difference
between the groups we just find the difference between the equations for the two groups! It
should be obvious from the figure that the difference is b1. Think about what this means. The
difference between the groups is b1. OK, one more time just for the sheer heck of it. The

Fig. -4.2 Dummy Variable Technique (b)


difference between the groups in this model is b1.

Whenever you have a regression model with dummy variables, you can always see how the
variables are being used to represent multiple subgroup equations by following the two steps
described above:

 create separate equations for each subgroup by substituting the dummy values
 find the difference between groups by finding the difference between their equations

Dummy variables assign the numbers ‘0’ and ‘1’ to indicate membership in any mutually exclusive
and
exhaustive category.
1. The number of dummy variables necessary to represent a single attribute variable is equal to the
number of levels (categories) in that variable minus one.
2. For a given attribute variable, none of the dummy variables constructed can be redundant. That is,
one dummy variable cannot be a constant multiple or a simple linear relation of another.
3. The interaction of two attribute variables (e.g., Gender and Marital Status) is represented by a
third
dummy variable which is simply the product of the two individual dummy variables.

A dummy variable (aka, an indicator variable) is a numeric variable that represents categorical data,
such as gender, race, political affiliation, etc.

5
Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small;
they can take on only two quantitative values. As a practical matter, regression results are easiest to
interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the
presence of a qualitative attribute, and 0 represents the absence.

How Many Dummy Variables?

The number of dummy variables required to represent a particular categorical variable depends on
the number of values that the categorical variable can assume. To represent a categorical variable
that can assume k different values, a researcher would need to define k - 1 dummy variables.

For example, suppose we are interested in political affiliation, a categorical variable that might
assume three values - Republican, Democrat, or Independent. We could represent political affiliation
with two dummy variables:

X1 = 1, if Republican; X1 = 0, otherwise.

X2 = 1, if Democrat; X2 = 0, otherwise.

In this example, notice that we don't have to create a dummy variable to represent the
"Independent" category of political affiliation. If X1 equals zero and X2 equals zero, we know the
voter is neither Republican nor Democrat. Therefore, voter must be Independent.

4.3 TESTING STRUCTURAL STABILITY OF REGRESSION MODELS


– COMPARISON OF REGRESSION

Chow Test for Structural Stability-


The Chow test (Chinese: 鄒檢定), proposed by econometrician Gregory Chow in 1960, is a test of
whether the true coefficients in two linear regressions on different data sets are equal. In
econometrics, it is most commonly used in time series analysis to test for the presence of a
structural break at a period which can be assumed to be known a priori (for instance, a major
historical event such as a war). In program evaluation, the Chow test is often used to determine
whether the independent variables have different impacts on different subgroups of the population.

A series of data can often contain a structural break, due to a change in policy or sudden shockto
the economy, i.e., 1987 stock market crash. In order to test for a structural break, we often usethe
Chow test, this is Chow‘ first test (the second test relates to predictions). The model in effectuses
an F-test to determine whether a single regression is more efficient than two separate regressions
involving splitting the data into two sub-samples. This could occur as follows, where in the second
case we have a structural break at t.:

6
y
y

Model 1

Model 2

x t x

Fig.-4.4 Chow Test for Structural Stability

Case 1 Case2

This suggests that model 1 applies before the break at time t, then model 2 applies after the
structural break. If the parameters in the above models are the same, i.e. , then models 1 and 2 can
be expressed as a single model as in case 1, where there is a single regression line. The Chow test
basically tests whether the single regression line or the two separate regression lines fit the data
best. The stages in running the Chow test are:
1. Firstly, run the regression using all the data, before and after the structural break, collect
RSSc.
2. Run two separate regressions on the data before and after the structural break, collecting the
RSS in both cases, giving RSS1 and RSS2.
3. Using these three values, calculate the test statistic from the following formula:
RSSc  (RSS1  RSS 2 ) / k
F
4. RSS1  RSS 2 / n  2k

5. Find the critical values in the F-test tables, in this case it has F(k,n-2k) degrees of freedom.
6. Conclude, the null hypothesis is that there is no structural break.
Multicollinearity

This occurs when there is an approximate linear relationship between the explanatory variables,
which could lead to unreliable regression estimates, although the OLS estimates are still BLUE. In
general, it leads to the standard errors of the parameters being too large, therefore the t- statistics

7
tend to be insignificant. The explanatory variables are always related to an extent andin most
cases, it is not a problem, only when the relationship becomes too big. One problem is that it is
difficult to detect and decide that it is a problem. The main ways of detecting it are:
 The regression has a high R2 statistic, but few if any of the t-statistics on the explanatory
variables are significant.
 Use of the simple correlation coefficient between the two explanatory variables in question
can be used, although the cut-off between acceptable and unacceptable correlation can be a
problem.

 If multicollinearity does appear to be a problem, then there are a number of ways of


remedying it. The obvious solution is to drop one of the variables suffering from
multicollinearity, however if this is an important variable for the model being tested, this
might not be an option. Other ways of overcoming this problem could be:
 Finding additional data, an alternative sample of data might not produce any evidence of
multicollinearity. Also, by increasing the sample size, this can reduce the standard errors of
the explanatory variables, also helping to overcome the problem.
 Use an alternative technique to the standard OLS technique.
 Transform the variables, for instance taking logarithms of the variables or differencing them
(i.e., dy=y-y(-1)).

4.4 INTERACTION EFFECTS, SEASONAL ANALYSIS

In statistics, an interaction may arise when considering the relationship among three or more
variables, and describes a situation in which the effect of one causal variable on an outcome
depends on the state of a second causal variable (that is, when effects of the two causes are not
additive). Although commonly thought of in terms of causal relationships, the concept of an
interaction can also describe non-causal associations. Interactions are often considered in the
context of regression analyses or factorial experiments.

The presence of interactions can have important implications for the interpretation of statistical
models. If two variables of interest interact, the relationship between each of the interacting
variables and a third "dependent variable" depends on the value of the other interacting variable. In
practice, this makes it more difficult to predict the consequences of changing the value of a variable,
particularly if the variables it interacts with are hard to measure or difficult to control.

An interaction in statistics refers to a scenario where the impact of one causal variable on an
outcome relies on the state of a second causal variable and may occur while studying the link among
three or more variables (that is, when effects of the two causes are not additive). [1][2] Even while
causal linkages are frequently conceived of in terms of interactions, non-causal correlations can also

8
be described by interactions. Regression analysis and factorial experiments frequently take
interactions into account.

Interactions can have significant effects on how statistical models should be interpreted. When two
variables of interest interact, each interacting variable's connection with a third "dependent
variable" is based on the value of the other interacting variable. In actuality, this makes it more
challenging to forecast the effects of altering the value of a variable, especially if the factors with
which it interacts are challenging to measure or to control.

In social and health science research, the concept of "interaction" is closely connected to that of
"moderation": the interaction of an explanatory variable and an environmental variable implies that
the explanatory variable's influence has been changed or moderated by the environmental variable.

An interaction variable or interaction feature is a variable constructed from an original set of


variables to try to represent either all of the interaction present or some part of it. In exploratory
statistical analyses it is common to use products of original variables as the basis of testing whether
interaction is present with the possibility of substituting other more realistic interaction variables at
a later stage. When there are more than two explanatory variables, several interaction variables are
constructed, with pairwise-products representing pairwise-interactions and higher order products
representing higher order interactions.

The binary factor A and the quantitative variable X interact (are non-additive) when analyzed with
respect to the outcome variable Y.

Thus, for a response Y and two variables x1 and x2 an additive model would be:

In contrast to this,

is an example of a model with an interaction between variables x1 and x2 ("error" refers to the
random variable whose value is that by which Y differs from the expected value of Y; see errors and

residuals in statistics). Often, models are presented without the interaction term

, but this confounds the main effect and interaction effect (i.e., without specifying the interaction
term, it is possible that any main effect found is actually due to an interaction).

What is an Interaction Effect in Regression?

An interaction effect is the simultaneous effect of two or more independent variables on at least one
dependent variable in which their joint effect is significantly greater (or significantly less) than the

9
sum of the parts. It helps in understanding how two or more independent variables work in tandem
to impact the dependent variable.

It is important to understand two components first– Main Effects and interaction effects.

Recognizing the Importance of Interaction Effects in Statistical Analysis

The Importance of Interaction Effects.

When the impact of one variable relies on the value of another, we speak of an interaction effect. In
regression models, ANOVA, and well-constructed studies, interaction effects frequently arise. In this
piece, I'll go through what interaction effects are, how to test for them, how to understand
interaction models, and the potential pitfalls of ignoring them altogether.

Many factors can alter the results of any experiment, whether it's a taste test or production process
research. The results are highly sensitive to these factors being altered. Changing the condiment
used in a taste test, for instance, may have a significant impact on the participants' enjoyment of the
meal. Analysts use models in this way to evaluate the strength of the association between
independent and dependent variables. The term "primary effect" is used to describe this type of
result. Even while identifying primary impacts is usually a simple process, focusing solely on those
variables may be misleading.

The independent variables may interact with one another in increasingly intricate research domains.
When a third factor enters into the equation between a given independent and dependent variable,
we say that there is an interaction effect. The connection between an independent and dependent
variable shift based on the value of a third variable, leading statisticians to conclude that these
variables interact. However, if the real world acts in this way, it is essential to include this kind of
influence in your model. As we'll see in this essay, the link between condiments and enjoyment likely
varies with the type of cuisine.

To illustrate the use of interaction effects with categorical independent variables, consider the
following

To me, interaction effects may be thought of as a "it depends" type of impact. See, now you know
why! To begin conceptualizing these impacts in an interaction model, let's look at an intuitive
example.

Main Effects:

Main Effects is the effect of single independent variable on dependent variable — ignoring the
effect of all other independent variables.

Interaction Effect:

10
As mentioned above, the simultaneous effect of two or more independent variables on at least one
dependent variable in which their joint effect is significantly greater (or significantly less) than the
sum of the parts.
I will limit this article to discussing about interaction between two variables.

Interaction Effect can be between two:

1. Categorical variables
2. Continuous variables
3. One categorical and one continuous variable
For each of these scenarios, the interpretation would vary slightly.

1. Between categorical variables:

Imagine someone is trying to lose weight. Weight Loss could be a result of exercising or
following a diet plan or due to both working in tandem.

11
The above numbers indicate weight loss in kg.
What does the above result indicate?
a) It shows that exercising alone is more effective than diet plan and results in 5 kg weight loss
b) Only exercising causes more weight loss as compared to a scenario when both exercising
and diet plan are followed together (Your diet plan is not working :))
What does the above result indicate?
It shows that the weight loss is higher when exercising and diet plan are implemented together.So,
we can say that there is an interaction effect between exercising and diet plan.

2. Between continuous variables

Let us view a Regression equation showing both main effect and interaction effect components.Y =
β0 + β1* X1 + β2*X2 + β3* X1X2
The above equation is interpreted as follows:
a) β1 is the effect of X1 on Y when X2 equal to 0 i.e., one unit increase in X1 causes β1
unit increase in Y, when X2 equals 0.
b) Similarly, β2 is the effect of X2 on Y when X1 equal to 0 i.e., one unit increase in X2
causes β2 unit increase in Y, when X1 equals 0.
c) In case, neither X1 nor X2 is zero, the effect of X1 on Y depends on X2 and the effect ofX2
on Y depends on X1.
To make it clearer, let us rewrite the above equation in another format.
Y = β0 + (β1 + β3* X2) X1 + β2*X2
=> Y= β0 + β1* X1 + (β2 + β3* X1)X2
=> (β1 + β3* X2) is the effect of X1 on Y and it depends on the value of X2
=> (β2 + β3* X1)is the effect of X2 on Y and it depends on the value of X1
Please note that this article has been written w.r.t to inputs/variables used for Market Mix
Modeling. The above concept is a likely scenario for MMM where the inputs could have a zero
value.
For a scenario where input variables cannot be zero, some other measures are taken. Anexample
could be a model where a person‘s weight is considered as one of the regressors. A person‘s weight
cannot be zero :)

3. One continuous variable and one categorical variable

The interaction between one categorical variable and one continuous variable is similar to two
continuous variables.
Let‘s go back to our regression equation:
Y = β0 + β1* X1 + β2*X2 + β3* X1X2
Where X1 is categorical variable, say (Female = 1, Male = 0)
And X2 = Continuous variable
When X1 = 0, Y = β0 + β2*X2

12
=> One unit increase in X2 will cause β2 units increase in Y for males
When X1 = 1, Y = β0 + β1 + (β2 + β3)*X2
=>One unit increase in X2 will cause β2 + β3 units increase in Y for females Effect
of X2 on Y is higher for females than males (Please refer figure 1 below)

Fig. -4.5 Between continuous variables

Interpretation of Interaction in MMM:

1. Both categorical variables:

Let‘s take two categorical variables — seasonality and some launch of product.
Assume that both seasonality and launch of product have a positive relationship with sales.
Seasonality and product launch in their individual capacity will lead to sales. If there is an
interaction effect between them, this might lead to incremental sales.
Y = β0 + β1* Seasonality + β2*Product launch + β3* Seasonality * Product Launch
=> Y = β0 + β1 + β2 + β3
where Seasonality and Product Launch = 1
In case there is no interaction, Y = β0 + β1 + β2

2. Both continuous variables:

Example of interaction between two continuous variables in a MMM could be — effect of TV


advertisement and Digital ads together on sales.
So, when there is an interaction term, effect of TV ads on Sales depend on Digital ads and effectof
Digital ads on Sales depends on TV ads.
Y = β0 + β1* TV Ad + β2*Digital Ad+ β3* TV Ad * Digital Ad -> Positive interaction term If the
interaction term is positive, then the joint effect of these two variables is synergistic as itis leading
to additional sales. It is suggested that both type of ads should be run simultaneouslyto get higher
sales.

13
Y = β0 + β1* TV Ad + β2*Digital Ad - β3* TV Ad * Digital Ad -> Negative interaction term
If the interaction term is negative, the interaction component takes away some part of Sales thus
reducing the overall sales. In this scenario, it is suggested not to run both the campaigns
simultaneously as it takes away the sales. (Your campaigns are creating confusion among
customers: P)
Note that the main effects of these two inputs is positive but the combined effect has a negative
Beta value resulting in reduction in total sales.

3. One continuous variable and one categorical variable

Where X1 is categorical variable, say Seasonality (1 if there is seasonality, 0 otherwise)


And X2 = Continuous variable: TV advertisementY
= Sales
Sales are impacted by seasonality and TV advertisement individually and when they work
together.
Y = β0 + β1* Seasonality + β2* TV ad + β3* TV Ad * SeasonalityIn
this scenario, when seasonality component is there, then:
Y = β0 + β1 + β2* TV ad + β3* TV Ad
=> Y = β0 + β1 + (β2 + β3)* TV Ad
The interaction effect between TV and seasonality has led to additional sales.
This was a brief about Interaction effects between variables. Interaction effect is a vast topic initself.
There are more nuances to it.

4.5 PIECEWISE LINEAR REGRESSION

In statistics, linear regression is a linear approach for modelling the relationship between a scalar
response and one or more explanatory variables (also known as dependent and independent
variables). The case of one explanatory variable is called simple linear regression; for more than one,
the process is called multiple linear regression. This term is distinct from multivariate linear
regression, where multiple correlated dependent variables are predicted, rather than a single scalar
variable.

In linear regression, the relationships are modeled using linear predictor functions whose unknown
model parameters are estimated from the data. Such models are called linear models. Most
commonly, the conditional mean of the response given the values of the explanatory variables (or
predictors) is assumed to be an affine function of those values; less commonly, the conditional
median or some other quantile is used. Like all forms of regression analysis, linear regression focuses
on the conditional probability distribution of the response given the values of the predictors, rather
than on the joint probability distribution of all of these variables, which is the domain of multivariate
analysis.

14
Linear regression was the first type of regression analysis to be studied rigorously, and to be used
extensively in practical applications. This is because models which depend linearly on their unknown
parameters are easier to fit than models which are non-linearly related to their parameters and
because the statistical properties of the resulting estimators are easier to determine.

Linear regression has many practical uses. Most applications fall into one of the following two broad
categories:

If the goal is prediction, forecasting, or error reduction,[clarification needed] linear regression can be
used to fit a predictive model to an observed data set of values of the response and explanatory
variables. After developing such a model, if additional values of the explanatory variables are
collected without an accompanying response value, the fitted model can be used to make a
prediction of the response.

If the goal is to explain variation in the response variable that can be attributed to variation in the
explanatory variables, linear regression analysis can be applied to quantify the strength of the
relationship between the response and the explanatory variables, and in particular to determine
whether some explanatory variables may have no linear relationship with the response at all, or to
identify which subsets of explanatory variables may contain redundant information about the
response.

Linear regression models are often fitted using the least squares approach, but they may also be
fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least
absolute deviations regression), or by minimizing a penalized version of the least squares cost
function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least
squares approach can be used to fit models that are not linear models. Thus, although the terms
"least squares" and "linear model" are closely linked, they are not synonymous.

For a relationship between a response variable (Y) and an explanatory variable (X), different
linear relationships may apply for different ranges of X. A single linear model will not provide an
adequate description of the relationship. Often a non-linear model will be most appropriate inthis
situation, but sometimes there is a clear break point demarcating two different linear relationships.
Piecewise linear regression is a form of regression that allows multiple linear models to be fitted to
the data for different ranges of X.

The regression function at the breakpoint may be discontinuous, but it is possible to specify the
model such that the model is continuous at all points. For such a model the two equations for Y
need to be equal at the breakpoint. Non-linear least squares regression techniques can be used to
fit the model to the data.

Linear regression is a method for modelling the correlation between a scalar response and a set of
predictors that are all assumed to be linearly related (also known as dependent and independent
variables). A basic linear regression is performed when there is only one explanatory variable,

15
whereas a multiple linear regression analysis is performed when there are numerous explanatory
variables. This word is used in contrast to multivariate linear regression, which predicts several
dependent variables that are all interrelated.

Linear regression is a method for modelling relationships by estimating the unknown model
parameters using the available data. Linear models are one kind of statistical representation. The
conditional mean of the answer is often employed since it is expected to be an affine function of
the explanatory factors (or predictors), while the conditional median or another quantile is used
less frequently. Linear regression, like other types of regression analysis, is concerned with the
probability distribution of the answer given the values of the predictors, as opposed to the
combined probability distribution of all of these variables, which is the province of multivariate
analysis.

Among the many types of regression analysis, linear regression was the first to be researched in
depth and put to widespread use in the real world. This is because it is simpler to fit linear models
to data and to assess the statistical features of the resulting estimators than it is to fit non-linear
models to data.

The field of linear regression has several applications. Most software may be classified as either of
these two types:

To fit a predictive model to an observed data set of values of the response and explanatory
variables, linear regression is commonly employed when the objective is prediction, forecasting, or
error reduction. Once such a model has been developed, it may be used to predict the answer if
new values of the explanatory variables are gathered without a corresponding response value.

To determine whether some explanatory variables may have no linear relationship with the
response at all, or to identify which subsets of explanatory variables may contain redundant
information, linear regression analysis can be applied if the goal is to explain variation in the
response variable that can be attributed to variation in the explanatory variables.

Although least squares is the most used method for fitting linear regression models, there are
alternative methods that may be used instead. For example, least absolute deviations regression or
ridge regression (L2-norm penalty) and lasso regression can be used to build linear regression
models (L1-norm penalty). Alternatively, non-linear models can be fitted using the least squares
method. Therefore, although "least squares" and "linear model" are often used interchangeably,
they are not the same thing.

There is often no linearity in real-world data. Fitting a line and obtaining a perfect model on
nonlinear and non-monotonic datasets is notoriously challenging. Though sophisticated models
such as SVM, Trees, and Neural Networks are available, they often come at the expense of being
easily explained and interpreted.

When the choice boundaries are not convoluted, is there a compromise that can be made?

16
Clearly, it's all in the title Piecewise regression divides a data set into smaller, independently
analyzed pieces, and then applies a linear regression to each subset. The points at which two parts
separates are known as break points.

Using a small dataset for demonstration, we will plot the results of a Linear and Piecewise linear
regression analysis.

Fig.-4.6 A data points

Fig. -4.7 Linear fit Fig -4.8 Piecewise Linear Fit


If you compare the figure above with one using a piecewise fit, you'll see that the linear fit
produces a greater standard error. The piecewise plot shown in the previous section may appear to
be overfitting, but it is not. As additional data points are added, this method performs admirably. In
this scenario, the data is divided into three groups, and a regression line is fitted to each group.

To solve a problem, piecewise searches for a distribution of breakpoints that results in the smallest
sum of squared errors. Minimum sum of squared errors are achieved by employing least squares
fitting inside the critical region. Finding the best places to make cuts quickly while dealing with an
issue that spans several segments can be expedited through the use of a multi-start gradient-based
search.

17
In highly regulated business situations such as credit decisions and risk-based simulation, where
model explain-ability is necessary, a piecewise linear function is utilized to eliminate model bias by
segmenting on critical decision factors.
Piecewise Function: How to Use It?

The assumption of linearity between independent and dependent variables is crucial to the linear
regression model. To partition nonlinear variables into linear decision boundaries, a piecewise
model can be used inside a final linear model.

Piecewise independent nonlinear variables are segmented into intervals, and the properties of
these intervals are then included independently into linear regression models.

Non-linearity may be handled in a number of ways, including the use of polynomial functions,
however in order to describe variables with complex structure, one is often left with characteristics
of higher degree polynomials. This might cause the models to become unstable.

Assumptions-
Standard linear regression models with standard estimation techniques make a number of
assumptions about the predictor variables, the response variables and their relationship. Numerous
extensions have been developed that allow each of these assumptions to be relaxed (i.e., reduced
to a weaker form), and in some cases eliminated entirely. Generally, these extensions make the
estimation procedure more complex and time-consuming, and may also require more data in order
to produce an equally precise model.

The following are the major assumptions made by standard linear regression models with standard
estimation techniques (e.g., ordinary least squares):

Weak exogeneity. This essentially means that the predictor variables x can be treated as fixed
values, rather than random variables. This means, for example, that the predictor variables are
assumed to be error-free—that is, not contaminated with measurement errors. Although this
assumption is not realistic in many settings, dropping it leads to significantly more difficult errors-
in-variables models.

Linearity. This means that the mean of the response variable is a linear combination of the
parameters (regression coefficients) and the predictor variables. Note that this assumption is much
less restrictive than it may at first seem. Because the predictor variables are treated as fixed values
(see above), linearity is really only a restriction on the parameters. The predictor variables
themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying
predictor variable can be added, each one transformed differently. This technique is used, for
example, in polynomial regression, which uses linear regression to fit the response variable as an
arbitrary polynomial function (up to a given degree) of a predictor variable. With this much
flexibility, models such as polynomial regression often have "too much power", in that they tend to
overfit the data. As a result, some kind of regularization must typically be used to prevent
unreasonable solutions coming out of the estimation process. Common examples are ridge

18
regression and lasso regression. Bayesian linear regression can also be used, which by its nature is
more or less immune to the problem of overfitting. (In fact, ridge regression and lasso regression
can both be viewed as special cases of Bayesian linear regression, with particular types of prior
distributions placed on the regression coefficients.)

Constant variance (a.k.a. homoscedasticity). This means that the variance of the errors does not
depend on the values of the predictor variables. Thus, the variability of the responses for given
fixed values of the predictors is the same regardless of how large or small the responses are. This is
often not the case, as a variable whose mean is large will typically have a greater variance than one
whose mean is small. For example, a person whose income is predicted to be $100,000 may easily
have an actual income of $80,000 or $120,000—i.e., a standard deviation of around $20,000—
while another person with a predicted income of $10,000 is unlikely to have the same $20,000
standard deviation, since that would imply their actual income could vary anywhere between
−$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the
assumption of normally distributed errors fails—the variance or standard deviation should be
predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity
is called heteroscedasticity. In order to check this assumption, a plot of residuals versus predicted
values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e.,
increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the
absolute or squared residuals versus the predicted values (or each predictor) can also be examined
for a trend or curvature. Formal tests can also be used; see Heteroscedasticity. The presence of
heteroscedasticity will result in an overall "average" estimate of variance being used instead of one
that takes into account the true variance structure. This leads to less precise (but in the case of
ordinary least squares, not biased) parameter estimates and biased standard errors, resulting in
misleading tests and interval estimates. The mean squared error for the model will also be wrong.
Various estimation techniques including weighted least squares and the use of heteroscedasticity-
consistent standard errors can handle heteroscedasticity in a quite general way. Bayesian linear
regression techniques can also be used when the variance is assumed to be a function of the mean.
It is also possible in some cases to fix the problem by applying a transformation to the response
variable (e.g., fitting the logarithm of the response variable using a linear regression model, which
implies that the response variable itself has a log-normal distribution rather than a normal
distribution).

Independence of errors. This assumes that the errors of the response variables are uncorrelated
with each other. (Actual statistical independence is a stronger condition than mere lack of
correlation and is often not needed, although it can be exploited if it is known to hold.) Some
methods such as generalized least squares are capable of handling correlated errors, although they
typically require significantly more data unless some sort of regularization is used to bias the model
towards assuming uncorrelated errors. Bayesian linear regression is a general way of handling this
issue.

Lack of perfect multicollinearity in the predictors. For standard least squares estimation methods,
the design matrix X must have full column rank p; otherwise, perfect multicollinearity exists in the
predictor variables, meaning a linear relationship exists between two or more predictor variables.
This can be caused by accidentally duplicating a variable in the data, using a linear transformation

19
of a variable along with the original (e.g., the same temperature measurements expressed in
Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such
as their mean. It can also happen if there is too little data available compared to the number of
parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of
this assumption, where predictors are highly but not perfectly correlated, can reduce the precision
of parameter estimates (see Variance inflation factor). In the case of perfect multicollinearity, the
parameter vector β will be non-identifiable—it has no unique solution. In such a case, only some of
the parameters can be identified (i.e., their values can only be estimated within some linear
subspace of the full parameter space Rp). See partial least squares regression. Methods for fitting
linear models with multicollinearity have been developed,[5][6][7][8] some of which require
additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly
zero. Note that the more computationally expensive iterated algorithms for parameter estimation,
such as those used in generalized linear models, do not suffer from this problem.

Beyond these assumptions, several other statistical properties of the data strongly influence the
performance of different estimation methods:

The statistical relationship between the error terms and the regressors plays an important role in
determining whether an estimation procedure has desirable sampling properties such as being
unbiased and consistent.

The arrangement, or probability distribution of the predictor variables x has a major influence on
the precision of estimates of β. Sampling and design of experiments are highly developed subfields
of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of
β.

4.6 USE OF DUMMY VARIABLES

In statistics and econometrics, particularly in regression analysis, a dummy variable is one that
takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may
be expected to shift the outcome. They can be thought of as numeric stand-ins for qualitative facts
in a regression model, sorting data into mutually exclusive categories (such as smoker and non-
smoker).

A dummy independent variable (also called a dummy explanatory variable) which for some
observation has a value of 0 will cause that variable's coefficient to have no role in influencing the
dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter the
intercept. For example, suppose membership in a group is one of the qualitative variables relevant
to a regression. If group membership is arbitrarily assigned the value of 1, then all others would get
the value 0. Then the intercept would be the constant term for non-membersbut would be the
constant term plus the coefficient of the membership dummy in the case of group members.

20
Dummy variables are used frequently in time series analysis with regime switching, seasonal
analysis and qualitative data applications.

All of the independent (X) variables in a regression analysis are interpreted numerically.
Comparable numerical variables include those with interval or ratio scales, such as "10 is twice as
much as 5" or "3 minus 1 = 2." On the other hand, you may wish to incorporate an attribute or
nominal scale variable,
as "Brand Name" or "Defect Type" in your research. Let's pretend you've discovered three distinct
flaws, and have labelled them 1, 2, and 3.

and '3'. In this context, the expression "three minus one" has no meaning... Subtracting Defect 1
from Defect 3 is not possible. The

To be clear, the numerical values used to describe the severity of each "Defect Type" are only
descriptive and have no bearing on the quality of the defects themselves.

individual or family-owned and operated. In this case, dummy variables are introduced to "trick"
the regression algorithm into producing accurate results.

Interpreting and assessing attribute variables.

Dummy variables are dichotomous variables derived from a more complex variable.
A dichotomous variable is the simplest form of data. For example, color (e.g., Black = 0; White = 1).
It may be necessary to dummy code variables in order to meet the assumptions of some analyses.
A common scenario is to dummy code a categorical variable for use as a predictor in multiple linear
regression (MLR).
For example, we may have data about participants' religion, with each participant coded as follows:
A categorical or nominal variable with three categories

ReligionCode

Christian 1

Muslim 2

Atheist 3

This is a categorical variable which would be inappropriate to use in this format as a predictor in
MLR. However, this variable could be represented using a series of three dichotomous variables
(coded as 0 or 1), as follows:

Dummy coding for a categorical variable with three categories

21
ReligionChristian Muslim Atheist

Christian 1 0 0

Muslim 0 1 0

Atheist 0 0 1

There is some redundancy in this dummy coding. For instance, if we know that someone is not
Christian and not Muslim, then they are Atheist.
So, we only need to use two of the three dummy-coded variables as predictors. More generally, the
number of dummy-coded variables needed is one less than the number of categories (k - 1, where
k is the original number of categories). If all dummy variables were used, there would be
multicollinearity.
Choosing which dummy variables to use is arbitrary, but depends on the researcher's logic. The
dummy variable not uses becomes the reference category. Then, this is the tricky part
conceptually, all other dummy variables will predict the outcome variable in relation to the
reference variable.
For example, if I'm particularly interested in whether atheism is associated with higher rates of
depression, then use the dummy coded variables for:
Christian (0 = Not Christian or 1 = Christian)
Muslim (0 = Not Muslim or 1 = Muslim)
If the regression coefficient for the Christian dummy coded variable is:
not significant, then whether someone is Christian vs. Atheist isn't related to their depression
significant and positive, then Christian people tend to be more depressed than Atheists
significant and negative, then Christian people tend to be less depressed than Atheists
If the regression coefficient for the Muslim dummy coded variable is:
not significant, then there whether someone is Muslim vs. Atheist isn't related to their depression
significant and positive, then Muslim people tend to be more depressed than Atheists
significant and negative, then Muslim people tend to be less depressed than Atheists
Alternatively, I may simply be interested to recode the data into a single dichotomous variable to
indicate, for example, whether a participant is Atheist (0) or Religious (1), where Religious category
consists of those who are either Christian or Muslim. The coding would be as follows:

A categorical or nominal variable with three categories


Religiosity Code

Atheism 0

Religious 1

22
4.7 DUMMY VARIABLE TRAP

The Dummy variable trap is a scenario where there are attributes which are highly correlated
(Multicollinear) and one variable predicts the value of others. When we use one hot encoding
for handling the categorical data, then one dummy variable (attribute) can be predicted with the
help of other dummy variables. Hence, one dummy variable is highly correlated with other dummy
variables. Using all dummy variables for regression models lead to dummy variable trap. So, the
regression models should be designed excluding one dummy variable.
For Example –
Let‘s consider the case of gender having two values male (0 or 1) and female (1 or 0). Includingboth
the dummy variable can cause redundancy because if a person is not male in such case thatperson is
a female, hence, we don‘t need to use both the variables in regression models. This will protect
us from dummy variable trap.

In statistical terms, a dummy variable can be used for qualitative data analysis, time series analysis,
and other purposes. This article will introduce you to the idea of a Dummy Variable Trap and provide
you with a basic grasp of the Dummy Variable Trap model. Dummy variables is another name for this
idea when used in a regression analysis.

Sub-sample analysis and investigation are reflected by dummy variables in the regression model.
Categories such as gender, age, height, and weight can be represented by "Dummy Variables," which
are assigned numeric values to serve as stand-ins. These quantitative and categorical Dummy
Variables are used in the regression model. Their values are either 0 or 1, and they are represented
by tiny integers. In a collection of categorical data, they show whether something is absent or
present, with zero indicating absence and one indicating presence.

Let's start with the definition of the Dummy Variable Trap. There is a correlation between the traits,
hence this is the case (multicollinear). In this particular setting, one variable can be used as a
predictor for the values of several others. In order to forecast one characteristic of a Dummy
Variable using the rest of the Dummy Variables, it is necessary to apply just one hot encoding when
working with a categorical data set. As a result, when only Dummy Variables are employed in
regression models, the scenario is known as a Dummy Variable Trap.

This is a common issue when doing straightforward linear regression. One common assumption in
statistics is that dependent variables are unchanging throughout time. On the other hand, both
continuous and categorical values can be assigned to independent variables.

The next thing to do is go through the differences between how Dummy Variables in regression
analysis are interpreted and how continuous variables in a linear model are interpreted.

When two or more dummy variables generated using one-hot encoding are significantly linked, a
phenomenon known as the Dummy Variable Trap arises (multi-collinear). As a result, it is challenging
to understand predicted coefficient variables in regression models, as one variable might be inferred

23
from the others. Simply put, due to multicollinearity, it is difficult to draw conclusions about the
influence of the dummy variables on the prediction model in isolation.

With one-hot encoding, a separate dummy variable is generated for each category variable to
indicate its existence (1) or absence (0). In this way, a categorical variable such as "tree species,"
which may take the values "pine" and "oak," might be represented as a dummy variable by
translating each value into a one-hot vector. This results in two columns, the first of which indicates
whether or not the tree is a pine, and the second whether or not it is an oak. If the tree in question is
of the species represented in a given column, that column will contain a 0; otherwise, it will include a
1. These two rows are multi-collinear because we can rule out oak and pine from the list of possible
tree types.
There are many types of data used in statistics, and regression models in particular must be able to
handle them all. Quantitative (numerical) or qualitative (description) information might be collected
(categorical). Regression models work well with numerical data, but we can't use categorical data
without first transforming it.

The label encoding process allows us to convert categorical features into numeric ones (label
encoding assigns a unique integer to each category of data). However, there are other factors that
make this technique less ideal. After label encoding, regression models typically employ a single hot
encoding. This allows us to generate as many new attributes as there are classes in the
corresponding category attribute; if the latter has n classes, then the former must also produce n
attributes. Dummy variables are qualities that provide no use. In regression models, dummy
variables stand in for actual categories of information.

Each attribute will be assigned a value of 0 or 1 to indicate the attribute's presence or absence in the
dummy variables that will be constructed using one-hot encoding.

Indicator Variable Fallacy:

When qualities are strongly linked (Multicollinear) and one variable predicts the value of others, a
situation known as the Dummy variable trap arises. We may anticipate the value of one dummy
variable (attribute) using the values of other dummy variables when we utilize one-hot encoding to
deal with the categorical data. It follows that there is a strong relationship between dummy
variables. The dummy variable trap occurs when regression models only use dummy variables. As a
result, while developing regression models, it is necessary to take into account the possibility that
one dummy variable may be ignored.
Just one example:

Let's take into account the scenario where gender can take on the values of either male (which
would be 0) or female (which would be 1) (1 or 0). Since if a person is not male, then they must be
female, including both the dummy variable and the gender dummy variable in regression models is
unnecessary. We will be safe from the dummy variable trap if we do this.

In linear regression models, to create a model that can infer relationship between features (having
categorical data) and the outcome, we use the dummy variable technique.

24
A “Dummy Variable” or “Indicator Variable” is an artificial variable created to represent an attribute
with two or more distinct categories/levels.

The dummy variable trap is a scenario in which the independent variables become multicollinear
after addition of dummy variables.

Multicollinearity is a phenomenon in which two or more variables are highly correlated. In simple
words, it means value of one variable can be predicted from the values of other variable(s).

4.8 REGRESSION WITH DUMMY DEPENDENT VARIABLES

In statistics, especially in regression models, we deal with various kind of data. The data may be
quantitative (numerical) or qualitative (categorical). The numerical data can be easily handled in
regression models but we can‘t use categorical data directly; it needs to be transformed in some
way.

For transforming categorical attribute to numerical attribute, we can use label encoding procedure
(label encoding assigns a unique integer to each category of data). But this procedure is not alone
that much suitable, hence, One hot encoding is used in regression models following label encoding.
This enables us to create new attributes according to the number of classes present in the
categorical attribute i.e., if there are n number of categories in categorical attribute, n new
attributes will be created. These attributes created are called Dummy Variables. Hence, dummy
variables are ―proxy‖ variables for categorical data in regression models.

These dummy variables will be created with one hot encoding and each attribute will have value
either 0 or 1, representing presence or absence of that attribute.
Dummy independent variables in regressions have been introduced and understood in previous
chapters. We have seen how testing for gender-based pay disparities may be done with 0/1
variables like Female (1 if female, 0 if male). These variables can only take on two possible values,
true or false. However, Y has functioned as a continuous variable throughout the analysis. That is,
the Y variable has always taken on many different values in all the regressions we have seen thus far,
from the first SAT score regression through the several earnings function regressions.

In this section, we'll look at models in which the dependent variable is a dummy or dichotomous
variable. Binary response, dichotomous choice, and qualitative response models are all names for
this type of structure.

Dummy dependent variable models necessitate quite advanced econometrics and are challenging to
manage with our standard regression methods. We offer the subject with a strong focus on intuition
and graphical analysis since that is in line with our pedagogical tenets. Also, the box model and its
associated error term are the main points of attention. Last but not least, we proceed to use Monte
Carlo simulation to justify the part played by randomness. Although the subject matter is still
challenging, we are confident that our method significantly improves comprehension.

25
Specifically, What Does a Model With a Dummy Dependent Variable Mean?

That's a simple question to answer. A dummy dependent variable model has a qualitative rather
than quantitative Y variable (sometimes called the response, left-hand side, or X variable).

One's yearly income is a numeric value that might be anything from zero to many millions of dollars.
To the same extent, the Unemployment Rate is a measurable statistic that can be calculated by
dividing the total number of jobless individuals by the total number of employed individuals in a
certain area (county, state, or nation). To convert this percentage, we use the following fraction:
(e.g., 4.3 or 6.7 percent). The relationship between unemployment and income can be represented
by a cloud of dots in a scatter diagram.
On the other hand, the decision to emigrate is qualitative, taking on the values 0 (do not emigrate)
or 1 (do emigrate) (do emigrate). If we were to plot Emigrate and the Unemployment Rate in each
county as a scatter diagram, we wouldn't see a cloud. One horizontal strip would show
unemployment rates in different counties for those who did not depart, and the other would show
the same data for people who did leave the country.

As a qualitative variable, your political affiliation can take on values such as 0 for Democrat, 1 for
Republican, 2 for Libertarian, 3 for Green, 4 for Other, and 5 for Independent. The precise figures are
not given. The mean and standard deviation of the values 0, 1, 2, 3, 4, and 5 have no significance.
Each political party's numerical value in a scatter plot of political affiliation and annual income would
be represented by a horizontal strip.

Binary choice models are commonly used when the qualitative dependent variable has precisely two
values (like Emigrate). An appropriate representation of the dependent variable here would be a
dummy variable with values of 0 and 1. A multiweapon, multinomial, or polychotomous model is
one in which the qualitative dependent variable can take on more than two values (such as Political
Party). Models with a qualitative dependent variable that can take on more than two values present
extra challenges in terms of interpretation and estimation. The topics discussed in those books are
outside the scope of this one.
Dummy variables, which only contain 1s and 0s, can likewise serve as the dependent variable. When
it assumes the value 1, it is considered a success. Consider the case of house ownership or mortgage
approval, where the dummy variable would be assigned the value 1 if the individual was a
homeowner and 0 otherwise. After that, it may be regressed on a number of different factors,
including both the typical continuous ones and additional dummy variables. This is how a scatterplot
for such a model might look like:

26
Fig. -4.9 LPM

Although this method, known as the Linear Probability Model (LPM), is commonly used for
estimation, it has a number of drawbacks when utilizing ordinary least squares (OLS). Since the
regression line does not provide a good fit to the data, the typical measures of this, such as the R2
statistic, cannot be relied upon. The technique has additional flaws as well:

First, any model estimated with the LPM method will suffer from heteroskedasticity.

Since the LPM assesses probabilities, and a probability larger than 1 does not exist, it is feasible that
the LPM will provide estimates that are both more than 1 and less than 0.

As seen in the following graphic, the error term in such a model is highly unlikely to be normally
distributed.

Fourthly, the most significant issue is that the model's variables are probably not related in a linear
fashion. This indicates that a new sort of regression line, such as a 'S' shaped curve, is required to fit
the data more precisely.

we have created and interpreted dummy independent variables in regressions. We have seen how
0/1 variables such as Female (1 if female, 0 if male) can be used to test for wage discrimination.
These variables have either/or values with nothing in between. Up to this point, however, the
dependent variable Y has always been essentially a continuous variable. That is, in all the regressions
we have seen thus far, from our first regression using SAT scores to the many earnings function
regressions, the Y variable has always taken on many possible values.

This chapter discusses models in which the dependent variable (i.e., the variable on the left-hand
side of the regression equation, which is the variable being predicted) is a dummy or dichotomous
variable. This kind of model is often called a dummy dependent variable (DDV), binary response,
dichotomous choice, or qualitative response model.

27
Dummy dependent variable models are difficult to handle with our usual regression techniques and
require some rather sophisticated econometrics. In keeping with our teaching philosophy, we
present the material with a heavy emphasis on intuition and graphical analysis. In addition, we focus
on the box model and the source of the error term. Finally, we continue to rely on Monte Carlo
simulation in explaining the role of chance. Although the material remains difficult, we believe our
approach greatly increases understanding.

What Exactly Is a Dummy Dependent Variable Model?

That question is easy to answer. In a dummy dependent variable model, the dependent variable
(also known as the response, left-hand side, or Y variable) is qualitative, not quantitative.

Yearly Income is a quantitative variable; it might range from zero dollars per year to millions of
dollars per year. Similarly, the Unemployment Rate is a quantitative variable; it is defined as the
number of people unemployed divided by the number of people in the labor force in a given location
(county, state, or nation). This fraction is expressed as a percentage (e.g., 4.3 or 6.7 percent). A
scatter diagram of unemployment rate and income is a cloud of points with each point representing
a combination of the two variables.

On the other hand, whether you choose to emigrate is a qualitative variable; it is 0 (do not emigrate)
or 1 (do emigrate). A scatter diagram of Emigrate and the county Unemployment Rate would not be
a cloud. It would be simply two strips: one horizontal strip for various county unemployment rates
for individuals who did not emigrate and another horizontal strip for individuals who did emigrate.

The political party to which you belong is a qualitative variable; it might be 0 if Democrat, 1 if
Republican, 2 if Libertarian, 3 if Green Party, 4 if any other party, and 5 if independent. The numbers
are arbitrary. The average and SD of the 0, 1, 2, 3, 4, and 5 are meaningless. A scatter diagram of
Political Party and Yearly Income would have a horizontal strip for each value of political party.

When the qualitative dependent variable has exactly two values (like Emigrate), we often speak of
binary choice models. In this case, the dependent variable can be conveniently represented by a
dummy variable that takes on the value 0 or 1. If the qualitative dependent variable can take on
more than two values (such as Political Party), the model is said to be multiresponse or multinomial
or polychotomous. Qualitative dependent variable models with more than two values are more
difficult to understand and estimate. They are beyond the scope of this book.

More Examples of Dummy Dependent Variables

Figure gives more examples of applications of dummy dependent variables in economics. Notice that
many variables are dummy variables at the individual level (like Emigrate or Unemployed), although
their aggregated counterparts are continuous variables (like emigration rate or unemployment rate).

28
Fig. - 4.10 Dummy Dependent Variables
The careful student might point out that some variables commonly considered to be continuous, like
income, are not truly continuous because fractions of pennies are not possible. Although technically
correct, this criticism could be leveled at any observed variable and for practical purposes is
generally ignored. There are some examples, however, like educational attainment (in years of
schooling), in which determining whether the variable is continuous or qualitative is not so clear.

The definition of a dummy dependent variable model is quite simple: If the dependent, response,
left-hand side, or Y variable is a dummy variable, you have a dummy dependent variable model. The
reason dummy dependent variable models are important is that they are everywhere. Many
individual decisions of how much to do something require a prior decision to do or not do at all.
Although dummy dependent variable models are difficult to understand and estimate, they are
worth the effort needed to grasp them.

Dependent and Independent variables are variables in mathematical modeling, statistical modeling
and experimental sciences. Dependent variables receive this name because, in an experiment, their
values are studied under the supposition or demand that they depend, by some law or rule (e.g., by
a mathematical function), on the values of other variables. Independent variables, in turn, are not
seen as depending on any other variable in the scope of the experiment in question. In this sense,
some common independent variables are time, space, density, mass, fluid flow rate, and previous
values of some observed value of interest (e.g. human population size) to predict future values (the
dependent variable).

Of the two, it is always the dependent variable whose variation is being studied, by altering inputs,
also known as regressors in a statistical context. In an experiment, any variable that can be
attributed a value without attributing a value to any other variable is called an independent variable.
Models and experiments test the effects that the independent variables have on the dependent

29
variables. Sometimes, even if their influence is not of direct interest, independent variables may be
included for other reasons, such as to account for their potential confounding effect.

4.9 LOGIT, PROBIT, TOBIT MODELS – APPLICATIONS

LOGIT MODEL-

The Logit Model,


better known as Logistic Regression is a binomial regression model. Logistic Regression is used to
associate with a vector of random variables to a binomial random variable. Logistic regression is a
special case of a generalized linear model. It is widely used in machine learning.

A link function is simply a function of the mean of the response variable Y that we use as the
response instead of Y itself.

All that means is when Y is categorical, we use the logit of Y as the response in our regression

equation instead of just Y:


The logit function is the natural log of the odds that Y equals one of the categories. For
mathematical simplicity, we‘re going to assume Y has only two categories and code them as 0 and
1.
This is entirely arbitrary–we could have used any numbers. But these make the math work out
nicely, so let‘s stick with them. P is defined as the probability that Y=1. So, for example, those
Xs could be specific risk factors, like age, high blood pressure, and cholesterol level, and P would be
the probability that a patient develops heart disease. The explanation of the Model. We denote Y
as the variable to predict and X = (X1, X2 … Xn) as predictive variables (explanatory variables). In
the context of binary logistic regression, the variable Y takes two possible modalities {1, 0}. The
variables Xj are exclusively continuous or binary.
P(Y = 1) denotes the probability that the variable Y takes the value 1. Similarly we can define P(Y =
0) as the probability that the variable Y takes the value 0. P(X|1) is the conditional distribution of X
knowing the value taken by Y. Similarly, P(X|0) is defined.
The posterior probability of obtaining the modality 1 of Y knowing the value taken by X is noted
P(1|X). Similarly P(0|X) is defined.

In this monograph, Dr. DeMaris begins by describing the logit model in the context of the general
loglinear model, moving its application from two-way to multidimensional tables. In the first half of
the book, contingency table analysis is developed, aided by effective use of data from the General
Social Survey for 1989.

As long as the variables are measured at the nominal or ordinal levels, the cross-tab format for logit
modeling works well. However, if independent variables are continuous, then the more
disaggregated logistic regression technique is favored. . . . A data example explores the relationship

30
of three continuous explanatory variables—population size, population growth, and literacy—to
the log odds of a high murder rate (in a sample of 54 cities). . . . Besides a comparative discussion of
the substantive interpretation of coefficients (odds versus probabilities), DeMaris describes
significance testing and goodness-of-fit measures for logistic regressions, not to mention the
modeling of nonlinearity and interaction effects.

In the final chapter, logistic regression is extended to dependent variables with more than two
categories, categories that may be either nominal or ordinal. The extension to polytomous logistic
regression allows researchers to forsake the inefficiency of ordinary regression in such a case, as
well as to avoid turning to discriminant analysis, with its unrealistic multivariate normal
assumption.

In sum, logit modeling achieves a general purpose, serving whenever the measurement
assumptions for classical multiple regression fail to be met, for either independent or dependent
variables. (PsycINFO Database Record (c) 2016 APA, all rights reserved)

Applications

 In medicine, it allows to find the factors that characterize a group of sick subjects as
compared to healthy subjects.
 In the field of insurance, it makes it possible to target a fraction of the customers who will
be sensitive to an insurance policy on a particular risk.
 In the banking field, to detect risk groups when subscribing a loan.
 In econometrics, to explain a discrete variable. For example, voting intentions in elections.

Probit Model

In statistics, a probit model is a type of regression where the dependent variable can take only two
values, for example married or not married. The word is a portmanteau, coming from probability +
unit. The purpose of the model is to estimate the probability that an observation with particular
characteristics will fall into a specific one of the categories; moreover, classifying observations
based on their predicted probabilities is a type of binary classification model.

A probit model is a popular specification for a binary response model. As such it treats the sameset
of problems as does logistic regression using similar techniques. When viewed in thegeneralized
linear model framework, the probit model employs a probit link function. It is most often estimated
using the maximum likelihood procedure,[3] such an estimation being called a probit regression.

Tobit Model

31
In statistics, a Tobit model is any of a class of regression models in which the observed range of the
dependent variable is censored in some way. The term was coined by Arthur Goldberger in
reference to James Tobin, who developed the model in 1958 to mitigate the problem of zero-
inflated data for observations of household expenditure on durable goods. Because Tobin's method
can be easily extended to handle truncated and other non-randomly selected samples, some
authors adopt a broader definition of the tobit model that includes these cases.

Tobin's idea was to modify the likelihood function so that it reflects the unequal sampling
probability for each observation depending on whether the latent dependent variable fell above or
below the determined threshold. For a sample that, as in Tobin's original case, was censored from
below at zero, the sampling probability for each non-limit observation is simply height of the
appropriate density function. For any limit observation, it is the cumulative distribution, i.e.
the integral below zero of the appropriate density function. The Tobit likelihood function thus isa
mixture of densities and cumulative distribution functions.

Applications

Tobit models have, for example, been applied to estimate factors that impact grant receipt,
including financial transfers distributed to sub-national governments who may apply for these
grants. In these cases, grant recipients cannot receive negative amounts, and the data is thus left-
censored. For instance, Dahlberg and Johansson (2002) analyze a sample of 115 municipalities (42
of which received a grant). Dubois and Fat tore (2011)[23] use a tobit model to investigate the role
of various factors in European Union fund receipt by applying Polish sub-national governments. The
data may however be left-censored at a point higher than zero, with the riskof mis-specification.
Both studies apply Probit and other models to check for robustness. Tobit models have also been
applied in demand analysis to accommodate observations with zero expenditures on some goods.
In a related application of Tobit models, a system of nonlinear Tobit regressions models has been
used to jointly estimate a brand demand system with homoscedastic, heteroscedastic and
generalized heteroscedastic variants.

Quantal answers and restricted responses are two useful categories for grouping together variables
that cannot be treated by regression analysis, the primary instrument of econometrics. The
methods of analysis known as probit and logit are applicable to dichotomous, qualitative, and
categorical outcomes, which fall within the quantal response (all or nothing) group. Decisions
include buying a home vs renting one, selecting a method of transportation, and selecting a career
path are all examples. Variables with both discrete and continuous outcomes fall under the
restricted response category, with tobit being the standard model and analysis tool for this kind of
data. Both negative and positive durable-goods spending samples exist, and both limit and non-
limit pricing data exist in models of markets with price ceilings. While the tobit model and the
restricted and quantal response approaches also have their roots in the probit model, they are
distinct enough from one another to warrant individual consideration.

32
4.10 SUMMARY

At times it is desirable to have independent variables in the model that are qualitative rather
than quantitative. This is easily handled in a regression framework. Regression uses qualitative
variables to distinguish between populations. There are two main advantages of fitting both
populations in one model. You gain the ability to test for different slopes or intercepts in the
populations, and more degrees of freedom are available for the analysis.
Regression with qualitative variables is different from analysis of variance and analysis of
covariance. Analysis of variance uses qualitative independent variables only. Analysis of covariance
uses quantitative variables in addition to the qualitative variables in order to account for
correlation in the data and reduce MSE; however, the quantitative variables are not of primary
interest and merely improve the precision of the analysis.
For modelling discrete outcomes, the Logit model is a popular choice. This can be used for either a
binary result (a value of 0 or 1) or for a more complex result with three or more possible values
(multinomial logit). The logit model is favored for large sample sizes since it runs under the logit
distribution (or Gumbel distribution).
Most Probit models are equivalent, especially when represented in binary (0 and 1). However, its
operation changes drastically when dealing with three or more results (usually ranking or ordering)
in this situation. With only one regression equation to work with, only "extreme" (top and bottom)
rankings may be used to draw any firm conclusions about marginal impacts. I'm happy to provide
further details if asked.
There is nothing similar about Tobit models. None of the possible outcomes are binary or discrete.
Linear regression is one form that Tobit models take. In particular, the Tobit model is employed for
regressing a CONTINUOUS dependent variable that has a unimodal distribution. The Tobit model
permits regression on a censored continuous dependent variable, making it possible to do such
regression. With this method, the analyst can keep the linear assumptions necessary for linear
regression but choose a lower (or higher) threshold at which to censor the regression. For additional
details, read my paper.
When the dependent variable in a regression model is a dichotomous event, one of these three
models is utilized.
The probit model is based on the lognormal distribution, while the logistic distribution is used in the
logit model.
Logistic distribution has rounder tails than lognormal.

Since stock returns typically have fat tails, the logistribution distribution is frequently used to analyse
their behaviour.

Probit theory is grounded in utility theory or the rational choice perspective on human behaviour.

Basic Econometrics by Gujarati is a good resource for learning more about these types of models.
For example, in adoption models (dichotomos dependent variable), the Logit and Probit models are
typically employed in the first hurdle, whereas the Tobit model is typically used in the second hurdle
in a double hurdle model. Here, "actual" values are the dependent variable instead of a simple
yes/no choice. Farmers in a certain area may be polled to determine if they will switch to using

33
hybridized maize seeds (answers: yes and no, then logit or probit models are used depending on the
distribution). Here we face our first challenge. If so, please provide the exact sum they are willing to
pay for this seed. The amount they will pay serves as the dependent variable in a Tobit model. The
second obstacle has therefore been removed. More information about adoption models is available.

4.11 KEYWORDS

 Regression Analysis- A collection of statistical techniques known as regression analysis


is used to estimate the associations between a dependent variable and one or more
independent variables. It may be used to simulate the long-term link between variables
and gauge how strongly the relationships between them are related.

 Regression Model- An explanation of the relationship between one or more independent


variables and a response, dependent, or target variable is provided by a regression model.

 Categorical Variable- A categorical variable, also known as a qualitative variable, is a


statistical variable that, based on some qualitative quality, assigns each human or other
unit of observation to one of a small, and typically fixed, number of potential groups or
nominal categories.

 Binary Variable- A binary variable is the same as a "truth value" in mathematical logic
or a "bit" in computer science. Similar to how statisticians refer to a Bell curve as a
"Normal Distribution" and physicists refer to it as a "Gaussian distribution," they are
really different names for the same thing.

 Multicollinearity- High intercorrelations between two or more independent variables in


a multiple regression model are referred to as multicollinearity. When a researcher or
analyst tries to figure out how well each independent variable can be utilized to predict
or comprehend the dependent variable in a statistical model, multicollinearity can result
in skewed or misleading conclusions.

4.12 LEARNING ACTIVITY

1. Note on Dummy Variables & Dummy Variable Trap


_________________________________________________________________________________
___________________________________________________________________
2. Explain Tobit Model

34
_________________________________________________________________________________
___________________________________________________________________
3. Explain Logit Model in detail
_________________________________________________________________________________
___________________________________________________________________

4.13 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1. Explain Dummy variable technique?

2. Define regression analysis?

3. What is dummy variable trap?

4.Give 3 applications of logit model?

5. What is an Interaction Effect in Regression?

Long Questions

1. Explain in detail Tobit model and give its applications.

2.What is Piecewise linear regression?

3.What are the Interaction effects and seasonal analysis?

4.Briefly explain probit and logit model?

5.How to test structural stability of regression models?

B. Multiple Choice Questions

1.Which of the following is true for the coefficient of correlation?

a. The coefficient of correlation is not dependent on the change of scale

b. The coefficient of correlation is not dependent on the change of origin

35
c. The coefficient of correlation is not dependent on both the change of scale and change of
origin

d. None of these.

2.The independent variable is used to explain the dependent variable in ________.

a. Linear regression analysis

b. Multiple regression analysis

c. Non-linear regression analysis

d. None of these

3.The independent variable is also called

a. regressor

b. regressed

c. predictand

d. estimated

4.To determine the hight of a person when his weight is given is

a. correlation problem

b. association problem

c. regression problem

d. Qualitative problem

4. If regression line of value 5 then value of regression coefficient on x and y is

a. 0

b. 0.5

c. 1

36
d.1.5

Answers

1-C, 2-A, 3-A, 4-C, 5-A

4.14 REFERENCES

 Gujarati, D., Porter, D.C and Gunasekhar, C (2012). Basic Econometrics (Fifth Edition)
McGraw Hill Education.
 Anderson, D. R., D. J. Sweeney and T. A. Williams. (2011). Statistics for Business and
Economics. 12th Edition, Cengage Learning India Pvt. Ltd.
 Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third edition,
Thomson South-Western, 2007.
 Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York, 1994.
 Ramanathan, Ramu, Introductory Econometrics with Applications, Harcourt Academic
Press, 2002 (IGM Library Call No. 330.0182 R14I).
 Koutsoyiannis, A. The Theory of Econometrics, 2nd Edition, ESLB

37
UNIT- 5 PANEL DATA REGRESSION MODELS

STRUCTURE

5.0 Learning Objective

5.1 Introduction

5.2 Panel Data

5.3 Pooled OLS Regression

5.4 Fixed Effect Least Squares

5.5 Dummy Variable Model

5.6 Fixed effect within group (WG) Estimator

5.7 The Random effects model

5.8 Summary

5.9 Keywords

5.10 Learning Activity

5.11 Unit End Questions

5.12 References

5.0 LEARNING OBJECTIVES

 This module ‗PANEL DATA REGRESSION MODELS ' introduces the key
concept ofPanel Data, Pooled OLS Regression, Fixed Effect Least Squares.
 This module will help the students to understand the concept of Dummy Variable
Model,Fixed effect within group (WG) Estimator, The Random effects model.

5.1 INTRODUCTION

In panel data the same cross-sectional unit is surveyed over time. In short, panel data have
space as well as time dimensions. There are other names for panel data, such as pooled data
(pooling of time series and cross-sectional observations), combination of time series and cross-

1
section data, micropanel data, longitudinal data (a study over time of a variable or group of
subjects), event history analysis (e.g., studying the movement over time of subjects through
successive states or conditions), cohort analysis (e.g., following the career path of 1965
graduates of a business school).

Let us take an example from market prices of wheat and their production in 20 states in India
from 1950 to 2000. For a year, wheat and prices represent cross-section sample. For a state,
there are two time series observations on wheat and their prices. Thus we have (20X2) =40
panel observations on wheat and their prices. The regression models based on such panel data
are known as panel data regression models.

In contrast to standard linear regression models, panel data regression effectively manages the
dependencies of unobserved independent variables on a dependent variable that might result
in biased estimators. In this post, I'll go over the key theoretical underpinnings of the subject
as well as step-by-step instructions for creating a panel data regression model in Python.

I have two goals in mind when I write this post: First, an integrated panel data regression
model is difficult to understand and is difficult to explain simply. Second, panel data
regression in Python is more difficult to conduct than, say, in R, but that doesn't mean it is
less successful. In order to maybe make future panel data analysis a little bit simpler, I
decided to share the information I obtained from a recent assignment.
Enough enough! Let's get started by defining panel data and explaining why it is so effective.
Panel data is a two-dimensional notion in which the same individuals are regularly observed
throughout a range of time periods.

In general, cross-sectional and time-series data may be combined to create panel data. One
observation of several objects and their accompanying characteristics at one particular
moment is referred to as cross-sectional data (i.e. an observation is taken once). Time-series
data only repeatedly observes one item across time. By gathering information from several,
identical items across time, panel data combines both types of features into a single model.
A sort of longitudinal data, or data collected over time at several points, are panel data. Time
series data is one of the three primary forms of longitudinal data. Numerous observations
(high t) on a single unit or fewer (small t)

N). Examples include aggregate national data and stock price patterns.

2
Cross portions that were pooled. several independent samples drawn from different units

(big N) samples taken at various times from the same population:

Unspecific Social Surveys

Excerpts from the US Decennial Census

Surveys of the current population

Data from panels. Multiple observations (small t) from two or more units (large N).

household and individual panel studies (PSID, NLSY, ANES)

Information on businesses and organizations at various times

regional data compiled throughout time

This programme serves as a fundamental introduction to panel data analysis. In particular


I'll focus on the model for linear error components.

5.2 PANEL DATA

A panel data set contains data that is collected over a period of time for one or more uniquely
identifiable individuals or “things”. In panel data terminology, each individual or “thing” for
which data is collected is called a unit.

Here are three real world examples of panel data sets:

The Framingham Heart Study: The Framingham heart study is a long running experiment that
was started in 1948 in the city of Framingham, Massachusetts. Each year, health data from
5000+ individuals is being captured with the goal of identifying risk factors for cardiovascular
disease. In this data set, the unit is a person.

3
The Grunfeld Investment Data: This is a popular research data set that contains corporate
performance data of 10 US companies that was accumulated over a period of 20 years. In this
data set, the unit is a company.

The British Household Panel Survey: This is a survey of a sample of British households. Since
1991, members of each sampled household were asked a set questions and their responses
were recorded. The same sample of households was interviewed again each subsequent year.
The goal of the survey is to analyze the effects of socioeconomic changes happening in Britain
on British households. In this data set, the unit is a household.

While building a panel data set, researchers measure one or more parameters called variables
for each unit and record their values in a tabular format. Examples of variables are sex, race,
weight and lipid levels for individuals or employee count, outstanding shares and EBITDA
for companies. Notice that some variables may change across time periods, while others stay
constant.

What results from this data collection exercise is a three-dimensional data set in which each
row represents a unique unit, each column contains the data from one of the measured
variables for that unit, and the z-axis contains the sequence of time periods over which the the
unit has been tracked.

Panel data sets arise out of longitudinal studies in which the researchers wish to study the
impact of the measured variables on one or more response variables such as the yearly
investment made by a company, or GDP growth of a country

In statistics and econometrics, panel data or longitudinal data are multi-dimensional data
involving measurements over time. Panel data contain observations of multiple phenomena
obtained over multiple time periods for the same firms or individuals.
Time series and cross-sectional data can be thought of as special cases of panel data that are
in one dimension only (one panel member or individual for the former, one time point for the

4
latter). Data is broadly classified according to the number of dimensions. A data set containing
observations on a single phenomenon observed over multiple time periods is called time
series. In time series data, both the values and the ordering of the data points have meaning.
A data set containing observations on multiple phenomena observed at a single point in time
is called cross-sectional. In cross-sectional data sets, the values of the data points have
meaning, but the ordering of the data points does not. A data set containing observations on
multiple phenomena observed over multiple time periods is called panel data. Panel Data
aggregates all the individuals, and analyzes them in a period of time. Alternatively, the second
dimension of data may be some entity other than time. For example, when there is a sample
of groups, such as siblings or families, and several observations from every group, the data
are panel data. Whereas time series and cross-sectional data are both one-dimensional, panel
data sets are two-dimensional.

A study that uses panel data is called a longitudinal study or panel study.
A balanced panel (e.g., the left-hand dataset above) is a dataset in which each panel
member(i.e., person) is observed every year. Consequently, if a balanced panel contains N
panelmembers and T periods, the number of observations (n) in the dataset is necessarily n
= N×T. An unbalanced panel (e.g., the right-hand dataset above) is a dataset in which at
least one panelmember is not observed every period. Therefore, if an unbalanced panel
contains N panelmembers and T periods, then the following strict inequality holds for
the number ofobservations (n) in the dataset: n < N×T.

Both datasets above are structured in the long format, which is where one row holds one
observation per time. Another way to structure panel data would be the wide format where
one row represents one observational unit for all points in time (for the example, the wide
format would have only two (left example) or three (right example) rows of data with
additional columns for each time-varying variable (income, age).

A Statistical Look at the N Panel data is a euphemism for two-dimensional information.


The number of dimensions provides a useful framework for categorizing data. Time series refers to any
collection of data that spans numerous time periods and focuses on a particular topic. The values and
the order of data points in a time series are both significant. We use the term "cross-sectional" to refer
to a data collection that includes information about more than one phenomenon yet was collected at the

5
same time. Numbers of data points in cross-sectional data sets are meaningful, but the order in which
the values appear has no bearing on the interpretation of the data. Panel data refers to a data set that
includes information about several phenomena and/or various time periods. Panel data is used to study
groups of people throughout time. It is possible that the second dimension of data is anything besides
time. Whenever there are several observations from each group in a sample, like in a family or sibling
group, the data are said to be panel data. Panel data sets are two-dimensional, as opposed to one-
dimensional time series data and cross-sectional data.

Data sets which have a panel design-

 Russia Longitudinal Monitoring Survey (RLMS)

 German Socio-Economic Panel (SOEP)

 Household, Income and Labor Dynamics in Australia Survey (HILDA)

 British Household Panel Survey (BHPS)

 Survey of Family Income and Employment (SoFIE)

 Survey of Income and Program Participation (SIPP)

 Lifelong Labor Market Database (LLMDB)

 Longitudinal Internet Studies for the Social sciences (LISS)

 Panel Study of Income Dynamics (PSID)

 Korean Labor and Income Panel Study (KLIPS)

 China Family Panel Studies (CFPS)

 German Family Panel (pairfam)

 National Longitudinal Surveys (NLSY)

6
 Labor Force Survey (LFS)

 Korean Youth Panel (YP)

 Korean Longitudinal Study of Aging (KLoS)

Reasons for using Panel Data

1. Panel data can take explicit account of individual-specific heterogeneity (―individual‖


here means related to the micro unit)
2. By combining data in two dimensions, panel data gives more data variation, less
collinearityand more degrees of freedom.
3. Panel data is better suited than cross-sectional data for studying the dynamics of change.
For example it is well suited to understanding transition behaviour –for example
companybankruptcy or merger.
4. It is better in detecting and measuring the effects which cannot be observed in either
cross- section or time-series data.
5. Panel data enables the study of more complex behavioral models –for example the effects
of technological change, or economic cycles.
6. Panel data can minimize the effects of aggregation bias, from aggregating firms into
broad groups.

Example-

Fig. -5.1 Panel Data

7
In the multiple response permutation procedure (MRPP) example above, two datasets with a
panel structure are shown and the objective is to test whether there's a significant difference
between people in the sample data. Individual characteristics (income, age, sex) are collected
for different persons and different years. In the first dataset, two persons (1, 2) are observed
every year for three years (2016, 2017, 2018). In the second dataset, three persons (1, 2, 3)
are observed two times (person 1), three times (person 2), and one time (person 3),
respectively, over three years (2016, 2017, 2018); in particular, person 1 is not observed in
year 2018 and person 3 is not observed in 2016 or 2018.

A balanced panel (e.g., the first dataset above) is a dataset in which each panel member (i.e.,
person) is observed every year. Consequently, if a balanced panel contains N panel members
and T periods, the number of observations (n) in the dataset is necessarily n = N×T.

An unbalanced panel (e.g., the second dataset above) is a dataset in which at least one panel
member is not observed every period. Therefore, if an unbalanced panel contains N panel
members and T periods, then the following strict inequality holds for the number of
observations (n) in the dataset: n < N×T.

Both datasets above are structured in the long format, which is where one row holds one
observation per time. Another way to structure panel data would be the wide format where one
row represents one observational unit for all points in time (for the example, the wide format
would have only two (first example) or three (second example) rows of data with additional
columns for each time-varying variable (income, age).

Advantages of Panel Data

1. Since panel data relate to individuals, firms, states, countries, etc., over time, there is bound
to be heterogeneity in these units. The techniques of panel data estimation can take such

heterogeneity explicitly into account by allowing for individual-specific variables, as we shall


show

shortly. We use the term individual in a generic sense to include microunits such as individuals,

firms, states, and countries.

8
2. By combining time series of cross-section observations, panel data give “more informative
data, more variability, less collinearity among variables, more degrees of freedom and more
efficiency.”

3. By studying the repeated cross section of observations, panel data are better suited to study
the dynamics of change. Spells of unemployment, job turnover, and labor mobility are better
studied with

panel data.

4. Panel data can better detect and measure effects that simply cannot be observed in pure cross-
section or pure time series data. For example, the effects of minimum wage laws on
employment and earnings can be better studied if we include successive waves of minimum
wage increases in the federal and/or state minimum wages.

5. Panel data enables us to study more complicated behavioral models. For example,
phenomena

such as economies of scale and technological change can be better handled by panel data than
b pure cross-section or pure time series data.

6. By making data available for several thousand units, panel data can minimize the bias that
might result if we aggregate individuals or firms into broad aggregates.

Balanced and unbalanced panel data

If each cross-sectional unit has the same number of time series observations, then such a panel

(data) is called a balanced panel. In the present example we have a balanced panel, as each unit

in the sample has 20 observations. If the number of observations differs among panel members,

we call such a panel an unbalanced panel

5.3 POOLED OLS REGRESSION

In statistics, ordinary least squares (OLS) is a type of linear least squares method for
estimating the unknown parameters in a linear regression model. OLS chooses the parameters
of a linear function of a set of explanatory variables by the principle of least squares:
minimizing the sum of the squares of the differences between the observed dependent variable
(values of the variable being observed) in the given dataset and those predicted by the linear

9
function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the
dependent variable, between each data point in the set and the corresponding point on the
regression surface—the smaller the differences, the better the model fits the data. The
resulting estimator can be expressed by a simple formula, especially in the case of a simple
linear regression, in which there is a single regressor on the right side of the regression
equation.

The OLS estimator is consistent when the regressors are exogenous, and—by the Gauss–
Markov theorem—optimal in the class of linear unbiased estimators when the errors are
homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides
minimum-variance mean-unbiased estimation when the errors have finite variances. Under
the additional assumption that the errors are normally distributed, OLS is the maximum
likelihood estimator.

There are no unique attributes of individuals within the measurement set, and no universal
effects across time. This is estimation option 1 on the list. But pooled regression may result
in heterogeneity bias:

Fig -5.2 Pooled Ols Regression

10
5.4 FIXED EFFECT LEAST SQUARES

In statistics, a fixed effects model is a statistical model in which the model parameters are
fixed or non-random quantities. This is in contrast to random effects models and mixed models
in which all or some of the model parameters are considered as random variables. In
manyapplications including econometrics and biostatistics a fixed effects model refers to a
regression model in which the group means are fixed (non-random) as opposed to a random
effects model in which the group means are a random sample from a population. Generally,
data can be grouped according to several observed factors. The group means could be modeled
as fixed or random effects for each grouping. In a fixed effects model each group mean is a
group-specific fixed quantity.

Such models assist in controlling for omitted variable bias due to unobserved heterogeneity
when this heterogeneity is constant over time. This heterogeneity can be removed from the data
through differencing, for example by subtracting the group-level average over time, or by
taking a first difference which will remove any time invariant components of the model.

There are two common assumptions made about the individual specific effect: the random
effects assumption and the fixed effects assumption. The random effects assumption is that
the individual-specific effects are uncorrelated with the independent variables. The fixed
effect assumption is that the individual-specific effects are correlated with the independent
variables. If the random effects assumption holds, the random effects estimator is more
efficient than the fixed effects estimator. However, if this assumption does not hold, the
random effects estimator is not consistent. The Durbin–Wu–Hausman test is often used to
discriminate between the fixed and the random effects models In panel data where
longitudinal observations exist for the same subject, fixed effects represent the subject-
specific means. In panel data analysis the term fixed effects estimator (also known as the
within estimator) is used to refer to an estimator for the coefficients in the regression model
including those fixed effects (one time-invariant intercept for each subject).
Such models assist in controlling for omitted variable bias due to unobserved heterogeneity
when this heterogeneity is constant over time. This heterogeneity can be removed from the
data through differencing, for example by subtracting the group-level average over time, or
by takinga first difference which will remove any time invariant components of the model.

11
There are two common assumptions made about the individual specific effect: the random
effects assumption and the fixed effects assumption. The random effects assumption is that
the individual-specific effects are uncorrelated with the independent variables. The fixed
effect assumption is that the individual-specific effects are correlated with the independent
variables. If the random effects assumption holds, the random effects estimator is more
efficient than the fixed effects estimator. However, if this assumption does not hold, the
random effects estimator is not consistent. The Durbin–Wu–Housman test is often used to
discriminate between the fixed and the random effects models.
Regression analysis typically employs the method of least squares to approximate the solution
of overdetermined systems (sets of equations with more equations than unknowns) by
minimizing the sum of squares of the residuals (a residual is the difference between an
observed value and the fitted value provided by a model).

Data fitting is the primary use case. Simple regression and least-squares approaches run into
trouble when the issue involves significant uncertainties in the independent variable (the x
variable); in such circumstances, the methodology necessary for fitting errors-in-variables
models may be considered in place of that for least-squares.

Whether or not the residuals are linear in all unknowns classifies a least-squares issue as either
linear (or ordinary) or nonlinear. In statistical regression analysis, the linear least-squares issue
arises; it may be solved in closed form. Iterative refinement is commonly used to solve a
nonlinear issue, with the system being approximated by a linear one at each iteration.

The prediction error for a dependent variable is modelled as a function of the independent
variable and the outliers from the fitted curve in polynomial least squares.

Standardized least-squares estimates and maximum-likelihood estimates are equivalent when


the observations come from an exponential family with identity as its natural adequate
statistics (such as the normal, exponential, Poisson, and binomial distributions) and the mild-
conditions are fulfilled. The least squares approach may be reformulated as an estimator based
on the technique of moments.

Although the following explanation focuses mostly on linear functions, least squares may and
12
should be used for far broader classes of functions. The least-squares approach may also be
used to fit an extended linear model by repeatedly using a local quadratic approximation to
the probability (using the Fisher information).

Although the least-squares approach is commonly ascribed to Carl Friedrich Gauss (1795),
who made important theoretical contributions to the method and may have used it in the past,
the formal discovery and publication of the method occurred much later, in 1805, by Adrien-
Marie Legendre.

A fixed effects regression consists in subtracting the time mean from each variable in the
model and then estimating the resulting transformed model by Ordinary Least Squares. This
procedure, known as “within” transformation, allows one to drop the unobserved component
and consistently estimate β. Analytically, the above model becomes

ỹ it = β' x̃it + ε̃ it

where ỹ it = y it – ȳ i with ȳ i = T –1 ΣT t = 1 y it (and the same for x, μ, and ε). Because a μ


i is fixed over time, we have μ i μ̄ i = 0.

This procedure is numerically identical to including N – 1 dummies in the regression,


suggesting intuitively that a fixed effects regression accounts for unobserved individual
heterogeneity by means of individual specific intercepts. In other words, the slopes of the
regression are common across units (the coefficients of x1, x 2, …, x K) whereas the intercept
is allowed to vary.

One drawback of the fixed effects procedure is that the within transformation does not allow
one to include time-invariant independent variables in the regression, because they get
eliminated similarly to the fixed unobserved component. In addition, parameter estimates are
likely to be imprecise if the time series dimension is limited.

Under classical assumptions, the fixed effects estimator is consistent (with N → ∞ and T
fixed) in the cases of both E (xjit μ i) = 0 and E (xjit μ i) ≠ 0, where j = 1, …, K. It is efficient
when all the explanatory variables are correlated with μi However, it is less efficient than the
random effect estimator when E (xjitμi ) = 0.
13
The consistency property requires the strict exogene-ity of x. However, this property is not
satisfied when the estimated model includes a lagged dependent variable, as in yit = α yit-1 +
'xit + μi + εit .

This suggests the adoption of instrumental variables or Generalized Method of Moments


techniques in order to obtain consistent estimates. However, a large time dimension T assures
consistency even in the case of the dynamic specification above.

Sometimes the true model includes unobserved shocks common to all units i, but time-
varying. In this case, the model includes an additional error component 6 that can be controlled
for by simply including time dummies in the equation.

A typical application of a fixed effects regression is in the context of wage equations. Let us
assume that we are interested in assessing the impact of years of education in logs e on wages
in logs w when the ability of individuals a is not observed. The true model is then

Wi = β0 + β1 ei + v i

where vi = ai + εi Given that unobserved ability is likely to be correlated with education, then
the composite stochastic error v is also correlated with the regressor and the estimate of β 1
will be biased. However, since innate ability does not change over time, if our data set is
longitudinal, we can use a fixed effect estimator to obtain a consistent estimate of β 1 Applying
the within transformation to the preceding equation we end up with W̃it =βẽ1 it + ε̃ it

where we have eliminated the time invariant unobserved component a i Being E (ε̃it εit ) = 0,
the model now satisfies the classical assumptions and we can estimate it by Ordinary Least
Squares.

5.5 DUMMY VARIABLE MODEL

In statistics and econometrics, particularly in regression analysis, a dummy variable[a] is one


that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect

14
that may be expected to shift the outcome. They can be thought of as numeric stand-ins for
qualitative facts in a regression model, sorting data into mutually exclusive categories (such as
smoker and non-smoker).

A dummy independent variable (also called a dummy explanatory variable) which for some
observation has a value of 0 will cause that variable's coefficient to have no role in influencing
the dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter
the intercept. For example, suppose membership in a group is one of the qualitative variables
relevant to a regression. If group membership is arbitrarily assigned the value of 1, then all
others would get the value 0. Then the intercept would be the constant term for non-members
but would be the constant term plus the coefficient of the membership dummy in the case of
group members.

Incorporating a dummy independent-

Dummy variables are incorporated in the same way as quantitative variables are included (as
explanatory variables) in regression models. For example, if we consider a Mincer-type
regression model of wage determination, wherein wages are dependent on gender (qualitative)
and years of education (quantitative):

where is the error term. In the model, female = 1 when the person is a female
and female = 0 when the person is male. {\displaystyle \delta _{0}}{\displaystyle \delta _{0}}
can be interpreted as the difference in wages between females and males, holding education
constant. Thus, δ0 helps to determine whether there is a discrimination in wages between males
and females. For example, if δ0>0 (positive coefficient), then women earn a higher wage than
men (keeping other factors constant). The coefficients attached to the dummy variables are
called differential intercept coefficients. The model can be depicted graphically as an intercept
shift between females and males. In the figure, the case δ0<0 is shown (wherein men earn a
higher wage than women).

15
Dummy variables may be extended to more complex cases. For example, seasonal effects may
be captured by creating dummy variables for each of the seasons: {\displaystyle

D_{1}=1}{\displaystyle D_{1}=1} if the observation is for summer, and equals zero


otherwise; {\displaystyle D_{2}=1}{\displaystyle D_{2}=1} if and only if autumn, otherwise
equals zero; {\displaystyle D_{3}=1}{\displaystyle D_{3}=1} if and only if winter, otherwise
equals zero; and {\displaystyle D_{4}=1}{\displaystyle D_{4}=1} if and only if spring,
otherwise equals zero. In the panel data, fixed effects estimator dummies are created for each
of the units in cross-sectional data (e.g. firms or countries) or periods in a pooled time-series.
However, in such regressions either the constant term has to be removed or one of the dummies
has to be removed, with its associated category becoming the base category against which the
others are assessed in order to avoid the dummy variable trap:

The constant term in all regression equations is a coefficient multiplied by a regressor equal to
one. When the regression is expressed as a matrix equation, the matrix of regressors then
consists of a column of ones (the constant term), vectors of zeros and ones (the dummies), and
possibly other regressors. If one includes both male and female dummies, say, the sum of these
vectors is a vector of ones, since every observation is categorized as either male or female. This
sum is thus equal to the constant term's regressor, the first vector of ones. As result, the
regression equation will be unsolvable, even by the typical pseudoinverse method. In other
words: if both the vector-of-ones (constant term) regressor and an exhaustive set of dummies
are present, perfect multicollinearity occurs, and the system of equations formed by the
regression does not have a unique solution. This is referred to as the dummy variable trap. The
trap can be avoided by removing either the constant term or one of the offending dummies. The
removed dummy then becomes the base category against which the other categories are
compared.

Dependent dummy variable models

Analysis of dependent dummy variable models can be done through different methods. One
such method is the usual OLS method, which in this context is called the linear probability
model. An alternative method is to assume that there is an unobservable continuous latent
variable Y* and that the observed dichotomous variable Y = 1 if Y* > 0, 0 otherwise. This is
theunderlying concept of the logit and probit models. These models are discussed in brief

16
below.

Linear probability model

Main article: Linear probability model


An ordinary least squares model in which the dependent variable Y is a dichotomous
dummy, taking the values of 0 and 1, is the linear probability model (LPM).[9] Suppose we
consider thefollowing regression:
Yi = α1 + α2Xi + ui
where
X = family income
Y=1 if a house is owned by the family, 0 if a house is not owned by the family
The model is called the linear probability model because, the regression is linear. The
conditional mean of Yi given Xi, written as E (Yi /Xi), is interpreted as the conditional
probability that the event will occur for that value of Xi — that is, Pr(Yi = 1 |Xi). In this
example, E(Yi /Xi), gives the probability of a house being owned by a family whose
income isgiven by Xi.
Now, using the OLS assumption E(Yi /Xi), we
getE(Yi /Xi) = α1 + α2Xi
Some problems are inherent in the LPM model:
 The regression line will not be a well-fitted one and hence measures of significance,
such asR2, will not be reliable.
 Models that are analyzed using the LPM approach will have heteroscedastic
disturbances.
 The error term will have a non-normal distribution.

 The LPM may give predicted values of the dependent variable that are greater than 1
or lessthan 0. This will be difficult to interpret as the predicted values are intended to
be probabilities, which must lie between 0 and 1.
 There might exist a non-linear relationship between the variables of the LPM
model, inwhich case, the linear regression will not fit the data accurately.

In statistics and econometrics, particularly in regression analysis, a dummy


variable is one that takes only the value 0 or 1 to indicate the lack or presence of
some categorical influence that may be predicted to affect the outcome. In a
regression model, they serve as numerical surrogates for qualitative data by
17
classifying it into mutually exclusive groups (such as smoker and non-smoker).

A dummy independent variable (sometimes called a dummy explanatory variable)


which for some observation has a value of 0 will allow that variable's coefficient
to have no effect in affecting the dependent variable, whereas when the dummy
takes on a value 1 its coefficient operates to modify the intercept. For example,
assume membership in a group is one of the qualitative variables important to a
regression. If group membership is randomly assigned the value of 1, then all
others would obtain the number 0. Then the intercept would be the constant term
for non-members but would be the constant term plus the coefficient of the
membership dummy in the case of group members.

Dummy variables are employed often in time series analysis with regime
switching, seasonal analysis and qualitative data applications.
A dummy variable is a numerical variable used in regression analysis to represent
subgroups of the sample in your study. In research design, a dummy variable is
often used to distinguish different treatment groups. In the simplest case, we
would use a 0,1-dummy variable where a person is given a value of 0 if they are
in the control group or a 1 if they are in the treated group. Dummy variables are
useful because they enable us to use a single regression equation to represent
multiple groups. This means that we don’t need to write out separate equation
models for each subgroup. The dummy variables act like ‘switches’ that turn
various parameters on and off in an equation. Another advantage of a 0,1 dummy-
coded variable is that even though it is a nominal-level variable you can treat it
statistically like an interval-level variable (if this made no sense to you, you
probably should refresh your memory on levels of measurement). For instance, if
you take an average of a 0,1 variable, the result is the proportion of 1s in the
distribution.
yi=β0+β1Zi+ei
where:

 yi is outcome score of ith unit,


 β0 is coefficient for the intercept,
 β1 is coefficient for the slope,

18
 Zi is:
o 1 if the ith unit is in the treatment group;
o 0 if the ith unit is in the control group;
 ei is residual for the ith unit.

As an example of a dummy variable, think about the basic regression model for a two-
group randomized experiment where the sole assessment is done after the fact. A
comparison of posttest means for two groups using this model is equivalent to a t-test
or an ANOVA with only one factor (ANOVA). Key to the model is the estimate of
the difference between the groups, denoted by the term 1. Using this straightforward
example, I'll demonstrate how to utilize dummy variables to solve for the individual
subgroup equations. Then, we'll demonstrate how to subtract the equations for each
subgroup to get an approximation of the gap between them. You'll see that by
employing dummy variables, we can cram a ton of data into a single equation. Here, I
just want to demonstrate that the treatment group outperformed the control group by a
factor of 1.

The first step towards seeing this is to figure out the equation for our two groups
independently. Z = 0 in the non-experimental group. When we plug it into the
equation under the assumption that the error component averages to zero, we get the
intercept 0 as a projected value for the control group. We can now calculate the
treatment group line by replacing Z with 1, knowing that the error term is assumed to
average to 0. According to the formula for the treatment group, the total value for this
group is the sum of the two beta values.

Having completed step 1, we can now move on to calculating the dissimilarity


between the groups. To what extent can we trust their claims? The gap, then,
corresponds to the difference in the equations we derived for the two sets of data
above. In other words, the dissimilarity across groups may be determined by
subtracting their respective equations. The difference is 1, which should be
immediately apparent from the diagram. Consider the implications. The
dissimilarity coefficient between the two sets is 1. Well, let's do it again for kicks.
For this model, the dissimilarity across the categories is 1.

19
Always, by following the two procedures outlined above, you will be able to examine how
dummy variables are being utilized to represent various subgroup equations in any regression
model using dummy variables.

To do this, substitute the dummy values into an equation and get an answer specific to each
subgroup.
Examine the dissimilarities across classes by comparing their corresponding equations.

5.6 FIXED EFFECT WITHIN GROUP (WG) ESTIMATOR

Fixed effects are variables that are constant across individuals; these variables, like age, sex,
or ethnicity, don‘t change or change at a constant rate over time. They have fixed effects; in
other words, any change they cause to an individual is the same. For example, any effects
from beinga woman, a person of color, or a 17-year-old will not change over time.
It could be argued that these variables could change over time. For example, take women in
the workplace: Forbes reports that the glass ceiling is cracking. However, the wheels of
change are extremely slow (there was a 26-year gap between Britain‘s first woman prime
minister, Margaret Thatcher, and the country‘s second woman prime minister, Theresa May).
Therefore, for purposes of research and experimental design, these variables are treated as a
constant.

In a fixed effects model, random variables are treated as though they were non-random, or
fixed. For example, in regression analysis, ―fixed effects regression fixes (holds constant)
average effects for whatever variable you think might affect the outcome of your analysis.
Fixed effects models do have some limitations. For example, they can‘t control for variables
that vary over time (like income level or employment status). However, these variables can
be included in the model by including dummy variables for time or space units. This may
seemlike a good idea, but the more dummy variables you introduce, the more the ―noise‖
in the model is controlled for; this could lead to over-dampening the model, reducing the
useful aswell as the useless information.

Such models assist in controlling for unobserved heterogeneity, when this heterogeneity is
constant over time: typically the ethnicity, the year and location of birth are heterogeneous

20
variables a fixed effect model can control for. This constant heterogeneity is the fixed effect
for this individual. This constant can be removed from the data, for example by subtracting
each individual's means from each of his observations before estimating the model.
A random effects model makes the additional assumption that the individual effects are
randomly distributed. It is thus not the opposite of a fixed effects model, but a special case. If
the random effects assumption holds, the random effects model is more efficient than the fixed
effects model. However, if this additional assumption does not hold (ie, if the Hausman test
fails), the random effects model is not consistent.

5.7 THE RANDOM EFFECTS MODEL

In statistics, a random effects model, also called a variance components model, is a statistical
model where the model parameters are random variables. It is a kind of hierarchical linear
model, which assumes that the data being analyzed are drawn from a hierarchy of different
populations whose differences relate to that hierarchy. In econometrics, random effects
models are used in the analysis of hierarchical or panel data when one assumes no fixed effects
(it allows for individual effects). The random effects model is a special case of the fixed effects
model.

Contrast this to the biostatistics definitions, as biostatisticians use "fixed" and "random"
effects to respectively refer to the population-average and subject-specific effects (and where
the latter are generally assumed to be unknown, latent variables).

Random effect models assist in controlling for unobserved heterogeneity when the
heterogeneity is constant over time and not correlated with independent variables. This
constant can be removed from longitudinal data through differencing, since taking a first
difference which will remove any time invariant components of the model.
Two common assumptions can be made about the individual specific effect: the random
effects assumption and the fixed effects assumption. The random effects assumption is that
the individual unobserved heterogeneity is uncorrelated with the independent variables. The
fixed effect assumption is that the individual specific effect is correlated with the independent
variables.

If the random effects assumption holds, the random effects model is more efficient than the

21
fixed effects model. However, if this assumption does not hold, the random effects model is
not consistent

sin general, random effects are efficient, and should be used (over fixed effects) if the
assumptions underlying them are believed to be satisfied. For random effects to work in the
school example it is necessary that the school-specific effects be uncorrelated to the other
covariates of the model. This can be tested by running fixed effects, then random effects, and
doing a Hausman specification test. If the test rejects, then random effects is biased and fixed
effects is the correct estimation procedure.

To quantify the impact of immeasurable personal traits like perseverance or savviness, the Random
Effects regression model is commonly employed. Similar effects at the individual level are common in
panel data analyses. The Random Effects model, along with the Fixed Effect regression model, is a
popular method for investigating how different factors influence the panel data set's response.

This is the third and final instalment of our three-part series on Panel Data Analysis.

Inference in Panel Data Using Pooled Ordinary Least Squares

To Analyze Panel Data Using a Fixed Effects Regression Model

An Overview of the Random Effects Regression Model for Panel Data

It's possible that the first 10% of this chapter will feel like a review for individuals who have already
studied the chapters on the FE model and the Pooled OLS model.

Let's start by (quickly) reviewing panel data.

Covariate effects in mixed-effects models may be calculated using the full random-effects
model (FREM). Mean and standard deviation are used to characterize the covariates in the
model. Estimated covariances between parameters and covariates capture the covariate effects.
Methods based on estimating fixed effects may experience performance drops, however this
strategy is resistant to such impacts (e.g., correlated covariates where the effects cannot be
simultaneously identified in fixed-effects methods). You may modify the covariate-parameter

22
relationship by employing FREM covariate parameterization and transforming covariate data records.
It was demonstrated that the four relations used in this implementation (linear, log-linear, exponential,
and power) yield estimates that are comparable to those obtained using a fixed-effects design. Both real
and simulated data with and without non-normally distributed and strongly correlated variables were
used to compare FREM to technically identical full fixed-effects models (FFEMs). Based on these
studies, it is clear that both FREM and FFEM work admirably in the studied instances, with FREM
providing somewhat more precise estimates of parameter interindividual variability (IIV).
Moreover, FREM provides the distinct benefit of allowing a single estimation to produce
covariate impact coefficient estimates and IIV estimates for any subset of the analyzed
variables, including the influence of each covariate separately. Covariate effects can be
communicated in a fashion that is not dependent on other covariates, or the model can be
applied to data sets with varying sets of accessible covariates.

a random effect model, also called a variance components model is a kind of hierarchical
linear model. It assumes that the data describe a hierarchy of different populations whose
differences are constrained by the hierarchy. The fixed effects model is a special case

Simple example-

Suppose m elementary large schools are chosen randomly from among millions in a large
country. Then n pupils are chosen randomly from among those at each such school. Their
scores on a standard aptitude test are ascertained. Let Yij be the score of the jth pupil at the
ith school. Then

where μ is the average of all scores in the whole population, Ui is the deviation of the average
of all scores at the ith school from the average in the whole population, and Wij is the
deviation of the jth pupil's score from the average score at the ith school.

23
5.8 SUMMARY

In statistics and econometrics, panel data or longitudinal data are multi-dimensional data
involving measurements over time. Panel data contain observations of multiple phenomena
obtained over multiple time periods for the same firms or individuals.
In panel data the same cross-sectional unit (industry, firm and country) is surveyed over time,
sowe have data which is pooled over space as well as time.

Dependencies of unobserved independent variables on a dependent variable can lead to biased


estimators in standard linear regression models, but these dependencies can be tamed with
panel data regression. In this post, I will show you how to construct a panel data regression
model in Python and explain the most significant theory behind this topic.

In writing this, I hope to accomplish two things: My first point is that I have yet to come
across a simple and clear explanation of an integrated panel data regression model. Second,
although it is possible to execute panel data regression in Python, it is not as user-friendly as
it is, say, in R. This, however, does not diminish the effectiveness of the method. So, in the
hopes of making future panel data analysis slightly less painful for myself ;-), I've chosen to
share what I've learned on a recent assignment.

• With panel data we can study different issues:


- Cross sectional variation (unobservable in time series data) vs.
Time series variation (unobservable in cross sectional data)
- Heterogeneity (observable and unobservable individual heterogeneity)
- Hierarchical structures (say, zip code, city and state effects)
- Dynamics in economic behavior
- Individual/Group effects (individual effects)
- Time effects

Regression using panel data may mitigate omitted variable bias when there is no information
on variables that correlate with both the regressors of interest and the independent variable
and if these variables are constant in the time dimension or across entities. Provided that panel
data is available panel regression methods may improve upon multiple regression models
which, as discussed in Chapter 9, produce results that are not internally valid in such a setting.

24
This chapter covers the following topics:
notation for panel data
fixed effects regression using time and/or entity fixed effects
computation of standard errors in fixed effects regression models
Following the book, for applications we make use of the dataset Fatalities from the AER
package (Christian Kleiber and Zeileis 2020) which is a panel dataset reporting annual state
level observations on U.S. traffic fatalities for the period 1982 through 1988. The applications
analyze if there are effects of alcohol taxes and drunk driving laws on road fatalities and, if
present, how strong these effects are.

5.9 KEYWORDS

 1.Regression analysis-Regression analysis is a set of statistical methods used for the


estimation of relationships between a dependent variable and one or more
independent variables. It can be utilized to assess the strength of the relationship
between variables and for modeling the future relationship between them.

 2.Panel regressions- Panel regression is a modeling method adapted to panel data,


also called longitudinal data or cross-sectional data. It is widely used in econometrics,
where the behavior of statistical units (i.e. panel units) is followed across time. Those
units can be firms, countries, states, etc.

 3.Estimator- In statistics, an estimator is a rule for calculating an estimate of a given


quantity based on observed data: thus the rule (the estimator), the quantity of interest
(the estimand) and its result (the estimate) are distinguished. For example, the sample
mean is a commonly used estimator of the population mean.

 4.Accounting- Accounting is the process of documenting a business's financial


transactions. These transactions are compiled, examined, and reported to oversight
organizations, regulatory bodies, and tax collecting organizations as part of the
accounting process. A company's activities, financial condition, and cash flows are
summarized in the financial statements that are used in accounting. They provide a
succinct overview of financial events across an accounting period.

25
 5.Finance- Finance is defined as the management of money and includes activities
such as
investing, borrowing, lending, budgeting, saving, and forecasting. There are three
main types of finance: (1) personal, (2) corporate, and (3) public/government.

5.10 LEARNING ACTIVITY

1. Note on Panel Data


__________________________________________________________________________
__________________________________________________________________________
2. Explain Fixed Effect Least Squares
__________________________________________________________________________
__________________________________________________________________________
3. Give details of Random Effects Model
__________________________________________________________________________
__________________________________________________________________________

5.12 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1. Which cross-sectional dependency test is more appropriate for smaller time and large cross-
section macro panel?

2. A single step estimation of True Random Effect technical efficiency model with inefficiency
determinants?

3. What is panel data?

4. what are the different Types of Panel Data Regression?

5. What kind of data are required for panel analysis?

Long Questions

26
1.Explain the random effects model?

2. Explain briefly Dummy variable model?

3.What is pooled OLS regression. Explain with example?

4. What are the reasons for using Panel Data?

5. Explain fixed effect model in detail.

B. Multiple Choice Questions

1.Which of the following is a disadvantage of the fixed effects approach to estimating a panel
model?

a. The model is likely to be technical to estimate

b. The approach may not be valid if the composite error term is correlated with one or more of
the explanatory variables

c. The number of parameters to estimate may be large, resulting in a loss of degrees of freedom

d. The fixed effects approach can only capture cross-sectional heterogeneity and not temporal
variation in the dependent variable.

2. The "within transform" involves

a. Taking the average values of the variables

b. Subtracting the mean of each entity away from each observation on that entity

c. Estimating a panel data model using least squares dummy variables

d. Using both time dummies and cross-sectional dummies in a fixed effects panel model

3. The fixed effects panel model is also sometimes known as

a. A seemingly unrelated regression model

b. The least squares dummy variables approach

c. The random effects model

27
d. Heteroscedasticity and autocorrelation consistent

4. Which of the following is a disadvantage of the random effects approach to estimating a


panel model?

a. The approach may not be valid if the composite error term is correlated with one or more of
the explanatory variables

b. The number of parameters to estimate may be large, resulting in a loss of degrees of freedom

c. The random effects approach can only capture cross-sectional heterogeneity and not
temporal variation in the dependent variable.

d. All of (a) to (c) are potential disadvantages of the random effects approach.

5. Which of the following are advantages of the use of panel data over pure cross-sectional or
pure time-series modelling? (i) The use of panel data can increase the number of degrees of
freedom and therefore the power of tests

(ii) The use of panel data allows the average value of the dependent variable to vary either
cross-sectionally or over time or both

(iii) The use of panel data enables the researcher allows the estimated relationship between the
independent and dependent variables to vary either cross-sectionally or over time or both

a. (i) only

b. (i) and (ii) only

c. (ii) only

d. (i), (ii), and (iii)

Answers

1-c, 2-b, 3-b , 4-a ,5- b

28
5.13 REFERENCES

 Gujarati, D., Porter, D.C and Gunasekhar, C (2012). Basic Econometrics (Fifth
Edition)McGraw Hill Education.
 Anderson, D. R., D. J. Sweeney and T. A. Williams. (2011). Statistics for
Business andEconomics. 12th Edition, Cengage Learning India Pvt. Ltd.
 Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third
edition,Thomson South-Western, 2007.
 Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York, 1994.
 Ramanathan, Ramu, Introductory Econometrics with Applications, Harcourt
AcademicPress, 2002 (IGM Library Call No. 330.0182 R14I).
 Koutsoyiannis, A. The Theory of Econometrics, 2nd Edition, ESL

29
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit I - Introduction to Econometrics / Self Assessment Quiz

The coefficient of determination, r2 shows

Select one:
a. Proportion of the variation in the dependent variable X is explained by the
independent variable Y
b. Proportion of the variation in ui is explained by the independent variable X
c. . Both a and c

d. Proportion of the variation in the dependent variable Y is explained by the


independent variable X

The correct answer is: Proportion of the variation in the dependent variable Y is explained
by the independent variable X

Suppose that we wanted to sum the 2007 returns on ten shares


to calculate the return on a portfolio over that year. What method
of calculating the individual stock returns would enable us to do
this?

Select one:
a. Continuously compounded

b. Either approach could be used and they would both give the same portfolio return
c. Neither approach would allow us to do this validly
d. Simple

The correct answer is: Simple


Suppose the variable x2 has been omitted from the following
regression equation, y=β0+β1x1+β2x2+u. ~β1 is the estimator
obtained when x2 is omitted from the equation. If E(~β1) >β1,
~β1is said to _____

Select one:
a. have a downward bias
b. be unbiased
c. be biased toward zero

d. have an upward bias

The correct answer is: have an upward bias

Which of the following statements is false concerning the linear


probability model?

Select one:
a. Even if the probabilities are truncated at zero and one, there will probably be many
observations for which the probability is either exactly zero or exactly one

b. The error terms will be heteroscedastic and not normally distributed


c. There is nothing in the model to ensure that the estimated probabilities lie between
zero and one
d. The model is much harder to estimate than a standard regression model with a
continuous dependent variable

The correct answer is: The model is much harder to estimate than a standard regression
model with a continuous dependent variable
Which one of the following is NOT a plausible remedy for near
multicollinearity?

Select one:
a. Use principal components analysis
b. . Use a longer run of data

c. Take logarithems of each of the variables


d. Drop one of the collinear variables

The correct answer is: Take logarithems of each of the variables

What would be then consequences for the OLSestimator if


heteroscedasticity is present in a regression model but ignored?

Select one:
a. It will be inefficient
b. All of a),c), b) will be true.

c. It will be ignored
d. It will be inefficient

The correct answer is: It will be inefficient

What is the meaning of the term "heteroscedasticity"?

Select one:
a. The errors have non- zero mean
b. The variance of the errors is not constant

c. The errors are not linearly independent of one another


d. The variance of the dependent variable is not constant

The correct answer is: The variance of the errors is not constant
Which of the following is true of R2?

Select one:
a. R2 usually decreases with an increase in the number of independent variables in
aregression
b. R2 is also called the standard error of regression

c. A low R2 indicates that the Ordinary Least Squares line fits the data well.
d. R2 shows what percentage of the total variation in the dependent variable, Y, is
explained by the explanatory variables

The correct answer is: R2 is also called the standard error of regression

Which one of the following is NOT an assumption of the classical


linear regression model?

Select one:
a. The dependent variable is not correlated with the disturbance terms
b. The disturbance terms have zero mean

c. The explanatory variables are uncorrelated with the error terms.


d. The disturbance terms are independent of one another.

The correct answer is: The dependent variable is not correlated with the disturbance
terms
Nonexperimental data is called _____.

Select one:
a. \ panel data
b. \ observational data
c. time series data
d. \cross-sectional data

The correct answer is: time series data

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit I - Introduction to Econometrics / Self Assessment Quiz

Which of the following is true of experimental data?

Select one:
a. Experimental data is sometimes called observational data
b. Experimental data are collected in laboratory environments in the natural sciences.

c. Experimental data is sometimes called retrospective data


d. Experimental data cannot be collected in a controlled environment

The correct answer is: Experimental data are collected in laboratory environments in the
natural sciences.

Find the degrees of freedom in a regression model that has 10


observations and 7 independent variables

Select one:
a. 4
b. 2

c. 8
d. 3

The correct answer is: 2


In the equation y = β0 + β1x + u, β0 is the _____

Select one:
a. slope parameter
b. intercept parameter

c. independent variable
d. dependent variable

The correct answer is: intercept parameter

Which of the following is an example of time series data?

Select one:
a. Data on the number of vacancies in various departments of an organization on a
particular month.
b. Data on the gross domestic product of a country over a period of 10 years

c. Data on the consumption of wheat by 200 households during a year.


d. Data on the unemployment rates in different parts of a country during a year.

The correct answer is: Data on the gross domestic product of a country over a period of
10 years
Which of the following is a correct interpretation of a “95%
confidence interval” for aregression parameter?

Select one:
a. We are 95% sure that the interval contains our estimate of the coefficient
b. We are 95% sure that the interval contains the true value of the parameter

c. We are 95% sure that our estimate of the coefficient is correct


d. In repeated samples, we would derive the same estimate for the coefficient 95% of
thetime

The correct answer is: We are 95% sure that the interval contains the true value of the
parameter

If the residual sum of squares (SSR) in a regression analysis is 66


and the total sum of squares (SST) is equal to 90, what is the value
of the coefficient of determination?

Select one:
a. 0.55

b. 0.27
c. 0.74
d. 1.2

The correct answer is: 0.27


The type I error associated with testing a hypothesis is equal to

Select one:
a. The size of the test

b. The size of the sample


c. The confidence level
d. One minus the type II error

The correct answer is: The size of the test

Autocorrelation is generally occurred in

Select one:
a. Analysis data
b. Pooled data

c. Time series data


d. Cross-section data

The correct answer is: Time series data


The assumption that there are no exact linear relationships
among the independent variables in a multiple linear regression
model fails if _____, where n is the sample size and k is the
number of parameters

Select one:
a. n>2
b. n
c. n
d. k+1

e. n>k

The correct answers are: n

Econometrics is the branch of economics that _____

Select one:
a. applies mathematical methods to represent economic theories and solve economic
problems.
b. \studies the behavior of individual economic agents in making economic decisions

c. develops and uses statistical methods for estimating economic relationships


d. \deals with the performance, structure, behavior, and decision-making of an economy
as a whole

The correct answer is: develops and uses statistical methods for estimating economic
relationships

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit I - Introduction to Econometrics / Self Assessment Quiz

Which of the following statements is correct concerning the


conditions required for OLS to be a usable estimation technique?

Select one:
a. The model must be linear in the variables
b. The model must be linear in the residuals.
c. The model must be linear in the variables and the parameters

d. The model must be linear in the parameters

The correct answer is: The model must be linear in the parameters
Which of the following is a difference between panel and pooled
cross-sectional data?

Select one:
a. A panel data set consists of data on the same cross-sectional units over a given period
of time while a pooled data set consists of data on different cross-sectional units over a
given period of time
b. A panel data set consists of data on different cross-sectional units over a given period
of time while a pooled data set consists of data on the same cross-sectional units over a
given period of time.

c. A panel data set consists of data on a single variable measured at a given point intime
while a pooled data set consists of data on more than one variable at a given point in
time
d. A panel data consists of data on a single variable measured at a given point in time
while a pooled data set consists of data on the same cross-sectional units over a given
period of time.

The correct answer is: A panel data set consists of data on the same cross-sectional units
over a given period of time while a pooled data set consists of data on different cross-
sectional units over a given period of time

In order to determine whether to use a fixed effects or random


effects model, a researcher conducts a Hausman test. Which of
the following statements is false?

Select one:
a. For random effects models, the use of OLS would result in consistent but inefficient
parameter estimation
b. Random effects estimation will not be appropriate if the composite error term is
correlated with one or more of the explanatory variables in the model
c. If the Hausman test is not satisfied, the random effects model is more appropriate.

d. Random effects estimation involves the construction of "quasi-demeaned" data

The correct answer is: Random effects estimation will not be appropriate if the composite
error term is correlated with one or more of the explanatory variables in the model
Consider the following regression model: y = β0 + β1x1 + u. Which
of the followingis a property of Ordinary Least Square (OLS)
estimates of this model and their associated statistics?

Select one:
a. The point (´x, ´y) always lies on the OLS regression line.

b. The sum, and therefore the sample average of the OLS residuals, is positive.
c. The sum of the OLS residuals is negative
d. The sample covariance between the regressors and the OLS residuals is positive

The correct answer is: The point (´x, ´y) always lies on the OLS regression line.

High (but not perfect) correlation between two or more


independent variables is called _____

Select one:
a. multicollinearity
b. micronumerosity
c. heteroskedasticty
d. homoskedasticty

The correct answer is: multicollinearity

A dependent variable is also known as a(n) _____

Select one:
a. control variable

b. predictor variable
c. response variable
d. explanatory variable

The correct answer is: response variable


If an independent variable in a multiple linear regression model is
an exact linear combination of other independent variables, the
model suffers from the problem of _____.

Select one:
a. perfect collinearity
b. omitted variable bias

c. heteroskedasticty
d. homoskedasticity

The correct answer is: perfect collinearity

Consider an increase in the size of the test used to examine a


hypothesis from 5% to 10%. Which one of the following would be
an implication?

Select one:
a. The probability of a Type I error is increased
b. The probability of a Type II error is increased
c. The null hypothesis will be rejected less often

d. The rejection criterion has become more strict

The correct answer is: The probability of a Type I error is increased


A data set that consists of observations on a variable or several
variables over time is called a _____ data set

Select one:
a. Experimental

b. time series
c. binary
d. cross-sectional

The correct answer is: time series

An empirical analysis relies on _____to test a theory

Select one:
a. common sense
b. customs and conventions
c. data
d. ethical considerations

The correct answer is: data

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit II - Problems in Regression Analysis I / Self Assessment Quiz

Partial correlations are useful indicator of:

Select one:
a. Normality
b. Hetroscedasticity

c. Multicollinearity
d. Autocorrelation

The correct answer is: Multicollinearity

The standard deviation of the sampling distribution of an


estimator is.....

Select one:
a. Standard Error
b. t value

c. RSS
d. r2

The correct answer is: Standard Error


E (Ui,Uj) ≠ 0, when i≠j is termed as,

Select one:
a. Multi-collinearity
b. Homoscedasticity
c. Hetroscedasticity
d. Auto-Correlation

The correct answer is: Auto-Correlation

Which of the following pair is not correctly matched?

Select one:
a. F test – Overall significance of the regression model
b. Durbin’s h test – Autocorrelation in autoregressive models
c. Dicky- Fuller test – Hetroscedasticity
d. Distributed lag models – Koyck approach

The correct answer is: Dicky- Fuller test – Hetroscedasticity

In confidence interval estimation, α = 5%, this means that this


interval includes the true β withprobability of

Select one:
a. . 0.50%
b. 0.95
c. 0.5

d. 0.05

The correct answer is: 0.95


Park Test is used for what purpose?

Select one:
a. Detecting Multi-collinearity
b. Detecting Hetroscedasticity
c. Solving Hetroscedasticity

d. Solving Multi-collinearity

The correct answer is: Detecting Hetroscedasticity

A linear function has:

Select one:
a. Constant Slope and constant elasticity
b. Varying Slope and constant elasticity

c. Varying Slope and varying elasticity


d. Constant Slope and varying elasticity

The correct answer is: Constant Slope and varying elasticity

WLS is used to correct for:

Select one:
a. Correlation
b. Hetroscedasticity
c. Autocorrelation

d. Multicollinearity

The correct answer is: Hetroscedasticity


In Linear Probability Model, the:

Select one:
a. Regressor is dichotomous
b. Regressor is ordinal
c. Regressand is ordinal variable
d. Regressand is dichotomous

The correct answer is: Regressand is dichotomous

Consider two regression models (1 and 2) with R2of 0.52 and 0.89
respectively.Which among the following statements is true?

Select one:
a. None of these

b. Goodness of fit in regression 2 is more than that of 1


c. Both (a) and (b) are true
d. Goodness of fit in regression 1 is more than that of 2

The correct answer is: Goodness of fit in regression 2 is more than that of 1

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit II - Problems in Regression Analysis I / Self Assessment Quiz

The likelihood ratio test is related to:

Select one:
a. Maximum likelihood method.

b. Likelihood method.
c. Generalised least squares method.
d. Ordinary least squares method.

The correct answer is: Maximum likelihood method.

Regressionn models containing a mixture of quantitative and


qualitative variables arecalled :

Select one:
a. ANOV A models.
b. Parallel regressions.

c. Coincident regressions.
d. ANCOVA models.

The correct answer is: ANCOVA models.


Which of the following is a multi-collinearity diagnostic?

Select one:
a. Condition Index.
b. Durbin's m test.
c. Park test.
d. Glejser tes.

The correct answer is: Condition Index.

The number of Explanatory variables in a simple regression is....

Select one:
a. One
b. More than Two

c. Two
d. Zero

The correct answer is: One

Linear Regression is estimated through:

Select one:
a. MLE
b. PIML

c. OLS
d. FIML

The correct answer is: OLS


In a two variable regression, Y is the dependent variable and X is
the independentvariable. The correlation coefficient between Y
and X is 0.8. For this, which of thefollowing is correct?

Select one:
a. 80% of variations in Y are explained by X
b. 0.8% of variations in Y are explained by X
c. 64% of variations in Y are explained by X

d. 8% of variations in Y are explained by X

The correct answer is: 64% of variations in Y are explained by X

In a regression model, Y= a+bX+ui:

Select one:
a. Y is the regressor; X is the explanatory variable
b. Y is the independent variable; X is the dependent variable

c. Y is the regressant; X is the explanatory variable


d. Y is the independent variable; X is the outcome variable

The correct answer is: Y is the regressant; X is the explanatory variable

Choose the correct one from the following given

Select one:
a. 0 ≤ r2≤ 1
b. 0 > r2 > 1

c. 0 ≥ r2≥1
d. 0 < r2 < 1

The correct answer is: 0 ≤ r2≤ 1


In the k-variable case, the main diagonal elements in the simple
correlation matrix areall :

Select one:
a. More than one.
b. Less than One.
c. One

d. Zero.

The correct answer is: One

Exact level of significance is obtained through:

Select one:
a. p- value
b. t- ratio
c. F- ratio
d. Chi-square

The correct answer is: p- value

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit II - Problems in Regression Analysis I / Self Assessment Quiz

Degree of freedom refers to:

Select one:
a. Number of observations minus number of constraints
b. Number of constraints minus number of relations
c. Number of observations plus number of constraints
d. Number of constraints minus number of observations

The correct answer is: Number of observations minus number of constraints

Type I error is:

Select one:
a. Failing to reject null hypothesis when it is false
b. Rejecting the null hypothesis when it is true

c. Autocorrelation
d. Rejecting the null hypothesis when it is false

The correct answer is: Rejecting the null hypothesis when it is true
Choose the correct one from the following

Select one:
a. r2
b. ESS/TSS
c. TSS/RSS

d. r2
e. r2
f. TSS/ESS
g. r2
h. RSS/TSS

The correct answers are: r2, ESS/TSS, TSS/RSS, RSS/TSS, TSS/ESS

If the value of Durbin-Watson’s d stastic = 2, the value of Co-


efficient of auto-correlation is.....

Select one:
a. Two
b. Zero

c. More than Two


d. One

The correct answer is: Zero


Given regression co-efficient b= 2 and standard error of 0.5, the
value of t ratio is:

Select one:
a. 2
b. 3
c. 1
d. 4

The correct answer is: 4

Which of the following models is used to regress on dummy


dependent variable ?

Select one:
a. The logit model.
b. The LPM model.

c. All of these
d. The tobit model.

The correct answer is: All of these


Which of the following is correct concerning logit and probit
models?

Select one:
a. They use a different method of transforming the model so that the probabilities lie
between zero and one
b. The probit model is based on a cumulative logistic function

c. For the logit model, the marginal effect of a change in one of the explanatory variables
is simply the estimate of the parameter attached to that variable, whereas this is not the
case for the probit model
d. The logit model can result in too many observations falling at exactly zero or exactly
one

The correct answer is: They use a different method of transforming the model so that the
probabilities lie between zero and one

The Structural break in a data set is tested by :

Select one:
a. Chow test.
b. Von-Neumann ratio test.
c. Runs test.

d. Lagrange multiplier test.

The correct answer is: Chow test.


Variance Inflation Factor is used for...

Select one:
a. Detecting Hetroscedasticity

b. Solving Hetroscedasticity
c. Solving Multi-collinearity
d. Detecting Multi-collinearity

The correct answer is: Detecting Multi-collinearity

The name ‘Econometrics’ was coined by.......

Select one:
a. Alfred Marshall
b. Ragnar Frisch
c. J.M Keynes
d. Irving Fischer

The correct answer is: Ragnar Frisch

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit III - Problems in Regression Analysis II / Self Assessment Quiz

when one or more of the regressors are linear combinations of


the other regressors, it iscalled

Select one:
a. Heteroscedastity.
b. Multicollinearity.
c. Serial correlation.

d. Autocorrelation.

The correct answer is: Multicollinearity.

Which of the following theorem is utilised to Justify the normality


assumption ofrandom variable in regression model ?

Select one:
a. Central limit theorem.
b. Gauss-Markov theorem.

c. Chebyshev's theorem.
d. Euler's theorem.

The correct answer is: Gauss-Markov theorem.


What would be then consequences for the OLS estimator if
heteroscedasticity ispresent in a regression model but ignored?

Select one:
a. It will be inconsistent
b. It will be ignored
c. It will be inefficient

d. a),c), b) will be true.

The correct answer is: It will be inefficient

A sure way of removing multicollinearity from the model is to

Select one:
a. Work with panel data
b. Drop variables that cause multicollinearity in the first place
c. Transform the variables by first differencing them
d. Obtaining additional sample data

The correct answer is: Drop variables that cause multicollinearity in the first place

The Runs test used to detect autocorrelation is:

Select one:
a. Equivalent test.
b. Hypothesis test.
c. Non-parametric test.
d. Parametric test.

The correct answer is: Non-parametric test.


Plotting the residuals against time is termed as the :

Select one:
a. Stem and leaf plot.

b. Scatter plot.
c. Box plot.
d. Time sequence plot.

The correct answer is: Time sequence plot.

Rejecting a true hypothesis results in this type of error

Select one:
a. Structural error
b. .Hypothesis error

c. Type II error
d. Type I error

The correct answer is: Type I error

Which one of the following is not an example of mis-specification


of functional form

Select one:
a. Modelling as a function of 1 when in fact it is scaled as a function of x square
b. Modelling Y as a function of X when in fact it scales as a function of 1 by X
c. Excluding a relevant variable from a linear regression model

d. Using a linear specification when a double logarithmic model would be


moreappropriate

The correct answer is: Excluding a relevant variable from a linear regression model
In a regression model with multicollinearity being very high, the
estimators a Are unbiased

Select one:
a. Are unbiased

b. Are consistent
c. Standard errors are correctly estimated
d. All of these

The correct answer is: Are unbiased

Which of the following is NOT considered the assumption about


the pattern of heteroscedasticity

Select one:
a. The error variance is proportional to Xi2
b. The error variance is proportional to Yi
c. The error variance is proportional to the square of the mean value of Y
d. The error variance is proportional to Xi

The correct answer is: The error variance is proportional to Yi

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit III - Problems in Regression Analysis II / Self Assessment Quiz

If the mean and variance of time series do not vary systematically


overtime it is called

Select one:
a. Random.
b. Non-Random.

c. Stationary
d. Non- Stationary

The correct answer is: Stationary

What is the meaning of the term "heteroscedasticity"?

Select one:
a. errors have non-zero mean

b. variance of the dependent variable is not constant


c. variance of the errors is not constant
d. errors are not linearly independent of one another

The correct answer is: variance of the errors is not constant


In the regression function y=α + βx +c

Select one:
a. x is the regressor
b. none of these
c. y is the regressor

d. x is the regressand

The correct answer is: x is the regressor

As a rule of thumb, a variable is said to be highly collinear if the


Variance InflationFactor (VIF) is :

Select one:
a. Exceeds 10.
b. None of these
c. Less than 10.

d. Exactly 10.

The correct answer is: Exceeds 10.

The lowest significance level at which a null hypothesis can be


rejected is:

Select one:
a. p-value.
b. F value.

c. t value.
d. R square value.

The correct answer is: t value.


The number of independent values assigned to a statistical
distribution is called

Select one:
a. Degrees of freedom
b. None of these
c. Trial and error
d. Goodness of fit

The correct answer is: Degrees of freedom

When a linear function is fitted to non-linear data set it will result


in?

Select one:
a. Sampling error.
b. None of these
c. Specification error.
d. Measurement error.

The correct answer is: Specification error.

If multicollinearity is perfect in a regression model the standard


errors of the regression coefficients are

Select one:
a. Small negative values
b. Indeterminate

c. Infinite values
d. Determinate

The correct answer is: Infinite values


Which one of the following is NOT an example of mis-
specification of functional form?

Select one:
a. Modelling y as a function of x when in fact it scales as a function of 1/x
b. Excluding a relevant variable from a linear regression model

c. Using a linear specification when y scales as a function of the squares of x


d. Using a linear specification when a double-logarathimic model would be more
appropriate

The correct answer is: Excluding a relevant variable from a linear regression model

Near multicollinearity occurs when

Select one:
a. )Two or more explanatory variables are perfectly correlated with one another
b. Two or more explanatory variables are highly correlated with one another
c. explanatory variables are highly correlated with the dependent variable
d. explanatory variables are highly correlated with the error term

The correct answer is: Two or more explanatory variables are highly correlated with one
another

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit III - Problems in Regression Analysis II / Self Assessment Quiz

Multicollinearity is limited to

Select one:
a. All of these
b. . Time series data
c. Pooled data
d. Cross-section data

The correct answer is: All of these

If the residuals from a regression estimated using a small sample


of data are not normally distributed, which one of the following
consequences may arise?

Select one:
a. The coefficient estimate will be biased inconsistent
b. Test statistics concerning the parameter will not follow their assumed distributions

c. The coefficient estimate will be unbiased inconsistent


d. The coefficient estimate will be biased consistent

The correct answer is: Test statistics concerning the parameter will not follow their
assumed distributions
Assumption of 'No multicollinearity' means the correlation
between the regressand and regressor is

Select one:
a. Low
b. Any of the above

c. Zero
d. High

The correct answer is: Any of the above

Which of the following is not a formal method of detecting


hetroscedasticity ?

Select one:
a. Durbin's m test.
b. Spearman's rank correlation test.
c. Glejser test.
d. P ark test.

The correct answer is: Durbin's m test.


F test in most cases will reject the hypothesis that the partial
slope coefifcients are simultaneously equal to zero. This happens
when

Select one:
a. Multicollinearity is present
b. Multicollinearity is absent
c. Multicollinearity may be present OR may not be present
d. Depends on the F-value

The correct answer is: Multicollinearity is absent

The full form of CLR is

Select one:
a. Classical linear regression
b. Class line ratio

c. Classical linear relation


d. none of these

The correct answer is: Classical linear regression

Even if heteroscedasticity is suspected and detected, it is not easy


to correct the problem. This statement is

Select one:
a. Sometimes right
b. Wrong

c. Right
d. Depends on test statistics

The correct answer is: Right


Which of these is NOT a symptom of multicollinearity in a
regression model

Select one:
a. VIF of a variable is below 10
b. High R2 and all partial correlation among regressors
c. High R2 with few significant t ratios for coefficients

d. High pair-wise correlations among regressor

The correct answer is: VIF of a variable is below 10

Consider the following statements and choose the correct


answeri)Pooled data imply combination of time series and cross
sectional data.ii) Panel data is special type of pooled data in which
the same cross-section unit issurveyed over time

Select one:
a. Both a and b are correct...
b. Only a is correct

c. Only b is correct
d. Both a and b are wrong

The correct answer is: Both a and b are correct...


Including relevant lagged values of the dependent variable on the
right hand side of aregression equation could lead to which one
of the following?

Select one:
a. Unbiased but inconsistent coefficient estimate
b. Biased and inconsistent coefficient estimate
c. Unbiased and consistent but inefficient coefficient estimate

d. Biased but consistent coefficient estimate

The correct answer is: Biased but consistent coefficient estimate

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit V - Panel Data Regression Models / Self Assessment Quiz unit 5

Which of the following is correct concerning logit and probit


models?

Select one:
a. For the logit model, the marginal effect of a change in one of the explanatory variables
is simply the estimate of the parameter attached to that variable, whereas this is not the
case for the probit model
b. The logit model can result in too many observations falling at exactly zero or exactly
one
c. They use a different method of transforming the model so that the probabilities lie
between zero and one

d. The probit model is based on a cumulative logistic function

The correct answer is: They use a different method of transforming the model so that the
probabilities lie between zero and one
Which of the following statements are correct concerning the use
of antithetic variates as part of a Monte Carlo experiment? i)
Antithetic variates work by reducing the number of replications
required to cover the whole probability spaceii) Antithetic variates
involve employing a similar variable to that used in the
simulation, but whose properties are known analyticallyiii)
Antithetic variates involve using the negative of each of the
random draws and repeating the experiment using those values
as the drawsiv) Antithetic variates involve taking one over each of
the random draws and repeating the experiment using those
values as the draws

Select one:
a. (i), (ii), and (iv) only

b. (ii) and (iv) only


c. (i) and (iii) only
d. (i), (ii), (iii), and (iv)

The correct answer is: (i) and (iii) only


Which of the following statements is true about the level of
significance?

Select one:
a. In testing a hypothesis, we take the level of significance as 10% if it is not mentioned
earlier
b. In testing a hypothesis, we take the level of significance as 1% if it is not mentioned
earlier
c. In testing a hypothesis, we take the level of significance as 2% if it is not mentioned
earlier

d. In testing a hypothesis, we take the level of significance as 5% if it is not mentioned


earlier

The correct answer is: In testing a hypothesis, we take the level of significance as 2% if it is
not mentioned earlier

If the values of two variables move in the same direction,


__________

Select one:
a. The correlation is said to be linear

b. The correlation is said to be negative


c. The correlation is said to be non-linear
d. The correlation is said to be positive

The correct answer is: The correlation is said to be negative


Which of the following statements is true about the null
hypothesis

Select one:
a. Any wrong decision related to the null hypothesis results in one type of an error
b. Any wrong decision related to the null hypothesis results in four types of errors
c. Any wrong decision related to the null hypothesis results in three types of errors

d. Any wrong decision related to the null hypothesis results in two types of errors

The correct answer is: Any wrong decision related to the null hypothesis results in two
types of errors

Under which of the following situations would bootstrapping be


preferred to pure simulation?i) If it is desired that the
distributional properties of the data in the experimentare the
same as those of some actual dataii) If it is desired that the
distributional properties of the data in the experimentare known
exactlyiii) If the distributional properties of the actual data are
unknowniv) If the sample of actual data available is very small

Select one:
a. (i) and (iii) only
b. (ii) and (iv) only
c. (i), (ii), and (iv) only
d. (i), (ii), (iii), and (iv)

The correct answer is: (i), (ii), and (iv) only


The coefficient of correlation

Select one:
a. . is the square root of the coefficient of determination
b. is the same as r-square
c. can never be negative
d. is the square of the coefficient of determination

The correct answer is: . is the square root of the coefficient of determination

Which of the following is true for the coefficient of correlation?

Select one:
a. The coefficient of correlation is not dependent on the change of scale
b. The coefficient of correlation is not dependent on the change of origin

c. None of these
d. The coefficient of correlation is not dependent on both the change of scale and
change of origin

The correct answer is: The coefficient of correlation is not dependent on both the change
of scale and change of origin

The original hypothesis is known as ______.

Select one:
a. Alternate hypothesis
b. Non-linear regression analysis
c. Multiple regression analysis
d. Null hypothesis

The correct answer is: Null hypothesis


Regression modeling is a statistical framework for developing a
mathematical equation thatdescribes how

Select one:
a. All of these are correct.
b. one explanatory and one or more response variables are related

c. several explanatory and several response variables response are related


d. . one response and one or more explanatory variables are related

The correct answer is: . one response and one or more explanatory variables are related

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit V - Panel Data Regression Models / Self Assessment Quiz unit 5

The independent variable is used to explain the dependent


variable in ________

Select one:
a. Non-linear regression analysis
b. Alternate hypothesis

c. Linear regression analysis


d. Multiple regression analysis

The correct answer is: Linear regression analysis

The coefficient of determination equals

Select one:
a. 0
b. 0.6471

c. 1
d. -0.6471

The correct answer is: 0.6471


Which of the following statements is true about the arithmetic
mean of two regression coefficients?

Select one:
a. It is greater than the correlation coefficient
b. It is equal to the correlation coefficient
c. It is less than the correlation coefficient
d. It is greater than or equal to the correlation coefficient

The correct answer is: It is greater than the correlation coefficient

Which of the following statements is true for correlation analysis?

Select one:
a. It is a multivariate analysis
b. It is a univariate analysis
c. It is a bivariate analysis

d. Both a and c

The correct answer is: It is a univariate analysis

In regression analysis, the variable that is being predicted is the

Select one:
a. intervening variable
b. response, or dependent, variable

c. independent variable
d. is usually x

The correct answer is: response, or dependent, variable


In the case of an algebraic model for a straight line, if a value for
the x variable is specified, then

Select one:
a. . the computed value of y will always be the best estimate of the mean response

b. none of these alternatives is correct.


c. the computed response to the independent value will always give a minimal residual
d. the exact value of the response variable can be computed

The correct answer is: the exact value of the response variable can be computed

Suppose you use regression to predict the height of a woman’s


current boyfriend by using her ownheight as the explanatory
variable. Height was measured in feet from a sample of 100
womenundergraduates, and their boyfriends, at Dalhousie
University. Now, suppose that the height ofboth the women and
the men are converted to centimeters. The impact of this
conversion on theslope is:

Select one:
a. the value of the coefficient of determination will change
b. the magnitude of the slope will change

c. the sign of the slope will change


d. the value of SSE will change

The correct answer is: the value of SSE will change


A dependent variable whose values are not observable outside a
certain range but where the corresponding values of the
independent variables are still available would be most accurately
described as what kind of variable?

Select one:
a. Multinomial variable
b. Discrete choice

c. Censored
d. Truncated

The correct answer is: Censored

What is the meaning of the testing of the hypothesis?

Select one:
a. t is a significant estimation of the problem
b. It is a method of making a significant statement
c. It is greater than the correlation coefficient
d. It is a rule for acceptance or rejection of the hypothesis of the research problem

The correct answer is: It is a rule for acceptance or rejection of the hypothesis of the
research problem
In a regression analysis if r2 = 1, then

Select one:
a. SSE must be negative
b. SSE must be equal to zero
c. SSE can be any positive value

d. SSE must also be equal to one

The correct answer is: SSE must be equal to zero

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit V - Panel Data Regression Models / Self Assessment Quiz unit 5

Suppose that you have carried out a regression analysis where


the total variance in the response is133452 and the correlation
coefficient was 0.85. The residual sums of squares is:

Select one:
a. 20017.8
b. 96419.07
c. 113434.2
d. 37032.92

The correct answer is: 37032.92

. If the coefficient of determination is 0.81, the correlation


coefficient

Select one:
a. could be either + 0.9 or - 0.9
b. must be negative
c. is 0.6561

d. must be positive

The correct answer is: could be either + 0.9 or - 0.9


Suppose the correlation coefficient between height (as measured
in feet) versus weight (asmeasured in pounds) is 0.40. What is the
correlation coefficient of height measured in inchesversus weight
measured in ounces? [12 inches = one foot; 16 ounces = one
pound]

Select one:
a. cannot be determined from information given
b. 0.533

c. 0.4
d. 0.3

The correct answer is: 0.4

If there is a very strong correlation between two variables then


the correlation coefficient must be

Select one:
a. . much larger than 0, regardless of whether the correlation is negative or positive
b. much smaller than 0, if the correlation is negative
c. any value larger than 1
d. None of these alternatives is correct.

The correct answer is: much smaller than 0, if the correlation is negative
If the correlation coefficient is a positive value, then the slope of
the regression line

Select one:
a. must also be positive
b. can be either negative or positive
c. can not be zero

d. can be zero

The correct answer is: must also be positive

When the error terms have a constant variance, a plot of the


residuals versus the independentvariable x has a pattern that

Select one:
a. funnels in
b. fans out, but then funnels in
c. forms a horizontal band pattern

d. . fans out

The correct answer is: forms a horizontal band pattern

Which of the following statements is true about the regression


line?

Select one:
a. A regression line is also known as the prediction equation

b. All of these
c. A regression line is also known as the estimating equation
d. A regression line is also known as the line of the average relationship

The correct answer is: All of these


A residual plot:

Select one:
a. . displays explanatory variable versus residuals of the response variable
b. displays the explanatory variable on the x axis versus the response variable on the y
axis.
c. displays residuals of the explanatory variable versus residuals of the response variable

d. displays residuals of the explanatory variable versus the response variable.

The correct answer is: . displays explanatory variable versus residuals of the response
variable

If two variables, x and y, have a very strong linear relationship,


then

Select one:
a. there is evidence that x causes a change in y

b. there is evidence that y causes a change in x


c. None of these alternatives is correct.
d. there might not be any causal relationship between x and y

The correct answer is: there might not be any causal relationship between x and y
If the correlation coefficient is 0.8, the percentage of variation in
the response variable explainedby the variation in the
explanatory variable is

Select one:
a. 0.8
b. 0.008

c. 0.64
d. 0.0064

The correct answer is: 0.64

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit IV - Regressions with Qualitative Independent Variables / Self Assessment Quiz

You studied the impact of the dose of a new drug treatment for
high blood pressure. You thinkthat the drug might be more
effective in people with very high blood pressure. Because
youexpect a bigger change in those patients who start the
treatment with high blood pressure, you useregression to analyze
the relationship between the initial blood pressure of a patient (x)
and thechange in blood pressure after treatment with the new
drug (y). If you find a very strong positiveassociation between
these variables, then:

Select one:
a. there is evidence for an association of some kind between the patients initial
bloodpressure and the impact of the new drug on the patients blood pressure
b. there is evidence that the higher the patients initial blood pressure, the smaller the
impactof the new drug

c. there is evidence that the higher the patients initial blood pressure, the bigger the
impactof the new drug.
d. none of these are correct, this is a case of regression fallacy

The correct answer is: none of these are correct, this is a case of regression fallacy
Under which of the following situations would bootstrapping be
preferred to pure simulation?i) If it is desired that the
distributional properties of the data in the experimentare the
same as those of some actual dataii) If it is desired that the
distributional properties of the data in the experimentare known
exactlyiii) If the distributional properties of the actual data are
unknowniv) If the sample of actual data available is very small

Select one:
a. (i), (ii), (iii), and (iv)
b. (ii) and (iv) only
c. (i), (ii), and (iv) only
d. (i) and (iii) only

The correct answer is: (i), (ii), and (iv) only

A residual plot:

Select one:
a. . displays explanatory variable versus residuals of the response variable
b. displays the explanatory variable on the x axis versus the response variable on the y
axis.
c. displays residuals of the explanatory variable versus residuals of the response variable

d. displays residuals of the explanatory variable versus the response variable.

The correct answer is: . displays explanatory variable versus residuals of the response
variable
A dependent variable whose values are not observable outside a
certain range but where the corresponding values of the
independent variables are still available would be most accurately
described as what kind of variable?

Select one:
a. Truncated
b. Censored
c. Discrete choice

d. Multinomial variable

The correct answer is: Censored

The coefficient of determination equals

Select one:
a. 0
b. -0.6471
c. 0.6471

d. 1

The correct answer is: 0.6471


If two variables, x and y, have a very strong linear relationship,
then

Select one:
a. there is evidence that y causes a change in x
b. None of these alternatives is correct.
c. there is evidence that x causes a change in y

d. there might not be any causal relationship between x and y

The correct answer is: there might not be any causal relationship between x and y

When the error terms have a constant variance, a plot of the


residuals versus the independentvariable x has a pattern that

Select one:
a. fans out, but then funnels in

b. funnels in
c. . fans out
d. forms a horizontal band pattern

The correct answer is: forms a horizontal band pattern

Regression modeling is a statistical framework for developing a


mathematical equation thatdescribes how

Select one:
a. one explanatory and one or more response variables are related
b. . one response and one or more explanatory variables are related
c. several explanatory and several response variables response are related

d. All of these are correct.

The correct answer is: . one response and one or more explanatory variables are related
If there is a very strong correlation between two variables then
the correlation coefficient must be

Select one:
a. . much larger than 0, regardless of whether the correlation is negative or positive
b. any value larger than 1

c. much smaller than 0, if the correlation is negative


d. None of these alternatives is correct.

The correct answer is: much smaller than 0, if the correlation is negative

SSE can never be

Select one:
a. equal to 1

b. smaller than SST


c. equal to zero
d. larger than SST

The correct answer is: larger than SST

In a regression analysis if SSE = 200 and SSR = 300, then the


coefficient of determination is

Select one:
a. 0.4
b. 1.5
c. 0.6667
d. 0.6

The correct answer is: 0.6


Which of the following statements is true about the type two
error?

Select one:
a. Type two error means to accept a correct hypothesis
b. Type two error means to reject an incorrect hypothesis
c. Type two error means to reject a correct hypothesis

d. Type two error means to accept an incorrect hypothesis

The correct answer is: Type two error means to accept an incorrect hypothesis

Which of the following are types of correlation?

Select one:
a. Positive and Negative
b. Linear and Nonlinear

c. All of these
d. Simple, Partial and Multiple

The correct answer is: All of these

The least squares estimate of b1 equals

Select one:
a. 1.991
b. -0.923
c. -1.991
d. 0.923

The correct answer is: 1.991


Which of the following statements is true for correlation analysis?

Select one:
a. It is a univariate analysis
b. It is a bivariate analysis

c. It is a multivariate analysis
d. Both a and c

The correct answer is: It is a univariate analysis

Which of the following statements is true about the arithmetic


mean of two regression coefficients?

Select one:
a. It is greater than the correlation coefficient
b. It is less than the correlation coefficient

c. It is equal to the correlation coefficient


d. It is greater than or equal to the correlation coefficient

The correct answer is: It is greater than the correlation coefficient


Suppose that we wished to evaluate the factors that affected the
probability that an investor would choose an equity fund rather
than a bond fund or a cash investment. Which class of model
would be most appropriate?

Select one:
a. A tobit model

b. A multinomial logit
c. A logit model
d. An ordered logit model

The correct answer is: A multinomial logit

The coefficient of correlation

Select one:
a. is the same as r-square
b. is the square of the coefficient of determination
c. . is the square root of the coefficient of determination
d. can never be negative

The correct answer is: . is the square root of the coefficient of determination
You have carried out a regression analysis; but, after thinking
about the relationship betweenvariables, you have decided you
must swap the explanatory and the response variables.
Afterrefitting the regression model to the data you expect that:

Select one:
a. the value of the coefficient of determination will change
b. the value of the correlation coefficient will change
c. . the value of SSE will change
d. the sign of the slope will change

The correct answer is: . the value of SSE will change

In regression analysis, the variable that is being predicted is the

Select one:
a. is usually x
b. intervening variable
c. independent variable

d. response, or dependent, variable

The correct answer is: response, or dependent, variable

Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit IV - Regressions with Qualitative Independent Variables / Self Assessment Quiz

Which of the following techniques is an analysis of the


relationship between two variables to help provide the prediction
mechanism?

Select one:
a. Correlation
b. None of these

c. Standard error
d. Regression

The correct answer is: Regression

In regression, the equation that describes how the response


variable (y) is related to theexplanatory variable (x) is:

Select one:
a. the regression model

b. he correlation model
c. None of these alternatives is correct.
d. . used to compute the correlation coefficient

The correct answer is: the regression model


You have carried out a regression analysis; but, after thinking
about the relationship betweenvariables, you have decided you
must swap the explanatory and the response variables.
Afterrefitting the regression model to the data you expect that:

Select one:
a. the value of the correlation coefficient will change
b. . the value of SSE will change

c. the value of the coefficient of determination will change


d. the sign of the slope will change

The correct answer is: . the value of SSE will change

Suppose that we wished to evaluate the factors that affected the


probability that an investor would choose an equity fund rather
than a bond fund or a cash investment. Which class of model
would be most appropriate?

Select one:
a. A logit model
b. A multinomial logit

c. A tobit model
d. An ordered logit model

The correct answer is: A multinomial logit


The original hypothesis is known as ______.

Select one:
a. Multiple regression analysis
b. Non-linear regression analysis
c. Alternate hypothesis

d. Null hypothesis

The correct answer is: Null hypothesis

If the values of two variables move in the same direction,


__________

Select one:
a. The correlation is said to be positive
b. The correlation is said to be non-linear
c. The correlation is said to be linear

d. The correlation is said to be negative

The correct answer is: The correlation is said to be negative

The independent variable is used to explain the dependent


variable in ________

Select one:
a. Non-linear regression analysis
b. Linear regression analysis

c. Multiple regression analysis


d. Alternate hypothesis

The correct answer is: Linear regression analysis


Suppose that you have carried out a regression analysis where
the total variance in the response is133452 and the correlation
coefficient was 0.85. The residual sums of squares is:

Select one:
a. 96419.07
b. 37032.92

c. 20017.8
d. 113434.2

The correct answer is: 37032.92

In a regression analysis if SSE = 200 and SSR = 300, then the


coefficient of determination is

Select one:
a. 0.6667
b. 1.5

c. 0.6
d. 0.4

The correct answer is: 0.6


Which of the following statements is true about the regression
line?

Select one:
a. A regression line is also known as the prediction equation
b. A regression line is also known as the estimating equation

c. All of these
d. A regression line is also known as the line of the average relationship

The correct answer is: All of these

. If the coefficient of determination is 0.81, the correlation


coefficient

Select one:
a. must be negative

b. is 0.6561
c. must be positive
d. could be either + 0.9 or - 0.9

The correct answer is: could be either + 0.9 or - 0.9

In the case of an algebraic model for a straight line, if a value for


the x variable is specified, then

Select one:
a. the exact value of the response variable can be computed
b. none of these alternatives is correct.
c. . the computed value of y will always be the best estimate of the mean response

d. the computed response to the independent value will always give a minimal residual

The correct answer is: the exact value of the response variable can be computed
The coefficient of correlation

Select one:
a. can never be negative
b. is the same as r-square

c. is the square of the coefficient of determination


d. . is the square root of the coefficient of determination

The correct answer is: . is the square root of the coefficient of determination

Which of the following statements is true about the level of


significance?

Select one:
a. In testing a hypothesis, we take the level of significance as 1% if it is not mentioned
earlier
b. In testing a hypothesis, we take the level of significance as 10% if it is not mentioned
earlier
c. In testing a hypothesis, we take the level of significance as 2% if it is not mentioned
earlier
d. In testing a hypothesis, we take the level of significance as 5% if it is not mentioned
earlier

The correct answer is: In testing a hypothesis, we take the level of significance as 2% if it is
not mentioned earlier
In a regression analysis if r2 = 1, then

Select one:
a. SSE can be any positive value
b. SSE must be negative

c. SSE must also be equal to one


d. SSE must be equal to zero

The correct answer is: SSE must be equal to zero

Which of the following statements is true about the arithmetic


mean of two regression coefficients?

Select one:
a. It is equal to the correlation coefficient

b. It is less than the correlation coefficient


c. It is greater than the correlation coefficient
d. It is greater than or equal to the correlation coefficient

The correct answer is: It is greater than the correlation coefficient

Which of the following are types of correlation?

Select one:
a. Simple, Partial and Multiple
b. All of these

c. Linear and Nonlinear


d. Positive and Negative

The correct answer is: All of these


If the correlation coefficient is 0.8, the percentage of variation in
the response variable explainedby the variation in the
explanatory variable is

Select one:
a. 0.008
b. 0.0064
c. 0.64

d. 0.8

The correct answer is: 0.64

Which of the following statements is true for correlation analysis?

Select one:
a. It is a bivariate analysis
b. Both a and c
c. It is a multivariate analysis

d. It is a univariate analysis

The correct answer is: It is a univariate analysis

Larger values of r2 (R2) imply that the observations are more


closely grouped about the

Select one:
a. average value of the dependent variable
b. average value of the independent variables
c. origin
d. least squares line

The correct answer is: least squares line


Previous Activity

Jump to...

Next Activity
Welcome to Mizoram University !
Dashboard / Courses / PG Programs / MBA / MBA Big Data Analytics / Semester 4 / Econometrics for Business Decisions
/ Unit IV - Regressions with Qualitative Independent Variables / Self Assessment Quiz

Which of the following is correct concerning logit and probit


models?

Select one:
a. The probit model is based on a cumulative logistic function
b. For the logit model, the marginal effect of a change in one of the explanatory
variables is simply the estimate of the parameter attached to that variable, whereas this
is not the case for the probit model

c. They use a different method of transforming the model so that the probabilities lie
between zero and one
d. The logit model can result in too many observations falling at exactly zero or exactly
one

The correct answer is: They use a different method of transforming the model so that the
probabilities lie between zero and one

Which of the following statements is true about the null


hypothesis

Select one:
a. Any wrong decision related to the null hypothesis results in two types of errors

b. Any wrong decision related to the null hypothesis results in three types of errors
c. Any wrong decision related to the null hypothesis results in four types of errors
d. Any wrong decision related to the null hypothesis results in one type of an error

The correct answer is: Any wrong decision related to the null hypothesis results in two
types of errors
. The correlation coefficient is used to determine:

Select one:
a. The strength of the relationship between the x and y variables
b. A specific value of the x-variable given a specific value of the y-variable
c. A specific value of the y-variable given a specific value of the x-variable

d. . None of these

The correct answer is: The strength of the relationship between the x and y variables

Which of the following statements is true about the correlational


analysis between two sets of data

Select one:
a. The correlational analysis between two sets of data is known as multiple correlation

b. None of these
c. The correlational analysis between two sets of data is known as partial correlation
d. The correlational analysis between two sets of data is known as a simple correlation

The correct answer is: The correlational analysis between two sets of data is known as a
simple correlation
Which of the following statements will be true if the number of
replications used in a Monte Carlo study is small?i) The statistic of
interest may be estimated impreciselyii) The results may be
affected by unrepresentative combinations of random drawsiii)
The standard errors on the estimated quantities may be
unacceptably largeiv) Variance reduction techniques can be used
to reduce the standard errors

Select one:
a. (i), (ii), and (iv) only
b. (i) and (iii) only
c. (ii) and (iv) only
d. (i), (ii), (iii), and (iv)

The correct answer is: (i), (ii), (iii), and (iv)

Assume the same variables as in question 28 above; height is


measured in feet and weight ismeasured in pounds. Now,
suppose that the units of both variables are converted to metric
(metersand kilograms). The impact on the slope is:

Select one:
a. the sign of the slope will change
b. both a and b are correct
c. neither a nor b are correct

d. the magnitude of the slope will change

The correct answer is: the magnitude of the slope will change
The least squares estimate of b1 equals

Select one:
a. -0.923
b. 0.923

c. 1.991
d. -1.991

The correct answer is: 1.991

If the coefficient of determination is a positive value, then the


regression equation

Select one:
a. must have a negative slope
b. could have either a positive or a negative slope
c. must have a positive y intercept

d. must have a positive slope

The correct answer is: could have either a positive or a negative slope

What is the meaning of the testing of the hypothesis?

Select one:
a. It is greater than the correlation coefficient
b. It is a rule for acceptance or rejection of the hypothesis of the research problem
c. It is a method of making a significant statement
d. t is a significant estimation of the problem

The correct answer is: It is a rule for acceptance or rejection of the hypothesis of the
research problem
Which of the following statements are correct concerning the use
of antithetic variates as part of a Monte Carlo experiment? i)
Antithetic variates work by reducing the number of replications
required to cover the whole probability spaceii) Antithetic variates
involve employing a similar variable to that used in the
simulation, but whose properties are known analyticallyiii)
Antithetic variates involve using the negative of each of the
random draws and repeating the experiment using those values
as the drawsiv) Antithetic variates involve taking one over each of
the random draws and repeating the experiment using those
values as the draws

Select one:
a. (i) and (iii) only
b. (i), (ii), and (iv) only
c. (ii) and (iv) only

d. (i), (ii), (iii), and (iv)

The correct answer is: (i) and (iii) only

If the correlation coefficient is a positive value, then the slope of


the regression line

Select one:
a. can not be zero
b. can be zero
c. can be either negative or positive

d. must also be positive

The correct answer is: must also be positive


Which of the following is true for the coefficient of correlation?

Select one:
a. None of these
b. The coefficient of correlation is not dependent on the change of origin

c. The coefficient of correlation is not dependent on both the change of scale and
change of origin
d. The coefficient of correlation is not dependent on the change of scale

The correct answer is: The coefficient of correlation is not dependent on both the change
of scale and change of origin

Suppose you use regression to predict the height of a woman’s


current boyfriend by using her ownheight as the explanatory
variable. Height was measured in feet from a sample of 100
womenundergraduates, and their boyfriends, at Dalhousie
University. Now, suppose that the height ofboth the women and
the men are converted to centimeters. The impact of this
conversion on theslope is:

Select one:
a. the sign of the slope will change
b. the magnitude of the slope will change
c. the value of the coefficient of determination will change
d. the value of SSE will change

The correct answer is: the value of SSE will change


In regression analysis, if the independent variable is measured in
kilograms, the dependentvariable

Select one:
a. can be any units
b. must be in some unit of weight
c. cannot be in kilograms

d. must be in kilograms

The correct answer is: can be any units

A fitted least squares regression line

Select one:
a. may be used to predict a value of y if the corresponding x value is given
b. None of these alternatives is correct.
c. can only be computed if a strong linear relationship exists between x and y

d. is evidence for a cause-effect relationship between x and y

The correct answer is: may be used to predict a value of y if the corresponding x value is
given
In a regression and correlation analysis if r2 = 1, then

Select one:
a. 1
b. SSE
c. SSR
d. SSE

e. SSR
f. SSE
g. SST
h. SST

The correct answers are: SSE, SST, 1, SSE, SST

If the coefficient of determination is equal to 1, then the


correlation coefficient

Select one:
a. must be -1
b. can be either -1 or +1

c. can be any value between -1 to +1


d. must also be equal to 1

The correct answer is: can be either -1 or +1


Which of the following statements is true about the type two
error?

Select one:
a. Type two error means to reject a correct hypothesis
b. Type two error means to accept a correct hypothesis

c. Type two error means to accept an incorrect hypothesis


d. Type two error means to reject an incorrect hypothesis

The correct answer is: Type two error means to accept an incorrect hypothesis

Suppose the correlation coefficient between height (as measured


in feet) versus weight (asmeasured in pounds) is 0.40. What is the
correlation coefficient of height measured in inchesversus weight
measured in ounces? [12 inches = one foot; 16 ounces = one
pound]

Select one:
a. 0.4
b. 0.3
c. 0.533

d. cannot be determined from information given

The correct answer is: 0.4


In regression analysis, the variable that is being predicted is the

Select one:
a. response, or dependent, variable
b. intervening variable

c. independent variable
d. is usually x

The correct answer is: response, or dependent, variable

Previous Activity

Jump to...

Next Activity

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy