P & S Unit-1 Material
P & S Unit-1 Material
SYLLABUS: Descriptive Statistics and Methods for Data Science: Data science – Statistics
Introduction - Population Vs Sample – Collection of Data - Primary and Secondary Data –
Type of Variable: Dependent and Independent Categorical and Continuous Variables – Data
Visualization – Measures of Central Tendency – Measures of Variability (Spread or Variance) –
Skewness, Kurtosis.
1. DATA SCIENCE:
Answer: There has been many arguments regarding the statistics as a science or art. Science
is defined as systematized arrangement related to knowledge. It focuses on cause and effect
relationship in returns of scientific principles or laws in order to make clear generalizations. In
simple words, science provides the knowledge that is helpful in finding a way. But is does not
show the direction to be chosen. On contrary to science, art refers to the ability of managing
and facts in order to attain the set goals. It provides the ways of handling and presenting the
data, making judgment logically and obtaining the effective and relevant results.
“Statistics is not a body of substantive knowledge but a body of methods for obtaining
knowledge”.
Science is knowledge whereas art is action. Thus, from this point of view, statistics can
also be considered as an “art”. The basic function of statistics is to apply the given method is
order to extract facts, results so as to arrive at taking effective conclusion.
2
2. STATISTICS INTRODUCTION
The methods used in the analysis of statistical data are called as statistical methods.
Characteristics of Statistics:
1. Aggregate of Facts: Statistics is the aggregate of information i.e., complete facts and
figures that can be related and compared. Single figures cannot be termed as statistics.
For example, if heights of different students in a class are given then such data can be
compared and conclusions can be drawn regarding the height of each student.
Therefore, such data is considered as statistics. Whereas if the height of a single
student is given the comparison becomes impossible and it is not considered as
statistics. Likewise single figures concerning sales of firm, marks of students, price,
demand, exports etc., con not be considered as statistics.
2. Affected by Multiplicity of Causes: In statistics, the figures and facts are significantly
effected by various factors operating together.
3. Statistics are Expressed Numerically: Statistics are numerical statements of facts. In
other words, statistics are expressed in numbers or quantitative terms. For example,
the production level of XYZ manufacturing company increased from 100 tones in the
year 2000 to 150 tones in the year 2002. A qualitative statement like the sales of XYZ
company increasing year after the year can not be regarded as statistics.
3
If an estimation is made that 10000 people have participated in the protest, then it
means that it is an approximate figure, it can be either more or less than 10000. In
statistics, attaining mathematical accuracy is a complicated task. Therefore, in order to
ensure accuracy, estimation has to be carried out by applying reasonable standards of
accuracy.
2. Statistics in State: Statistics was also known as ‘Science of Statecraft’ and was used
for collecting data for making fiscal and military policies. Now-a-days the statistical
data is used for prices, production, income, consumption, expenditure and profits by
government. It plays a vital role for increasing the welfare of the state.
3. Statistics in Mathematics: Statistics and mathematics are the two interrelated
subjects. These subjects are based on the concept of ‘theory of probability’. The
significant role of mathematics into statistics has evolved into new branch of statistics
called as ‘Mathematical Statistics’. Statistics according to CONNOR, “is a branch of
applied mathematics which specializes in data”.
4. Statistics in Economics: The relation between statistics and economics was first
explained by “William Petty” in his book ‘Political arithmetic’. The statistical techniques
are used in solving the economic problems such as production, consumption,
distribution of income, wealth, profits etc., The relation of mathematics and statistics
with economics has evolved into a new branch called ‘econometrics’
5. Statistics in Business and Management: Statistics has an extensive role in business
as well as management. According to Prof. Ya-Lun-Chou “Statistics is a method of
decision-making in the face of uncertainty on the basis of numerical data and
calculated risks”. Most of the managerial decisions and business ‘forecasting
techniques are based on the statistical information.
6. Statistics in Accountancy and Auditing: Statistics has various applications in
accountancy and auding. An example for the application of statistics in accountancy is
the ‘Method of inflation Accounting’. Various statistical techniques are used for cose
accounting and auditing purposes.
7. Statistics in Insurance: Statistics in insurance is applied on the basis of probability
theory. Life insurance was developed by ‘Edmund Hally’ in 1961. The problems or
questions of life insurance were solved by using the life tables given by Edmund Hally.
The success of insurance industry mainly depends on the application of statistical data
in life tables.
8. Statistics in Physical Sciences and Astronomy: Statistics in astronomy is a physical
science which has various applications. .Kepler gave three famous laws relating to the
movements of heavenly bodies with the help of statistical data collected by Brave. On
the basis of Kepler’s laws, Sir Isaac Newton gave his famous law of gravitation, statistics
is widely used in the physical science such as astronomy, engineering, geology, physics
etc.,
5
9. Statistics in Social Science : According to Prof. ‘A.L. Bowley’, “Statistics is the science
of measurement of social organism regarded as a whole in all its manifestation.”
Sampling techniques and estimation theory are most useful tools of statistics which are
used in social science for conducting the social survey. The important application of
statistics in sociology is the study of death rates, birth rates, population growth etc.,
10. Statistics in Biology and Medical Sciences: The study of ‘correlation analysis’ given
by Prof. Karl Pearson is purely based on statistics. Moreover in medical sciences, the
statistical data related to the causes and incidence of diseases are of great importance.
For example, statistical papers used in the study of heart beats through
electrocardiogram [E.C.O.] Statistics has greater applications in psychology and
education which leads to ‘psychometry’ and in war. Thus statistics plays a vital role in
various disciplines and has greater managerial applications.
Q4: Explain the factors responsible for the development of statistics in modern
times.
The development of statistics are subjected to two main factors which are:
1. Increased Demand of Statistics: In the modern time, there has been a tremendous
development in business, science, governmental activities, research, commerce etc.,
Statistics acts as an important tool in formulating appropriate policies in those fields.
In the business context, statistics is very essential in solving the problems of complexity
and growing needs. In the case of governmental activities, the demand of statistics has
also been increased. Earlier government was primarily involved in maintenance of law
and order. But today, government has appeared in almost all spheres. With the
growing number of governmental functions, the need of statistics arise. Similarly, the
advancement in science and extensive research has called for the need and assistance
of statistics, thus, the demand for statistics has greatly increased.
2. Decreasing Cost of statistics: The development of electronic machines like
computers and calculators, has reduced the cost and time required for data collection.
Du to the this reason, statistics is used increasing in solving the problems. In addition
to this, the development of statistical theory has led to the reduced cost of data
collection and processing. Also, a branch of statistics known as “Design of
experiments” has developed. It is helpful in collecting and analyzing the data more
rapidly and economically. Even though many scholars attempted to contribute to the
6
science of statistics, Sir Ronald Fisher (1890-1962) must be accredited for the
development and progress of statistics. His contribution has brought outstanding
development in the statistical sciences. Although statistical tools are widely used in
solving the problems, at times they are not accurate. It can be concluded that
statistical methods are the effective ways in drawing the conclusion so as to arrive at
the better result.
6. Assist in Testing and Formulating Theories: Statistics not only assists in testing and
formulating theories in various fields but also assists in measuring the impact of such
theories on various fields. For instance, in agricultural and biological science,
statistical techniques are used for ascertaining the role of growth and development
activities of the plant. Consumers surveys, which are and market surveys carried out
effectively acts as the bases for formulating specific and clear production policies.
7. Assist in Formulating Policies: Statistical methods are useful in formulating various
economic and business policies. A survey conducted with respect to exports, imports,
production, wages and so on are useful in formulating policies and plans in the
respective fields.
1. Statistics are Not Qualitative: Statistics are numerical statements. It is applied to only
those areas which are measured quantitatively. Hence the statements such as “the
number of students in a college have increased when compared to last year” does not
form statistics.
2. Statistics Does Not Constitute Isolated Facts and Figures: In statistics, the facts and
figures are always aggregate in nature rather than single or isolated. For example, price
of a single product, marks of a student, etc., cannot be regarded as statistics as these
figures can not be either related as statistics as these figures can not be either related
or compared with each other. The aggregate figures such as household income,
expenditure, profits and sales of a firm over different years constitutes statistics.
3. Statistical Laws are probabilistic in nature: Statistical laws are not exact they are
probabilistic in nature. The conclusions drawn by following statistical give laws
approximate results/values and not exact figures.
4. Statistics Can be Misused: Statistics can be misused if it is used by the non experts.
The figures can also be misused by politicians, unethical workers by manipulating the
facts for their personal selfish intentions. Statistics does not proves or disproves
anything. It is only a tool which can be very much useful if utilized approximately. In
case if it is misused by inexperienced or unethical statisticians then it may result in
false conclusions and may prove highly dangerous to the firm under consideration.
The two main branches of statistics are 1] Descriptive statistics and 2] Inferential statistics.
1. Descriptive Statistics: The statistical methods included in this branch are collection,
presentation and characterization of data as to explain the different characteristics of
the set of data.
The various methods of descriptive statistics include,
(i) Graphic Method: Bar charts, pie charts and line graphs.
(ii) Numeric measures: Dispersion, kurtosis, measures of central tendency and
skewness.
2. Inferential Statistics: Inferential statistics includes the statistical methods involving
the estimation of population characteristics or decision making regarding the
population based on the sample results. The term population refers to a large group of
units about which inferences are to be done. Whereas, sample is a fraction, portion or
subset of the population.
Inferential statistics is classified as
(i) Parametric Statistics: It is based on the assumption that the population from
which the sap le is drown, is generally distributed. It can only be used when the
data collected is on ratio scale or internal scale.
(ii) Non-parametric Statistics: It has no explicit assumption about the normality of
distribution in the population. It can only be used when the data is collected on
ordinal or nominal scale. When the data is sought for number of elements such
as companies, household, customers, products, voters, individuals, there arises
the need of sampling to draw the conclusions about the population. Hence the
data is collected from small portion of population (i.e., sample) due to time, cost
and other concerns.
The concept of inferential statistics can be clearly understood from the following
definitions.
(a) Process: A process is nothing but a set of rules that collectively perform to
transform inputs into outputs. Example: Banking transaction.
(b) Population: It is a set of elements or observations associated with the phenomenon
under study for which a better comprehension and knowledge is required.
(c) Statistical Variable: It is a feature / characteristics of a population / process
defined operationally. It describes the quantity to be measured or observed.
(d) Sample: It is a set of few elements or observations of a process or population.
(e) Parameter: It is a descriptive measure related to a statistical variable that outlines
the features of entire population.
9
(f) Statistic: It is a numerical quantity that outlines the features of a sample drawn
from a population.
3. POPULATION Vs SAMPLE
Q8. Define the term population and sample. What are the ways into which
samples are classified?
2] Budget of India.
SAMPLE: The term sample refers to a finite subset of the population. Sample size is
represented as ‘n’ denoting the number of objects or observation in the sample.
TYPES OF SAMPLES: There are two ways into which samples are classified.
(a) Large Sampling: The sample comprising of objects which are more than 30 [i.e., n>=
30], it is known as large sampling.
(b) Small Sampling: The sample comprising of objects less than 30 [i.e., n<30], it is
known as small sampling.
Answer: Methods of Sampling: The different methods of sampling are (i) Purposive sampling
(ii) Random sampling (iii) Simple Sampling (iv) Stratified Sampling.
1. Purposive Sampling: If the sample of elements are selected with some purpose then it
is said to be purposive sampling.
Example: If there is a complaint against the defectiveness of components produced,
then sample of elements which are defective are considered, whereas others are not
considered. This is purposive sampling.
2. Random Sampling: If every element in [space] sample space have equal chance of
being included in test, then it comes under random sampling.
10
Q10: What do you mean by data collection? What are the different types of
data?
Answer: Data Collection: The collection of data simply refers to a process of gathering all the
facts and figures related to a particular subject that is under investigation. These facts and
figures are often referred to as data. In other words, “Meaningful facts that are expressed in a
quantitative form can be termed as data”.
Data can be needed in any situation for making decisions. Success or failure of any
statistical investigation depends upon the accuracy and reliability of available data.
TYPES OF DATA:
Data can be classified into two types based on the sources from where it is collected.
They are, 1] Primary Data 2] Secondary Data.
1. Primary Data: Primary data refers to the data collected specifically which is for the
purpose of research problem. It is the first hand information collected by the research
firm or by an external agent with the objective of solving a research problem. There are
different methods of collecting primary information, Researchers can conduct
experiments to gather the required information. Other methods include questionnaires,
mails, interviews of individuals, families, organizations, representatives etc.,
For example, the study of the working conditions of labourers in a big industry
conducted by the investigator himself or by his agent is know as primary data.
11
2. Secondary Data: Primary data for one party sometimes acts as the secondary data for
the other party. Secondary data refers to the existing data that have been collected
with an objective other than for research. It could be the data collected by the firm
itself for any other purpose, or by any external party for the same or other research
problem.
For example, inventory records maintained by a firm as a part of their routine operating
function acts as the internal secondary data of the firm for evaluating the seasonal
demand for its product. Alternatively, data gathered on industry demand by a
marketing research firm can also be used for the same research problem.
Q11: Explain the methods of collecting primary data with merits and demerits.
Answer: The following are the various methods of collecting primary data,
Merits:
Demerits:
12
MERITS:
DEMERITS:
(i) The data collected depends upon the indirectly obtained information i.e., the
information is gathered from those persons that are not related with the facts
and so the results are likely to be inaccurate and unreliable.
(ii) The information received is not free from the prejudices and ignorance of the
informers.
MERITS:
(i) This method is applicable in situations where the field of enquiry is extensive
and places of enquiry are scattered.
(ii) This method is economical as it saves time, labor and money.
DEMERITS:
DEMERITS:
This method is an improvement over the previous method. Here some enumerators are
appointed who contact the related persons and fill the schedules after making enquires
from them [i.e., blank forms with questions printed and space for noting the replies].
These enumerators are allotted for different areas, so that only one enumerator can
contact one person.
This method is adopted by the government specially during the census calculations.
MERITS:
(i) By this method the information can be obtained from the uneducated persons
too.
(ii) Homogeneous information is obtained because the schedules are fully explained
to each enumerator.
(iii) This method is applied for the large areas of investigation.
(iv) The replies are obtained from all the related persons since the enumerators
contact them personally.
(v) The prejudices of the enumerators does not matter because of their availability
in large number.
DEMERITS:
(i) If the enumerators are inefficient and careless, the results obtained from the
information given by them are wrong and unreliable.
(ii) This is the most uneconomical method as it requires more time, labor and
money. Hence, it is generally adopted by the government organizations only.
1. A large number of planning and execution processes which require a lot of time are
required for gathering data from a primary source.
2. The correctness of primary data relies on the features like honesty, integrity and
sincerity of the investigation being carried out by the investigator and the amount of
response from the related persons.
15
3. Gathering of primary data must be done by efficient, skillful, intelligent, tactful sincere
and trained investigators often leads to complexities such as time and money problems.
4. There is always a possibility of providing an improper information because of lack of
integrity among investigators and the related persons.
Answer: Editing of Primary Data: Immediately after gathering the data from primary or
secondary sources, the data must be edited. This editing process involves identification of
errors and mistakes present in the collected data. The degree of accuracy and the extent of
acceptable errors is decided early to avoid any confusion in later stages. However, editing of
primary data is an extensive process.
The primary data can be edited o the basis of the following factors.
1. Completeness: The editor should check whether the answer to each and every
question in the questionnaire is furnished or not. If any question is found to be not
answered then the respondent must be contacted to obtain the answer of the respective
question. But, if the editor fails to contact with the respondent then ‘No report’ must be
marked under that question. And moreover, if the question remains unanswered and is
of vital importance then the editor should discard the questionnaire itself.
2. Consistency: The editor should check whether the answers to questions in the
questionnaire is contradictory or not. If any mutually contradictory answers are found
then the editor must perform necessary action to obtain the correct answers. This
problem can be solved either by referring back to the questionnaire or by contacting
with the respondent.
3. Uniformity: The editor should check whether all the respondents have answered the
questions in the same sense or not. Because, some times it may happen that a single
question can be taken in different way by different respondents.
For example, consider a question of salary. Different respondents may take this
question in different sense. That is, some may answer this by writing yearly salary
while some may write monthly salary.
Therefore, if a question is found to be answered by different respondents in different
sense then that question must be reduced to some common base.
4. Accuracy: The editor should check whether the questionnaire that has been received
provides the correct information or not. If the information received is found to be
incorrect then it may result in misleading of investigation conclusion. Therefore, to
16
obtain accuracy appropriate actions must be taken to avoid wrong information. Editing
of data to obtain accuracy is one of the must complex task, but at the same time it is
necessary to carry out in order to obtain reliable conclusion of an investigation.
Answer: Sources of Secondary Data: the major sources of secondary data are of two types.
They are [1] Published sources [2] Unpublished sources.
Answer: Editing of Secondary Data: The editing process carried-out on the secondary data is
much simpler than the editing of primary data. However, the secondary data must be used
with utmost care. In other words, the secondary data must be used only if it reliable, accurate,
adequate and compatible to the problem under investigation. All the inconsistencies, errors
and omissions present in the secondary data must be eliminated before using it. The process
of examining the secondary data for any inconsistencies, errors, omissions etc., can also be
referred to an ‘scrutinization of secondary data’. Thus it can be said that, it is never safe to use
secondary data without proper scrutinization.
The following are the factors that must be considered while using the secondary data.
Example:
Consider the consumption of petrol and oil in Andhra Pradesh, this data will be
inadequate if it requires to measure the consumption of oil and petrol for the entire country
because there may be fluctuations in consumption at different states. Similarly, if there exist
the consumption data for the entire country, it is difficult to identify the consumption at each
and every individual state.
Therefore, for obtaining the accurate and reliable data, the secondary data must be
subjected to a thorough scrutiny and complete editing process before it is used.
Answer: Variable: The statistical data is gathered from the observations of several
individuals. Each of the observations collected as data are distinct.
A variable corresponds to quantity or quality that varies from one individual to the
other in the same population. The observations collected for a variable are called variate. Any
changes in the value of the variable associated with the branch of science is unpredictable,
thus, these variables are random variables.
variables are continuous variables. Example: Height, weight, age, etc., that are
measured as fractional values.
(ii) Discontinuous Variables: These variables have fixed numerical values without
any intermediate values in between them. Example: Blood pressure, Blood
sugar, number of children in a family, pulse rate et.,
(b) Qualitative [or] Categorical Variable: A variable whose value cannot be measured but
can be used in categorizing individuals based on some quality are called as qualitative
variables.
Example: Marital status, qualification, sex etc., of an individual or the color of a flower
etc.,
6] DATA VISUALIZATION
Q17: Explain in detail about data visualization and also list its major
advantages.
Answer: Data Visualization: Data Visualization is a process of converting numeric data into
some meaningful images which are easily interpreted by the humans. In other words, it can be
defined as a graphical representation of information that provides the viewer with qualitative
interpretation of information. It is a study of visual representation, wherein the information is
abstracted in some schematic format. Earlier, the field of data visualization was related to
information graphics and statistical graphics but now it has become important in the areas of
research, teaching and development.
Basically, a human brain has the ability of processing the visual information consisting
of different physical objects quickly when compared to textual information. The primary
function of data visualization tool is to help the users in examining the complex data sets. This
analysis is done by considering the physical characteristic properties such as transparency,
curvature, speed, color, lighting effects with regard to the data.
1. Data Visualization helps in identifying the hidden patterns present within the data.
20
2. It helps in developing polling applications that enables the humans to focus only on the
important issued.
3. It helps the organization to gain a competitive advantage by identifying the important
trends in corporate and market data.
4. It plays an important role in bit data and advanced analytics projects.
5. It has become an effective standard for modern business intelligence.
6. Data visualization tools have been important in democratizing data and analytics and
making data-driven insights available too workers in an organization.
7. It is typically easier to operate than traditional statistical analysis software or earlier
versions of business intelligence software.
8. It assists in including interactive capabilities, enabling users to drill into the data for
querying and analysis.
Q19: What do you mean by measures of central tendency? List out the
characteristics of a good measure of central tendency.
Answer: Measures of Central Tendency: Measures of central tendency are also known as
Averages. Averages are the values that lie between the smallest and the largest observations.
It is the Mean of the given data. The five important measures of central tendency are
Arithmetic mean, Median, Mode, Geometric mean and Harmonic mean. All these averages can
be calculated for individual, Discrete and Continuous series.
Answer: Arithmetic Mean: Arithmetic mean is the most popular measure of central tendency
that is used only in case of quantitative data. Simply it is referred to as ‘Mean’. It is defined as
the sum of all observations by the total number of observations. Generally, mean for X
observations is denoted as X.
The computation of mean can be performed in different ways for different types of data
based on the way they are distributed. Basically there are two ways in which data can be
distributed. They are named as ungrouped (or individual series) data and grouped data.
Merits: The arithmetic mean is the most widely used measurement of central tendency due to
the following merits,
1. It is simple to understand.
2. It is simple to compute.
3. Its definition is inflexible (i.e., unique).
4. It remains unaffected even it there is any fluctuations in sampling.
5. It considers all the observations.
23
Answer: Mean for Ungrouped Data: Ungrouped Data is a raw data that does not go under any
statistical treatment.
(i) Direct Method (or) Actual Mean Method: In this method, mean is obtained simply
by adding the values of all observation in the given series divided by the total
number of observation in that series. It is mathematically denoted as,
x
X = x = x1 + x 2 + .......... + x N
i
where, i
N
N = Number of observations.
(ii) Short-cut Method (or) Indirect Method (or) Assumed Mean Method: For a data
containing more number of observations or large figures, the calculation of mean by
direct method becomes complex. This complexity can be reduced by using short-cut
method. In short-cut method, an arbitrary value from the given data set is assumed
as a mean which is then used to take the deviations from each observation. After
this, all the deviations are added and then it is divided by the total number of
observations. Finally, the resultant value is added with the assumed mean to
obtain the actual mean. It is mathematically denoted as,
X = A+
d i
Where, A = Assumed Mean
N
d i = d1 + d 2 + ............. + d N
d i = xi − A
N = Number of observations.
(iii) Step Deviation Method: The calculation of mean can be further simplified by
using step deviation method. In this method, all the deviations obtained from
24
assumed mean is divided by a common factor(say ‘C’) of all the given observations.
The formula to calculate mean by step deviation is given as,
X = A+
d' i
XC Where, A = Assumed Mean
N
d' i = d '1 + d ' 2 +............. + d ' N
xi − A
d 'i =
C
C = Common factor of observations
N = Number of observations.
Answer: Mean for Grouped Data: Grouped data is a data organized in a tabular format
containing different groups. There are two forms of grouped data: Discrete series and
continuous series.
I. Mean for Discrete Series: Discrete series consists of both variables without class
fx
X = fx = f1 x1 + f 2 x 2 + ................ + f N x N
i i
, where
f
i i
i
f i = f1 + f 2 + ......... + f N
(ii) Short-cut Method:
X = A+
fd i i
, where A = Assumed mean from observation and d i = xi − A
f i
fd = fd i i 1 1 + f 2 d 2 + ................ + f N d N
f = f +fi 1 2 + ......... + f N
25
X = A+
f d' i i
XC Where, A = Assumed Mean
f i
f d'
i i = f1 d '1 + f 2 d ' 2 +............. + f N d ' N
xi − A
d 'i = ; C = Common factor of observations; f i = Total frequency.
C
II. Mean for Continuous Series: In continuous series, Variable X are given in terms of
class intervals. Therefore, the midpoint (m) of different classes are considered. And
using this mid-points and the given frequencies the calculation for mean is
performed in the same way as that of a discrete series.
X =
fm i i
, where mi =
l i +u i
[ Mid po int oof each class int erval
f i 2
fm = fm
i i 1 1 + f 2 m2 + ................ + f N m N
f = f +f
i 1 2 + ......... + f N
(ii) Short-cut Method:
X = A+
fd i i
, where A = Assumed mean from mid values and
f i
X = A+
f d'i i
X C .I . Where, A = Assumed Mean from mid value
f i
For all such class intervals, the arithmetic mean is calculated in a similar manner.
1. The sum of the deviations of a series of values taken from their arithmetic mean results
in zero. i.e., x1, x2,……….,xN, are a set of values with frequency distribution f 1, f2, ….fN
N
then f (x
i =1
i i − x) = 0 .
2. The sum of the squared deviation of a set of values taken from arithmetic mean is
always minimum.
3. If n1 and n2 are the sizes of two data sets with their means as x1 and x 2 respectively,
n1 x1 + n 2 x 2
then the combined mean is defined as x = .
n1 + n 2