0% found this document useful (0 votes)

12 views50 pages

Ads TopperSh

The document outlines a structured curriculum for a Data Science course by Brainheaters LLC, detailing various chapters/modules with their priorities and page numbers. It covers essential topics such as the Data Science process, data preparation, modeling, and the differences between data science and data analytics. Additionally, it highlights the importance of data science techniques in analyzing complex data sets across various industries.

Uploaded by

NEHA VALECHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views50 pages

Ads TopperSh

Uploaded by

NEHA VALECHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

A quality product by

BrainheatersTM LLC
BH.Index
(Learn as per the Priority to prepare smartly)

Sr Chapter/Module Name Priority Pg no

1. Introduction to Data Science 4 02

2. Data Exploration 1 13

3. Methodology and Data Visualization 2 38

4. Anomaly Detection 2 62

5. Time Series Forecasting 3 72

6. Applications of Data Science 3 89

Brainheaters Notes
Applied data Science (ADS)
COMPUTER Semester-8

‘C’ SCHEME - 2022-2023

MODULE-1 Q2. Explain in detail the Data Science Process. (P4 -Appeared 1 Time)
(5-10 Marks)
Ans: Data Science process is a structured and iterative approach to solving
Q1. Write a short note on Data Science. (P4 -Appeared 1 Time) (5-10 complex problems using data.
Marks) ● It involves a set of steps that help to extract insights and knowledge
Ans: Data Science is a multidisciplinary field that involves the use of from data to inform business decisions. The following are the steps
statistical, computational, and machine learning techniques to extract involved in the Data Science process:
insights and knowledge from complex data sets. 1. Problem Formulation:
● It encompasses a range of topics including data analysis, data In this step, the problem is identified, and the business
mining, machine learning, and artificial intelligence. objective is defined. The goal is to determine what questions
● Data Science involves a structured approach to analyzing and need to be answered using data and to ensure that they
interpreting large and complex data sets. align with the business goals.
● This includes identifying patterns, making predictions, and 2. Data Collection:
developing models that can be used to gain insights and drive The next step is to collect relevant data. This can involve
business decisions. various sources, such as internal databases, public data
● Data Scientists use a variety of tools and techniques, such as repositories, or web scraping. The data must be accurate,
statistical programming languages like R and Python, to work with complete, and representative of the problem.
data sets and extract meaningful information. 3. Data Cleaning and Preparation:
● The applications of Data Science are widespread and can be found In this step, the collected data is cleaned, pre-processed,
in various industries, such as healthcare, finance, marketing, and and transformed to ensure that it is consistent and ready for
transportation. It plays a crucial role in providing businesses with analysis. This involves tasks such as removing duplicates,
valuable insights into customer behavior, market trends, and handling missing values, and encoding categorical
operational efficiencies. variables.
● To become a Data Scientist, one typically needs to have a strong 4. Data Exploration:
background in mathematics, statistics, and computer science. The goal of this step is to understand the data and gain
● Many universities now offer Data Science programs, which provide insights. This involves the use of descriptive statistics, data
students with the necessary skills to work in this field. visualization, and exploratory analysis techniques to identify
● The demand for Data Scientists is growing rapidly, and it is patterns, correlations, and trends.
expected to continue to increase in the future.

Page no - 2 Handcrafted by Engineers | P - Priority Page no - 3 Handcrafted by Engineers | P - Priority

` `

5. Feature Engineering: model evaluation, model deployment, and model monitoring and
This step involves the creation of new features or variables maintenance.
from the existing data. This is done to improve the ● The goal is to extract insights and knowledge from data to inform
performance of the predictive model or to gain further business decisions.
insights into the problem.
6. Model Selection and Training:
In this step, a suitable model is selected to solve the
Q3. Describe Motivation to use Data Science Techniques: Volume,
problem. This involves a trade-off between accuracy, Dimensions and Complexity. (P4 -Appeared 1 Time) (5-10 Marks)

interpretability, and complexity. Once the model is selected, Ans: Data Science techniques are used to solve problems involving large,

it is trained on the data to learn the underlying patterns and complex, and high-dimensional data sets that cannot be analyzed using

relationships. traditional methods.

7. Model Evaluation: ● The motivation to use Data Science techniques is driven by three

The performance of the model is evaluated using metrics main factors: volume, dimensions, and complexity.

such as accuracy, precision, recall, and F1-score. The goal is 1. Volume:

to determine whether the model is performing well or The amount of data being generated is growing at an

whether it needs to be improved. unprecedented rate. Large volumes of data are being

8. Model Deployment: generated from various sources such as social media, IoT

Once the model has been evaluated and deemed devices, and sensors. Data Science techniques are used to

acceptable, it can be deployed into production. This involves manage and analyze these large volumes of data efficiently.

integrating the model into the existing business processes or 2. Dimensions:

systems. The number of variables or dimensions in a data set can be

9. Model Monitoring and Maintenance: very high. For example, in genetics research, the number of

The final step involves monitoring the performance of the genes being analyzed can be in the millions. Data Science

model in the real-world environment. This involves tracking techniques such as dimensionality reduction, feature

the model's performance, detecting any drift, and retraining selection, and feature extraction are used to reduce the

the model as necessary. number of dimensions in the data and to identify the most

● In summary, the Data Science process is a systematic and iterative relevant features.

approach that involves several steps, including problem 3. Complexity:

formulation, data collection, data cleaning and preparation, data Data sets can be complex, containing non-linear

exploration, feature engineering, model selection and training, relationships between variables, missing values, and noisy
data. Data Science techniques such as machine learning,

Page no - 4 Handcrafted by Engineers | P - Priority Page no - 5 Handcrafted by Engineers | P - Priority

` `

deep learning, and natural language processing are used to

analyze and make predictions from these complex data 2. Exploratory Analysis:
sets. Exploratory analysis involves visualizing and exploring the data to
● Data Science techniques are used in various industries, including identify relationships and patterns. This includes techniques such as
healthcare, finance, marketing, and transportation, to name a few. scatter plots, histograms, box plots, and heat maps. Exploratory
● For example, in healthcare, Data Science techniques are used to analysis is used to gain a deeper understanding of the data and to
analyze medical images, identify genetic markers, and develop identify potential insights.
personalized medicine. Examples of exploratory analysis include:
● In finance, Data Science techniques are used to detect fraud, ● Visualizing the relationship between temperature and
forecast market trends, and optimize trading strategies. humidity in a climate data set.
● Creating a histogram of the distribution of customer
purchase amounts in an e-commerce store.
Q4. Explain in detail Data Science Tasks and Examples. (P4 -Appeared 1 ● Plotting the distribution of crime rates in different areas of a
Time) (5-10 Marks) city.
Ans: Data Science tasks can be broadly classified into four categories:
descriptive analysis, exploratory analysis, predictive analysis, and 3. Predictive Analysis:
prescriptive analysis. ● Predictive analysis involves using machine learning and
These tasks involve using statistical, machine learning, and data statistical techniques to make predictions based on the
visualization techniques to extract insights and knowledge from data. The data. This includes techniques such as regression,
following are examples of each type of Data Science task: classification, clustering, and time series analysis. Predictive
1. Descriptive Analysis: analysis is used to make forecasts and predictions based on
Descriptive analysis involves summarizing and describing the past data.
characteristics of a data set. This includes measures such as mean, Examples of predictive analysis include:
median, mode, standard deviation, and frequency distribution. ● Predicting the sales volume of a product based on past
Descriptive analysis is used to gain an understanding of the data sales data.
and to identify patterns and trends. ● Forecasting the stock price of a company based on
Examples of descriptive analysis include: historical data.
● Calculating the average age of customers in a retail store. ● Identifying potential customer churn based on past
● Determining the percentage of males and females in a purchase behavior.
population.
● Analyzing the distribution of income levels in a city.

Page no - 6 Handcrafted by Engineers | P - Priority Page no - 7 Handcrafted by Engineers | P - Priority

` `

4. Prescriptive Analysis: variables into numerical variables, normalizing data to a

Prescriptive analysis involves using optimization and simulation common scale, and creating new variables by combining or
techniques to make recommendations based on the data. This splitting existing variables.
includes techniques such as linear programming, decision trees, 3. Data Integration: Data integration involves combining data
and Monte Carlo simulation. Prescriptive analysis is used to make from multiple sources to create a single, unified data set.
data-driven recommendations for decision-making. This includes resolving discrepancies in data format and
Examples of prescriptive analysis include: merging data based on common variables.
● Optimizing production schedules in a manufacturing plant 4. Data Reduction: Data reduction involves reducing the size
to minimize costs. and complexity of the data set without losing significant
● Determining the optimal route for delivery trucks to minimize information. This includes techniques such as feature
travel time and distance. selection, dimensionality reduction, and data sampling.
● Recommending the most profitable investment portfolio ● The importance of data preparation cannot be overstated. Poor
based on risk and return. quality data preparation can lead to inaccurate analysis and
incorrect conclusions.
Therefore, it is essential to devote sufficient time and effort to data
Q5. Write a short note on Data Preparation. (P4 - Appeared 1 Time)
●
preparation to ensure that the data set is suitable for analysis and
(5-10 Marks) that the results obtained are accurate and reliable.
Ans: Data preparation, also known as data preprocessing, is a crucial step
in the Data Science process.
● It involves cleaning, transforming, and organizing raw data into a Q6. Write a short note on Modeling. (P4 -Appeared 1 Time) (5-10 Marks)
format suitable for analysis. The quality of data preparation has a Ans: Modeling is a critical step in the Data Science process that involves
significant impact on the accuracy and reliability of the results building mathematical or computational models to represent real-world
obtained from data analysis. phenomena.
● Data preparation involves several steps, including: ● The goal of modeling is to gain a better understanding of the data
1. Data Cleaning: Data cleaning involves identifying and and to make predictions or decisions based on the model's output.
correcting errors, inconsistencies, and missing data in the The following are some key aspects of modeling in Data Science:
data set. This includes removing duplicate entries, correcting 1. Model Selection: Choosing the appropriate model for a given
typos, and filling in missing data. problem is critical to the success of modeling. The model
2. Data Transformation: Data transformation involves should be able to capture the essential features of the data
converting data from one format to another to make it and should be able to make accurate predictions or
suitable for analysis. This includes converting categorical decisions.

Page no - 8 Handcrafted by Engineers | P - Priority Page no - 9 Handcrafted by Engineers | P - Priority

` `

2. Model Training: Model training involves fitting the model to

the data set. This includes determining the model
Q7. Difference between data science and data analytics. (P4 -
parameters that best fit the data and optimizing the model's Appeared 1 Time) (5-10 Marks)

performance using techniques such as cross-validation. Ans:

3. Model Evaluation: Model evaluation involves assessing the Data Science Data Analytics
quality of the model's predictions or decisions. This includes Parameter
measuring the model's accuracy, precision, recall, and other
performance metrics. A field that encompasses A subset of data science that

4. Model Deployment: Model deployment involves integrating a wide range of focuses on using statistical

the model into the decision-making process. This includes Definition techniques, tools, and and computational methods

implementing the model in a software application, methodologies for working to explore, analyze, and

developing an API for the model, or integrating the model with data extract insights from data

into a larger system.

Extraction of insights and
● Models can be broadly classified into two categories: supervised Uncovering patterns, trends,
knowledge from complex
and unsupervised. and relationships in data and
Focus and large data sets using
● In supervised learning, the model is trained on a labeled data set, using that information to
advanced statistical and
where the output variable is known. The goal of supervised learning make data-driven decisions
computational methods
is to predict the output variable for new, unseen data. Examples of
supervised learning include regression and classification. Data preparation, data
Exploratory data analysis,
● In unsupervised learning, the model is trained on an unlabeled data Stages of modeling, data
statistical analysis, and
set, where the output variable is unknown. The goal of unsupervised Analysis visualization, and
predictive modeling
learning is to identify patterns and structure in the data. Examples communication of insights
of unsupervised learning include clustering and dimensionality
reduction. Mainly used in business
Used in various fields,
settings to optimize
Application including business,
processes, improve customer
s healthcare, finance, and
experience, and increase
more
profitability

Page no - 10 Handcrafted by Engineers | P - Priority Page no - 11 Handcrafted by Engineers | P - Priority

` `

Machine learning, deep Statistical analysis, MODULE-2

Techniques learning, natural language regression analysis,
Used processing, computer clustering, data visualization,
vision, and more and more Q1. Write down Types of data. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: There are several types of data that are commonly used in Data
In- depth knowledge of Basic Programming skills
Programm Science and other fields. These types of data include:
programming is required are necessary for data
ing Skills 1. Nominal Data: Nominal data is categorical data that cannot be
for data science. analytics.
ranked or ordered. Examples include gender, race, and type of car.
2. Ordinal Data: Ordinal data is categorical data that can be ranked or
Data Science makes use ordered. Examples include education level (e.g., high school,
Use of
of machine learning Data Analytics doesn’t make college, graduate degree), socioeconomic status (e.g., low, middle,
Machine
algorithms to get use of machine learning. high), and customer satisfaction ratings (e.g., poor, fair, good,
Learning
insights. excellent).
3. Interval Data: Interval data is numerical data that has a consistent
scale and unit of measurement, but does not have a true zero point.
Examples include temperature in Celsius or Fahrenheit and dates
on a calendar.
4. Ratio Data: Ratio data is numerical data that has a consistent scale
and unit of measurement, and does have a true zero point.
Examples include weight, height, and income.
5. Time Series Data: Time series data is a type of data where the
observations are recorded at regular intervals over time. Examples
include stock prices, weather data, and website traffic.
6. Text Data: Text data is unstructured data in the form of text that can
be analyzed using Natural Language Processing techniques.
Examples include social media posts, customer reviews, and news
articles.
7. Spatial Data: Spatial data is data that is associated with
geographic locations. Examples include GPS data, satellite imagery,
and maps.

Page no - 12 Handcrafted by Engineers | P - Priority Page no - 13 Handcrafted by Engineers | P - Priority

` `

Understanding the type of data is important because it can influence the dataset that contains only a subset of the population may not
choice of statistical and computational techniques used for analysis. accurately represent the true population.
5. Consistency: Consistency refers to the degree to which the data is
uniform and follows a consistent format or structure. Inconsistent
Q2. Discuss in detail Properties of data. (P4 - Appeared 1 Time) (5-10 data can make it difficult to analyze and compare data, and can
Marks) lead to errors in analysis. For example, a dataset that contains
Ans: In Data Science, data is often described in terms of its properties, inconsistent date formats could make it difficult to accurately
which are characteristics that define the data and influence how it can be analyze time-series data.
analyzed and processed. Here are some of the key properties of data: 6. Relevance: Relevance refers to the extent to which the data is useful
1. Scale: Scale refers to the range and distribution of values in the for the intended analysis or application. Relevant data is essential
data. Data can have a small or large scale, depending on the range for making informed decisions and drawing accurate conclusions.
of values that it encompasses. For example, a dataset containing For example, a dataset containing irrelevant variables or data
the ages of people in a population might have a scale of 0 to 100 points could lead to incorrect conclusions about the relationships
years. between variables.
2. Resolution: Resolution refers to the level of detail or granularity in the 7. Timeliness: Timeliness refers to the degree to which the data is
data. Data can be high resolution, with a fine level of detail, or low current and relevant to the intended analysis or application. Timely
resolution, with a coarser level of detail. For example, satellite data is essential for making informed decisions and drawing
imagery can have a high resolution, allowing for the identification of accurate conclusions. For example, stock price data that is delayed
small details on the ground, while weather data might have a lower or outdated could lead to incorrect conclusions about market
resolution, providing broader information about a region. trends.
3. Accuracy: Accuracy refers to the degree to which the data
represents the true or intended values. Accurate data is essential
for making informed decisions and drawing accurate conclusions. Q3. Explain in detail Descriptive Statistics. (P4 - Appeared 1 Time) (5-10
For example, a dataset containing inaccurate or incomplete Marks)
customer information could lead to incorrect conclusions about Ans: Descriptive statistics refers to the process of analyzing and
customer behavior. summarizing data using various statistical methods.
4. Completeness: Completeness refers to the extent to which the data ● The purpose of descriptive statistics is to provide an overview of the
represents the full set of values or observations that are needed. data and to help identify patterns, trends, and relationships that
Incomplete data can result in gaps or biases in the analysis, and may be present.
can limit the ability to draw accurate conclusions. For example, a ● Here are some of the key methods used in descriptive statistics:

Page no - 14 Handcrafted by Engineers | P - Priority Page no - 15 Handcrafted by Engineers | P - Priority

` `

1. Measures of Central Tendency: Measures of central the quartiles divide the data into quarters (25th, 50th, and
tendency are statistics that represent the "center" of the 75th percentiles).
data, or the typical or average value. The three most ● Descriptive statistics is an important tool for analyzing and
common measures of central tendency are the mean summarizing data in a meaningful way. It is often used to provide a
(average), median (middle value), and mode (most baseline understanding of the data before more complex analyses
common value). are performed.
2. Measures of Variability: Measures of variability are statistics
that describe how spread out or varied the data is. The most
common measures of variability are the range (difference
Q4. Describe in detail Univariate Exploration. (P4 - Appeared 1 Time)
between the highest and lowest values), variance (average (5-10 Marks)

squared deviation from the mean), and standard deviation Ans: Univariate exploration is a data analysis technique that focuses on

(square root of the variance). examining a single variable at a time.

3. Frequency Distributions: Frequency distributions show how ● The purpose of univariate exploration is to gain an understanding of

often each value or range of values occurs in the data. the distribution and characteristics of a single variable, which can

Frequency distributions can be displayed using histograms, help in identifying any patterns or outliers in the data.

bar charts, or pie charts. ● Here are some of the key methods used in univariate exploration:

4. Correlation Analysis: Correlation analysis is used to identify 1. Histograms: Histograms are used to visualize the distribution

the relationship between two variables. Correlation of a single variable. A histogram is a graph that displays the

coefficients range from -1 to +1, with a value of 0 indicating frequency of data within different intervals or bins. The

no correlation and a value of +1 indicating a perfect positive height of each bar represents the number of data points

correlation. within that interval.

5. Regression Analysis: Regression analysis is used to model 2. Box plots: Box plots are used to visualize the distribution of a

the relationship between two or more variables. Simple single variable by displaying the median, quartiles, and

linear regression models the relationship between two outliers. A box plot consists of a box that spans the

variables, while multiple regression models the relationship interquartile range (IQR) and whiskers that extend to the

between multiple variables. highest and lowest values within 1.5 times the IQR.

6. Percentiles and Quartiles: Percentiles and quartiles are used 3. Density plots: Density plots are used to visualize the

to divide the data into equal parts based on their rank or probability density function of a single variable. A density

position. The median represents the 50th percentile, while plot is a smoothed version of a histogram that represents
the distribution of the variable as a continuous curve.

Page no - 16 Handcrafted by Engineers | P - Priority Page no - 17 Handcrafted by Engineers | P - Priority

` `

4. Bar charts: Bar charts are used to visualize the distribution of 3. Mode: The mode is the value that occurs most frequently in a
categorical variables. A bar chart displays the frequency or dataset. It can be useful in identifying the most common value in a
proportion of each category as a bar. dataset, but it may not be a good measure of central tendency if
5. Summary statistics: Summary statistics such as mean, there are multiple modes or if the dataset is continuous.
median, mode, variance, and standard deviation can be Each measure of central tendency has its advantages and disadvantages,
used to describe the central tendency and variability of a and the choice of which to use depends on the nature of the data and the
single variable. research question. For example, if the data is normally distributed, the
6. Skewness and Kurtosis: Skewness and Kurtosis are used to mean may be the most appropriate measure of central tendency.
measure the shape of the distribution of a variable. However, if the data is skewed or contains outliers, the median may be a
Skewness measures the asymmetry of the distribution, while better choice. Similarly, the mode may be useful for categorical data, but
kurtosis measures the degree of peakedness or flatness of not for continuous data.
the distribution.
Univariate exploration is an important step in data analysis as it
Q6. Write down Measures of Spread, Symmetry. (P4 - Appeared 1 Time)
●
provides a detailed understanding of a single variable and helps in
identifying any patterns or outliers in the data. (5-10 Marks)
Ans: Measures of spread and symmetry are important descriptive statistics
that help to characterize the distribution of a dataset. Here are some of the
Q5. Explain in detail the Measure of Central Tendency. (P4 - Appeared 1 most common measures of spread and symmetry:
Time) (5-10 Marks) ● Measures of spread:
Ans: Measures of central tendency are statistics that describe the "center" 1. Range: The range is the difference between the largest and
or typical value of a dataset. There are three common measures of central smallest values in a dataset.
tendency: the mean, median, and mode. 2. Interquartile range (IQR): The IQR is the range of the middle
1. Mean: The mean is calculated by adding up all the values in a 50% of the dataset, calculated by subtracting the 25th
dataset and then dividing by the number of values. It is the most percentile from the 75th percentile.
commonly used measure of central tendency. However, it can be 3. Variance: The variance is the average of the squared
sensitive to outliers, or extreme values, which can skew the mean. differences from the mean. It measures how much the data
2. Median: The median is the middle value of a dataset when the varies from the mean.
values are arranged in order. If there is an even number of values, 4. Standard deviation: The standard deviation is the square
the median is the average of the two middle values. The median is root of the variance. It measures the spread of the data
less sensitive to outliers than the mean. around the mean.

Page no - 18 Handcrafted by Engineers | P - Priority Page no - 19 Handcrafted by Engineers | P - Priority

` `

● Measures of symmetry: ● This is because there are more extreme values on the right side of
1. Skewness: Skewness is a measure of the asymmetry of the the distribution.
distribution. A positive skew indicates that the tail of the ● Skewness can be quantified using a number of different measures,
distribution is longer on the positive side, while a negative such as Pearson's moment coefficient of skewness, the Bowley
skew indicates that the tail is longer on the negative side. A skewness, or the quartile skewness coefficient.
skewness value of zero indicates that the distribution is ● Skewness is an important measure of distributional shape, as it
perfectly symmetrical. provides insight into the direction and degree of deviation from
2. Kurtosis: Kurtosis is a measure of the "peakedness" of the symmetry.
distribution. A high kurtosis value indicates that the ● Skewed distributions can have important implications in statistical
distribution has a sharp peak and heavy tails, while a low analysis, as they can affect the interpretation of statistical tests,
kurtosis value indicates a flat or rounded distribution. such as the t-test or ANOVA.
● These measures are important because they provide valuable
information about the shape and variability of a dataset.
Understanding the spread and symmetry of a dataset can help in
Q8. Discuss in detail Karl Pearson Coefficient of skewness. (P4 -
making informed decisions about the data and in identifying any Appeared 1 Time) (5-10 Marks)

patterns or outliers that may be present. Ans: The Karl Pearson Coefficient of skewness is a measure of the skewness
of a distribution.
● It is defined as the ratio of the difference between the mean and the
Q7. Write a short note on Skewness. (P4 - Appeared 1 Time) (5-10 mode of a distribution, to the standard deviation of the distribution.
Marks) ● This measure was developed by Karl Pearson, a British
Ans: Skewness is a measure of the asymmetry of a probability distribution. mathematician and statistician.
It describes the extent to which a distribution deviates from symmetry ● The formula for Karl Pearson Coefficient of skewness is:
around its mean. Skewness = 3 * (Mean - Median) / Standard deviation
● A distribution can be skewed to the left (negative skewness) or Where,
skewed to the right (positive skewness). Mean = arithmetic mean of the dataset
● If a distribution is skewed to the left, the tail of the distribution is Median = median of the dataset
longer on the left-hand side, and the mean is less than the median. Standard deviation = standard deviation of the dataset
● This is because there are more extreme values on the left side of the ● The Karl Pearson Coefficient of skewness is a dimensionless
distribution. Conversely, if a distribution is skewed to the right, the measure of skewness, meaning that it has no units.
tail of the distribution is longer on the right-hand side, and the ● The measure is always zero for a perfectly symmetrical distribution.
mean is greater than the median.

Page no - 20 Handcrafted by Engineers | P - Priority Page no - 21 Handcrafted by Engineers | P - Priority

` `

● Positive values of skewness indicate that the tail of the distribution is ● Overall, Bowley's coefficient of skewness is a useful measure of the
longer on the right-hand side, while negative values of skewness skewness of a distribution, and it can provide valuable insight into
indicate that the tail is longer on the left-hand side. the shape and characteristics of the data.
● One of the main advantages of the Karl Pearson Coefficient of ● One of the main advantages of Bowley's coefficient of skewness is
skewness is that it is easy to calculate and interpret. that it is less sensitive to extreme values or outliers than other
● However, it can be sensitive to outliers in the data, and it may not be measures of skewness, such as the Karl Pearson coefficient. This is
appropriate for distributions that are heavily skewed. because it is based on quartiles, which are less affected by extreme
● Overall, the Karl Pearson Coefficient of skewness is a useful measure values than the mean and standard deviation.
of the skewness of a distribution, and it can provide valuable insight ● However, one limitation of Bowley's coefficient of skewness is that it
into the shape and characteristics of the data. is only based on three quartiles of the dataset, and may not be as
accurate as other measures of skewness for distributions that are
heavily skewed.
Q9. Explain in detail Bowley‘s Coefficient. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Bowley's coefficient of skewness, also known as quartile skewness Q10. Discuss in detail Kurtosis Multivariate Exploration. (P4 - Appeared 1
coefficient, is a measure of skewness of a distribution. Time) (5-10 Marks)
● It is based on the difference between the upper and lower quartiles Ans: Kurtosis is a statistical measure that describes the shape of a
of a dataset. The measure was developed by Arthur Bowley, an distribution. It is a measure of the "peakedness" or "flatness" of a distribution
English statistician. compared to a normal distribution.
● The formula for Bowley's coefficient of skewness is: ● Multivariate exploration refers to the analysis of more than one
Skewness = (Q3 + Q1 - 2 * Q2) / (Q3 - Q1) variable at a time. In this context, kurtosis can be used to analyze
Where, the relationship between multiple variables in a dataset.
Q1 = first quartile of the dataset ● The most commonly used measure of kurtosis is the Pearson's
Q2 = second quartile or median of the dataset coefficient of kurtosis, which is calculated by dividing the fourth
Q3 = third quartile of the dataset moment by the square of the variance.
● The coefficient of skewness can take values ranging from -1 to +1. A ● The formula for Pearson's coefficient of kurtosis is:
value of zero indicates that the distribution is perfectly symmetrical, Kurtosis = (M4 / S 4 ) - 3
while negative and positive values indicate that the distribution is Where,
skewed to the left or right, respectively. M4 = fourth moment of the dataset
S = standard deviation of the dataset

Page no - 22 Handcrafted by Engineers | P - Priority Page no - 23 Handcrafted by Engineers | P - Priority

` `

● The value of kurtosis can be positive or negative. A positive value measure of central tendency and can be affected by outliers or
indicates that the distribution is more peaked than a normal extreme values in the dataset.
distribution, while a negative value indicates that the distribution is ● The median is the middle value in a dataset when the observations
flatter than a normal distribution. are arranged in order from smallest to largest. The median is less
● A value of zero indicates that the distribution is similar in shape to a sensitive to extreme values and is a more robust measure of central
normal distribution. tendency than the mean.
● In multivariate exploration, kurtosis can be used to analyze the ● The mode is the value that appears most frequently in a dataset.
relationship between multiple variables in a dataset. The mode is useful when there is a high frequency of one or a few
● For example, if two variables have similar levels of kurtosis, it may specific values in the dataset.
indicate that they are related or have a similar distribution. ● The choice of central data point depends on the characteristics of
● On the other hand, if two variables have different levels of kurtosis, it the dataset and the research question being addressed.
may indicate that they are not related or have different ● In some cases, the mean may be more appropriate, while in other
distributions. cases the median or mode may be more appropriate.
● One limitation of kurtosis in multivariate exploration is that it only ● For example, if the dataset contains extreme values or outliers, the
measures the shape of a distribution, and does not take into median may be a better measure of central tendency than the
account other factors such as the location and spread of the data. mean.
● Therefore, it is important to use kurtosis in combination with other
measures such as central tendency and dispersion when analyzing
relationships between multiple variables.
Q12. Write a short note on Correlation. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Correlation is a statistical measure that describes the relationship
Q11. Explain in detail Central Data Point. (P4 - Appeared 1 Time) (5-10 between two variables.
Marks) ● It is used to determine whether there is a statistical association
Ans: Central data point is a term used to describe a specific value that between the two variables, and if so, the strength and direction of
represents the central tendency or central location of a dataset. the association. Correlation can be expressed as a numerical value
● In statistics, measures of central tendency are used to describe the known as the correlation coefficient.
typical or most common value in a dataset. ● The correlation coefficient ranges from -1 to +1.
● There are three commonly used measures of central tendency: ● A value of -1 indicates a perfect negative correlation, meaning that
mean, median, and mode. as one variable increases, the other variable decreases in a linear
● The mean is calculated by adding up all the values in a dataset and fashion.
dividing by the number of observations. The mean is a sensitive

Page no - 24 Handcrafted by Engineers | P - Priority Page no - 25 Handcrafted by Engineers | P - Priority

` `

● A value of +1 indicates a perfect positive correlation, meaning that continuous variables. It ranges from -1 to +1, with 0 indicating no
as one variable increases, the other variable increases in a linear correlation and values closer to -1 or +1 indicating a stronger
fashion. A value of 0 indicates no correlation between the two correlation.
variables. 2. Spearman correlation: This form of correlation measures the
● Correlation is commonly used in research to investigate the relationship between two variables based on their rank order. It is
relationship between variables. used when the variables are ordinal or when the relationship is
● For example, in a medical study, correlation may be used to non-linear.
determine whether there is a relationship between a specific 3. Kendall correlation: This form of correlation is also based on rank
treatment and patient outcomes. order and measures the strength and direction of the relationship
● In finance, correlation may be used to determine whether there is a between two variables. It is often used when the data is
relationship between the performance of two stocks or investments. non-parametric or when the variables are not normally distributed.
● It is important to note that correlation does not imply causation. 4. Point-biserial correlation: This form of correlation measures the
Just because two variables are correlated does not mean that one relationship between a continuous variable and a dichotomous
causes the other. variable. It is used when one variable is continuous and the other
● It is also important to consider other factors that may influence the variable is binary.
relationship between the variables. 5. Biserial correlation: This form of correlation measures the
● Correlation is a useful statistical tool for investigating the relationship between two dichotomous variables. It is used when
relationship between two variables. both variables are binary.
● It can help researchers identify patterns and trends in the data and 6. Phi coefficient: This form of correlation is used when both variables
make predictions about future outcomes. are dichotomous and is similar to the biserial correlation.
● However, it is important to interpret correlation in the context of the The choice of correlation method depends on the type of data being
research question being addressed and consider other factors that analyzed and the research question being addressed. It is important to
may influence the relationship between the variables. select the appropriate method to accurately measure the relationship
between two variables.

Q13. Different forms of correlation. (P4 - Appeared 1 Time) (5-10 Marks)

Ans: There are different forms of correlation that can be used to measure Q14. Describe Karl Pearson Correlation Coefficient for bivariate
the strength and direction of the relationship between two variables. Here distribution. (P4 - Appeared 1 Time) (5-10 Marks)
are some of the most common forms of correlation: Ans: Karl Pearson correlation coefficient is a statistical measure used to
1. Pearson correlation: This is the most commonly used form of quantify the strength and direction of the linear relationship between two
correlation and measures the linear relationship between two continuous variables.

Page no - 26 Handcrafted by Engineers | P - Priority Page no - 27 Handcrafted by Engineers | P - Priority

` `

It is denoted by the symbol "r" and ranges between -1 and +1. A

Q15. Explain Overview of Various forms of distributions Normal, Poisson.
●
value of +1 indicates a perfect positive correlation, while a value of
-1 indicates a perfect negative correlation. (P4 - Appeared 1 Time) (5-10 Marks)

● A value of 0 indicates no correlation between the two variables. Ans: There are various types of probability distributions, each with its own

● The formula to calculate Pearson correlation coefficient is as unique characteristics and properties.

follows: Two common types of distributions are the normal distribution and the

r = (Σ((x - x̄)(y - ȳ)))/√((Σ(x - x̄)²)(Σ(y - ȳ)²)) Poisson distribution.

where x and y are the two variables, x̄ and ȳ are their respective 1. Normal Distribution:

means, and Σ represents the sum of the values. ● The normal distribution, also known as the Gaussian

● To interpret the correlation coefficient, we can use the following distribution, is a continuous probability distribution that is

guidelines: symmetric and bell-shaped.

1. A value of r between -0.7 and -1 or between 0.7 and 1 ● It is often used to model naturally occurring phenomena

indicates a strong correlation. such as heights, weights, and IQ scores.

2. A value of r between -0.5 and -0.7 or between 0.5 and 0.7 ● The normal distribution is characterized by two parameters,

indicates a moderate correlation. the mean (μ) and the standard deviation (σ).

3. A value of r between -0.3 and -0.5 or between 0.3 and 0.5 ● The mean represents the central tendency of the

indicates a weak correlation. distribution, while the standard deviation represents the

4. A value of r between -0.3 and 0.3 indicates no correlation. spread or variability of the distribution.

● It is important to note that correlation does not imply causation. A ● The area under the normal curve is equal to 1, and about

strong correlation between two variables does not necessarily 68% of the data falls within one standard deviation of the

mean that one variable causes the other. Other factors may be mean, about 95% falls within two standard deviations, and

responsible for the observed relationship. about 99.7% falls within three standard deviations.

● Pearson correlation coefficient is commonly used in research and

data analysis to investigate the relationship between two variables. 2. Poisson Distribution:

● It is a useful tool for identifying patterns and trends in the data and ● The Poisson distribution is a discrete probability distribution

making predictions about future outcomes. that is used to model the number of times an event occurs in
a fixed interval of time or space, given that the events occur
independently and at a constant rate.
● It is named after French mathematician Siméon Denis
Poisson. The Poisson distribution is characterized by a single

Page no - 28 Handcrafted by Engineers | P - Priority Page no - 29 Handcrafted by Engineers | P - Priority

` `

parameter, λ (lambda), which represents the average rate of is actually true. The most common significance level is 0.05
occurrence of the event. or 5%.
● The probability of observing a certain number of events in a 3. Collect data and calculate test statistic: Collect the data
fixed interval of time or space can be calculated using the and calculate a test statistic, which is a numerical value that
Poisson probability mass function. measures the difference between the observed data and
● The Poisson distribution is commonly used in fields such as the expected values under the null hypothesis.
biology, physics, and telecommunications. 4. Determine the p-value: The p-value is the probability of
Both the normal distribution and the Poisson distribution are important observing the test statistic or a more extreme value if the
tools in statistics and data analysis. They allow us to model and analyze null hypothesis is true. It is used to determine the statistical
data in a meaningful way, and make predictions about future outcomes significance of the test result.
based on past observations. 5. Compare the p-value to the significance level: If the p-value
is less than the significance level, we reject the null
hypothesis and accept the alternative hypothesis. If the
Q16. Describe in detail Test Hypothesis. (P4 - Appeared 1 Time) (5-10 p-value is greater than the significance level, we fail to reject
Marks) the null hypothesis.
Ans: In statistics, a hypothesis is a proposed explanation or prediction for a 6. Draw conclusions: Based on the results of the hypothesis
phenomenon or set of observations. test, we can draw conclusions about the relationship
● Hypothesis testing is a statistical technique that allows us to test the between the variables being tested.
validity of a hypothesis by comparing it to an alternative hypothesis ● Hypothesis testing is used in many different fields, including
using a set of statistical tools and methods. business, medicine, and social sciences.
● The goal of hypothesis testing is to determine whether there is ● It is a powerful tool for making decisions and drawing conclusions
enough evidence to support or reject the null hypothesis. based on data and statistical analysis.
● The hypothesis testing process typically involves the following steps: ● However, it is important to carefully formulate the null and
1. Formulate the null and alternative hypotheses: The null alternative hypotheses, choose an appropriate significance level,
hypothesis is the default hypothesis that there is no and properly interpret the results of the test to ensure accurate and
significant difference or effect between two populations or meaningful conclusions.
samples, while the alternative hypothesis is the opposite
hypothesis that there is a significant difference or effect.
2. Choose a significance level: The significance level (denoted
as α) is the probability of rejecting the null hypothesis when it

Page no - 30 Handcrafted by Engineers | P - Priority Page no - 31 Handcrafted by Engineers | P - Priority

` `

Q17. Write a short note on the Central limit theorem. (P4 - Appeared 1 Q18. Confidence Interval, Z-test, t-test. (P4 - Appeared 1 Time) (5-10
Time) (5-10 Marks) Marks)
Ans: The Central Limit Theorem (CLT) is a key concept in data science that Ans: Confidence Interval:
is used to make statistical inferences about large datasets. ● A confidence interval is a range of values that is likely to contain the
● It states that if a large number of independent and identically true population parameter with a certain level of confidence.
distributed random variables are added or averaged together, the ● It is an important tool in statistical inference that is used to estimate
resulting distribution will be approximately normal, regardless of the the range of values that the true population parameter could fall
underlying distribution of the individual variables. within based on a sample of data.
● In practical terms, this means that if we have a large enough ● The level of confidence is typically expressed as a percentage, such
sample size, we can use the CLT to make accurate estimates about as 95% or 99%.
the population mean and standard deviation, even if we don't know ● A wider confidence interval implies a lower level of confidence in
the underlying distribution of the data. the estimate, and a narrower interval implies a higher level of
● This is important in data science because it allows us to draw confidence.
meaningful conclusions from data, even when we have incomplete Z-test:
information. ● A Z-test is a hypothesis test that is used to determine whether a
● For example, the CLT is often used in hypothesis testing, where we sample mean is significantly different from a hypothesized
compare a sample mean to a hypothesized population mean. population mean, when the population variance is known.
● By calculating the standard error of the mean using the CLT, we can ● It involves calculating the Z-score of the sample mean, which is the
estimate the probability of observing a sample mean as extreme as number of standard deviations the sample mean is from the
the one we have, given the hypothesized population mean. hypothesized mean.
● This helps us determine whether the difference between the sample ● If the Z-score falls outside a certain range of values, the null
mean and the hypothesized mean is statistically significant. hypothesis is rejected and the sample mean is deemed to be
● Overall, the CLT is a fundamental concept in data science that helps significantly different from the hypothesized mean.
us make sense of large datasets and draw meaningful conclusions T-test:
from them. ● A T-test is a hypothesis test that is used to determine whether a
sample mean is significantly different from a hypothesized
population mean, when the population variance is unknown.
● The T-test is used instead of the Z-test when the sample size is
small, or when the population variance is unknown.

Page no - 32 Handcrafted by Engineers | P - Priority Page no - 33 Handcrafted by Engineers | P - Priority

` `

● The T-test involves calculating the T-score of the sample mean, ● This can lead to inaccurate predictions and misleading insights,
which is similar to the Z-score, but takes into account the sample which can have serious consequences in fields such as healthcare
size and the sample variance. or finance.
● If the T-score falls outside a certain range of values, the null
hypothesis is rejected and the sample mean is deemed to be Type-II Error
significantly different from the hypothesized mean. ● In data science, a Type-II error occurs when a statistical model or
● In summary, confidence intervals are used to estimate the range of algorithm fails to identify a significant pattern or relationship in the
values that the true population parameter could fall within based data, when in fact it is present. This is similar to a false negative in
on a sample of data. Z-tests are used when the population variance statistical hypothesis testing.
is known, and T-tests are used when the population variance is ● For example, suppose a predictive model fails to identify an
unknown or when the sample size is small. important feature for predicting a target variable, leading to
● Both tests are used to determine whether a sample mean is inaccurate predictions and missed opportunities. This can also
significantly different from a hypothesized population mean. have serious consequences in fields such as healthcare or finance,
where missing an important trend or relationship can have
significant implications.
Q19. Describe in detail Type-I, Type-II Errors. (P4 - Appeared 1 Time)
(5-10 Marks) ● It is important to note that in data science, the consequences of
Ans: In data science, Type-I and Type-II errors have the same definitions as Type-I and Type-II errors can vary depending on the specific
in statistical hypothesis testing. application and the cost of making an incorrect decision.
● However, in the context of data science, these errors can occur in ● For example, in a medical diagnosis application, the cost of a
different ways and have different consequences. Type-II error (failing to identify a disease) may be much higher than
the cost of a Type-I error (identifying a disease that is not present),
Type-I Error: as the latter can be verified with additional tests, whereas the
● In data science, a Type-I error occurs when a statistical model or former may result in delayed treatment and worse health
algorithm incorrectly identifies a pattern or relationship in the data outcomes. Therefore, it is important for data scientists to consider
as being significant, when in fact it is not. the consequences of both types of errors when designing and
● This is similar to a false positive in statistical hypothesis testing. For evaluating models and algorithms.
example, suppose a predictive model incorrectly identifies a feature
as being important for predicting a target variable, when in fact it is
not.

Page no - 34 Handcrafted by Engineers | P - Priority Page no - 35 Handcrafted by Engineers | P - Priority

` `

Overall, ANOVA is a powerful tool in data science for comparing

Q20. Write a short note on ANOVA. (P4 - Appeared 1 Time) (5-10
●
means between multiple groups and can provide valuable insights
Marks) for decision-making.
Ans: ANOVA, or Analysis of Variance, is a statistical method that can be ● However, it is important to carefully consider the assumptions and
used in data science to compare means between three or more groups. potential errors associated with ANOVA in order to obtain accurate
● ANOVA can help to determine whether there are significant and reliable results.
differences in the means of multiple groups and can be useful for
making decisions based on the data.
● In data science, ANOVA can be used in a variety of applications,
such as comparing the performance of different algorithms,
evaluating the effectiveness of different marketing campaigns, or
comparing the results of different A/B tests.
● By comparing the means of multiple groups, ANOVA can provide
insights into which groups are performing better or worse, which
can help to guide decision-making and improve outcomes.
● One important consideration in using ANOVA in data science is the
assumption of equal variances between the groups.
● If the variances are not equal, then the ANOVA results may be
biased and incorrect conclusions may be drawn.
● Therefore, it is important to check the variance assumptions before
applying ANOVA and to use appropriate methods to adjust for
unequal variances if necessary.
● Another consideration in using ANOVA in data science is the
potential for Type-I and Type-II errors.
● Type-I errors occur when a significant difference is detected
between groups, even though there is no true difference.
● Type-II errors occur when no significant difference is detected
between groups, even though there is a true difference.
● These errors can be controlled by setting appropriate significance
levels and sample sizes.

Page no - 36 Handcrafted by Engineers | P - Priority Page no - 37 Handcrafted by Engineers | P - Priority

` `

Overall, methodology is essential for ensuring that research is conducted in

MODULE-3
a rigorous and systematic manner, and that the results are valid, reliable,
and generalizable. By following sound methodology principles, researchers
can increase the likelihood that their research will be useful and impactful.
Q1. Write a short note on Methodology. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Methodology refers to the set of principles, techniques, and Q2. Write Overview of model building. (P4 - Appeared 1 Time) (5-10
procedures that are used to conduct research, analyze data, and draw Marks)
conclusions. It is an important aspect of research, as it helps to ensure that Ans: Model building is the process of creating a mathematical or statistical
the research is conducted in a rigorous and systematic manner, and that representation of a real-world system, process, or phenomenon.
the results are valid, reliable, and generalizable. ● The model is constructed using available data and knowledge of
In general, methodology includes several key components, such as: the system, and is used to make predictions or test hypotheses
1. Research design: This involves defining the research question or about the system.
problem, selecting the appropriate research design (such as ● In data science, model building is an important step in the process
experimental, observational, or survey research), and designing the of developing predictive models that can be used to make
study to ensure that it is feasible and ethical. informed decisions or gain insights into complex data.
2. Sampling: This involves selecting a representative sample from the ● The process of model building typically involves several key steps,
population of interest, and ensuring that the sample is sufficiently including:
large and diverse to provide reliable and valid results. 1. Data collection: This involves gathering relevant data from
3. Data collection: This involves selecting the appropriate data various sources, such as databases, surveys, or sensors.
collection methods (such as surveys, interviews, or observations), 2. Data cleaning and preparation: This involves cleaning the
collecting the data in a systematic and standardized manner, and data to remove errors, missing values, or outliers, and
ensuring that the data is accurate and complete. preparing the data for analysis by transforming or
4. Data analysis: This involves analyzing the data using appropriate normalizing it as necessary.
statistical methods, such as descriptive statistics, inferential 3. Feature engineering: This involves selecting and
statistics, or machine learning algorithms, and interpreting the transforming the variables or features that will be used to
results in light of the research question or problem. build the model, such as selecting relevant variables,
5. Conclusions: This involves drawing conclusions from the results, and creating new variables based on existing ones, or scaling
making recommendations for future research or practice. variables to ensure comparability.
4. Model selection: This involves selecting the appropriate
model or algorithm that will be used to build the predictive

Page no - 38 Handcrafted by Engineers | P - Priority Page no - 39 Handcrafted by Engineers | P - Priority

` `

model, based on the nature of the data and the research ● The most common type of cross-validation is k-fold
question or problem. cross-validation, where the data is split into k subsets of roughly
5. Model training and evaluation: This involves fitting the model equal size.
to the data using available algorithms and techniques, and ● The model is then trained on k-1 of these subsets and tested on the
evaluating the performance of the model using various remaining subset.
metrics and techniques, such as cross-validation or ● This process is repeated k times, with each subset used exactly
confusion matrices. once for testing.
6. Model deployment and maintenance: This involves ● The results of each test are then averaged to provide an overall
deploying the model in a real-world setting and monitoring estimate of the model's performance.
its performance over time to ensure that it continues to ● Cross-validation is a useful technique for evaluating the
provide accurate and reliable predictions. performance of different predictive models and selecting the best
● Model building is a complex and iterative process that requires model for a given problem.
careful attention to data quality, feature selection, and algorithm ● It can also be used to tune the parameters of a model, such as the
selection, as well as ongoing monitoring and maintenance to regularization parameter in linear regression or the number of trees
ensure that the model remains valid and reliable. in a random forest, by testing different values of the parameter on
● By following sound model building practices, data scientists can different subsets of the data.
create predictive models that provide useful insights and inform ● Cross-validation is a powerful technique for evaluating the
decision-making. performance of predictive models and ensuring that they
generalize well to new data.
By using cross-validation, data scientists can make more informed
Q3. Write a short note on Cross Validation. (P4 - Appeared 1 Time) (5-10
●
decisions about which models to use and how to tune their
Marks) parameters, resulting in more accurate and reliable predictions.
Ans: Cross-validation is a statistical technique used to evaluate the
performance of predictive models.
● The basic idea of cross-validation is to split the available data into Q4. Explain in detail K-fold cross validation. (P4 - Appeared 1 Time)
two or more subsets, one for training the model and the other for (5-10 Marks)
testing the model's performance. Ans: K-fold cross-validation is a technique used to evaluate the
● This helps to ensure that the model is not overfitting to the training performance of a predictive model.
data and provides an unbiased estimate of the model's ● It involves dividing the available data into k subsets of roughly
performance on new data. equal size, and then training and testing the model k times, each
time using a different subset as the test set.

Page no - 40 Handcrafted by Engineers | P - Priority Page no - 41 Handcrafted by Engineers | P - Priority

` `

The basic steps involved in k-fold cross-validation are as follows:

Q5. Explain in detail leave-1 out. (P4 - Appeared 1 Time) (5-10 Marks)
●
1. Split the data into k equally sized subsets.
2. Train the model on k-1 of the subsets. Ans: Leave-one-out cross-validation (LOOCV) is a variation of k-fold

3. Use the trained model to predict the outcomes of the test cross-validation where k is equal to the number of samples in the dataset.

set. ● In other words, LOOCV involves training the model on all but one of

4. Calculate the performance metric (such as accuracy, the samples in the dataset, and using the remaining sample as the

precision, recall, or F1 score) for the test set. test set.

5. Repeat steps 2-4 k times, each time using a different subset ● The basic steps involved in LOOCV are as follows:

as the test set. 1. Remove one sample from the dataset and use the

6. Average the performance metrics over the k folds to get an remaining samples to train the model.

overall estimate of the model's performance. 2. Use the trained model to predict the outcome of the

● The advantage of k-fold cross-validation is that it allows for a more removed sample.

accurate and reliable estimate of the model's performance than 3. Calculate the performance metric (such as accuracy,

simply splitting the data into a training set and a test set. precision, recall, or F1 score) for the removed sample.

● By repeating the process k times and averaging the results, we can 4. Repeat steps 1-3 for each sample in the dataset.

reduce the variance of the performance estimate and obtain a 5. Average the performance metrics over all samples to get an

more accurate assessment of the model's ability to generalize to overall estimate of the model's performance.

new data. ● The advantage of LOOCV is that it provides the most accurate

● The choice of k depends on the size of the dataset and the estimate of the model's performance possible, as each sample in

complexity of the model. the dataset is used as the test set exactly once.

● In general, larger values of k are more computationally expensive ● However, LOOCV can be computationally expensive, especially for

but provide more accurate estimates of the model's performance, large datasets, as it requires training the model on almost all of the

while smaller values of k are less computationally expensive but samples multiple times.

may be more prone to overfitting. ● LOOCV is useful for evaluating the performance of models that have

● K-fold cross-validation is widely used in data science and machine a small number of samples or that are prone to overfitting.

learning, as it provides a robust and reliable way to evaluate the ● By using LOOCV, data scientists can obtain a more accurate

performance of predictive models and select the best model for a estimate of the model's ability to generalize to new data and select

given problem. the best model for a given problem.

● However, LOOCV should be used judiciously, as it can be
computationally expensive and may not be necessary for all
datasets and models.

Page no - 42 Handcrafted by Engineers | P - Priority Page no - 43 Handcrafted by Engineers | P - Priority

` `

It can also be applied to a wide range of statistical problems and is

Q6. Write a short note on Bootstrapping. (P4 - Appeared 1 Time) (5-10
●
relatively easy to implement. However, bootstrapping can be
Marks) computationally intensive and may not be necessary for datasets
Ans: Bootstrapping is a resampling technique used in statistics and with large sample sizes.
machine learning to estimate the uncertainty of a statistical estimator or to
generate new samples from an existing dataset.
● It involves repeatedly sampling from the original dataset with Q7. Discuss in detail Univariate Visualization. (P4 - Appeared 1 Time)
replacement, creating new datasets of the same size as the original (5-10 Marks)
dataset. Ans: Univariate visualization is a technique used in data analysis and
● The basic steps involved in bootstrapping are as follows: visualization to explore and understand a single variable in a dataset.
1. Take a sample of size n from the original dataset. ● It involves plotting and summarizing the distribution of a single
2. Randomly sample from the n observations with variable, such as a numerical or categorical variable, without taking
replacement, creating a new dataset of size n. into account the relationship between multiple variables.
3. Repeat steps 1-2 a large number of times (typically 1,000 or ● There are several techniques for univariate visualization, including:
more). 1. Histogram: A histogram is a graphical representation of the
4. Calculate the statistics of interest (such as mean, median, distribution of a numerical variable. It displays the frequency
variance, or confidence intervals) for each of the or count of observations in each bin or interval of the
bootstrapped datasets. variable.
5. Compute the mean and standard error of the statistics of 2. Boxplot: A boxplot is a graphical representation of the
interest across all the bootstrapped datasets. distribution of a numerical variable that displays the
● Bootstrapping can be used to estimate the sampling distribution of median, quartiles, and outliers of the data.
a statistic, such as the mean or standard deviation, when the 3. Bar chart: A bar chart is a graphical representation of the
population distribution is unknown or when the sample size is too distribution of a categorical variable. It displays the
small to assume normality. frequency or count of observations in each category or
● It can also be used to generate new datasets that are similar to the group of the variable.
original dataset, which can be useful for testing different machine 4. Pie chart: A pie chart is a graphical representation of the
learning models or for data augmentation. distribution of a categorical variable that displays the
● The advantage of bootstrapping is that it can provide a more proportion or percentage of observations in each category
accurate estimate of the variability of a statistic or the distribution or group of the variable.
of the data than traditional methods, especially when the
underlying population distribution is unknown or non-normal.

Page no - 44 Handcrafted by Engineers | P - Priority Page no - 45 Handcrafted by Engineers | P - Priority

` `

5. Density plot: A density plot is a graphical representation of ● For example, a histogram can show if the data is skewed to one
the distribution of a numerical variable that displays the side, if it has a bell-shaped distribution, or if it has multiple peaks.
density of observations across the range of the variable. ● To create a histogram, we first choose the number and size of the
6. Frequency polygon: A frequency polygon is a graphical bins. The number of bins should be large enough to capture the
representation of the distribution of a numerical variable shape of the data but not so large that individual bins have very few
that displays the frequency or count of observations as a data points.
line connecting the midpoints of each bin or interval of the ● The size of the bins determines the width of the bars in the
variable. histogram.
● Univariate visualization can help data analysts and scientists ● After selecting the bins, we count the number of data points that fall
identify patterns, outliers, and anomalies in the distribution of a within each bin and plot these counts as the height of the bars. The
variable. resulting histogram provides a visual representation of the
● It can also provide insights into the skewness, kurtosis, and central distribution of the data.
tendency of the data, and help determine the appropriate ● The following are some typical histograms, with a caption below
statistical tests and models to use for further analysis. each one explaining the distribution of the data, as well as the
● However, univariate visualization should be complemented with characteristics of the mean, median, and mode.
bivariate and multivariate visualization techniques to explore the 1.
relationships between variables and to gain a more comprehensive
understanding of the data.

Q8. Describe in detail Histogram, Quartile. (P4 - Appeared 1 Time) (5-10

Marks)
Ans:A histogram is a graphical representation of the distribution of a set of
numerical data.
● It is a type of bar chart where the x-axis represents the range of the
data, divided into intervals or bins, and the y-axis represents the Figure 1, represents a bell-shaped distribution, which has a
frequency of data points falling within each bin. single peak and tapers off to both the left and to the right of
● The histogram is a useful tool for visualizing the shape, center, and the peak. The shape appears to be symmetric about the
spread of the distribution of the data. center of the histogram. The single peak indicates that the
● It allows us to see how the data is distributed and how frequently distribution is unimodal.
certain values occur.

Page no - 46 Handcrafted by Engineers | P - Priority Page no - 47 Handcrafted by Engineers | P - Priority

` `

2. ● Quartiles are a type of percentile. A percentile is a value with a

certain percentage of the data falling below it. In general terms, k%
of the data falls below the kth percentile.
1. The first quartile (Q1, or the lowest quartile) is the 25th
percentile, meaning that 25% of the data falls below the first
quartile.
2. The second quartile (Q2, or the median) is the 50th
percentile, meaning that 50% of the data falls below the
second quartile.
Figure 2, represents a distribution that is approximately
3. The third quartile (Q3, or the upper quartile) is the 75th
uniform and forms a rectangular, flat shape. The frequency
percentile, meaning that 75% of the data falls below the third
of each class is approximately the same.
quartile.
3.
● By splitting the data at the 25th, 50th, and 75th percentiles, the
quartiles divide the data into four equal parts.
● In a sample or dataset, the quartiles divide the data into four
groups with equal numbers of observations.
● In a probability distribution, the quartiles divide the distribution’s
range into four intervals with equal probability.

Figure 3, represents a left-skewed distribution, which has a

peak to the right of the distribution and data values that
taper off to the left. This distribution has a single peak and is
also unimodal. For a histogram that is skewed to the left, the
mean is located to the left on the distribution and is the
smallest value of the measures of central tendency.
● Quartiles are values that divide a dataset into four equal parts. They
are a measure of central tendency and variability in a dataset.
● Quartiles are often used to describe the spread of data in a dataset,
and they can be used to identify outliers or extreme values.

Page no - 48 Handcrafted by Engineers | P - Priority Page no - 49 Handcrafted by Engineers | P - Priority

` `

● They are also used in statistical calculations, such as calculating useful for identifying patterns, trends, and outliers in the data, and can be
the interquartile range (IQR), which is the difference between the used to inform decision-making in various fields, including finance,
third and first quartiles. healthcare, and marketing.
● The IQR is a measure of the spread of the middle 50% of the data
and is often used to identify potential outliers in a dataset.
Q10. Discuss in detail Scatter Plot. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: A scatter plot is a type of data visualization that displays the
Q9. Write a short note on the Distribution Chart. (P4 - Appeared 1 Time) relationship between two variables in a dataset.
(5-10 Marks) ● It is a graph with points that represent individual data points and
Ans: A distribution chart is a graphical representation of the frequency show their positions along two axes.
distribution of a dataset. It shows how the data is distributed across ● Here is an example of a scatter plot:
different values or ranges. There are different types of distribution charts,
each of which is used for a specific purpose:
1. Histogram: A histogram is a type of distribution chart that shows the
frequency distribution of continuous data. It is used to visualize the
shape, center, and spread of the distribution of the data.
2. Box plot: A box plot is a type of distribution chart that shows the
distribution of continuous data using quartiles. It is used to identify
outliers, skewness, and the spread of the data.
3. Stem and leaf plot: A stem and leaf plot is a type of distribution
chart that shows the distribution of discrete data. It is used to
visualize the frequency distribution of the data.
4. Probability density function: A probability density function is a type
of distribution chart that shows the probability of a continuous
random variable falling within a certain range. It is used to model
and analyze continuous data.
5. Bar chart: A bar chart is a type of distribution chart that shows the ● In this scatter plot, the x-axis represents one variable, and the y-axis
frequency distribution of categorical data. It is used to visualize the represents the other variable. Each point represents a single data
distribution of data across different categories. point and its value for both variables. The pattern of the points on
Distribution charts are an important tool in data analysis and are used to the plot provides insight into the relationship between the two
gain insights into the distribution and characteristics of a dataset. They are variables.

Page no - 50 Handcrafted by Engineers | P - Priority Page no - 51 Handcrafted by Engineers | P - Priority

` `

● Scatter plots can be used to analyze relationships between

variables in many fields, including finance, science, and social
science.
● For example, a scatter plot can show the relationship between a
stock's price and its earnings per share, or the relationship between
a person's height and their weight.
● Scatter plots can reveal a number of important characteristics
about the data.
1. The first is the strength of the relationship between the two
variables.
2. Another important characteristic of a scatter plot is the
direction of the relationship between the two variables.
● Scatter plots can also be used to identify outliers, which are data
points that fall far outside the normal range of values.
● Outliers can have a significant impact on the relationship between ● In this scatter matrix, each diagonal plot shows a histogram of a
the two variables, and can often be identified as points that are far single variable, while the off-diagonal plots show the scatter plots
away from the main cluster of data points. of two variables.
● The patterns in the scatter plots can provide insight into the
relationships between the variables.
Q11. Write a short note on Scatter Matrix. (P4 - Appeared 1 Time) (5-10 ● Scatter matrices are a useful tool for exploring multivariate
Marks) datasets and identifying patterns or relationships between
Ans: A scatter matrix, also known as a pair plot or pair scatter plot. variables.
● It is a type of data visualization that allows for the exploration of the ● They can help identify variables that are highly correlated or
relationships between multiple variables in a dataset. variables that are not related at all. Scatter matrices can also help
● It consists of a matrix of scatter plots, with each scatter plot detect outliers and provide a quick overview of the distribution of
showing the relationship between two variables. each variable in the dataset.
● Here's an example of a scatter matrix: ● Scatter matrices can be created using various software tools, such
as Python's Seaborn library or R's GGally package.

Page no - 52 Handcrafted by Engineers | P - Priority Page no - 53 Handcrafted by Engineers | P - Priority

` `

It is important to choose an appropriate size for the scatter matrix

Q13. Discuss in detail Density Chart. (P4 - Appeared 1 Time) (5-10
●
to ensure that all the plots are visible and the patterns are easy to
interpret. Marks)

● Scatter matrices are a powerful tool for exploring the relationships Ans: A density chart is a useful data visualization tool that displays the

between multiple variables in a dataset. By using scatter matrices distribution of a variable in a dataset.

to visualize and analyze data, researchers and analysts can gain ● It provides a continuous estimate of the probability density function

deeper insights and make more informed decisions. of the data by smoothing out the frequency distribution of a
histogram.
● The area under the curve of a density chart represents the
Q12. Write a short note on the Bubble chart. (P4 - Appeared 1 Time) probability of the variable being in a certain range of values.
(5-10 Marks) ● Density charts are particularly useful for displaying continuous
Ans: A bubble chart is a type of data visualization that is used to display variables, such as age or weight, where there are many potential
three dimensions of data in a two-dimensional space. values in the data.
● In a bubble chart, data points are represented by bubbles of ● They can help identify patterns in the data, such as whether the
varying sizes and colors. The x and y axes represent two variables, variable is normally distributed or skewed to one side. By comparing
and the size and color of the bubbles represent a third variable. the density charts of different variables, analysts can also identify
● Bubble charts are useful for displaying data with multiple relationships between variables, such as correlation or causation.
dimensions, as they allow for the visualization of relationships ● There are several ways to create a density chart, including using
between three variables. statistical software packages such as R or Python.
● They are often used in business and economics to show market ● One common method for creating a density chart is to use kernel
trends, and in social science to visualize relationships between density estimation (KDE), which is a non-parametric way of
multiple variables. estimating the probability density function of a variable.
● When creating a bubble chart, it is important to choose an ● In KDE, a kernel function is used to smooth the frequency distribution
appropriate size and color scheme to ensure that the bubbles are of the variable, resulting in a continuous curve that approximates
easy to read and interpret. the probability density function.
● It is also important to label the axes and provide a clear legend to ● When creating a density chart, it is important to choose an
explain the meaning of the bubble sizes and colors. appropriate bandwidth for the kernel function.
● In conclusion, bubble charts are a powerful tool for displaying data ● A higher bandwidth will result in a smoother curve, but it may also
with multiple dimensions. By using bubble charts to visualize and hide important features in the data, while a lower bandwidth may
analyze data, researchers and analysts can gain deeper insights result in a more jagged curve that is difficult to interpret.
and make more informed decisions.

Page no - 54 Handcrafted by Engineers | P - Priority Page no - 55 Handcrafted by Engineers | P - Priority

` `

● It is also important to label the axes and provide a clear legend to 5. Check for correlations: Analyze the correlation matrix to identify
explain the meaning of the density values. strong correlations between variables. This can help identify
● Density charts are a powerful tool for displaying the distribution of a potential predictors or explanatory variables for further analysis.
variable in a dataset. By using density charts to visualize and 6. Identify outliers: Use outlier detection techniques to identify any
analyze data, researchers and analysts can gain deeper insights data points that are significantly different from the rest of the data.
and make more informed decisions. This can help identify potential errors or unusual events that may
need to be further investigated.
7. Test hypotheses: Use statistical tests to test hypotheses and explore
Q14. Explain in detail Roadmap for Data Exploration. (P4 - Appeared 1 the relationships between variables in more depth. This can include
Time) (5-10 Marks) regression analysis, ANOVA, or other techniques depending on the
Ans: Data exploration is a crucial step in the data analysis process, as it research question.
allows analysts to understand the data and identify patterns or 8. Communicate results: Finally, communicate the results of the data
relationships between variables. A roadmap for data exploration can help exploration process to relevant stakeholders. This can include
guide analysts through this process and ensure that all important aspects visualizations, tables, and reports that summarize the key findings
of the data are considered. and insights from the data.
Here is a roadmap for data exploration: By following this roadmap for data exploration, analysts can gain a deeper
1. Define the problem: Start by defining the research question or understanding of the data and make more informed decisions. It is
problem that the data will be used to address. This will help guide important to note that this process is iterative and may require multiple
the exploration process and ensure that the analysis is focused on rounds of exploration and analysis to fully understand the data and
relevant variables. address the research question.
2. Collect the data: Gather all relevant data, whether it be from
surveys, experiments, or other sources. Ensure that the data is clean
and organized, and that missing values are appropriately handled. Q15. Explain Visualizing high dimensional data: Parallel chart. (P4 -
3. Describe the data: Begin by describing the data using basic Appeared 1 Time) (5-10 Marks)
statistical measures such as mean, median, mode, and standard Ans: Visualizing high-dimensional data can be challenging, as it can be
deviation. This will provide an overview of the data and help identify difficult to represent all of the dimensions in a way that is easy to
any outliers or unusual values. understand.
4. Visualize the data: Use data visualization techniques such as ● One technique for visualizing high-dimensional data is the parallel
histograms, box plots, and scatter plots to explore the relationships coordinates chart, also known as a parallel plot or parallel chart.
between variables and identify any patterns or trends.

Page no - 56 Handcrafted by Engineers | P - Priority Page no - 57 Handcrafted by Engineers | P - Priority

` `

● A parallel coordinates chart displays multivariate data by ● They can help identify patterns and relationships between variables
representing each dimension as a separate axis, all parallel to each that may be difficult to see with other visualization techniques.
other. ● However, they can also be complex and difficult to interpret,
● Each data point is represented as a line that intersects with each especially for large datasets with many dimensions.
axis at the value of that dimension for that particular data point.
By displaying all of the dimensions in parallel, the parallel
Q16. Discuss in detail Deviation chart. (P4 - Appeared 1 Time) (5-10
●
coordinates chart can provide a comprehensive view of the
relationships between different variables. Marks)

● Here are some key features and considerations of parallel Ans: A deviation chart, also known as a diverging stacked bar chart or a

coordinates charts: butterfly chart, is a visualization technique used to compare two groups of

1. Axes and scaling: Each axis represents a single dimension or data with a common metric.

variable, and it is important to scale the axes appropriately ● It displays the difference between two sets of data using stacked

to ensure that all data points are visible. Nonlinear scaling bars that deviate from a central axis.

may be required for data that is highly skewed. ● The deviation chart is particularly useful when comparing positive

2. Interactivity: Parallel coordinates charts can be made and negative values, as it allows for a clear comparison of the

interactive by allowing users to highlight or filter data points differences between the two groups.

based on specific values or ranges. ● The central axis represents the point of balance between the two

3. Overplotting: When multiple data points overlap each other groups, with the positive values on one side and the negative values

in a parallel coordinates chart, it can be difficult to on the other.

distinguish between them. Techniques such as ● Here are some key features and considerations of deviation charts:

transparency, jittering, and bundling can be used to alleviate 1. Layout and design: Deviation charts can be designed in

this issue. various ways, with either horizontal or vertical bars. The bars

4. Clustering: Parallel coordinates charts can be used to can be colored or shaded to indicate positive and negative

identify clusters or patterns in the data, as groups of data values.

points that are close together along multiple axes may 2. Labels and annotations: It is important to label the bars and

represent a distinct subgroup. provide clear annotations to explain the meaning of the

● Parallel coordinates charts can be useful for a variety of chart. This can include axis labels, legend, and annotations

applications, such as exploratory data analysis, cluster analysis, to indicate the source of the data and any important

and classification. insights.

3. Data preparation: To create a deviation chart, the data must
be prepared by calculating the differences between the two

Page no - 58 Handcrafted by Engineers | P - Priority Page no - 59 Handcrafted by Engineers | P - Priority

` `

groups and stacking the positive and negative values on ● By comparing the shapes of the curves, it is possible to determine
either side of the central axis. which features have the most influence on the data.
4. Interpretation: Deviation charts can help identify the ● The curves can also be colored or labeled to represent different
magnitude and direction of differences between two groups. groups or classes of observations.
They can be used to compare a variety of metrics, such as ● One limitation of Andrews curves is that they may not be suitable
revenue, profit, or performance indicators. for very large datasets, as the computation of Fourier coefficients
5. Limitations: Deviation charts can be difficult to read if there can be computationally expensive.
are many categories or if the differences between the ● They also may not be as effective for datasets with complex
groups are small. They may also be less effective for nonlinear relationships between features.
comparing more than two groups of data. ● Overall, Andrews curves can be a useful tool for visualizing
● Overall, deviation charts can be a useful tool for comparing two high-dimensional data and exploring the relationships between
groups of data with a common metric, especially when there are features.
significant differences in positive and negative values. They can ● They provide a way to represent complex data in a simple and
help highlight patterns and trends in the data, and provide a clear intuitive way, and can help to uncover patterns and insights that
visual representation of the differences between the two groups. may not be apparent with other visualization techniques.

Q17. Write a short note on Andrews Curves. (P4 - Appeared 1 Time)

(5-10 Marks)
Ans: Andrews curves is a technique used for visualizing high-dimensional
data by representing each observation as a smooth curve.
● It was introduced by F.J. Andrews in 1972, and has been used for
data exploration, data classification, and data clustering.
● The Andrews curve plot is created by mapping each data point to a
Fourier series of sine and cosine functions, where each coefficient
represents a feature or variable in the data.
● The curve is then generated by summing up the Fourier series. Each
curve represents a single observation, and the shape of the curve
reflects the values of the features in the data.
● The Andrews curve plot is useful for visualizing high-dimensional
data and identifying patterns and relationships between variables.

Page no - 60 Handcrafted by Engineers | P - Priority Page no - 61 Handcrafted by Engineers | P - Priority

` `

MODULE-4 Q2. Discuss in detail Causes of Outliers. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: In statistics, an outlier is an observation that is significantly different
Q1. Write a short note on Outliers. (P4 - Appeared 1 Time) (5-10 Marks) from other observations in a dataset.
Ans: In statistics, outliers are data points that are significantly different from ● Outliers can occur due to a variety of reasons, and understanding
other observations in a dataset. the causes of outliers is crucial to effectively analyze and interpret
● Outliers can have a significant impact on statistical analysis, as data.
they can affect the mean, standard deviation, and other measures ● Here are some of the common causes of outliers:
of central tendency and dispersion. 1. Measurement error: Outliers can occur due to errors in the
● Outliers can occur due to various reasons, such as measurement measurement process. For example, a device used to
errors, natural variation, extreme events, and data manipulation. measure temperature may malfunction and produce an
● Outliers can be detected using various techniques, such as box inaccurate reading that is significantly different from other
plots, scatter plots, and statistical tests like Z-score, Tukey's method, readings in the dataset.
and Grubbs' test. 2. Data entry errors: Human error during data entry can result
● However, identifying an outlier does not necessarily mean it should in outliers. For instance, a data entry operator may
be removed from the dataset. accidentally enter a wrong value, leading to an outlier.
● Outliers can sometimes provide valuable insights into the data and 3. Sampling errors: Outliers can occur due to sampling errors,
need to be analyzed further. where the sample data is not representative of the
● It is essential to understand the cause of the outliers and their population. For instance, if a sample of a population is
impact on the data before deciding whether to remove them or not. skewed towards one end of the distribution, it may result in
● To summarize this we can say, outliers are data points that are outliers in the dataset.
significantly different from other observations in a dataset, and they 4. Natural variation: In some cases, outliers can occur naturally
can occur due to various reasons. Identifying and analyzing outliers due to variation in the data. For example, in a dataset of
is essential for effective statistical analysis, but it is equally heights of adult humans, there may be a few individuals who
important to understand their cause and impact before deciding are significantly taller or shorter than the rest of the
whether to remove them or not. population.
5. Extreme events: Outliers can occur due to extreme events
that are not representative of the typical behavior of the
system being observed. For example, a stock market crash

Page no - 62 Handcrafted by Engineers | P - Priority Page no - 63 Handcrafted by Engineers | P - Priority

` `

or a natural disaster can cause outliers in financial or detect any deviations from the expected patterns as
weather datasets, respectively. anomalies.
6. Data manipulation: Outliers can also be deliberately 3. Time-series analysis: Time-series analysis techniques can
introduced into a dataset through data manipulation. For be used to identify anomalies in time-series data by
example, an individual may add an outlier to a dataset to identifying any sudden or unexpected changes in the data
achieve a particular result or to influence a decision. patterns.
● It is essential to identify the cause of outliers in a dataset to 4. Pattern recognition: Pattern recognition techniques, such as
determine the appropriate course of action. neural networks and decision trees, can be used to identify
● In some cases, outliers may need to be removed from the dataset, anomalies in a dataset based on their deviation from the
while in others, they may be valuable data points that need to be expected patterns.
analyzed further. 5. Visualization techniques: Visualization techniques, such as
● Techniques such as data visualization, statistical tests, and outlier scatter plots, histograms, and heat maps, can be used to
detection algorithms can help identify and analyze outliers in a identify anomalies visually by identifying any unusual
dataset. patterns or outliers in the data.
6. Rule-based methods: Rule-based methods involve setting
predefined rules or thresholds to identify anomalies based
Q3. Anomaly detection techniques. (P4 - Appeared 1 Time) (5-10 on specific criteria or domain knowledge.
Marks) ● Overview of this could be concluded as anomaly detection
Ans: Anomaly detection techniques are used to identify and isolate techniques are used to identify unusual data points or patterns in a
unusual data points or patterns that deviate from the norm or expected dataset.
behavior in a dataset. ● These techniques include statistical methods, machine learning
● Here are some of the commonly used anomaly detection algorithms, time-series analysis, pattern recognition, visualization
techniques: techniques, and rule-based methods.
1. Statistical methods: Statistical techniques such as Z-score, ● The choice of the appropriate technique depends on the type and
Grubbs' test, and the modified Thompson Tau test are used nature of the data and the specific anomaly detection
to identify outliers or anomalies in a dataset based on their requirements.
deviation from the mean or other statistical measures.
2. Machine learning algorithms: Machine learning algorithms,
such as clustering, classification, and regression algorithms,
can be used to identify anomalies in a dataset. These
algorithms are trained on normal data patterns and can

Page no - 64 Handcrafted by Engineers | P - Priority Page no - 65 Handcrafted by Engineers | P - Priority

` `

5. Mahalanobis distance: Mahalanobis distance is a measure

Q4. Discuss Outlier Detection using Statistics. (P4 - Appeared 1 Time) of the distance between a data point and the mean of the
(5-10 Marks) dataset, adjusted for the correlation between the variables.
Ans: Outlier detection is the process of identifying data points that deviate An observation is considered an outlier if its Mahalanobis
significantly from the expected behavior of a system or dataset. distance is greater than a predefined threshold.
● Outliers can affect statistical analysis, as they can skew the mean, ● These statistical methods can be used individually or in
standard deviation, and other measures of central tendency and combination to identify outliers in a dataset.
dispersion. ● The choice of method depends on the characteristics of the data
● Statistical methods are widely used for outlier detection, as they and the application requirements.
provide a systematic approach to identifying anomalies based on ● It is important to note that identifying an outlier does not
statistical significance. necessarily mean it should be removed from the dataset.
● Here are some of the commonly used statistical methods for outlier ● Outliers can sometimes provide valuable insights into the data and
detection: need to be analyzed further.
1. Z-score: The Z-score is a measure of how many standard
deviations a data point is from the mean. A data point is
considered an outlier if its Z-score is greater than a Q5. Explain Outlier Detection using Distance based method. (P4 -
predefined threshold, typically 2 or 3. Appeared 1 Time) (5-10 Marks)
2. Percentile rank: The percentile rank is the percentage of data Ans:Distance-based methods are a common approach for outlier
points that fall below a particular value. An observation is detection in data analysis.
considered an outlier if its percentile rank is below a ● These methods identify outliers based on the distance between
predefined threshold, typically 1 or 5. data points, using various distance metrics such as Euclidean
3. Tukey's method: Tukey's method, also known as the distance, Mahalanobis distance, and cosine similarity.
box-and-whisker plot, is a graphical method for outlier ● Below are some commonly used distance-based methods for
detection. The method uses the interquartile range (IQR) to outlier detection:
define the range of typical values and identifies outliers as 1. K-nearest neighbor (KNN): In KNN, the distance between a
data points outside the range of 1.5 times the IQR. data point and its k-nearest neighbors is computed, and the
4. Grubbs' test: Grubbs' test is a statistical method that tests point is considered an outlier if its distance to the kth
whether a data point is significantly different from other neighbor exceeds a predefined threshold.
observations in the dataset. The test computes a test 2. Local Outlier Factor (LOF): LOF measures the local density of
statistic, which is compared to a critical value to determine a data point relative to its neighbors. An observation is
whether the data point is an outlier.

Page no - 66 Handcrafted by Engineers | P - Priority Page no - 67 Handcrafted by Engineers | P - Priority

` `

considered an outlier if its LOF score is significantly lower ● These methods identify outliers based on the density of data points
than the average LOF score of its neighbors. in a region of the feature space. In these methods, an outlier is
3. Distance to cluster center: In clustering-based outlier defined as a data point that is located in a low-density region, while
detection, the distance between a data point and the center normal data points are located in high-density regions.
of the cluster is computed. If a data point is significantly far ● Some commonly used density-based methods for outlier detection
from the center of the cluster, it is considered an outlier. are:
4. DBSCAN: Density-Based Spatial Clustering of Applications 1. Local Outlier Factor (LOF): LOF measures the local density of
with Noise (DBSCAN) is a clustering algorithm that identifies a data point relative to its neighbors. It computes the ratio of
core points, border points, and noise points in a dataset. the average density of the k-nearest neighbors of a data
Data points that are classified as noise points are point to its own density. An observation is considered an
considered outliers. outlier if its LOF score is significantly lower than the average
5. One-class SVM: One-Class Support Vector Machines (SVM) LOF score of its neighbors.
is a machine learning technique that learns a boundary 2. Density-Based Spatial Clustering of Applications with Noise
around the normal data points and identifies outliers as (DBSCAN): DBSCAN is a clustering algorithm that identifies
data points outside the boundary. core points, border points, and noise points in a dataset.
● These distance-based methods can be used alone or in Core points are defined as data points with a minimum
combination to identify outliers in a dataset. number of neighbors within a specific distance. Border
● The choice of method depends on the characteristics of the data points are neighbors of core points that are not core points
and the application requirements. themselves, while noise points have no neighbors within the
● It is important to note that distance-based methods have specified distance. Data points that are classified as noise
limitations, such as sensitivity to the choice of distance metric, points are considered outliers.
scalability, and parameter tuning. 3. Gaussian Mixture Model (GMM): GMM is a probabilistic
● Therefore, careful consideration is required when selecting and clustering technique that models the data distribution as a
applying these methods for outlier detection. mixture of Gaussian distributions. Outliers are identified as
data points that have low probabilities of being generated
by the model.
Q6. Describe Outlier detection using density-based methods. (P4 - 4. Local Density-Based Outlier Factor (LDOF): LDOF is a
Appeared 1 Time) (5-10 Marks) density-based method that measures the degree of
Ans: Density-based methods are a popular approach for outlier detection outlierness of a data point based on its local density and
in data analysis. distance to high-density regions. It computes a score that

Page no - 68 Handcrafted by Engineers | P - Priority Page no - 69 Handcrafted by Engineers | P - Priority

` `

represents the deviation of a data point from the density ● The SMOTE algorithm has several advantages over other
distribution of the dataset. over-sampling techniques, such as random over-sampling,
5. Kernel Density Estimation (KDE): KDE is a non-parametric including:
method that estimates the density of data points in a region 1. SMOTE generates synthetic instances that are more
of the feature space. Outliers are identified as data points representative of the minority class than random
with low probability density values. over-sampling.
● These density-based methods can be used alone or in combination 2. SMOTE does not create exact copies of existing instances,
to identify outliers in a dataset. reducing the risk of overfitting.
● The choice of method depends on the characteristics of the data 3. SMOTE can be combined with other techniques, such as
and the application requirements. under-sampling, to further balance the dataset.
● It is important to note that density-based methods have limitations, 4. SMOTE can improve the performance of machine learning
such as sensitivity to the choice of parameters, scalability, and models, especially in cases where the minority class is
robustness to high-dimensional data. under-represented.
● Therefore, careful consideration is required when selecting and ● However, SMOTE also has some limitations, such as:
applying these methods for outlier detection. 1. SMOTE can create noisy samples that do not accurately
represent the minority class.
2. SMOTE can increase the risk of overfitting if the synthetic
Q7. Write a short note on SMOT. (P4 - Appeared 1 Time) (5-10 Marks) samples are too similar to the existing samples.
Ans: SMOTE (Synthetic Minority Over-sampling Technique) is a technique 3. SMOTE may not be effective in cases where the minority
used in machine learning to address the problem of imbalanced datasets. class is very small or the data is highly imbalanced.
● In imbalanced datasets, the number of instances in the minority ● It is a useful technique for balancing imbalanced datasets and
class (e.g., fraud cases) is much smaller than the number of improving the performance of machine learning models. However,
instances in the majority class (e.g., non-fraud cases). it should be used with caution and in combination with other
● This can lead to biased models that are unable to accurately techniques to ensure the validity and generalizability of the results.
predict the minority class.
● SMOTE is a technique that generates synthetic samples of the
minority class to balance the dataset.
● The algorithm works by selecting a minority class instance and
computing its k nearest neighbors in the feature space.
● Synthetic instances are then created by interpolating between the
selected instance and its neighbors.

Page no - 70 Handcrafted by Engineers | P - Priority Page no - 71 Handcrafted by Engineers | P - Priority

` `

manager may use their experience and knowledge of the market to

MODULE-5
forecast sales for the next quarter.
6. Ensemble Methods: These methods combine the forecasts from
multiple models or algorithms to produce a more accurate and
Q1. Explain Taxonomy of Time Series Forecasting methods. (P4 - robust forecast. For example, an ensemble model may combine the
Appeared 1 Time) (5-10 Marks) forecasts from a statistical method, a machine learning method,
Ans: Time series forecasting methods can be classified into various and an expert opinion method.
categories based on the underlying model, algorithm, and approach. Here 7. Probabilistic Methods: These methods provide a range of possible
are some common taxonomy of time series forecasting methods: outcomes with associated probabilities. For example, a probabilistic
1. Statistical Methods: These methods use statistical techniques to method may provide a 95% confidence interval for the forecasted
model the time series and make predictions. Some of the popular value.
statistical methods include ARIMA (Autoregressive Integrated
Moving Average), SARIMA (Seasonal ARIMA), and exponential
smoothing. Q2. Discuss in detail Time Series Decomposition. (P4 - Appeared 1 Time)
2. Machine Learning Methods: These methods use machine learning (5-10 Marks)
algorithms to learn the patterns in the time series and make Ans: Time series decomposition is a statistical method that separates a
predictions. Some of the popular machine learning methods time series into its different components, namely trend, seasonality, and
include artificial neural networks (ANNs), support vector regression residual, to better understand its underlying patterns and behavior.
(SVR), decision trees, and random forests. ● Decomposition is a commonly used technique in time series
3. Deep Learning Methods: These methods use deep neural networks analysis and forecasting, as it can help identify important patterns
to learn the complex patterns in the time series and make and trends in the data, remove noise and outliers, and improve the
predictions. Some of the popular deep learning methods include accuracy of forecasts.
Long Short-Term Memory (LSTM), Convolutional Neural Networks ● The decomposition of a time series involves the following steps:
(CNNs), and Gated Recurrent Units (GRUs). 1. Trend Component: The first step is to estimate the trend
4. Hybrid Methods: These methods combine multiple models or component of the time series, which represents the
algorithms to improve the accuracy of the forecasts. For example, a long-term behavior of the data over time. This can be done
hybrid model may use a statistical method to model the trend and using various statistical techniques, such as moving
seasonality of the time series and a machine learning method to averages, exponential smoothing, or regression analysis. The
model the residual errors. trend component captures the overall direction and
5. Expert Opinion Methods: These methods rely on the knowledge and magnitude of the time series and is usually the most
expertise of domain experts to make forecasts. For example, a sales important component for forecasting.

Page no - 72 Handcrafted by Engineers | P - Priority Page no - 73 Handcrafted by Engineers | P - Priority

` `

2. Seasonal Component: The second step is to estimate the

seasonal component of the time series, which represents the
Q3. Write a short note on the Average method. (P4 - Appeared 1 Time)
periodic patterns or cycles in the data that repeat over fixed (5-10 Marks)

time intervals. This can be done using various methods, such Ans:The average method is a simple and widely used forecasting

as seasonal indices, Fourier analysis, or seasonal regression technique that involves calculating the average of past observations and

models. The seasonal component captures the systematic using it as a forecast for future periods.

variation in the time series due to seasonal effects, such as ● This method is based on the assumption that future values will be

weather, holidays, or economic cycles. similar to past values, and that the average provides a reasonable

3. Residual Component: The third step is to estimate the estimate of the future trend.

residual component of the time series, which represents the ● To use the average method, the first step is to collect historical data

random or unpredictable variation in the data that is not for the time series. Then, the average of the past observations is

explained by the trend or seasonal components. This can be calculated and used as the forecast for the next period.

done by subtracting the estimated trend and seasonal ● This process is repeated for each future period, with the forecast for

components from the original time series. The residual each period equal to the average of the past observations.

component captures the unexplained variation in the time ● The average method is easy to use and does not require any

series and is usually the least important component for complex mathematical calculations or statistical analysis.

forecasting. ● It is often used as a benchmark or baseline method for comparing

● Once the time series has been decomposed into its components, the performance of more advanced forecasting techniques.

each component can be analyzed separately to better understand ● However, the average method has several limitations, including:

its characteristics and behavior. 1. It does not capture any trend or seasonal patterns in the

● For example, the trend component can be used to identify data, and assumes that future values will be the same as

long-term patterns and changes in the data, while the seasonal past values.

component can be used to identify seasonal effects and patterns. 2. It is sensitive to outliers and extreme values in the data,

● The residual component can be used to identify outliers and which can affect the accuracy of the forecast.

random fluctuations in the data. 3. It does not take into account any external factors or events

● The decomposition of a time series can also be used for forecasting that may affect the time series, such as changes in the

by extrapolating the trend and seasonal components into the future economy or market conditions.

and adding them together to obtain a forecast. ● Despite these limitations, the average method can be useful for

● The residual component can be used to estimate the uncertainty or short-term forecasting of stable and relatively predictable time

variability in the forecast. series.

Page no - 74 Handcrafted by Engineers | P - Priority Page no - 75 Handcrafted by Engineers | P - Priority

` `

● It is also a useful tool for generating quick and simple forecasts, ● The formula for calculating the moving average for each point in
especially when more advanced methods are not available or the series is:
necessary. MA(t) = (Y(t) + Y(t-1) + ... + Y(t-k+1)) / k
● However, for more complex time series and longer-term forecasts, where Y(t) is the value at time t, k is the window size, and MA(t) is
other forecasting methods may be more appropriate. the moving average at time t.
● MA smoothing has several advantages. It can help to reduce noise
in the data, making it easier to identify trends and patterns. It is also
Q4. Explain in detail Moving Average smoothing. (P4 - Appeared 1 a simple and intuitive method that requires only basic
Time) (5-10 Marks) mathematical skills.
Ans: Moving Average (MA) smoothing is a widely used statistical method ● However, MA smoothing also has some limitations. One of the main
for analyzing time-series data. limitations is that it is sensitive to the choice of window size.
● It is a technique for identifying trends and patterns in a time series ● A small window size may not smooth the data enough to reveal
by calculating an average of the values in a sliding window over the meaningful patterns, while a large window size may smooth the
series. data too much and obscure important features.
● In simple terms, MA smoothing involves calculating the average of ● Another limitation is that it is a backward-looking method and may
a fixed number of previous data points in a time series. not be suitable for predicting future values beyond the range of the
● This creates a smoothed version of the series, which is useful for data used to calculate the moving averages.
identifying trends and patterns that may be difficult to see in the
raw data.
● The basic idea behind MA smoothing is to calculate the moving Q5. Explain Time series analysis using linear regression. (P4 - Appeared
average of a fixed window size (usually denoted as k) for each point 1 Time) (5-10 Marks)
in the series. Ans: Time series analysis is a statistical method used to analyze data that
● The window size determines how many data points are included in is collected over a period of time.
the average calculation. ● In time series analysis, the data is often plotted over time to identify
● For example, if the window size is 3, the moving average at each patterns, trends, and other useful information.
point is calculated by taking the average of the current point and ● Linear regression is a statistical method that can be used to analyze
the two previous points. time series data. Linear regression is a technique used to model the
● To calculate the moving average for each point in the series, we relationship between a dependent variable and one or more
start with the first k data points and calculate the average. We then independent variables.
move the window one data point at a time and recalculate the ● In time series analysis, the dependent variable is usually the
average for each new window. variable of interest that changes over time, and the independent

Page no - 76 Handcrafted by Engineers | P - Priority Page no - 77 Handcrafted by Engineers | P - Priority

` `

variables are time-related variables such as time itself or other ● It is a combination of three methods: autoregression, differencing,
factors that may affect the dependent variable. and moving average.
● The basic idea behind linear regression in time series analysis is to ● The ARIMA model is a generalization of the simpler ARMA
use a linear equation to model the relationship between the (Autoregressive Moving Average) model, which assumes that the
dependent variable and the independent variables. time series is stationary (i.e., the statistical properties of the series
● The linear equation takes the form of: do not change over time).
Y = a + bX + e ● However, many time series in real-world applications are
where Y is the dependent variable, X is the independent variable, a non-stationary, meaning that the statistical properties change over
is the intercept, b is the slope, and e is the error term. time.
● The slope and intercept can be estimated using least-squares ● ARIMA models can handle non-stationary time series by
regression, which minimizes the sum of the squared differences incorporating differencing, which removes the trend or seasonality
between the predicted values and the actual values. component from the series. The ARIMA model also includes
● Once the slope and intercept are estimated, they can be used to autoregression and moving average components, which capture
make predictions about future values of the dependent variable the autocorrelation and noise components of the series,
based on the independent variables. respectively.
● This is useful in time series analysis because it allows us to identify ● ARIMA models are specified by three parameters:
trends and patterns in the data and make predictions about future p, d, and q.
behavior. ● The parameter p represents the autoregression order, which is the
● Linear regression can also be used to test hypotheses about the number of lagged values of the dependent variable used to predict
relationship between the dependent variable and the independent the current value.
variables. ● The parameter q represents the moving average order, which is the
● For example, we may want to test whether a particular independent number of lagged errors used to predict the current value.
variable has a significant effect on the dependent variable, or ● The parameter d represents the differencing order, which is the
whether there is a trend in the data over time. number of times the series is differenced to make it stationary.
● ARIMA models can be used for both time series analysis and
forecasting.
Q6. Write a short note on ARIMA Model. (P4 - Appeared 1 Time) (5-10 ● For time series analysis, ARIMA models can be used to identify the
Marks) underlying patterns and trends in the data, and to test for the
Ans: ARIMA (Autoregressive Integrated Moving Average) model is a popular presence of seasonality or other cyclical components.
statistical method used for time series analysis and forecasting.

Page no - 78 Handcrafted by Engineers | P - Priority Page no - 79 Handcrafted by Engineers | P - Priority

` `

For forecasting, ARIMA models can be used to predict future values

Q7. Discuss in detail Mean Absolute Error. (P4 - Appeared 1 Time) (5-10
●
of the time series based on its past behavior.
● Here's a flowchart for building an ARIMA model: Marks)

1. Examine the time series data to determine if it is stationary. If Ans: Mean Absolute Error (MAE) is a popular metric used to measure the

it is not stationary, apply differencing to make it stationary. accuracy of a predictive model.

2. Determine the order of differencing (d) required to make the ● It is particularly useful for evaluating models that predict

series stationary. This can be done by examining the continuous variables, such as regression models.

autocorrelation function (ACF) and partial autocorrelation ● MAE measures the average absolute difference between the actual

function (PACF) plots of the differenced series. and predicted values of a variable.

3. Determine the order of the autoregressive (p) and moving ● It is calculated by taking the absolute value of the difference

average (q) terms required for the model. This can also be between each predicted value and its corresponding actual value,

done by examining the ACF and PACF plots of the and then taking the average of these differences.

differenced series. ● The formula for calculating MAE is:

4. Fit the ARIMA model to the time series data using the chosen MAE = (1/n) * Σ| yi - ŷi |

values of p, d, and q. This can be done using a variety of where yi is the actual value of the variable, ŷi is the predicted value

statistical software packages. of the variable, n is the total number of observations, and Σ is the

5. Check the residuals of the ARIMA model for autocorrelation, summation symbol.

non-normality, and heteroscedasticity. If any of these issues ● MAE is a useful metric because it is easy to understand and

are present, re-estimate the model or consider using a interpret. It measures the average size of the errors made by the

different model. model, with larger errors contributing more to the overall score than

6. Use the ARIMA model to make predictions for future time smaller errors.

periods. ● MAE is expressed in the same units as the variable being predicted,

7. Check the accuracy of the ARIMA model predictions using which makes it easy to compare the accuracy of different models.

metrics such as mean absolute error (MAE) or root mean ● One limitation of MAE is that it treats all errors as equally important,

squared error (RMSE). regardless of whether they are positive or negative.

8. Refine the ARIMA model as necessary based on the ● This means that a model that consistently underestimates the

prediction accuracy and additional insights gained from the variable of interest will have the same MAE as a model that

analysis of the time series data. consistently overestimates the variable, even though these errors
may have different implications for the practical use of the model.
● Despite this limitation, MAE is a widely used metric for evaluating the
accuracy of predictive models.

Page no - 80 Handcrafted by Engineers | P - Priority Page no - 81 Handcrafted by Engineers | P - Priority

` `

● It is often used in conjunction with other metrics, such as Root Mean where the variable being predicted has a wide range of possible
Squared Error (RMSE), to provide a more comprehensive evaluation values.
of model performance ● It is often used in conjunction with other metrics, such as MAE or
Mean Absolute Percentage Error (MAPE), to provide a more
comprehensive evaluation of model performance.
Q8. Write a short note on Root Mean Square Error. (P4 - Appeared 1
Time) (5-10 Marks)
Ans: Root Mean Square Error (RMSE) is a commonly used metric for Q9. Mean Absolute Percentage Error. (P4 - Appeared 1 Time) (5-10
evaluating the accuracy of a predictive model. Marks)
● It measures the average magnitude of the errors made by the Ans: Mean Absolute Percentage Error (MAPE) is a commonly used metric for
model, with larger errors contributing more to the overall score than evaluating the accuracy of a predictive model.
smaller errors. RMSE is particularly useful for evaluating models that ● It measures the average percentage difference between the actual
predict continuous variables, such as regression models. and predicted values of a variable, making it particularly useful for
● The formula for calculating RMSE is: evaluating models that predict variables with varying scales or
2
RMSE = sqrt(1/n * Σ(yi - ŷi) ) magnitudes.
where yi is the actual value of the variable, ● The formula for calculating MAPE is:
ŷi is the predicted value of the variable, 𝑀𝐴𝑃𝐸 =
1
𝑛 ( )
𝑦𝑖 − ŷ𝑖
× Σ || 𝑦𝑖 || × 100 %
n is the total number of observations, where yi is the actual value of the variable, ŷi is the predicted value
Σ is the summation symbol, and sqrt is the square root function. of the variable, n is the total number of observations, Σ is the
● RMSE is expressed in the same units as the variable being predicted, summation symbol, and | | represents absolute value.
which makes it easy to compare the accuracy of different models. ● MAPE is expressed as a percentage, which makes it easy to interpret
● One advantage of RMSE over Mean Absolute Error (MAE) is that it and compare across different variables and models.
gives more weight to larger errors, which may be more important to ● It measures the average size of the errors made by the model
consider in certain applications. relative to the actual values of the variable, with larger errors
● However, one limitation of RMSE is that it is sensitive to outliers in the contributing more to the overall score than smaller errors.
data, which can inflate the value of the metric. Another limitation is ● Unlike RMSE, MAPE is not sensitive to the scale of the variable being
that it can be difficult to interpret in practical terms, since it does predicted, which can be an advantage in certain applications.
not have a direct relationship to the performance of the model. ● One limitation of MAPE is that it is undefined when the actual value
● Despite these limitations, RMSE is widely used as a metric for of the variable is zero, which can occur in some applications.
evaluating the accuracy of predictive models, especially in cases

Page no - 82 Handcrafted by Engineers | P - Priority Page no - 83 Handcrafted by Engineers | P - Priority

` `

● In addition, MAPE can be affected by outliers in the data, which can ● One advantage of MASE over other metrics, such as RMSE or MAPE, is
skew the overall score. Finally, MAPE can be less intuitive to interpret that it provides a standardized way to compare the accuracy of
than other metrics, such as RMSE or MAE. different models over time series data, without being affected by
the scale of the variable being predicted or by outliers in the data.
In addition, MASE is less sensitive to changes in the distribution of
Q10. Discuss in detail Mean Absolute Scaled Error. (P4 - Appeared 1
●
the data over time, which can be an advantage in applications
Time) (5-10 Marks) where the underlying data is subject to external factors, such as
Ans: Mean Absolute Scaled Error (MASE) is a commonly used metric for seasonality or trend.
evaluating the accuracy of a predictive model. ● MASE can be more difficult to calculate than other metrics, since it
● It measures the average magnitude of the errors made by the requires the calculation of a benchmark model for each time series,
model relative to the errors made by a simple benchmark model, inspite of having these advantages .
making it particularly useful for evaluating models that make ● In addition, MASE can be affected by the choice of benchmark
predictions over time series data. model, which can influence the overall score.
● The formula for calculating MASE is:
𝑀𝐴𝑆𝐸 =
1
𝑛 ( 𝑦𝑖 − ŷ𝑖
× Σ || 𝑀𝐴𝐸 ||) Q11. Explain Evaluation parameters for Classification. (P4 - Appeared 1
yi is the actual value of the variable,
ŷi is the predicted value of the variable, Time) (5-10 Marks)

n is the total number of observations, Ans: Evaluation parameters for classification are metrics that are used to

Σ is the summation symbol, assess the performance of a classification model.

| | represents absolute value, and ● These metrics provide a quantitative measure of how well the

MAE is the mean absolute error of a benchmark model that always model is able to correctly classify instances into their respective

predicts the value of the variable as its most recent observation. classes.

● MASE is a unitless measure, which makes it easy to interpret and ● Some of the commonly used evaluation parameters for

compare across different variables and models. classification are:

● It measures the average size of the errors made by the model 1. Accuracy: It is the proportion of correct predictions made by

relative to the errors made by a simple benchmark model, with the model out of the total number of predictions.

values less than 1 indicating that the model is better than the It is calculated as follows:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑂𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
benchmark model and values greater than 1 indicating that the 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

model is worse than the benchmark model.

Page no - 84 Handcrafted by Engineers | P - Priority Page no - 85 Handcrafted by Engineers | P - Priority

` `

2. Precision: It is the proportion of true positive predictions out ● Based on the specific requirements of the classification task, one or
of the total number of positive predictions made by the more of these parameters can be used to evaluate and compare
model. the performance of different classification models.
It is calculated as follows:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Q12. Write a short note on regression and clustering. (P4 - Appeared 1
3. Recall (Sensitivity): It is the proportion of true positive
Time) (5-10 Marks)
predictions out of the total number of actual positive
Ans: Regression and clustering are two different types of machine learning
instances in the dataset.
algorithms that are used for different purposes.
It is calculated as follows:
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 Regression:
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
● It is a type of supervised learning algorithm that is used to model
4. F1 Score: It is the harmonic mean of precision and recall. It is
the relationship between a dependent variable and one or more
a measure that balances between precision and recall.
independent variables.
It is calculated as follows:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
● The goal of regression is to find a mathematical function that can
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 predict the value of the dependent variable based on the values of
5. Specificity: It is the proportion of true negative predictions the independent variables.
out of the total number of actual negative instances in the ● Regression is used to solve problems such as predicting house
dataset. prices based on their features, estimating sales based on
It is calculated as follows: advertising spend, or forecasting the stock market based on
𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 historical data.
6. Area under the receiver operating characteristic curve Clustering :
(AUC-ROC): It is a measure of the classifier's ability to ● Clustering, on the other hand, is a type of unsupervised learning
distinguish between positive and negative classes. algorithm that is used to group similar data points together based
It is calculated as the area under the ROC curve, which is a on their features.
plot of true positive rate (sensitivity) against the false ● The goal of clustering is to identify clusters or groups of data points
positive rate (1 - specificity) at various classification that share similar characteristics, without any prior knowledge of
thresholds. the groups.
● These evaluation parameters help in understanding the ● Clustering is used to solve problems such as customer
performance of the classification model in terms of its ability to segmentation, fraud detection, or image segmentation.
accurately classify instances into their respective classes. ● While regression and clustering are different types of algorithms,
they can be used together in certain scenarios.

Page no - 86 Handcrafted by Engineers | P - Priority Page no - 87 Handcrafted by Engineers | P - Priority

` `

For example, clustering can be used to identify groups of similar

●
MODULE-6
data points, which can then be used as input features for a
regression model.
This approach can be used to improve the accuracy of the
Q1. Write a short note on Predictive Modeling. (P4 - Appeared 1 Time)
●
regression model by capturing non-linear relationships between
the input features and the dependent variable. (5-10 Marks)

● Another use case for combining regression and clustering is to use Ans: Predictive modeling is a statistical and data mining technique used to

clustering to identify outliers in the data, which can then be create a model that can predict future events or behaviors based on

removed before fitting a regression model. historical data.

● Outliers can have a significant impact on the regression model, ● It involves identifying patterns and relationships within data sets to

leading to overfitting or underfitting. make predictions about future outcomes.

● By removing the outliers identified through clustering, the accuracy ● The process of predictive modeling typically involves several steps,

of the regression model can be improved. including data collection, data cleaning and preprocessing, feature
selection, model training, model evaluation, and deployment.
● The goal is to build a model that accurately predicts future
outcomes based on the available data.
● Predictive modeling has many practical applications, including
fraud detection, marketing analytics, credit scoring, and risk
management.
● It is widely used in industries such as finance, insurance, healthcare,
and retail to help organizations make better decisions and improve
their performance.
● Predictive modeling is a powerful tool that enables businesses to
use data to make more informed decisions and gain a competitive
advantage.
● However, it requires a deep understanding of statistical and data
science concepts and techniques, as well as access to high-quality
data and advanced analytics tools.

Page no - 88 Handcrafted by Engineers | P - Priority Page no - 89 Handcrafted by Engineers | P - Priority

` `

7. Model deployment: Finally, the model can be deployed to predict

Q2. Describe in detail House price prediction. (P4 - Appeared 1 Time) the prices of new houses based on their features. This can be done
(5-10 Marks) using a web-based interface or a mobile app.
Ans: House price prediction is a popular application of predictive modeling House price prediction is a complex task that requires a deep
in the real estate industry. The goal is to build a model that can accurately understanding of statistical and machine learning concepts. However, with
predict the price of a house based on various features or attributes such as the right data and tools, it can be a powerful tool for real estate
location, size, number of bedrooms, bathrooms, and other amenities. professionals and home buyers alike.
Here are the steps involved in building a house price prediction model:
1. Data collection: The first step is to gather data on houses that have
been sold in the target area. This data can be obtained from real Q3. Explain in detail Fraud Detection. (P4 - Appeared 1 Time) (5-10
estate websites, property databases, and local agencies. Marks)
2. Data preprocessing: Once the data has been collected, it needs to Ans: Fraud detection is a critical application of predictive modeling in
be cleaned and processed. This involves removing duplicates, filling industries such as finance, insurance, and e-commerce. The goal is to
in missing values, and converting categorical variables into identify and prevent fraudulent activities such as credit card fraud,
numerical values. insurance fraud, and identity theft.
3. Feature selection: The next step is to select the relevant features Here are the steps involved in building a fraud detection model:
that will be used to build the model. This can be done using various 1. Data collection: The first step is to gather data on past transactions,
techniques such as correlation analysis, principal component including information such as transaction amount, date and time,
analysis, or domain expertise. location, and user information.
4. Model selection: There are several models that can be used for 2. Data preprocessing: Once the data has been collected, it needs to
house price prediction, including linear regression, decision trees, be cleaned and processed. This involves removing duplicates, filling
and random forests. The choice of model depends on the in missing values, and converting categorical variables into
complexity of the problem and the quality of the data. numerical values.
5. Model training: Once the model has been selected, it needs to be 3. Feature engineering: The next step is to create new features that
trained using the available data. This involves dividing the data into can be used to identify fraudulent activities. For example, the time
training and testing sets, and using the training set to train the of day or day of the week may be an important indicator of fraud.
model. 4. Model selection: There are several models that can be used for
6. Model evaluation: After the model has been trained, it needs to be fraud detection, including logistic regression, decision trees, and
evaluated to determine its accuracy and performance. This can be neural networks. The choice of model depends on the complexity of
done by comparing the predicted prices with the actual prices of the problem and the quality of the data.
houses in the testing set.

Page no - 90 Handcrafted by Engineers | P - Priority Page no - 91 Handcrafted by Engineers | P - Priority

` `

5. Model training: Once the model has been selected, it needs to be various techniques such as correlation analysis, principal
trained using the available data. This involves dividing the data into component analysis, or domain expertise.
training and testing sets, and using the training set to train the 3. Similarity or distance measure: The similarity or distance measure is
model. a key component of clustering, which is used to determine how
6. Model evaluation: After the model has been trained, it needs to be similar or dissimilar two data points are. Common measures
evaluated to determine its accuracy and performance. This can be include Euclidean distance, cosine similarity, and Manhattan
done by comparing the predicted results with the actual results of distance.
past transactions. 4. Clustering algorithm: There are several clustering algorithms that
7. Model deployment: Finally, the model can be deployed to identify can be used, including K-means, hierarchical clustering, and
fraudulent activities in real-time. This can be done using a density-based clustering. The choice of algorithm depends on the
web-based interface or a mobile app. specific problem and the characteristics of the data.
Overall, fraud detection is a challenging task that requires a deep 5. Evaluation: After clustering, it is important to evaluate the quality of
understanding of statistical and machine learning concepts. However, with the clustering results. This can be done using various metrics such
the right data and tools, it can be a powerful tool for preventing financial as silhouette score, Dunn index, or Calinski-Harabasz index.
losses and protecting consumer data. Clustering has many practical applications, including customer
Regenerate response segmentation, image segmentation, and anomaly detection. It is widely
used in industries such as marketing, healthcare, and finance to gain
insights from large datasets and make better decisions.
Q4. Write a short note on Clustering. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Clustering is a popular unsupervised learning technique used in Q5. Describe in detail Customer Segmentation. (P4 - Appeared 1 Time)
machine learning and data mining. The goal of clustering is to group (5-10 Marks)
similar data points together based on their characteristics or features, Ans: Customer segmentation is a technique used by businesses to divide
without prior knowledge of the specific classes or labels. their customers into groups based on common characteristics such as
The process of clustering typically involves several steps: demographics, behavior, and purchasing habits. The goal of customer
1. Data preprocessing: The first step is to clean and preprocess the segmentation is to better understand the needs and preferences of
data, including removing duplicates, filling in missing values, and different customer groups and tailor marketing strategies to each group
normalizing the data. accordingly.
2. Feature selection: The next step is to select the relevant features Here are the steps involved in customer segmentation:
that will be used to group the data points. This can be done using

Page no - 92 Handcrafted by Engineers | P - Priority Page no - 93 Handcrafted by Engineers | P - Priority

` `

1. Data collection: The first step is to gather data on customers, Customer segmentation has many practical applications, including
including information such as age, gender, income, and purchase improving customer retention, increasing customer lifetime value, and
history. optimizing marketing spend. It is widely used in industries such as retail,
2. Data preprocessing: Once the data has been collected, it needs to e-commerce, and healthcare to gain insights from large datasets and
be cleaned and processed. This involves removing duplicates, filling make better decisions.
in missing values, and converting categorical variables into
numerical values.
3. Feature selection: The next step is to select the relevant features
Q6. Explain in detail Time series forecasting. (P4 - Appeared 1 Time)
that will be used to group the customers. This can be done using (5-10 Marks)

various techniques such as correlation analysis, principal Ans: Time series forecasting is a popular technique in data science used to

component analysis, or domain expertise. predict future values of a time-dependent variable based on historical

4. Similarity or distance measure: The similarity or distance measure is data. Time series data consists of observations taken at regular time

a key component of customer segmentation, which is used to intervals, such as daily, weekly, or monthly, and includes data from a wide

determine how similar or dissimilar two customers are. Common range of domains such as finance, economics, weather, and energy.

measures include Euclidean distance, cosine similarity, and Here are the steps involved in time series forecasting:

Manhattan distance. 1. Data collection: The first step is to gather historical data on the time

5. Clustering algorithm: There are several clustering algorithms that series variable of interest, including the time stamps and values.

can be used for customer segmentation, including K-means, 2. Data preprocessing: Once the data has been collected, it needs to

hierarchical clustering, and density-based clustering. The choice of be cleaned and processed. This involves removing duplicates, filling

algorithm depends on the specific problem and the characteristics in missing values, and handling any outliers or anomalies.

of the data. 3. Visualization: It is helpful to visualize the data to understand its

6. Evaluation: After clustering, it is important to evaluate the quality of trends and patterns over time. This can be done using various

the clustering results. This can be done using various metrics such visualization techniques such as line charts, scatterplots, and

as silhouette score, Dunn index, or Calinski-Harabasz index. heatmaps.

7. Marketing strategy: Once the customers have been segmented into 4. Time series modeling: There are several time series models that can

groups, businesses can tailor their marketing strategies to each be used for forecasting, including ARIMA (autoregressive integrated

group. For example, a company may create different advertising moving average), exponential smoothing, and seasonal

campaigns for high-income customers versus low-income decomposition. The choice of model depends on the specific

customers. problem and the characteristics of the data.

5. Model evaluation: After building the time series model, it is
important to evaluate its performance using various metrics such

Page no - 94 Handcrafted by Engineers | P - Priority Page no - 95 Handcrafted by Engineers | P - Priority

` `

as mean absolute error (MAE), mean squared error (MSE), and root models, statistical models, and machine learning models. The
mean squared error (RMSE). This helps to ensure that the model is choice of model depends on the specific problem and the
accurate and reliable. characteristics of the data.
6. Forecasting: Once the time series model has been evaluated, it can 4. Model evaluation: After building the weather model, it is important
be used to make predictions about future values of the to evaluate its performance using various metrics such as
time-dependent variable. This can help businesses and accuracy, precision, and recall. This helps to ensure that the model
organizations make better decisions and plan for the future. is accurate and reliable.
Time series forecasting has many practical applications, including 5. Forecasting: Once the weather model has been evaluated, it can be
predicting stock prices, forecasting demand for products, and estimating used to make predictions about future weather conditions. These
energy consumption. It is widely used in industries such as finance, retail, predictions can be used to provide weather alerts and advisories to
and manufacturing to gain insights from historical data and make better the public, as well as to inform decision-making in various
decisions. industries.
Weather forecasting has many practical applications, including predicting
storms, droughts, and heat waves, as well as forecasting crop yields and
Q7. Write a short note on Weather Forecasting. (P4 - Appeared 1 Time) informing transportation planning. It is a critical tool for emergency
(5-10 Marks) management and disaster response, helping to save lives and minimize
Ans: Weather forecasting is the process of predicting the future state of the property damage.
atmosphere at a given location and time. Weather forecasting is an
important application of data science and is widely used in a range of
fields such as agriculture, aviation, transportation, and emergency Q8. Explain in detail Product recommendation. (P4 - Appeared 1 Time)
management. (5-10 Marks)
The process of weather forecasting typically involves the following steps: Ans: Product recommendation is a technique in data science and machine
1. Data collection: The first step is to collect a range of weather data, learning used to suggest products to users based on their past behavior
including temperature, humidity, pressure, wind speed, and and preferences.
precipitation. This data can be collected from various sources such ● This is done by analyzing the user's historical data, such as their
as weather stations, satellites, and radars. purchase history, search history, and clickstream data, and using
2. Data preprocessing: Once the data has been collected, it needs to this information to make personalized recommendations.
be cleaned and processed. This involves removing duplicates, filling ● Product recommendation is a powerful technique in data science
in missing values, and handling any outliers or anomalies. and machine learning that can help businesses and organizations
3. Weather modeling: There are several models that can be used for provide personalized recommendations to users, leading to
weather forecasting, including numerical weather prediction increased sales and customer satisfaction.

Page no - 96 Handcrafted by Engineers | P - Priority Page no - 97 Handcrafted by Engineers | P - Priority

` `

● Here are the steps involved in product recommendation: recommendations for users based on their past behavior
1. Data collection: The first step is to collect data on user and preferences.
behavior, such as their purchase history, search history, and ● Product recommendation has many practical applications,
clickstream data. This data can be collected from various including in e-commerce, advertising, and entertainment.
sources such as e-commerce websites, social media ● It can help businesses increase sales by suggesting relevant
platforms, and mobile apps. products to customers, as well as improve customer satisfaction by
2. Data preprocessing: Once the data has been collected, it providing personalized recommendations.
needs to be cleaned and processed. This involves removing
duplicates, filling in missing values, and handling any outliers
or anomalies.
3. Feature extraction: The next step is to extract features from
the data that are relevant to the product recommendation
task. For example, features such as the user's age, gender,
location, and previous purchases can be used to make
recommendations.
4. Recommendation engine: There are several
recommendation algorithms that can be used, including
collaborative filtering, content-based filtering, and hybrid
models. The choice of algorithm depends on the specific
problem and the characteristics of the data.
5. Model training: After choosing the recommendation
algorithm, the model needs to be trained on the historical
data to learn patterns and relationships between users and
products.
6. Model evaluation: Once the model has been trained, it needs
to be evaluated using various metrics such as precision,
recall, and F1 score. This helps to ensure that the model is
accurate and reliable.
7. Recommendation generation: Once the model has been
evaluated, it can be used to generate personalized

Page no - 98 Handcrafted by Engineers | P - Priority Page no - 99 Handcrafted by Engineers | P - Priority

Nolan S.A. - Heinzen, T. E. Statistics For Behavioral Sciences 2nd Edition
100% (1)
Nolan S.A. - Heinzen, T. E. Statistics For Behavioral Sciences 2nd Edition
710 pages
Outline For A Quantitative Study
100% (7)
Outline For A Quantitative Study
3 pages
Assignment 2 Eco745
No ratings yet
Assignment 2 Eco745
4 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Exporatory Data Analytics Notes ME SEM 2
No ratings yet
Exporatory Data Analytics Notes ME SEM 2
132 pages
Data Science Management - Vss
No ratings yet
Data Science Management - Vss
84 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
Datascience
No ratings yet
Datascience
12 pages
6001 - Datascience With Bigdata
No ratings yet
6001 - Datascience With Bigdata
34 pages
Exploratory Data Analysis With Python
No ratings yet
Exploratory Data Analysis With Python
24 pages
Data Science
No ratings yet
Data Science
11 pages
Unit I
No ratings yet
Unit I
52 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Unit 1 Ds
No ratings yet
Unit 1 Ds
10 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
24 pages
Data Science
No ratings yet
Data Science
5 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
What Is Data Science?
No ratings yet
What Is Data Science?
94 pages
Module1 Data Science
No ratings yet
Module1 Data Science
15 pages
Chapter 1 - Intr To DS and Business Understanding
No ratings yet
Chapter 1 - Intr To DS and Business Understanding
35 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
File
No ratings yet
File
27 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Data Science
No ratings yet
Data Science
18 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
Data Sciences in Telecommunication-Chapitre-1
No ratings yet
Data Sciences in Telecommunication-Chapitre-1
20 pages
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
No ratings yet
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
7 pages
Data Science
No ratings yet
Data Science
18 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Unit 1
No ratings yet
Unit 1
11 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Unit 3
No ratings yet
Unit 3
9 pages
Data Science
No ratings yet
Data Science
11 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Bcom Python
No ratings yet
Bcom Python
71 pages
Handbook Introduction of Data Science AY 23-24
No ratings yet
Handbook Introduction of Data Science AY 23-24
171 pages
Data Science Overview Basic To Advance Guide
No ratings yet
Data Science Overview Basic To Advance Guide
27 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
DS - Unit I
No ratings yet
DS - Unit I
3 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Introduction To Data Science - 23CSH-283
100% (1)
Introduction To Data Science - 23CSH-283
48 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Data Science
No ratings yet
Data Science
10 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
28 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Science Module 1 Q & A
No ratings yet
Data Science Module 1 Q & A
16 pages
Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
FDS Notes
No ratings yet
FDS Notes
5 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Unit - I
No ratings yet
Unit - I
17 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
BI Unit 2
No ratings yet
BI Unit 2
113 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
The Bass Model Unscrambling Regression Coefficients For P&Q
No ratings yet
The Bass Model Unscrambling Regression Coefficients For P&Q
4 pages
Two-And Three - Parameter Weibull Goodness-of-Fit Tests: United States Department of Agriculture
No ratings yet
Two-And Three - Parameter Weibull Goodness-of-Fit Tests: United States Department of Agriculture
34 pages
Relative Pe
No ratings yet
Relative Pe
16 pages
Solucionario Econometria Jeffrey M Wooldridge PDF
11% (9)
Solucionario Econometria Jeffrey M Wooldridge PDF
4 pages
DLP Week 2 Q4
No ratings yet
DLP Week 2 Q4
8 pages
Evaluation of Melon Cucumis Melo L
No ratings yet
Evaluation of Melon Cucumis Melo L
10 pages
Chapter Four Research Proposal
No ratings yet
Chapter Four Research Proposal
26 pages
Ensemble Modeling
No ratings yet
Ensemble Modeling
34 pages
Assignment Unit I
No ratings yet
Assignment Unit I
2 pages
Chapter 7 Exercises
No ratings yet
Chapter 7 Exercises
4 pages
2 - CHAPTER TWO-Mean and Total Estimation
No ratings yet
2 - CHAPTER TWO-Mean and Total Estimation
14 pages
Study of Cardiac Changes in Patients With Iron Deficiency Anaemia and Its Correlation
No ratings yet
Study of Cardiac Changes in Patients With Iron Deficiency Anaemia and Its Correlation
5 pages
Mini Dictionary
No ratings yet
Mini Dictionary
2 pages
2008 2009 Exam
No ratings yet
2008 2009 Exam
1 page
Variabel Moderat
No ratings yet
Variabel Moderat
26 pages
Time Series Forecasting Business Report-1
No ratings yet
Time Series Forecasting Business Report-1
65 pages
Geokniga Id Be Ok Mik Uc Critique Mineral Resource Estimation Techniques
100% (1)
Geokniga Id Be Ok Mik Uc Critique Mineral Resource Estimation Techniques
276 pages
Thesis Chapter III
No ratings yet
Thesis Chapter III
10 pages
First Part of Measures of Variability
No ratings yet
First Part of Measures of Variability
33 pages
Econometrics For Finance - CHAPTER - 4
No ratings yet
Econometrics For Finance - CHAPTER - 4
10 pages
Econometrics Questions
No ratings yet
Econometrics Questions
3 pages
Related Samples T Test Lecture
No ratings yet
Related Samples T Test Lecture
43 pages
Final Exam Sample Test
No ratings yet
Final Exam Sample Test
12 pages
CAPE Applied Mathematics 2015 U1 P2
No ratings yet
CAPE Applied Mathematics 2015 U1 P2
12 pages
Management Science - Chapter 7 - Test Reveiwer
No ratings yet
Management Science - Chapter 7 - Test Reveiwer
10 pages
NIMCET Statistics Topicwise
No ratings yet
NIMCET Statistics Topicwise
7 pages
MCQ Hypothesis Testing 4
No ratings yet
MCQ Hypothesis Testing 4
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.