Ads TopperSh
Ads TopperSh
BrainheatersTM LLC
BH.Index
(Learn as per the Priority to prepare smartly)
2. Data Exploration 1 13
4. Anomaly Detection 2 62
Brainheaters Notes
Applied data Science (ADS)
COMPUTER Semester-8
MODULE-1 Q2. Explain in detail the Data Science Process. (P4 -Appeared 1 Time)
(5-10 Marks)
Ans: Data Science process is a structured and iterative approach to solving
Q1. Write a short note on Data Science. (P4 -Appeared 1 Time) (5-10 complex problems using data.
Marks) ● It involves a set of steps that help to extract insights and knowledge
Ans: Data Science is a multidisciplinary field that involves the use of from data to inform business decisions. The following are the steps
statistical, computational, and machine learning techniques to extract involved in the Data Science process:
insights and knowledge from complex data sets. 1. Problem Formulation:
● It encompasses a range of topics including data analysis, data In this step, the problem is identified, and the business
mining, machine learning, and artificial intelligence. objective is defined. The goal is to determine what questions
● Data Science involves a structured approach to analyzing and need to be answered using data and to ensure that they
interpreting large and complex data sets. align with the business goals.
● This includes identifying patterns, making predictions, and 2. Data Collection:
developing models that can be used to gain insights and drive The next step is to collect relevant data. This can involve
business decisions. various sources, such as internal databases, public data
● Data Scientists use a variety of tools and techniques, such as repositories, or web scraping. The data must be accurate,
statistical programming languages like R and Python, to work with complete, and representative of the problem.
data sets and extract meaningful information. 3. Data Cleaning and Preparation:
● The applications of Data Science are widespread and can be found In this step, the collected data is cleaned, pre-processed,
in various industries, such as healthcare, finance, marketing, and and transformed to ensure that it is consistent and ready for
transportation. It plays a crucial role in providing businesses with analysis. This involves tasks such as removing duplicates,
valuable insights into customer behavior, market trends, and handling missing values, and encoding categorical
operational efficiencies. variables.
● To become a Data Scientist, one typically needs to have a strong 4. Data Exploration:
background in mathematics, statistics, and computer science. The goal of this step is to understand the data and gain
● Many universities now offer Data Science programs, which provide insights. This involves the use of descriptive statistics, data
students with the necessary skills to work in this field. visualization, and exploratory analysis techniques to identify
● The demand for Data Scientists is growing rapidly, and it is patterns, correlations, and trends.
expected to continue to increase in the future.
5. Feature Engineering: model evaluation, model deployment, and model monitoring and
This step involves the creation of new features or variables maintenance.
from the existing data. This is done to improve the ● The goal is to extract insights and knowledge from data to inform
performance of the predictive model or to gain further business decisions.
insights into the problem.
6. Model Selection and Training:
In this step, a suitable model is selected to solve the
Q3. Describe Motivation to use Data Science Techniques: Volume,
problem. This involves a trade-off between accuracy, Dimensions and Complexity. (P4 -Appeared 1 Time) (5-10 Marks)
interpretability, and complexity. Once the model is selected, Ans: Data Science techniques are used to solve problems involving large,
it is trained on the data to learn the underlying patterns and complex, and high-dimensional data sets that cannot be analyzed using
7. Model Evaluation: ● The motivation to use Data Science techniques is driven by three
The performance of the model is evaluated using metrics main factors: volume, dimensions, and complexity.
to determine whether the model is performing well or The amount of data being generated is growing at an
whether it needs to be improved. unprecedented rate. Large volumes of data are being
8. Model Deployment: generated from various sources such as social media, IoT
Once the model has been evaluated and deemed devices, and sensors. Data Science techniques are used to
acceptable, it can be deployed into production. This involves manage and analyze these large volumes of data efficiently.
9. Model Monitoring and Maintenance: very high. For example, in genetics research, the number of
The final step involves monitoring the performance of the genes being analyzed can be in the millions. Data Science
model in the real-world environment. This involves tracking techniques such as dimensionality reduction, feature
the model's performance, detecting any drift, and retraining selection, and feature extraction are used to reduce the
the model as necessary. number of dimensions in the data and to identify the most
● In summary, the Data Science process is a systematic and iterative relevant features.
formulation, data collection, data cleaning and preparation, data Data sets can be complex, containing non-linear
exploration, feature engineering, model selection and training, relationships between variables, missing values, and noisy
data. Data Science techniques such as machine learning,
3. Model Evaluation: Model evaluation involves assessing the Data Science Data Analytics
quality of the model's predictions or decisions. This includes Parameter
measuring the model's accuracy, precision, recall, and other
performance metrics. A field that encompasses A subset of data science that
4. Model Deployment: Model deployment involves integrating a wide range of focuses on using statistical
the model into the decision-making process. This includes Definition techniques, tools, and and computational methods
implementing the model in a software application, methodologies for working to explore, analyze, and
developing an API for the model, or integrating the model with data extract insights from data
Understanding the type of data is important because it can influence the dataset that contains only a subset of the population may not
choice of statistical and computational techniques used for analysis. accurately represent the true population.
5. Consistency: Consistency refers to the degree to which the data is
uniform and follows a consistent format or structure. Inconsistent
Q2. Discuss in detail Properties of data. (P4 - Appeared 1 Time) (5-10 data can make it difficult to analyze and compare data, and can
Marks) lead to errors in analysis. For example, a dataset that contains
Ans: In Data Science, data is often described in terms of its properties, inconsistent date formats could make it difficult to accurately
which are characteristics that define the data and influence how it can be analyze time-series data.
analyzed and processed. Here are some of the key properties of data: 6. Relevance: Relevance refers to the extent to which the data is useful
1. Scale: Scale refers to the range and distribution of values in the for the intended analysis or application. Relevant data is essential
data. Data can have a small or large scale, depending on the range for making informed decisions and drawing accurate conclusions.
of values that it encompasses. For example, a dataset containing For example, a dataset containing irrelevant variables or data
the ages of people in a population might have a scale of 0 to 100 points could lead to incorrect conclusions about the relationships
years. between variables.
2. Resolution: Resolution refers to the level of detail or granularity in the 7. Timeliness: Timeliness refers to the degree to which the data is
data. Data can be high resolution, with a fine level of detail, or low current and relevant to the intended analysis or application. Timely
resolution, with a coarser level of detail. For example, satellite data is essential for making informed decisions and drawing
imagery can have a high resolution, allowing for the identification of accurate conclusions. For example, stock price data that is delayed
small details on the ground, while weather data might have a lower or outdated could lead to incorrect conclusions about market
resolution, providing broader information about a region. trends.
3. Accuracy: Accuracy refers to the degree to which the data
represents the true or intended values. Accurate data is essential
for making informed decisions and drawing accurate conclusions. Q3. Explain in detail Descriptive Statistics. (P4 - Appeared 1 Time) (5-10
For example, a dataset containing inaccurate or incomplete Marks)
customer information could lead to incorrect conclusions about Ans: Descriptive statistics refers to the process of analyzing and
customer behavior. summarizing data using various statistical methods.
4. Completeness: Completeness refers to the extent to which the data ● The purpose of descriptive statistics is to provide an overview of the
represents the full set of values or observations that are needed. data and to help identify patterns, trends, and relationships that
Incomplete data can result in gaps or biases in the analysis, and may be present.
can limit the ability to draw accurate conclusions. For example, a ● Here are some of the key methods used in descriptive statistics:
1. Measures of Central Tendency: Measures of central the quartiles divide the data into quarters (25th, 50th, and
tendency are statistics that represent the "center" of the 75th percentiles).
data, or the typical or average value. The three most ● Descriptive statistics is an important tool for analyzing and
common measures of central tendency are the mean summarizing data in a meaningful way. It is often used to provide a
(average), median (middle value), and mode (most baseline understanding of the data before more complex analyses
common value). are performed.
2. Measures of Variability: Measures of variability are statistics
that describe how spread out or varied the data is. The most
common measures of variability are the range (difference
Q4. Describe in detail Univariate Exploration. (P4 - Appeared 1 Time)
between the highest and lowest values), variance (average (5-10 Marks)
squared deviation from the mean), and standard deviation Ans: Univariate exploration is a data analysis technique that focuses on
3. Frequency Distributions: Frequency distributions show how ● The purpose of univariate exploration is to gain an understanding of
often each value or range of values occurs in the data. the distribution and characteristics of a single variable, which can
Frequency distributions can be displayed using histograms, help in identifying any patterns or outliers in the data.
bar charts, or pie charts. ● Here are some of the key methods used in univariate exploration:
4. Correlation Analysis: Correlation analysis is used to identify 1. Histograms: Histograms are used to visualize the distribution
the relationship between two variables. Correlation of a single variable. A histogram is a graph that displays the
coefficients range from -1 to +1, with a value of 0 indicating frequency of data within different intervals or bins. The
no correlation and a value of +1 indicating a perfect positive height of each bar represents the number of data points
5. Regression Analysis: Regression analysis is used to model 2. Box plots: Box plots are used to visualize the distribution of a
the relationship between two or more variables. Simple single variable by displaying the median, quartiles, and
linear regression models the relationship between two outliers. A box plot consists of a box that spans the
variables, while multiple regression models the relationship interquartile range (IQR) and whiskers that extend to the
between multiple variables. highest and lowest values within 1.5 times the IQR.
6. Percentiles and Quartiles: Percentiles and quartiles are used 3. Density plots: Density plots are used to visualize the
to divide the data into equal parts based on their rank or probability density function of a single variable. A density
position. The median represents the 50th percentile, while plot is a smoothed version of a histogram that represents
the distribution of the variable as a continuous curve.
4. Bar charts: Bar charts are used to visualize the distribution of 3. Mode: The mode is the value that occurs most frequently in a
categorical variables. A bar chart displays the frequency or dataset. It can be useful in identifying the most common value in a
proportion of each category as a bar. dataset, but it may not be a good measure of central tendency if
5. Summary statistics: Summary statistics such as mean, there are multiple modes or if the dataset is continuous.
median, mode, variance, and standard deviation can be Each measure of central tendency has its advantages and disadvantages,
used to describe the central tendency and variability of a and the choice of which to use depends on the nature of the data and the
single variable. research question. For example, if the data is normally distributed, the
6. Skewness and Kurtosis: Skewness and Kurtosis are used to mean may be the most appropriate measure of central tendency.
measure the shape of the distribution of a variable. However, if the data is skewed or contains outliers, the median may be a
Skewness measures the asymmetry of the distribution, while better choice. Similarly, the mode may be useful for categorical data, but
kurtosis measures the degree of peakedness or flatness of not for continuous data.
the distribution.
Univariate exploration is an important step in data analysis as it
Q6. Write down Measures of Spread, Symmetry. (P4 - Appeared 1 Time)
●
provides a detailed understanding of a single variable and helps in
identifying any patterns or outliers in the data. (5-10 Marks)
Ans: Measures of spread and symmetry are important descriptive statistics
that help to characterize the distribution of a dataset. Here are some of the
Q5. Explain in detail the Measure of Central Tendency. (P4 - Appeared 1 most common measures of spread and symmetry:
Time) (5-10 Marks) ● Measures of spread:
Ans: Measures of central tendency are statistics that describe the "center" 1. Range: The range is the difference between the largest and
or typical value of a dataset. There are three common measures of central smallest values in a dataset.
tendency: the mean, median, and mode. 2. Interquartile range (IQR): The IQR is the range of the middle
1. Mean: The mean is calculated by adding up all the values in a 50% of the dataset, calculated by subtracting the 25th
dataset and then dividing by the number of values. It is the most percentile from the 75th percentile.
commonly used measure of central tendency. However, it can be 3. Variance: The variance is the average of the squared
sensitive to outliers, or extreme values, which can skew the mean. differences from the mean. It measures how much the data
2. Median: The median is the middle value of a dataset when the varies from the mean.
values are arranged in order. If there is an even number of values, 4. Standard deviation: The standard deviation is the square
the median is the average of the two middle values. The median is root of the variance. It measures the spread of the data
less sensitive to outliers than the mean. around the mean.
● Measures of symmetry: ● This is because there are more extreme values on the right side of
1. Skewness: Skewness is a measure of the asymmetry of the the distribution.
distribution. A positive skew indicates that the tail of the ● Skewness can be quantified using a number of different measures,
distribution is longer on the positive side, while a negative such as Pearson's moment coefficient of skewness, the Bowley
skew indicates that the tail is longer on the negative side. A skewness, or the quartile skewness coefficient.
skewness value of zero indicates that the distribution is ● Skewness is an important measure of distributional shape, as it
perfectly symmetrical. provides insight into the direction and degree of deviation from
2. Kurtosis: Kurtosis is a measure of the "peakedness" of the symmetry.
distribution. A high kurtosis value indicates that the ● Skewed distributions can have important implications in statistical
distribution has a sharp peak and heavy tails, while a low analysis, as they can affect the interpretation of statistical tests,
kurtosis value indicates a flat or rounded distribution. such as the t-test or ANOVA.
● These measures are important because they provide valuable
information about the shape and variability of a dataset.
Understanding the spread and symmetry of a dataset can help in
Q8. Discuss in detail Karl Pearson Coefficient of skewness. (P4 -
making informed decisions about the data and in identifying any Appeared 1 Time) (5-10 Marks)
patterns or outliers that may be present. Ans: The Karl Pearson Coefficient of skewness is a measure of the skewness
of a distribution.
● It is defined as the ratio of the difference between the mean and the
Q7. Write a short note on Skewness. (P4 - Appeared 1 Time) (5-10 mode of a distribution, to the standard deviation of the distribution.
Marks) ● This measure was developed by Karl Pearson, a British
Ans: Skewness is a measure of the asymmetry of a probability distribution. mathematician and statistician.
It describes the extent to which a distribution deviates from symmetry ● The formula for Karl Pearson Coefficient of skewness is:
around its mean. Skewness = 3 * (Mean - Median) / Standard deviation
● A distribution can be skewed to the left (negative skewness) or Where,
skewed to the right (positive skewness). Mean = arithmetic mean of the dataset
● If a distribution is skewed to the left, the tail of the distribution is Median = median of the dataset
longer on the left-hand side, and the mean is less than the median. Standard deviation = standard deviation of the dataset
● This is because there are more extreme values on the left side of the ● The Karl Pearson Coefficient of skewness is a dimensionless
distribution. Conversely, if a distribution is skewed to the right, the measure of skewness, meaning that it has no units.
tail of the distribution is longer on the right-hand side, and the ● The measure is always zero for a perfectly symmetrical distribution.
mean is greater than the median.
● Positive values of skewness indicate that the tail of the distribution is ● Overall, Bowley's coefficient of skewness is a useful measure of the
longer on the right-hand side, while negative values of skewness skewness of a distribution, and it can provide valuable insight into
indicate that the tail is longer on the left-hand side. the shape and characteristics of the data.
● One of the main advantages of the Karl Pearson Coefficient of ● One of the main advantages of Bowley's coefficient of skewness is
skewness is that it is easy to calculate and interpret. that it is less sensitive to extreme values or outliers than other
● However, it can be sensitive to outliers in the data, and it may not be measures of skewness, such as the Karl Pearson coefficient. This is
appropriate for distributions that are heavily skewed. because it is based on quartiles, which are less affected by extreme
● Overall, the Karl Pearson Coefficient of skewness is a useful measure values than the mean and standard deviation.
of the skewness of a distribution, and it can provide valuable insight ● However, one limitation of Bowley's coefficient of skewness is that it
into the shape and characteristics of the data. is only based on three quartiles of the dataset, and may not be as
accurate as other measures of skewness for distributions that are
heavily skewed.
Q9. Explain in detail Bowley‘s Coefficient. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Bowley's coefficient of skewness, also known as quartile skewness Q10. Discuss in detail Kurtosis Multivariate Exploration. (P4 - Appeared 1
coefficient, is a measure of skewness of a distribution. Time) (5-10 Marks)
● It is based on the difference between the upper and lower quartiles Ans: Kurtosis is a statistical measure that describes the shape of a
of a dataset. The measure was developed by Arthur Bowley, an distribution. It is a measure of the "peakedness" or "flatness" of a distribution
English statistician. compared to a normal distribution.
● The formula for Bowley's coefficient of skewness is: ● Multivariate exploration refers to the analysis of more than one
Skewness = (Q3 + Q1 - 2 * Q2) / (Q3 - Q1) variable at a time. In this context, kurtosis can be used to analyze
Where, the relationship between multiple variables in a dataset.
Q1 = first quartile of the dataset ● The most commonly used measure of kurtosis is the Pearson's
Q2 = second quartile or median of the dataset coefficient of kurtosis, which is calculated by dividing the fourth
Q3 = third quartile of the dataset moment by the square of the variance.
● The coefficient of skewness can take values ranging from -1 to +1. A ● The formula for Pearson's coefficient of kurtosis is:
value of zero indicates that the distribution is perfectly symmetrical, Kurtosis = (M4 / S 4 ) - 3
while negative and positive values indicate that the distribution is Where,
skewed to the left or right, respectively. M4 = fourth moment of the dataset
S = standard deviation of the dataset
● The value of kurtosis can be positive or negative. A positive value measure of central tendency and can be affected by outliers or
indicates that the distribution is more peaked than a normal extreme values in the dataset.
distribution, while a negative value indicates that the distribution is ● The median is the middle value in a dataset when the observations
flatter than a normal distribution. are arranged in order from smallest to largest. The median is less
● A value of zero indicates that the distribution is similar in shape to a sensitive to extreme values and is a more robust measure of central
normal distribution. tendency than the mean.
● In multivariate exploration, kurtosis can be used to analyze the ● The mode is the value that appears most frequently in a dataset.
relationship between multiple variables in a dataset. The mode is useful when there is a high frequency of one or a few
● For example, if two variables have similar levels of kurtosis, it may specific values in the dataset.
indicate that they are related or have a similar distribution. ● The choice of central data point depends on the characteristics of
● On the other hand, if two variables have different levels of kurtosis, it the dataset and the research question being addressed.
may indicate that they are not related or have different ● In some cases, the mean may be more appropriate, while in other
distributions. cases the median or mode may be more appropriate.
● One limitation of kurtosis in multivariate exploration is that it only ● For example, if the dataset contains extreme values or outliers, the
measures the shape of a distribution, and does not take into median may be a better measure of central tendency than the
account other factors such as the location and spread of the data. mean.
● Therefore, it is important to use kurtosis in combination with other
measures such as central tendency and dispersion when analyzing
relationships between multiple variables.
Q12. Write a short note on Correlation. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Correlation is a statistical measure that describes the relationship
Q11. Explain in detail Central Data Point. (P4 - Appeared 1 Time) (5-10 between two variables.
Marks) ● It is used to determine whether there is a statistical association
Ans: Central data point is a term used to describe a specific value that between the two variables, and if so, the strength and direction of
represents the central tendency or central location of a dataset. the association. Correlation can be expressed as a numerical value
● In statistics, measures of central tendency are used to describe the known as the correlation coefficient.
typical or most common value in a dataset. ● The correlation coefficient ranges from -1 to +1.
● There are three commonly used measures of central tendency: ● A value of -1 indicates a perfect negative correlation, meaning that
mean, median, and mode. as one variable increases, the other variable decreases in a linear
● The mean is calculated by adding up all the values in a dataset and fashion.
dividing by the number of observations. The mean is a sensitive
● A value of +1 indicates a perfect positive correlation, meaning that continuous variables. It ranges from -1 to +1, with 0 indicating no
as one variable increases, the other variable increases in a linear correlation and values closer to -1 or +1 indicating a stronger
fashion. A value of 0 indicates no correlation between the two correlation.
variables. 2. Spearman correlation: This form of correlation measures the
● Correlation is commonly used in research to investigate the relationship between two variables based on their rank order. It is
relationship between variables. used when the variables are ordinal or when the relationship is
● For example, in a medical study, correlation may be used to non-linear.
determine whether there is a relationship between a specific 3. Kendall correlation: This form of correlation is also based on rank
treatment and patient outcomes. order and measures the strength and direction of the relationship
● In finance, correlation may be used to determine whether there is a between two variables. It is often used when the data is
relationship between the performance of two stocks or investments. non-parametric or when the variables are not normally distributed.
● It is important to note that correlation does not imply causation. 4. Point-biserial correlation: This form of correlation measures the
Just because two variables are correlated does not mean that one relationship between a continuous variable and a dichotomous
causes the other. variable. It is used when one variable is continuous and the other
● It is also important to consider other factors that may influence the variable is binary.
relationship between the variables. 5. Biserial correlation: This form of correlation measures the
● Correlation is a useful statistical tool for investigating the relationship between two dichotomous variables. It is used when
relationship between two variables. both variables are binary.
● It can help researchers identify patterns and trends in the data and 6. Phi coefficient: This form of correlation is used when both variables
make predictions about future outcomes. are dichotomous and is similar to the biserial correlation.
● However, it is important to interpret correlation in the context of the The choice of correlation method depends on the type of data being
research question being addressed and consider other factors that analyzed and the research question being addressed. It is important to
may influence the relationship between the variables. select the appropriate method to accurately measure the relationship
between two variables.
● A value of 0 indicates no correlation between the two variables. Ans: There are various types of probability distributions, each with its own
● The formula to calculate Pearson correlation coefficient is as unique characteristics and properties.
follows: Two common types of distributions are the normal distribution and the
where x and y are the two variables, x̄ and ȳ are their respective 1. Normal Distribution:
means, and Σ represents the sum of the values. ● The normal distribution, also known as the Gaussian
● To interpret the correlation coefficient, we can use the following distribution, is a continuous probability distribution that is
1. A value of r between -0.7 and -1 or between 0.7 and 1 ● It is often used to model naturally occurring phenomena
2. A value of r between -0.5 and -0.7 or between 0.5 and 0.7 ● The normal distribution is characterized by two parameters,
indicates a moderate correlation. the mean (μ) and the standard deviation (σ).
3. A value of r between -0.3 and -0.5 or between 0.3 and 0.5 ● The mean represents the central tendency of the
indicates a weak correlation. distribution, while the standard deviation represents the
4. A value of r between -0.3 and 0.3 indicates no correlation. spread or variability of the distribution.
● It is important to note that correlation does not imply causation. A ● The area under the normal curve is equal to 1, and about
strong correlation between two variables does not necessarily 68% of the data falls within one standard deviation of the
mean that one variable causes the other. Other factors may be mean, about 95% falls within two standard deviations, and
responsible for the observed relationship. about 99.7% falls within three standard deviations.
● It is a useful tool for identifying patterns and trends in the data and ● The Poisson distribution is a discrete probability distribution
making predictions about future outcomes. that is used to model the number of times an event occurs in
a fixed interval of time or space, given that the events occur
independently and at a constant rate.
● It is named after French mathematician Siméon Denis
Poisson. The Poisson distribution is characterized by a single
parameter, λ (lambda), which represents the average rate of is actually true. The most common significance level is 0.05
occurrence of the event. or 5%.
● The probability of observing a certain number of events in a 3. Collect data and calculate test statistic: Collect the data
fixed interval of time or space can be calculated using the and calculate a test statistic, which is a numerical value that
Poisson probability mass function. measures the difference between the observed data and
● The Poisson distribution is commonly used in fields such as the expected values under the null hypothesis.
biology, physics, and telecommunications. 4. Determine the p-value: The p-value is the probability of
Both the normal distribution and the Poisson distribution are important observing the test statistic or a more extreme value if the
tools in statistics and data analysis. They allow us to model and analyze null hypothesis is true. It is used to determine the statistical
data in a meaningful way, and make predictions about future outcomes significance of the test result.
based on past observations. 5. Compare the p-value to the significance level: If the p-value
is less than the significance level, we reject the null
hypothesis and accept the alternative hypothesis. If the
Q16. Describe in detail Test Hypothesis. (P4 - Appeared 1 Time) (5-10 p-value is greater than the significance level, we fail to reject
Marks) the null hypothesis.
Ans: In statistics, a hypothesis is a proposed explanation or prediction for a 6. Draw conclusions: Based on the results of the hypothesis
phenomenon or set of observations. test, we can draw conclusions about the relationship
● Hypothesis testing is a statistical technique that allows us to test the between the variables being tested.
validity of a hypothesis by comparing it to an alternative hypothesis ● Hypothesis testing is used in many different fields, including
using a set of statistical tools and methods. business, medicine, and social sciences.
● The goal of hypothesis testing is to determine whether there is ● It is a powerful tool for making decisions and drawing conclusions
enough evidence to support or reject the null hypothesis. based on data and statistical analysis.
● The hypothesis testing process typically involves the following steps: ● However, it is important to carefully formulate the null and
1. Formulate the null and alternative hypotheses: The null alternative hypotheses, choose an appropriate significance level,
hypothesis is the default hypothesis that there is no and properly interpret the results of the test to ensure accurate and
significant difference or effect between two populations or meaningful conclusions.
samples, while the alternative hypothesis is the opposite
hypothesis that there is a significant difference or effect.
2. Choose a significance level: The significance level (denoted
as α) is the probability of rejecting the null hypothesis when it
Q17. Write a short note on the Central limit theorem. (P4 - Appeared 1 Q18. Confidence Interval, Z-test, t-test. (P4 - Appeared 1 Time) (5-10
Time) (5-10 Marks) Marks)
Ans: The Central Limit Theorem (CLT) is a key concept in data science that Ans: Confidence Interval:
is used to make statistical inferences about large datasets. ● A confidence interval is a range of values that is likely to contain the
● It states that if a large number of independent and identically true population parameter with a certain level of confidence.
distributed random variables are added or averaged together, the ● It is an important tool in statistical inference that is used to estimate
resulting distribution will be approximately normal, regardless of the the range of values that the true population parameter could fall
underlying distribution of the individual variables. within based on a sample of data.
● In practical terms, this means that if we have a large enough ● The level of confidence is typically expressed as a percentage, such
sample size, we can use the CLT to make accurate estimates about as 95% or 99%.
the population mean and standard deviation, even if we don't know ● A wider confidence interval implies a lower level of confidence in
the underlying distribution of the data. the estimate, and a narrower interval implies a higher level of
● This is important in data science because it allows us to draw confidence.
meaningful conclusions from data, even when we have incomplete Z-test:
information. ● A Z-test is a hypothesis test that is used to determine whether a
● For example, the CLT is often used in hypothesis testing, where we sample mean is significantly different from a hypothesized
compare a sample mean to a hypothesized population mean. population mean, when the population variance is known.
● By calculating the standard error of the mean using the CLT, we can ● It involves calculating the Z-score of the sample mean, which is the
estimate the probability of observing a sample mean as extreme as number of standard deviations the sample mean is from the
the one we have, given the hypothesized population mean. hypothesized mean.
● This helps us determine whether the difference between the sample ● If the Z-score falls outside a certain range of values, the null
mean and the hypothesized mean is statistically significant. hypothesis is rejected and the sample mean is deemed to be
● Overall, the CLT is a fundamental concept in data science that helps significantly different from the hypothesized mean.
us make sense of large datasets and draw meaningful conclusions T-test:
from them. ● A T-test is a hypothesis test that is used to determine whether a
sample mean is significantly different from a hypothesized
population mean, when the population variance is unknown.
● The T-test is used instead of the Z-test when the sample size is
small, or when the population variance is unknown.
● The T-test involves calculating the T-score of the sample mean, ● This can lead to inaccurate predictions and misleading insights,
which is similar to the Z-score, but takes into account the sample which can have serious consequences in fields such as healthcare
size and the sample variance. or finance.
● If the T-score falls outside a certain range of values, the null
hypothesis is rejected and the sample mean is deemed to be Type-II Error
significantly different from the hypothesized mean. ● In data science, a Type-II error occurs when a statistical model or
● In summary, confidence intervals are used to estimate the range of algorithm fails to identify a significant pattern or relationship in the
values that the true population parameter could fall within based data, when in fact it is present. This is similar to a false negative in
on a sample of data. Z-tests are used when the population variance statistical hypothesis testing.
is known, and T-tests are used when the population variance is ● For example, suppose a predictive model fails to identify an
unknown or when the sample size is small. important feature for predicting a target variable, leading to
● Both tests are used to determine whether a sample mean is inaccurate predictions and missed opportunities. This can also
significantly different from a hypothesized population mean. have serious consequences in fields such as healthcare or finance,
where missing an important trend or relationship can have
significant implications.
Q19. Describe in detail Type-I, Type-II Errors. (P4 - Appeared 1 Time)
(5-10 Marks) ● It is important to note that in data science, the consequences of
Ans: In data science, Type-I and Type-II errors have the same definitions as Type-I and Type-II errors can vary depending on the specific
in statistical hypothesis testing. application and the cost of making an incorrect decision.
● However, in the context of data science, these errors can occur in ● For example, in a medical diagnosis application, the cost of a
different ways and have different consequences. Type-II error (failing to identify a disease) may be much higher than
the cost of a Type-I error (identifying a disease that is not present),
Type-I Error: as the latter can be verified with additional tests, whereas the
● In data science, a Type-I error occurs when a statistical model or former may result in delayed treatment and worse health
algorithm incorrectly identifies a pattern or relationship in the data outcomes. Therefore, it is important for data scientists to consider
as being significant, when in fact it is not. the consequences of both types of errors when designing and
● This is similar to a false positive in statistical hypothesis testing. For evaluating models and algorithms.
example, suppose a predictive model incorrectly identifies a feature
as being important for predicting a target variable, when in fact it is
not.
model, based on the nature of the data and the research ● The most common type of cross-validation is k-fold
question or problem. cross-validation, where the data is split into k subsets of roughly
5. Model training and evaluation: This involves fitting the model equal size.
to the data using available algorithms and techniques, and ● The model is then trained on k-1 of these subsets and tested on the
evaluating the performance of the model using various remaining subset.
metrics and techniques, such as cross-validation or ● This process is repeated k times, with each subset used exactly
confusion matrices. once for testing.
6. Model deployment and maintenance: This involves ● The results of each test are then averaged to provide an overall
deploying the model in a real-world setting and monitoring estimate of the model's performance.
its performance over time to ensure that it continues to ● Cross-validation is a useful technique for evaluating the
provide accurate and reliable predictions. performance of different predictive models and selecting the best
● Model building is a complex and iterative process that requires model for a given problem.
careful attention to data quality, feature selection, and algorithm ● It can also be used to tune the parameters of a model, such as the
selection, as well as ongoing monitoring and maintenance to regularization parameter in linear regression or the number of trees
ensure that the model remains valid and reliable. in a random forest, by testing different values of the parameter on
● By following sound model building practices, data scientists can different subsets of the data.
create predictive models that provide useful insights and inform ● Cross-validation is a powerful technique for evaluating the
decision-making. performance of predictive models and ensuring that they
generalize well to new data.
By using cross-validation, data scientists can make more informed
Q3. Write a short note on Cross Validation. (P4 - Appeared 1 Time) (5-10
●
decisions about which models to use and how to tune their
Marks) parameters, resulting in more accurate and reliable predictions.
Ans: Cross-validation is a statistical technique used to evaluate the
performance of predictive models.
● The basic idea of cross-validation is to split the available data into Q4. Explain in detail K-fold cross validation. (P4 - Appeared 1 Time)
two or more subsets, one for training the model and the other for (5-10 Marks)
testing the model's performance. Ans: K-fold cross-validation is a technique used to evaluate the
● This helps to ensure that the model is not overfitting to the training performance of a predictive model.
data and provides an unbiased estimate of the model's ● It involves dividing the available data into k subsets of roughly
performance on new data. equal size, and then training and testing the model k times, each
time using a different subset as the test set.
3. Use the trained model to predict the outcomes of the test cross-validation where k is equal to the number of samples in the dataset.
set. ● In other words, LOOCV involves training the model on all but one of
4. Calculate the performance metric (such as accuracy, the samples in the dataset, and using the remaining sample as the
5. Repeat steps 2-4 k times, each time using a different subset ● The basic steps involved in LOOCV are as follows:
as the test set. 1. Remove one sample from the dataset and use the
6. Average the performance metrics over the k folds to get an remaining samples to train the model.
overall estimate of the model's performance. 2. Use the trained model to predict the outcome of the
● The advantage of k-fold cross-validation is that it allows for a more removed sample.
accurate and reliable estimate of the model's performance than 3. Calculate the performance metric (such as accuracy,
simply splitting the data into a training set and a test set. precision, recall, or F1 score) for the removed sample.
● By repeating the process k times and averaging the results, we can 4. Repeat steps 1-3 for each sample in the dataset.
reduce the variance of the performance estimate and obtain a 5. Average the performance metrics over all samples to get an
more accurate assessment of the model's ability to generalize to overall estimate of the model's performance.
new data. ● The advantage of LOOCV is that it provides the most accurate
● The choice of k depends on the size of the dataset and the estimate of the model's performance possible, as each sample in
complexity of the model. the dataset is used as the test set exactly once.
● In general, larger values of k are more computationally expensive ● However, LOOCV can be computationally expensive, especially for
but provide more accurate estimates of the model's performance, large datasets, as it requires training the model on almost all of the
while smaller values of k are less computationally expensive but samples multiple times.
may be more prone to overfitting. ● LOOCV is useful for evaluating the performance of models that have
● K-fold cross-validation is widely used in data science and machine a small number of samples or that are prone to overfitting.
learning, as it provides a robust and reliable way to evaluate the ● By using LOOCV, data scientists can obtain a more accurate
performance of predictive models and select the best model for a estimate of the model's ability to generalize to new data and select
5. Density plot: A density plot is a graphical representation of ● For example, a histogram can show if the data is skewed to one
the distribution of a numerical variable that displays the side, if it has a bell-shaped distribution, or if it has multiple peaks.
density of observations across the range of the variable. ● To create a histogram, we first choose the number and size of the
6. Frequency polygon: A frequency polygon is a graphical bins. The number of bins should be large enough to capture the
representation of the distribution of a numerical variable shape of the data but not so large that individual bins have very few
that displays the frequency or count of observations as a data points.
line connecting the midpoints of each bin or interval of the ● The size of the bins determines the width of the bars in the
variable. histogram.
● Univariate visualization can help data analysts and scientists ● After selecting the bins, we count the number of data points that fall
identify patterns, outliers, and anomalies in the distribution of a within each bin and plot these counts as the height of the bars. The
variable. resulting histogram provides a visual representation of the
● It can also provide insights into the skewness, kurtosis, and central distribution of the data.
tendency of the data, and help determine the appropriate ● The following are some typical histograms, with a caption below
statistical tests and models to use for further analysis. each one explaining the distribution of the data, as well as the
● However, univariate visualization should be complemented with characteristics of the mean, median, and mode.
bivariate and multivariate visualization techniques to explore the 1.
relationships between variables and to gain a more comprehensive
understanding of the data.
● They are also used in statistical calculations, such as calculating useful for identifying patterns, trends, and outliers in the data, and can be
the interquartile range (IQR), which is the difference between the used to inform decision-making in various fields, including finance,
third and first quartiles. healthcare, and marketing.
● The IQR is a measure of the spread of the middle 50% of the data
and is often used to identify potential outliers in a dataset.
Q10. Discuss in detail Scatter Plot. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: A scatter plot is a type of data visualization that displays the
Q9. Write a short note on the Distribution Chart. (P4 - Appeared 1 Time) relationship between two variables in a dataset.
(5-10 Marks) ● It is a graph with points that represent individual data points and
Ans: A distribution chart is a graphical representation of the frequency show their positions along two axes.
distribution of a dataset. It shows how the data is distributed across ● Here is an example of a scatter plot:
different values or ranges. There are different types of distribution charts,
each of which is used for a specific purpose:
1. Histogram: A histogram is a type of distribution chart that shows the
frequency distribution of continuous data. It is used to visualize the
shape, center, and spread of the distribution of the data.
2. Box plot: A box plot is a type of distribution chart that shows the
distribution of continuous data using quartiles. It is used to identify
outliers, skewness, and the spread of the data.
3. Stem and leaf plot: A stem and leaf plot is a type of distribution
chart that shows the distribution of discrete data. It is used to
visualize the frequency distribution of the data.
4. Probability density function: A probability density function is a type
of distribution chart that shows the probability of a continuous
random variable falling within a certain range. It is used to model
and analyze continuous data.
5. Bar chart: A bar chart is a type of distribution chart that shows the ● In this scatter plot, the x-axis represents one variable, and the y-axis
frequency distribution of categorical data. It is used to visualize the represents the other variable. Each point represents a single data
distribution of data across different categories. point and its value for both variables. The pattern of the points on
Distribution charts are an important tool in data analysis and are used to the plot provides insight into the relationship between the two
gain insights into the distribution and characteristics of a dataset. They are variables.
● Scatter matrices are a powerful tool for exploring the relationships Ans: A density chart is a useful data visualization tool that displays the
between multiple variables in a dataset. By using scatter matrices distribution of a variable in a dataset.
to visualize and analyze data, researchers and analysts can gain ● It provides a continuous estimate of the probability density function
deeper insights and make more informed decisions. of the data by smoothing out the frequency distribution of a
histogram.
● The area under the curve of a density chart represents the
Q12. Write a short note on the Bubble chart. (P4 - Appeared 1 Time) probability of the variable being in a certain range of values.
(5-10 Marks) ● Density charts are particularly useful for displaying continuous
Ans: A bubble chart is a type of data visualization that is used to display variables, such as age or weight, where there are many potential
three dimensions of data in a two-dimensional space. values in the data.
● In a bubble chart, data points are represented by bubbles of ● They can help identify patterns in the data, such as whether the
varying sizes and colors. The x and y axes represent two variables, variable is normally distributed or skewed to one side. By comparing
and the size and color of the bubbles represent a third variable. the density charts of different variables, analysts can also identify
● Bubble charts are useful for displaying data with multiple relationships between variables, such as correlation or causation.
dimensions, as they allow for the visualization of relationships ● There are several ways to create a density chart, including using
between three variables. statistical software packages such as R or Python.
● They are often used in business and economics to show market ● One common method for creating a density chart is to use kernel
trends, and in social science to visualize relationships between density estimation (KDE), which is a non-parametric way of
multiple variables. estimating the probability density function of a variable.
● When creating a bubble chart, it is important to choose an ● In KDE, a kernel function is used to smooth the frequency distribution
appropriate size and color scheme to ensure that the bubbles are of the variable, resulting in a continuous curve that approximates
easy to read and interpret. the probability density function.
● It is also important to label the axes and provide a clear legend to ● When creating a density chart, it is important to choose an
explain the meaning of the bubble sizes and colors. appropriate bandwidth for the kernel function.
● In conclusion, bubble charts are a powerful tool for displaying data ● A higher bandwidth will result in a smoother curve, but it may also
with multiple dimensions. By using bubble charts to visualize and hide important features in the data, while a lower bandwidth may
analyze data, researchers and analysts can gain deeper insights result in a more jagged curve that is difficult to interpret.
and make more informed decisions.
● It is also important to label the axes and provide a clear legend to 5. Check for correlations: Analyze the correlation matrix to identify
explain the meaning of the density values. strong correlations between variables. This can help identify
● Density charts are a powerful tool for displaying the distribution of a potential predictors or explanatory variables for further analysis.
variable in a dataset. By using density charts to visualize and 6. Identify outliers: Use outlier detection techniques to identify any
analyze data, researchers and analysts can gain deeper insights data points that are significantly different from the rest of the data.
and make more informed decisions. This can help identify potential errors or unusual events that may
need to be further investigated.
7. Test hypotheses: Use statistical tests to test hypotheses and explore
Q14. Explain in detail Roadmap for Data Exploration. (P4 - Appeared 1 the relationships between variables in more depth. This can include
Time) (5-10 Marks) regression analysis, ANOVA, or other techniques depending on the
Ans: Data exploration is a crucial step in the data analysis process, as it research question.
allows analysts to understand the data and identify patterns or 8. Communicate results: Finally, communicate the results of the data
relationships between variables. A roadmap for data exploration can help exploration process to relevant stakeholders. This can include
guide analysts through this process and ensure that all important aspects visualizations, tables, and reports that summarize the key findings
of the data are considered. and insights from the data.
Here is a roadmap for data exploration: By following this roadmap for data exploration, analysts can gain a deeper
1. Define the problem: Start by defining the research question or understanding of the data and make more informed decisions. It is
problem that the data will be used to address. This will help guide important to note that this process is iterative and may require multiple
the exploration process and ensure that the analysis is focused on rounds of exploration and analysis to fully understand the data and
relevant variables. address the research question.
2. Collect the data: Gather all relevant data, whether it be from
surveys, experiments, or other sources. Ensure that the data is clean
and organized, and that missing values are appropriately handled. Q15. Explain Visualizing high dimensional data: Parallel chart. (P4 -
3. Describe the data: Begin by describing the data using basic Appeared 1 Time) (5-10 Marks)
statistical measures such as mean, median, mode, and standard Ans: Visualizing high-dimensional data can be challenging, as it can be
deviation. This will provide an overview of the data and help identify difficult to represent all of the dimensions in a way that is easy to
any outliers or unusual values. understand.
4. Visualize the data: Use data visualization techniques such as ● One technique for visualizing high-dimensional data is the parallel
histograms, box plots, and scatter plots to explore the relationships coordinates chart, also known as a parallel plot or parallel chart.
between variables and identify any patterns or trends.
● A parallel coordinates chart displays multivariate data by ● They can help identify patterns and relationships between variables
representing each dimension as a separate axis, all parallel to each that may be difficult to see with other visualization techniques.
other. ● However, they can also be complex and difficult to interpret,
● Each data point is represented as a line that intersects with each especially for large datasets with many dimensions.
axis at the value of that dimension for that particular data point.
By displaying all of the dimensions in parallel, the parallel
Q16. Discuss in detail Deviation chart. (P4 - Appeared 1 Time) (5-10
●
coordinates chart can provide a comprehensive view of the
relationships between different variables. Marks)
● Here are some key features and considerations of parallel Ans: A deviation chart, also known as a diverging stacked bar chart or a
coordinates charts: butterfly chart, is a visualization technique used to compare two groups of
1. Axes and scaling: Each axis represents a single dimension or data with a common metric.
variable, and it is important to scale the axes appropriately ● It displays the difference between two sets of data using stacked
to ensure that all data points are visible. Nonlinear scaling bars that deviate from a central axis.
may be required for data that is highly skewed. ● The deviation chart is particularly useful when comparing positive
2. Interactivity: Parallel coordinates charts can be made and negative values, as it allows for a clear comparison of the
interactive by allowing users to highlight or filter data points differences between the two groups.
based on specific values or ranges. ● The central axis represents the point of balance between the two
3. Overplotting: When multiple data points overlap each other groups, with the positive values on one side and the negative values
distinguish between them. Techniques such as ● Here are some key features and considerations of deviation charts:
transparency, jittering, and bundling can be used to alleviate 1. Layout and design: Deviation charts can be designed in
this issue. various ways, with either horizontal or vertical bars. The bars
4. Clustering: Parallel coordinates charts can be used to can be colored or shaded to indicate positive and negative
points that are close together along multiple axes may 2. Labels and annotations: It is important to label the bars and
represent a distinct subgroup. provide clear annotations to explain the meaning of the
● Parallel coordinates charts can be useful for a variety of chart. This can include axis labels, legend, and annotations
applications, such as exploratory data analysis, cluster analysis, to indicate the source of the data and any important
groups and stacking the positive and negative values on ● By comparing the shapes of the curves, it is possible to determine
either side of the central axis. which features have the most influence on the data.
4. Interpretation: Deviation charts can help identify the ● The curves can also be colored or labeled to represent different
magnitude and direction of differences between two groups. groups or classes of observations.
They can be used to compare a variety of metrics, such as ● One limitation of Andrews curves is that they may not be suitable
revenue, profit, or performance indicators. for very large datasets, as the computation of Fourier coefficients
5. Limitations: Deviation charts can be difficult to read if there can be computationally expensive.
are many categories or if the differences between the ● They also may not be as effective for datasets with complex
groups are small. They may also be less effective for nonlinear relationships between features.
comparing more than two groups of data. ● Overall, Andrews curves can be a useful tool for visualizing
● Overall, deviation charts can be a useful tool for comparing two high-dimensional data and exploring the relationships between
groups of data with a common metric, especially when there are features.
significant differences in positive and negative values. They can ● They provide a way to represent complex data in a simple and
help highlight patterns and trends in the data, and provide a clear intuitive way, and can help to uncover patterns and insights that
visual representation of the differences between the two groups. may not be apparent with other visualization techniques.
MODULE-4 Q2. Discuss in detail Causes of Outliers. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: In statistics, an outlier is an observation that is significantly different
Q1. Write a short note on Outliers. (P4 - Appeared 1 Time) (5-10 Marks) from other observations in a dataset.
Ans: In statistics, outliers are data points that are significantly different from ● Outliers can occur due to a variety of reasons, and understanding
other observations in a dataset. the causes of outliers is crucial to effectively analyze and interpret
● Outliers can have a significant impact on statistical analysis, as data.
they can affect the mean, standard deviation, and other measures ● Here are some of the common causes of outliers:
of central tendency and dispersion. 1. Measurement error: Outliers can occur due to errors in the
● Outliers can occur due to various reasons, such as measurement measurement process. For example, a device used to
errors, natural variation, extreme events, and data manipulation. measure temperature may malfunction and produce an
● Outliers can be detected using various techniques, such as box inaccurate reading that is significantly different from other
plots, scatter plots, and statistical tests like Z-score, Tukey's method, readings in the dataset.
and Grubbs' test. 2. Data entry errors: Human error during data entry can result
● However, identifying an outlier does not necessarily mean it should in outliers. For instance, a data entry operator may
be removed from the dataset. accidentally enter a wrong value, leading to an outlier.
● Outliers can sometimes provide valuable insights into the data and 3. Sampling errors: Outliers can occur due to sampling errors,
need to be analyzed further. where the sample data is not representative of the
● It is essential to understand the cause of the outliers and their population. For instance, if a sample of a population is
impact on the data before deciding whether to remove them or not. skewed towards one end of the distribution, it may result in
● To summarize this we can say, outliers are data points that are outliers in the dataset.
significantly different from other observations in a dataset, and they 4. Natural variation: In some cases, outliers can occur naturally
can occur due to various reasons. Identifying and analyzing outliers due to variation in the data. For example, in a dataset of
is essential for effective statistical analysis, but it is equally heights of adult humans, there may be a few individuals who
important to understand their cause and impact before deciding are significantly taller or shorter than the rest of the
whether to remove them or not. population.
5. Extreme events: Outliers can occur due to extreme events
that are not representative of the typical behavior of the
system being observed. For example, a stock market crash
or a natural disaster can cause outliers in financial or detect any deviations from the expected patterns as
weather datasets, respectively. anomalies.
6. Data manipulation: Outliers can also be deliberately 3. Time-series analysis: Time-series analysis techniques can
introduced into a dataset through data manipulation. For be used to identify anomalies in time-series data by
example, an individual may add an outlier to a dataset to identifying any sudden or unexpected changes in the data
achieve a particular result or to influence a decision. patterns.
● It is essential to identify the cause of outliers in a dataset to 4. Pattern recognition: Pattern recognition techniques, such as
determine the appropriate course of action. neural networks and decision trees, can be used to identify
● In some cases, outliers may need to be removed from the dataset, anomalies in a dataset based on their deviation from the
while in others, they may be valuable data points that need to be expected patterns.
analyzed further. 5. Visualization techniques: Visualization techniques, such as
● Techniques such as data visualization, statistical tests, and outlier scatter plots, histograms, and heat maps, can be used to
detection algorithms can help identify and analyze outliers in a identify anomalies visually by identifying any unusual
dataset. patterns or outliers in the data.
6. Rule-based methods: Rule-based methods involve setting
predefined rules or thresholds to identify anomalies based
Q3. Anomaly detection techniques. (P4 - Appeared 1 Time) (5-10 on specific criteria or domain knowledge.
Marks) ● Overview of this could be concluded as anomaly detection
Ans: Anomaly detection techniques are used to identify and isolate techniques are used to identify unusual data points or patterns in a
unusual data points or patterns that deviate from the norm or expected dataset.
behavior in a dataset. ● These techniques include statistical methods, machine learning
● Here are some of the commonly used anomaly detection algorithms, time-series analysis, pattern recognition, visualization
techniques: techniques, and rule-based methods.
1. Statistical methods: Statistical techniques such as Z-score, ● The choice of the appropriate technique depends on the type and
Grubbs' test, and the modified Thompson Tau test are used nature of the data and the specific anomaly detection
to identify outliers or anomalies in a dataset based on their requirements.
deviation from the mean or other statistical measures.
2. Machine learning algorithms: Machine learning algorithms,
such as clustering, classification, and regression algorithms,
can be used to identify anomalies in a dataset. These
algorithms are trained on normal data patterns and can
considered an outlier if its LOF score is significantly lower ● These methods identify outliers based on the density of data points
than the average LOF score of its neighbors. in a region of the feature space. In these methods, an outlier is
3. Distance to cluster center: In clustering-based outlier defined as a data point that is located in a low-density region, while
detection, the distance between a data point and the center normal data points are located in high-density regions.
of the cluster is computed. If a data point is significantly far ● Some commonly used density-based methods for outlier detection
from the center of the cluster, it is considered an outlier. are:
4. DBSCAN: Density-Based Spatial Clustering of Applications 1. Local Outlier Factor (LOF): LOF measures the local density of
with Noise (DBSCAN) is a clustering algorithm that identifies a data point relative to its neighbors. It computes the ratio of
core points, border points, and noise points in a dataset. the average density of the k-nearest neighbors of a data
Data points that are classified as noise points are point to its own density. An observation is considered an
considered outliers. outlier if its LOF score is significantly lower than the average
5. One-class SVM: One-Class Support Vector Machines (SVM) LOF score of its neighbors.
is a machine learning technique that learns a boundary 2. Density-Based Spatial Clustering of Applications with Noise
around the normal data points and identifies outliers as (DBSCAN): DBSCAN is a clustering algorithm that identifies
data points outside the boundary. core points, border points, and noise points in a dataset.
● These distance-based methods can be used alone or in Core points are defined as data points with a minimum
combination to identify outliers in a dataset. number of neighbors within a specific distance. Border
● The choice of method depends on the characteristics of the data points are neighbors of core points that are not core points
and the application requirements. themselves, while noise points have no neighbors within the
● It is important to note that distance-based methods have specified distance. Data points that are classified as noise
limitations, such as sensitivity to the choice of distance metric, points are considered outliers.
scalability, and parameter tuning. 3. Gaussian Mixture Model (GMM): GMM is a probabilistic
● Therefore, careful consideration is required when selecting and clustering technique that models the data distribution as a
applying these methods for outlier detection. mixture of Gaussian distributions. Outliers are identified as
data points that have low probabilities of being generated
by the model.
Q6. Describe Outlier detection using density-based methods. (P4 - 4. Local Density-Based Outlier Factor (LDOF): LDOF is a
Appeared 1 Time) (5-10 Marks) density-based method that measures the degree of
Ans: Density-based methods are a popular approach for outlier detection outlierness of a data point based on its local density and
in data analysis. distance to high-density regions. It computes a score that
represents the deviation of a data point from the density ● The SMOTE algorithm has several advantages over other
distribution of the dataset. over-sampling techniques, such as random over-sampling,
5. Kernel Density Estimation (KDE): KDE is a non-parametric including:
method that estimates the density of data points in a region 1. SMOTE generates synthetic instances that are more
of the feature space. Outliers are identified as data points representative of the minority class than random
with low probability density values. over-sampling.
● These density-based methods can be used alone or in combination 2. SMOTE does not create exact copies of existing instances,
to identify outliers in a dataset. reducing the risk of overfitting.
● The choice of method depends on the characteristics of the data 3. SMOTE can be combined with other techniques, such as
and the application requirements. under-sampling, to further balance the dataset.
● It is important to note that density-based methods have limitations, 4. SMOTE can improve the performance of machine learning
such as sensitivity to the choice of parameters, scalability, and models, especially in cases where the minority class is
robustness to high-dimensional data. under-represented.
● Therefore, careful consideration is required when selecting and ● However, SMOTE also has some limitations, such as:
applying these methods for outlier detection. 1. SMOTE can create noisy samples that do not accurately
represent the minority class.
2. SMOTE can increase the risk of overfitting if the synthetic
Q7. Write a short note on SMOT. (P4 - Appeared 1 Time) (5-10 Marks) samples are too similar to the existing samples.
Ans: SMOTE (Synthetic Minority Over-sampling Technique) is a technique 3. SMOTE may not be effective in cases where the minority
used in machine learning to address the problem of imbalanced datasets. class is very small or the data is highly imbalanced.
● In imbalanced datasets, the number of instances in the minority ● It is a useful technique for balancing imbalanced datasets and
class (e.g., fraud cases) is much smaller than the number of improving the performance of machine learning models. However,
instances in the majority class (e.g., non-fraud cases). it should be used with caution and in combination with other
● This can lead to biased models that are unable to accurately techniques to ensure the validity and generalizability of the results.
predict the minority class.
● SMOTE is a technique that generates synthetic samples of the
minority class to balance the dataset.
● The algorithm works by selecting a minority class instance and
computing its k nearest neighbors in the feature space.
● Synthetic instances are then created by interpolating between the
selected instance and its neighbors.
time intervals. This can be done using various methods, such Ans:The average method is a simple and widely used forecasting
as seasonal indices, Fourier analysis, or seasonal regression technique that involves calculating the average of past observations and
models. The seasonal component captures the systematic using it as a forecast for future periods.
variation in the time series due to seasonal effects, such as ● This method is based on the assumption that future values will be
weather, holidays, or economic cycles. similar to past values, and that the average provides a reasonable
3. Residual Component: The third step is to estimate the estimate of the future trend.
residual component of the time series, which represents the ● To use the average method, the first step is to collect historical data
random or unpredictable variation in the data that is not for the time series. Then, the average of the past observations is
explained by the trend or seasonal components. This can be calculated and used as the forecast for the next period.
done by subtracting the estimated trend and seasonal ● This process is repeated for each future period, with the forecast for
components from the original time series. The residual each period equal to the average of the past observations.
component captures the unexplained variation in the time ● The average method is easy to use and does not require any
series and is usually the least important component for complex mathematical calculations or statistical analysis.
● Once the time series has been decomposed into its components, the performance of more advanced forecasting techniques.
each component can be analyzed separately to better understand ● However, the average method has several limitations, including:
its characteristics and behavior. 1. It does not capture any trend or seasonal patterns in the
● For example, the trend component can be used to identify data, and assumes that future values will be the same as
long-term patterns and changes in the data, while the seasonal past values.
component can be used to identify seasonal effects and patterns. 2. It is sensitive to outliers and extreme values in the data,
● The residual component can be used to identify outliers and which can affect the accuracy of the forecast.
random fluctuations in the data. 3. It does not take into account any external factors or events
● The decomposition of a time series can also be used for forecasting that may affect the time series, such as changes in the
by extrapolating the trend and seasonal components into the future economy or market conditions.
and adding them together to obtain a forecast. ● Despite these limitations, the average method can be useful for
● The residual component can be used to estimate the uncertainty or short-term forecasting of stable and relatively predictable time
● It is also a useful tool for generating quick and simple forecasts, ● The formula for calculating the moving average for each point in
especially when more advanced methods are not available or the series is:
necessary. MA(t) = (Y(t) + Y(t-1) + ... + Y(t-k+1)) / k
● However, for more complex time series and longer-term forecasts, where Y(t) is the value at time t, k is the window size, and MA(t) is
other forecasting methods may be more appropriate. the moving average at time t.
● MA smoothing has several advantages. It can help to reduce noise
in the data, making it easier to identify trends and patterns. It is also
Q4. Explain in detail Moving Average smoothing. (P4 - Appeared 1 a simple and intuitive method that requires only basic
Time) (5-10 Marks) mathematical skills.
Ans: Moving Average (MA) smoothing is a widely used statistical method ● However, MA smoothing also has some limitations. One of the main
for analyzing time-series data. limitations is that it is sensitive to the choice of window size.
● It is a technique for identifying trends and patterns in a time series ● A small window size may not smooth the data enough to reveal
by calculating an average of the values in a sliding window over the meaningful patterns, while a large window size may smooth the
series. data too much and obscure important features.
● In simple terms, MA smoothing involves calculating the average of ● Another limitation is that it is a backward-looking method and may
a fixed number of previous data points in a time series. not be suitable for predicting future values beyond the range of the
● This creates a smoothed version of the series, which is useful for data used to calculate the moving averages.
identifying trends and patterns that may be difficult to see in the
raw data.
● The basic idea behind MA smoothing is to calculate the moving Q5. Explain Time series analysis using linear regression. (P4 - Appeared
average of a fixed window size (usually denoted as k) for each point 1 Time) (5-10 Marks)
in the series. Ans: Time series analysis is a statistical method used to analyze data that
● The window size determines how many data points are included in is collected over a period of time.
the average calculation. ● In time series analysis, the data is often plotted over time to identify
● For example, if the window size is 3, the moving average at each patterns, trends, and other useful information.
point is calculated by taking the average of the current point and ● Linear regression is a statistical method that can be used to analyze
the two previous points. time series data. Linear regression is a technique used to model the
● To calculate the moving average for each point in the series, we relationship between a dependent variable and one or more
start with the first k data points and calculate the average. We then independent variables.
move the window one data point at a time and recalculate the ● In time series analysis, the dependent variable is usually the
average for each new window. variable of interest that changes over time, and the independent
variables are time-related variables such as time itself or other ● It is a combination of three methods: autoregression, differencing,
factors that may affect the dependent variable. and moving average.
● The basic idea behind linear regression in time series analysis is to ● The ARIMA model is a generalization of the simpler ARMA
use a linear equation to model the relationship between the (Autoregressive Moving Average) model, which assumes that the
dependent variable and the independent variables. time series is stationary (i.e., the statistical properties of the series
● The linear equation takes the form of: do not change over time).
Y = a + bX + e ● However, many time series in real-world applications are
where Y is the dependent variable, X is the independent variable, a non-stationary, meaning that the statistical properties change over
is the intercept, b is the slope, and e is the error term. time.
● The slope and intercept can be estimated using least-squares ● ARIMA models can handle non-stationary time series by
regression, which minimizes the sum of the squared differences incorporating differencing, which removes the trend or seasonality
between the predicted values and the actual values. component from the series. The ARIMA model also includes
● Once the slope and intercept are estimated, they can be used to autoregression and moving average components, which capture
make predictions about future values of the dependent variable the autocorrelation and noise components of the series,
based on the independent variables. respectively.
● This is useful in time series analysis because it allows us to identify ● ARIMA models are specified by three parameters:
trends and patterns in the data and make predictions about future p, d, and q.
behavior. ● The parameter p represents the autoregression order, which is the
● Linear regression can also be used to test hypotheses about the number of lagged values of the dependent variable used to predict
relationship between the dependent variable and the independent the current value.
variables. ● The parameter q represents the moving average order, which is the
● For example, we may want to test whether a particular independent number of lagged errors used to predict the current value.
variable has a significant effect on the dependent variable, or ● The parameter d represents the differencing order, which is the
whether there is a trend in the data over time. number of times the series is differenced to make it stationary.
● ARIMA models can be used for both time series analysis and
forecasting.
Q6. Write a short note on ARIMA Model. (P4 - Appeared 1 Time) (5-10 ● For time series analysis, ARIMA models can be used to identify the
Marks) underlying patterns and trends in the data, and to test for the
Ans: ARIMA (Autoregressive Integrated Moving Average) model is a popular presence of seasonality or other cyclical components.
statistical method used for time series analysis and forecasting.
1. Examine the time series data to determine if it is stationary. If Ans: Mean Absolute Error (MAE) is a popular metric used to measure the
2. Determine the order of differencing (d) required to make the ● It is particularly useful for evaluating models that predict
series stationary. This can be done by examining the continuous variables, such as regression models.
autocorrelation function (ACF) and partial autocorrelation ● MAE measures the average absolute difference between the actual
function (PACF) plots of the differenced series. and predicted values of a variable.
3. Determine the order of the autoregressive (p) and moving ● It is calculated by taking the absolute value of the difference
average (q) terms required for the model. This can also be between each predicted value and its corresponding actual value,
done by examining the ACF and PACF plots of the and then taking the average of these differences.
4. Fit the ARIMA model to the time series data using the chosen MAE = (1/n) * Σ| yi - ŷi |
values of p, d, and q. This can be done using a variety of where yi is the actual value of the variable, ŷi is the predicted value
statistical software packages. of the variable, n is the total number of observations, and Σ is the
5. Check the residuals of the ARIMA model for autocorrelation, summation symbol.
non-normality, and heteroscedasticity. If any of these issues ● MAE is a useful metric because it is easy to understand and
are present, re-estimate the model or consider using a interpret. It measures the average size of the errors made by the
different model. model, with larger errors contributing more to the overall score than
6. Use the ARIMA model to make predictions for future time smaller errors.
periods. ● MAE is expressed in the same units as the variable being predicted,
7. Check the accuracy of the ARIMA model predictions using which makes it easy to compare the accuracy of different models.
metrics such as mean absolute error (MAE) or root mean ● One limitation of MAE is that it treats all errors as equally important,
8. Refine the ARIMA model as necessary based on the ● This means that a model that consistently underestimates the
prediction accuracy and additional insights gained from the variable of interest will have the same MAE as a model that
analysis of the time series data. consistently overestimates the variable, even though these errors
may have different implications for the practical use of the model.
● Despite this limitation, MAE is a widely used metric for evaluating the
accuracy of predictive models.
● It is often used in conjunction with other metrics, such as Root Mean where the variable being predicted has a wide range of possible
Squared Error (RMSE), to provide a more comprehensive evaluation values.
of model performance ● It is often used in conjunction with other metrics, such as MAE or
Mean Absolute Percentage Error (MAPE), to provide a more
comprehensive evaluation of model performance.
Q8. Write a short note on Root Mean Square Error. (P4 - Appeared 1
Time) (5-10 Marks)
Ans: Root Mean Square Error (RMSE) is a commonly used metric for Q9. Mean Absolute Percentage Error. (P4 - Appeared 1 Time) (5-10
evaluating the accuracy of a predictive model. Marks)
● It measures the average magnitude of the errors made by the Ans: Mean Absolute Percentage Error (MAPE) is a commonly used metric for
model, with larger errors contributing more to the overall score than evaluating the accuracy of a predictive model.
smaller errors. RMSE is particularly useful for evaluating models that ● It measures the average percentage difference between the actual
predict continuous variables, such as regression models. and predicted values of a variable, making it particularly useful for
● The formula for calculating RMSE is: evaluating models that predict variables with varying scales or
2
RMSE = sqrt(1/n * Σ(yi - ŷi) ) magnitudes.
where yi is the actual value of the variable, ● The formula for calculating MAPE is:
ŷi is the predicted value of the variable, 𝑀𝐴𝑃𝐸 =
1
𝑛 ( )
𝑦𝑖 − ŷ𝑖
× Σ || 𝑦𝑖 || × 100 %
n is the total number of observations, where yi is the actual value of the variable, ŷi is the predicted value
Σ is the summation symbol, and sqrt is the square root function. of the variable, n is the total number of observations, Σ is the
● RMSE is expressed in the same units as the variable being predicted, summation symbol, and | | represents absolute value.
which makes it easy to compare the accuracy of different models. ● MAPE is expressed as a percentage, which makes it easy to interpret
● One advantage of RMSE over Mean Absolute Error (MAE) is that it and compare across different variables and models.
gives more weight to larger errors, which may be more important to ● It measures the average size of the errors made by the model
consider in certain applications. relative to the actual values of the variable, with larger errors
● However, one limitation of RMSE is that it is sensitive to outliers in the contributing more to the overall score than smaller errors.
data, which can inflate the value of the metric. Another limitation is ● Unlike RMSE, MAPE is not sensitive to the scale of the variable being
that it can be difficult to interpret in practical terms, since it does predicted, which can be an advantage in certain applications.
not have a direct relationship to the performance of the model. ● One limitation of MAPE is that it is undefined when the actual value
● Despite these limitations, RMSE is widely used as a metric for of the variable is zero, which can occur in some applications.
evaluating the accuracy of predictive models, especially in cases
● In addition, MAPE can be affected by outliers in the data, which can ● One advantage of MASE over other metrics, such as RMSE or MAPE, is
skew the overall score. Finally, MAPE can be less intuitive to interpret that it provides a standardized way to compare the accuracy of
than other metrics, such as RMSE or MAE. different models over time series data, without being affected by
the scale of the variable being predicted or by outliers in the data.
In addition, MASE is less sensitive to changes in the distribution of
Q10. Discuss in detail Mean Absolute Scaled Error. (P4 - Appeared 1
●
the data over time, which can be an advantage in applications
Time) (5-10 Marks) where the underlying data is subject to external factors, such as
Ans: Mean Absolute Scaled Error (MASE) is a commonly used metric for seasonality or trend.
evaluating the accuracy of a predictive model. ● MASE can be more difficult to calculate than other metrics, since it
● It measures the average magnitude of the errors made by the requires the calculation of a benchmark model for each time series,
model relative to the errors made by a simple benchmark model, inspite of having these advantages .
making it particularly useful for evaluating models that make ● In addition, MASE can be affected by the choice of benchmark
predictions over time series data. model, which can influence the overall score.
● The formula for calculating MASE is:
𝑀𝐴𝑆𝐸 =
1
𝑛 ( 𝑦𝑖 − ŷ𝑖
× Σ || 𝑀𝐴𝐸 ||) Q11. Explain Evaluation parameters for Classification. (P4 - Appeared 1
yi is the actual value of the variable,
ŷi is the predicted value of the variable, Time) (5-10 Marks)
n is the total number of observations, Ans: Evaluation parameters for classification are metrics that are used to
| | represents absolute value, and ● These metrics provide a quantitative measure of how well the
MAE is the mean absolute error of a benchmark model that always model is able to correctly classify instances into their respective
predicts the value of the variable as its most recent observation. classes.
● MASE is a unitless measure, which makes it easy to interpret and ● Some of the commonly used evaluation parameters for
● It measures the average size of the errors made by the model 1. Accuracy: It is the proportion of correct predictions made by
relative to the errors made by a simple benchmark model, with the model out of the total number of predictions.
values less than 1 indicating that the model is better than the It is calculated as follows:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑂𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
benchmark model and values greater than 1 indicating that the 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
2. Precision: It is the proportion of true positive predictions out ● Based on the specific requirements of the classification task, one or
of the total number of positive predictions made by the more of these parameters can be used to evaluate and compare
model. the performance of different classification models.
It is calculated as follows:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Q12. Write a short note on regression and clustering. (P4 - Appeared 1
3. Recall (Sensitivity): It is the proportion of true positive
Time) (5-10 Marks)
predictions out of the total number of actual positive
Ans: Regression and clustering are two different types of machine learning
instances in the dataset.
algorithms that are used for different purposes.
It is calculated as follows:
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 Regression:
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
● It is a type of supervised learning algorithm that is used to model
4. F1 Score: It is the harmonic mean of precision and recall. It is
the relationship between a dependent variable and one or more
a measure that balances between precision and recall.
independent variables.
It is calculated as follows:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
● The goal of regression is to find a mathematical function that can
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 predict the value of the dependent variable based on the values of
5. Specificity: It is the proportion of true negative predictions the independent variables.
out of the total number of actual negative instances in the ● Regression is used to solve problems such as predicting house
dataset. prices based on their features, estimating sales based on
It is calculated as follows: advertising spend, or forecasting the stock market based on
𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 historical data.
6. Area under the receiver operating characteristic curve Clustering :
(AUC-ROC): It is a measure of the classifier's ability to ● Clustering, on the other hand, is a type of unsupervised learning
distinguish between positive and negative classes. algorithm that is used to group similar data points together based
It is calculated as the area under the ROC curve, which is a on their features.
plot of true positive rate (sensitivity) against the false ● The goal of clustering is to identify clusters or groups of data points
positive rate (1 - specificity) at various classification that share similar characteristics, without any prior knowledge of
thresholds. the groups.
● These evaluation parameters help in understanding the ● Clustering is used to solve problems such as customer
performance of the classification model in terms of its ability to segmentation, fraud detection, or image segmentation.
accurately classify instances into their respective classes. ● While regression and clustering are different types of algorithms,
they can be used together in certain scenarios.
● Another use case for combining regression and clustering is to use Ans: Predictive modeling is a statistical and data mining technique used to
clustering to identify outliers in the data, which can then be create a model that can predict future events or behaviors based on
● Outliers can have a significant impact on the regression model, ● It involves identifying patterns and relationships within data sets to
● By removing the outliers identified through clustering, the accuracy ● The process of predictive modeling typically involves several steps,
of the regression model can be improved. including data collection, data cleaning and preprocessing, feature
selection, model training, model evaluation, and deployment.
● The goal is to build a model that accurately predicts future
outcomes based on the available data.
● Predictive modeling has many practical applications, including
fraud detection, marketing analytics, credit scoring, and risk
management.
● It is widely used in industries such as finance, insurance, healthcare,
and retail to help organizations make better decisions and improve
their performance.
● Predictive modeling is a powerful tool that enables businesses to
use data to make more informed decisions and gain a competitive
advantage.
● However, it requires a deep understanding of statistical and data
science concepts and techniques, as well as access to high-quality
data and advanced analytics tools.
5. Model training: Once the model has been selected, it needs to be various techniques such as correlation analysis, principal
trained using the available data. This involves dividing the data into component analysis, or domain expertise.
training and testing sets, and using the training set to train the 3. Similarity or distance measure: The similarity or distance measure is
model. a key component of clustering, which is used to determine how
6. Model evaluation: After the model has been trained, it needs to be similar or dissimilar two data points are. Common measures
evaluated to determine its accuracy and performance. This can be include Euclidean distance, cosine similarity, and Manhattan
done by comparing the predicted results with the actual results of distance.
past transactions. 4. Clustering algorithm: There are several clustering algorithms that
7. Model deployment: Finally, the model can be deployed to identify can be used, including K-means, hierarchical clustering, and
fraudulent activities in real-time. This can be done using a density-based clustering. The choice of algorithm depends on the
web-based interface or a mobile app. specific problem and the characteristics of the data.
Overall, fraud detection is a challenging task that requires a deep 5. Evaluation: After clustering, it is important to evaluate the quality of
understanding of statistical and machine learning concepts. However, with the clustering results. This can be done using various metrics such
the right data and tools, it can be a powerful tool for preventing financial as silhouette score, Dunn index, or Calinski-Harabasz index.
losses and protecting consumer data. Clustering has many practical applications, including customer
Regenerate response segmentation, image segmentation, and anomaly detection. It is widely
used in industries such as marketing, healthcare, and finance to gain
insights from large datasets and make better decisions.
Q4. Write a short note on Clustering. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Clustering is a popular unsupervised learning technique used in Q5. Describe in detail Customer Segmentation. (P4 - Appeared 1 Time)
machine learning and data mining. The goal of clustering is to group (5-10 Marks)
similar data points together based on their characteristics or features, Ans: Customer segmentation is a technique used by businesses to divide
without prior knowledge of the specific classes or labels. their customers into groups based on common characteristics such as
The process of clustering typically involves several steps: demographics, behavior, and purchasing habits. The goal of customer
1. Data preprocessing: The first step is to clean and preprocess the segmentation is to better understand the needs and preferences of
data, including removing duplicates, filling in missing values, and different customer groups and tailor marketing strategies to each group
normalizing the data. accordingly.
2. Feature selection: The next step is to select the relevant features Here are the steps involved in customer segmentation:
that will be used to group the data points. This can be done using
1. Data collection: The first step is to gather data on customers, Customer segmentation has many practical applications, including
including information such as age, gender, income, and purchase improving customer retention, increasing customer lifetime value, and
history. optimizing marketing spend. It is widely used in industries such as retail,
2. Data preprocessing: Once the data has been collected, it needs to e-commerce, and healthcare to gain insights from large datasets and
be cleaned and processed. This involves removing duplicates, filling make better decisions.
in missing values, and converting categorical variables into
numerical values.
3. Feature selection: The next step is to select the relevant features
Q6. Explain in detail Time series forecasting. (P4 - Appeared 1 Time)
that will be used to group the customers. This can be done using (5-10 Marks)
various techniques such as correlation analysis, principal Ans: Time series forecasting is a popular technique in data science used to
component analysis, or domain expertise. predict future values of a time-dependent variable based on historical
4. Similarity or distance measure: The similarity or distance measure is data. Time series data consists of observations taken at regular time
a key component of customer segmentation, which is used to intervals, such as daily, weekly, or monthly, and includes data from a wide
determine how similar or dissimilar two customers are. Common range of domains such as finance, economics, weather, and energy.
measures include Euclidean distance, cosine similarity, and Here are the steps involved in time series forecasting:
Manhattan distance. 1. Data collection: The first step is to gather historical data on the time
5. Clustering algorithm: There are several clustering algorithms that series variable of interest, including the time stamps and values.
can be used for customer segmentation, including K-means, 2. Data preprocessing: Once the data has been collected, it needs to
hierarchical clustering, and density-based clustering. The choice of be cleaned and processed. This involves removing duplicates, filling
algorithm depends on the specific problem and the characteristics in missing values, and handling any outliers or anomalies.
6. Evaluation: After clustering, it is important to evaluate the quality of trends and patterns over time. This can be done using various
the clustering results. This can be done using various metrics such visualization techniques such as line charts, scatterplots, and
7. Marketing strategy: Once the customers have been segmented into 4. Time series modeling: There are several time series models that can
groups, businesses can tailor their marketing strategies to each be used for forecasting, including ARIMA (autoregressive integrated
group. For example, a company may create different advertising moving average), exponential smoothing, and seasonal
campaigns for high-income customers versus low-income decomposition. The choice of model depends on the specific
as mean absolute error (MAE), mean squared error (MSE), and root models, statistical models, and machine learning models. The
mean squared error (RMSE). This helps to ensure that the model is choice of model depends on the specific problem and the
accurate and reliable. characteristics of the data.
6. Forecasting: Once the time series model has been evaluated, it can 4. Model evaluation: After building the weather model, it is important
be used to make predictions about future values of the to evaluate its performance using various metrics such as
time-dependent variable. This can help businesses and accuracy, precision, and recall. This helps to ensure that the model
organizations make better decisions and plan for the future. is accurate and reliable.
Time series forecasting has many practical applications, including 5. Forecasting: Once the weather model has been evaluated, it can be
predicting stock prices, forecasting demand for products, and estimating used to make predictions about future weather conditions. These
energy consumption. It is widely used in industries such as finance, retail, predictions can be used to provide weather alerts and advisories to
and manufacturing to gain insights from historical data and make better the public, as well as to inform decision-making in various
decisions. industries.
Weather forecasting has many practical applications, including predicting
storms, droughts, and heat waves, as well as forecasting crop yields and
Q7. Write a short note on Weather Forecasting. (P4 - Appeared 1 Time) informing transportation planning. It is a critical tool for emergency
(5-10 Marks) management and disaster response, helping to save lives and minimize
Ans: Weather forecasting is the process of predicting the future state of the property damage.
atmosphere at a given location and time. Weather forecasting is an
important application of data science and is widely used in a range of
fields such as agriculture, aviation, transportation, and emergency Q8. Explain in detail Product recommendation. (P4 - Appeared 1 Time)
management. (5-10 Marks)
The process of weather forecasting typically involves the following steps: Ans: Product recommendation is a technique in data science and machine
1. Data collection: The first step is to collect a range of weather data, learning used to suggest products to users based on their past behavior
including temperature, humidity, pressure, wind speed, and and preferences.
precipitation. This data can be collected from various sources such ● This is done by analyzing the user's historical data, such as their
as weather stations, satellites, and radars. purchase history, search history, and clickstream data, and using
2. Data preprocessing: Once the data has been collected, it needs to this information to make personalized recommendations.
be cleaned and processed. This involves removing duplicates, filling ● Product recommendation is a powerful technique in data science
in missing values, and handling any outliers or anomalies. and machine learning that can help businesses and organizations
3. Weather modeling: There are several models that can be used for provide personalized recommendations to users, leading to
weather forecasting, including numerical weather prediction increased sales and customer satisfaction.
● Here are the steps involved in product recommendation: recommendations for users based on their past behavior
1. Data collection: The first step is to collect data on user and preferences.
behavior, such as their purchase history, search history, and ● Product recommendation has many practical applications,
clickstream data. This data can be collected from various including in e-commerce, advertising, and entertainment.
sources such as e-commerce websites, social media ● It can help businesses increase sales by suggesting relevant
platforms, and mobile apps. products to customers, as well as improve customer satisfaction by
2. Data preprocessing: Once the data has been collected, it providing personalized recommendations.
needs to be cleaned and processed. This involves removing
duplicates, filling in missing values, and handling any outliers
or anomalies.
3. Feature extraction: The next step is to extract features from
the data that are relevant to the product recommendation
task. For example, features such as the user's age, gender,
location, and previous purchases can be used to make
recommendations.
4. Recommendation engine: There are several
recommendation algorithms that can be used, including
collaborative filtering, content-based filtering, and hybrid
models. The choice of algorithm depends on the specific
problem and the characteristics of the data.
5. Model training: After choosing the recommendation
algorithm, the model needs to be trained on the historical
data to learn patterns and relationships between users and
products.
6. Model evaluation: Once the model has been trained, it needs
to be evaluated using various metrics such as precision,
recall, and F1 score. This helps to ensure that the model is
accurate and reliable.
7. Recommendation generation: Once the model has been
evaluated, it can be used to generate personalized