0% found this document useful (0 votes)
18 views100 pages

E Book Mastering Ds For Interview

This document is a guide for mastering data science interviews, featuring over 90 commonly asked questions across basic, intermediate, and advanced levels. It emphasizes the importance of data science in solving real-world problems and outlines key concepts such as data cleaning, machine learning techniques, and the differences between data science and data analytics. The guide aims to assist both aspiring and experienced data scientists in preparing for technical interviews.

Uploaded by

bgmibokka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views100 pages

E Book Mastering Ds For Interview

This document is a guide for mastering data science interviews, featuring over 90 commonly asked questions across basic, intermediate, and advanced levels. It emphasizes the importance of data science in solving real-world problems and outlines key concepts such as data cleaning, machine learning techniques, and the differences between data science and data analytics. The guide aims to assist both aspiring and experienced data scientists in preparing for technical interviews.

Uploaded by

bgmibokka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

A Sim

ple G
uide

accredian

Mastering the
Data
Science
Interview

Top 90+ Questions


for Success
INTRODUCTION TO DATA SCIENCE accredian

TABLE OF
CONTENTS

01
Overview
Basic Data Science Interview Questions 02

Intermediate Data Science Interview 22


Questions
Advanced Data Science Interview 61
Questions

www.accredian.com
MASTERING DATA SCIENCE INTERVIEW accredian

OVERVIEW
Over the years, data science has gained widespread
importance due to the importance of data. Data is
considered the new oil of the future which when
analyzed and harnessed properly can prove to be
very beneficial to the stakeholders. Not just this, a
data scientist gets exposure to work in diverse
domains, solving real-life practical problems all by
making use of trendy technologies.

A successful data scientist can interpret data,


perform innovation and bring out creativity while
solving problems that help drive business and
strategic goals.

This makes Data Science the most


lucrative job of the 21st century.

In this book, we will explore what


are the most commonly asked Data
Science Technical Interview
Questions which will help both
aspiring and experienced data
scientists.

MANVENDER SINGH
CEO I ACCREDIAN www.accredian.com I 01
Basic
Data Science
Interview
Questions

www.accredian.com I 02
MASTERING DATA SCIENCE INTERVIEW accredian

Q1. What is Data Science?

An interdisciplinary field that constitutes various


scientific processes, algorithms, tools, and machine
learning techniques working to help find common
patterns and gather sensible insights from the given raw
input data using statistical and mathematical analysis is
called Data Science.

The following figure represents the life cycle of data


science.

www.accredian.com I 03
MASTERING DATA SCIENCE INTERVIEW accredian

It starts with gathering the business requirements


and relevant data.

Once the data is acquired, it is maintained by


performing data cleaning, data warehousing, data
staging, and data architecture.

Data processing does the task of exploring the


data, mining it, and analyzing it which can be
finally used to generate the summary of the
insights extracted from the data.

Once the exploratory steps are completed, the


cleansed data is subjected to various algorithms like
predictive analysis, regression, text mining,
recognition patterns, etc depending on the
requirements.

In the final stage, the results are communicated to


the business in a visually appealing manner. This is
where the skill of data visualization, reporting, and
different business intelligence tools come into the
picture.

www.accredian.com I 04
MASTERING DATA SCIENCE INTERVIEW accredian

Q2. What is the difference between


Data analytics and Data science?

Data science involves the task of transforming data


by using various technical analysis methods to
extract meaningful insights using which a data
analyst can apply to their business scenarios.

Data analytics deals with checking the existing


hypothesis and information and answers questions
for a better and effective business-related decision-
making process.

Data Science drives innovation by answering


questions that build connections and answers for
futuristic problems. Data analytics focuses on
getting present meaning from existing historical
context whereas data science focuses on predictive
modeling.

Data Science can be considered as a broad subject


that makes use of various mathematical and
scientific tools and algorithms for solving complex
problems whereas data analytics can be considered
as a specific field dealing with specific concentrated
problems using fewer tools of statistics and
visualization.

www.accredian.com I 05
MASTERING DATA SCIENCE INTERVIEW accredian

Q3. What are some of the techniques used


for sampling? What is the main advantage
of sampling?

Data analysis can not be done on a whole volume of data


at a time especially when it involves larger datasets.

It becomes crucial to take some data samples that can


be used for representing the whole population and then
perform analysis on it.

While doing this, it is very much necessary to carefully


take sample data out of the huge data that truly
represents the entire dataset.

www.accredian.com I 06
MASTERING DATA SCIENCE INTERVIEW accredian

There are majorly two categories of sampling techniques


based on the usage of statistics, they are:

Probability Sampling techniques: Clustered


sampling, Simple random sampling, Stratified
sampling.

Non-Probability Sampling techniques: Quota


sampling, Convenience sampling, snowball sampling,
etc.

www.accredian.com I 07
MASTERING DATA SCIENCE INTERVIEW accredian

Q4. Differentiate between Data Analytics


and Data Science

Q5. How is Python Useful?

Python is widely recognized as an exceptionally


advantageous programming language due to its
versatility and simplicity.

Its extensive range of applications and associated


benefits have established it as a preferred choice
among developers. Notably,
www.accredian.com I 08
MASTERING DATA SCIENCE INTERVIEW accredian

Python stands out in terms of readability and user-


friendliness.

Its syntax is meticulously designed to be intuitive and


concise, enabling ease in coding, comprehension,
and maintenance.

Python offers a comprehensive standard library that


encompasses a diverse collection of pre-built
modules and functions. This wealth of resources
substantially minimizes the time and effort expended
by developers, streamlining the execution of routine
programming tasks.

Q6. How R is Useful in the Data Science


Domain?

Here are some ways in which R is useful in the data


science domain:

Data Manipulation and Analysis: R offers a


comprehensive collection of libraries and functions
that facilitate proficient data manipulation,
transformation, and statistical analysis.

Statistical Modeling and Machine Learning: R


offers a wide range of packages for advanced
statistical modeling and machine learning tasks,
www.accredian.com I 09
MASTERING DATA SCIENCE INTERVIEW accredian

empowering data scientists to build predictive models


and perform complex analyses.

Data Visualization: R’s extensive visualization


libraries enable the creation of visually appealing and
insightful plots, charts, and graphs.

Reproducible Research: R supports the integration


of code, data, and documentation, facilitating
reproducible workflows and ensuring transparency in
data science projects.

Q7. What is Supervised Learning?


Supervised learning is a machine learning approach in
which an algorithm learns from labeled training data to
make predictions or classify new, unseen data.

It involves :

the use of input data and corresponding output


labels, allowing the algorithm to learn patterns and
relationships.

The goal is to generalize the learned patterns and


accurately predict outputs for new input data based
on the learned patterns.

www.accredian.com I 10
MASTERING DATA SCIENCE INTERVIEW accredian

Q8. What is Unsupervised Learning?

Unsupervised learning is a machine learning approach


wherein an algorithm uncovers patterns and structures
within unlabeled data, operating without explicit
guidance or predetermined output labels.

Its objective is to reveal hidden relationships, patterns,


and clusters present in the data.

Unlike supervised learning, the algorithm autonomously


explores the data to identify inherent structures and
draw inferences, proving valuable for exploratory data
analysis and the discovery of novel insights.

Q9. What do you understand about


Linear Regression?

Logistic regression is a classification algorithm that can


be used when the dependent variable is binary. Let’s take
an example. Here, we are trying to determine whether it
will rain or not on the basis of temperature and humidity.

www.accredian.com I 11
MASTERING DATA SCIENCE INTERVIEW accredian

Temperature and humidity are the independent


variables, and rain would be our dependent variable. So,
the logistic regression algorithm actually produces an S
shape curve.

So, basically in logistic regression, the Y value lies within


the range of 0 and 1. This is how logistic regression works.

Q10. What is a confusion matrix?

The confusion matrix is a table that is used to estimate


the performance of a model. It tabulates the actual
values and the predicted values in a 2×2 matrix.

www.accredian.com I 12
MASTERING DATA SCIENCE INTERVIEW accredian

Temperature and humidity are the independent


variables, and rain would be our dependent variable. So,
the logistic regression algorithm actually produces an S
shape curve.

True Positive (d): This denotes all of those records


where the actual values are true and the predicted
values are also true. So, these denote all of the true
positives.
False Negative (c): This denotes all of those records
where the actual values are true, but the predicted
values are false.
False Positive (b): In this, the actual values are false,
but the predicted values are true.
True Negative (a): Here, the actual values are false
and the predicted values are also false.

So, if you want to get the correct values, then correct


values would basically represent all of the true positives
and the true negatives. This is how the confusion matrix
works.
www.accredian.com I 13
MASTERING DATA SCIENCE INTERVIEW accredian

Q11. List down the conditions for Overfitting


and Underfitting.

Overfitting:
The model performs well
only for the sample training
data. If any new data is
given as input to the model,
it fails to provide any result.
These conditions occur due
to low bias and high
variance in the model.
Decision trees are more
prone to Overfitting.

Underfitting:
Here, the model is so simple
that it is not able to identify
the correct relationship in
the data, and hence it does
not perform well even on
the test data. This can
happen due to high bias
and low variance. Linear
regression is more prone
to Underfitting.

www.accredian.com I 14
MASTERING DATA SCIENCE INTERVIEW accredian

Q12. Differentiate between the long and


wide format data.

www.accredian.com I 15
MASTERING DATA SCIENCE INTERVIEW accredian

The following image depicts the representation of


wide format and long format data:

www.accredian.com I 16
MASTERING DATA SCIENCE INTERVIEW accredian

Q13. What are Eigenvectors and Eigenvalues?

Eigenvectors are column vectors or unit vectors


whose length/magnitude is equal to 1. They are also
called right vectors.

Eigenvalues are coefficients that are applied on


eigenvectors which give these vectors different values
for length or magnitude.

A matrix can be decomposed into Eigenvectors and


Eigenvalues and this process is called Eigen
decomposition. These are then eventually used in
machine learning methods like PCA (Principal
Component Analysis) for gathering valuable insights from
the given matrix.

www.accredian.com I 17
MASTERING DATA SCIENCE INTERVIEW accredian

Q14. What do you understand by


Imbalanced Data?
Data is said to be highly imbalanced if it is distributed
unequally across different categories. These datasets
result in an error in model performance and result in
inaccuracy.

Q15. Are there any differences between


the expected value and mean value?
There are not many differences between these two, but it
is to be noted that these are used in different contexts.
The mean value generally refers to the probability
distribution whereas the expected value is referred to in
the contexts involving random variables.

Q16. What do you understand by


Survivorship Bias?
This bias refers to the logical error while focusing on
aspects that survived some process and overlooking
those that did not work due to lack of prominence. This
bias can lead to deriving wrong conclusions.

Q17. How are Data Science and Machine


Learning related to each other?
Data Science and Machine Learning are two terms that

www.accredian.com I 18
MASTERING DATA SCIENCE INTERVIEW accredian

are closely related but are often misunderstood. Both of


them deal with data. However, there are some
fundamental distinctions that show us how they are
different from each other.

Data Science is a broad field that deals with large


volumes of data and allows us to draw insights from this
voluminous data. The entire process of data science
takes care of multiple steps that are involved in drawing
insights out of the available data. This process includes
crucial steps such as data gathering, data analysis,
data manipulation, data visualization, etc.

Machine Learning, on the other hand, can be thought


of as a sub-field of data science. It also deals with data,
but here, we are solely focused on learning how to
convert the processed data into a functional model,
which can be used to map inputs to outputs, e.g., a
model that can expect an image as an input and tell us if
that image contains a flower as an output.

In short, data science deals with gathering data,


processing it, and finally, drawing insights from it. The
field of data science that deals with building models
using algorithms is called machine learning. Therefore,
machine learning is an integral part of data science.

www.accredian.com I 19
MASTERING DATA SCIENCE INTERVIEW accredian

Q18. Explain how machine learning is


different from deep learning.
A field of computer science, machine learning is a
subfield of data science that deals with using existing
data to help systems automatically learn new skills to
perform different tasks without having rules to be
explicitly programmed.

Deep Learning, on the other hand, is a field in


machine learning that deals with building machine
learning models using algorithms that try to imitate
the process of how the human brain learns from the
information in a system for it to attain new capabilities. In
deep learning, we make heavy use of deeply connected
neural networks with many layers.

www.accredian.com I 20
MASTERING DATA SCIENCE INTERVIEW accredian

Q19. Explain the differences between


supervised and unsupervised learning.

Q20. Why is Python used for data cleaning


in Data Science?
Python libraries such as Matplotlib, Pandas, Numpy,
Keras, and SciPy are extensively used for data
cleaning and analysis. These libraries are used to load
and clean the data and do effective analysis. For instance,
you might decide to remove outliers that are beyond a
certain standard deviation from the mean of a numerical
column.

www.accredian.com I 21
Intermediate
Data Science
Interview
Questions

www.accredian.com I 22
MASTERING DATA SCIENCE INTERVIEW accredian

Q21. What is deep learning? What is the


difference between deep learning and
machine learning?

Deep learning is a paradigm of machine learning. In


deep learning, multiple layers of processing are
involved in order to extract high features from the
data. The neural networks are designed in such a way
that they try to simulate the human brain.

Deep learning has shown incredible performance in


recent years because of the fact that it shows great
analogy with the human brain.

The difference between machine learning and deep


learning is that deep learning is a paradigm or a part
of machine learning that is inspired by the structure
and functions of the human brain called the artificial
neural networks.

Q22. What is a Gradient and Gradient


Descent?
GRADIENT

Gradient is the measure of a property that how much


the output has changed with respect to a little change
in the input.

www.accredian.com I 23
MASTERING DATA SCIENCE INTERVIEW accredian

In other words, we can say that it is a measure of change


in the weights with respect to the change in error.

The gradient can be mathematically represented as the


slope of a function.

GRADIENT DESCENT

Gradient descent is a minimization algorithm that


minimizes the Activation function. Well, it can
minimize any function given to it but it is usually
provided with the activation function only.

Gradient descent, as the name suggests means descent


or a decrease in something. The analogy of gradient
descent is often taken as a person climbing down a
hill/mountain.

So, if a person is climbing down the hill, the next position


that the climber has to come to is denoted by “b” in this
equation. Then, there is a minus sign because it denotes
the minimization (as gradient descent is a minimization
algorithm). The Gamma is called a waiting factor and the
remaining term which is the Gradient term itself shows
the direction of the steepest descent.
www.accredian.com I 24
MASTERING DATA SCIENCE INTERVIEW accredian

This situation can be represented in a graph as follows:

Here, we are somewhere at the “Initial Weights” and we


want to reach the Global minimum. So, this
minimization algorithm will help us do that.

www.accredian.com I 25
MASTERING DATA SCIENCE INTERVIEW accredian

Q23. What is the ROC curve?


It stands for Receiver Operating Characteristic.

It is basically a plot between a true positive rate and a


false positive rate, and it helps us to find out the right
tradeoff between the true positive rate and the false
positive rate for different probability thresholds of the
predicted values.

So, the closer the curve to the upper left corner, the
better the model is. In other words, whichever curve has
greater area under it that would be the better model.
You can see this in the below graph:

www.accredian.com I 26
MASTERING DATA SCIENCE INTERVIEW accredian

Q24. What do you understand by a decision


tree?
A decision tree is a supervised learning algorithm
that is used for both classification and regression. Hence,
in this case, the dependent variable can be both a
numerical value and a categorical value.

Here, each node denotes the test on an attribute, and


each edge denotes the outcome of that attribute, and
each leaf node holds the class label. So, in this case, we
have a series of test conditions which give the final
decision according to the condition.

Q25. What do you understand by a random


forest model?
It combines multiple models together to get the final
output or, to be more precise, it combines multiple
decision trees together to get the final output.

www.accredian.com I 27
MASTERING DATA SCIENCE INTERVIEW accredian

So, decision trees are the building blocks of the random


forest model.

Q26. How is Data modeling different from


Database design?
Data Modeling:
It can be considered as the first step towards the
design of a database.
Data modeling creates a conceptual model based on
the relationship between various data models.
The process involves moving from the conceptual
stage to the logical model to the physical schema.
It involves the systematic method of applying data
modeling techniques.

Database Design:
This is the process of designing the database.
The database design creates an output which is a
detailed data model of the database.
Strictly speaking, database design includes the
detailed logical model of a database but it can also
include physical design choices and storage
parameters.
www.accredian.com I 28
MASTERING DATA SCIENCE INTERVIEW accredian

Q27. What is precision?


When we are implementing algorithms for the
classification of data or the retrieval of information,
precision helps us get a portion of positive class
values that are positively predicted. Basically, it
measures the accuracy of correct positive predictions.

Below is the formula to calculate precision:

Q28. What is a recall?


It is the set of all positive predictions out of the total
number of positive instances. Recall helps us identify
the misclassified positive predictions.

We use the below formula to calculate recall:

www.accredian.com I 29
MASTERING DATA SCIENCE INTERVIEW accredian

Q29. What is the F1 score and how to


calculate it?
F1 score helps us calculate the harmonic mean of
precision and recall that gives us the test’s accuracy.

If F1 = 1, then precision and recall are accurate.


If F1 < 1 or equal to 0, then precision or recall is less
accurate, or they are completely inaccurate.

See below for the formula to calculate the F1 score:

Q30. Why do we use p-value?


We use the p-value to understand whether the given
data really describes the observed effect or not.

We use the below formula to calculate the p-value for


the effect ‘E’ and the null hypothesis ‘H0’ is true:

www.accredian.com I 30
MASTERING DATA SCIENCE INTERVIEW accredian

Q31. What is the difference between an


error and a residual error?
An error occurs in values while the prediction gives us
the difference between the observed values and the true
values of a dataset. Whereas, the residual error is the
difference between the observed values and the
predicted values.

The reason we use the residual error to evaluate the


performance of an algorithm is that the true values are
never known. Hence, we use the observed values to
measure the error using residuals. It helps us get an
accurate estimate of the error.

Q32. Why do we use the summary function?


The summary function in R gives us the statistics of the
implemented algorithm on a particular dataset. It
consists of various objects, variables, data attributes, etc.
It provides summary statistics for individual objects when
fed into the function.

We use a summary function when we want information


about the values present in the dataset. It gives us the
summary statistics in the following form:

www.accredian.com I 31
MASTERING DATA SCIENCE INTERVIEW accredian

Here, it gives the minimum and maximum values from a


specific column of the dataset. Also, it provides the
median, mean, 1st quartile, and 3rd quartile values that
help us understand the values better.

Q33. Explain univariate, bivariate, and


multivariate analyses.
When we are dealing with data analysis, we often come
across terms such as univariate, bivariate, and
multivariate.

Let’s try and understand what these mean.

Univariate analysis: Univariate analysis involves


analyzing data with only one variable or, in other
words, a single column or a vector of the data. This
analysis allows us to understand the data and extract
patterns and trends from it. Example: Analyzing the
weight of a group of people.

www.accredian.com I 32
MASTERING DATA SCIENCE INTERVIEW accredian

Bivariate analysis: Bivariate analysis involves


analyzing the data with exactly two variables or, in
other words, the data can be put into a two-column
table. This kind of analysis allows us to figure out the
relationship between the variables. Example:
Analyzing the data that contains temperature and
altitude.

Multivariate analysis: Multivariate analysis involves


analyzing the data with more than two variables. The
number of columns of the data can be anything
more than two. This kind of analysis allows us to
figure out the effects of all other variables (input
variables) on a single variable (the output variable).

Example: Analyzing data about house prices, which


contains information about the houses, such as locality,
crime rate, area, the number of floors, etc.

www.accredian.com I 33
MASTERING DATA SCIENCE INTERVIEW accredian

Q34. What is the benefit of dimensionality


reduction?
Dimensionality reduction reduces the dimensions
and size of the entire dataset. It drops unnecessary
features while retaining the overall information in the
data intact. Reduction in dimensions leads to faster
processing of the data.

The reason why data with high dimensions is considered


so difficult to deal with is that it leads to high time
consumption while processing the data and training a
model on it. Reducing dimensions speeds up this
process, removes noise, and also leads to better model
accuracy.

Q35.What is a kernel function in SVM?

In the SVM algorithm, a kernel function is a special


mathematical function.

In simple terms, a kernel function takes data as


input and converts it into a required form. This
transformation of the data is based on something called
a kernel trick, which is what gives the kernel function its
name. Using the kernel function, we can transform the
data that is not linearly separable (cannot be separated
using a straight line) into one that is linearly separable.

www.accredian.com I 34
MASTERING DATA SCIENCE INTERVIEW accredian

Q36. What does it mean when the p-values


are high and low?
A p-value is the measure of the probability of having
results equal to or more than the results achieved under
a specific hypothesis assuming that the null hypothesis is
correct. This represents the probability that the observed
difference occurred randomly by chance.

Low p-value which means values ≤ 0.05 means


that the null hypothesis can be rejected and the data
is unlikely with true null.
High p-value, i.e values ≥ 0.05 indicates the
strength in favor of the null hypothesis. It means that
the data is like with true null.
p-value = 0.05 means that the hypothesis can go
either way.

Q37. When is resampling done?


Resampling is a methodology used to sample data for
improving accuracy and quantify the uncertainty of
population parameters. It is done to ensure the model is
good enough by training the model on different patterns
of a dataset to ensure variations are handled. It is also
done in the cases where models need to be validated
using random subsets or when substituting labels on
data points while performing tests.

www.accredian.com I 35
MASTERING DATA SCIENCE INTERVIEW accredian

Q38. Define confounding variables.


Confounding variables are also known as confounders.
These variables are a type of extraneous variables that
influence both independent and dependent variables
causing spurious association and mathematical
relationships between those variables that are associated
but are not casually related to each other.

Q39. Define and explain selection bias?


The selection bias occurs in the case when the
researcher has to make a decision on which participant
to study. The selection bias is associated with those
researches when the participant selection is not random.
The selection bias is also called the selection effect. The
selection bias is caused by as a result of the method of
sample collection.

Four types of selection bias are explained below:

Sampling Bias: As a result of a population that is not


random at all, some members of a population have
fewer chances of getting included than others,
resulting in a biased sample. This causes a systematic
error known as sampling bias.

www.accredian.com I 36
MASTERING DATA SCIENCE INTERVIEW accredian

Time interval: Trials may be stopped early if we


reach any extreme value but if all variables aresimilar
invariance, the variables with the highest variance
have a higher chance of achieving the extreme value.

Data: It is when specific data is selected arbitrarily


and the generally agreed criteria are not followed.

Attrition: Attrition in this context means the loss of


the participants. It is the discounting of those
subjects that did not complete the trial.

Q40. Define bias-variance trade-off?

Let us first understand the meaning of bias and variance


in detail:

Bias: It is a kind of error in a machine learning model


when an ML Algorithm is oversimplified. When a
model is trained, at that time it makes simplified
assumptions so that it can easily understand the
target function. Some algorithms that have low bias
are Decision Trees, SVM, etc. On the other hand,
logistic and linear regression algorithms are the ones
with a high bias.

Variance: Variance is also a kind of error. It is

www.accredian.com I 37
MASTERING DATA SCIENCE INTERVIEW accredian

introduced into an ML Model when an ML algorithm is


made highly complex. This model also learns noise from
the data set that is meant for training. It further performs
badly on the test data set. This may lead to over lifting as
well as high sensitivity.

When the complexity of a model is increased, a


reduction in the error is seen. This is caused by the lower
bias in the model. But, this does not happen always till
we reach a particular point called the optimal point.
After this point, if we keep on increasing the complexity
of the model, it will be over lifted and will suffer from the
problem of high variance.

We can represent this situation with the help of a graph


as shown below:

www.accredian.com I 38
MASTERING DATA SCIENCE INTERVIEW accredian

As you can see from the image, before the optimal point,
increasing the complexity of the model reduces the error
(bias). However, after the optimal point, we see that the
increase in the complexity of the machine learning
model increases the variance.

Trade-off Of Bias And Variance: So, as we know that


bias and variance, both are errors in machine learning
models, it is very essential that any machine learning
model has low variance as well as a low bias so that it
can achieve good performance.

Let us see some examples. The K-Nearest Neighbor


Algorithm is a good example of an algorithm with
low bias and high variance. This trade-off can easily be
reversed by increasing the k value which in turn results in
increasing the number of neighbours. This, in turn, results
in increasing the bias and reducing the variance.

Another example can be the algorithm of a support


vector machine. This algorithm also has a high variance
and obviously, a low bias and we can reverse the trade-
off by increasing the value of parameter C. Thus,
increasing the C parameter increases the bias and
decreases the variance.

So, the trade-off is simple. If we increase the bias, the


variance will decrease and vice versa.
www.accredian.com I 39
MASTERING DATA SCIENCE INTERVIEW accredian

Q41. Define the confusion matrix?


It is a matrix that has 2 rows and 2 columns. It has 4
outputs that a binary classifier provides to it. It is used to
derive various measures like specificity, error rate,
accuracy, precision, sensitivity, and recall.

The test data set should contain the correct and


predicted labels. The labels depend upon the
performance. For instance, the predicted labels are the
same if the binary classifier performs perfectly. Also, they
match the part of observed labels in real-world scenarios.

The four outcomes shown above in the confusion


matrix mean the following:

www.accredian.com I 40
MASTERING DATA SCIENCE INTERVIEW accredian

True Positive: This means that the positive


prediction is correct.
False Positive: This means that the positive
prediction is incorrect.
True Negative: This means that the negative
prediction is correct.
False Negative: This means that the negative
prediction is incorrect.

The formulas for calculating basic measures that comes


from the confusion matrix are:

Error rate: (FP + FN)/(P + N)


Accuracy: (TP + TN)/(P + N)
Sensitivity = TP/P
Specificity = TN/N
Precision = TP/(TP + FP)
F-Score = (1 + b)(PREC.REC)/(b2 PREC + REC) Here, b
is mostly 0.5 or 1 or 2

In these formulas:

FP = false positive
FN = false negative
TP = true positive
RN = true negative

www.accredian.com I 41
MASTERING DATA SCIENCE INTERVIEW accredian

Also,

Sensitivity is the measure of the True Positive Rate. It


is also called recall.

Specificity is the measure of the true negative rate.

Precision is the measure of a positive predicted


value.

F-score is the harmonic mean of precision and recall.

Q42. What is logistic regression?


State an example where you have recently
used logistic regression.
Logistic Regression is also known as the logit model. It is
a technique to predict the binary outcome from a
linear combination of variables (called the predictor
variables).

For example, let us say that we want to predict the


outcome of elections for a particular political leader. So,
we want to find out whether this leader is going to win
the election or not. So, the result is binary i.e. win (1) or
loss (0). However, the input is a combination of linear
variables like the money spent on advertising, the past
work done by the leader and the party, etc.

www.accredian.com I 42
MASTERING DATA SCIENCE INTERVIEW accredian

Q43. What is Linear Regression? What are


some of the major drawbacks of the linear
model?
Linear regression is a technique in which the score of a
variable Y is predicted using the score of a predictor
variable X. Y is called the criterion variable. Some of the
drawbacks of Linear Regression are as follows:

The assumption of linearity of errors is a major


drawback.
It cannot be used for binary outcomes. We have
Logistic Regression for that.
Overfitting problems are there that can’t be solved.

Q44. What is a random forest? Explain it’s


working.

Classification is very important in machine learning. It is


very important to know to which class does an
observation belongs. Hence, we have various
classification algorithms in machine learning like
logistic regression, support vector machine, decision
trees, Naive Bayes classifier, etc. One such
classification technique that is near the top of the
classification hierarchy is the random forest classifier.

www.accredian.com I 43
MASTERING DATA SCIENCE INTERVIEW accredian

So, firstly we need to understand a decision tree before


we can understand the random forest classifier and its
works.

So, let us say that we have a string as given below:

So, we have the string with 5 ones and 4 zeroes and we


want to classify the characters of this string using their
features. These features are colour (red or green in this
case) and whether the observation (i.e. character) is
underlined or not.

Now, let us say that we are only interested in red and


underlined observations. So, the decision tree would look
something like this:

www.accredian.com I 44
MASTERING DATA SCIENCE INTERVIEW accredian

So, we started with the colour first as we are only


interested in the red observations and we separated the
red and the green-coloured characters. After that, the
“No” branch i.e. the branch that had all the green
coloured characters was not expanded further as we
want only red-underlined characters. So, we expanded
the “Yes” branch and we again got a “Yes” and a “No”
branch based on the fact whether the characters were
underlined or not.

So, this is how we draw a typical decision tree. However,


the data in real life is not this clean but this was just to
give an idea about the working of the decision trees. Let
us now move to the random forest.

RANDOM FOREST

It consists of a large number of decision trees that


operate as an ensemble. Basically, each tree in the
forest gives a class prediction and the one with the
maximum number of votes becomes the prediction of
our model. For instance, in the example shown below, 4
decision trees predict 1, and 2 predict 0. Hence,
prediction 1 will be considered.

www.accredian.com I 45
MASTERING DATA SCIENCE INTERVIEW accredian

The underlying principle of a random forest is that


several weak learners combine to form a keen learner.

The steps to build a random forest are as follows:

Build several decision trees on the samples of data


and record their predictions.
Each time a split is considered for a tree, choose a
random sample of mm predictors as the split
candidates out of all the pp predictors. This happens
to every tree in the random forest.
Apply the rule of thumb i.e. at each split m = p√m = p.
Apply the predictions to the majority rule.

www.accredian.com I 46
MASTERING DATA SCIENCE INTERVIEW accredian

Q45. How can we select an appropriate


value of k in k-means?
Selecting the correct value of k is an important aspect of
k-means clustering. We can make use of the elbow
method to pick the appropriate k value. To do this, we
run the k-means algorithm on a range of values, e.g., 1 to
15. For each value of k, we compute an average score.
This score is also called inertia or the inter-cluster
variance.

This is calculated as the sum of squares of the distances


of all values in a cluster. As k starts from a low value and
goes up to a high value, we start seeing a sharp decrease
in the inertia value. After a certain value of k, in the range,
the drop in the inertia value becomes quite small. This is
the value of k that we need to choose for the k-means
clustering algorithm.

Q46. How can we deal with outliers?

Outliers can be dealt with in several ways. One way is to


drop them. We can only drop the outliers if they have
values that are incorrect or extreme. For example, if a
dataset with the weights of babies has a value 98.6-
degree Fahrenheit, then it is incorrect. Now, if the value is
187 kg, then it is an extreme value, which is not useful for
our model.

www.accredian.com I 47
MASTERING DATA SCIENCE INTERVIEW accredian

In case the outliers are not that extreme, then we can try:

A different kind of model. For example, if we were


using a linear model, then we can choose a non-
linear model.

Normalizing the data, which will shift the extreme


values closer to other data points.

Using algorithms that are not so affected by outliers,


such as random forest, etc.

Q47. How to calculate the accuracy of a


binary classification algorithm using its
confusion matrix?
In a binary classification algorithm, we have only two
labels, which are True and False. Before we can calculate
the accuracy, we need to understand a few key terms:

True positives: Number of observations correctly


classified as True
True negatives: Number of observations correctly
classified as False
False positives: Number of observations incorrectly
classified as True
False negatives: Number of observations incorrectly
classified as False

www.accredian.com I 48
MASTERING DATA SCIENCE INTERVIEW accredian

To calculate the accuracy, we need to divide the sum of


the correctly classified observations by the number of
total observations.

Q48. What is ensemble learning?

When we are building models using Data Science and


Machine Learning, our goal is to get a model that can
understand the underlying trends in the training data
and can make predictions or classifications with a high
level of accuracy.

However, sometimes some datasets are very complex,


and it is difficult for one model to be able to grasp the
underlying trends in these datasets. In such situations,
we combine several individual models together to
improve performance. This is what is called ensemble
learning.

Q49. Explain collaborative filtering in


recommender systems.

Collaborative filtering is a technique used to build


recommender systems. In this technique, to generate
recommendations, we make use of data about the likes
and dislikes of users similar to other users. This similarity
is estimated based on several varying factors, such as age,
gender, locality, etc.

www.accredian.com I 49
MASTERING DATA SCIENCE INTERVIEW accredian

If User A, similar to User B, watched and liked a movie,


then that movie will be recommended to User B, and
similarly, if User B watched and liked a movie, then that
would be recommended to User A.

In other words, the content of the movie does not matter


much. When recommending it to a user what matters is
if other users similar to that particular user liked the
content of the movie or not.

Q50. Explain content-based filtering in


recommender systems.

Content-based filtering is one of the techniques used to


build recommender systems. In this technique,
recommendations are generated by making use of the
properties of the content that a user is interested in.

For example, if a user is watching movies belonging to


the action and mystery genre and giving them good
ratings, it is a clear indication that the user likes movies of
this kind. If shown movies of a similar genre as
recommendations, there is a higher probability that the
user would like those recommendations as well.

In other words, here, the content of the movie is taken


into consideration when generating recommendations
for users.

www.accredian.com I 50
MASTERING DATA SCIENCE INTERVIEW accredian

Q51. Explain bagging in Data Science.

Bagging is an ensemble learning method. It stands


for bootstrap aggregating. In this technique, we generate
some data using the bootstrap method, in which we use
an already existing dataset and generate multiple
samples of the N size. This bootstrapped data is then
used to train multiple models in parallel, which makes
the bagging model more robust than a simple model.

Once all the models are trained, then it’s time to make a
prediction, we make predictions using all the trained
models and then average the result in the case of
regression, and for classification, we choose the result,
generated by models, that have the highest frequency.

Q52. Explain boosting in Data science.

Boosting is one of the ensemble learning methods. Unlike


bagging, it is not a technique used to parallelly train our
models. In boosting, we create multiple models and
sequentially train them by combining weak models
iteratively in a way that training a new model depends on
the models trained before it.

In doing so, we take the patterns learned by a previous


model and test them on a dataset when training the new
model. In each iteration, we give more importance
www.accredian.com I 51
MASTERING DATA SCIENCE INTERVIEW accredian

to observations in the dataset that are incorrectly


handled or predicted by previous models. Boosting is
useful in reducing bias in models as well.

Q53. Explain stacking in Data science.

Just like bagging and boosting, stacking is also an


ensemble learning method. In bagging and boosting, we
could only combine weak models that used the same
learning algorithms, e.g., logistic regression. These
models are called homogeneous learners.

However, in stacking, we can combine weak models


that use different learning algorithms as well. These
learners are called heterogeneous learners. Stacking
works by training multiple (and different) weak models
or learners and then using them together by training
another model, called a meta-model, to make
predictions based on the multiple outputs of predictions
returned by these multiple weak models.

Q54. What does the word ‘Naive’ mean in


Naive Bayes?

Naive Bayes is a data science algorithm. It has the word


‘Bayes’ in it because it is based on the Bayes theorem,
which deals with the probability of an event occurring

www.accredian.com I 52
MASTERING DATA SCIENCE INTERVIEW accredian

given that another event has already occurred.


It has ‘naive’ in it because it makes the assumption that
each variable in the dataset is independent of the other.
This kind of assumption is unrealistic for real-world data.
However, even with this assumption, it is very useful for
solving a range of complicated problems, e.g., spam
email classification, etc.

Q55. What is Batch normalization?

One method for attempting to enhance the functionality


and stability of the neural network is batch
normalization. To do this, normalize the inputs in each
layer such that the mean output activation stays at 0 and
the standard deviation is set to 1.

Q56. What do you understand from cluster


sampling and systematic sampling?

Cluster sampling is also known as the probability


sampling approach where you can divide a population
into groups, such as districts or schools, and then select a
representative sample from among these groups at
random. A modest representation of the population as a
whole should be present in each cluster.

A probability sampling strategy called systematic


sampling involves picking people from the population at

www.accredian.com I 53
MASTERING DATA SCIENCE INTERVIEW accredian

regular intervals, such as every 15th person on a


population list. The population can be organized
randomly to mimic the benefits of simple random
sampling.

Q57. What is the Computational Graph?

A directed graph with variables or operations as nodes is


a computational graph. Variables can contribute to
operations with their value, and operations can
contribute their output to other operations. In this
manner, each node in the graph establishes a function of
the variables.

Q58. What is the difference between


Batch and Stochastic Gradient Descent?

The differences between Batch and Stochastic Gradient


Descent are as follows:

www.accredian.com I 54
MASTERING DATA SCIENCE INTERVIEW accredian

Q59. What is an Activation function?

An activation function is a function that is incorporated


into an artificial neural network to aid in the network’s
learning of complicated patterns in the input data. In
contrast to a neuron-based model seen in human brains,
the activation function determines what signals should
be sent to the following neuron at the very end.

Q60. How Do You Build a Random forest


model?

The steps for creating a random forest model are as


follows:

Choose n from a dataset of k records.


Create distinct decision trees for each of the n data
values being taken into account. From each of them,
a projected result is obtained.
Each of the findings is subjected to a voting
mechanism.
The final outcome is determined by whose prediction
received the most support.

www.accredian.com I 55
MASTERING DATA SCIENCE INTERVIEW accredian

Q61. Can you avoid overfitting your model?


if yes, then how?

In actuality, data models may be overfitting. For it, the


strategies listed below can be applied:

Increase the amount of data in the dataset under


study to make it simpler to separate the links
between the input and output variables.
To discover important traits or parameters that need
to be examined, use feature selection.
Use regularization strategies to lessen the variation of
the outcomes a data model generates.
Rarely, datasets are stabilized by adding a little
amount of noisy data. This practice is called data
augmentation.

Q62. What is Cross Validation?

Cross-validation is a model validation method used to


assess the generalizability of statistical analysis results to
other data sets. It is frequently applied when forecasting
is the main objective and one wants to gauge how well a
model will work in real-world applications.

In order to prevent overfitting and gather knowledge on


how the model will generalize to different data sets,
cross-validation aims to establish a data set to test the

www.accredian.com I 56
MASTERING DATA SCIENCE INTERVIEW accredian

model during the training phase (i.e. validation data set).

Q63. What is variance in Data Science?

Variance is a type of error that occurs in a Data Science


model when the model ends up being too complex and
learns features from data, along with the noise that exists
in it.

This kind of error can occur if the algorithm used to train


the model has high complexity, even though the data
and the underlying patterns and trends are quite easy to
discover. This makes the model a very sensitive one that
performs well on the training dataset but poorly on the
testing dataset, and on any kind of data that the model
has not yet seen.

Variance generally leads to poor accuracy in testing and


results in overfitting.

Q64. What is pruning in a Decision Tree


algorithm?

Pruning a decision tree is the process of removing the


sections of the tree that are not necessary or are
redundant. Pruning leads to a smaller decision tree,
which performs better and gives higher accuracy and
speed.
www.accredian.com I 57
MASTERING DATA SCIENCE INTERVIEW accredian

Q65. What is entropy in a decision tree


algorithm?

In a decision tree algorithm, entropy is the measure of


impurity or randomness. The entropy of a given dataset
tells us how pure or impure the values of the dataset are.
In simple terms, it tells us about the variance in the
dataset.

For example, suppose we are given a box with 10 blue


marbles. Then, the entropy of the box is 0 as it contains
marbles of the same color, i.e., there is no impurity. If we
need to draw a marble from the box, the probability of it
being blue will be 1.0. However, if we replace 4 of the
blue marbles with 4 red marbles in the box, then the
entropy increases to 0.4 for drawing blue marbles.

Additionally, In a decision tree algorithm, multi-class


entropy is a measure used to evaluate the impurity or
disorder of a dataset with respect to the class labels
when there are multiple classes involved. It is commonly
used as a criterion to make decisions about splitting
nodes in a decision tree.

www.accredian.com I 58
MASTERING DATA SCIENCE INTERVIEW accredian

Q66. What information is gained in a


decision tree algorithm?
When building a decision tree, at each step, we have to
create a node that decides which feature we should use
to split data, i.e., which feature would best separate our
data so that we can make predictions. This decision is
made using information gain, which is a measure of how
much entropy is reduced when a particular feature is
used to split the data. The feature that gives the highest
information gain is the one that is chosen to split the
data.

Let’s consider a practical example to gain a better


understanding of how information gain operates within
a decision tree algorithm. Imagine we have a dataset
containing customer information such as age, income,
and purchase history. Our objective is to predict whether
a customer will make a purchase or not.

To determine which attribute provides the most valuable


information, we calculate the information gain for each
attribute. If splitting the data based on income leads to
subsets with significantly reduced entropy, it indicates
that income plays a crucial role in predicting purchase
behavior. Consequently, income becomes a crucial factor
in constructing the decision tree as it offers valuable
insights.

www.accredian.com I 59
MASTERING DATA SCIENCE INTERVIEW accredian

By maximizing information gain, the decision tree


algorithm identifies attributes that effectively reduce
uncertainty and enable accurate splits. This process
enhances the model’s predictive accuracy, enabling
informed decisions pertaining to customer purchases.

www.accredian.com I 60
Advanced
Data Science
Interview
Questions

www.accredian.com I 61
MASTERING DATA SCIENCE INTERVIEW accredian

Q67. How are the time series problems


different from other regression problems?
Time series data can be thought of as an extension to
linear regression which uses terms like
autocorrelation, movement of averages for
summarizing historical data of y-axis variables for
predicting a better future.

Forecasting and prediction is the main goal of time


series problems where accurate predictions can be
made but sometimes the underlying reasons might
not be known.

Having Time in the problem does not necessarily


mean it becomes a time series problem. There should
be a relationship between target and time for a
problem to become a time series problem.

The observations close to one another in time are


expected to be similar to the ones far away which
provide accountability for seasonality. For instance,
today’s weather would be similar to tomorrow’s
weather but not similar to weather from 4 months
from today. Hence, weather prediction based on past
data becomes a time series problem.

www.accredian.com I 62
MASTERING DATA SCIENCE INTERVIEW accredian

Q68. What are RMSE and MSE in a linear


regression model?

RMSE:

RMSE stands for Root Mean Square Error. In a linear


regression model, RMSE is used to test the performance
of the machine learning model. It is used to evaluate the
data spread around the line of best fit. So, in simple
words, it is used to measure the deviation of the
residuals.

RMSE is calculated using the formula:

Yi is the actual value of the output variable.


Y(Cap) is the predicted value and,
N is the number of data points.

MSE:

Mean Squared Error is used to find how close is the line


to the actual data. So, we make the difference in the
www.accredian.com I 63
MASTERING DATA SCIENCE INTERVIEW accredian

distance of the data points from the line and the


difference is squared. This is done for all the data points
and the submission of the squared difference divided by
the total number of data points gives us the Mean
Squared Error (MSE).

So, if we are taking the squared difference of N data


points and dividing the sum by N, what does it mean?
Yes, it represents the average of the squared difference of
a data point from the line i.e. the average of the squared
difference between the actual and the predicted values.

The formula for finding MSE is given below:

Pruning a decision tree is the process of removing the


sections of the tree that are not necessary or are
redundant. Pruning leads to a smaller decision tree,
which performs better and gives higher accuracy and
speed.

www.accredian.com I 64
MASTERING DATA SCIENCE INTERVIEW accredian

Yi is the actual value of the output variable (the ith


data point)
Y(cap) is the predicted value and,
N is the total number of data points.

So, RMSE is the square root of MSE.

Q69. What are Support Vectors in SVM


(Support Vector Machine)?

In the above diagram, we can see that the thin lines


mark the distance from the classifier to the closest data
points (darkened data points). These are called support
vectors. So, we can define the support vectors as the data

www.accredian.com I 65
MASTERING DATA SCIENCE INTERVIEW accredian

points or vectors that are nearest (closest) to the


hyperplane. They affect the position of the hyperplane.
Since they support the hyperplane, they are known as
support vectors.

Q70. Explain Neural Network Fundamentals.

In the human brain, different neurons are present. These


neurons combine and perform various tasks. The Neural
Network in deep learning tries to imitate human brain
neurons. The neural network learns the patterns from the
data and uses the knowledge that it gains from various
patterns to predict the output for new data, without any
human assistance.

A perceptron is the simplest neural network that


contains a single neuron that performs 2 functions. The
first function is to perform the weighted sum of all the
inputs and the second is an activation function.

www.accredian.com I 66
MASTERING DATA SCIENCE INTERVIEW accredian

There are some other neural networks that are more


complicated. Such networks consist of the following
three layers:

Input Layer: The neural network has the input layer to


receive the input

Hidden Layer: There can be multiple hidden layers


between the input layer and the output layer. The
initially hidden layers are used for detecting the low-
level patterns whereas the further layers are
responsible for combining output from previous
layers to find more patterns.

Output Layer: This layer outputs the prediction.

An example neural network image is shown below:

www.accredian.com I 67
MASTERING DATA SCIENCE INTERVIEW accredian

Q71. What is Generative Adversarial


Network?
This approach can be understood with the famous
example of the wine seller.

Let us say that there is a wine seller who has his own
shop. This wine seller purchases wine from the dealers
who sell him the wine at a low cost so that he can sell
the wine at a high cost to the customers. Now, let us say
that the dealers whom he is purchasing the wine from,
are selling him fake wine. They do this as the fake wine
costs way less than the original wine and the fake and
the real wine are indistinguishable to a normal consumer
(customer in this case). The shop owner has some friends
who are wine experts and he sends his wine to them
every time before keeping the stock for sale in his shop.
So, his friends, the wine experts, give him feedback that
the wine is probably fake. Since the wine seller has been
purchasing the wine for a long time from the same
dealers, he wants to make sure that their feedback is
right before he complains to the dealers about it. Now,
let us say that the dealers also have got a tip from
somewhere that the wine seller is suspicious of them.

So, in this situation, the dealers will try their best to sell
the fake wine whereas the wine seller will try his best to
identify the fake wine. Let us see this with the help of a
diagram shown below:
www.accredian.com I 68
MASTERING DATA SCIENCE INTERVIEW accredian

From the image above, it is clear that a noise vector is


entering the generator (dealer) and he generates the
fake wine and the discriminator has to distinguish
between the fake wine and real wine. This is a Generative
Adversarial Network (GAN).

In a GAN, there are 2 main components viz.


Generator and,
Discrminator.

So, the generator is a CNN that keeps producing images


and the discriminator tries to identify the real images
from the fake ones.

www.accredian.com I 69
MASTERING DATA SCIENCE INTERVIEW accredian

Q72. What is a computational graph?

A computational graph is also known as a “Dataflow


Graph”. Everything in the famous deep learning library
TensorFlow is based on the computational graph. The
computational graph in Tensorflow has a network of
nodes where each node operates. The nodes of this
graph represent operations and the edges represent
tensors.

Q73. What are auto-encoders?

Auto-encoders are learning networks. They transform


inputs into outputs with minimum possible errors. So,
basically, this means that the output that we want
should be almost equal to or as close as to input as
follows.

Multiple layers are added between the input and the


output layer and the layers that are in between the
input and the output layer are smaller than the input
layer. It received unlabelled input. This input is encoded
to reconstruct the input later.

Q74. What are Exploding Gradients and


Vanishing Gradients?

Exploding Gradients: Let us say that you are training


www.accredian.com I 70
MASTERING DATA SCIENCE INTERVIEW accredian

an RNN. Say, you saw exponentially growing error


gradients that accumulate, and as a result of this, very
large updates are made to the neural network model
weights. These exponentially growing error gradients
that update the neural network weights to a great extent
are called Exploding Gradients.

Vanishing Gradients: Let us say again, that you are


training an RNN. Say, the slope became too small.
This problem of the slope becoming too small is
called Vanishing Gradient. It causes a major increase
in the training time and causes poor performance
and extremely low accuracy.

Q75. What is the p-value and what does it


indicate in the Null Hypothesis?

P-value is a number that ranges from 0 to 1. In a


hypothesis test in statistics, the p-value helps in telling
us how strong the results are. The claim that is kept for
experiment or trial is called Null Hypothesis.

A low p-value i.e. p-value less than or equal to 0.05


indicates the strength of the results against the Null
Hypothesis which in turn means that the Null
Hypothesis can be rejected.

A high p-value i.e. p-value greater than 0.05


www.accredian.com I 71
MASTERING DATA SCIENCE INTERVIEW accredian

indicates the strength of the results in favour of the Null


Hypothesis i.e. for the Null Hypothesis which in turn
means that the Null Hypothesis can be accepted.

Q76. Can you tell us why TensorFlow is the


most preferred library in deep learning?

Tensorflow is a very famous library in deep learning. The


reason is pretty simple actually.

It provides C++ as well as Python APIs which makes


it very easier to work on.
TensorFlow has a fast compilation speed as
compared to Keras and Torch (other famous deep
learning libraries).
Tenserflow supports both GPU and CPU computing
devices.

Hence, it is a major success and a very popular library


for deep learning.

Q77. What is Cross-Validation?

Cross-Validation is a Statistical technique used for


improving a model’s performance. Here, the model will
be trained and tested with rotation using different
samples of the training dataset to ensure that the model
performs well for unknown data.
www.accredian.com I 72
MASTERING DATA SCIENCE INTERVIEW accredian

The training data will be split into various groups and the
model is run and validated against these groups in
rotation.

The most commonly used techniques are:

K- Fold method
Leave p-out method
Leave-one-out method
Holdout method

Q78. What are the differences between


correlation and covariance?
Although these two terms are used for establishing a
relationship and dependency between any two random
variables, the following are the differences between
them:
www.accredian.com I 73
MASTERING DATA SCIENCE INTERVIEW accredian

Correlation: This technique is used to measure and


estimate the quantitative relationship between two
variables and is measured in terms of how strong are
the variables related.

Covariance: It represents the extent to which the


variables change together in a cycle. This explains the
systematic relationship between pair of variables
where changes in one affect changes in another
variable.

Mathematically, consider 2 random variables, X and Y


where the means are represented as �� and ��
respectively and standard deviations are represented by
�� and �� respectively and E represents the expected
value operator, then:

Based on the above formula, we can deduce that the


correlation is dimensionless whereas covariance is
represented in units that are obtained from the
multiplication of units of two variables.

www.accredian.com I 74
MASTERING DATA SCIENCE INTERVIEW accredian

Based on the above formula, we can deduce that the


correlation is dimensionless whereas covariance is
represented in units that are obtained from the
multiplication of units of two variables.

The following image graphically shows the difference


between correlation and covariance:

www.accredian.com I 75
MASTERING DATA SCIENCE INTERVIEW accredian

Q79. How do you approach solving any data


analytics based project?

Generally, we follow the below steps:

The first step is to thoroughly understand the


business requirement/problem

Next, explore the given data and analyze it carefully. If


you find any data missing, get the requirements
clarified from the business.

Data cleanup and preparation step is to be


performed next which is then used for modelling.
Here, the missing values are found and the variables
are transformed.

Run your model against the data, build meaningful


visualization and analyze the results to get
meaningful insights.

Release the model implementation, and track the


results and performance over a specified period to
analyze the usefulness.

Perform cross-validation of the model.

www.accredian.com I 76
MASTERING DATA SCIENCE INTERVIEW accredian

Check out the list of data analytics projects.

Q80. Why do we need selection bias?

Selection Bias happens in cases where there is no


randomization specifically achieved while picking a
part of the dataset for analysis. This bias tells that the
sample analyzed does not represent the whole
population meant to be analyzed.

For example, in the image on the next page, we can


see that the sample that we selected does not
entirely represent the whole population that we
have. This helps us to question whether we have
selected the right data for analysis or not.

www.accredian.com I 77
MASTERING DATA SCIENCE INTERVIEW accredian

Q81. Why is Data cleaning crucial?


How do you clean the data?

While running an algorithm on any data, to gather


proper insights, it is very much necessary to have
correct and clean data that contains only relevant
information. Dirty data most often results in poor or
incorrect insights and predictions which can have
damaging effects.

For example, while launching any big campaign to


market a product, if our data analysis tells us to target a
product that in reality has no demand and if the
campaign is launched, it is bound to fail. This results in
a loss of the company’s revenue.
www.accredian.com I 78
MASTERING DATA SCIENCE INTERVIEW accredian

This is where the importance of having proper and clean


data comes into the picture.

Data Cleaning of the data coming from different


sources helps in data transformation and results in
the data where the data scientists can work on.

Properly cleaned data increases the accuracy of the


model and provides very good predictions.

If the dataset is very large, then it becomes


cumbersome to run data on it. The data cleanup step
takes a lot of time (around 80% of the time) if the
data is huge. It cannot be incorporated with running
the model. Hence, cleaning data before running the
model, results in increased speed and efficiency of
the model.

Data cleaning helps to identify and fix any structural


issues in the data. It also helps in removing any
duplicates and helps to maintain the consistency of
the data.

The following diagram represents the advantages of


data cleaning:

www.accredian.com I 79
MASTERING DATA SCIENCE INTERVIEW accredian

Q82. What are the available feature


selection methods for selecting the right
variables for building efficient predictive
models?
While using a dataset in data science or machine
learning algorithms, it so happens that not all the
variables are necessary and useful to build a model.
Smarter feature selection methods are required to avoid
redundant models to increase the efficiency of our
model.
www.accredian.com I 80
MASTERING DATA SCIENCE INTERVIEW accredian

Following are the three main methods in feature


selection:

FILTER METHODS:
These methods pick up only the intrinsic properties of
features that are measured via univariate statistics
and not cross-validated performance. They are
straightforward and are generally faster and require
less computational resources when compared to
wrapper methods.

There are various filter methods such as the:

Chi-Square test,
Fisher’s Score method,
Correlation Coefficient,
Variance Threshold,
Mean Absolute Difference (MAD) method,
Dispersion Ratios, etc

www.accredian.com I 81
MASTERING DATA SCIENCE INTERVIEW accredian

WRAPPER METHODS:

These methods need some sort of method to search


greedily on all possible feature subsets, access their
quality by learning and evaluating a classifier with
the feature.

The selection technique is built upon the machine


learning algorithm on which the given dataset needs
to fit.

There are three types of wrapper methods, they


are:

www.accredian.com I 82
MASTERING DATA SCIENCE INTERVIEW accredian

Forward Selection: Here, one feature is tested at a


time and new features are added until a good fit is
obtained.
Backward Selection: Here, all the features are tested
and the non-fitting ones are eliminated one by one to
see while checking which works better.
Recursive Feature Elimination: The features are
recursively checked and evaluated how well they
perform.

EMBEDDED METHODS

Embedded methods constitute the advantages of


both filter and wrapper methods by including feature
interactions while maintaining reasonable
computational costs.

These methods are iterative as they take each model


iteration and carefully extract features contributing to
most of the training in that iteration.

Examples of embedded methods: LASSO


Regularization (L1), Random Forest Importance.

www.accredian.com I 83
MASTERING DATA SCIENCE INTERVIEW accredian

Q83. During analysis, how do you treat the


missing values?

To identify the extent of missing values, we first have to


identify the variables with the missing values. Let us say
a pattern is identified. The analyst should now
concentrate on them as it could lead to interesting and
meaningful insights.

However, if there are no patterns identified, we can


substitute the missing values with the median or mean
www.accredian.com I 84
MASTERING DATA SCIENCE INTERVIEW accredian

values or we can simply ignore the missing values.

If the variable is categorical, the common strategies for


handling missing values include:

Assigning a New Category: You can assign a new


category, such as "Unknown" or "Other," to represent
the missing values.

Mode imputation: You can replace missing values


with the mode, which represents the most frequent
category in the variable.

Using a Separate Category: If the missing values


carry significant information, you can create a
separate category to indicate missing values.

It's important to select an appropriate strategy based on


the nature of the data and the potential impact on
subsequent analysis or modelling.

If 80% of the values are missing for a particular variable,


then we would drop the variable instead of treating the
missing values.

www.accredian.com I 85
MASTERING DATA SCIENCE INTERVIEW accredian

Q84. What does the ROC Curve represent


and how to create it?

ROC (Receiver Operating Characteristic) curve is a


graphical representation of the contrast between false-
positive rates and true positive rates at different
thresholds. The curve is used as a proxy for a trade-off
between sensitivity and specificity.

The ROC curve is created by plotting values of true


positive rates (TPR or sensitivity) against false-positive
rates (FPR or (1-specificity)) TPR represents the proportion
of observations correctly predicted as positive out of
overall positive observations. The FPR represents the
proportion of observations incorrectly predicted out of
overall negative observations. Consider the example of
medical testing, the TPR represents the rate at which
people are correctly tested positive for a particular
disease.

www.accredian.com I 86
MASTERING DATA SCIENCE INTERVIEW accredian

Q85. What is the difference between the


Test set and validation set?
The test set is used to test or evaluate the performance of
the trained model. It evaluates the predictive power of
the model.

The validation set is part of the training set that is used to


select parameters for avoiding model overfitting.

Q86. What is the difference between the


Test set and validation set?
Kernel functions are generalized dot product functions
used for the computing dot product of vectors xx and yy
in high dimensional feature space.

Kernal trick method is used for solving a non-linear


problem by using a linear classifier by transforming
linearly inseparable data into separable ones in higher
dimensions.

www.accredian.com I 87
MASTERING DATA SCIENCE INTERVIEW accredian

Q87. Differentiate between box plot and


histogram.
Box plots and histograms are both visualizations used for
showing data distributions for efficient communication
of information.

Histograms are the bar chart representation of


information that represents the frequency of numerical
variable values that are useful in estimating probability
distribution, variations and outliers.

Boxplots are used for communicating different


aspects of data distribution where the shape of the
distribution is not seen but still the insights can be
gathered. These are useful for comparing multiple charts
at the same time as they take less space when compared
to histograms.

www.accredian.com I 88
MASTERING DATA SCIENCE INTERVIEW accredian

Q88. How will you balance/correct


imbalanced data?
There are different techniques to correct/balance
imbalanced data. It can be done by increasing the
sample numbers for minority classes. The number of
samples can be decreased for those classes with
extremely high data points.

Following are some approaches followed to balance


data:

Use the right evaluation metrics: In cases of


imbalanced data, it is very important to use the right
evaluation metrics that provide valuable information.

Specificity/Precision: Indicates the number of


selected instances that are relevant.

Sensitivity: Indicates the number of relevant


instances that are selected.

F1 score: It represents the harmonic mean of


precision and sensitivity.

MCC (Matthews correlation coefficient): It


represents the correlation coefficient between

www.accredian.com I 89
MASTERING DATA SCIENCE INTERVIEW accredian

observed and predicted binary classifications.

AUC (Area Under the Curve): This represents a


relation between the true positive rates and false-
positive rates.

For example, consider the below graph that illustrates


training data:

Here, if we measure the accuracy of the model in terms


of getting "0"s, then the accuracy of the model would be
very high -> 99.9%, but the model does not guarantee
any valuable information. In such cases, we can apply
different evaluation metrics as stated above.

Training Set Resampling: It is also possible to


balance data by working on getting different datasets
and this can be achieved by resampling.

There are two approaches followed under-sampling that


is used based on the use case and the requirements:
www.accredian.com I 90
MASTERING DATA SCIENCE INTERVIEW accredian

1. Under-sampling: This balances the data by reducing


the size of the abundant class and is used when the
data quantity is sufficient. By performing this, a new
dataset that is balanced can be retrieved and this can
be used for further modeling.
2. Over-sampling: This is used when data quantity is
not sufficient. This method balances the dataset by
trying to increase the samples size. Instead of getting
rid of extra samples, new samples are generated and
introduced by employing the methods of repetition,
bootstrapping, etc.

Perform K-fold cross-validation correctly: Cross-


Validation needs to be applied properly while using
over-sampling. The cross-validation should be done
before over-sampling because if it is done later, then
it would be like overfitting the model to get a specific
result. To avoid this, resampling of data is done
repeatedly with different ratios.

Q89. How will you balance/correct


imbalanced data?

Tuning strategies are used to find the right set of


hyperparameters. Hyperparameters are those properties
that are fixed and model-specific before the model is
tested or trained on the dataset.

www.accredian.com I 91
MASTERING DATA SCIENCE INTERVIEW accredian

Both the grid search and random search tuning


strategies are optimization techniques to find
efficient hyperparameters.

GRID SEARCH:

Here, every combination of a preset list of


hyperparameters is tried out and evaluated.

The search pattern is similar to searching in a grid


where the values are in a matrix and a search is
performed. Each parameter set is tried out and their
accuracy is tracked. after every combination is tried
out, the model with the highest accuracy is chosen as
the best one.

The main drawback here is that, if the number of


hyperparameters is increased, the technique suffers.
The number of evaluations can increase exponentially
with each increase in the hyperparameter. This is
called the problem of dimensionality in a grid search.

www.accredian.com I 92
MASTERING DATA SCIENCE INTERVIEW accredian

RANDOM SEARCH:

In this technique, random combinations of


hyperparameters set are tried and evaluated for
finding the best solution. For optimizing the search,
the function is tested at random configurations in
parameter space as shown in the image below.

In this method, there are increased chances of


finding optimal parameters because the pattern
followed is random. There are chances that the
model is trained on optimized parameters without
the need for aliasing.

This search works the best when there is a lower


number of dimensions as it takes less time to find the
right set.

www.accredian.com I 93
MASTERING DATA SCIENCE INTERVIEW accredian

Q90. Consider a case where you know the


probability of finding at least one shooting
star in a 15-minute interval is 30%. Evaluate
the probability of finding at least one
shooting star in a one-hour duration?

We know that,
Probability of finding atleast 1 shooting star in 15 min
= P(sighting in 15min) = 30% = 0.3
Hence, Probability of not sighting any
shooting star in 15 min = 1-P(sighting in 15min)
= 1-0.3
= 0.7

Probability of not finding shooting star in 1 hour


= 0.7^4
= 0.1372
Probability of finding atleast 1
shooting star in 1 hour = 1-0.1372
= 0.8628

So the probability is 0.8628 = 86.28%

Q91. How will you balance/correct


imbalanced data?

Before citing instances, let us understand what are false


positives and false negatives.
www.accredian.com I 94
MASTERING DATA SCIENCE INTERVIEW accredian

False Positives are those cases that were wrongly


identified as an event even if they were not. They are
called Type I errors.

False Negatives are those cases that were wrongly


identified as non-events despite being an event. They
are called Type II errors.

Some examples where false positives were important


than false negatives are:

In the medical field:


Consider that a lab report has predicted cancer to a
patient even if he did not have cancer. This is an example
of a false positive error. It is dangerous to start
chemotherapy for that patient as he doesn’t have cancer
as starting chemotherapy would lead to damage of
healthy cells and might even actually lead to cancer.

In the e-commerce field:


Suppose a company decides to start a campaign where
they give $100 gift vouchers for purchasing $10000
worth of items without any minimum purchase
conditions. They assume it would result in at least 20%
profit for items sold above $10000. What if the vouchers
are given to the customers who haven’t purchased
anything but have been mistakenly marked as those
who purchased $10000 worth of products. This is the
case of false-positive error.
www.accredian.com I 95
MASTERING DATA SCIENCE INTERVIEW accredian

Q92. Give one example where both false


positives and false negatives are important
equally?

We know that,
Probability of finding atleast 1 shooting star in 15 min
= P(sighting in 15min) = 30% = 0.3
Hence, Probability of not sighting any
shooting star in 15 min = 1-P(sighting in 15min)
= 1-0.3
= 0.7

Probability of not finding shooting star in 1 hour


= 0.7^4
= 0.1372
Probability of finding atleast 1
shooting star in 1 hour = 1-0.1372
= 0.8628

So the probability is 0.8628 = 86.28%

www.accredian.com I 96
MASTERING DATA SCIENCE INTERVIEW accredian

START YOUR
JOURNEY TO
BECOMING A
DATA SCIENCE
EXPERT
Now that you have a comprehensives overview of the field of Data
Science, the career opportunities that await you, and the skills you
need to get there, the next and most effective step towards achieving
your goal is to get certified and learn all you need to.

Accredian is a pioneer in online training and one of the world’s


leading certification providers in the most in-demand technologies
today. We provide various training and certifications, for all levels of
professionals (beginners to senior level) to equip you with the
knowledge required to forge a career path in data science.

Professional Program
E&ICT IIT G - Executive Program in Data Science & AI - 12 months
E&ICT IIT G - Executive Program in Data Science & AI - 10 months

Advanced Program
E&ICT IIT G - Advanced Certification in Data Science & ML - 6 months

Basic Program
E&ICT IIT G - Certificate in Data Analytics - 3 months

www.accredian.com I 97
accredian
INDIA

www.accredian.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy