Foundation of Data Science Solve Question Paper Aug 2022
Foundation of Data Science Solve Question Paper Aug 2022
d) What is a quartile?
Ans. Quartiles are position indicators that divide a sequence of numbers into 4 equal parts A quartile divides data into
three points—a lower quartile, median, and upper quartile.
2.Multiple Imputation: In this method, multiple datasets are created by imputing the missing values using a
statistical model such as regression or clustering. Each dataset is analyzed and the results are combined to
account for the uncertainty in imputed values. This method is more robust and provides more accurate results
compared to mean/median imputation, especially when the missing values are not missing at random.
2.Pandas: Pandas is a fast, flexible, and powerful open-source data analysis and data manipulation library for
Python. It provides data structures for efficiently storing large datasets and tools for working with them,
including data cleaning, filtering, grouping, and aggregating. Pandas makes it easy to manipulate and analyze
large datasets, and is a essential tool in the data scientist's toolbox.
A word cloud (also known as a tag cloud) is a visual representation of the frequency of words in a text corpus.
The size and color of each word in the cloud indicate its frequency in the text. Word clouds provide a quick and
simple way to understand the most commonly used words in a large text, and can be used to identify patterns
and trends in the data. They are widely used in the fields of text mining, natural language processing, and data
visualization. Word clouds can be generated using a variety of tools and software, and are often used as a first
step in the analysis of large text datasets.
Q3) Attempt any TWO of the following: [2 x 4 = 8]
Data visualization is used in a variety of fields, including business, science, medicine, and engineering,
to help decision-makers make informed decisions, communicate insights to stakeholders, and present
information in an engaging and memorable way. It is also used for exploratory data analysis, to discover patterns
in the data, and for presentation and reporting, to share findings and insights with others.
c) Calculate the variance and standard deviation for the following data. X: 14 9 13 16 25 7 12.
Ans. The variance and standard deviation are two common measures of dispersion in a set of data. The variance is a
measure of how spread out the data is, while the standard deviation is the square root of the variance and
provides a measure of the average deviation of the data from the mean.
To calculate the variance and standard deviation for the data set X = [14, 9, 13, 16, 25, 7, 12], we first
need to find the mean.
Mean (μ) = (14 + 9 + 13 + 16 + 25 + 7 + 12) / 7 = 14
Next, we subtract the mean from each data point and square the results to find the deviations from the mean.
Finally, we find the standard deviation by taking the square root of the variance:
Standard deviation (σ) = √(95.43) = 9.77
So, the variance and standard deviation for the data set X = [14, 9, 13, 16, 25, 7, 12] are 95.43 and 9.77,
respectively.
04) Attempt any TWO of the following: [2 x 4 = 8]
In hypothesis testing, a null hypothesis and an alternative hypothesis are formulated. The null hypothesis
represents the status quo and is usually the opposite of what the researcher wants to prove. The alternative
hypothesis represents the researcher's claim or assumption.
A test statistic is calculated from the sample data, and its distribution under the null hypothesis is determined.
The test statistic is then compared to a critical value or a p-value, which is the probability of observing a test
statistic as extreme or more extreme than the one calculated, assuming the null hypothesis is true.
If the p-value is less than a pre-determined level of significance (often denoted by α), the null hypothesis is
rejected, and the alternative hypothesis is accepted. If the p-value is greater than α, the null hypothesis is not
rejected, and no conclusion is drawn about the alternative hypothesis.
In summary, hypothesis testing is a method used to make inferences about a population based on a sample of
data and helps to make informed decisions by testing claims and assumptions.
In conclusion, these libraries provide a wide range of options for creating different types of visualizations in
Python. Whether you are looking to create simple plots or interactive visualizations, there is a library in Python
that can meet your needs.
Q5) Attempt any ONE of the following: [1 x 3 = 3]
ii) One Technique of Data Transformation: Normalization is a technique in data transformation that rescales the
values in a data set to a standard scale, typically between 0 and 1. The purpose of normalization is to
remove the impact of scaling differences in the data and to make the data more comparable and
interpretable.
Normalization can be performed using a number of methods, including Min-Max Normalization, Z-Score
Normalization, and Decimal Scaling. In Min-Max Normalization, the data is scaled such that the minimum value is
0 and the maximum value is 1. In Z-Score Normalization, the data is rescaled such that the mean of the data is 0
and the standard deviation is 1. In Decimal Scaling, the data is rescaled by dividing each value by a power of 10.
Normalization is a critical step in many data analysis and machine learning processes, as it helps to remove the
impact of scaling differences in the data and makes the data more comparable and interpretable.
ii) Types of Outliers: An outlier is a data point that is significantly different from the other data points in a data
set. There are several types of outliers, including:
1. Univariate Outliers: These are outliers in a single variable data set. They can be identified using a range of
methods, including the use of box plots, scatter plots, and Z-score calculations.
2. Multivariate Outliers: These are outliers that occur in multiple variable data sets. They can be identified
using methods such as Mahalanobis Distance, which calculates the distance between a data point and the
mean of the data set in a multivariate space.
3. Collective Outliers: These are outliers that occur in groups, rather than as individual data points. They can
be identified using methods such as cluster analysis, which groups similar data points together and identifies
clusters that are significantly different from the rest of the data set.
In this answer, we will explain Univariate Outliers in detail:
Univariate Outliers are outliers in a single variable data set. They can be identified using a range of methods,
including the use of box plots, scatter plots, and Z-score calculations.
A box plot is a simple way to visualize univariate outliers. It displays the median, first and third quartiles, and
outliers of a data set in a single plot. Outliers are typically identified as data points that fall outside of the
whiskers of the box plot.
Z-score calculations can be used to identify outliers in a data set. Z-scores represent the number of standard
deviations that a data point is from the mean of the data set. Outliers can be identified as data points with a Z-
score that is significantly different from 0, typically Z-scores greater than 3 or less than -3 are considered outliers.
Scatter plots can also be used to visualize univariate outliers. Outliers can be identified as data points that fall far
from the majority of the data points in the scatter plot.