FDS Most Imp Question
FDS Most Imp Question
directly from original sources for a specific research purpose or analysis. It is firsthand data
gathered through methods such as surveys, interviews, experiments, or observations,
ensuring that it is original and not previously analyzed.
Define Interquartile range- The interquartile range (IQR) is a statistical measure of the
spread of a dataset. It is the range between the first quartile (Q1) and the third quartile (Q3)
and represents the middle 50% of the data. The formula is: IQR=Q3−Q1\text {IQR} = Q3 -
Q1IQR=Q3−Q1 It helps identify the variability in the data and is used to detect outliers.
What are uses of zip files- ZIP files are compressed archive files used to reduce the size of
files or collections of files for storage and transfer. The main uses include : Space-saving:
Compressing files to take up less disk space. Ease of transfer: Reducing file size makes it
faster and more e icient to share over the internet. Organization: Bundling multiple files or
folders into a single archive. Security: Optionally, ZIP files can be encrypted with a password
for added protection.
What do you mean by XML Files data format-XML (eXtensible Markup Language) is a text-
based format used to store and transport structured data. It is both human-readable and
machine-readable, making it widely used for data representation and sharing between
systems. XML files are made up of elements, attributes, and nested structures that define
data hierarchies.
What is visual encoding- Visual encoding refers to the process of representing data through
visual means such as charts, graphs, and diagrams. It involves translating data attributes
(e.g., values, categories) into visual elements (e.g., shapes, colors, positions) to facilitate
understanding and pattern recognition.
What is Data science- Data science is an interdisciplinary field that combines scientific
methods, processes, algorithms, and systems to extract insights and knowledge from
structured and unstructured data. It involves areas such as statistics, machine learning, data
analysis, and big data technologies to make informed decisions.
Define Data source- A data source is any origin from which data is obtained or collected for
analysis. Data sources can be primary (e.g., surveys, interviews, experiments) or secondary
(e.g., databases, research papers, and public datasets). Data sources can also include APIs,
sensors, or data dumps from various software.
Define Data cleaning- Data cleaning is the process of identifying and rectifying errors,
inconsistencies, inaccuracies, or missing values in a dataset to improve its quality. This
process can include tasks such as: Removing duplicates ,Correcting typos or formatting
issues , Handling missing values by imputation or deletion , Ensuring data consistency across
the dataset
What is data transformation- Data transformation is the process of converting data from
one format, structure, or value into another. This process is essential for cleaning, enriching,
and formatting data before analysis or integration with other data systems. It may involve
tasks such as normalization, aggregation, or data type conversion.
What is data discretization-The data discretization techniques can be used to reduce the
number of values for a given continuous attribute by dividing the range of the attribute into
intervals. Interval labels can be used to restore actual data values. It can be restoring
multiple values of a continuous attribute with a small number of interval labels therefore
decrease and simplifies the original information.
What is missing values-Missing values defines specified data values as user-missing. For
example, you might want to distinguish between data that are missing because a respondent
refused to answer and data that are missing because the question didn't apply to that
respondent.
What is data quality-Data quality refers to the condition of a dataset, determined by factors
such as accuracy, completeness, consistency, reliability, and relevance. High-quality data is
essential for e ective analysis and decision-making.
What is tag cloud-A tag cloud (also known as a word cloud or weighted list in visual design) is
a visual representation of text data which is often used to depict keyword metadata on
websites, or to visualize free form text. Tags are usually single words, and the importance of
each tag is shown with font size or color .When used as website navigation aids, the terms are
hyperlinked to items associated with the tag.
What are the di erent methods for measuring the data dispersion
The measures of dispersion that are measured and expressed in the units of data themselves
are called Absolute Measure of Dispersion. For example – Meters, Dollars, Kg, etc.Some
absolute measures of dispersion are: Range: It is defined as the di erence between the
largest and the smallest value in the distribution.
Mean Deviation: It is the arithmetic mean of the di erence between the values and their
mean.
Standard Deviation: It is the square root of the arithmetic average of the square of the
deviations measured from the mean.
Variance: It is defined as the average of the square deviation from the mean of the given data
set.
Quartile Deviation: It is defined as half of the di erence between the third quartile and the
first quartile in a given data set.
Interquartile Range: The di erence between upper(Q3 ) and lower(Q1) quartile is called
Interterquartile Range. Its formula is given as Q3 – Q1.
What do you mean by Data attribute .Explain types of attributes with example
Data attributes refer to the specific characteristics or properties that describe individual data
objects within a dataset.These attributes provide meaningful information about the objects
and are used to analyze, classify, or manipulate the data.Understanding and analyzing data
attributes is fundamental in various fields such as statistics , machine learning , and data
analysis, as they form the basis for deriving insights and making informed decisions from the
data.
Nominal Attributes :Nominal attributes, as related to names, refer to categorical data where
the values represent di erent categories or labels without any inherent order or ranking.
Binary Attributes: Binary attributes are a type of qualitative attribute where the data can take
on only two distinct values or states.
Symmetric: In a symmetric attribute, both values or states are considered equally important
or interchangeable.
Asymmetric: An asymmetric attribute indicates that the two values or states are not equally
important or interchangeable
Ordinal Attributes : Ordinal attributes are a type of qualitative attribute where the values
possess a meaningful order or ranking, but the magnitude between values is not precisely
quantified.
Explain 3V’s of data science-The 3 V's (volume, velocity and variety) are three defining
properties or dimensions of big data. Volume refers to the amount of data, velocity refers to
the speed of data processing, and variety refers to the number of types of data.
What is meant by Noisy data-Noisy data are data with a large amount of additional
meaningless information in it called noise. [1] This includes data corruption and the term is
often used as a synonym for corrupt data.[1] It also includes any data that a user system
cannot understand and interpret correctly. Many systems, for example, cannot use
unstructured text. Noisy data can adversely a ect the results of any data analysis and skew
conclusions if not handled properly. Statistical analysis is sometimes used to weed the noise
out of noisy data.
Give the purpose of data preprocessing? Data preprocessing is a crucial step in data
mining and machine learning. It involves cleaning, transforming, and preparing raw data to
improve its quality and suitability for analysis. The main purposes of data preprocessing
include: Handling missing values: Imputing missing values or removing records with missing
data. Noise reduction: Identifying and removing noise or outliers. Data integration:
Combining data from multiple sources.Data transformation: Normalization,
standardization, and feature engineering.
Data reduction: Dimensionality reduction and feature selection.
The overlapping region represents the intersection of the two sets, which
is {3, 4}. Venn diagrams are useful for understanding and visualizing set
operations like union, intersection, and di erence.
What do you mean by Data transformation? Explain strategies of data transformation.
Data transformation refers to the process of converting data from its original format or
structure into a di erent format or structure that is more suitable for analysis, reporting, or
further processing. This is a key step in data preprocessing and can involve several operations
that change the data's format, structure, or values to meet specific analytical needs. Data
transformation is often performed during the Extract, Transform, Load (ETL) process in data
engineering, especially when data from disparate sources needs to be integrated into a
centralized data warehouse or used for machine learning.
Data Cleaning (or Data Wrangling):Data cleaning is an essential part of data transformation
Normalization: the process of adjusting the values in a dataset to a common scale, typically
in a range like 0 to 1.
Aggregation:Aggregation is the process of combining data from multiple sources or records
into a single summary value. This is often used in reporting or time-series data analysis.
Data Type Conversion-Data often needs to be transformed to appropriate types to match the
requirements of downstream processes
Feature Engineering-Feature engineering is the process of creating new variables (features)
or modifying existing ones to improve the performance of a machine learning model or
analysis.
What is a quartile?
A quartile is a statistical measure that divides a dataset into four equal parts. There are three
quartiles: Quartiles are used to understand the distribution and variability of data.
First Quartile (Q1): Divides the lowest 25% of the data.
Second Quartile (Q2): Also known as the median, divides the lowest 50% of the data.
Third Quartile (Q3): Divides the lowest 75% of the data.
Types of Data (with Examples):
Structured Data: Organized into tables (e.g., spreadsheets, databases).
Unstructured Data: Lacks predefined format (e.g., videos, social media posts).
Semi-Structured Data: Contains some organizational properties (e.g., JSON, XML).
Metadata: Data about data (e.g., file size, author name).
Types of Data
Structured Data: Organized into rows and columns (e.g., spreadsheets, databases).
Unstructured Data: Data without a fixed structure (e.g., text, images, videos).
Semi-Structured Data: Data with a flexible structure (e.g., JSON, XML).
Qualitative Data: Non-numerical information (e.g., opinions, descriptions).
Quantitative Data: Numerical information (e.g., measurements, counts).
Transformation Strategies
Log Transformation: Reduces skewness and handles large-scale di erences in data.
Square Root Transformation: Reduces skewness for positive data.
Z-Score Transformation: Standardizes data using z-scores.
Scaling: Ensures all data is on the same scale for analysis.
List di erent types of attributes. Attributes can be broadly categorized into two types:
Categorical Attributes:
Nominal: No inherent order (e.g., color, gender).
Ordinal: Has a natural order (e.g., low, medium, high).
Numerical Attributes:
Discrete: Countable values (e.g., number of children).
Continuous: Infinitely many possible values (e.g., height, weight).
State the methods of feature selection.There are several methods for feature selection in
machine learning:
-Filter Methods: Statistical measures like correlation, chi-square test, and information gain
are used to rank features.
-Wrapper Methods: Algorithms like forward selection, backward elimination, and recursive
feature elimination evaluate subsets of features.
-Embedded Methods: Feature selection is integrated into the model building process, suchas
regularization techniques like L1 and L2 regularization.
On the basis of Structured data Unstructured data
Structured data is
It is qualitative, as it cannot be
quantitative, i.e., it consists
Nature processed and analyzed using
of hard numbers or things
conventional tools.
that can be counted.
What is nominal attribute?-Nominal attributes are categorical data where the values
represent di erent categories or labels without any inherent order or ranking. Examples
include gender, color, or country.
Explain two methods of data cleaning for missing values. -->
Deletion:
Listwise deletion: Removes entire records with missing values.
Pairwise deletion: Excludes cases with missing values only for specific analyses.
Simple to implement but can lead to loss of information.
Imputation:
Mean/Median/Mode Imputation: Replaces missing values with the mean, median, or
mode of the respective variable.
Regression Imputation: Predicts missing values using regression models.
Hot Deck Imputation: Replaces missing values with values from similar records.