0% found this document useful (0 votes)
24 views12 pages

FDS Most Imp Question

The document provides definitions and explanations of various data science concepts, including primary data, interquartile range, data visualization tools, and data preprocessing. It covers statistical methods, data attributes, and the importance of data quality and transformation in analysis. Additionally, it discusses hypothesis testing, measures of central tendency, and the visualization of geospatial data.

Uploaded by

j03410581
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

FDS Most Imp Question

The document provides definitions and explanations of various data science concepts, including primary data, interquartile range, data visualization tools, and data preprocessing. It covers statistical methods, data attributes, and the importance of data quality and transformation in analysis. Additionally, it discusses hypothesis testing, measures of central tendency, and the visualization of geospatial data.

Uploaded by

j03410581
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

What do you mean by Primary Data- Primary data refers to information that is collected

directly from original sources for a specific research purpose or analysis. It is firsthand data
gathered through methods such as surveys, interviews, experiments, or observations,
ensuring that it is original and not previously analyzed.

Define Interquartile range- The interquartile range (IQR) is a statistical measure of the
spread of a dataset. It is the range between the first quartile (Q1) and the third quartile (Q3)
and represents the middle 50% of the data. The formula is: IQR=Q3−Q1\text {IQR} = Q3 -
Q1IQR=Q3−Q1 It helps identify the variability in the data and is used to detect outliers.

What are uses of zip files- ZIP files are compressed archive files used to reduce the size of
files or collections of files for storage and transfer. The main uses include : Space-saving:
Compressing files to take up less disk space. Ease of transfer: Reducing file size makes it
faster and more e icient to share over the internet. Organization: Bundling multiple files or
folders into a single archive. Security: Optionally, ZIP files can be encrypted with a password
for added protection.

What do you mean by XML Files data format-XML (eXtensible Markup Language) is a text-
based format used to store and transport structured data. It is both human-readable and
machine-readable, making it widely used for data representation and sharing between
systems. XML files are made up of elements, attributes, and nested structures that define
data hierarchies.

What is visual encoding- Visual encoding refers to the process of representing data through
visual means such as charts, graphs, and diagrams. It involves translating data attributes
(e.g., values, categories) into visual elements (e.g., shapes, colors, positions) to facilitate
understanding and pattern recognition.

What is Data science- Data science is an interdisciplinary field that combines scientific
methods, processes, algorithms, and systems to extract insights and knowledge from
structured and unstructured data. It involves areas such as statistics, machine learning, data
analysis, and big data technologies to make informed decisions.

Define Data source- A data source is any origin from which data is obtained or collected for
analysis. Data sources can be primary (e.g., surveys, interviews, experiments) or secondary
(e.g., databases, research papers, and public datasets). Data sources can also include APIs,
sensors, or data dumps from various software.

Define Hypothesis Testing- Hypothesis testing is a statistical method used to determine


whether there is enough evidence to reject a null hypothesis about a population parameter. It
involves formulating a null hypothesis (H0H_0H0) and an alternative hypothesis (H1H_1H1),
collecting data, and using statistical tests (e.g., t-tests, chi-square tests) to make decisions.
The result is typically expressed in terms of a p-value, indicating the likelihood of observing
the data if the null hypothesis is true.
What is use of Bubble plot- A bubble plot is a type of scatter plot that displays three
dimensions of data. The x and y axes represent two variables, while the size of the bubble
represents a third variable, and the color can represent another variable. It is useful for
visualizing relationships and comparing data points based on three attributes.

Define Data cleaning- Data cleaning is the process of identifying and rectifying errors,
inconsistencies, inaccuracies, or missing values in a dataset to improve its quality. This
process can include tasks such as: Removing duplicates ,Correcting typos or formatting
issues , Handling missing values by imputation or deletion , Ensuring data consistency across
the dataset

List the visualization libraries in python.


Matplotlib: Basic, highly customizable 2D plots.
Seaborn: Built on Matplotlib, provides more sophisticated and high-level visualizations.
Plotly: Interactive, web-based plots.
Bokeh: Interactive plots for web applications.
ggplot: A Python adaptation of the popular R library for statistical graphics.
Altair:Declarative statistical visualization library.
Pandas Visualization: Quick plotting capabilities built into Pandas dataframes.

List applications of data science.


Healthcare: Predictive analysis, medical image analysis, personalized treatment.
Finance: Fraud detection, risk modeling, stock market predictions
Retail: Customer segmentation, demand forecasting, recommendation systems.
Marketing: Customer behavior analysis, targeted advertising.
E-commerce: Personalization, sales forecasting, inventory management.
Transportation: Route optimization, predictive maintenance, traffic analysis.
Social Media: Sentiment analysis, user engagement analysis.
Education: Adaptive learning systems, dropout prediction.

What is data transformation- Data transformation is the process of converting data from
one format, structure, or value into another. This process is essential for cleaning, enriching,
and formatting data before analysis or integration with other data systems. It may involve
tasks such as normalization, aggregation, or data type conversion.

Define standard deviation-Standard deviation is the measurement of the dispersion of the


data set from its mean value. It is always measured in arithmetic value. Standard deviation is
always positive and is denoted by σ (sigma).

What is data discretization-The data discretization techniques can be used to reduce the
number of values for a given continuous attribute by dividing the range of the attribute into
intervals. Interval labels can be used to restore actual data values. It can be restoring
multiple values of a continuous attribute with a small number of interval labels therefore
decrease and simplifies the original information.
What is missing values-Missing values defines specified data values as user-missing. For
example, you might want to distinguish between data that are missing because a respondent
refused to answer and data that are missing because the question didn't apply to that
respondent.

What is data quality-Data quality refers to the condition of a dataset, determined by factors
such as accuracy, completeness, consistency, reliability, and relevance. High-quality data is
essential for e ective analysis and decision-making.

What is tag cloud-A tag cloud (also known as a word cloud or weighted list in visual design) is
a visual representation of text data which is often used to depict keyword metadata on
websites, or to visualize free form text. Tags are usually single words, and the importance of
each tag is shown with font size or color .When used as website navigation aids, the terms are
hyperlinked to items associated with the tag.

What is outlier. State types of outliers


An outlier is a data point that di ers significantly from other observations in a dataset.
Outliers can indicate variability in the measurements, experimental errors, or a novel
phenomenon. They can be classified into two main types:
1. Univariate Outliers: These are outliers in a single variable's distribution. They can be
identified using statistical methods like the Z-score or the IQR (Interquartile Range) method.
2. Multivariate Outliers: These occur when a data point is an outlier in the context of
multiple variables. They can be detected using methods such as Mahalanobis distance or
clustering techniques.

Explain Data transformation and there types


Data transformation is the process of converting, cleaning, and structuring raw data into a
usable format for analysis and decision-making. It’s a crucial step in data management that
ensures your information is accessible, consistent, and secure.As you deal with massive
amounts of data from various sources daily, data transformation has become an essential
tool to integrate, store, and analyze information for business intelligence.Data transformation
can be categorized into four main types:Constructive: Adding, copying, or replicating data.
Destructive : Deleting unnecessary records or fields.
Esthetic : Standardizing values to meet specific requirements or parameters.
Structural: Reorganizing the database by renaming, moving, or combining columns.

Write details notes on basic data visualization tools


Data Visualization Tools are software platforms that provide information in a visual format
such as a graph, chart, etc to make it easily understandable and usable. Data Visualization
tools are so popular as they allow analysts and statisticians to create visual data models
easily according to their specifications by conveniently providing an interface, database
connections, and Machine Learning tools all in one place!
Bar Charts: Used to compare quantities across di erent categories.
Histograms: Represent the distribution of numerical data.
Line Graphs: Show trends over time.
Scatter Plots: Display relationships between two numerical variables.
Box Plots: Summarize data using quartiles and highlight outliers.
What are the measures of central tendency. Explain them
frequency distribution and graphical representation are used to depict a set of raw data to
attain meaningful conclusions from them. However, sometimes, these methods fail to
convey a proper and clear picture of the data as expected. Therefore, some measures, also
known as Measures of Central Tendency.The di erent measures of central tendency can be
classified into three categories; viz., Mathematical Averages (Arithmetic Mean Geometric
Mean(G), and Harmonic Mean(H)), Positional Averages (Median(M or Me) and Mode(Z)), and
Commercial Averages (Moving Average, Progressive Average, and Composite Average).

What are the di erent methods for measuring the data dispersion
The measures of dispersion that are measured and expressed in the units of data themselves
are called Absolute Measure of Dispersion. For example – Meters, Dollars, Kg, etc.Some
absolute measures of dispersion are: Range: It is defined as the di erence between the
largest and the smallest value in the distribution.
Mean Deviation: It is the arithmetic mean of the di erence between the values and their
mean.
Standard Deviation: It is the square root of the arithmetic average of the square of the
deviations measured from the mean.
Variance: It is defined as the average of the square deviation from the mean of the given data
set.
Quartile Deviation: It is defined as half of the di erence between the third quartile and the
first quartile in a given data set.
Interquartile Range: The di erence between upper(Q3 ) and lower(Q1) quartile is called
Interterquartile Range. Its formula is given as Q3 – Q1.

How do you visualize geospatial data. Explain in detail


Geospatial visualization is the process of representing data associated with a location on a
map to help people understand it. It can be used to create interactive 3D maps and graphics,
or static maps. Geospatial visualization can help people understand patterns, trends, and
themes on Earth's surface. It can also help identify problems, track changes, and make
predictions.
Maps-Maps can be used to show the boundaries of a country, continent, or the whole
planet. They can also be used to show a street, town, or park.
Proportional symbol maps-These maps use symbols of di erent sizes to show the relative
magnitude of data at specific locations. For example, you can use this map type to show
population density, crime rates, or economic indicators.
Interactive maps-Interactive maps allow users to navigate the map controls and interact
with the data to get more information

Define Interquartile range


The interquartile range (IQR) is a simple way to measure how spread out the middle 50% of a
dataset is. It’s used in statistics to understand the spread of data by focusing on the central
part, ignoring any extreme values or outliers. This makes the IQR a useful tool when you want
to get a clear sense of where most of your data points lie, without letting unusually high or low
values distort the picture. Essentially, the IQR helps describe how clustered or spread out the
middle portion of your data
Explain di erent data formats in brief-Data file formats usually come in two main varieties:
Binary files - files that contain information in their binary format, usually supporting
documents containing image or video data.
Text-based files - containing text-based data and information, for documents that are
primarily databases.
1. CSV (Comma-Separated Values): A simple text format for tabular data.
2. JSON (JavaScript Object Notation): A lightweight data interchange format that is easy for
humans to read and write.
3. XML (eXtensible Markup Language): A markup language that defines rules for encoding
documents in a format that is both human-readable and machine-readable.
4. Excel: A spreadsheet format commonly used for data analysis and manipulation.

What do you mean by Data attribute .Explain types of attributes with example
Data attributes refer to the specific characteristics or properties that describe individual data
objects within a dataset.These attributes provide meaningful information about the objects
and are used to analyze, classify, or manipulate the data.Understanding and analyzing data
attributes is fundamental in various fields such as statistics , machine learning , and data
analysis, as they form the basis for deriving insights and making informed decisions from the
data.
Nominal Attributes :Nominal attributes, as related to names, refer to categorical data where
the values represent di erent categories or labels without any inherent order or ranking.
Binary Attributes: Binary attributes are a type of qualitative attribute where the data can take
on only two distinct values or states.
Symmetric: In a symmetric attribute, both values or states are considered equally important
or interchangeable.
Asymmetric: An asymmetric attribute indicates that the two values or states are not equally
important or interchangeable
Ordinal Attributes : Ordinal attributes are a type of qualitative attribute where the values
possess a meaningful order or ranking, but the magnitude between values is not precisely
quantified.

Explain data cube-When data is grouped or combined in multidimensional matrices called


Data Cubes. The data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."The general idea of this approach is to materialize certain expensive
computations that are frequently inquired.example, a relation with the schema sales (part,
supplier, customer, and sale-price) can be materialized into a set of eight views

Define statistical data analysis:-Statistical Analysis means gathering, understanding, and


showing data to find patterns and connections that can help us make decisions. It includes
lots of di erent ways to look at data, from simple stu like basic facts to more complicated
methods for figuring out what those facts mean.

Explain 3V’s of data science-The 3 V's (volume, velocity and variety) are three defining
properties or dimensions of big data. Volume refers to the amount of data, velocity refers to
the speed of data processing, and variety refers to the number of types of data.
What is meant by Noisy data-Noisy data are data with a large amount of additional
meaningless information in it called noise. [1] This includes data corruption and the term is
often used as a synonym for corrupt data.[1] It also includes any data that a user system
cannot understand and interpret correctly. Many systems, for example, cannot use
unstructured text. Noisy data can adversely a ect the results of any data analysis and skew
conclusions if not handled properly. Statistical analysis is sometimes used to weed the noise
out of noisy data.

Give the purpose of data preprocessing? Data preprocessing is a crucial step in data
mining and machine learning. It involves cleaning, transforming, and preparing raw data to
improve its quality and suitability for analysis. The main purposes of data preprocessing
include: Handling missing values: Imputing missing values or removing records with missing
data. Noise reduction: Identifying and removing noise or outliers. Data integration:
Combining data from multiple sources.Data transformation: Normalization,
standardization, and feature engineering.
Data reduction: Dimensionality reduction and feature selection.

Explain any four data visualization tools.


Google Charts:A free tool that lets you create interactive charts for online display.
Apache Superset:An open-source platform for data visualization and exploration. It has
built-in charts, visualizations, and interactive dashboards.
Boxplots:A tool for identifying outliers. It displays the distribution of statistical observations.
Visme : An online tool for creating infographics, charts, maps, graphs, presentations, and
social media graphics.

Explain null and alternate hypothesis.


Null Hypothesis (H₀):The null hypothesis represents a statement of no e ect, no di erence,
or no relationship. It is the default assumption that any observed di erences or e ects in the
data are due to chance or random variation, rather than any specific cause or intervention.
Alternate Hypothesis (H₁ or Ha):The alternate hypothesis is the complement of the null
hypothesis. It suggests that there is a significant e ect, di erence, or relationship. In other
words, it is the hypothesis that researchers hope to support.

What is venn diagram? How to create it? Explain with example.


A Venn diagram is a visual representation of sets and their relationships. It consists of
overlapping circles, where each circle represents a set and the overlapping regions represent
the intersection of sets. Example:
Consider two sets: A = {1, 2, 3, 4} and B = {3, 4, 5, 6}.

The overlapping region represents the intersection of the two sets, which
is {3, 4}. Venn diagrams are useful for understanding and visualizing set
operations like union, intersection, and di erence.
What do you mean by Data transformation? Explain strategies of data transformation.
Data transformation refers to the process of converting data from its original format or
structure into a di erent format or structure that is more suitable for analysis, reporting, or
further processing. This is a key step in data preprocessing and can involve several operations
that change the data's format, structure, or values to meet specific analytical needs. Data
transformation is often performed during the Extract, Transform, Load (ETL) process in data
engineering, especially when data from disparate sources needs to be integrated into a
centralized data warehouse or used for machine learning.

Data Cleaning (or Data Wrangling):Data cleaning is an essential part of data transformation
Normalization: the process of adjusting the values in a dataset to a common scale, typically
in a range like 0 to 1.
Aggregation:Aggregation is the process of combining data from multiple sources or records
into a single summary value. This is often used in reporting or time-series data analysis.
Data Type Conversion-Data often needs to be transformed to appropriate types to match the
requirements of downstream processes
Feature Engineering-Feature engineering is the process of creating new variables (features)
or modifying existing ones to improve the performance of a machine learning model or
analysis.

What is inferential statistics?


Inferential statistics is a branch of statistics that involves drawing conclusions about a
population based on a sample of data. It uses statistical tests and models to make inferences
about the population parameters, such as the mean, standard deviation, or proportions.
Common techniques include hypothesis testing, confidence intervals, and regression
analysis.

Explain any two data transformation technique in detail.


Data transformation is the process of converting raw data into a suitable format for analysis.
Two common techniques are:
Normalization: Scales numerical data to a specific range (e.g., 0 to 1 or -1 to 1). Helps in
improving the performance of machine learning algorithms. Techniques include min-max
scaling and z-score normalization.
Discretization: Converts continuous numerical data into discrete intervals or bins. Reduces
the number of values, simplifies analysis, and can improve model performance. Methods
include equal-width binning, equal-frequency binning, and clustering-based binning.

Write a short note on feature extraction.


Feature extraction is the process of selecting and transforming relevant features from raw
data to improve the performance of machine learning models. It involves identifying the most
informative characteristics of the data that contribute to the prediction or classification task.
Feature Selection: Choosing a subset of the most relevant features.
Feature Engineering: Creating new features from existing ones.
Dimensionality Reduction: Reducing the number of features using techniques like Principal
Component Analysis (PCA) or t-SNE.
Explain Exploratory Data Analysis (EDA) in detail.
Exploratory Data Analysis (EDA) is an essential step in the data science pipeline. It involves
understanding the data through statistical summaries, visualizations, and other techniques.
Understand the Data:
Identify the data types (numerical, categorical)
Check for missing values and outliers
Examine the distribution of variables
Discover Patterns:
Identify trends, correlations, and relationships between variables
Find clusters and anomalies
Prepare for Modeling:
Transform and clean the data
Select relevant features for modelling

Explain any two tools in data scientist tool box.


Python: -Versatile programming language for data analysis, machine learning, and data
visualization. Popular libraries: NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn
SQL: Essential for working with relational databases. Used for data extraction,
transformation, and loading (ETL) processes. Enables querying and manipulating large
datasets.

Write a short note on wordclouds.


A word cloud is a visual representation of text data where words are displayed in di erent
sizes, with larger words representing more frequent terms. It's a useful tool for quickly
identifying the most important keywords or themes within a text document or corpus. Word
clouds are often used in text analysis, natural language processing, and information
visualization. By visually highlighting the most prominent words, word clouds can help users
gain insights into the underlying topics and sentiments of the text.

Explain any two ways in which data is stored in files.


Text-based Files:
-Data is stored as plain text characters.
-Common formats: CSV, TSV, JSON, XML
-Simple to read and write but less e icient for large datasets.
Binary Files:
-Data is stored in binary format,it is more e icient for storing large amounts of data.
-Common formats: Databases, images, audio, video
-Requires specific software or libraries to read and write.

What is a quartile?
A quartile is a statistical measure that divides a dataset into four equal parts. There are three
quartiles: Quartiles are used to understand the distribution and variability of data.
First Quartile (Q1): Divides the lowest 25% of the data.
Second Quartile (Q2): Also known as the median, divides the lowest 50% of the data.
Third Quartile (Q3): Divides the lowest 75% of the data.
Types of Data (with Examples):
Structured Data: Organized into tables (e.g., spreadsheets, databases).
Unstructured Data: Lacks predefined format (e.g., videos, social media posts).
Semi-Structured Data: Contains some organizational properties (e.g., JSON, XML).
Metadata: Data about data (e.g., file size, author name).

Types of Descriptive Statistics


Descriptive statistics summarize and describe the main features of a dataset.
They are categorized into:
Measures of Central Tendency: Mean, Median, Mode
Measures of Dispersion: Range, Variance, Standard Deviation
Measures of Shape: Skewness and Kurtosis

Types of Data
Structured Data: Organized into rows and columns (e.g., spreadsheets, databases).
Unstructured Data: Data without a fixed structure (e.g., text, images, videos).
Semi-Structured Data: Data with a flexible structure (e.g., JSON, XML).
Qualitative Data: Non-numerical information (e.g., opinions, descriptions).
Quantitative Data: Numerical information (e.g., measurements, counts).

Transformation Strategies
Log Transformation: Reduces skewness and handles large-scale di erences in data.
Square Root Transformation: Reduces skewness for positive data.
Z-Score Transformation: Standardizes data using z-scores.
Scaling: Ensures all data is on the same scale for analysis.

What is Visual Encoding


-Visual encoding refers to the process of representing data values as visual elements, such as
position, size, shape, color, and orientation, to make data interpretable. For example, in a bar
chart, the length of the bar encodes the data value.

Define Box Plot


-A box plot is a graphical representation of the distribution of a dataset. It displays the
median, quartiles, and potential outliers. The box represents the interquartile range (IQR),
while whiskers extend to show the range of the data.
Define Dendrogram
- A dendrogram is a tree-like diagram that represents hierarchical relationships between data
points. It is commonly used in clustering algorithms to visualize how data points are grouped.

What are Donut Charts


-A donut chart is a variation of a pie chart with a circular hole in the center. It is used to
represent proportions or percentages among di erent categories.

Define Area Chart


-An area chart is a type of graph that represents quantitative data using filled areas under the
line connecting data points. It shows trends over time and emphasizes the magnitude of
changes.

Life Cycle of Data Science:


Problem Definition: Identifying the question or challenge to solve.
Data Acquisition: Collecting raw data from sources.
Data Preparation: Cleaning, transforming, and organizing data for analysis.
Exploratory Data Analysis (EDA): Understanding patterns and relationships in the data.
Modeling : Applying machine learning or statistical models to derive insights.
Evaluation: Assessing the model's accuracy and reliability.
Deployment: Implementing the solution in real-world applications.
Monitoring and Maintenance: Ensuring the model remains e ective
over time.

Data Scientist's Toolbox:


Programming Languages: Python, R
Libraries and Frameworks: Pandas,NumPy, TensorFlow, Scikit-learn
Data Visualization Tools: Matplotlib, Tableau .
Databases: SQL, MongoDB Cloud Platforms: AWS, Google Cloud

List di erent types of attributes. Attributes can be broadly categorized into two types:
Categorical Attributes:
Nominal: No inherent order (e.g., color, gender).
Ordinal: Has a natural order (e.g., low, medium, high).
Numerical Attributes:
Discrete: Countable values (e.g., number of children).
Continuous: Infinitely many possible values (e.g., height, weight).

State the methods of feature selection.There are several methods for feature selection in
machine learning:
-Filter Methods: Statistical measures like correlation, chi-square test, and information gain
are used to rank features.
-Wrapper Methods: Algorithms like forward selection, backward elimination, and recursive
feature elimination evaluate subsets of features.
-Embedded Methods: Feature selection is integrated into the model building process, suchas
regularization techniques like L1 and L2 regularization.
On the basis of Structured data Unstructured data

It is based on a relational It is based on character and binary


Technology
database. data.

Structured data is less


There is an absence of schema, so it
Flexibility flexible and schema-
is more flexible.
dependent.

It is hard to scale database


Scalability It is more scalable.
schema.

Robustness It is very robust. It is less robust.

Here, we can perform a While in unstructured data, textual


structured query that queries are possible, the
Performance
allows complex joining, so performance is lower than semi-
the performance is higher. structured and structured data.

Structured data is
It is qualitative, as it cannot be
quantitative, i.e., it consists
Nature processed and analyzed using
of hard numbers or things
conventional tools.
that can be counted.

It has a variety of formats, i.e., it


Format It has a predefined format. comes in a variety of shapes and
sizes.

Searching for unstructured data is


Analysis It is easy to search.
more difficult.

Define variance.-Variance is a statistical measure that quantifies the dispersion or


spread of data points from their mean. It calculates the average squared di erence
between each data point and the mean. A higher variance indicates greater variability in
the data,while a lower variance indicates less variability.

What is nominal attribute?-Nominal attributes are categorical data where the values
represent di erent categories or labels without any inherent order or ranking. Examples
include gender, color, or country.
Explain two methods of data cleaning for missing values. -->
Deletion:
Listwise deletion: Removes entire records with missing values.
Pairwise deletion: Excludes cases with missing values only for specific analyses.
Simple to implement but can lead to loss of information.
Imputation:
Mean/Median/Mode Imputation: Replaces missing values with the mean, median, or
mode of the respective variable.
Regression Imputation: Predicts missing values using regression models.
Hot Deck Imputation: Replaces missing values with values from similar records.

Explain role of statistics in data science.


Statistics plays a crucial role in data science by providing the tools and techniques to
analyze, interpret, and draw meaningful insights from data. It helps in:
Data exploration and cleaning: Identifying patterns, anomalies, and missing values.
Feature engineering: Creating new features from existing ones.
Model building and evaluation: Selecting appropriate models, training, and evaluating their
performance.
Hypothesis testing: Making inferences about the population based on sample data.
Data visualization: Creating informative visualizations to communicate findings.

Define volume characteristic of data in reference to data science.


Volume in the context of data science refers to the sheer size and quantity of data being
generated and stored.As technology advances, the volume of data generated by various
sources(e.g., social media, IoT devices, scientific experiments) is rapidly increasing. This
massive volume of data presents both challenges and opportunities for data scientists,
requiring specialized tools and techniques to store, process, and analyze it e ectively.

Give examples of semistructured data.


<book>
<title>The Lord of the Rings</title>
<author>J.R.R. Tolkien</author>
<genre>Fantasy</genre>
</book>
Semistructured Data is data that doesn't conform to a rigid, predefined data model. It has a
partial structure, often using tags or markers to delimit data elements. Examples include
XML, JSON, and HTML. While it lacks the strict structure of relational databases, it o ers
flexibility for representing complex information.

What is one hot coding?


One-hot encoding is a technique used to convert categorical data into numerical data. It
creates a new binary feature for each category, assigning a value of 1 to the corresponding
category and 0 to others. This allows machine learning algorithms to process categorical
data e ectively.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy