FDS - 4 Solved
FDS - 4 Solved
Sol:
Primary data refers to data that is collected directly from the source or through firsthand
observation. This type of data is gathered specifically for the research purpose at hand,
making it original and unique. Primary data collection methods include surveys, interviews,
experiments, and observations. It is highly reliable and specific to the needs of the
research.
Sol:
Data quality refers to the condition of data based on factors such as accuracy,
completeness, consistency, reliability, and relevance. High-quality data meets the
requirements for its intended use in decision-making, analysis, and operations. Poor data
quality can lead to incorrect insights, flawed decisions, and operational inefficiencies.
c) Define outlier.
Sol:
An outlier is a data point that significantly differs from other observations in a dataset. It
lies outside the overall pattern of the data and can be caused by variability in the data,
errors, or unusual conditions. Outliers can distort statistical analyses and lead to
misleading results if not properly handled.
Sol:
The Interquartile Range (IQR) is a measure of statistical dispersion that represents the
range within which the middle 50% of data points lie. It is calculated as the difference
between the first quartile (Q1) and the third quartile (Q3): IQR = Q3 - Q1
The IQR is used to identify outliers and understand the spread of the central portion of the
data.
e) What do you mean by Missing Values?
Sol:
Missing values refer to the absence of data in a dataset where a data point is expected.
They can occur due to various reasons, such as data entry errors, equipment malfunction,
or respondents not answering certain questions in a survey. Missing values can affect the
quality of analysis and need to be addressed through methods like imputation or deletion.
Sol:
ZIP files are used to compress and archive files and directories. They offer several benefits:
• Compression: Reduce the file size, saving storage space and making it easier to transfer
files.
• Archiving: Bundle multiple files and directories into a single archive, simplifying file
management and distribution.
• Encryption: Provide security features to protect the contents of the files through
password protection and encryption.
Sol:
XML (eXtensible Markup Language) is a markup language that defines a set of rules for
encoding documents in a format that is both human-readable and machine-readable. XML
files are structured using tags to define elements, making it easy to store, transport, and
exchange data between different systems and applications.
Sol:
Data discretization is the process of converting continuous data attributes into discrete
categories or intervals. It simplifies the data, making it more manageable for analysis and
interpretation. Techniques for data discretization include binning, clustering, and decision
tree methods.
i) What is Tag Cloud?
Sol:
A tag cloud, also known as a word cloud, is a visual representation of text data where
individual words are displayed in varying sizes based on their frequency or importance. It
helps quickly identify the most prominent terms in a dataset, making it useful for text
analysis and summarization.
Sol:
Visual encoding refers to the process of representing data through visual elements like
position, size, shape, color, and texture. This helps in conveying information effectively and
enables users to understand and interpret data patterns, trends, and relationships through
visual representations.
Q2) Attempt any four of the following:
Sol:
1. Healthcare:
o Drug Discovery: Analyze clinical trial data to discover new drugs and predict their
effectiveness.
2. Finance:
o Risk Management: Identify and mitigate financial risks through predictive modeling
and anomaly detection.
o Fraud Detection: Detect fraudulent transactions and activities using machine learning
algorithms.
3. Retail:
4. Marketing:
o Sentiment Analysis: Analyze social media and customer reviews to gauge public
sentiment about products and brands.
5. Transportation:
Sol:
• Null Hypothesis (H0): The null hypothesis is a statement that there is no effect or no
difference, and it serves as the default assumption in hypothesis testing. It is a
baseline that the researcher aims to test against.
o Example: In a clinical trial, the null hypothesis might state that a new drug has
no effect on patients compared to a placebo.
o Example: In a clinical trial, the alternate hypothesis might state that the new
drug has a positive effect on patients compared to a placebo.
Hypothesis testing involves determining whether there is enough evidence to reject the null
hypothesis in favor of the alternate hypothesis.
c) What do you mean by Noisy Data? Explain any two causes of noisy data.
Sol:
Noisy data refers to data that contains errors, outliers, or irrelevant information that can
obscure the true signal or pattern in the dataset. Noisy data can negatively impact the
quality of analysis and model performance.
o Manual data entry by humans is prone to errors, such as typos, incorrect values, and
formatting inconsistencies. These errors introduce noise into the dataset.
2. Sensor Malfunctions:
d) What do you mean by Data Visualization? Give examples of any two data
visualization libraries.
Sol:
Data visualization is the graphical representation of data using visual elements such as
charts, graphs, and maps. It enables users to understand and interpret complex data by
presenting it in a visual context, making it easier to identify patterns, trends, and insights.
1. Matplotlib (Python):
o Example: Creating line charts, bar charts, scatter plots, and histograms.
2. ggplot2 (R):
Sol: -
1. Volume:
o The volume characteristic of data refers to the sheer amount of data generated and
stored in data systems. It signifies the massive quantities of data that organizations
collect, process, and analyze.
o Example: Social media platforms generating terabytes of user data every day.
2. Velocity:
o Velocity refers to the speed at which data is generated, processed, and analyzed. It
emphasizes the need for real-time or near real-time data processing to gain timely
insights.
o Example: Stock market data where prices and trades are updated every second.
3. Variety:
o Variety refers to the different types and sources of data. It includes structured data
(databases), semi-structured data (XML, JSON), and unstructured data (text,
images, videos).
o Example: Data from social media posts, emails, transaction records, and
multimedia content.
Q3) Attempt any two of the following:
Sol:
Data cube aggregation is a technique used in data warehousing and Online Analytical
Processing (OLAP) to reduce the volume of data by summarizing and aggregating it across
multiple dimensions. This method helps in improving query performance and making the
data more manageable for analysis.
Key Concepts:
• Dimensions:
o Dimensions are the perspectives or entities with respect to which data is organized,
such as time, location, and product.
• Measures:
o Measures are the numerical values that are analyzed, such as sales, revenue, and
profit.
• Aggregation Operations:
o Common aggregation operations include sum, average, count, min, max, etc.
1. Data Loading:
o Raw data is loaded into the data warehouse from various sources
2. Data Preprocessing:
o Data is cleaned and transformed to ensure quality and consistency
3. Cube Creation:
o The data cube is created by defining the dimensions and measures. Each cell in the
cube represents an aggregated value for a specific combination of dimension
values.
4. Data Aggregation:
Aggregation operations are performed to compute the values for each cell in the
data cube. For example, summing sales for each product-category over different
time periods.
Example: Consider a sales data warehouse with the following dimensions: Time (Year,
Quarter, Month), Product (Category, Sub-Category), and Location (Country, State, City). The
measure is Sales.
------------------------------------------------------------------------------------------------------------------
| Year, Quarter, Month | Category, Sub-Category | Country, State, City | Aggregated Sales |
Advantages:
Disadvantages:
Sol:
• Median: The median is the middle value when the numbers are arranged in
ascending order. For an odd number of values, it is the middle value.
Sorted Values: 24, 24, 24, 24, 25, 25, 27, 29, 32
Median = 25
• Range: The range is the difference between the maximum and minimum values.
Sol:
1. Tableau:
A powerful data visualization tool that allows users to create interactive and shareable
dashboards. Tableau provides various visualization options such as bar charts, line charts,
maps, and more. It is known for its user-friendly interface and ability to handle large
datasets efficiently.
2. Power BI:
3. Matplotlib:
A plotting library for Python that provides a variety of plotting functions to create static,
interactive, and animated visualizations. It is widely used in scientific and engineering
communities for its flexibility and ability to produce publication-quality plots.
4. ggplot2:
Sol:
Sol:
Data attributes refer to the characteristics or properties of data that describe the elements
within a dataset. Attributes define what type of information is being stored and how it can
be used in analysis.
Types of Attributes:
1. Nominal Attribute:
• Categorical attributes without any natural ordering or ranking.
• Example: Colors (Red, Green, Blue), Gender (Male, Female).
2. Ordinal Attribute:
• Categorical attributes with a meaningful order or ranking.
• Example: Education Level (High School, Bachelor's, Master's, Ph.D.), Customer
Satisfaction (Low, Medium, High).
3. Interval Attribute:
• Numerical attributes where the difference between values is meaningful, but there
is no true zero point.
• Example: Temperature in Celsius or Fahrenheit, Dates (e.g., 2024, 2025).
4. Ratio Attribute:
• Numerical attributes with a meaningful zero point, allowing for comparison of
magnitudes.
• Example: Height, Weight, Age, Income.
c) How do you visualize geospatial data? Explain in detail
Sol: -
1. Choropleth Maps:
o Choropleth maps use different shades of colors to represent data values across
geographical regions, such as countries, states, or districts. The intensity of the
color indicates the magnitude of the variable being visualized.
o Use Case: Commonly used to show population density, election results, or any data
that can be aggregated by region.
o Example:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
plt.show()
2. Heat Maps:
o Heat maps represent data density or intensity using color gradients, where higher
concentrations of data points are shown with more intense colors.
o Use Case: Useful for visualizing phenomena such as crime rates, traffic congestion,
or disease outbreaks.
o Example:
import folium
HeatMap(data).add_to(map)
map.save("heatmap.html")
3. Scatter Plots:
o Use Case: Suitable for visualizing locations of businesses, events, or any point-
based data.
o Example:
# Example coordinates
plt.scatter(lons, lats)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
4. Interactive Maps:
o Interactive maps allow users to pan, zoom, and explore geographical data
dynamically. These maps can include layers, pop-ups, and other interactive
elements.
o Use Case: Ideal for detailed exploration of spatial data, such as real estate maps,
environmental monitoring, and urban planning.
o Example:
import folium
# Add a marker
map.save("map.html")
Q5) Attempt any one of the following:
Sol: -
Data transformation is the process of converting data from one format or structure into
another. This process is essential for preparing data for analysis, ensuring consistency,
improving data quality, and making it more suitable for machine learning algorithms.
1. Normalization:
o Example:
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
2. Standardization:
o Example:
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)
3. Log Transformation:
o Log transformation stabilizes variance and makes the data more normally
distributed. It is particularly useful for data that follows a skewed distribution.
o Example:
import numpy as np
log_transformed_data = np.log(data)
print(log_transformed_data)
import pandas as pd
df = pd.DataFrame(data)
one_hot_encoded_data = pd.get_dummies(df)
print(one_hot_encoded_data)
5. Binning:
o Converting continuous data into categorical data by dividing the range of data into
intervals or bins.
o Example:
import pandas as pd
data = {'age': [23, 25, 30, 35, 40, 45, 50, 55, 60, 65]}
df = pd.DataFrame(data)
df['age_bin'] = pd.cut(df['age'], bins=[20, 30, 40, 50, 60, 70], labels=['20-30', '30-40', '40-50',
'50-60', '60-70'])
print(df)
b) What are the different methods for measuring the data dispersion?
Sol:
Data dispersion refers to the spread or variability of data points in a dataset. It provides
insights into how much the data points differ from the central tendency (mean, median,
mode).
1. Range:
o The range is the difference between the maximum and minimum values in a
dataset.
• Example:
print("Range:", range_value)
o The IQR measures the spread of the middle 50% of the data. It is the difference
between the third quartile (Q3) and the first quartile (Q1).
o Formula: IQR = Q3 − Q1
• Example:
import numpy as np
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print("IQR:", IQR)
3. Variance:
o Variance measures the average squared deviation of each data point from the mean.
It quantifies the spread of the data points.
o Formula: Variance=∑(Xi−μ)2 / N
• Example:
import numpy as np
variance = np.var(data)
4. Standard Deviation:
• Example:
import numpy as np
std_dev = np.std(data)
o MAD measures the average absolute deviation of each data point from the mean.
• Example:
import numpy as np