0% found this document useful (0 votes)
15 views21 pages

FDS - 4 Solved

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views21 pages

FDS - 4 Solved

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Q1) Attempt any eight of the following:

a) What do you mean by Primary Data?

Sol:

Primary data refers to data that is collected directly from the source or through firsthand
observation. This type of data is gathered specifically for the research purpose at hand,
making it original and unique. Primary data collection methods include surveys, interviews,
experiments, and observations. It is highly reliable and specific to the needs of the
research.

b) What do you mean by Data Quality?

Sol:

Data quality refers to the condition of data based on factors such as accuracy,
completeness, consistency, reliability, and relevance. High-quality data meets the
requirements for its intended use in decision-making, analysis, and operations. Poor data
quality can lead to incorrect insights, flawed decisions, and operational inefficiencies.

c) Define outlier.

Sol:

An outlier is a data point that significantly differs from other observations in a dataset. It
lies outside the overall pattern of the data and can be caused by variability in the data,
errors, or unusual conditions. Outliers can distort statistical analyses and lead to
misleading results if not properly handled.

d) Define Interquartile Range.

Sol:

The Interquartile Range (IQR) is a measure of statistical dispersion that represents the
range within which the middle 50% of data points lie. It is calculated as the difference
between the first quartile (Q1) and the third quartile (Q3): IQR = Q3 - Q1

The IQR is used to identify outliers and understand the spread of the central portion of the
data.
e) What do you mean by Missing Values?

Sol:

Missing values refer to the absence of data in a dataset where a data point is expected.
They can occur due to various reasons, such as data entry errors, equipment malfunction,
or respondents not answering certain questions in a survey. Missing values can affect the
quality of analysis and need to be addressed through methods like imputation or deletion.

f) What are uses of ZIP files?

Sol:

ZIP files are used to compress and archive files and directories. They offer several benefits:

• Compression: Reduce the file size, saving storage space and making it easier to transfer
files.

• Archiving: Bundle multiple files and directories into a single archive, simplifying file
management and distribution.

• Encryption: Provide security features to protect the contents of the files through
password protection and encryption.

g) What do you mean by XML Files Data Format?

Sol:

XML (eXtensible Markup Language) is a markup language that defines a set of rules for
encoding documents in a format that is both human-readable and machine-readable. XML
files are structured using tags to define elements, making it easy to store, transport, and
exchange data between different systems and applications.

h) Define Data Discretization.

Sol:

Data discretization is the process of converting continuous data attributes into discrete
categories or intervals. It simplifies the data, making it more manageable for analysis and
interpretation. Techniques for data discretization include binning, clustering, and decision
tree methods.
i) What is Tag Cloud?

Sol:

A tag cloud, also known as a word cloud, is a visual representation of text data where
individual words are displayed in varying sizes based on their frequency or importance. It
helps quickly identify the most prominent terms in a dataset, making it useful for text
analysis and summarization.

j) What is Visual Encoding?

Sol:

Visual encoding refers to the process of representing data through visual elements like
position, size, shape, color, and texture. This helps in conveying information effectively and
enables users to understand and interpret data patterns, trends, and relationships through
visual representations.
Q2) Attempt any four of the following:

a) Explain different applications of Data Science.

Sol:

Data Science has a wide range of applications across various fields:

1. Healthcare:

o Predictive Analytics: Predict patient outcomes, identify high-risk patients, and


improve diagnosis and treatment plans.

o Drug Discovery: Analyze clinical trial data to discover new drugs and predict their
effectiveness.

2. Finance:

o Risk Management: Identify and mitigate financial risks through predictive modeling
and anomaly detection.

o Fraud Detection: Detect fraudulent transactions and activities using machine learning
algorithms.

3. Retail:

o Customer Segmentation: Segment customers based on purchasing behavior to tailor


marketing strategies.

o Inventory Management: Optimize inventory levels using demand forecasting models.

4. Marketing:

o Personalized Marketing: Create targeted marketing campaigns based on customer


data and preferences.

o Sentiment Analysis: Analyze social media and customer reviews to gauge public
sentiment about products and brands.
5. Transportation:

o Route Optimization: Improve logistics and supply chain efficiency by optimizing


routes and delivery schedules.

o Autonomous Vehicles: Develop and enhance self-driving technologies using sensor


data and machine learning.

b) Explain Null and Alternate Hypothesis.

Sol:

• Null Hypothesis (H0): The null hypothesis is a statement that there is no effect or no
difference, and it serves as the default assumption in hypothesis testing. It is a
baseline that the researcher aims to test against.

o Example: In a clinical trial, the null hypothesis might state that a new drug has
no effect on patients compared to a placebo.

• Alternate Hypothesis (H1 or Ha): The alternate hypothesis is a statement that


indicates the presence of an effect or a difference. It represents the outcome that the
researcher aims to support.

o Example: In a clinical trial, the alternate hypothesis might state that the new
drug has a positive effect on patients compared to a placebo.

Hypothesis testing involves determining whether there is enough evidence to reject the null
hypothesis in favor of the alternate hypothesis.
c) What do you mean by Noisy Data? Explain any two causes of noisy data.

Sol:

Noisy data refers to data that contains errors, outliers, or irrelevant information that can
obscure the true signal or pattern in the dataset. Noisy data can negatively impact the
quality of analysis and model performance.

Causes of Noisy Data:

1. Data Entry Errors:

o Manual data entry by humans is prone to errors, such as typos, incorrect values, and
formatting inconsistencies. These errors introduce noise into the dataset.

2. Sensor Malfunctions:

o In sensor-based data collection, malfunctions or inaccuracies in sensors can


produce erroneous readings, leading to noisy data. For example, a faulty temperature
sensor may record incorrect values.

d) What do you mean by Data Visualization? Give examples of any two data
visualization libraries.

Sol:

Data visualization is the graphical representation of data using visual elements such as
charts, graphs, and maps. It enables users to understand and interpret complex data by
presenting it in a visual context, making it easier to identify patterns, trends, and insights.

Examples of Data Visualization Libraries:

1. Matplotlib (Python):

o A widely-used plotting library in Python that provides a variety of plotting functions to


create static, interactive, and animated visualizations.

o Example: Creating line charts, bar charts, scatter plots, and histograms.

2. ggplot2 (R):

o A data visualization package in R based on the Grammar of Graphics. It provides a


coherent system for creating complex and customizable visualizations.

o Example: Creating multi-faceted plots, box plots, and density plots.


e) Explain 3V's of Data Science.

Sol: -

1. Volume:

o The volume characteristic of data refers to the sheer amount of data generated and
stored in data systems. It signifies the massive quantities of data that organizations
collect, process, and analyze.

o Example: Social media platforms generating terabytes of user data every day.

2. Velocity:

o Velocity refers to the speed at which data is generated, processed, and analyzed. It
emphasizes the need for real-time or near real-time data processing to gain timely
insights.

o Example: Stock market data where prices and trades are updated every second.

3. Variety:

o Variety refers to the different types and sources of data. It includes structured data
(databases), semi-structured data (XML, JSON), and unstructured data (text,
images, videos).

o Example: Data from social media posts, emails, transaction records, and
multimedia content.
Q3) Attempt any two of the following:

a) Explain data cube aggregation method in context of data reduction.

Sol:

Data cube aggregation is a technique used in data warehousing and Online Analytical
Processing (OLAP) to reduce the volume of data by summarizing and aggregating it across
multiple dimensions. This method helps in improving query performance and making the
data more manageable for analysis.

Key Concepts:

• Dimensions:
o Dimensions are the perspectives or entities with respect to which data is organized,
such as time, location, and product.

• Measures:
o Measures are the numerical values that are analyzed, such as sales, revenue, and
profit.

• Aggregation Operations:
o Common aggregation operations include sum, average, count, min, max, etc.

Steps in Data Cube Aggregation:

1. Data Loading:
o Raw data is loaded into the data warehouse from various sources

2. Data Preprocessing:
o Data is cleaned and transformed to ensure quality and consistency

3. Cube Creation:
o The data cube is created by defining the dimensions and measures. Each cell in the
cube represents an aggregated value for a specific combination of dimension
values.
4. Data Aggregation:
Aggregation operations are performed to compute the values for each cell in the
data cube. For example, summing sales for each product-category over different
time periods.

Example: Consider a sales data warehouse with the following dimensions: Time (Year,
Quarter, Month), Product (Category, Sub-Category), and Location (Country, State, City). The
measure is Sales.

A data cube for this scenario would look like:

| Time | Product | Location | Sales |

------------------------------------------------------------------------------------------------------------------

| Year, Quarter, Month | Category, Sub-Category | Country, State, City | Aggregated Sales |

Advantages:

• Improved Query Performance: Pre-aggregated data allows faster query responses.


• Flexible Analysis: Users can drill down or roll up to different levels of aggregation.
• Holistic View: Provides a comprehensive view of data across multiple dimensions.

Disadvantages:

• Storage Space: Requires significant storage space for large datasets.


• Complexity: Can be complex to design and maintain for very large data sets with
many dimensions.
b) What is mean, median, mode and range for the following list of values: 24, 29, 24, 25,
24, 27, 25, 32, 24

Sol:

• Mean (Average): Mean = ∑ values / number of values

Mean = (24 + 29 + 24 + 25 + 24 + 27 + 25 + 32 + 24) / 9 = 234 / 9 ≈ 26

• Median: The median is the middle value when the numbers are arranged in
ascending order. For an odd number of values, it is the middle value.

Sorted Values: 24, 24, 24, 24, 25, 25, 27, 29, 32

The median is the 5th value:

Median = 25

• Mode: The mode is the value that appears most frequently.

Mode = 24 (appears 4 times)

• Range: The range is the difference between the maximum and minimum values.

Range = Max value − Min value = 32 – 24 = 8


c) Explain any four data visualization tools.

Sol:

1. Tableau:

A powerful data visualization tool that allows users to create interactive and shareable
dashboards. Tableau provides various visualization options such as bar charts, line charts,
maps, and more. It is known for its user-friendly interface and ability to handle large
datasets efficiently.

2. Power BI:

A business analytics service by Microsoft that provides interactive visualizations and


business intelligence capabilities with a simple interface. Power BI allows users to connect
to various data sources, create reports, and share insights across the organization.

3. Matplotlib:

A plotting library for Python that provides a variety of plotting functions to create static,
interactive, and animated visualizations. It is widely used in scientific and engineering
communities for its flexibility and ability to produce publication-quality plots.

4. ggplot2:

A data visualization package in R based on the Grammar of Graphics. ggplot2 provides a


coherent system for creating complex and customizable visualizations. It is known for its
ability to create multi-faceted plots, box plots, density plots, and more.
Q4) Attempt any two of the following:

a) Differentiate between structured and unstructured data.

Sol:

Feature Structured Data Unstructured Data


Definition Organized and formatted in Lacks a predefined format or
a fixed schema or structure organization
Storage Stored in relational Stored in data lakes, NoSQL
databases, spreadsheets, databases, file systems
data warehouses
Examples Tables in databases, Excel Text documents, images, videos,
sheets, SQL databases emails, social media posts
Querying Easily queried using SQL Requires more complex
and other structured query processing and analytics tools
languages
Ease of Easier to analyze due to More challenging to analyze;
Analysis well-defined structure requires advanced tools for
processing
Data Types Typically numeric or Can include text, multimedia,
categorical and mixed types
Data Size Usually smaller and Often larger in size and
manageable complexity
Flexibility Less flexible, as schema More flexible, can accommodate
must be defined various types of data without a
beforehand predefined schema
Processing SQL, Excel, BI tools Text mining, natural language
Tools processing (NLP), machine
learning
b) What do you mean by Data attributes? Explain types of attributes with example.

Sol:

Data attributes refer to the characteristics or properties of data that describe the elements
within a dataset. Attributes define what type of information is being stored and how it can
be used in analysis.

Types of Attributes:

1. Nominal Attribute:
• Categorical attributes without any natural ordering or ranking.
• Example: Colors (Red, Green, Blue), Gender (Male, Female).

2. Ordinal Attribute:
• Categorical attributes with a meaningful order or ranking.
• Example: Education Level (High School, Bachelor's, Master's, Ph.D.), Customer
Satisfaction (Low, Medium, High).

3. Interval Attribute:
• Numerical attributes where the difference between values is meaningful, but there
is no true zero point.
• Example: Temperature in Celsius or Fahrenheit, Dates (e.g., 2024, 2025).

4. Ratio Attribute:
• Numerical attributes with a meaningful zero point, allowing for comparison of
magnitudes.
• Example: Height, Weight, Age, Income.
c) How do you visualize geospatial data? Explain in detail

Sol: -

Visualizing geospatial data involves representing geographical information in a visual


format to help analyze and interpret spatial patterns, relationships, and trends. This
visualization can be done using various tools and techniques, each suitable for different
types of geospatial data and analysis objectives. Here's a detailed explanation of how to
visualize geospatial data:

Key Techniques for Visualizing Geospatial Data

1. Choropleth Maps:

o Choropleth maps use different shades of colors to represent data values across
geographical regions, such as countries, states, or districts. The intensity of the
color indicates the magnitude of the variable being visualized.

o Use Case: Commonly used to show population density, election results, or any data
that can be aggregated by region.

o Example:

import geopandas as gpd

import matplotlib.pyplot as plt

# Load geospatial data

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Plot a choropleth map

world.plot(column='pop_est', cmap='OrRd', legend=True)

plt.show()

2. Heat Maps:

o Heat maps represent data density or intensity using color gradients, where higher
concentrations of data points are shown with more intense colors.
o Use Case: Useful for visualizing phenomena such as crime rates, traffic congestion,
or disease outbreaks.

o Example:

import folium

from folium.plugins import HeatMap

# Create a map centered at a location

map = folium.Map(location=[40.7128, -74.0060], zoom_start=12)

# Add heat map layer with data points

data = [[40.7128, -74.0060], [40.7306, -73.9352]] # Example data points

HeatMap(data).add_to(map)

# Display the map

map.save("heatmap.html")

3. Scatter Plots:

o Scatter plots use geographical coordinates (latitude and longitude) to display


individual data points on a map. They help identify the distribution and clustering of
data points.

o Use Case: Suitable for visualizing locations of businesses, events, or any point-
based data.

o Example:

import matplotlib.pyplot as plt

# Example coordinates

lons = [-74.0060, -73.9352]

lats = [40.7128, 40.7306]


# Plot scatter plot of coordinates

plt.scatter(lons, lats)

plt.xlabel('Longitude')

plt.ylabel('Latitude')

plt.title('Geospatial Scatter Plot')

plt.show()

4. Interactive Maps:

o Interactive maps allow users to pan, zoom, and explore geographical data
dynamically. These maps can include layers, pop-ups, and other interactive
elements.

o Use Case: Ideal for detailed exploration of spatial data, such as real estate maps,
environmental monitoring, and urban planning.

o Example:

import folium

# Create a map centered at a location

map = folium.Map(location=[40.7128, -74.0060], zoom_start=12)

# Add a marker

folium.Marker([40.7128, -74.0060], popup='New York City').add_to(map)

# Save and display the map

map.save("map.html")
Q5) Attempt any one of the following:

a) What do you mean by Data transformation? Explain strategies of data


transformation.

Sol: -

Data transformation is the process of converting data from one format or structure into
another. This process is essential for preparing data for analysis, ensuring consistency,
improving data quality, and making it more suitable for machine learning algorithms.

Strategies of Data Transformation:

1. Normalization:

o Normalization scales numeric data to a standard range, typically between 0 and 1.


This technique ensures that all features contribute equally to the analysis,
especially when they have different units or scales.

o Example:

from sklearn.preprocessing import MinMaxScaler

data = [[100], [200], [300], [400], [500]]

scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(data)

print(normalized_data)

2. Standardization:

o Standardization involves scaling data to have a mean of 0 and a standard deviation


of 1. It is useful when the data has varying units or scales and follows a normal
distribution.

o Example:

from sklearn.preprocessing import StandardScaler

data = [[100], [200], [300], [400], [500]]

scaler = StandardScaler()

standardized_data = scaler.fit_transform(data)

print(standardized_data)
3. Log Transformation:

o Log transformation stabilizes variance and makes the data more normally
distributed. It is particularly useful for data that follows a skewed distribution.

o Example:

import numpy as np

data = [1, 10, 100, 1000, 10000]

log_transformed_data = np.log(data)

print(log_transformed_data)

4. Encoding Categorical Variables:

o Converting categorical data into numerical format. Techniques include one-hot


encoding and label encoding.

o Example (One-Hot Encoding):

import pandas as pd

data = {'color': ['red', 'blue', 'green']}

df = pd.DataFrame(data)

one_hot_encoded_data = pd.get_dummies(df)

print(one_hot_encoded_data)

5. Binning:

o Converting continuous data into categorical data by dividing the range of data into
intervals or bins.

o Example:

import pandas as pd

data = {'age': [23, 25, 30, 35, 40, 45, 50, 55, 60, 65]}

df = pd.DataFrame(data)
df['age_bin'] = pd.cut(df['age'], bins=[20, 30, 40, 50, 60, 70], labels=['20-30', '30-40', '40-50',
'50-60', '60-70'])

print(df)

b) What are the different methods for measuring the data dispersion?

Sol:

Data dispersion refers to the spread or variability of data points in a dataset. It provides
insights into how much the data points differ from the central tendency (mean, median,
mode).

Methods for Measuring Data Dispersion:

1. Range:

o The range is the difference between the maximum and minimum values in a
dataset.

o Formula: Range = Max Value − Min Value

• Example:

data = [10, 20, 30, 40, 50]

range_value = max(data) - min(data)

print("Range:", range_value)

2. Interquartile Range (IQR):

o The IQR measures the spread of the middle 50% of the data. It is the difference
between the third quartile (Q3) and the first quartile (Q1).

o Formula: IQR = Q3 − Q1

• Example:

import numpy as np

data = [10, 12, 14, 15, 18, 21, 100]

Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)

IQR = Q3 - Q1

print("IQR:", IQR)

3. Variance:

o Variance measures the average squared deviation of each data point from the mean.
It quantifies the spread of the data points.

o Formula: Variance=∑(Xi−μ)2 / N

• Example:

import numpy as np

data = [10, 20, 30, 40, 50]

variance = np.var(data)

print ("Variance:", variance)

4. Standard Deviation:

o Standard deviation is the square root of variance. It represents the average


deviation of data points from the mean.

o Formula: Standard Deviation = √ (Variance)

• Example:

import numpy as np

data = [10, 20, 30, 40, 50]

std_dev = np.std(data)

print("Standard Deviation:", std_dev)


5. Mean Absolute Deviation (MAD):

o MAD measures the average absolute deviation of each data point from the mean.

o Formula: MAD = ∑(∣Xi−μ∣) / N

• Example:

import numpy as np

data = [10, 20, 30, 40, 50]

MAD = np.mean(np.abs(data - np.mean(data)))

print ("MAD: ", MAD)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy