DS Unit 1
DS Unit 1
is a component of Spark that allows users to query Steps: o Compare model results against business success
structured data using SQL syntax and provides a criteria. o Validate models using testing data o Identify
programming interface for working with both structured any gaps . • Outcome: Recommendations for deploying
and semi-structured data. ·Support for SQL queries: You the model or revisiting previous phases for
can run SQL queries against data stored in various formats, improvements6. Deployment • Objective: Implement the
solution in the production environment for business use. •
such as Parquet, JSON, or Hive tableSparkStreaming·Spark
Key Steps: o Integrate the model into business systems. o
Streaming is an extension of Spark that enables real-time
Document the solution for stakeholders. o Monitor and
data processing. It processes data in small, micro-batches
maintain the system to ensure continuous performance. •
and integrates seamlessly with Spark's other components.
Outcome: A deployed solution delivering actionable
·Real-time processing: Spark Streaming processes
insights or automating processes. .
incoming data in real timeGraphX·GraphX is a Spark
Q.5 Differences between Data lake vs. data swamp Data
component for graph processing and analytics. It allows
Quality: Data Lake: High-quality, structured data with
users to perform operations on graphs and perform
governance. Data Swamp: Poor-quality, unstructured, and
graph-parallel computations.·Graph abstraction: It
messy data. Organization:Data Lake: Organized with
provides two key abstractions: the Graph and Pregel API for
metadata and indexing.Data Swamp: Disorganized with
graph-parallel computation. no clear structure. Governance:Data Lake: Strong
Q.2 Explain (CRISP-DM): governance, metadata management, and security. Data
The Cross-Industry Standard Process for Data Mining Swamp: Lacks governance and metadata. Data
(CRISP-DM) is one of themost widely used methodologies Accessibility:Data Lake: Easy to search, access, and
for data mining and data science projects. It providesa analyze.Data Swamp: Hard to navigate and access.
structured approach for solving business problems Usage:Data Lake: Supports analytics, machine learning,
through data mining techniquesand is flexible enough to and business intelligence.Data Swamp: Difficult to use for
be applied across different industriesand types of data. meaningful analysis. Scalability:Data Lake: Scalable and
CRISP-DM is iterative,Phases of CRISP-DM CRISP-DM: 1. optimized for large data volumes.Data Swamp: Can
Business Understanding • Objective: Define the project struggle with scalability due to poor structure.
goals and understand the business context. • Key Steps: o Performance:Data Lake: High performance with proper
Identify the problem or opportunity. o Establish project optimization.Data Swamp: Poor performance due to
objectives and success criteria. • Outcome: A clear disorganization. Metadata:Data Lake: Rich metadata
enables data discovery.Data Swamp: Lacks metadata,
understanding of how the data science project aligns with
making data discovery difficult. Maintenance Cost:Data
business goals. 2. Data Understanding • Objective: Gather
Lake: High initial cost, but efficient long-term use
and explore data to gain insights and identify data quality
issues. • Key Steps: o Collect initial data.