0% found this document useful (0 votes)
7 views1 page

DS Unit 1

The document discusses data processing tools in data science, focusing on Apache Spark and its components such as Spark Core, Spark SQL, and Spark Streaming, which facilitate efficient data handling and analysis. It also outlines the CRISP-DM methodology for data mining, emphasizing its iterative phases from business understanding to deployment. Additionally, it compares data lakes and data swamps, highlighting differences in data quality, organization, governance, accessibility, and performance.

Uploaded by

hardikng24hmit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views1 page

DS Unit 1

The document discusses data processing tools in data science, focusing on Apache Spark and its components such as Spark Core, Spark SQL, and Spark Streaming, which facilitate efficient data handling and analysis. It also outlines the CRISP-DM methodology for data mining, emphasizing its iterative phases from business understanding to deployment. Additionally, it compares data lakes and data swamps, highlighting differences in data quality, organization, governance, accessibility, and performance.

Uploaded by

hardikng24hmit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

.

o Explore data properties o Assess data quality •


Q1 Explain any five data processing tools in data science
Outcome: A refined dataset with initial insights and an
: Apache Spark·Apache Spark is an open-source,
understanding of its potential for analysis. 3. Data
distributed computing system designed to process large
Preparation • Objective: Prepare the data for modeling by
datasets quickly. Spark is popular for its speed, ease of use,
cleaning, transforming, and organizing it. • Key Steps: o
and versatility in handling both batch and real-time
Handle missing . o Transform data into required formats. o
processing workloads. ·In-memory processing: Spark
Select relevant features and create derived attributes. •
stores data in memory (RAM) for faster computation,. Outcome: A final dataset ready for modeling. 4. Modeling
Spark Core ·Spark Core is the foundational component of • Objective: Apply data mining techniques to extract
Apache Spark that provides essential functionalities for the patterns and insights. • Key Steps: o Select appropriate
entire Spark ecosystem, such as task scheduling, memory modeling techniques o Train and test models. o Tune
management, fault tolerance, and resource distribution. hyperparameters to optimize performance. • Outcome:
·RDD (Resilient Distributed Datasets): The core data Trained models that meet the project objectives. 5.
structure in Spark, RDDs are immutable collections of Evaluation • Objective: Assess the models' performance
objects distributed across a cluster. Spark SQL ·Spark SQL and determine if they meet the business objectives. • Key

is a component of Spark that allows users to query Steps: o Compare model results against business success

structured data using SQL syntax and provides a criteria. o Validate models using testing data o Identify

programming interface for working with both structured any gaps . • Outcome: Recommendations for deploying

and semi-structured data. ·Support for SQL queries: You the model or revisiting previous phases for

can run SQL queries against data stored in various formats, improvements6. Deployment • Objective: Implement the
solution in the production environment for business use. •
such as Parquet, JSON, or Hive tableSparkStreaming·Spark
Key Steps: o Integrate the model into business systems. o
Streaming is an extension of Spark that enables real-time
Document the solution for stakeholders. o Monitor and
data processing. It processes data in small, micro-batches
maintain the system to ensure continuous performance. •
and integrates seamlessly with Spark's other components.
Outcome: A deployed solution delivering actionable
·Real-time processing: Spark Streaming processes
insights or automating processes. .
incoming data in real timeGraphX·GraphX is a Spark
Q.5 Differences between Data lake vs. data swamp Data
component for graph processing and analytics. It allows
Quality: Data Lake: High-quality, structured data with
users to perform operations on graphs and perform
governance. Data Swamp: Poor-quality, unstructured, and
graph-parallel computations.·Graph abstraction: It
messy data. Organization:Data Lake: Organized with
provides two key abstractions: the Graph and Pregel API for
metadata and indexing.Data Swamp: Disorganized with
graph-parallel computation. no clear structure. Governance:Data Lake: Strong
Q.2 Explain (CRISP-DM): governance, metadata management, and security. Data
The Cross-Industry Standard Process for Data Mining Swamp: Lacks governance and metadata. Data
(CRISP-DM) is one of themost widely used methodologies Accessibility:Data Lake: Easy to search, access, and
for data mining and data science projects. It providesa analyze.Data Swamp: Hard to navigate and access.
structured approach for solving business problems Usage:Data Lake: Supports analytics, machine learning,
through data mining techniquesand is flexible enough to and business intelligence.Data Swamp: Difficult to use for
be applied across different industriesand types of data. meaningful analysis. Scalability:Data Lake: Scalable and

CRISP-DM is iterative,Phases of CRISP-DM CRISP-DM: 1. optimized for large data volumes.Data Swamp: Can

Business Understanding • Objective: Define the project struggle with scalability due to poor structure.

goals and understand the business context. • Key Steps: o Performance:Data Lake: High performance with proper

Identify the problem or opportunity. o Establish project optimization.Data Swamp: Poor performance due to

objectives and success criteria. • Outcome: A clear disorganization. Metadata:Data Lake: Rich metadata
enables data discovery.Data Swamp: Lacks metadata,
understanding of how the data science project aligns with
making data discovery difficult. Maintenance Cost:Data
business goals. 2. Data Understanding • Objective: Gather
Lake: High initial cost, but efficient long-term use
and explore data to gain insights and identify data quality
issues. • Key Steps: o Collect initial data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy