0% found this document useful (0 votes)
19 views34 pages

Data Engineering - Session 01

Uploaded by

Divam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views34 pages

Data Engineering - Session 01

Uploaded by

Divam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Course Curriculum

• Session 01 – Theory
• Introduction to Enterprise Data, Data Engineering,
Modern Data Applications & Patterns, Data
Frameworks, Components & Best Practices
• Session 02 – Theory & Lab Demos
• Introduction to Data stores: SQL, NoSQL, File Systems,
Data Lakes, Data Warehouses, Data Mesh Cloud Data
Products, Lab Demos of select data stores
• Session 03 – Theory & Lab Demos
• Data Architecture Layers, Data Pipelines,
Transformation, Orchestration, Data Aggregation vs
Federation, Lab Demos of sleect Data Pipeline
Products
• Session 04 – Theory & In-Class Design
• Data Governance: Data Catalogs, Data Quality,
Lineage, Provenance, Data Security, Regulatory
Compliance, Real-World Application Data Design
• Tutorials
Enterprise Data

Enterprise Data refers to the


collection of structured, semi-
structured, and unstructured data
that is generated, collected, and
utilized across an entire
organization to support business
operations, decision-making, and
strategic planning.
Source Driven Enterprise
Data Classification
Including transaction data about sales and
purchases; inventory data regarding raw
Operational data materials, finished goods and inventory
levels; and financial data regarding
revenues, costs and profits

Employee information including profiles,


Human resources data performance, payroll, attendance and
training records

Internal Infrastructure data


Details on the company’s physical assets,
properties and IT infrastructure

Enterprise
Data
Records of internal emails, notes, memos
Communication data and minutes of meetings

Research and Information from research projects,


product development stages and testing
development data results

Contact details, demographic information,


purchase history, loyalty program data,
Customer data customer service feedback and other
interactions
Including competitor analysis; data on product
offerings, pricing and marketing campaigns;
Market data and trend analysis of the market, emerging
sectors and negative issues

Including social media postings, user reviews


Customer data and other data generated by customer inputs

Related to the economy, both domestic and

External
Economic indicators globally

Relevant to industry regulations, standards


Regulatory and compliance data

Enterprise
and compliance requirements

Concerning environmental conditions


Environmental data

Data
pertinent to certain industries and activities

For example from analysts or market research


Third-party data firms

From news channels, blogs, forums and social


Social and news media feeds media platforms

Information on populations, age groups,


cultural nuances and geographic distribution,
Demographic and geographic data which is essential for market segmentation
and targeting
Behaviour Driven
Enterprise Data
Classification
Data Sensitivity
Driven
Enterprise Data
Classification
Public Data Internal Data Confidential Data: Restricted/Sensitive Regulated Data
Data:
Description: Non-sensitive data Description: Data used within Description: Sensitive data that Description: Highly sensitive Description: Data governed by
intended for public use. the organization but not could harm the organization if data that, if compromised, could specific laws, regulations, or
Examples: Company brochures, intended for public access. exposed. result in severe damage to the industry standards that mandate
publicly available reports, press Examples: Internal emails, Examples: Business strategies, organization. how it should be handled,
releases. internal process documents, and customer data, internal financial Access: Strictly limited to a few stored, and protected.

Access: Open to everyone, no employee handbooks. reports. authorized personnel; highest Examples: Financial records
restrictions. Access: Accessible to all Access: Restricted to authorized level of security controls. (SOX), healthcare information
employees; minimal security employees; requires moderate Examples: Personally Identifiable (HIPAA), and credit card details
controls. security measures. Information (PII), trade secrets, (PCI-DSS).
financial transactions, Requirements: Strict access
intellectual property controls, encryption, regular
audits, and compliance
reporting.
Key Data Compliance
Frameworks & Regulations
• GDPR (General Data Protection Regulation): Governs data privacy in the EU.
• CCPA (California Consumer Privacy Act): Protects consumer data in California.
• HIPAA (Health Insurance Portability and Accountability Act): Regulates healthcare data in the US.
• PCI-DSS (Payment Card Industry Data Security Standard): Ensures secure handling of credit card
information.
• SOX (Sarbanes-Oxley Act): Mandates financial data integrity for publicly traded companies.
Analyzing Key Enterprise Data Characteristics
Non-Functional Requirements

Historical data
Volumes and Scale Computation Data Security
retention
(incremental) requirements requirements
requirements

Data lifecycle
Data Audit Data Lineage Data Quality
management
requirements requirements requirements
requirements

Data Archival &


Regulatory Data Access Patterns Data Producers (ETL
Purging
requirements (Data Consumers) requirements)
requirements
Let’s review real-world data requirements!
Artifact 01
Introduction to
Data Engineering
Practice of designing, building, and
maintaining the infrastructure, architecture,
and processes that enable the collection,
storage, transformation, and delivery of data
across an organization. It involves creating
data pipelines and workflows that allow raw
data from various sources to be processed
into structured, usable formats for analytics,
reporting, and machine learning.

Importance
Data engineering provides the foundation for data-driven decision-making by ensuring that data is accessible,
reliable, and ready for use by data scientists, analysts, and business stakeholders.
Data Engineering: The
Backbone of Modern
Enterprises

Data engineering is the essential bridge


between raw data and actionable insights.
It empowers organizations to leverage
their data effectively, drive innovation, and
gain a competitive edge in the market.

Data as a Strategic Enabling Data-Driven Driving Innovation Overcoming Data Foundation for
Asset Decisions Challenges Advanced Analytics
In today's digital age, data is Data engineers build the By providing clean, accessible, Data engineers address Data engineering provides the
the new oil. Effective data infrastructure that empowers and reliable data, data common challenges such as foundation for advanced
engineering is crucial for organizations to make engineering fuels innovation data quality, scalability, analytics techniques like
extracting value from this informed decisions across and competitive advantage. security, and compliance to machine learning and artificial
valuable resource. various departments, from ensure data integrity and intelligence, enabling
marketing to operations. trustworthiness. organizations to uncover
hidden insights and predict
future trends.
Traditional Applications vs Modern Data-Driven
Applications

Advanced Data Patterns


Historically
We had only two kind of data models:
RDBMS & Data Warehouses
Centralized
RDBMS, that served as the Operational Data Store that Databases
held the most current data that is subject to maximum Row stores also called
changes (CUD)
Shared
Data Warehouse, that served the purpose of Data Everything
marts holding all the historic data of years that went Column
Database
through less changes but more used for providing stores
views or business intelligence

These Scale Vertically only


Processing was mostly offline and batch or
intraday driven

How do these considerations address the today’s changing requirements of scale, performance, consistency, availability etc. ??

Time: 3 mins
The Paradigm Shift

• Horizontal vs Vertical Scalability for Data


• Centralized vs Distributed vs Decentralized Data
• Traditional structured stores vs NoSQLs
• Data Aggregation vs Data Federation/Virtualization
• Standard ETLs vs Versatile Data Integration
• Polyglot & Tiered Databases (Fit-for-purpose
• Data Lakes: Where is my Enterprise Data ?
• Bridging Unstructured & Structured Data with AI techniques
• Data-as-a-Service
• What about Data Governance ?
Pattern 1: Vertical vs Horizontal Scaling
Scaling out
Scaling up or Vertical scaling, Scaling-out or Horizontal scaling,
this method involves adding this method involves adding more
more resources to a single unit, servers or systems to distribute
such as a server or application the workload across multiple
pod. This can be done by adding machines. This can improve
more memory, CPU, or disk performance and redundancy by
capacity. Scaling up can be using a network of
easier to manage and more systems. Scaling out can be a good
cost-effective because you only choice when you need to
need to manage one larger distribute a workload across
server. However, it's limited by multiple channels, such as when
the maximum capacity of the different kinds of bread need to be
storage controller. toasted.
Pattern 2: Centralized vs Distributed vs
Decentralized Data Storage Patterns
Have a single point of access, which Spread data across multiple nodes,
makes them more secure than which can increase reliability and
distributed databases. They're also speed. However, securing data becomes
easier to monitor and control, and can more difficult, as each node needs to be
help reduce errors. However, if the
protected. Decentralized databases are
server load increases, it may not
perform well. Centralized databases better suited for organizations with a
are best suited for organizations with generalist workforce that performs
specialized roles and standardized complex tasks.
procedures.

Designed to scale horizontally by adding


more nodes to the network. This allows for
increased storage capacity and processing
power. Distributed databases are ideal for
applications that require large amounts of
data to be processed quickly and efficiently
SQLs
NoSQLs
File stores Distributed Caches
Caches Storage & Compute

File Stores
Examples of a
few NoSQL, File
and Distributed
Cache Datbases
Polyglot Polyglot persistence involves using different data storage
technologies to support the unique needs of different
types of data within an application.
Persistence

Benefits E-Commerce Example


Flexibility: Different types of data can be
adapted to specific requirements.
Scalability: Different database
technologies can handle different scaling
requirements.
Loosely coupled services: Each service
can use a different type of database than
other services.
Avoiding monolith
applications: Different services can use
different data stores to avoid a single
database failure taking down the entire
business.
A multi-model database is a database system that can
store and process different types of data in a single
Multi-Model database, instead of requiring multiple specialized
databases. This flexibility allows businesses to handle
Databases a variety of data types without reformatting data or
switching databases. Multi-model databases can be a
Benefits good fit for dynamic environments where business
• Flexibility: By supporting multiple data models, such as relational, needs are constantly changing
document, graph, key-value, and columnar, they allow for diverse data
representation within a single database.
• Unified Query Language: A unified query language to access and
manipulate different data models, simplifying interactions and application
development.
• Data Consistency: Maintaining consistency across multiple single-model
databases can be challenging. Multi-model databases can ensure ACID
(Atomicity, Consistency, Isolation, Durability) properties across various data
models.
• Reduced Data Redundancy: By centralizing various data models, multi-
model databases can reduce the redundancy that might arise from storing
similar data in multiple databases.
• Simplified Architecture: Reduces the need to manage and integrate
multiple standalone databases, leading to a more simplified architecture.
• Reduced Complexity: Multi-model databases can simplify data integration
between different parts of an application or between different applications,
as the data is stored in a common database.
• Cost Effective: Operating and maintaining multiple database systems can be
expensive. By consolidating these into one multi-model database,
organizations can realize significant operational and infrastructure savings.
Few
examples of
Multi-Model
Databases
Data Duplication vs Deduplication
Benefits of Data Deduplication
Data duplication Data deduplication
The process of identifying and
Same data is stored in Reduced storage costs: By eliminating duplicate
multiple locations within a removing duplicate data. It
system or across different involves comparing data data, organizations can significantly reduce their
systems. This can lead to blocks or chunks to find
identical copies and then
storage requirements and associated costs.
inconsistencies,
inefficiencies, and increased storing only a single instance,
storage costs. while maintaining references Improved performance: Deduplication can
to the duplicates. This can improve system performance by reducing the
significantly reduce storage
requirements and improve amount of data that needs to be processed and
data management efficiency. stored.

Enhanced data integrity: Deduplication helps


ensure data consistency and accuracy by
eliminating conflicting copies.

Increased data availability: Deduplicated data


can be more easily accessed and retrieved,
improving data availability and reducing
downtime.
Key Deduplication Methods

Source: Oracle
Data Aggregation
Data aggregation is the process of collecting and organizing data
from multiple sources into a single format for analysis and
decision-making. It can be applied at any scale, from pivot
tables to data lakes.

Key Areas of Focus


Improved Efficiency and Data Quality
Better Decision-Making
Integrating Different Types of Data
Producing Quality Results
Ensuring Legal, Regulatory, and Privacy Compliance

Source: brightdata.com
Data Federation

• It is a technique that integrates data from multiple sources by


executing queries across the sources and combining the results.
Data federation uses a federated query engine that translates the
user or application queries into subqueries that are sent to the
source systems and then merges the subquery results into a final
output. Data federation allows for flexible and scalable data
integration, preserves the autonomy and security of the source
systems, and supports complex queries and transformations.

Data Virtualization

• It is a technique that creates a unified view of data from multiple


sources without physically moving or copying the data. Data
virtualization uses a middleware layer that connects to the source
systems and provides a virtual schema that can be queried by the
users or applications. Data virtualization enables real-time access to
the latest data, reduces data duplication and storage costs, and
simplifies data management and governance.
Data
Engineering
Components
Data Engineering Components
•Definition: The process of collecting and •Definition: Storing data in a way that is •Definition: Converting raw data into a •Definition: Combining data from •Data Quality: Ensures that data is
importing data from various sources accessible, scalable, and secure for structured, analyzable format by different sources into a unified view to accurate, consistent, complete, and
into a centralized data repository. future processing and analysis. cleaning, transforming, and enriching it. create a complete, consistent data set. reliable.
•Sources: Databases, APIs, IoT devices, •Types: Data Warehouses (e.g., Amazon •Techniques: ETL (Extract, Transform, •Methods: Batch Processing, Real-Time •Data Governance: Establishes policies,
social media, cloud storage. Redshift, Google BigQuery), Data Lakes Load), ELT (Extract, Load, Transform). Streaming, Data Virtualization. standards, and procedures for data
•Tools: Apache NiFi, Apache Kafka, (e.g., Azure Data Lake, AWS S3), NoSQL •Tools: Apache Spark, Apache Beam, •Tools: Apache Camel, MuleSoft, management across the organization.
Flume, AWS Glue. Databases (e.g., MongoDB, Cassandra). Talend, Dataflow, Databricks. Informatica, Stitch. •Tools: Great Expectations, Talend Data
•Considerations: Scalability, data format Quality, Collibra, Alation.
(structured, semi-structured,
unstructured), security.

Data Quality -
Data Processing -
Data Ingestion Data Storage Data Integration Data
Transformation
Governance

•Definition: Coordinating and managing •Definition: Protecting data from •Definition: Tracking data flow, •Definition: Providing data in a
the execution of data workflows, unauthorized access, breaches, and identifying issues, and ensuring data consumable format to end-users,
ensuring that data moves through the ensuring compliance with regulations pipelines function correctly. analysts, data scientists, and
pipelines smoothly and efficiently. (e.g., GDPR, CCPA). •Tools: Prometheus, Grafana, Splunk, applications.
•Tools: Apache Airflow, Luigi, Prefect, •Components: Data encryption, access Datadog. •Methods: APIs, SQL queries,
Control-M. controls, data masking, and auditing. dashboards, data exports.
•Tools: AWS Identity and Access •Tools: Looker, Tableau, Power BI,
Management (IAM), Azure Active Jupyter Notebooks
Directory, Vault by HashiCorp.

Data Data Security & Data Monitoring Data Access &


Orchestration Compliance & Observability Delivery
Q&A

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy