Data Engineering - Session 01
Data Engineering - Session 01
• Session 01 – Theory
• Introduction to Enterprise Data, Data Engineering,
Modern Data Applications & Patterns, Data
Frameworks, Components & Best Practices
• Session 02 – Theory & Lab Demos
• Introduction to Data stores: SQL, NoSQL, File Systems,
Data Lakes, Data Warehouses, Data Mesh Cloud Data
Products, Lab Demos of select data stores
• Session 03 – Theory & Lab Demos
• Data Architecture Layers, Data Pipelines,
Transformation, Orchestration, Data Aggregation vs
Federation, Lab Demos of sleect Data Pipeline
Products
• Session 04 – Theory & In-Class Design
• Data Governance: Data Catalogs, Data Quality,
Lineage, Provenance, Data Security, Regulatory
Compliance, Real-World Application Data Design
• Tutorials
Enterprise Data
Enterprise
Data
Records of internal emails, notes, memos
Communication data and minutes of meetings
External
Economic indicators globally
Enterprise
and compliance requirements
Data
pertinent to certain industries and activities
Access: Open to everyone, no employee handbooks. reports. authorized personnel; highest Examples: Financial records
restrictions. Access: Accessible to all Access: Restricted to authorized level of security controls. (SOX), healthcare information
employees; minimal security employees; requires moderate Examples: Personally Identifiable (HIPAA), and credit card details
controls. security measures. Information (PII), trade secrets, (PCI-DSS).
financial transactions, Requirements: Strict access
intellectual property controls, encryption, regular
audits, and compliance
reporting.
Key Data Compliance
Frameworks & Regulations
• GDPR (General Data Protection Regulation): Governs data privacy in the EU.
• CCPA (California Consumer Privacy Act): Protects consumer data in California.
• HIPAA (Health Insurance Portability and Accountability Act): Regulates healthcare data in the US.
• PCI-DSS (Payment Card Industry Data Security Standard): Ensures secure handling of credit card
information.
• SOX (Sarbanes-Oxley Act): Mandates financial data integrity for publicly traded companies.
Analyzing Key Enterprise Data Characteristics
Non-Functional Requirements
Historical data
Volumes and Scale Computation Data Security
retention
(incremental) requirements requirements
requirements
Data lifecycle
Data Audit Data Lineage Data Quality
management
requirements requirements requirements
requirements
Importance
Data engineering provides the foundation for data-driven decision-making by ensuring that data is accessible,
reliable, and ready for use by data scientists, analysts, and business stakeholders.
Data Engineering: The
Backbone of Modern
Enterprises
Data as a Strategic Enabling Data-Driven Driving Innovation Overcoming Data Foundation for
Asset Decisions Challenges Advanced Analytics
In today's digital age, data is Data engineers build the By providing clean, accessible, Data engineers address Data engineering provides the
the new oil. Effective data infrastructure that empowers and reliable data, data common challenges such as foundation for advanced
engineering is crucial for organizations to make engineering fuels innovation data quality, scalability, analytics techniques like
extracting value from this informed decisions across and competitive advantage. security, and compliance to machine learning and artificial
valuable resource. various departments, from ensure data integrity and intelligence, enabling
marketing to operations. trustworthiness. organizations to uncover
hidden insights and predict
future trends.
Traditional Applications vs Modern Data-Driven
Applications
How do these considerations address the today’s changing requirements of scale, performance, consistency, availability etc. ??
Time: 3 mins
The Paradigm Shift
File Stores
Examples of a
few NoSQL, File
and Distributed
Cache Datbases
Polyglot Polyglot persistence involves using different data storage
technologies to support the unique needs of different
types of data within an application.
Persistence
Source: Oracle
Data Aggregation
Data aggregation is the process of collecting and organizing data
from multiple sources into a single format for analysis and
decision-making. It can be applied at any scale, from pivot
tables to data lakes.
Source: brightdata.com
Data Federation
Data Virtualization
Data Quality -
Data Processing -
Data Ingestion Data Storage Data Integration Data
Transformation
Governance
•Definition: Coordinating and managing •Definition: Protecting data from •Definition: Tracking data flow, •Definition: Providing data in a
the execution of data workflows, unauthorized access, breaches, and identifying issues, and ensuring data consumable format to end-users,
ensuring that data moves through the ensuring compliance with regulations pipelines function correctly. analysts, data scientists, and
pipelines smoothly and efficiently. (e.g., GDPR, CCPA). •Tools: Prometheus, Grafana, Splunk, applications.
•Tools: Apache Airflow, Luigi, Prefect, •Components: Data encryption, access Datadog. •Methods: APIs, SQL queries,
Control-M. controls, data masking, and auditing. dashboards, data exports.
•Tools: AWS Identity and Access •Tools: Looker, Tableau, Power BI,
Management (IAM), Azure Active Jupyter Notebooks
Directory, Vault by HashiCorp.