0% found this document useful (0 votes)
317 views

Fundamentals of Data Engineering Index

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
317 views

Fundamentals of Data Engineering Index

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 17

Preface

What This Book Isn’t

What This Book Is About

Who Should Read This Book

Prerequisites

What You’ll Learn and How It Will Improve Your Abilities

Navigating This Book

Conventions Used in This Book

How to Contact Us

Acknowledgments

I. Foundation and Building Blocks

1. Data Engineering Described

What Is Data Engineering?

Data Engineering Defined

The Data Engineering Lifecycle

Evolution of the Data Engineer

Data Engineering and Data Science

Data Engineering Skills and Activities

Data Maturity and the Data Engineer

The Background and Skills of a Data Engineer

Business Responsibilities

Technical Responsibilities
The Continuum of Data Engineering Roles, from A to B

Data Engineers Inside an Organization

Internal-Facing Versus External-Facing Data Engineers

Data Engineers and Other Technical Roles

Data Engineers and Business Leadership

Conclusion

Additional Resources

2. The Data Engineering Lifecycle

What Is the Data Engineering Lifecycle?

The Data Lifecycle Versus the Data Engineering Lifecycle

Generation: Source Systems

Storage

Ingestion

Transformation

Serving Data

Major Undercurrents Across the Data Engineering Lifecycle

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering
Conclusion

Additional Resources

3. Designing Good Data Architecture

What Is Data Architecture?

Enterprise Architecture Defined

Data Architecture Defined

“Good” Data Architecture

Principles of Good Data Architecture

Principle 1: Choose Common Components Wisely

Principle 2: Plan for Failure

Principle 3: Architect for Scalability

Principle 4: Architecture Is Leadership

Principle 5: Always Be Architecting

Principle 6: Build Loosely Coupled Systems

Principle 7: Make Reversible Decisions

Principle 8: Prioritize Security

Principle 9: Embrace FinOps

Major Architecture Concepts

Domains and Services

Distributed Systems, Scalability, and Designing for Failure

Tight Versus Loose Coupling: Tiers, Monoliths, and


Microservices
User Access: Single Versus Multitenant

Event-Driven Architecture

Brownfield Versus Greenfield Projects

Examples and Types of Data Architecture

Data Warehouse

Data Lake

Convergence, Next-Generation Data Lakes, and the Data


Platform

Modern Data Stack

Lambda Architecture

Kappa Architecture

The Dataflow Model and Unified Batch and Streaming

Architecture for IoT

Data Mesh

Other Data Architecture Examples

Who’s Involved with Designing a Data Architecture?

Conclusion

Additional Resources

4. Choosing Technologies Across the Data Engineering Lifecycle

Team Size and Capabilities

Speed to Market

Interoperability
Cost Optimization and Business Value

Total Cost of Ownership

Total Opportunity Cost of Ownership

FinOps

Today Versus the Future: Immutable Versus Transitory


Technologies

Our Advice

Location

On Premises

Cloud

Hybrid Cloud

Multicloud

Decentralized: Blockchain and the Edge

Our Advice

Cloud Repatriation Arguments

Build Versus Buy

Open Source Software

Proprietary Walled Gardens

Our Advice

Monolith Versus Modular

Monolith

Modularity
The Distributed Monolith Pattern

Our Advice

Serverless Versus Servers

Serverless

Containers

How to Evaluate Server Versus Serverless

Our Advice

Optimization, Performance, and the Benchmark Wars

Big Data...for the 1990s

Nonsensical Cost Comparisons

Asymmetric Optimization

Caveat Emptor

Undercurrents and Their Impacts on Choosing Technologies

Data Management

DataOps

Data Architecture

Orchestration Example: Airflow

Software Engineering

Conclusion

Additional Resources

II. The Data Engineering Lifecycle in Depth

5. Data Generation in Source Systems


Sources of Data: How Is Data Created?

Source Systems: Main Ideas

Files and Unstructured Data

APIs

Application Databases (OLTP Systems)

Online Analytical Processing System

Change Data Capture

Logs

Database Logs

CRUD

Insert-Only

Messages and Streams

Types of Time

Source System Practical Details

Databases

APIs

Data Sharing

Third-Party Data Sources

Message Queues and Event-Streaming Platforms

Whom You’ll Work With

Undercurrents and Their Impact on Source Systems

Security
Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

6. Storage

Raw Ingredients of Data Storage

Magnetic Disk Drive

Solid-State Drive

Random Access Memory

Networking and CPU

Serialization

Compression

Caching

Data Storage Systems

Single Machine Versus Distributed Storage

Eventual Versus Strong Consistency

File Storage

Block Storage

Object Storage
Cache and Memory-Based Storage Systems

The Hadoop Distributed File System

Streaming Storage

Indexes, Partitioning, and Clustering

Data Engineering Storage Abstractions

The Data Warehouse

The Data Lake

The Data Lakehouse

Data Platforms

Stream-to-Batch Storage Architecture

Big Ideas and Trends in Storage

Data Catalog

Data Sharing

Schema

Separation of Compute from Storage

Data Storage Lifecycle and Data Retention

Single-Tenant Versus Multitenant Storage

Whom You’ll Work With

Undercurrents

Security

Data Management

DataOps
Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

7. Ingestion

What Is Data Ingestion?

Key Engineering Considerations for the Ingestion Phase

Bounded Versus Unbounded Data

Frequency

Synchronous Versus Asynchronous Ingestion

Serialization and Deserialization

Throughput and Scalability

Reliability and Durability

Payload

Push Versus Pull Versus Poll Patterns

Batch Ingestion Considerations

Snapshot or Differential Extraction

File-Based Export and Ingestion

ETL Versus ELT

Inserts, Updates, and Batch Size

Data Migration
Message and Stream Ingestion Considerations

Schema Evolution

Late-Arriving Data

Ordering and Multiple Delivery

Replay

Time to Live

Message Size

Error Handling and Dead-Letter Queues

Consumer Pull and Push

Location

Ways to Ingest Data

Direct Database Connection

Change Data Capture

APIs

Message Queues and Event-Streaming Platforms

Managed Data Connectors

Moving Data with Object Storage

EDI

Databases and File Export

Practical Issues with Common File Formats

Shell

SSH
SFTP and SCP

Webhooks

Web Interface

Web Scraping

Transfer Appliances for Data Migration

Data Sharing

Whom You’ll Work With

Upstream Stakeholders

Downstream Stakeholders

Undercurrents

Security

Data Management

DataOps

Orchestration

Software Engineering

Conclusion

Additional Resources

8. Queries, Modeling, and Transformation

Queries

What Is a Query?

The Life of a Query

The Query Optimizer


Improving Query Performance

Queries on Streaming Data

Data Modeling

What Is a Data Model?

Conceptual, Logical, and Physical Data Models

Normalization

Techniques for Modeling Batch Analytical Data

Modeling Streaming Data

Transformations

Batch Transformations

Materialized Views, Federation, and Query Virtualization

Streaming Transformations and Processing

Whom You’ll Work With

Upstream Stakeholders

Downstream Stakeholders

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering
Conclusion

Additional Resources

9. Serving Data for Analytics, Machine Learning, and Reverse ETL

General Considerations for Serving Data

Trust

What’s the Use Case, and Who’s the User?

Data Products

Self-Service or Not?

Data Definitions and Logic

Data Mesh

Analytics

Business Analytics

Operational Analytics

Embedded Analytics

Machine Learning

What a Data Engineer Should Know About ML

Ways to Serve Data for Analytics and ML

File Exchange

Databases

Streaming Systems

Query Federation

Data Sharing
Semantic and Metrics Layers

Serving Data in Notebooks

Reverse ETL

Whom You’ll Work With

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software Engineering

Conclusion

Additional Resources

III. Security, Privacy, and the Future of Data Engineering

10. Security and Privacy

People

The Power of Negative Thinking

Always Be Paranoid

Processes

Security Theater Versus Security Habit

Active Security

The Principle of Least Privilege


Shared Responsibility in the Cloud

Always Back Up Your Data

An Example Security Policy

Technology

Patch and Update Systems

Encryption

Logging, Monitoring, and Alerting

Network Access

Security for Low-Level Data Engineering

Conclusion

Additional Resources

11. The Future of Data Engineering

The Data Engineering Lifecycle Isn’t Going Away

The Decline of Complexity and the Rise of Easy-to-Use Data Tools

The Cloud-Scale Data OS and Improved Interoperability

“Enterprisey” Data Engineering

Titles and Responsibilities Will Morph...

Moving Beyond the Modern Data Stack, Toward the Live Data
Stack

The Live Data Stack

Streaming Pipelines and Real-Time Analytical Databases

The Fusion of Data with Applications


The Tight Feedback Between Applications and ML

Dark Matter Data and the Rise of...Spreadsheets?!

Conclusion

A. Serialization and Compression Technical Details

Serialization Formats

Row-Based Serialization

Columnar Serialization

Hybrid Serialization

Database Storage Engines

Compression: gzip, bzip2, Snappy, Etc.

B. Cloud Networking

Cloud Network Topology

Data Egress Charges

Availability Zones

Regions

GCP-Specific Networking and Multiregional Redundancy

Direct Network Connections to the Clouds

CDNs

The Future of Data Egress Fees

Index

About the Authors

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy