100% found this document useful (1 vote)

6K views6 pages

Week 3 - Data Engineering Lifecycle

The document discusses the key components of a data engineering lifecycle including data ingestion, storage, processing, analysis and user interface layers. It also covers important concepts like data platforms, data stores, data security, data collection, data wrangling and tools used for data transformation. Some key points are: - The architecture of a data platform consists of layers that ingest, store, process and deliver data to users. - Choosing a data store depends on data type, volume, intended use and security/governance needs. - Data collection gathers data from various sources using tools like APIs, web scraping and data exchanges. - Data wrangling transforms and cleanses raw data using

Uploaded by

Amine Bouzidi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

6K views6 pages

Week 3 - Data Engineering Lifecycle

Uploaded by

Amine Bouzidi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

🚀

Week 3 - Data Engineering Lifecycle

Data Platforms, Data Stores, and Security
Summary and Highlights
The architecture of a data platform can be seen as a set of layers, or functional components, each one performing a set of
specific tasks. These layers include:

Data Ingestion or Data Collection Layer, responsible for bringing data from source systems into the data platform.

Data Storage and Integration Layer, responsible for storing and merging extracted data.

Data Processing Layer, responsible for validating, transforming, and applying business rules to data.

Analysis and User Interface Layer, responsible for delivering processed data to data consumers.

Data Pipeline Layer, responsible for implementing and maintaining a continuously flowing data pipeline.

A well-designed data repository is essential for building a system that is scalable and capable of performing during high
workloads.

The choice or design of a data store is influenced by the type and volume of data that needs to be stored, the intended
use of data, and storage considerations. The privacy, security, and governance needs of your organization also influence
this choice.

The CIA, or Confidentiality, Integrity, and Availability triad are three key components of an effective strategy for information
security. The CIA triad is applicable to all facets of security, be it infrastructure, network, application, or data security.

Practice Quiz
Question 1
Which one of these steps is an intrinsic part of the “Data Storage and Integration Layer” of a data platform?

Read data in batch or streaming modes from storage and apply transformations

Transform and merge extracted data, either logically or physically

Transfer data from data sources to the data platform in streaming, batch, or both modes

Deliver processed data to data consumers

The Storage and Integration layer in a data platform stores, transforms, and merges extracted data to make it available for
data processing.

Question 2
Systems that are used for capturing high-volume transactional data need to be designed for faster response times to
complex queries.

True

False

Systems that are used for capturing high-volume transactional data need to be designed for high-speed read, write, and
update operations.

Question 3
What is the role of “Intrusion Detection” and “Intrusion Prevention” in the area of network security?

Ensure endpoint security by allowing only authorized devices to connect to the network

Inspect incoming network traffic for intrusion attempts and vulnerabilities

Create silos, or virtual local area networks, within a network so that you can segregate your assets

Week 3 - Data Engineering Lifecycle 1

Ensure attackers cannot tap into data while it is in transit

Intrusion Detection and Intrusion Prevention systems inspect network vulnerabilities and intrusion attempts and prevent
them from happening.

Graded Quiz
Question 1
Which one of these steps is an intrinsic part of the “Data Processing Layer” of a data platform?

Deliver processed data to data consumers

Transfer data from data sources to the data platform in streaming, batch, or both modes

Transform and merge extracted data, either logically or physically

Read data in batch or streaming modes from storage and apply transformations

Question 2
Systems that are used for capturing high-volume transactional data need to be designed for high-speed read, write, and
update operations.

True

False

High-speed read, write, and update operations are essential for systems that need to capture large volumes of
transactional data.

Question 3
What is the role of “Network Access Control” systems in the area of network security?

To ensure endpoint security by allowing only authorized devices to connect to the network

To ensure attackers cannot tap into data while it is in transit

To create silos, or virtual local area networks, within a network so that you can segregate your assets

To inspect incoming network traffic for intrusion attempts and vulnerabilities

This is achieved with the help of Network Access Control systems.

Question 4
____________ ensures that users access information based on their roles and the privileges assigned to their roles.

Authentication

Authorization

Firewalls

Security Monitoring

One of the primary controls for data security is to enable access to data through a system of Authorization. It allows
access to information based on a user’s role and role-based privileges.

Question 5
Security Monitoring and Intelligence systems:

Create virtual local area networks within a network so that you can segregate your assets

Create an audit history for triage and compliance purposes

Ensure users access information based on their role and privileges

Ensure only authorized devices can connect to a network

Security Monitoring and Intelligence systems create an audit trail and provide reports and alerts that help enterprises react
to security violations in time.

Data Collection and Data Wrangling

Week 3 - Data Engineering Lifecycle 2

Summary and Highlights
Depending on where the data must be sourced from, there are a number of methods and tools available for gathering
data. These include query languages for extracting data from databases, APIs, Web Scraping, Data Streams, RSS Feeds,
and Data Exchanges.

Once the data you need has been gathered and imported, your next step is to make it analytics-ready. This is where the
process of Data Wrangling, or Data Munging, comes in.

Data Wrangling involves a whole range of transformations and cleansing activities performed on the data. Transformation
of raw data includes the tasks you undertake to:

Structurally manipulate and combine data using Joins and Unions.

Normalize data, that is, clean the database of unused and redundant data.

Denormalize data, that is, combine data from multiple tables into a single table so that it can be queried faster.

Cleansing activities include:

Profiling data to uncover anomalies and quality issues.

Visualizing data using statistical methods in order to spot outliers.

Fixing issues such as missing values, duplicate data, irrelevant data, inconsistent formats, syntax errors, and outliers.

A variety of software and tools are available for the data wrangling process. Some of the popularly used ones include
Excel Power Query, Spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, Trifacta Wrangler, Python,
and R, each with their own set of features, strengths, limitations, and applications.

Practice Quiz
Question 1
How is data gathered using Application Programming Interfaces, or APIs?

APIs are used for aggregating constant streams of data flowing from instruments, IoT devices and applications, and
GPS data from cars

APIs are used for downloading specific data from web pages based on defined parameters

APIs are used for capturing updated data from online forums and news sites where data is refreshed on an ongoing
basis

APIs are invoked from applications to access databases, web services, data marketplaces and other such
data endpoints for gathering data

Question 2
What is one of the common structural transformations used for combining data from one or more tables?

Joins

Cleaning

Denormalization

Normalization

Question 3
What tool allows you to discover, cleanse, and transform data with built-in operations?

Watson Studio Refinery

OpenRefine

Trifacta Wrangler

Google DataPrep

Watson Studio Refinery has built-in features that allow you to discover, cleanse, and transform data.

Graded Quiz
Question 1

Week 3 - Data Engineering Lifecycle 3

Web scraping is used to extract what type of data?

Text, videos, and data from relational databases

Text, videos, and images

Images, videos, and data from NoSQL databases

Data from news sites and NoSQL databases

Question 2
___________ focuses on cleaning the database of unused data and reducing redundancy and inconsistency.

Denormalization

Data Visualization

Data Profiling

Normalization

Normalization cleanses the database of unused data and inconsistencies in data that is coming from multiple sources.

Question 3
OpenRefine is an open-source tool that allows you to:

Transform data into a variety of formats such as TSV, CSV, XLS, XML, and JSON

Automatically detect schemas, data types, and anomalies

Enforces applicable data governance policies automatically

Use add-ins such as Microsoft Power Query to identify issues and clean data

Question 4
When you’re combining rows of data from multiple source tables into a single table, what kind of data transformation are
you performing?

Denormalization

Joins

Unions

Normalization

Unions are a common structural transformation used for combining rows of data from multiple source tables.

Question 5
When you detect a value in your data set that is vastly different from other observations in the same data set, what would
you report that as?

Missing value

Irrelevant data

Outlier

Syntax error

Outliers are values in your data set that may be vastly different from other values in the same data field.

Querying Data, Performance Tuning, and Troubleshooting

Summary and Highlights
In order for raw data to become analytics-ready, a number of transformation and cleansing tasks need to be
performed on raw data. And that requires you to understand your dataset from multiple perspectives. One of the ways
in which you can explore your dataset is to query it.

Basic querying techniques can help you explore your data, such as, counting and aggregating a dataset, identifying
extreme values, slicing data, sorting data, filtering patterns, and grouping data.

Week 3 - Data Engineering Lifecycle 4

In a data engineering lifecycle, the performance of data pipelines, platforms, databases, applications, tools, queries,
and scheduled jobs, need to be constantly monitored for performance and availability.

The performance of a data pipeline can get impacted if the workload increases significantly, or there are application
failures, or a scheduled job does not work as expected, or some of the tools in the pipeline run into compatibility
issues.

Databases are susceptible to outages, capacity overutilization, application slowdown, and conflicting activities and
queries being executed simultaneously.

Monitoring and alerting systems collect quantitative data in real time to give visibility into the performance of data
pipelines, platforms, databases, applications, tools, queries, scheduled jobs, and more.

Time-based and condition-based maintenance schedules generate data that helps identify systems and procedures
responsible for faults and low availability.

Practice Quiz
Question 1
In the video, we used a query function to see how spread out the values in the “Sale Amount” field are. What function did
we use?

Average

Count

Maximum Value

Standard Deviation

Question 2
______________ helps you assess if the size of a workload is slowing down the system.

Monitoring the performance of queries

Job-level Runtime Monitoring

Monitoring the amount of data being processed through a data pipeline

Database Monitoring

Governance and Compliance

Summary and Highlights
Data Governance is a collection of principles, practices, and processes that help maintain the security, privacy, and
integrity of data through its lifecycle.

Personal Information and Sensitive Personal Information, that is, data that can be traced back to an individual or can be
used to identify or cause harm to an individual, needs to be protected through governance regulations.
General Data Protection Regulation, or GDPR, is one such regulation that protects the personal data and privacy of EU
citizens for transactions that occur within EU member states.
Regulations, such as HIPAA (Health Insurance Portability and Accountability Act) for Healthcare, PCI DSS (Payment Card
Industry Data Security Standard) for retail, and SOX (Sarbanes Oxley) for financial data are some of the industry-specific
regulations.

Compliance covers the processes and procedures through which an organization adheres to regulations and conducts its
operations in a legal and ethical manner.
Compliance requires organizations to maintain an auditable trail of personal data through its lifecycle, which includes
acquisition, processing, storage, sharing, retention, and disposal of data.
Tools and technologies play a critical role in the implementation of a governance framework, offering features such as:

Authentication and Access Control.

Encryption and Data Masking.

Hosting options that comply with requirements and restrictions for international data transfers.

Monitoring and Alerting functionalities.

Week 3 - Data Engineering Lifecycle 5

Data erasure tools that ensure deleted data cannot be retrieved.

Practice Quiz
At what stage of the data lifecycle would you establish which third-party vendors in your supply chain will have access to
the data you are collecting?

Data Sharing

Data Acquisition

Data Processing

Data Storage

It is in the Data Sharing phase of the data lifecycle that you establish which third-party vendors will have access to your
data, and how they will be held accountable to the same regulations you are liable for.

Graded Quiz
Question 1
In which phase of the data lifecycle do you establish the data you need, the amount of data you need, and how you intend
to use the data you are collecting.

Data Processing

Data Acquisition

Data Sharing

Data Retention

In the Data Acquisition phase, you establish the data you need to collect, the amount of data you need, and its intended
use.

Question 2
The process of _____________ abstracts the presentation layer without changing the data in the database physically.

Encryption

Data Profiling

Anonymization

Pseudonymization

Using Anonymization, the presentation layer is abstracted without changing the data in the database itself.

Week 3 - Data Engineering Lifecycle 6

DBMS Notes Merged
No ratings yet
DBMS Notes Merged
53 pages
Practice Test 1
100% (1)
Practice Test 1
88 pages
Ddbms Lab Manual
No ratings yet
Ddbms Lab Manual
100 pages
Galaxy A40 Manual PDF
No ratings yet
Galaxy A40 Manual PDF
176 pages
Unit 1
No ratings yet
Unit 1
61 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
14 pages
Computer Network Question Bank cs8591
No ratings yet
Computer Network Question Bank cs8591
9 pages
Open Elective IV - BI - 22.112018
No ratings yet
Open Elective IV - BI - 22.112018
2 pages
Adv SQL and Functions
100% (1)
Adv SQL and Functions
140 pages
Data Analytics Important Questions
No ratings yet
Data Analytics Important Questions
11 pages
VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
No ratings yet
VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
7 pages
Relational Algebra and SQL
No ratings yet
Relational Algebra and SQL
68 pages
DB Lab Manual
No ratings yet
DB Lab Manual
66 pages
Business Intelligence Hand Book
No ratings yet
Business Intelligence Hand Book
33 pages
CS8481 - 18.2
35% (26)
CS8481 - 18.2
5 pages
CS8091 BIGDATA ANALYTICS QUESTION BANK - Watermark
No ratings yet
CS8091 BIGDATA ANALYTICS QUESTION BANK - Watermark
95 pages
CS6010-Social Network Analysis PDF
100% (1)
CS6010-Social Network Analysis PDF
9 pages
Vtu 5TH Sem Cse DBMS Notes
100% (1)
Vtu 5TH Sem Cse DBMS Notes
54 pages
DBMS LAB Manual Iare
No ratings yet
DBMS LAB Manual Iare
10 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
Assignment 4
50% (2)
Assignment 4
10 pages
SQL Assignment
No ratings yet
SQL Assignment
7 pages
Different Types of SQL Joins
No ratings yet
Different Types of SQL Joins
12 pages
DBMS Unit-1 PPT 1.1 (Introduction, Drawback of File Sysstem, View of Data)
No ratings yet
DBMS Unit-1 PPT 1.1 (Introduction, Drawback of File Sysstem, View of Data)
4 pages
MIS MCQ Solved
100% (1)
MIS MCQ Solved
9 pages
Normalization in DBMS11
No ratings yet
Normalization in DBMS11
17 pages
DBMS MCQs - Chapterwise Database Management Multiple Choice Questions and Answers
50% (2)
DBMS MCQs - Chapterwise Database Management Multiple Choice Questions and Answers
6 pages
Assignment No. 2 Database Management System
No ratings yet
Assignment No. 2 Database Management System
9 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
DBMS LAB Important Questions For UNIV LAB
No ratings yet
DBMS LAB Important Questions For UNIV LAB
6 pages
SQL Practice Questions
No ratings yet
SQL Practice Questions
16 pages
Assignment ON Data Mining
No ratings yet
Assignment ON Data Mining
24 pages
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
100% (1)
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
12 pages
Sample
0% (3)
Sample
8 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
AD3491 - Unit 1 - Introduction to Data Science Important Questions 2 Marks With Answer --3-8
No ratings yet
AD3491 - Unit 1 - Introduction to Data Science Important Questions 2 Marks With Answer --3-8
6 pages
Multiple Questions On SQL
No ratings yet
Multiple Questions On SQL
7 pages
SQL Queries
No ratings yet
SQL Queries
11 pages
Homogeneous and Heterogeneous Systems
No ratings yet
Homogeneous and Heterogeneous Systems
4 pages
Normalization in DBMS11
No ratings yet
Normalization in DBMS11
12 pages
Question Bank ASQL
No ratings yet
Question Bank ASQL
2 pages
Dbms Unit-I
100% (4)
Dbms Unit-I
80 pages
Gate Questions: Database Management Systems
No ratings yet
Gate Questions: Database Management Systems
76 pages
L36-37-Converting ER Diagram To Relational Schema
No ratings yet
L36-37-Converting ER Diagram To Relational Schema
36 pages
Normalization Assessment Solutions PDF
No ratings yet
Normalization Assessment Solutions PDF
3 pages
DBMS LAB MANUAL FINAL (AutoRecovered)
No ratings yet
DBMS LAB MANUAL FINAL (AutoRecovered)
46 pages
Enterprise Reporting
No ratings yet
Enterprise Reporting
40 pages
CS6312 Set1
0% (4)
CS6312 Set1
13 pages
MCQs - Big Data Analytics - 7 V's of Big Data
No ratings yet
MCQs - Big Data Analytics - 7 V's of Big Data
7 pages
Case-Study Solution
No ratings yet
Case-Study Solution
4 pages
SQL Notes Full PDF
No ratings yet
SQL Notes Full PDF
72 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
Database Management System: Important Questions Unit-1
No ratings yet
Database Management System: Important Questions Unit-1
9 pages
DBMS Assignment-2
No ratings yet
DBMS Assignment-2
6 pages
SQL Nov 2004 Solved
No ratings yet
SQL Nov 2004 Solved
4 pages
Asg3 Solns Sketch
100% (1)
Asg3 Solns Sketch
7 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
DE UNIT-2
No ratings yet
DE UNIT-2
10 pages
Read Me
No ratings yet
Read Me
3 pages
WPS Docs Quick Start Guide
No ratings yet
WPS Docs Quick Start Guide
10 pages
8100 Performance Verification Manual 2023.3.2
No ratings yet
8100 Performance Verification Manual 2023.3.2
22 pages
M09res01-Data Security
No ratings yet
M09res01-Data Security
38 pages
Picoscope 2000 Series Programmers Guide
No ratings yet
Picoscope 2000 Series Programmers Guide
68 pages
IAM - Smart O&M - Brochure
No ratings yet
IAM - Smart O&M - Brochure
2 pages
OS Lab Manual
No ratings yet
OS Lab Manual
37 pages
dell-powerstore-500-dc-spec-sheet
No ratings yet
dell-powerstore-500-dc-spec-sheet
8 pages
Spring IOC Pratap Kumar - 1
No ratings yet
Spring IOC Pratap Kumar - 1
27 pages
BCS303 Questions Bank
No ratings yet
BCS303 Questions Bank
7 pages
Face Recognition With Deep Learning
No ratings yet
Face Recognition With Deep Learning
1 page
Slurm 18.08 Overview
No ratings yet
Slurm 18.08 Overview
21 pages
Micom Alstom P40: Protection Platform Devices
No ratings yet
Micom Alstom P40: Protection Platform Devices
1 page
Rebong Ermintrude Roark RIPActivity
No ratings yet
Rebong Ermintrude Roark RIPActivity
9 pages
Datasheet - DHI NVD0905DH 4I 4K - Final
No ratings yet
Datasheet - DHI NVD0905DH 4I 4K - Final
1 page
Wifi
No ratings yet
Wifi
7 pages
Ps 1
No ratings yet
Ps 1
4 pages
XDR With SIEM Integration
No ratings yet
XDR With SIEM Integration
36 pages
FSX Aerosoft AES V204cracked PDF
No ratings yet
FSX Aerosoft AES V204cracked PDF
4 pages
Lab_03-OOP
No ratings yet
Lab_03-OOP
9 pages
Experiment 3: Decision Making and Looping Operation Using 8086
No ratings yet
Experiment 3: Decision Making and Looping Operation Using 8086
5 pages
Analisa Volume Trafik Jaringan Dan Service Level Agreement (SLA)
No ratings yet
Analisa Volume Trafik Jaringan Dan Service Level Agreement (SLA)
6 pages
UC22EU-D2OG540-Saudi-Aramco-AlZahrany-Rotating-Equipment-Advisory-Tool-and-Overview
No ratings yet
UC22EU-D2OG540-Saudi-Aramco-AlZahrany-Rotating-Equipment-Advisory-Tool-and-Overview
25 pages
Product Datasheet: APC Smart-UPS C 2000VA LCD 230V
No ratings yet
Product Datasheet: APC Smart-UPS C 2000VA LCD 230V
3 pages
Deadlock On Insert
No ratings yet
Deadlock On Insert
8 pages
Vinoth SN Resume of Web / Graphic Designer
No ratings yet
Vinoth SN Resume of Web / Graphic Designer
8 pages
SQL Injection Lab
No ratings yet
SQL Injection Lab
11 pages
Compilation Sinopsis 20184 - 17012019
No ratings yet
Compilation Sinopsis 20184 - 17012019
188 pages
PA System
No ratings yet
PA System
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Week 3 - Data Engineering Lifecycle

Uploaded by

Week 3 - Data Engineering Lifecycle

Uploaded by

🚀

Week 3 - Data Engineering Lifecycle

Transform and merge extracted data, either logically or physically

Deliver processed data to data consumers

Inspect incoming network traffic for intrusion attempts and vulnerabilities

Week 3 - Data Engineering Lifecycle 1

Deliver processed data to data consumers

Transform and merge extracted data, either logically or physically

To ensure attackers cannot tap into data while it is in transit

To inspect incoming network traffic for intrusion attempts and vulnerabilities

This is achieved with the help of Network Access Control systems.

Create an audit history for triage and compliance purposes

Ensure users access information based on their role and privileges

Ensure only authorized devices can connect to a network

Data Collection and Data Wrangling

Week 3 - Data Engineering Lifecycle 2

Structurally manipulate and combine data using Joins and Unions.

Cleansing activities include:

Profiling data to uncover anomalies and quality issues.

Visualizing data using statistical methods in order to spot outliers.

Watson Studio Refinery

Week 3 - Data Engineering Lifecycle 3

Text, videos, and data from relational databases

Text, videos, and images

Images, videos, and data from NoSQL databases

Data from news sites and NoSQL databases

Automatically detect schemas, data types, and anomalies

Enforces applicable data governance policies automatically

Querying Data, Performance Tuning, and Troubleshooting

Week 3 - Data Engineering Lifecycle 4

Monitoring the performance of queries

Job-level Runtime Monitoring

Monitoring the amount of data being processed through a data pipeline

Governance and Compliance

Authentication and Access Control.

Encryption and Data Masking.

Monitoring and Alerting functionalities.

Week 3 - Data Engineering Lifecycle 5

Week 3 - Data Engineering Lifecycle 6

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.