0% found this document useful (0 votes)
14 views5 pages

Data Architectures in Cloud Environments: Kurz Erklärt

The article discusses the evolution and significance of data architectures in cloud environments, emphasizing their role in developing robust data pipelines. It outlines various cloud computing models and data architectures, such as data lakes, data warehouses, and data meshes, highlighting their features and applications. The authors also present design principles and cloud data services essential for implementing effective data architectures.

Uploaded by

Andita Dwiyoga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Data Architectures in Cloud Environments: Kurz Erklärt

The article discusses the evolution and significance of data architectures in cloud environments, emphasizing their role in developing robust data pipelines. It outlines various cloud computing models and data architectures, such as data lakes, data warehouses, and data meshes, highlighting their features and applications. The authors also present design principles and cloud data services essential for implementing effective data architectures.

Uploaded by

Andita Dwiyoga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Datenbank-Spektrum

https://doi.org/10.1007/s13222-024-00490-5

KURZ ERKLÄRT

Data Architectures in Cloud Environments


Maximilian Plazotta1 · Meike Klettke1

Accepted: 2 October 2024


© The Author(s) 2024

Abstract
Data architectures are an integral part for developing robust data pipelines. Cloud computing offers various features to
build modern data architectures to power data engineering pipelines on a scalable, cost-effective, and secure level.
In this article we describe how cloud computing works in the field of data management. We start with brief explanations
on cloud computing, general data architectures, and how both worlds fit together by introducing design principles, tools,
and services for cloud data architectures.

Keywords Data architecture · Cloud computing · Data warehouse · Data lake

1 Introduction chitectures and how they are run in the cloud. Sect. 4 gives
an outlook of future research possibilities.
With the upcoming of commercial cloud computing (AWS
in 2006, Google Cloud in 2008, Azure in 2008) over the
last 20 years, new forms of data architectures emerged 2 Cloud computing
with it. As a consequence several novel requirements arose
from new data-intensive applications. Especially varying “There is no cloud––it is just someone else’s computer”––is
data formats, big data (data velocity, volume, variety, ve- a phrase often used by skeptics to downgrade cloud com-
racity), and a need for collaboration across platforms drove puting and its capabilities. But it provides much more
new types of architectures e.g. data lake, data fabric, data than a typical computer: The three key offerings are com-
mesh, or data lakehouse apart from the established tradi- pute, network, and storage with various services around
tional databases or data warehouse architectures. Fig. 1 it e.g., servers, databases, software, continuous integration/
displays the popularity of data architectures over time us- continuous deployment (CI/CD), security, and intelligence/
ing Google trend search (0 = lowest popularity, 100 = highest analytics. It easily scales through the elasticity/on-demand
popularity). From being very popular going even back to nature of cloud infrastructures with no upfront investment
1990s, classical data warehouses declined over the last two in the underlying hardware. Forms of cloud computing
decades—not saying they have become unpopular, but dif- are public cloud, private cloud, hybrid cloud, and multi-
ferent approaches arose over the last years. After 2015, one cloud. Public clouds are third-party offerings by hyper-
can observe a huge gain in popularity of these new types scaler companies that are operated by them—the technical
of architectures. This article gives an overview on how data backbone of cloud computing (firms) are data centers lo-
architectures can be set-up within cloud environments and cated around the globe [2]. Private clouds are owned and
which services therefore can be used. In Sect. 2, cloud com- managed by individual companies; hybrid clouds are a mix
puting is explained in detail. Sect. 3 is devoted to data ar- of public and private cloud offerings. Multi-cloud set-ups,
an architecture with multiple cloud providers, gained popu-
larity especially over the last few years due to rising prices
 Maximilian Plazotta
of cloud workloads which forces companies to compare
maximilian.plazotta@gmail.com pricing across cloud providers. Furthermore, it also helps
to mitigate the risk of service outages or prevent vendor
Meike Klettke
meike.klettke@ur.de lock-in. The usage of cloud resources are grouped in four
categories: infrastructure as a service (IaaS), platform
1
University of Regensburg, Regensburg, Germany as a services (PaaS), software as a services (SaaS), and

K
Datenbank-Spektrum

Fig. 1 Popularity of data related terms

serverless computing. IaaS are basically raw IT infrastruc- ACID transactions [7]. For transactional workloads, these
ture services like virtual machines (VMs). Here the cloud relational database management systems (RDBMS) are still
provider operates the underlying infrastructure, but the in use today.
user is responsible for managing the software (installation, Introduced by Inmon (1990) [8] and Kimball (1996) [10]
maintenance, updates etc.) which runs on the VM. PaaS in the 1990s, the data warehouse is still a widely used ar-
services build a layer on top of IaaS to enable the building, chitecture. According to Inmon [8] “a data warehouse is
testing, and deployment of underlying software or appli- a subject-oriented, integrated, non-volatile, and time vari-
cations. Closely related to PaaS is serverless computing, ant collection of data in support of managements decisions”
or also called function as a service (FaaS) which allows and offers more than traditional databases e.g., ETL ca-
to run code without the need for managing the underlying pabilities, extensive storage of multi-dimensional data, and
servers. SaaS services are fully externally managed appli- business intelligence integration. As traditional database ar-
cations without any need for maintenance of infrastructure, chitectures support online transaction processing (OLTP)
updates/upgrades, and patching. workloads, data warehouses are optimized for online ana-
lytical processing (OLAP). Inmon’s top-down approach fo-
cuses on the enterprise data warehouse (Corporate Informa-
3 Data architectures tion Factory—CIF) while Kimball introduces the bottom-
up dimensional model where data is organized in different
Following the guiding principles of the open group archi- schemata, e.g. star, snowflake schema or a multidimensional
tecture framework (TOGAF1) a data architecture refers to storage. Cloud data warehouses, self-deployed or through
the design, structure, and components for managing data
assets. It describes how data is collected, processed, stored,
distributed, and consumed across systems. Over the last
50 years different forms of data architectures became pop-
ular—due the rapid growth of data volume, more computing
power, and introduction of new technologies, various new
types arose especially since 2010 (see Fig. 2).
The first form are relational databases which can be
dated back to the 1970s. The key criteria of these archi-
tectures are the usage of the relational data model [4], the
usage of structured query language (SQL) [3], an optimiza-
tion for transactional queries (OLTP) and the support of

1 https://pubs.opengroup.org/architecture/togaf91-doc/arch/chap10.
html. Fig. 2 Timeline data architectures

K
Datenbank-Spektrum

SaaS solutions, are a commonly used form in enterprises 1. Domain ownership


today. 2. Data as a product
Unlike data warehouses, that enforce a specific structure 3. Self-serve data platform
to the data before storage, data lakes allow data to be stored 4. Federated computational governance
as they are—in structured, semi-structured, or unstructured
form. The term data lake was initially coined by James Compared to the previously discussed architectures where
Dixon (2010) [6]. Data lake architectures are heavily used one central IT-department manages the platform, the data
nowadays due to the low effort needed for data ingestion mesh approach designates the individual (business) teams
and to their high scalability (terabytes or petabytes of data) as responsibles (domain ownership). Within these domains
combined with relative low cost of storage fitting perfectly data is considered to be a (data) product which can be used,
in clouds. shared, and distributed to create new data products—all un-
With the rising popularity of big data during the early- der the self-service principle with a governance framework
2010s, new architectures for real-time data processing around it.
were established. The lambda architecture, introduced by With the introduction of the data lakehouse in 2020 [17]
Nathan Marz first in his blog post “How to beat the CAP by the originators of Apache Spark and founders of
theorem” (2011) [13], and then later more elaborated in Databricks, a new approach was established which com-
Marz and Warren (2015) [14], describes the parallelization bines data warehouse and data lake features. Data lakes,
of both batch (“cold path”) and real-time (“hot path”) data with their advantages in handling different data formats,
processing. Consisting of three layers: the batch layer, low storage cost, flexibility with scalable compute, is paired
the speed layer, and the serving layer, the lambda ar- with the data warehouse (primarily structured data, ETL,
chitecture provides a scalable and fault-tolerant approach business intelligence) to achieve a flexible, scalable, and
to address different types of big data use cases e.g., so- cost-effective data architecture. In this context Databricks
cial media analytics [14] or traffic management [16]. The also introduced the medallion architecture approach with
kappa architecture proposed by Jay Kreps (2014) [12] is bronze (raw data), silver (filtered, cleaned, enriched), and
a derivation of the lambda architecture which excludes the gold (ready for consumption) layers to clean and improve
batch layer to solely focus on the streaming layer. Both data.
forms are directly deployable within cloud environments
and solve the problem of processing and analyzing large- 3.1 Components of data architectures
scale, real-time data alongside historical batch data.
The term data fabric was coined by NetApp [9] in 2015 As defined at the beginning of Sect. 3, a data architecture
as they observed a rising demand in shifting data between must possess integral components to manage data assets.
cloud providers. This procedure is (still today) costly and Over time, different data engineering methods emerged and
complicated; often referred to as (cloud) vendor lock-in. therefore became more sophisticated and technically com-
Data fabric focuses on connecting distributed data from plex. In [11], different generations of data engineering pipe-
different (siloed) systems, environments, or applications. lines are introduced, from simple preprocessing steps to op-
The glue that holds the architecture together is the meta- timized pipelines. Modern data architectures must support
data—it captures information on where the data is located data pipelining efforts for various data formats, velocities,
and what inter-dependencies exist. Key criteria of data fab- and volumes. Fig. 3 gives a high-level overview of central
ric architectures are data cataloging and lineage, metadata components of data architectures where data is processed
and master data management, and data integration. Data end-to-end. It often starts with data being extracted from
fabric is not a single technological solution like e.g. a data the source system(s) into the data platform where the ac-
warehouse, it is a framework for building robust, distributed tual magic happens: data ingestion, data processing, data
data platforms across multiple environments [15]. storage, and data consumption.
One of the most recently proposed architectures, the data The proposed logic can be traced back to ETL/ELT (load
mesh, relies on decentrally managed data ecosystems. The stored) processes from the 1980s/1990s. Nevertheless, the
four key principles of data mesh proposed by Dehghani details and degree of complexity of such modern platforms
(2019) [5] are: need to include further and more granular steps.

Fig. 3 Typical data pipeline

K
Datenbank-Spektrum

Table 1 Design principles for cloud data architectures


Design principle Explanation
Modularity Interchangeability of components (e.g because of changes in datasets, processing algorithms, or workflows) services,
and tools related to data
Scalability Adjustment to increasing data volumes and compute needs (e.g. auto-scaling)
Flexibility, Agility Adaptation to changing data sources or formats
Cost-effectiveness Resource-efficient platform with on-demand usage and flexible pricing
Automation Automate where possible e.g. infrastructure as code (IaC), pipeline scheduling, monitoring/alerting
Reliability Implementation of fail-over mechanisms and disaster recovery (multiple availability zones with replication); support
of ACID transactions
Security Data access rules, encryption at-rest and in-transit, authorization (e.g., multi-factor authentication)
Performance Optimization mechanisms to enhance performance and efficiency (indexing, compression, caching etc.)

Fig. 4 Cloud data services

3.2 Design principles for data architectures components are needed in many data architectures shown
deployed in the cloud in Fig. 2, e.g. in data warehouses, data mesh, and data lake-
houses. Following the process from Sect. 3.1 and Fig. 3,
All presented data architectures from Sect. 3 can also the first step data extraction is often executed outside the
be deployed in cloud environments. Popular transactional cloud environment as many applications are still hosted
databases like MySQL or PostgreSQL can easily be hosted elsewhere. Nevertheless, there exist various APIs and con-
in clouds. Data warehousing and data lake solutions can nectors to extract data from the source systems into the
be used directly from cloud providers or from third party cloud platform. Once the data is accessible, a data inges-
companies. The other mentioned more sophisticated archi- tion service like AWS DataSync (batch data), AWS Kinesis
tectures are also deployable across cloud environments. To (streaming data), Azure Data Factory (batch), Azure Event
build robust and scalable cloud data architectures, some Hubs (streaming), Google Cloud Storage Transfer Service
underlying design principles ought to be fulfilled (see (batch), Google Cloud Pub/Sub (streaming) is used. These
Table 1). Data is the key variable for defining and se- services support the ingestion of a variety of data sources
lecting the design principles. Therefore, we focus on data from different environments and varying data formats for
security (e.g. encryption, access rules, etc.) and not on batch or streaming data.
general cloud security (e.g. avoidance of open endpoints, For data processing AWS Glue is the central service
misconfiguration, etc.) for the security principle. to prepare, transform, and load data in AWS. In Microsoft
Azure one can use Azure Synapse (Analytics) or (Azure)
3.3 Cloud data services Databricks; in Google Cloud it is called Dataflow. For stor-
ing data, there are basically two forms: the data ware-
Cloud providers2 offer various services to implement end- house and the data lake. The classic AWS data warehouse
to-end data pipelines (see Fig. 4). Such pipelines and their is called AWS Redshift, for Azure it is Synapse, and for
Google Cloud BigQuery. The data lake storage in AWS is
2 Besides cloud providers, there also exist 3rd party companies who S3; Azure Data Lake Storage (ADLS), and Google Cloud
offer fully managed data platforms as SaaS like Snowflake, Databricks, Storage, respectively. Once the data is ready for consump-
Starburst, and Dremio. tion, it can be visualized through business intelligence ser-

K
Datenbank-Spektrum

vices (AWS QuickSight, Microsoft PowerBI, Google Cloud otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not
looker studio) or advanced analytics like machine learning
permitted by statutory regulation or exceeds the permitted use, you will
(AWS SageMaker, Azure Machine Learning, Google Cloud need to obtain permission directly from the copyright holder. To view
Vertex AI) can be applied to the data. It is important to a copy of this licence, visit http://creativecommons.org/licenses/by/4.
mention that some services possess overlapping function- 0/.
ality e.g., Azure Synapse is also able to ingest, process,
and store data. Services are not randomly interchangeable References
across cloud providers, but data is shareable to some ex-
tent with e.g., delta lake [1] across cloud object stores (S3, 1. Armbrust M, Das T, Sun L, Xin R, Zhu S, Ghodsi A, Yavuz B,
Murthy M, Torres J, Sun L, Boncz PA, Mokhtar M, Hovell HV,
ADLS, GCS). Ionescu A, Luszczak A, Switakowski M, Ueshin T, Li X, Paran-
jpye S, Szafranski M, Senster P, Zaharia M (2020) Delta Lake:
High-Performance ACID Table Storage over Cloud Object Stores.
4 Open Research Questions Proc Vldb Endow 13(12):3411–3424
2. Barroso LA, Hölzle U, Ranganathan P (2018) The Datacenter as
a Computer: Designing Warehouse-Scale Machines. Morgan, Clay-
This article gives a broad overview of significant data archi- pool Publishers
tectures and their deployment in cloud environments with 3. Chamberlin DD, Boyce RF (1974) SEQUEL: A Structured English
guiding principles on design and functionality. As illus- Query Language. In: Proceedings of ACM-SIGMOD Workshop on
Data Description, Access and Control, pp 249–264
trated in Fig. 2, the development of data architectures is 4. Codd EF (1970) A Relational Model of Data for Large Shared Data
inevitable due to the introduction of new technologies, ap- Banks. Commun ACM 13(6):377–387
proaches, or technical advancements. So, what is the next 5. Dehghani Z, Fowler M (2022) Data Mesh: Delivering Data-driven
big architectural approach after the recent data lakehouse? Value at Scale. O’Reilly Media
6. Dixon J (2010) Pentaho, Hadoop, and Data Lakes. https://
Obviously, efficient data sharing across multiple (cloud) jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-
systems is not solved satisfactorily but there are some prom- lakes/
ising concepts like data spaces or delta share. Egress cost, 7. Härder T, Reuter A (1983) Principles of Transaction-Oriented
cost that arise from moving data from a data center to the Database Recovery. ACM Comput Surv 15(4):287–317
8. Inmon WH (1990) Building the Data Warehouse, 1st edn. John Wi-
internet, are still a huge roadblock when it comes to (out- ley & Sons, Inc, USA
bound) data sharing. This year (11.01.2024) the European 9. Kidd J (2015) Realize the Full Potential of Cloud with the Data
Commission introduced the EU Data Act which upon other Fabric. https://community.netapp.com/t5/Tech-ONTAP-Articles/
terms enforces cloud providers to remove any egress cost Realize-the-Full-Potential-of-Cloud-with-the-Data-Fabric/ta-p/
101344
by January 2027 (Chap. VI, Art. 29.1). In our opinion, an 10. Kimball R (1996) The Data Warehouse Toolkit: Practical Tech-
ecosystem for developing data pipelines in the cloud based niques for Building Dimensional Data Warehouses. John Wiley
on specific application requirements and migration to an- 11. Klettke M, Störl U (2022) Four Generations in Data Engineering
other cloud provider are also research tasks for the com- for Data Science. Datenbank Spektrum 22(1):59–66
12. Kreps J (2014) Questioning the Lambda Architecture. https://www.
ing years. Furthermore, the impact of artificial intelligence oreilly.com/radar/questioning-the-lambda-architecture/
(e.g. in forms of large language models) on data engineer- 13. Marz N (2011) How to beat the CAP theorem. http://nathanmarz.
ing pipelines offers huge potential to improve data quality com/blog/how-to-beat-the-cap-theorem.html
or optimize performance. 14. Marz N, Warren J (2015) Big Data: Principles and best practices of
scalable realtime data systems. Manning
Funding Open Access funding enabled and organized by Projekt 15. Strengholt P (2023) Data Management at Scale. O’Reilly
DEAL. 16. Yousfi S, Rhanoui M, Chiadmi D (2019) Towards a Generic Multi-
modal Architecture for Batch and Streaming Big Data Integration.
Open Access This article is licensed under a Creative Commons At- J Comput Sci 15(1):207–220
tribution 4.0 International License, which permits use, sharing, adapta- 17. Zaharia M, Ghodsi A, Xin R, Armbrust M (2021) Lakehouse:
tion, distribution and reproduction in any medium or format, as long as A New Generation of Open Platforms that Unify Data Warehous-
you give appropriate credit to the original author(s) and the source, pro- ing and Advanced Analytics. CIDR
vide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are Publisher’s Note Springer Nature remains neutral with regard to juris-
included in the article’s Creative Commons licence, unless indicated dictional claims in published maps and institutional affiliations.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy