Data Architectures in Cloud Environments: Kurz Erklärt
Data Architectures in Cloud Environments: Kurz Erklärt
https://doi.org/10.1007/s13222-024-00490-5
KURZ ERKLÄRT
Abstract
Data architectures are an integral part for developing robust data pipelines. Cloud computing offers various features to
build modern data architectures to power data engineering pipelines on a scalable, cost-effective, and secure level.
In this article we describe how cloud computing works in the field of data management. We start with brief explanations
on cloud computing, general data architectures, and how both worlds fit together by introducing design principles, tools,
and services for cloud data architectures.
1 Introduction chitectures and how they are run in the cloud. Sect. 4 gives
an outlook of future research possibilities.
With the upcoming of commercial cloud computing (AWS
in 2006, Google Cloud in 2008, Azure in 2008) over the
last 20 years, new forms of data architectures emerged 2 Cloud computing
with it. As a consequence several novel requirements arose
from new data-intensive applications. Especially varying “There is no cloud––it is just someone else’s computer”––is
data formats, big data (data velocity, volume, variety, ve- a phrase often used by skeptics to downgrade cloud com-
racity), and a need for collaboration across platforms drove puting and its capabilities. But it provides much more
new types of architectures e.g. data lake, data fabric, data than a typical computer: The three key offerings are com-
mesh, or data lakehouse apart from the established tradi- pute, network, and storage with various services around
tional databases or data warehouse architectures. Fig. 1 it e.g., servers, databases, software, continuous integration/
displays the popularity of data architectures over time us- continuous deployment (CI/CD), security, and intelligence/
ing Google trend search (0 = lowest popularity, 100 = highest analytics. It easily scales through the elasticity/on-demand
popularity). From being very popular going even back to nature of cloud infrastructures with no upfront investment
1990s, classical data warehouses declined over the last two in the underlying hardware. Forms of cloud computing
decades—not saying they have become unpopular, but dif- are public cloud, private cloud, hybrid cloud, and multi-
ferent approaches arose over the last years. After 2015, one cloud. Public clouds are third-party offerings by hyper-
can observe a huge gain in popularity of these new types scaler companies that are operated by them—the technical
of architectures. This article gives an overview on how data backbone of cloud computing (firms) are data centers lo-
architectures can be set-up within cloud environments and cated around the globe [2]. Private clouds are owned and
which services therefore can be used. In Sect. 2, cloud com- managed by individual companies; hybrid clouds are a mix
puting is explained in detail. Sect. 3 is devoted to data ar- of public and private cloud offerings. Multi-cloud set-ups,
an architecture with multiple cloud providers, gained popu-
larity especially over the last few years due to rising prices
Maximilian Plazotta
of cloud workloads which forces companies to compare
maximilian.plazotta@gmail.com pricing across cloud providers. Furthermore, it also helps
to mitigate the risk of service outages or prevent vendor
Meike Klettke
meike.klettke@ur.de lock-in. The usage of cloud resources are grouped in four
categories: infrastructure as a service (IaaS), platform
1
University of Regensburg, Regensburg, Germany as a services (PaaS), software as a services (SaaS), and
K
Datenbank-Spektrum
serverless computing. IaaS are basically raw IT infrastruc- ACID transactions [7]. For transactional workloads, these
ture services like virtual machines (VMs). Here the cloud relational database management systems (RDBMS) are still
provider operates the underlying infrastructure, but the in use today.
user is responsible for managing the software (installation, Introduced by Inmon (1990) [8] and Kimball (1996) [10]
maintenance, updates etc.) which runs on the VM. PaaS in the 1990s, the data warehouse is still a widely used ar-
services build a layer on top of IaaS to enable the building, chitecture. According to Inmon [8] “a data warehouse is
testing, and deployment of underlying software or appli- a subject-oriented, integrated, non-volatile, and time vari-
cations. Closely related to PaaS is serverless computing, ant collection of data in support of managements decisions”
or also called function as a service (FaaS) which allows and offers more than traditional databases e.g., ETL ca-
to run code without the need for managing the underlying pabilities, extensive storage of multi-dimensional data, and
servers. SaaS services are fully externally managed appli- business intelligence integration. As traditional database ar-
cations without any need for maintenance of infrastructure, chitectures support online transaction processing (OLTP)
updates/upgrades, and patching. workloads, data warehouses are optimized for online ana-
lytical processing (OLAP). Inmon’s top-down approach fo-
cuses on the enterprise data warehouse (Corporate Informa-
3 Data architectures tion Factory—CIF) while Kimball introduces the bottom-
up dimensional model where data is organized in different
Following the guiding principles of the open group archi- schemata, e.g. star, snowflake schema or a multidimensional
tecture framework (TOGAF1) a data architecture refers to storage. Cloud data warehouses, self-deployed or through
the design, structure, and components for managing data
assets. It describes how data is collected, processed, stored,
distributed, and consumed across systems. Over the last
50 years different forms of data architectures became pop-
ular—due the rapid growth of data volume, more computing
power, and introduction of new technologies, various new
types arose especially since 2010 (see Fig. 2).
The first form are relational databases which can be
dated back to the 1970s. The key criteria of these archi-
tectures are the usage of the relational data model [4], the
usage of structured query language (SQL) [3], an optimiza-
tion for transactional queries (OLTP) and the support of
1 https://pubs.opengroup.org/architecture/togaf91-doc/arch/chap10.
html. Fig. 2 Timeline data architectures
K
Datenbank-Spektrum
K
Datenbank-Spektrum
3.2 Design principles for data architectures components are needed in many data architectures shown
deployed in the cloud in Fig. 2, e.g. in data warehouses, data mesh, and data lake-
houses. Following the process from Sect. 3.1 and Fig. 3,
All presented data architectures from Sect. 3 can also the first step data extraction is often executed outside the
be deployed in cloud environments. Popular transactional cloud environment as many applications are still hosted
databases like MySQL or PostgreSQL can easily be hosted elsewhere. Nevertheless, there exist various APIs and con-
in clouds. Data warehousing and data lake solutions can nectors to extract data from the source systems into the
be used directly from cloud providers or from third party cloud platform. Once the data is accessible, a data inges-
companies. The other mentioned more sophisticated archi- tion service like AWS DataSync (batch data), AWS Kinesis
tectures are also deployable across cloud environments. To (streaming data), Azure Data Factory (batch), Azure Event
build robust and scalable cloud data architectures, some Hubs (streaming), Google Cloud Storage Transfer Service
underlying design principles ought to be fulfilled (see (batch), Google Cloud Pub/Sub (streaming) is used. These
Table 1). Data is the key variable for defining and se- services support the ingestion of a variety of data sources
lecting the design principles. Therefore, we focus on data from different environments and varying data formats for
security (e.g. encryption, access rules, etc.) and not on batch or streaming data.
general cloud security (e.g. avoidance of open endpoints, For data processing AWS Glue is the central service
misconfiguration, etc.) for the security principle. to prepare, transform, and load data in AWS. In Microsoft
Azure one can use Azure Synapse (Analytics) or (Azure)
3.3 Cloud data services Databricks; in Google Cloud it is called Dataflow. For stor-
ing data, there are basically two forms: the data ware-
Cloud providers2 offer various services to implement end- house and the data lake. The classic AWS data warehouse
to-end data pipelines (see Fig. 4). Such pipelines and their is called AWS Redshift, for Azure it is Synapse, and for
Google Cloud BigQuery. The data lake storage in AWS is
2 Besides cloud providers, there also exist 3rd party companies who S3; Azure Data Lake Storage (ADLS), and Google Cloud
offer fully managed data platforms as SaaS like Snowflake, Databricks, Storage, respectively. Once the data is ready for consump-
Starburst, and Dremio. tion, it can be visualized through business intelligence ser-
K
Datenbank-Spektrum
vices (AWS QuickSight, Microsoft PowerBI, Google Cloud otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not
looker studio) or advanced analytics like machine learning
permitted by statutory regulation or exceeds the permitted use, you will
(AWS SageMaker, Azure Machine Learning, Google Cloud need to obtain permission directly from the copyright holder. To view
Vertex AI) can be applied to the data. It is important to a copy of this licence, visit http://creativecommons.org/licenses/by/4.
mention that some services possess overlapping function- 0/.
ality e.g., Azure Synapse is also able to ingest, process,
and store data. Services are not randomly interchangeable References
across cloud providers, but data is shareable to some ex-
tent with e.g., delta lake [1] across cloud object stores (S3, 1. Armbrust M, Das T, Sun L, Xin R, Zhu S, Ghodsi A, Yavuz B,
Murthy M, Torres J, Sun L, Boncz PA, Mokhtar M, Hovell HV,
ADLS, GCS). Ionescu A, Luszczak A, Switakowski M, Ueshin T, Li X, Paran-
jpye S, Szafranski M, Senster P, Zaharia M (2020) Delta Lake:
High-Performance ACID Table Storage over Cloud Object Stores.
4 Open Research Questions Proc Vldb Endow 13(12):3411–3424
2. Barroso LA, Hölzle U, Ranganathan P (2018) The Datacenter as
a Computer: Designing Warehouse-Scale Machines. Morgan, Clay-
This article gives a broad overview of significant data archi- pool Publishers
tectures and their deployment in cloud environments with 3. Chamberlin DD, Boyce RF (1974) SEQUEL: A Structured English
guiding principles on design and functionality. As illus- Query Language. In: Proceedings of ACM-SIGMOD Workshop on
Data Description, Access and Control, pp 249–264
trated in Fig. 2, the development of data architectures is 4. Codd EF (1970) A Relational Model of Data for Large Shared Data
inevitable due to the introduction of new technologies, ap- Banks. Commun ACM 13(6):377–387
proaches, or technical advancements. So, what is the next 5. Dehghani Z, Fowler M (2022) Data Mesh: Delivering Data-driven
big architectural approach after the recent data lakehouse? Value at Scale. O’Reilly Media
6. Dixon J (2010) Pentaho, Hadoop, and Data Lakes. https://
Obviously, efficient data sharing across multiple (cloud) jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-
systems is not solved satisfactorily but there are some prom- lakes/
ising concepts like data spaces or delta share. Egress cost, 7. Härder T, Reuter A (1983) Principles of Transaction-Oriented
cost that arise from moving data from a data center to the Database Recovery. ACM Comput Surv 15(4):287–317
8. Inmon WH (1990) Building the Data Warehouse, 1st edn. John Wi-
internet, are still a huge roadblock when it comes to (out- ley & Sons, Inc, USA
bound) data sharing. This year (11.01.2024) the European 9. Kidd J (2015) Realize the Full Potential of Cloud with the Data
Commission introduced the EU Data Act which upon other Fabric. https://community.netapp.com/t5/Tech-ONTAP-Articles/
terms enforces cloud providers to remove any egress cost Realize-the-Full-Potential-of-Cloud-with-the-Data-Fabric/ta-p/
101344
by January 2027 (Chap. VI, Art. 29.1). In our opinion, an 10. Kimball R (1996) The Data Warehouse Toolkit: Practical Tech-
ecosystem for developing data pipelines in the cloud based niques for Building Dimensional Data Warehouses. John Wiley
on specific application requirements and migration to an- 11. Klettke M, Störl U (2022) Four Generations in Data Engineering
other cloud provider are also research tasks for the com- for Data Science. Datenbank Spektrum 22(1):59–66
12. Kreps J (2014) Questioning the Lambda Architecture. https://www.
ing years. Furthermore, the impact of artificial intelligence oreilly.com/radar/questioning-the-lambda-architecture/
(e.g. in forms of large language models) on data engineer- 13. Marz N (2011) How to beat the CAP theorem. http://nathanmarz.
ing pipelines offers huge potential to improve data quality com/blog/how-to-beat-the-cap-theorem.html
or optimize performance. 14. Marz N, Warren J (2015) Big Data: Principles and best practices of
scalable realtime data systems. Manning
Funding Open Access funding enabled and organized by Projekt 15. Strengholt P (2023) Data Management at Scale. O’Reilly
DEAL. 16. Yousfi S, Rhanoui M, Chiadmi D (2019) Towards a Generic Multi-
modal Architecture for Batch and Streaming Big Data Integration.
Open Access This article is licensed under a Creative Commons At- J Comput Sci 15(1):207–220
tribution 4.0 International License, which permits use, sharing, adapta- 17. Zaharia M, Ghodsi A, Xin R, Armbrust M (2021) Lakehouse:
tion, distribution and reproduction in any medium or format, as long as A New Generation of Open Platforms that Unify Data Warehous-
you give appropriate credit to the original author(s) and the source, pro- ing and Advanced Analytics. CIDR
vide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are Publisher’s Note Springer Nature remains neutral with regard to juris-
included in the article’s Creative Commons licence, unless indicated dictional claims in published maps and institutional affiliations.