Ebook The Evolution of The Data Warehouse
Ebook The Evolution of The Data Warehouse
Trends in computing 6
Introducing Panoply.io 12
Complexity in
Data Warehousing
3
Data Warehouse Architecture
There are three primary processes and structures involved in
the creation of a traditional data warehouse:
• Extract-Transform-Load (ETL)
• The Data Warehouse
• Online Analytical Processing (OLAP) Cubes
The following sections elaborate on these facets of data ware-
housing.
ETL in A Nutshell
At some point, data needs to move from source systems to
an analysis target—typically a data warehouse. This process
is known as the extract-transform-load (ETL) process, and
it is often the most challenging part of any data warehouse
project. The ETL process involves cleaning the data, which
means taking data out of a variety of source formats and con-
solidating it into a format suitable for analysis. As such, ETL
can be a time-consuming and tedious process. Moreover, the
ETL process is not always a static one. It can change mid-
stream as business requirements shift, leading to delays and
additional work in the course of the analytics effort.
Many ETL processes run just once each day, which means
business users often do not see their most recent data; the
results can be up to 24 hours out of date, which can feel like
a lifetime for some businesses. Although efforts have been
made by many organizations to provide more real-time de-
livery of data into the data warehouse, this can be a chal-
4
lenging process, as it involves adding the ability to capture
changed data from the source systems in real time. Captur-
ing real-time data changes adds physical and administrative
burdens to the source systems.
5
A rather recent trend in the space is the concept of a data lake.
This term refers to a storage repository that holds data in its
native format until an analysis process acts on it. A unique ID
is applied to each piece of data upon ingestion. Data lakes are
discussed in more detail in a later section.
From a cost perspective, data warehouses are some of the
most expensive resources in an IT organization. Most of this
comes down to pure infrastructure costs—on-site enterprise
storage is still quite expensive, and data warehouses are co-
lossal systems ranging from terabytes to petabytes.
Trends in Computing
Over the last decade, the amount of computing power that
can be brought to bear to work on larger volumes of data has
increased dramatically. However, even with significant raw
computing power, traditional relational database systems
may encounter major performance issues when trying to
query large volumes of operational, transactional data, re-
sulting in what has become known as a big data problem.
6
! What is Hadoop?
Hadoop is an open-source software project that provides a platform to store
and process massive data sets. The Hadoop Distributed File System can
handle very large files as well as store an astronomical quantity of files.
The MapReduce framework processes data in parallel which results in sub-
stantially increased performance over the serial processing methods used
historically.
7
In essence, it means things like networks, storage, and server
configuration can all be turned into software and can be au-
tomated.
8
systems. With cloud computing and the ability to “rent” hard-
ware and licensing, companies can get up and running in a
matter of minutes, and have data streaming into their cloud-
based data warehouse shortly thereafter. Best of all, the up-
front investment is practically zero dollars, since companies
pay only for what they use.
9
warehouse security leverages roles, with some semblance
of row-level security—either at the reporting layer or in the
database itself. Encryption of data at rest is common in or-
ganizations that have specific regulatory requirements, as is
encryption or obfuscation of data within the database.
Diversity of Data
Another trend impacting data warehousing projects is an
increase in the overall diversity of data. Data no longer re-
sides exclusively in relational databases in a perfect tabular
format. Many social media sources, for example, use JavaS-
cript object notation (JSON) for their data; many application
programming interfaces (APIs) use eXtensible Markup Lan-
guage (XML); and some organizations still have mainframe
data that is fixed width. This medley of data types can be even
more troublesome for the ETL process than the problems pre-
viously described. Typically, custom code is required to parse
and manage these types of data.
In addition to the custom code required to parse this data, it
10
is very expensive to parse from a computational perspective.
Moreover, the data tends to have dynamic formats that can
break a rigid traditional ETL process.
Data Lakes
A recent trend that takes advantage of a number of advances
in computing is the concept of a data lake. A data lake takes
advantage of low-cost local storage to house a vast amount
of raw data from source systems in its native format until
the data is needed for analysis. Each element of data in the
lake is assigned a unique identifier and tagged with metadata
tags. This data is typically queried to filter to a smaller set of
data, which can be deeply analyzed.
Data Warehouse
Data Lake
Data Mars
11
Introducing Panoply.io
If you think that all of this sounds rather daunting, you’re ab-
solutely right. And that’s where Panoply.io comes in. Panoply.
io is an end-to-end platform for analytical data warehousing.
Its purpose is to abstract away the complexity of the technol-
ogies, components, and configurations required to maintain
a robust data warehouse, allowing companies to utilize their
data with their favorite tools instantly.
12
tomated based on workload demands to synchronize costs
with workload needs. Panoply.io takes workload sizing to the
next level by gathering data about your workloads and au-
to-scaling the size of your infrastructure predictively based
on observed usage patterns.
Data Security
Security is a major concern identified by a number of cus-
tomers regarding cloud computing. They need to know: by
shipping their business data off-site to a cloud provider, what
risks are they incurring?
13
On top of the security inherent in AWS, the Panoply.io archi-
tecture has security at its very core. All of the data stored in
Panoply.io is encrypted, both at the file system and applica-
tion levels. Panoply.io also supports two-factor authentica-
tion to protect against social engineering attacks. All data is
secured in transport using TLS encryption and hardware-ac-
celerated AES 256-bit encryption. This means there is no per-
formance penalty for protecting your data.
Data Transformation
Panoply.io does not include an ETL tool. As mentioned ear-
lier, ETL can be a painful, manual, and expensive process. In
its place, there is a bit of a twist: Panoply.io takes advantage
of a more modern process—extract, load, and transform (ELT).
14
ELT is quickly becoming the norm for big data systems where
the schema is applied upon read. This transformation model
allows users to write their transformations in SQL or Python
and represent the transformations by displaying the number
of views. This process has the advantage of working retroac-
tively on data through a simple code change to the view as
opposed to making major changes to the ETL process.
Production
Source Data Data Processing Reporting Data
Client Warehouse
Source
Config
Log
DTS Cubes
15
integration that meets your needs, a framework is provided
to allow users to create custom integrations with other data
sources as well.
16
systems. For example, you may wish to incorporate social
media data from advertising campaigns to align sales data
with marketing efforts. Since most of the resources in ques-
tion probably export JSON, and Panoply.io supports API-
based ingestion, you can include social data in your analysis
with minimal effort and see deeper, richer detail. Additional-
ly, you can easily integrate with other popular data sources
like Salesforce and Google Analytics.
Integrated BI Tools
Integrated reporting tools have always been part of the busi-
ness intelligence (BI) landscape, as far back as Crystal Re-
ports and Brio. Such visualization tools are a vital part of
the BI process. Recently, comprehensive external tools like
Tableau and Microsoft Power BI have become very popular,
as they support a variety of modern data sources and de-
liver powerful visualizations across large amounts of data.
These tools are extremely popular with business users, as
they can quickly develop data models and charts without the
user needing to have comprehensive knowledge of SQL or
any other programming language. In addition to Tableau and
Power BI, Panoply.io supports tools such as R and Spark for
statistical and streaming analysis of data.
17
2
Houston, We Have an
Efficiency Problem
• IT Infrastructure
• IT Storage
• Database Administration
• Development
• IT Business Intelligence
• IT Project Management
• Business Team Requesting Project
18
Managing and aligning all of these resources to reach a com-
mon goal is quite challenging, and this is why in large orga-
nizations these projects can take years to get off the ground—
not to mention get completed. Worse yet, sometimes the
projects aren’t even successful. There is often contention for
these resources, especially the shared ones such as the infra-
structure and database teams, which means the data ware-
house project may not be the highest priority for the IT orga-
nization.
19
to the lack of intimate familiarity with the company. Even
when working with internal development staff, it can be chal-
lenging to convey the requirements that are needed.
Systems Management
For smaller companies, one of the benefits of cloud comput-
ing is that they are effectively outsourcing the management
20
of their infrastructure. As mentioned earlier, large IT organi-
zations have a significant percentage of their staff dedicated
to keeping the lights on. The IT Operations team usually per-
forms some of the following duties:
Performance
In organizations of all sizes, there are often issues with the
performance of large data systems. When you start working
with terabytes and petabytes of data, systems must be opti-
mized for ideal performance. In the traditional relational da-
tabase world, this meant optimizing indexes across the data
warehouse. That optimization used to be a time-consuming,
tedious art, but it has evolved over time. In larger organiza-
tions, the solution is sometimes as primitive as purchasing
more robust hardware to meet the needs of the system. While
this can be an acceptable solution, it is expensive, sometimes
21
even wasteful, and does not solve the underlying inefficiency.
22
C1 C2 C3 C4 C5 C6
23
Tuning data warehouses involves a tedious process of cap-
turing queries, evaluating execution plans, and gradual-
ly implementing improvements through a testing cycle. It
can take several days, and often several weeks, to resolve a
particularly problematic performance issue. It is important
not to underestimate the resources necessary to meet these
needs, and automation tools should be considered to speed
things along. Automation is an inevitable part of the process
for companies that want to deliver successful IT projects in
the post-cloud era.
24
By fully committing to this cloud-based model, Panoply.io
eliminates some of the previously described problems, as
well as creates some heretofore unknown advantages. Let’s
look at a few of the advantages working with Panoply.io has
over data warehousing solutions of yesteryear.
Performance
As opposed to the manual query tuning process mentioned
above, Panoply.io captures metrics on all of your query runs.
This information is fed to a self-tuning process that auto-
matically optimizes your data and index structures based on
your query patterns and workloads. The tuning goes as far as
implementing techniques like partitioning (splitting a table
into several smaller sub-tables in a manner that is invisible
to users and queries), the implementation of which required
a great deal of manual effort to complete in the past.
25
Panoply.io stores data in a columnar format, which is opti-
mized for reading and loading of data in a multi-tiered fash-
ion. This design allows for optimal performance without
sacrificing on cost. Most organizations want to store more
data than they can query at any given time. In practice, this
means is that for economic reasons, the bulk of your data re-
sides in a Hadoop/S3 store for archiving and backup/recov-
ery purposes. Only the hot data (data frequently accessed by
SQL queries) is stored in a fully managed Amazon Redshift
data warehouse, which is optimized to deliver the best per-
formance for frequent queries. The final tier in this solution
is a small set of data that is stored in Elasticsearch to support
small and fast queries. The Elasticsearch component acts
as a results cache for those queries. While this architecture
would be extremely complex to implement in an on-premis-
es environment, Panoply.io abstracts it behind a single JDBC
endpoint that you can use to query seamlessly.
ION
I MZIATS
OPT THM
ING MS A GE LGORI
E LLR ITH
R A N
O D O STO Z O3
M ALG
M A
ILE G
AG RNI N RK A S
LE A SPA AU RE
TO DS
SCAHIF
LI T
NG
SE ING
C
STIRCH
CO D
NS A T LA
E SE A OU
UM A
ER
REH MODETLHLMS
S
W A T I L R
FAUALGO
DE
SO DA
U R TA
CE
S
26
Remodeling
To maximize the realization of the benefits from the afore-
mentioned performance optimizations, Panoply rebuilds
your indexes whenever it detects changes in your query pat-
terns. These rebuild actions are kicked off by statistical anal-
ysis of your queries and data. Additionally, a separate task
which redistributes data across nodes takes place asynchro-
nously, to provide better data locality and therefore better per-
formance. Since moving data is expensive and has a higher
potential for negative impact, the redistribution algorithm is
much more conservative than the reindexing algorithm. In a
traditional system, these processes must both be arranged by
the database administrator in conjunction with the business
team. Panoply.io removes this burden entirely by automating
the process altogether.
Self-Service
For many years, the Holy Grail of business intelligence solu-
tions has been the concept of self-service BI. Tools like Pow-
er BI and Tableau have gone a long way toward making this
possible. By offering a user-friendly interface from which to
access the data, these tools allow business users to construct
charts and dashboards using their own knowledge and vision
of the data they intend to review. Panoply.io takes this to the
next level by further abstracting all of the data from a myriad
of source systems to provide a single data interface where
users can connect all of their business intelligence tools.
Backups
Backup and recovery are an essential part of any data solu-
27
tion—you want to protect the investments you have made in
your data against hardware failure or user error. Panoply.io
leverages AWS’s backup infrastructure to back up all of your
data across two Availability Zones on different continents.
The system takes incremental backups of data whenever
changes are made, and full backups run periodically. These
backups are not simple snapshots; you have the ability to re-
store to any point in time and debug any changes in data. You
also have direct access to your backups, allowing you to write
your own data analysis scripts that run against them or load
them to any internal database.
Aggregations
Another part of legacy data warehouse projects is building an
aggregation model. Typically, aggregation is accomplished
using an OLAP cube, which allows users to query the data
warehouse in a more ad-hoc fashion. Users can slice and dice
data based on key values and filters. Building this OLAP mod-
el requires additional development time, and OLAP queries
are batch-processed daily, meaning that the business may be
looking at day-old data at times. Panoply.io automates this
process by analyzing your metadata and data sources to iden-
tify logical entities and build key aggregations automatically.
28
Machine Learning
Machine learning is a field of computer science that uses
math to identify patterns and train computers to act without
being explicitly programmed. This sort of training is used a
number of ways internally within Panoply.io—optimization of
your queries and hardware architecture happens by collect-
ing data and self-training on it. These techniques are similar
to the pattern identification of data types, which allows for a
high degree of automation in the data warehouse stack. This
transformative power automates a large portion of a process
that used to require substantial manual effort.
29
3
30
Cloud computing changes some of these paradigms—in most
cases, you can “rent” hardware and software, transferring
what was traditionally a capital expense to a more palatable
operating expense. In the next section, you will learn about
how that can be beneficial to your business.
31
portance to the CFOs of both companies.
32
VS
Capex Opex
Figure 4. CapEx (smaller amount of big, one-time expenses) versus OpEx
(many smaller, sometimes recurring expenses)
33
the associated CapEx costs will require executive approval
and may need to pass through several budget cycles. Addi-
tionally, as mentioned earlier, the lead time associated with
a major IT project could double the time it takes to get the
project off the ground. With a cloud solution, if the smaller
amount of funding required for the pay-as-you-go solution
is in the departmental budget, the department can fund the
project internally with a much shorter approval cycle. Once
the connectivity to the cloud provider comes online, and an
administrator makes the initial data source connections, the
department can start to receive value from the project in very
short order.
34
tion that may take many years to accomplish. However, for a
smaller company, this shift is revolutionary and game chang-
ing. For a small and fast-growing company, the ability to have
all of its human resources focused on mission-centric work
rather than maintenance activity such as managing the up-
keep of systems can be a huge benefit.
There are very real and costly ramifications of this sort of in-
efficiency. At some organizations, limitations such as those
just described keep them from reporting their monthly sales
until 30 days after month end. In that case, the organization
would be unable to report its financial data promptly to com-
panies that were interested in them for acquisition, for exam-
ple. Other companies may have useful data in place but may
take days or months to get it into their target systems, which
can impact their agility and flexibility in decision-making.
The sooner you have higher quality information, the faster
you can make critical business decisions.
35
Storage Optimization
One of the largest costs in any large IT organizations is stor-
age. This is confusing to many people who aren’t deeply in-
volved in IT infrastructure and may go into a retail electron-
ics store and see a 4-terabyte hard drive for less than $200.
However, enterprise-class storage, the kind that supports an-
alytics systems, comes with a hefty price tag. In some cases,
it can be up to $3000 per terabyte. Why the dramatic differ-
ence in pricing? There are a number of reasons. For one, en-
terprise storage needs several layers of redundancy for pro-
tection in the event of hard drive or disk controller failures.
Additionally, the storage in most larger organizations is net-
work connected and requires the use of dedicated switches
and expensive fiber optic cable to connect it to the servers it
supports. These specialized storage devices require dedicat-
ed staff to maintain and support the storage array.
36
Cloud providers can offer impressive density at a low cost
using a storage technique known as object-based storage.
They reap additional efficiency because they are writing
their own management software which is devoid of all the
support and extra costs of legacy enterprise storage vendors.
The object-based storage model also offers flexibility and al-
lows cloud providers to simplify deployment of storage and
build their own redundancy model without having to rely on
a third party.
The nature of object storage affords the following advantages over file and
block:
• Low cost due to commodity hardware
• Enhanced durability due to erasure coding scheme
• Freedom from the limitations imposed by RAID
37
Automated Backup
Backups have come a long way since the days of the LTO tape
being shipped to an off-site provider. Even though technolo-
gies have changed, the costs associated with storing back-
ups have not changed that much. Backups are just additional
data that needs to be stored, powered, and cooled. Adminis-
trators are also needed to ensure the success of backup jobs
and perform storage capacity management.
Developer Efficiency
Given the very manual and waterfall-based nature of ETL pro-
cessing, it accounts for the largest portion of most traditional
data analytics projects. Gathering, cleaning, and transform-
ing data can take up to 80% of developer effort on a project.
Unfortunately, this part of the project isn’t even something
that adds any business value; it is merely a part of the work
required to start building any analytics system.
38
Imagine if, from Day One of your analytics project, your de-
velopers were focused only on building the best analytics
queries and algorithms, and they had free time to focus on
determining how to answer the next questions your business
will have rather than merely cleaning and shredding data and
worrying about a character change breaking that day’s data
load. With the automated data transformation that Panoply.
io offers, that dream becomes a reality.
39
Summary
Moving into a PaaS solution from a traditional on-premises
solution can certainly be a leap of faith, as it requires giv-
ing up some measure of control. However, a powerful and
robust platform like Panoply.io can provide higher velocity
to insight, allowing your business to make better-informed
decisions and improve growth opportunities. Additionally,
your IT staff can be more productive because they will have
a quicker time to results and less manual effort getting data
into source systems.
© 2017 by Panoply Ltd. All rights reserved. All Panoply Ltd. products and services mentioned herein, as well as their respective logos,
are trademarked or registered trademarks of Panoply Ltd. All other product and service names mentioned are the trademarks of their
respective companies. These materials are subject to change without notice. These materials and the data contained are provided by
Panoply Ltd. and its clients, partners and suppliers for informational purposes only, without representation or warranty of any kind,
and Panoply Ltd. shall not be liable for errors or omissions in this document, which is meant for public promotional purposes.