6-61456 Pentaho Hadoop Ebook
6-61456 Pentaho Hadoop Ebook
Analytic Data
Pipeline
Hadoop and the
Analytic Data Pipeline
TABLE OF CONTENTS
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Hadoop is Disruptive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Taking a Holistic View of Big Data Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Big Ingestion
Ensuring a Flexible and Scalable Approach to Data Ingestion
and Onboarding Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Big Transformation
Driving Scalable Data Processing and Blending with Maximum
Productivity and Fine Control for Developers and Data Analysts. . . . . . . . . . . . . . . . . . . 10
Big Analytics
Delivering Complete Analytic Insights to the Business
in a Dynamic, Governed Fashion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Big Solutions
Taking a Solution-Oriented Approach that Leverages
the Best of Both Technology and People. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
INTRODUCTION
Hadoop is
disruptive.
Over the last five years, there have been few more
disruptive forces in information technology than big data –
and at the center of this trend is the Hadoop ecosystem.
While everyone has a slightly different definition of big
data, Hadoop is usually the first technology that comes
to mind in big data discussions.
When organizations can effectively leverage Hadoop, putting to work frameworks like
MapReduce, YARN, and Spark, the potential IT and business benefits can be particularly
large. Over time, we’ve seen pioneering organizations achieve this type of success – and
they’ve established some repeatable value-added use case patterns along the way.
Examples include optimizing data warehouses by offloading less frequently used data
and heavy transformation workloads to Hadoop, as well as customer 360-degree view
projects that blend operational data sources together with big data to create on-demand
intelligence across key customer touch points. Organizations have achieved what can be
best described as “order of magnitude” benefits in some of these scenarios, for instance:
Given these potentially transformational results, you might ask – “Why isn’t every orga-
nization doing this today?” One major reason is simply that Hadoop is hard. As with any
technology that is just beginning to mature, barriers to entry are high. Specifically, some
of the most common challenges to successfully implementing Hadoop for value-added
analytics are:
• A mismatch between the complex coding and scripting skillsets required to work
with Hadoop and the SQL-centric skillsets most organizations possess
• High cost of acquiring developers to work with Hadoop, coupled with the risk
of having to interpret and manage their code if they leave
These are some of the most readily apparent reasons why Hadoop projects may fail,
leaving IT organizations disillusioned that the expected massive ROI (return on invest-
ment) has not been delivered. In fact, some experts are expecting the large majority of
Hadoop projects to fall short of their business goals for these very reasons.1
PENTAHO
1 “Through 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation
objectives due to skills and integration challenges,” Gartner Analyst, Nick Heudecker; infoworld.com,
Sept 2015. 4
The good news is that traditional data integration software providers have begun to
update their tools to help ease the pain of Hadoop, letting ETL developers and data
analysts integrate and process data in a Hadoop environment with their existing skills.
However, leveraging existing ETL skill sets alleviates just one part of a much larger set
of big data challenges.
To be clear, user-friendly ETL tools for big data can certainly accelerate developer produc-
tivity in Hadoop use cases. However, if there isn’t a clear plan that addresses the end-to-end
delivery of integrated, governed data, as well as analytics to address business goals, a lot
of the potential benefits of Hadoop will be left on the table. Organizations may achieve
some moderate cost take-out benefits, but transformative business results will be much
harder to achieve.
In order to maximize the ROI on their Hadoop investment, enterprises need to take
a “full pipeline” view of their big data projects. This means approaching Hadoop in the
context of end-to-end processes: starting at raw data sources, moving through data
engineering and preparation inside and outside Hadoop, and finally leading to the
delivery of analytic insights to various user roles, often as a part of existing business
processes and applications.
A HOLISTIC APPROACH TO THE BIG DATA PIPELINE AND PROJEC T LIFEC YCLE
1 2 3 4
Big Ingestion Big Big Analytics Big Solutions
Transformation
The remainder of this paper will dive into each of these categories in more detail.
6
Big
Ingestion
ENSURING A FLEXIBLE AND SCAL ABLE APPROACH TO DATA INGESTION
AND ONBOARDING PROCESSES
7
As such, a key need in Hadoop data and analytics projects is the ability to tap into a
variety of different data sources, types, and formats. Further, organizations need to
prepare not only for the data they want to integrate with Hadoop today, but also data
that will need to be handled for potential additional use cases in the future.
The following types of data sources and formats are often part of Hadoop analytics
projects:
Organizations are also finding that cost and efficiency pressures as well as other factors
are leading them to use cloud-computing environments more heavily. They may run
Hadoop distributions and other data stores on cloud infrastructure, and as a result,
may need data integration solutions to be cloud-friendly.
This can include running on the public cloud to take advantage of scalability and elastic-
ity, private clouds with connectivity to on-premises data sources, as well as hybrid cloud
As Hadoop projects evolve from small pilots to departmental use cases and, eventually,
enterprise shared service environments, scalability of the data ingestion and onboarding
processes becomes mission-critical. More data sources are introduced over time, individ-
ual data sources change, and frequency of ingestion can vacillate. As this process extends
out to a hundred data sources or more, which could even be a range of similar files
in varying formats, maintaining the Hadoop data ingestion process can become
especially painful.
PENTAHO
8
At this point, organizations desperately need to reduce manual effort, potential for
error, and amount of time spent on the care and feeding of Hadoop. They need to go
beyond manually designing data ingestion workflows to establish a dynamic and reusable
approach while also maintaining traceability and governance.
Being able to create dynamic data ingestion templates that apply metadata on-the-fly
for each new or changed source is one solution to this problem. According to a recent
best practices guide by Ralph Kimball, “consider using a metadata-driven codeless
development environment to increase productivity and help insulate you from underly-
ing technology changes.” 3 Not surprisingly, the earlier organizations can anticipate these
needs, the better.
PENTAHO
3 Ralph Kimball, Kimball Group, “Newly Emerging Best Practices for Big Data”. 9
Big
Transformation
DRIVING SCAL ABLE DATA PROCESSING AND BLENDING WITH
MA XIMUM PRODUCTIVIT Y AND FINE CONTROL FOR DEVELOPERS
10
As touched on earlier, it is essentially a “table stakes” requirement to leverage an intui-
tive and easy-to-use data integration product to design and execute these types of data
integration workflows on the Hadoop cluster. Providing drag and drop Hadoop data
integration tools to ETL developers and data analysts allows enterprises to avoid hiring
expensive developers with Hadoop experience.
This is possible with the combination of a highly portable data transformation engine
(“write once, run anywhere”) and an intuitive graphical development environment for
data integration and orchestration workflow. Ideally, this joint set of capabilities is encap-
sulated entirely within one software platform. Overall, this approach not only boosts IT
productivity dramatically, but it also accelerates the delivery of actionable analytics to
business decision makers.
Ease of installation and configuration is a related element that enterprises can look to in
order to drive superior time to value in Hadoop data integration and analytics projects.
This is fairly intuitive – the more adapters, node-by-node installations, and separate
Hadoop component configurations required, the longer it will take to get up and running.
11
For instance, as more node-by-node software is installed and more cluster variables are
tuned, it is more likely that an approach will risk interfering with policies and rules set by
Hadoop administrators. Also, more onerous and cluster-invasive platform installation
requirements can create problems including:
12
An alternative approach to big data integration involves the use of code generation
tools, which output code that must then be separately run. In addition, because these
tools generate code, that code is often maintained, tuned, and debugged directly – which
can create additional overhead for Hadoop projects. Code generators may provide fine-
grained control, but they normally have a much steeper learning curve. Use of such code-
generators mandates iterative and repetitive access to highly skilled technical resources
familiar with coding and programming. As such, total cost of ownership (TCO) should be
carefully evaluated.
Intuitive drag and drop design paradigm for big data jobs and
transformations, with ability to configure as needed
Data integration run-time engine that is highly portable across different data
storage and processing frameworks, drastically reducing need to re-factor
data workflows
13
Big
Analytics
DELIVERING COMPLETE ANALY TIC INSIGHTS TO THE
BUSINESS IN A DYNAMIC, GOVERNED FASHION
14
As data scientists and advanced analysts begin to query and explore blended data sets
in Hadoop, they will often make use of data warehouse and SQL-like layers on Hadoop,
such as Hive and Impala. Thanks to a familiar type of query language, these tools do not
take long to learn. As such, skilled data analysts should seek out data integration and
analytics platforms that provide operational reporting and visual analytics directly
on Hive and Impala.
While this is acceptable in the analytics prototyping phase, the performance and usability
are unlikely to satisfy the requirements of larger groups of analysts and business users
in production environments. The wrong query at the wrong time can potentially strain
Hadoop cluster resources, interfering with the completion of other integration processes.
Enterprises are used to providing business users with analytics tools that sit on top of
highly governed, pre-processed data warehouses. The data is, for the most part, trusted
and accurate, while pre-built analytic cubes offer fast answers to the business questions
that business users may want to ask of the data. Conversely, in the world of Hadoop, it
is a much greater challenge to provide direct analytics access at scale that is both highly
governed as well as easily and interactively consumed by the analytics end user. In many
cases, there may be so much data in the Hadoop cluster that it may not even make sense
This is another circumstance where considering Hadoop as part of the broader analytic
pipeline is crucial. Specifically, many organizations are already familiar with high perfor-
mance relational databases that are optimized for interactive end-user analytics – or
“analytic databases.” Enterprises are finding that a highly effective way to unleash the
analytic power of Hadoop is to deliver refined data sets from Hadoop to these databases.
The most effective approach enables the business user or analyst to intuitively request
the subset of Hadoop data he or she would like to analyze. A user selection can trigger
on-demand data processing and blending in the Hadoop environment, followed by deliv-
PENTAHO
ery of an analytics-ready data set to the end user for ad hoc analysis and visualization.
15
An illustration of this process flow is depicted below, starting at the upper left:
Request Refine
Governance
Publish
Finally, it is key to integrate and operationalize advanced predictive and statistical mod-
eling into the broader big data pipeline. Despite their potential to create path-breaking
insights, data scientists often find themselves outside the broader enterprise data
integration and analytics production process. Further, the majority of time and effort
in a predictive analytics task is often spent preparing the data, rather than actually
PENTAHO
16
The more the data integration and analytics approach enables collaboration between
data scientists and the broader IT team, the quicker it will be to develop and implement
new models for forecasting, scoring, and more – leading to faster business benefits. In
particular, time to insight can be accelerated by allowing data scientists to develop mod-
els in their framework of choice (R, Python, etc.) and apply those models directly within
the data transformation and preparation workflow. These models can then be more
easily embedded in regularly occurring business processes.
Reporting and visual analysis tools that work on data in Hive and Impala
End to end visibility into and trust in the data integration and analytics
process, from raw data to visualizations on the glass
17
Big
Solutions
TAKING A SOLUTION-ORIENTED APPROACH THAT LEVER AGES
THE BEST OF BOTH TECHNOLOGY AND PEOPLE
“…plan for disruptive changes coming from every direction: new data types, competitive
challenges, programming approaches, hardware, networking technology, and services
offered by literally hundreds of new big data providers…
• Open architectures based on open standards that are easy for IT teams to understand
• Ability to easily leverage existing scripts and code across a variety of frameworks,
whether that means Java, Pig scripts, Python, or others
• Open APIs and well-defined SDKs that facilitate solution extensions to introduce add-
on data and analytics capabilities for specific use cases
• Seamless ability to embed reports, visualizations, and other analytic content into exist-
In addition, it takes more than just the right technology platform: the ability to leverage
the right people is arguably more important to project success. Too often, organizations
experience delays and underwhelming results when it comes to Hadoop data integra-
tion and analytics. The problem isn’t always with the underlying technology – rather it is
very common that best practices solution architectures and implementation approaches
are not being followed. Working with a seasoned partner with deep expertise in Hadoop
data and analytics projects can help set teams on the right path from the start and avoid
costly course corrections (or worse) later on.
PENTAHO
4 Ralph Kimball, Kimball Group, “Newly Emerging Best Practices for Big Data”. 19
Since Hadoop itself is so new, IT teams should place a premium on a data integration and
analytics provider’s track record of customer success with Hadoop-specific projects, not
just generic data integration and analytics projects. In addition to vendor service offer-
ings, organizations should consider the experience and expertise of the big data services
team members. These should span the entire project lifecycle, from solution visioning
and implementation workshops to in-depth training programs, architect-level support,
and technical account management.
Open APIs and well-defined SDKs that let users easily create platform
extensions for new use cases
20
Conclusion
Big data has the potential to solve big problems and create transformational business
benefits. While a whole ecosystem of tools have sprung up around Hadoop to handle and
analyze data, many of them are specialized to just one part of a larger process. In order
to fulfill the promise of Hadoop, organizations need to step back and take an end-to-end
view of their analytic data pipelines.
This means considering every phase of the process – from data ingestion to data
transformation to end analytic consumption, and even beyond to other applications and
systems where analytics must be embedded. It means not only tackling the tactical chal-
lenges like closing the big data development skills gap but also clearly determining how
Hadoop and big data will create value for the business. Whether this happens through
cost savings, revenue generation, better customer experiences or other objectives, taking
an end-to-end view of the data pipeline will help promote project success and enhanced
IT collaboration with the business.
In summary, organizations should keep the follow tenets of successful big data projects
top of mind:
21
What’s next
Ready to get serious about boosting the analytics experience
for your users? Check out these helpful resources.
> See how Pentaho tackles Hadoop challenges head-on, About Pentaho
A Hitachi Group Company
from data integration to proven big data implementation
Pentaho, a Hitachi Group company, is a leading
patterns and expertise. data integration and business analytics compa-
ny with an enterprise-class, open source-based
> See a quick demo of how Pentaho makes it easy to platform for diverse big data deployments.
transform and blend data at scale on Hadoop Pentaho’s unified data integration and analyt-
Be social Copyright ©2016 Pentaho Corporation. Redistribution permitted. All trademarks are the property of
with Pentaho: their respective owners. For the latest information, please visit our website at pentaho.com.