0% found this document useful (0 votes)

45 views22 pages

6-61456 Pentaho Hadoop Ebook

Uploaded by

Mike Veteran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views22 pages

6-61456 Pentaho Hadoop Ebook

Uploaded by

Mike Veteran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Hadoop and the

Analytic Data
Pipeline
Hadoop and the
Analytic Data Pipeline

TABLE OF CONTENTS

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Hadoop is Disruptive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Taking a Holistic View of Big Data Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Big Ingestion
Ensuring a Flexible and Scalable Approach to Data Ingestion
and Onboarding Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Big Transformation
Driving Scalable Data Processing and Blending with Maximum
Productivity and Fine Control for Developers and Data Analysts. . . . . . . . . . . . . . . . . . . 10

Big Analytics
Delivering Complete Analytic Insights to the Business
in a Dynamic, Governed Fashion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Big Solutions
Taking a Solution-Oriented Approach that Leverages
the Best of Both Technology and People. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
INTRODUCTION

Hadoop is
disruptive.
Over the last five years, there have been few more
disruptive forces in information technology than big data –
and at the center of this trend is the Hadoop ecosystem.
While everyone has a slightly different definition of big
data, Hadoop is usually the first technology that comes
to mind in big data discussions.
When organizations can effectively leverage Hadoop, putting to work frameworks like
MapReduce, YARN, and Spark, the potential IT and business benefits can be particularly
large. Over time, we’ve seen pioneering organizations achieve this type of success – and
they’ve established some repeatable value-added use case patterns along the way.

Examples include optimizing data warehouses by offloading less frequently used data
and heavy transformation workloads to Hadoop, as well as customer 360-degree view
projects that blend operational data sources together with big data to create on-demand
intelligence across key customer touch points. Organizations have achieved what can be
best described as “order of magnitude” benefits in some of these scenarios, for instance:

• Reducing ETL and data onboarding

process times from many hours to less
than an hour
• Cutting millions of dollars in spending Hadoop is hard – but the
right tools make it easier.
with traditional data warehouse vendors
• Accelerating time to identify fraudulent
transactions or other customer behavior
indicators by 10 times or more

Given these potentially transformational results, you might ask – “Why isn’t every orga-
nization doing this today?” One major reason is simply that Hadoop is hard. As with any
technology that is just beginning to mature, barriers to entry are high. Specifically, some
of the most common challenges to successfully implementing Hadoop for value-added
analytics are:

• A mismatch between the complex coding and scripting skillsets required to work
with Hadoop and the SQL-centric skillsets most organizations possess
• High cost of acquiring developers to work with Hadoop, coupled with the risk
of having to interpret and manage their code if they leave

Hadoop and the Analytic Data Pipeline

• Sheer amount of time and effort it takes to manually code, tune, and debug
routines for Hadoop
• Challenges integrating Hadoop into enterprise data architectures and making
it “play nice” with existing databases, applications, and other systems

These are some of the most readily apparent reasons why Hadoop projects may fail,
leaving IT organizations disillusioned that the expected massive ROI (return on invest-
ment) has not been delivered. In fact, some experts are expecting the large majority of
Hadoop projects to fall short of their business goals for these very reasons.1
PENTAHO

1 “Through 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation
objectives due to skills and integration challenges,” Gartner Analyst, Nick Heudecker; infoworld.com,
Sept 2015. 4
The good news is that traditional data integration software providers have begun to
update their tools to help ease the pain of Hadoop, letting ETL developers and data
analysts integrate and process data in a Hadoop environment with their existing skills.
However, leveraging existing ETL skill sets alleviates just one part of a much larger set
of big data challenges.

Taking a Holistic View of Big Data Projects

Stepping back, it is important to recognize that determining how to deliver business value
from Hadoop still represents the second most frequent challenge for those considering
the technology (after skills gaps).2 Organizations need to keep this front of mind and avoid
landing their Hadoop initiative in the “science project” or “experimental” category.

To be clear, user-friendly ETL tools for big data can certainly accelerate developer produc-
tivity in Hadoop use cases. However, if there isn’t a clear plan that addresses the end-to-end
delivery of integrated, governed data, as well as analytics to address business goals, a lot
of the potential benefits of Hadoop will be left on the table. Organizations may achieve
some moderate cost take-out benefits, but transformative business results will be much
harder to achieve.

In order to maximize the ROI on their Hadoop investment, enterprises need to take
a “full pipeline” view of their big data projects. This means approaching Hadoop in the
context of end-to-end processes: starting at raw data sources, moving through data
engineering and preparation inside and outside Hadoop, and finally leading to the
delivery of analytic insights to various user roles, often as a part of existing business
processes and applications.

A HOLISTIC APPROACH TO THE BIG DATA PIPELINE AND PROJEC T LIFEC YCLE

Big data integration that puts you in control

Hadoop and the Analytic Data Pipeline

Flexible and scalable Dynamic governed big

data onboarding data insights

Solutions-based approach: People + Tech + Flexibility

PENTAHO

2 “Gartner Survey Highlights Challenges to Hadoop Adoption”, gartner.com, May 2015. 5

Paying attention to the whole data analytic pipeline, including administration and
orchestration of the overall process, keeps the IT group focused on the ultimate value
it is delivering to the business as well as the broader impact on people, processes, and
technology throughout the organization. In particular, there are four categories teams
need to focus on in this comprehensive approach to Hadoop projects:

1 2 3 4
Big Ingestion Big Big Analytics Big Solutions
Transformation

Ensuring a Driving data Delivering Taking a

flexible and processing and complete big solution-
scalable blending at scale data analytic oriented
approach to with maximum insights to the approach that
data ingestion productivity and business in leverages the
and onboarding fine control a dynamic, best of both
processes governed technology
fashion and people

The remainder of this paper will dive into each of these categories in more detail.

Hadoop and the Analytic Data Pipeline

PENTAHO

6
Big
Ingestion
ENSURING A FLEXIBLE AND SCAL ABLE APPROACH TO DATA INGESTION
AND ONBOARDING PROCESSES

Hadoop and the Analytic Data Pipeline

Naturally, the first step in an enterprise data pipeline
involves the source systems and raw data that will ultimately
be ingested, blended, and analyzed. Market experience
dictates that the most important big data insights tend to
come from combinations of diverse data that may initially
be isolated in silos across the organization.
PENTAHO

7
As such, a key need in Hadoop data and analytics projects is the ability to tap into a
variety of different data sources, types, and formats. Further, organizations need to
prepare not only for the data they want to integrate with Hadoop today, but also data
that will need to be handled for potential additional use cases in the future.

The following types of data sources and formats are often part of Hadoop analytics
projects:

• Data warehouses and RDBMs containing transactional customer profile data

• Log file and event data including web logs, application logs, and more
• Data in semi-structured formats including
XML, JSON, and Avro
• Flat files, such as those in CSV format
• Data housed in NoSQL data stores, such
as HBase, MongoDB, and Cassandra
Scalability of data ingestion
• Data pulled from web-based APIs as well
as FTP servers
and onboarding processes
• Cloud and on-premise application data,
such as CRM and ERP data
is mission-critical
• Analytic databases such as HPE Vertica,
Amazon Redshift, and SAP HANA

Organizations are also finding that cost and efficiency pressures as well as other factors
are leading them to use cloud-computing environments more heavily. They may run
Hadoop distributions and other data stores on cloud infrastructure, and as a result,
may need data integration solutions to be cloud-friendly.

This can include running on the public cloud to take advantage of scalability and elastic-
ity, private clouds with connectivity to on-premises data sources, as well as hybrid cloud

Hadoop and the Analytic Data Pipeline

environments. In a public cloud scenario, organizations may look to leverage storage,
databases, and Hadoop distributions from an overall infrastructure provider (In the case
of Amazon web services, this would mean S3 for storage, Amazon Redshift for analytic
data warehousing, and Amazon Elastic MapReduce for Hadoop).

As Hadoop projects evolve from small pilots to departmental use cases and, eventually,
enterprise shared service environments, scalability of the data ingestion and onboarding
processes becomes mission-critical. More data sources are introduced over time, individ-
ual data sources change, and frequency of ingestion can vacillate. As this process extends
out to a hundred data sources or more, which could even be a range of similar files
in varying formats, maintaining the Hadoop data ingestion process can become
especially painful.
PENTAHO

8
At this point, organizations desperately need to reduce manual effort, potential for
error, and amount of time spent on the care and feeding of Hadoop. They need to go
beyond manually designing data ingestion workflows to establish a dynamic and reusable
approach while also maintaining traceability and governance.

Being able to create dynamic data ingestion templates that apply metadata on-the-fly
for each new or changed source is one solution to this problem. According to a recent
best practices guide by Ralph Kimball, “consider using a metadata-driven codeless
development environment to increase productivity and help insulate you from underly-
ing technology changes.” 3 Not surprisingly, the earlier organizations can anticipate these
needs, the better.

What to look for from vendors:

DATA INGESTION AND ONBOARDING

Easy connectivity to traditional data sources including data warehouses,

flat files, and enterprise applications

Straightforward connectivity to Hadoop, NoSQL stores, and support for a

variety of semi-structured and non-relational data formats

Ability to deploy platform in public, private, and hybrid cloud environments

and take advantage of cloud-based big data

Transformation templates that make it possible to generate jobs on the fly

and scale data onboarding out to many more data sources with minimal

Hadoop and the Analytic Data Pipeline

manual effort

PENTAHO

3 Ralph Kimball, Kimball Group, “Newly Emerging Best Practices for Big Data”. 9
Big
Transformation
DRIVING SCAL ABLE DATA PROCESSING AND BLENDING WITH
MA XIMUM PRODUCTIVIT Y AND FINE CONTROL FOR DEVELOPERS

Hadoop and the Analytic Data Pipeline

AND DATA ANALYSTS

Once enterprises are able to successfully pull a variety of

data into Hadoop in a flexible and scalable fashion, the
next step involves processing, transforming and blend-
ing that data at scale on the Hadoop cluster. This enables
complete analytics, taking all relevant data into account,
whether structured, semi-structured, or unstructured.
PENTAHO

10
As touched on earlier, it is essentially a “table stakes” requirement to leverage an intui-
tive and easy-to-use data integration product to design and execute these types of data
integration workflows on the Hadoop cluster. Providing drag and drop Hadoop data
integration tools to ETL developers and data analysts allows enterprises to avoid hiring
expensive developers with Hadoop experience.

In a rapidly evolving big data world, IT

departments also need to design and main-
tain data transformations without having Provide drag and drop
Hadoop data integration
to worry about changes to the underlying
technology infrastructure. There needs
to be a level of abstraction away from the
underlying framework (whether Hadoop or tools to ETL developers
something else), such that the development
and maintenance of data-intensive applica- and data analysts
tions can be democratized beyond a small
group of expert coders.

This is possible with the combination of a highly portable data transformation engine
(“write once, run anywhere”) and an intuitive graphical development environment for
data integration and orchestration workflow. Ideally, this joint set of capabilities is encap-
sulated entirely within one software platform. Overall, this approach not only boosts IT
productivity dramatically, but it also accelerates the delivery of actionable analytics to
business decision makers.

Ease of installation and configuration is a related element that enterprises can look to in
order to drive superior time to value in Hadoop data integration and analytics projects.
This is fairly intuitive – the more adapters, node-by-node installations, and separate
Hadoop component configurations required, the longer it will take to get up and running.

Hadoop and the Analytic Data Pipeline

However, underlying solution architecture, and by extension configuration processes,
can have important additional operational implications. PENTAHO

11
For instance, as more node-by-node software is installed and more cluster variables are
tuned, it is more likely that an approach will risk interfering with policies and rules set by
Hadoop administrators. Also, more onerous and cluster-invasive platform installation
requirements can create problems including:

• Repetitive manual installation interventions

• Increased risk to change and reduced
solution agility
• Inability to work in a dynamic
Be wary of “black box”
provisioning model
• Reduced architectural flexibility
approaches to data
• Lower cost effectiveness
transformation on Hadoop
Organizations taking a holistic approach
to Hadoop data analytics will look beyond
simply insulating traditional ETL developers
from the complexity of Hadoop to providing different roles with the additional control and
performance they need. If a broader base of Hadoop developers, admins, and data scien-
tists should be involved in the overall data pipeline, those roles need to be empowered to
work productively with Hadoop as well.

Enterprises should be wary of “black box” approaches to data transformation on

Hadoop, and instead, opt for an approach that combines ease of use and deeper control
and visibility. This includes native, transparent transformation execution via MapReduce,
direct control over spinning up or down cluster resources via YARN, ability to work with
data in HBase, and integration with tools like Sqoop for bulk loads and Oozie for workflow
management. It can also extend out to providing the ability to orchestrate and leverage
pre-existing scripts ( Java, Pig, Hive, etc.) that organizations may still want to use in
conjunction with other visually designed jobs and transformations.

Hadoop and the Analytic Data Pipeline

PENTAHO

12
An alternative approach to big data integration involves the use of code generation
tools, which output code that must then be separately run. In addition, because these
tools generate code, that code is often maintained, tuned, and debugged directly – which
can create additional overhead for Hadoop projects. Code generators may provide fine-
grained control, but they normally have a much steeper learning curve. Use of such code-
generators mandates iterative and repetitive access to highly skilled technical resources
familiar with coding and programming. As such, total cost of ownership (TCO) should be
carefully evaluated.

What to look for from vendors:

DATA TR ANSFORMATION AND BLENDING AT SCALE

Intuitive drag and drop design paradigm for big data jobs and
transformations, with ability to configure as needed

Data integration run-time engine that is highly portable across different data
storage and processing frameworks, drastically reducing need to re-factor
data workflows

Fast, repeatable configuration to run data transformations on Hadoop that

minimizes node-by-node installation and cluster invasiveness

Native and scalable ability to execute data transformations as Hadoop

MapReduce jobs in-cluster

Broad, transparent Hadoop ecosystem integration including YARN (job

resource management), HBase (NoSQL store), Sqoop (bulk load), Oozie

Hadoop and the Analytic Data Pipeline

(workflow management), existing Pig scripts and more

Encapsulation of all functionality within the data integration and analytics

software, with no need to generate and manage separate code
PENTAHO

13
Big
Analytics
DELIVERING COMPLETE ANALY TIC INSIGHTS TO THE
BUSINESS IN A DYNAMIC, GOVERNED FASHION

Hadoop and the Analytic Data Pipeline

A prerequisite to unlocking maximum analytic value from
Hadoop is carefully considering all relevant business end
users, as well as business processes and applications
(internal and external) that the project should touch. Dif-
ferent data consumers may need different tooling and
approaches, depending on their needs and levels of
sophistication.
PENTAHO

14
As data scientists and advanced analysts begin to query and explore blended data sets
in Hadoop, they will often make use of data warehouse and SQL-like layers on Hadoop,
such as Hive and Impala. Thanks to a familiar type of query language, these tools do not
take long to learn. As such, skilled data analysts should seek out data integration and
analytics platforms that provide operational reporting and visual analytics directly
on Hive and Impala.

At the same time, it’s important to note that

SQL layers on Hadoop do present limita-
tions in several ways. First, they may not Hadoop as part of
the broader analytic
provide the degree of interactivity expected
in today’s reporting and analytics tools (when
used on relational data sources). In particular,
there may be latency limitations related to pipeline is crucial.
the complexity of queries and amounts of
data involved.

While this is acceptable in the analytics prototyping phase, the performance and usability
are unlikely to satisfy the requirements of larger groups of analysts and business users
in production environments. The wrong query at the wrong time can potentially strain
Hadoop cluster resources, interfering with the completion of other integration processes.

Enterprises are used to providing business users with analytics tools that sit on top of
highly governed, pre-processed data warehouses. The data is, for the most part, trusted
and accurate, while pre-built analytic cubes offer fast answers to the business questions
that business users may want to ask of the data. Conversely, in the world of Hadoop, it
is a much greater challenge to provide direct analytics access at scale that is both highly
governed as well as easily and interactively consumed by the analytics end user. In many
cases, there may be so much data in the Hadoop cluster that it may not even make sense

Hadoop and the Analytic Data Pipeline

to pre-process it as in a data warehouse scenario.

This is another circumstance where considering Hadoop as part of the broader analytic
pipeline is crucial. Specifically, many organizations are already familiar with high perfor-
mance relational databases that are optimized for interactive end-user analytics – or
“analytic databases.” Enterprises are finding that a highly effective way to unleash the
analytic power of Hadoop is to deliver refined data sets from Hadoop to these databases.

The most effective approach enables the business user or analyst to intuitively request
the subset of Hadoop data he or she would like to analyze. A user selection can trigger
on-demand data processing and blending in the Hadoop environment, followed by deliv-
PENTAHO

ery of an analytics-ready data set to the end user for ad hoc analysis and visualization.

15
An illustration of this process flow is depicted below, starting at the upper left:

BIG DATA REFINEMENT FLOW

User initiated Blending and

data request refining any data

Request Refine

Governance

Publish

Publish analytic Automatically

data sets generate data model

Hadoop and the Analytic Data Pipeline

Given the complexities of working with Hadoop, it is especially important for the IT orga-
nization to be able to architect this process to establish the same level of trust it already
has with respect to the group’s enterprise data warehouse. However, once the process
is established, business requests to IT for Hadoop data sets is drastically reduced. Of
course, it is also important to provide an intuitive end user interface for requesting
and exploring these on-demand data marts.

Finally, it is key to integrate and operationalize advanced predictive and statistical mod-
eling into the broader big data pipeline. Despite their potential to create path-breaking
insights, data scientists often find themselves outside the broader enterprise data
integration and analytics production process. Further, the majority of time and effort
in a predictive analytics task is often spent preparing the data, rather than actually
PENTAHO

analyzing and modeling it.

16
The more the data integration and analytics approach enables collaboration between
data scientists and the broader IT team, the quicker it will be to develop and implement
new models for forecasting, scoring, and more – leading to faster business benefits. In
particular, time to insight can be accelerated by allowing data scientists to develop mod-
els in their framework of choice (R, Python, etc.) and apply those models directly within
the data transformation and preparation workflow. These models can then be more
easily embedded in regularly occurring business processes.

What to look for from vendors:

ANALY TICS FOR GOVERNED, DYNAMIC INSIGHTS

Reporting and visual analysis tools that work on data in Hive and Impala

Ability to orchestrate end user driven data refinement, blending, modeling,

and delivery of data sets in a big data environment, including Hadoop and
analytic databases

Provisioning of intuitive business user interfaces to select and analyze data in

a big data environment

End to end visibility into and trust in the data integration and analytics
process, from raw data to visualizations on the glass

Ability to integrate existing predictive models from R, Python, and other

advanced analytics frameworks into the data preparation workflow

Hadoop and the Analytic Data Pipeline

PENTAHO

17
Big
Solutions
TAKING A SOLUTION-ORIENTED APPROACH THAT LEVER AGES
THE BEST OF BOTH TECHNOLOGY AND PEOPLE

Hadoop and the Analytic Data Pipeline

While many advancements have been made in the
Hadoop ecosystem over the last several years, Hadoop
is still maturing as a platform for use in production
enterprise deployments. Moreover, anyone who’s been
involved with enterprise technology initiatives knows that
requirements evolve and tend to be “works in process”
more than set in stone. Hadoop represents a major new
element in the broader data pipeline, and related initia-
PENTAHO

tives usually require a phased approach.

18
As a result, software evaluators will not find one off-the-shelf tool that satisfies all current
and forward-looking Hadoop data and analytics requirements as is. “Future proofing” is in
danger of becoming an overused word in conversations around big data today, but flex-
ibility and extensibility should be part of all project checklists. Ralph Kimball elaborates
on this set of needs in more detail:

“…plan for disruptive changes coming from every direction: new data types, competitive
challenges, programming approaches, hardware, networking technology, and services
offered by literally hundreds of new big data providers…

…maintain a balance among several implementation approaches including Hadoop,

traditional grid computing, pushdown optimization in an RDBMS, on-premise comput-
ing, cloud computing, and even your mainframe.” 4

The ability to port transformations to run

seamlessly across different Hadoop distri-
butions is a starting point, as many organi- “Future proofing” is
zations are not sure what their enterprise
standard distribution will be down the road. in danger of becoming
However, true durability requires an overall
platform approach to flexibility that aligns
with the open innovation that has driven the
an overused word.
Hadoop ecosystem, including:

• Open architectures based on open standards that are easy for IT teams to understand
• Ability to easily leverage existing scripts and code across a variety of frameworks,
whether that means Java, Pig scripts, Python, or others
• Open APIs and well-defined SDKs that facilitate solution extensions to introduce add-
on data and analytics capabilities for specific use cases
• Seamless ability to embed reports, visualizations, and other analytic content into exist-

Hadoop and the Analytic Data Pipeline

ing business applications and processes

In addition, it takes more than just the right technology platform: the ability to leverage
the right people is arguably more important to project success. Too often, organizations
experience delays and underwhelming results when it comes to Hadoop data integra-
tion and analytics. The problem isn’t always with the underlying technology – rather it is
very common that best practices solution architectures and implementation approaches
are not being followed. Working with a seasoned partner with deep expertise in Hadoop
data and analytics projects can help set teams on the right path from the start and avoid
costly course corrections (or worse) later on.
PENTAHO

4 Ralph Kimball, Kimball Group, “Newly Emerging Best Practices for Big Data”. 19
Since Hadoop itself is so new, IT teams should place a premium on a data integration and
analytics provider’s track record of customer success with Hadoop-specific projects, not
just generic data integration and analytics projects. In addition to vendor service offer-
ings, organizations should consider the experience and expertise of the big data services
team members. These should span the entire project lifecycle, from solution visioning
and implementation workshops to in-depth training programs, architect-level support,
and technical account management.

What to look for from vendors:

SOLUTIONS THAT LEVER AGE THE BEST OF PEOPLE AND TECHNOLOGY

Portability of Hadoop data transformations to run across different

commercial distributions with minimal overhead

Open platform architecture based on open standards, as well as the ability

to leverage existing scripts in languages like Java, Python, Pig, R and others

Open APIs and well-defined SDKs that let users easily create platform
extensions for new use cases

Seamless ability to embed reports, visualizations, and other analytics into

existing business applications and processes

A well-established track record of customer success with big data and

Hadoop projects, including multiple specific reference customers

Hadoop and the Analytic Data Pipeline

An experienced big data services organization, with offerings covering the
full project lifecycle, from workshops and training to architect-level support
and ongoing consulting services
PENTAHO

20
Conclusion
Big data has the potential to solve big problems and create transformational business
benefits. While a whole ecosystem of tools have sprung up around Hadoop to handle and
analyze data, many of them are specialized to just one part of a larger process. In order
to fulfill the promise of Hadoop, organizations need to step back and take an end-to-end
view of their analytic data pipelines.

This means considering every phase of the process – from data ingestion to data
transformation to end analytic consumption, and even beyond to other applications and
systems where analytics must be embedded. It means not only tackling the tactical chal-
lenges like closing the big data development skills gap but also clearly determining how
Hadoop and big data will create value for the business. Whether this happens through
cost savings, revenue generation, better customer experiences or other objectives, taking
an end-to-end view of the data pipeline will help promote project success and enhanced
IT collaboration with the business.

In summary, organizations should keep the follow tenets of successful big data projects
top of mind:

Ensuring a flexible and scalable approach to data ingestion and

1 onboarding processes

Driving data processing and blending at scale with maximum productivity

2 and fine control

Delivering complete big data analytic insights to the business in a

3
Hadoop and the Analytic Data Pipeline
dynamic, governed fashion

Taking a solution-oriented approach that leverages the best in both

4 technology and people
PENTAHO

21
What’s next
Ready to get serious about boosting the analytics experience
for your users? Check out these helpful resources.

> Learn how to architect end-to-end data management

solutions with Apache Hadoop.

> Read a free excerpt from the Field Guide to Hadoop

to learn about Hadoop’s core technologies.

> See how Pentaho tackles Hadoop challenges head-on, About Pentaho
A Hitachi Group Company
from data integration to proven big data implementation
Pentaho, a Hitachi Group company, is a leading
patterns and expertise. data integration and business analytics compa-
ny with an enterprise-class, open source-based
> See a quick demo of how Pentaho makes it easy to platform for diverse big data deployments.
transform and blend data at scale on Hadoop Pentaho’s unified data integration and analyt-

without coding. ics platform is comprehensive, completely em-

beddable and delivers governed data to power
any analytics in any environment. Pentaho’s
mission is to help organizations across multiple
industries harness the value from all their
data, including big data and IoT, enabling them
to find new revenue streams, operate more
efficiently, deliver outstanding service and
minimize risk. Pentaho has over 15,000 product
deployments and 1,500 commercial custom-
ers today including ABN-AMRO Clearing, BT,
Caterpillar Marine Asset Intelligence, EMC,
Landmark Halliburton, Moody’s, NASDAQ and
Sears Holding Corporation. For more informa-
tion visit www.pentaho.com.

Be social Copyright ©2016 Pentaho Corporation. Redistribution permitted. All trademarks are the property of
with Pentaho: their respective owners. For the latest information, please visit our website at pentaho.com.

BAOJUN 530 (CHEVROLET Captiva (MKII) - MG Hector-WULING Almaz) 2018 - ManualTallerMotor PDF
100% (5)
BAOJUN 530 (CHEVROLET Captiva (MKII) - MG Hector-WULING Almaz) 2018 - ManualTallerMotor PDF
404 pages
Big Data
100% (1)
Big Data
82 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
BigData Cs-704 Practical
No ratings yet
BigData Cs-704 Practical
28 pages
Unit 5
No ratings yet
Unit 5
68 pages
Hadoop
No ratings yet
Hadoop
562 pages
Hadoop Research Paper
No ratings yet
Hadoop Research Paper
7 pages
Gerente El Lago de Datos
No ratings yet
Gerente El Lago de Datos
24 pages
Big Data and Its Impact On Data Warehousing
No ratings yet
Big Data and Its Impact On Data Warehousing
18 pages
7 Tips To Succeed With Big Data 2014 0
0% (1)
7 Tips To Succeed With Big Data 2014 0
11 pages
Healthcare Bigdata CaseStudy 2-Hadoop PDF
No ratings yet
Healthcare Bigdata CaseStudy 2-Hadoop PDF
34 pages
Leading A Healthcare Company To The Big Data Promised Land
100% (1)
Leading A Healthcare Company To The Big Data Promised Land
34 pages
Big Data ANAlysis Short
No ratings yet
Big Data ANAlysis Short
114 pages
Adopting Hadoop in The Enterprise
No ratings yet
Adopting Hadoop in The Enterprise
13 pages
155928-Turn Big Data
No ratings yet
155928-Turn Big Data
8 pages
Data Lakes in A Modern Data Architecture
88% (8)
Data Lakes in A Modern Data Architecture
23 pages
Harnessing The Value of Big Data Analytics
No ratings yet
Harnessing The Value of Big Data Analytics
13 pages
7 Tips To Succeed With Big Data 2014 0
No ratings yet
7 Tips To Succeed With Big Data 2014 0
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
5 pages
Document 1
No ratings yet
Document 1
9 pages
Lauras
No ratings yet
Lauras
33 pages
Deutsche Telekom Perspective On HADOOP and Big Data Technologies
No ratings yet
Deutsche Telekom Perspective On HADOOP and Big Data Technologies
19 pages
сервис мануал LG 43UJ634V шасси UD74P PDF
100% (1)
сервис мануал LG 43UJ634V шасси UD74P PDF
101 pages
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
No ratings yet
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
20 pages
Big Data
No ratings yet
Big Data
82 pages
Harnessing The Value of Big Data Analytics
No ratings yet
Harnessing The Value of Big Data Analytics
13 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Analytics02
No ratings yet
Big Data Analytics02
20 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
The Forrester Wave-Big Data Hadoop, q1 2016
No ratings yet
The Forrester Wave-Big Data Hadoop, q1 2016
13 pages
Book Chapter
No ratings yet
Book Chapter
23 pages
Content For
No ratings yet
Content For
7 pages
Ebook Hadoop
No ratings yet
Ebook Hadoop
20 pages
s-22 DWM
100% (2)
s-22 DWM
33 pages
IM08
No ratings yet
IM08
36 pages
3TNV76
100% (2)
3TNV76
31 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
7 Tips To Succeed With Big Data: Dan Jewett, VP Product Management
No ratings yet
7 Tips To Succeed With Big Data: Dan Jewett, VP Product Management
11 pages
The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014: Key Takeaways
No ratings yet
The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014: Key Takeaways
15 pages
BDA Assignment 1: Big Data Features and Characteristics
No ratings yet
BDA Assignment 1: Big Data Features and Characteristics
14 pages
Success With Big Data Analytics: Competencies and Capabilities For The Journey
No ratings yet
Success With Big Data Analytics: Competencies and Capabilities For The Journey
13 pages
Closing The Big Data Management and Security Gap
No ratings yet
Closing The Big Data Management and Security Gap
8 pages
Xerox® Color 550/560 Printer Customer Expectations Document
100% (1)
Xerox® Color 550/560 Printer Customer Expectations Document
25 pages
Alteryx Hadoop Whitepaper Final1
No ratings yet
Alteryx Hadoop Whitepaper Final1
6 pages
Index: Big Data Analytics: Turning Big Data Into Big Money
No ratings yet
Index: Big Data Analytics: Turning Big Data Into Big Money
8 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Top 10 Big Data Trends
No ratings yet
Top 10 Big Data Trends
13 pages
Big Data Analytics: Achieving Business Value From Big Data Analyticcs Anoop Dwivedi March 21, 2013
No ratings yet
Big Data Analytics: Achieving Business Value From Big Data Analyticcs Anoop Dwivedi March 21, 2013
24 pages
Modern Data Management - AWS
No ratings yet
Modern Data Management - AWS
13 pages
Big Data Challenges: 1. Dealing With Data Growth
100% (2)
Big Data Challenges: 1. Dealing With Data Growth
4 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
16-029 Pentaho Hadoop Ebook v7
No ratings yet
16-029 Pentaho Hadoop Ebook v7
22 pages
Demystifying Big Data RGc1.0
100% (1)
Demystifying Big Data RGc1.0
10 pages
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
No ratings yet
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
18 pages
Huawei BSC6900 GSM Parameter Reference
No ratings yet
Huawei BSC6900 GSM Parameter Reference
1,288 pages
Big Data Analytical Tools
100% (1)
Big Data Analytical Tools
8 pages
Big Data
No ratings yet
Big Data
5 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Big Data Spectrum
No ratings yet
Big Data Spectrum
61 pages
What Is Big Data
No ratings yet
What Is Big Data
7 pages
SR# Call Type A-Party B-Party Date & Time Duration Cell ID Imei
100% (1)
SR# Call Type A-Party B-Party Date & Time Duration Cell ID Imei
12 pages
Big Data Analysis Guide
No ratings yet
Big Data Analysis Guide
11 pages
TDWI BPReport Q411 Big Data ExecSummary
No ratings yet
TDWI BPReport Q411 Big Data ExecSummary
6 pages
PDF Toshiba e Studio 2505 F Service Manual DL
No ratings yet
PDF Toshiba e Studio 2505 F Service Manual DL
557 pages
Unit 5 - Master Pages & Themes1
No ratings yet
Unit 5 - Master Pages & Themes1
17 pages
Format For MCC Motor Feeder Tag
No ratings yet
Format For MCC Motor Feeder Tag
10 pages
Contaminant Transport
No ratings yet
Contaminant Transport
2 pages
OM ASTRA ID-OASKOBSE1708-en 11 20171106
No ratings yet
OM ASTRA ID-OASKOBSE1708-en 11 20171106
313 pages
PCAN Driver Linux - UserManual
No ratings yet
PCAN Driver Linux - UserManual
67 pages
ESA - Satellite Frequency Bands
No ratings yet
ESA - Satellite Frequency Bands
5 pages
DELOITTE - Future of Health 2024
No ratings yet
DELOITTE - Future of Health 2024
16 pages
Dynamo Create Wires
No ratings yet
Dynamo Create Wires
7 pages
LG Lettore Blu Ray CH12NS30 ENG
No ratings yet
LG Lettore Blu Ray CH12NS30 ENG
15 pages
Dump
No ratings yet
Dump
10 pages
The Evolution of The Financial Adviser Platform
No ratings yet
The Evolution of The Financial Adviser Platform
14 pages
Isp1181bdgg Datasheet
No ratings yet
Isp1181bdgg Datasheet
70 pages
D2023036SMT
No ratings yet
D2023036SMT
5 pages
3d Printing Nike Amended
No ratings yet
3d Printing Nike Amended
13 pages
Ipcs Global Kannur
No ratings yet
Ipcs Global Kannur
7 pages
Sukhwinder
No ratings yet
Sukhwinder
2 pages
Queens College: Assignment 1 Course Course Code School Programme Cohort Semester Session Lecturer
0% (1)
Queens College: Assignment 1 Course Course Code School Programme Cohort Semester Session Lecturer
4 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Premium Marine Axial Fans: Delta T Systems
No ratings yet
Premium Marine Axial Fans: Delta T Systems
5 pages
Project: Location: Design Consultant: Department: Designer: BMR Date: ANY AD Engineering Co. Riyadh-KSA Electrical
100% (1)
Project: Location: Design Consultant: Department: Designer: BMR Date: ANY AD Engineering Co. Riyadh-KSA Electrical
1 page
PSP244
No ratings yet
PSP244
4 pages
Tamilswetha
No ratings yet
Tamilswetha
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

6-61456 Pentaho Hadoop Ebook

Uploaded by

6-61456 Pentaho Hadoop Ebook

Uploaded by

Hadoop and the

• Reducing ETL and data onboarding

Hadoop and the Analytic Data Pipeline

Taking a Holistic View of Big Data Projects

Big data integration that puts you in control

Hadoop and the Analytic Data Pipeline

Flexible and scalable Dynamic governed big

Solutions-based approach: People + Tech + Flexibility

2 “Gartner Survey Highlights Challenges to Hadoop Adoption”, gartner.com, May 2015. 5

Ensuring a Driving data Delivering Taking a

Hadoop and the Analytic Data Pipeline

Hadoop and the Analytic Data Pipeline

• Data warehouses and RDBMs containing transactional customer profile data

Hadoop and the Analytic Data Pipeline

What to look for from vendors:

Easy connectivity to traditional data sources including data warehouses,

Straightforward connectivity to Hadoop, NoSQL stores, and support for a

Ability to deploy platform in public, private, and hybrid cloud environments

Transformation templates that make it possible to generate jobs on the fly

Hadoop and the Analytic Data Pipeline

Hadoop and the Analytic Data Pipeline

Once enterprises are able to successfully pull a variety of

In a rapidly evolving big data world, IT

Hadoop and the Analytic Data Pipeline

• Repetitive manual installation interventions

Enterprises should be wary of “black box” approaches to data transformation on

Hadoop and the Analytic Data Pipeline

What to look for from vendors:

Fast, repeatable configuration to run data transformations on Hadoop that

Native and scalable ability to execute data transformations as Hadoop

Broad, transparent Hadoop ecosystem integration including YARN (job

Hadoop and the Analytic Data Pipeline

Encapsulation of all functionality within the data integration and analytics

Hadoop and the Analytic Data Pipeline

At the same time, it’s important to note that

Hadoop and the Analytic Data Pipeline

BIG DATA REFINEMENT FLOW

User initiated Blending and

Publish analytic Automatically

Hadoop and the Analytic Data Pipeline

analyzing and modeling it.

What to look for from vendors:

Ability to orchestrate end user driven data refinement, blending, modeling,

Provisioning of intuitive business user interfaces to select and analyze data in

Ability to integrate existing predictive models from R, Python, and other

Hadoop and the Analytic Data Pipeline

Hadoop and the Analytic Data Pipeline

tives usually require a phased approach.

…maintain a balance among several implementation approaches including Hadoop,

The ability to port transformations to run

Hadoop and the Analytic Data Pipeline

What to look for from vendors:

Portability of Hadoop data transformations to run across different

Open platform architecture based on open standards, as well as the ability

Seamless ability to embed reports, visualizations, and other analytics into

A well-established track record of customer success with big data and

Hadoop and the Analytic Data Pipeline

Ensuring a flexible and scalable approach to data ingestion and

Driving data processing and blending at scale with maximum productivity

Delivering complete big data analytic insights to the business in a

Taking a solution-oriented approach that leverages the best in both

> Learn how to architect end-to-end data management

> Read a free excerpt from the Field Guide to Hadoop

without coding. ics platform is comprehensive, completely em-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

2 “Gartner Survey Highlights Challenges to Hadoop Adoption”, gartner.com, May 2015. 5

> Learn how to architect end-to-end data management

> Read a free excerpt from the Field Guide to Hadoop