Metadata Management On A Hadoop Eco-System: Whitepaper by
Metadata Management On A Hadoop Eco-System: Whitepaper by
Eco-System Whitepaper by
Satya Nayak
Project Lead, Mphasis
Jansee Korapati
Module Lead, Mphasis
Introduction • A
well-built metadata layer will allow organization
The data lake stores large amount of structured and to harness the potential of data lake and deliver the
unstructured data in various varieties at different following mechanisms to the end users to access data
transformed layers. While the data is growing to terabytes and perform analysis:
and petabytes, and your data lake is being used by the - Self Service BI (SSBI)
enterprise, you are likely to come across questions/
- Data-as-a-Service (DaaS)
challenges, such as what data is available in the data lake,
how it is consumed/prepared/transformed, who is using - Machine Learning-as-a-Service
this data, who is contributing to this data, how old is the
- Data Provisioning (DP)
data… etc.
A well maintained metadata layer can effectively answer
these kind of queries and thus im-prove the usability of
the data lake. This white paper provides the benefits of You can optimize your data lake
an effective metadata layer for a data lake implemented
using Hadoop Cluster; information on various metadata to the fullest with metadata
Management tools is presented, with their features and management.
architecture.
Distribution
Quality info info
Data
Metadata layer can access data
Version from multiple layers.
Fig. 1
Various Metadata Management Tools
Here is the list of different metadata management tools
that can capture metadata on a Hadoop cluster. There is
no order of preference for this list. These tools are listed in
You can choose from metadata
a random order. management tools widely
• C
loudera Navigator: Cloudera Navigator is a data
available, as per your business
governance solution for Hadoop, offering critical requirement.
capabilities such as data discovery, continuous
optimization, audit, lineage, metadata management,
and policy enforcement. As part of ClouderaEnterprise,
Cloudera Navigator is critical to enabling high-
performance agile analytics, supporting continuous
Features of architecture of commonly
data architecture optimization, and meeting regulatory used Metadata Tools
compliance requirements. 1. Cloudera Navigator
• A
pache Atlas: Currently in Apache incubator, this is a Cloudera Navigator is a proprietary tool from Cloudera
scalable and extensible component which can create, for data management in Hadoop Eco-System. It primarily
automate and define relationship on data for metadata provides two solutions in the area of Data Governance.
in the data lake System. You can also export metadata
to third party system from Atlas. It can be used for data
discovery and lineage tracking.
• Apache Falcon: Falcon is aimed at making the feed
processing and feed management on Hadoop clusters Source Data
easier for their end consumers. System Zone Integration
Discovery
• H
Catlog: HCatalog is a table and storage management
layer for Hadoop that enables users with different data
processing tools like Pig and MapReduce to read and Data
write data on the grid more easily. Provisioning
Transient
Enrichment
• L
oom: Loom provides metadata management and Zone
data preparation for Hadoop. The core of Loom is an External
extensible metadata repository for managing Access
business and technical metadata, including data
lineage, for all the data in Hadoop and surrounding
systems. Loom’s active scan framework automates the Raw Zone Data Hub Intenal
generation of metadata for Hadoop data by crawling Processing
HDFS to discover and introspect new files.
• W
aterline: Waterline Data automates the creation and
management of an inventory of data assets at the field
level, empowering data architects to provide all the Information Lifecycle Management layer
data the business needs through secure self-service.
It ensures data governance policies are adhered to, by
enabling data stewards to audit data lineage, protect Metadata Layer
sensitive data, and identify compliance issues.
• G
round Metadata (AMP Lab): Ground is a data Security & Governance Layer
context system, under development at University of
California, Berkeley. It is aimed at building a flexible, Fig. 2
open source, vendor neu-tral system that enables users
to classify about what data they have, where that data Data Management
is flowing to and from, who is using the data, when the
data changed, and why and how the data is changing. Data management provides visibility into and control
Among other things, we believe a data context system over the data residing in Hadoop data stores and the
is particularly use-ful for data inventory, data usage computations performed on that data. The features
tracking, model-specific interpretation, reproducibility, included here are:
interoperability, and collective governance.
• A
uditing data access and verifying access privileges: Data Encryption
The goal of auditing is to capture a complete and
Data encryption and key management provide a critical
immutable record of all the activities within a system.
layer of protection against potential threats by malicious
Cloudera Navigator auditing features add secured,
actors on the network or in the datacenter. Encryption
real-time audit components to key data and access
frame-works. Cloudera Navigator allows compliance and key manage-ment are also required for meeting
groups to configure, collect, and view audit events, and key compliance initiatives and ensuring the integrity of
understand who accessed what data and how. your enterprise data. The following Cloudera Navigator
components enable compliance groups to manage
• Searching metadata and visualizing lineage - Cloudera encryption:
Navigator metadata management features allow DBAs,
data stewards, business analysts, and data scientists to • Cloudera Navigator Encrypt transparently encrypts and
define, search, and amend the properties, and tag data secures data at rest without requiring changes to your
entities and view relationships between datasets. applications and ensures there is minimal performance
lag in the encryption or decryption process.
• Policies - Cloudera Navigator policy features enable
data stewards to specify automated ac-tions based on • Cloudera Navigator Key Trustee Server is an enterprise-
data access or on a schedule to add metadata, create grade virtual safe-deposit box that stores and manages
alerts, and move or purge data. cryptographic keys and other security artifacts.
• Analytics - Cloudera Navigator analytics features enable • Cloudera Navigator Key HSM allows Cloudera Navigator
Hadoop administrators to examine data usage patterns Key Trustee Server to seamlessly integrate with a
and create policies based on those patterns. hardware security module (HSM).
Cloudera Navigator Metadata Architecture
HDFS
Clodera Navigator Metadata Architecture
Navigator UI
Hive
Impala
Cloudera MapReduce
Manager Navigator
Server Navigator API
Metadata Server
Oozie
Spark
Navigator Storage
YARN Database Directory
Fig. 3
Atlas is a Data Governance initiative from Hortonworks on • Capture security access information for every
Hadoop cluster. It was initially started from Hortonworks application, process, and interaction with data
and then taken over to Apache as a top level project. • Capture the operational information for execution, steps,
Atlas is a scal-able and extensible set of core foundational and activities
governance services – enabling enterprises to ef-fectively Search & Lineage (Browse)
and efficiently meet their compliance requirements within
Hadoop and allows integra-tion with the whole enterprise • P
re-defined navigation paths to explore the data
data ecosystem. classification and audit information
• Text-based search features locate relevant data and
Features audit event across data lake quickly and accurately
Data Classification • Browse visualization of data set lineage allowing users
to drill-down into operational, security, and provenance
• Import or define taxonomy business-oriented related information
annotations for data
• Define, annotate, and automate capture of relationships Security & Policy Engine
between datasets and underly-ing elements including • Rationalize compliance policy at runtime based on data
source, target, and derivation processes classification schemes, attrib-utes and roles.
• Export metadata to third-party systems • Advanced definition of policies for preventing data
derivation based on classification (i.e. re-identification) –
Prohibitions
• Column and Row level masking based on cell values
and attributes.
Apache Atlas
REST API
Services
Knowledge Store
Data Lifecycle Tag Based
Taxonomies Managment Policies
Type-System
Fig. 4
REST API
Bridge
Search DSL
Connectors
Messaging Framework
Search
Falcon Sqoop Others
Type System
Graph DB Repository
Fig. 5
In terms of implementation, Atlas has the following components to accomplish the design.
• Web service: This exposes RESTful APIs and a web user interface to create, update and query metadata.
• Metadata store: Metadata is modeled using a graph, implemented using the Graph database Titan. Titan has options
for a variety of backing stores for persisting the graph, including an embedded
Berkeley DB, Apache HBase and Apache Cassandra. The choice of the backing store determines the level of service
availability.
• Index store: For powering full text searches on metadata, Atlas also indexes the metadata, again via Titan. For the full
text search feature, it can use backend systems like Elastic Search or Apache Solr.
• Bridges/Hooks: To add metadata to Atlas, libraries called ‘hooks’ are enabled in various systems like Apache Hive,
Apache Falcon and Apache Sqoop, which capture metadata events in the respective systems and propagate them to
Atlas. The Atlas server consumes these events and updates its stores.
• M
etadata notification events: Any updates to metadata in Atlas, either via the Hooks or the API, are propagated from
Atlas to downstream systems via events. Systems like Apache Ranger consume these events and allow administrators
to act on them, for e.g. to configure policies for Access control.
• Notification server: Atlas uses Apache Kafka as a notification server for communica-tion between hooks and
downstream consumers of metadata notification events. Events are written by the hooks and Atlas to different Kafka
topics. Kafka enables a loosely coupled integration between these disparate systems.
Bridges/Hook
External components like hive/sqoop/storm/falcon should model their taxonomy using type system and register the
types with Atlas. For every entity created in this external component, the corresponding entity should be registered
in Atlas as well. This is typically done in a hook, which runs in the external component and is called for every entity
operation. Hook generally processes the entity asynchronously using a thread pool to avoid adding latency to the main
operation.
Atlas exposes notification interface and can be used for reliable entity registration by hook as well. The hook can send
notification message containing the list of entities to be registered. Atlas service contains hook consumer that listens to
these messages and registers the entities.
Notification Server Design:
Notification is used for reliable entity registration from hooks and for entity/type change notifications. Atlas, by default,
provides Kafka integration, but it’s possible to provide other implementations as well. Atlas service starts embedded
Kafka server by default. Atlas also provides NotificationHookConsumer that runs in Atlas Service and listens to messages
from hook and registers the entities in Atlas.
Hive Apache
Entity/Trait/ Atlas
Type change
notification
Notification
Listener
(Ex: Ranger)
Fig. 6
3. Apache Falcon
Apache Falcon addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage
tracking by deploying a framework for data management and process-ing. Falcon centrally manages the data lifecycle,
facilitates quick data replication for business continuity and disaster recovery and provides a foundation for audit and
compliance by track-ing entity lineage and collection of audit logs.
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters. It makes on-
boarding new workflows/pipelines simpler, with support for late data handling and retry policies. It allows user to easily
define relationships between various data and processing elements and integrate with metastore/catalog such as Apache
Hive/HCatalog. Finally it also captures the lineage information for feeds and processes.
Following is the high level architecture of Apache Falcon.
Frequent
Replication Archival Rentention
Feeds
Frequent
Monitoring Feeds
Falcon Engine
Late Data Exception
Lineage Audit Arrival Handling
Fig. 7
4. Waterline Data Inventory
Waterline Data is a data marketplace platform provider, combining an automated data inventory with a self-service
data catalog. The inspiration for “Waterline” came from the meta-phor of the data lake where data is hidden below
the waterline. 80% of business value is cre-ated from Big Data by discovering data, and 80% of data is discovered by
finding and under-standing trusted data. The mission of Waterline Data is to accelerate time to value for data discovery
by helping business analysts and data scientists find, understand, and provision in-formation assets through self-service
– without having to “dive” for data – and by helping data stewards provide agile data governance with automated and
crowd-sourced business seman-tics and data lineage.
DATA
Fig. 8
Enterprises are evolving their data lakes to encompass all enterprise information as-sets, creating a “data marketplace”
for the business users-–a logical data layer and catalog over physical data assets, to find, understand, and provision data
assets at the speed of the business. Waterline Data is pioneering the “data marketplace” platform by providing smart
data discovery to discover and crowd-source business metadata, data lineage, and infonomics, underlying a self-service
business data catalog across on-premise and cloud data sources.
Features Summary
The below table lists the top features and the tools that support them.
Conclusion
In this digital age, IT industry is heavily dependent on data and its innovative usage. As we are able to capture enormous
data and do analytics on them, there is a lot of inherent values getting realized every day in the business. As we keep
storing data from various sources, managing data gets difficult day-by-day. Since data goes into more complexity,
Metadata Management is very critical for companies. Community development is helping to create vari-ety of tools to
address this challenge.
References
1. https://www.cloudera.com/products/cloudera-navigator.html
2. http://atlas.incubator.apache.org/
3. http://hortonworks.com/apache/atlas/
4. https://falcon.apache.org/
5. http://hortonworks.com/apache/falcon/
6. http://hortonworks.com/apache/ranger/
7. http://www.waterlinedata.com/
Satya Nayak
Project Lead, Mphasis
Satya Nayak has 11 years of experience in enterprise applications and
delivering software solutions. Satya has provided variety of solutions to
various industry domains in his career. He has been working on Big Data
and related technologies for 2.5 years. Currently he is working as a
Project Lead, which involves creating designs and making innovative
solutions for client requirements in Big Data Space.
Satya enjoys exploring new domains and taking up challenging roles.
Satya is Cloudera Certified Spark and Hadoop Developer (CCA-175) and
MapR Certified Spark Developer (MCSD).
Jansee Korapati
Module Lead, Mphasis
Jansee Korapati has 7+ years of experience in IT services, which includes
2 years in implementing Data Lakes using Big Data technologies like
Hadoop, Spark, Sqoop, Hive, Impala and Flume. She is an expert in
assessment and performance tuning of the long running queries in
Hive with massive data.
Jansee has been working for Mphasis for 6 months as Module Lead.
She is responsible for metadata management, ingestion and transformation
layers of the current project.
About Mphasis
Mphasis is a global technology services and solutions company specializing in the areas of Digital, Governance and Risk & Compliance. Our solution focus
and superior human capital propels our partnership with large enterprise customers in their digital transformation journeys. We partner with global financial
institutions in the execution of their risk and compliance strategies. We focus on next generation technologies for differentiated solutions delivering
optimized operations for clients.
460 Park Avenue South 226 Airport Parkway 88 Wood Street Bagmane World Technology Center
Suite #1101 San Jose London EC2V 7RS, UK Marathahalli Ring Road
New York, NY 10016, USA California, 95110 Tel.: +44 20 8528 1000 Doddanakundhi Village
Tel.: +1 212 686 6655 USA Mahadevapura
Bangalore 560 048, India
Tel.: +91 80 3352 5000
www.mphasis.com
Copyright © Mphasis Corporation. All rights reserved.
Various Metadata Management Tools HCatlog: HCatalog is a table and storage
•
Here is the list of different metadata management tools management layer for Hadoop that enables users
that can capture metadata on a Hadoop cluster. There is with different data processing tools like Pig and
no order of preference for this list. These tools are listed in MapReduce to read and write data on the grid more
a random order. easily.
Data Management
Information Lifecycle Management layer
Data management provides visibility into and control
over the data residing in Hadoop data stores and the
Metadata Layer computations performed on that data. The features
included here are:
Security & Governance Layer
Fig. 2