Cloudera Introduction
Cloudera Introduction
Important Notice
Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service
names or slogans contained in this document are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part,
without the prior written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and
company names or logos mentioned in this document are the property of their
respective owners. Reference to any products, services, processes or other
information, by trade name, trademark, manufacturer, supplier or otherwise does
not constitute or imply endorsement, sponsorship or recommendation thereof by
us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced,
stored in or introduced into a retrieval system, or transmitted in any form or by any
means (electronic, mechanical, photocopying, recording, or otherwise), or for any
purpose, without the express written permission of Cloudera.
Cloudera, Inc.
1001 Page Mill Road Bldg 2
Palo Alto, CA 94304
info@cloudera.com
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Version: 5.4.x
Date: May 20, 2015
Table of Contents
CDH Overview.............................................................................................................7
Cloudera Impala Overview..........................................................................................................................7
Impala Benefits.........................................................................................................................................................8
How Cloudera Impala Works with CDH..................................................................................................................8
Primary Impala Features.........................................................................................................................................9
Cloudera Search Overview..........................................................................................................................9
How Cloudera Search Works.................................................................................................................................10
Cloudera Search and Other Cloudera Components............................................................................................11
Apache Sentry Overview...........................................................................................................................12
FAQs..........................................................................................................................39
Cloudera Express and Cloudera Enterprise Features............................................................................39
Cloudera Manager 5 Frequently Asked Questions................................................................................41
General Questions..................................................................................................................................................41
Cloudera Navigator 2 Frequently Asked Questions...............................................................................43
Cloudera Impala Frequently Asked Questions.......................................................................................44
Trying Impala...........................................................................................................................................................44
Impala System Requirements...............................................................................................................................45
Supported and Unsupported Functionality In Impala........................................................................................46
How do I?.................................................................................................................................................................47
Impala Performance...............................................................................................................................................48
Impala Use Cases....................................................................................................................................................50
Questions about Impala And Hive........................................................................................................................51
Impala Availability..................................................................................................................................................52
Impala Internals......................................................................................................................................................52
SQL...........................................................................................................................................................................55
Partitioned Tables..................................................................................................................................................56
HBase.......................................................................................................................................................................56
Cloudera Search Frequently Asked Questions.......................................................................................57
General.....................................................................................................................................................................57
Performance and Fail Over....................................................................................................................................58
Schema Management............................................................................................................................................59
Supportability..........................................................................................................................................................60
Getting Support........................................................................................................61
Cloudera Support.......................................................................................................................................61
Information Required for Logging a Support Case.............................................................................................61
Community Support..................................................................................................................................61
Get Announcements about New Releases.............................................................................................62
Report Issues.............................................................................................................................................62
About Cloudera Introduction
Documentation Overview
The following guides are included in the Cloudera documentation set:
Guide Description
Cloudera Introduction Cloudera provides a scalable, flexible, integrated platform that makes it
easy to manage rapidly increasing volumes and varieties of data in your
enterprise. Industry-leading Cloudera products and solutions enable you
to deploy and manage Apache Hadoop and related projects, manipulate
and analyze your data, and keep that data secure and protected.
Cloudera Release Guide This guide contains release and download information for installers and
administrators. It includes release notes as well as information about
versions and downloads. The guide also provides a release matrix that
Cloudera Introduction | 5
About Cloudera Introduction
Guide Description
shows which major and minor release version of a product is supported
with which release version of Cloudera Manager, CDH and, if applicable,
Cloudera Search and Cloudera Impala.
Cloudera QuickStart This guide describes how to quickly install Cloudera software and create
initial deployments for proof of concept (POC) or development. It describes
how to download and use the QuickStart virtual machines, which provide
everything you need to start a basic installation. It also shows you how
to create a new installation of Cloudera Manager 5, CDH5, and managed
services on a cluster of four hosts. Quick start installations should be
used for demonstrations and POC applications only and are not
recommended for production.
Cloudera Installation and Upgrade This guide provides Cloudera software requirements and installation
information for production deployments, as well as upgrade procedures.
This guide also provides specific port information for Cloudera software.
Cloudera Administration This guide describes how to configure and administer a Cloudera
deployment. Administrators manage resources, availability, and backup
and recovery configurations. In addition, this guide shows how to
implement high availability, and discusses integration.
Cloudera Data Management This guide describes how to perform data management using Cloudera
Navigator. Data management activities include auditing access to data
residing in HDFS and Hive metastores, reviewing and updating metadata,
and discovering the lineage of data objects.
Cloudera Operation This guide shows how to monitor the health of a Cloudera deployment
and diagnose issues. You can obtain metrics and usage information and
view processing activities. This guide also describes how to examine logs
and reports to troubleshoot issues with cluster configuration and
operation as well as monitor compliance.
Cloudera Security This guide is intended for system administrators who want to secure a
cluster using data encryption, user authentication, and authorization
techniques. This topic also provides information about Hadoop security
programs and shows you how to set up a gateway to restrict access.
Cloudera Impala Guide This guide describes Cloudera Impala, its features and benefits, and how
it works with CDH. This topic introduces Impala concepts, describes how
to plan your Impala deployment, and provides tutorials for first-time
users as well as more advanced tutorials that describe scenarios and
specialized features. You will also find a language reference, performance
tuning, instructions for using the Impala shell, troubleshooting
information, and frequently asked questions.
Cloudera Search Guide This guide provides Cloudera Search prerequisites, shows how to load
and index data in search, and shows how to use Search to query data. In
addition, this guide provides a tutorial, various indexing references, and
troubleshooting information.
Cloudera Glossary This guide contains a glossary of terms for Cloudera components.
6 | Cloudera Introduction
CDH Overview
CDH Overview
CDH is the most complete, tested, and popular distribution of Apache Hadoop and related projects. CDH delivers
the core elements of Hadoop – scalable storage and distributed computing – along with a Web-based user
interface and vital enterprise capabilities. CDH is Apache-licensed open source and is the only Hadoop solution
to offer unified batch processing, interactive SQL and interactive search, and role-based access controls.
CDH provides:
• Flexibility - Store any type of data and manipulate it with a variety of different computation frameworks
including batch processing, interactive SQL, free text search, machine learning and statistical computation.
• Integration - Get up and running quickly on a complete Hadoop platform that works with a broad range of
hardware and software solutions.
• Security - Process and control sensitive data.
• Scalability - Enable a broad range of applications and scale and extend them to suit your requirements.
• High availability - Perform mission-critical business tasks with confidence.
• Compatibility - Leverage your existing IT infrastructure and investment.
Cloudera Introduction | 7
CDH Overview
Impala Benefits
Impala provides:
• Familiar SQL interface that data scientists and analysts already know
• Ability to interactively query data on big data in Apache Hadoop
• Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective
commodity hardware
• Ability to share data files between different components with no copy or export/import step; for example,
to write with Pig, transform with Hive and query with Impala
• Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just
for analytics
8 | Cloudera Introduction
CDH Overview
Feature Description
Unified management and Cloudera Manager provides unified and centralized management and
monitoring with Cloudera Manager monitoring for CDH and Cloudera Search. Cloudera Manager simplifies
deployment, configuration, and monitoring of your search services. Many
existing search solutions lack management and monitoring capabilities
and fail to provide deep insight into utilization, system health, trending,
and other supportability aspects.
Index storage in HDFS Cloudera Search is integrated with HDFS for index storage. Indexes created
by Solr/Lucene can be directly written in HDFS with the data, instead of
to local disk, thereby providing fault tolerance and redundancy.
Cloudera Introduction | 9
CDH Overview
Feature Description
Cloudera Search is optimized for fast read and write of indexes in HDFS
while indexes are served and queried through standard Solr mechanisms.
Because data and indexes are co-located, data processing does not require
transport or separately managed storage.
Batch index creation through To facilitate index creation for large data sets, Cloudera Search has built-in
MapReduce MapReduce jobs for indexing data stored in HDFS. As a result, the linear
scalability of MapReduce is applied to the indexing pipeline.
Real-time and scalable indexing at Cloudera Search provides integration with Flume to support near real-time
data ingest indexing. As new events pass through a Flume hierarchy and are written
to HDFS, those events can be written directly to Cloudera Search indexers.
In addition, Flume supports routing events, filtering, and annotation of
data passed to CDH. These features work with Cloudera Search for
improved index sharding, index separation, and document-level access
control.
Easy interaction and data A Cloudera Search GUI is provided as a Hue plug-in, enabling users to
exploration through Hue interactively query data, view result files, and do faceted exploration. Hue
can also schedule standing queries and explore index files. This GUI uses
the Cloudera Search API, which is based on the standard Solr API.
Simplified data processing for Cloudera Search relies on Apache Tika for parsing and preparation of many
Search workloads of the standard file formats for indexing. Additionally, Cloudera Search
supports Avro, Hadoop Sequence, and Snappy file format mappings, as
well as Log file formats, JSON, XML, and HTML. Cloudera Search also
provides data preprocessing using Morphlines, which simplifies index
configuration for these formats. Users can use the configuration for other
applications, such as MapReduce jobs.
HBase search Cloudera Search integrates with HBase, enabling full-text search of stored
data without affecting HBase performance. A listener monitors the
replication event stream and captures each write or update-replicated
event, enabling extraction and mapping. The event is then sent directly
to Solr indexers and written to indexes in HDFS, using the same process
as for other indexing workloads of Cloudera Search. The indexes can be
served immediately, enabling near real-time search of HBase data.
10 | Cloudera Introduction
CDH Overview
during the mapping phase. Reducers use Solr to write the data as a single index or as index shards, depending
on your configuration and preferences. Once the indexes are stored in HDFS, they can be queried using standard
Solr mechanisms, as previously described above for the near-real-time indexing use case.
The Lily HBase Indexer Service is a flexible, scalable, fault tolerant, transactional, near real-time oriented system
for processing a continuous stream of HBase cell updates into live search indexes. Typically, the time between
data ingestion using the Flume sink to that content potentially appearing in search results is measured in
seconds, although this duration is tunable. The Lily HBase Indexer uses Solr to index data stored in HBase. As
HBase applies inserts, updates, and deletes to HBase table cells, the indexer keeps Solr consistent with the
HBase table contents, using standard HBase replication features. The indexer supports flexible custom
application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain
columnFamily:qualifier links back to the data stored in HBase. This way applications can use the Search
result set to directly access matching raw HBase cells. Indexing and searching do not affect operational stability
or write throughput of HBase because the indexing and searching processes are separate and asynchronous
to HBase.
Cloudera Introduction | 11
CDH Overview
12 | Cloudera Introduction
Cloudera Manager 5 Overview
Terminology
To effectively use Cloudera Manager, you should first understand its terminology. The relationship between the
terms is illustrated below and their definitions follow:
Some of the terms, such as cluster and service, will be used without further explanation. Others, such as role
group, gateway, host template, and parcel are expanded upon in later sections.
Note: A common point of confusion is the type-instance nature of "service" and "role". Cloudera
Manager and this primer sometimes uses the same term for type and instance. For example, the
Cloudera Manager Admin Console Clusters > ClusterName menu actually lists service instances. This
is similar to the practice in programming languages where for example the term "string" may indicate
either a type (java.lang.String) or an instance of that type ("hi there"). When it's necessary to
distinguish between types and instances, the word "type" is appended to indicate a type and the word
"instance" is appended to explicitly indicate an instance.
deployment
A configuration of Cloudera Manager and all the clusters it manages.
Cloudera Introduction | 13
Cloudera Manager 5 Overview
cluster
A logical entity that contains a set of hosts, a single version of CDH installed on the hosts, and the service and
role instances running on the hosts. A host can belong to only one cluster.
host
A physical or virtual machine that runs role instances.
rack
A physical entity that contains a set of physical hosts typically served by the same switch.
service
A category of managed functionality, which may be distributed or not, running in a cluster. Sometimes referred
to as a service type. For example: MapReduce, HDFS, YARN, Spark, Accumulo. Whereas in traditional environments
multiple services run on one host, in distributed systems a service runs on many hosts.
service instance
An instance of a service running on a cluster. A service instance spans many role instances. For example: "HDFS-1"
and "yarn".
role
A category of functionality within a service. For example, the HDFS service has the following roles: NameNode,
SecondaryNameNode, DataNode, and Balancer. Sometimes referred to as a role type.
role instance
An instance of a role running on a host. It typically maps to a Unix process. For example: "NameNode-h1" and
"DataNode-h1".
role group
A set of configuration properties for a set of role instances.
host template
A set of role groups. When a template is applied to a host, a role instance from each role group is created and
assigned to that host.
gateway
A role that designates a host that should receive a client configuration for a service when the host does not
have any role instances for that service running on it.
parcel
A binary distribution format that contains compiled code and meta-information such as a package description,
version, and dependencies.
14 | Cloudera Introduction
Cloudera Manager 5 Overview
The host tcdn501-1 is the "master" host for the cluster, so it has many more role instances, 21, compared with
the 7 role instances running on the other hosts. In addition to the CDH "master" role instances, tcdn501-1 also
has Cloudera Management Service roles:
Cloudera Introduction | 15
Cloudera Manager 5 Overview
Architecture
As depicted below, the heart of Cloudera Manager is the Cloudera Manager Server. The Server hosts the Admin
Console Web Server and the application logic. It is responsible for installing software, configuring, starting, and
stopping services, and managing the cluster on which the services run.
Heartbeating
Heartbeats are a primary communication mechanism in Cloudera Manager. By default the Agents send heartbeats
every 15 seconds to the Cloudera Manager Server. However, to reduce user latency the frequency is increased
when state is changing.
During the heartbeat exchange the Agent notifies the Cloudera Manager Server the actions it is performing. In
turn the Cloudera Manager Server responds with the actions the Agent should be performing. Both the Agent
and the Cloudera Manager Server end up doing some reconciliation. For example, if you start a service, the Agent
attempts to start the relevant processes; if a process fails to start, the server marks the start command as
having failed.
State Management
The Cloudera Manager Server maintains the state of the cluster. This state can be divided into two categories:
"model" and "runtime", both of which are stored in the Cloudera Manager Server database.
16 | Cloudera Introduction
Cloudera Manager 5 Overview
Cloudera Manager models CDH and managed services: their roles, configurations, and inter-dependencies. Model
state captures what is supposed to run where, and with what configurations. For example, model state captures
the fact that a cluster contains 17 hosts, each of which is supposed to run a DataNode. You interact with the
model through the Cloudera Manager Admin Console configuration screens and API and operations such as
"Add Service".
Runtime state is what processes are running where, and what commands (for example, rebalance HDFS or
execute a Backup/Disaster Recovery schedule or rolling restart or stop) are currently being executed. The runtime
state includes the exact configuration files needed to run a process. When you press "Start" in Cloudera Manager
Admin Console, the server gathers up all the configuration for the relevant services and roles, validates it,
generates the configuration files, and stores them in the database.
When you update a configuration (for example, the Hue Server web port), you've updated the model state.
However, if Hue is running while you do this, it's still using the old port. When this kind of mismatch occurs, the
role is marked as having an "outdated configuration". To resynchronize, you restart the role (which triggers the
configuration re-generation and process restart).
While Cloudera Manager models all of the reasonable configurations, inevitably there are some cases that require
special handling. To allow you to workaround, for example, a bug or to explore unsupported options, Cloudera
Manager supports an "advanced configuration snippet" mechanism that lets you add properties directly to the
configuration files.
Configuration Management
Cloudera Manager defines configuration at several levels:
• The service level may define configurations that apply to the entire service instance, such as an HDFS service's
default replication factor (dfs.replication).
• The role group level may define configurations that apply to the member roles, such as the DataNodes' handler
count (dfs.datanode.handler.count). This can be set differently for different groups of DataNodes. For
example, DataNodes running on more capable hardware may have more handlers.
• The role instance level may override configurations that it inherits from its role group. This should be used
sparingly, because it easily leads to configuration divergence within the role group. One example usage is to
temporarily enable debug logging in a specific role instance to troubleshoot an issue.
• Hosts have configurations related to monitoring, software management, and resource management.
• Cloudera Manager itself has configurations related to its own administrative operations.
Role Groups
It is possible to set configuration at the service instance (for example, HDFS) or role instance (for example, the
DataNode on host17). An individual role inherits the configurations set at the service level. Configurations made
at the role level override those inherited from the service level. While this approach offers flexibility, it is tedious
to configure a set of role instances in the same way.
Cloudera Introduction | 17
Cloudera Manager 5 Overview
Cloudera Manager supports role groups, a mechanism for assigning configurations to a group of role instances.
The members of those groups then inherit those configurations. For example, in a cluster with heterogeneous
hardware, a DataNode role group can be created for each host type and the DataNodes running on those hosts
can be assigned to their corresponding role group. That makes it possible to set the configuration for all the
DataNodes running on the same hardware by modifying the configuration of one role group. The HDFS service
discussed earlier has the following role groups defined for the service's roles:
In addition to making it easy to manage the configuration of subsets of roles, role groups also make it possible
to maintain different configurations for experimentation or managing shared clusters for different users and/or
workloads.
Host Templates
In typical environments, sets of hosts have the same hardware and the same set of services running on them.
A host template defines a set of role groups (at most one of each type) in a cluster and provides two main
benefits:
• Adding new hosts to clusters easily - multiple hosts can have roles from different services created, configured,
and started in a single operation.
• Altering the configuration of roles from different services on a set of hosts easily - which is useful for quickly
switching the configuration of an entire cluster to accommodate different workloads or users.
In contrast, the HDFS role instances (for example, NameNode and DataNode) obtain their configurations from
a private per-process directory, under /var/run/cloudera-scm-agent/process/unique-process-name. Giving
each process its own private execution and configuration environment allows Cloudera Manager to control each
process independently. For example, here are the contents of an example 879-hdfs-NAMENODE process directory:
$ tree -a /var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
/var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
cloudera_manager_Agent_fencer.py
cloudera_manager_Agent_fencer_secret_key.txt
18 | Cloudera Introduction
Cloudera Manager 5 Overview
cloudera-monitor.properties
core-site.xml
dfs_hosts_allow.txt
dfs_hosts_exclude.txt
event-filter-rules.json
hadoop-metrics2.properties
hdfs.keytab
hdfs-site.xml
log4j.properties
logs
stderr.log
stdout.log
topology.map
topology.py
There are several advantages to distinguishing between server and client configuration:
• Sensitive information in the server-side configuration, such as the password for the Hive Metastore RDBMS,
is not exposed to the clients.
• A service that depends on another service may deploy with customized configuration. For example, to get
good HDFS read performance, Cloudera Impala needs a specialized version of the HDFS client configuration,
which may be harmful to a generic client. This is achieved by separating the HDFS configuration for the
Impala daemons (stored in the per-process directory mentioned above) from that of the generic client
(/etc/hadoop/conf).
• Client configuration files are much smaller and more readable. This also avoids confusing non-administrator
Hadoop users with irrelevant server-side properties.
Cloudera Introduction | 19
Cloudera Manager 5 Overview
Process Management
In a non-Cloudera Manager managed cluster, you most likely start a role instance using an init script, for example,
service hadoop-hdfs-datanode start. Cloudera Manager does not use init scripts for the daemons it
manages; in a Cloudera Manager managed cluster, starting and stopping services using init scripts will not
work.
In a Cloudera Manager managed cluster you can only start or stop services via Cloudera Manager. Cloudera
Manager uses an open source process management tool called supervisord, which takes care of redirecting
log files, notifying of process failure, setting the effective user ID of the calling process to the right user, and so
on. Cloudera Manager supports automatically restarting a crashed process. It will also flag a role instance with
a bad state if it crashes repeatedly right after start up.
It is worth noting that stopping Cloudera Manager and the Cloudera Manager Agents will not bring down your
cluster; any running instances will keep running.
One of the Agent's main responsibilities is to start and stop processes. When the Agent detects a new process
from the heartbeat, the Agent creates a directory for it in /var/run/cloudera-scm-agent and unpacks the
configuration. These actions reflect an important point: a Cloudera Manager process never travels alone. In other
words, a process is more than just the arguments to exec()—it also includes configuration files, directories
that need to be created, and other information.
The Agent itself is started by init.d at start-up. It, in turn, contacts the server and figures out what processes
should be running. The Agent is monitored as part of Cloudera Manager's host monitoring: if the Agent stops
heartbeating, the host will be marked as having bad health.
20 | Cloudera Introduction
Cloudera Manager 5 Overview
• Parcels can be installed at any location in the filesystem and by default are installed in
/opt/cloudera/parcels. In contrast, packages are installed in /usr/lib.
As a consequence of their unique properties, parcels offer a number of advantages over packages:
• CDH is distributed as a single object - In contrast to having a separate package for each part of CDH, when
using parcels there is just a single object to install. This is especially useful when managing a cluster that
isn't connected to the Internet.
• Internal consistency - All CDH components are matched so there isn't a danger of different parts coming
from different versions of CDH.
• Installation outside of /usr - In some environments, Hadoop administrators do not have privileges to install
system packages. In the past, these administrators had to fall back to CDH tarballs, which deprived them of
a lot of infrastructure that packages provide. With parcels, administrators can install to /opt or anywhere
else without having to step through all the additional manual steps of regular tarballs.
Note: With parcel software distribution, the path to the CDH libraries is
/opt/cloudera/parcels/CDH/lib instead of the usual /usr/lib. You should not link /usr/lib/
elements to parcel deployed paths, as such links may confuse scripts that distinguish between
the two paths.
• Installation of CDH without sudo - Parcel installation is handled by the Cloudera Manager Agent running as
root so it's possible to install CDH without needing sudo.
• Decouples distribution from activation - Due to side-by-side install capabilities, it is possible to stage a new
version of CDH across the cluster in advance of switching over to it. This allows the longest running part of
an upgrade to be done ahead of time without affecting cluster operations, consequently reducing the downtime
associated with upgrade.
• Rolling upgrades - These are only possible with parcels, due to their side-by-side nature. Packages require
shutting down the old process, upgrading the package, and then starting the new process. This can be hard
to recover from in the event of errors and requires extensive integration with the package management
system to function seamlessly. When a new version is staged side-by-side, switching to a new minor version
is simply a matter of changing which version of CDH is used when restarting each process. It then becomes
practical to do upgrades with rolling restarts, where service roles are restarted in the right order to switch
over to the new version with minimal service interruption. Your cluster can continue to run on the existing
installed components while you stage a new version across your cluster, without impacting your current
operations. Note that major version upgrades (for example, CDH 4 to CDH 5) require full service restarts due
to the substantial changes between the versions. Finally, you can upgrade individual parcels, or multiple
parcels at the same time.
• Easy downgrades - Reverting back to an older minor version can be as simple as upgrading. Note that some
CDH components may require explicit additional steps due to schema upgrades.
• Upgrade management - Cloudera Manager can fully manage all the steps involved in a CDH version upgrade.
In contrast, with packages, Cloudera Manager can only help with initial installation.
• Distributing additional components - Parcels are not limited to CDH. Cloudera Impala, Cloudera Search, LZO,
and add-on service parcels are also available.
• Compatibility with other distribution tools - If there are specific reasons to use other tools for download
and/or distribution, you can do so, and Cloudera Manager will work alongside your other tools. For example,
you can handle distribution with Puppet. Or, you can download the parcel to Cloudera Manager Server manually
(perhaps because your cluster has no Internet connectivity) and then have Cloudera Manager distribute the
parcel to the cluster.
Host Management
Cloudera Manager provides several features to manage the hosts in your Hadoop clusters. The first time you
run Cloudera Manager Admin Console you can search for hosts to add to the cluster and once the hosts are
selected you can map the assignment of CDH roles to hosts. Cloudera Manager automatically deploys all software
Cloudera Introduction | 21
Cloudera Manager 5 Overview
required to participate as a managed host in a cluster: JDK, Cloudera Manager Agent, CDH, Impala, Solr, and so
on to the hosts.
Once the services are deployed and running, the Hosts area within the Admin Console shows the overall status
of the managed hosts in your cluster. The information provided includes the version of CDH running on the host,
the cluster to which the host belongs, and the number of roles running on the host. Cloudera Manager provides
operations to manage the life cycle of the participating hosts and to add and delete hosts. The Cloudera
Management Service Host Monitor role performs health tests and collects host metrics to allow you to monitor
the health and performance of the hosts.
Resource Management
Resource management helps ensure predictable behavior by defining the impact of different services on cluster
resources. The goals of resource management features are to:
• Guarantee completion in a reasonable time frame for critical workloads
• Support reasonable cluster scheduling between groups of users based on fair allocation of resources per
group
• Prevent users from depriving other users access to the cluster
Cloudera Manager 4 introduced the ability to partition resources across HBase, HDFS, Impala, MapReduce, and
YARN services by allowing you to set configuration properties that were enforced by Linux control groups (Linux
cgroups). With Cloudera Manager 5, the ability to statically allocate resources using cgroups is configurable
through a single static service pool wizard. You allocate services a percentage of total resources and the wizard
configures the cgroups.
Static service pools isolate the services in your cluster from one another, so that load on one service has a
bounded impact on other services. Services are allocated a static percentage of total resources—CPU, memory,
and I/O weight—which are not shared with other services. When you configure static service pools, Cloudera
Manager computes recommended memory, CPU, and I/O configurations for the worker roles of the services
that correspond to the percentage assigned to each service. Static service pools are implemented per role group
within a cluster, using Linux control groups (cgroups) and cooperative memory limits (for example, Java maximum
heap sizes). Static service pools can be used to control access to resources by HBase, HDFS, Impala, MapReduce,
Solr, Spark, YARN, and add-on services. Static service pools are not enabled by default.
For example, the following figure illustrates static pools for HBase, HDFS, Impala, and YARN services that are
respectively assigned 20%, 30%, 20%, and 30% of cluster resources.
Cloudera Manager allows you to manage mechanisms for dynamically apportioning resources statically allocated
to YARN and Impala using dynamic resource pools.
22 | Cloudera Introduction
Cloudera Manager 5 Overview
Depending on the version of CDH you are using, dynamic resource pools in Cloudera Manager support the
following resource management (RM) scenarios:
• (CDH 5) YARN Independent RM - YARN manages the virtual cores, memory, running applications, and scheduling
policy for each pool. In the preceding diagram, three dynamic resource pools - Dev, Product, and Mktg with
weights 3, 2, and 1 respectively - are defined for YARN. If an application starts and is assigned to the Product
pool, and other applications are using the Dev and Mktg pools, the Product resource pool will receive 30% x
2/6 (or 10%) of the total cluster resources. If there are no applications using the Dev and Mktg pools, the
YARN Product pool will be allocated 30% of the cluster resources.
• (CDH 5) YARN and Impala Integrated RM - YARN manages memory for pools running Impala queries; Impala
limits the number of running and queued queries in each pool. In the YARN and Impala integrated RM scenario,
Impala services can reserve resources through YARN, effectively sharing the static YARN service pool and
resource pools with YARN applications. The integrated resource management scenario, where both YARN
and Impala use the YARN resource management framework, require the Impala Llama role.
In the following figure, the YARN and Impala services have a 50% static share which is subdivided among the
original resource pools with an additional resource pool designated for the Impala service. If YARN applications
are using all the original pools, and Impala uses its designated resource pool, Impala queries will have the
same resource allocation 50% x 4/8 = 25% as in the first scenario. However, when YARN applications are not
using the original pools, Impala queries will have access to 50% of the cluster resources.
Note:
When using YARN with Impala, Cloudera recommends using the static partitioning technique
(through a static service pool) rather than the combination of YARN and Llama. YARN is a central,
synchronous scheduler and thus introduces higher latency and variance which is better suited for
batch processing than for interactive workloads like Impala (especially with higher concurrency).
Currently, YARN allocates memory throughout the query, making it hard to reason about
out-of-memory and timeout conditions.
• (CDH 5) YARN and Impala Independent RM - YARN manages the virtual cores, memory, running applications,
and scheduling policy for each pool; Impala manages memory for pools running queries and limits the number
of running and queued queries in each pool.
• (CDH 5 and CDH 4) Impala Independent RM - Impala manages memory for pools running queries and limits
the number of running and queued queries in each pool.
The scenarios where YARN manages resources, whether for independent RM or integrated RM, map to the YARN
scheduler configuration. The scenarios where Impala independently manages resources employ the Impala
admission control feature.
Cloudera Introduction | 23
Cloudera Manager 5 Overview
User Management
Access to Cloudera Manager features is controlled by user accounts. A user account identifies how a user is
authenticated and determines what privileges are granted to the user.
Cloudera Manager provides several mechanisms for authenticating users. You can configure Cloudera Manager
to authenticate users against the Cloudera Manager database or against an external authentication service.
The external authentication service can be an LDAP server (Active Directory or an OpenLDAP compatible directory),
or you can specify another external service. Cloudera Manager also supports using the Security Assertion Markup
Language (SAML) to enable single sign-on.
A user account can be assigned one of the following roles:
• Auditor - View data and audit events in Cloudera Manager.
• Read-Only - View monitoring information, as well as data.
• Limited Operator - Decommission hosts, as well as view service and monitoring information and data.
• Operator - Decommission and recommission hosts and roles, as well as view service and monitoring
information and data.
• Configurator - Perform the Operator operations described above, configure services, enter and exit maintenance
mode, and manage dashboards.
• Cluster Administrator - View all data and perform all actions except the following: administer Cloudera
Navigator, perform replication and snapshot operations, view audit events, manage user accounts, and
configure external authentication.
• BDR Administrator - View service and monitoring information and data, and perform replication and snapshot
operations.
• Navigator Administrator - View service and monitoring information and data, view audit events, and administer
Cloudera Navigator.
• User Administrator - Manage user accounts and configure external authentication, as well as view service
and monitoring information and data.
• Full Administrator - View all data and do all actions, including reconfiguring and restarting services, and
administering other users.
For more information about the privileges associated with each of the Cloudera Manager user roles, see Cloudera
Manager User Roles.
Security Management
Cloudera Manager strives to consolidate security configurations across several projects.
24 | Cloudera Introduction
Cloudera Manager 5 Overview
Authentication
The purpose of authentication in Hadoop, as in other systems, is simply to prove that a user or service is who
he or she claims to be.
Typically, authentication in enterprises is managed through a single distributed system, such as a Lightweight
Directory Access Protocol (LDAP) directory. LDAP authentication consists of straightforward username/password
services backed by a variety of storage systems, ranging from file to database.
A common enterprise-grade authentication system is Kerberos. Kerberos provides strong security benefits
including capabilities that render intercepted authentication packets unusable by an attacker. It virtually
eliminates the threat of impersonation by never sending a user's credentials in cleartext over the network.
Several components of the Hadoop ecosystem are converging to use Kerberos authentication with the option
to manage and store credentials in LDAP or AD. For example, Microsoft's Active Directory (AD) is an LDAP directory
that also provides Kerberos authentication for added security.
Authorization
Authorization is concerned with who or what has access or control over a given resource or service. Since Hadoop
merges together the capabilities of multiple varied, and previously separate IT systems as an enterprise data
hub that stores and works on all data within an organization, it requires multiple authorization controls with
varying granularities. In such cases, Hadoop management tools simplify setup and maintenance by:
• Tying all users to groups, which can be specified in existing LDAP or AD directories.
• Providing role-based access control for similar interaction methods, like batch and interactive SQL queries.
For example, Apache Sentry permissions apply to Hive (HiveServer2) and Impala.
CDH currently provides the following forms of access control:
• Traditional POSIX-style permissions for directories and files, where each directory and file is assigned a single
owner and group. Each assignment has a basic set of permissions available; file permissions are simply read,
write, and execute, and directories have an additional permission to determine access to child directories.
• Extended Access Control Lists (ACLs) for HDFS that provide fine-grained control of permissions for HDFS
files by allowing you to set different permissions for specific named users and/or named groups.
• Apache HBase uses ACLs to authorize various operations (READ, WRITE, CREATE, ADMIN) by column,
column family, and column family qualifier. HBase ACLs are granted and revoked to both users and groups.
• Role-based access control with Apache Sentry.
Encryption
The goal of encryption is to ensure that only authorized users can view, use, or contribute to a data set. These
security controls add another layer of protection against potential threats by end-users, administrators and
other malicious actors on the network. Data protection can be applied at a number of levels within Hadoop:
• OS Filesystem-level - Encryption can be applied at the Linux operating system file system level to cover all
files in a volume. An example of this approach is Cloudera Navigator Encrypt (formerly Gazzang zNcrypt)
which is available for Cloudera customers licensed for Cloudera Navigator. Navigator Encrypt operates at the
Linux volume level, so it can encrypt cluster data inside and outside HDFS, such as temp/spill files,
configuration files and metadata databases (to be used only for data related to a CDH cluster). Navigator
Encrypt must be used with Navigator Key Trustee (formerly Gazzang zTrustee).
• HDFS-level - Encryption applied by the HDFS client software. HDFS Data At Rest Encryption operates at the
HDFS folder level, enabling encryption to be applied only to the HDFS folders where it is needed. Cannot
encrypt any data outside HDFS. To ensure reliable key storage (so that data is not lost), Navigator Key Trustee
should be used, while the default Java keystore can be used for test purposes.
• Network-level - Encryption can be applied to encrypt data just before it gets sent across a network and to
decrypt it as soon as it is received. In Hadoop this means coverage for data sent from client user interfaces
as well as service-to-service communication like remote procedure calls (RPCs). This protection uses
industry-standard protocols such as SSL/TLS.
Cloudera Introduction | 25
Cloudera Manager 5 Overview
Health Tests
Cloudera Manager monitors the health of the services, roles, and hosts that are running in your clusters via
health tests. The Cloudera Management Service also provides health tests for its roles. Role-based health tests
are enabled by default. For example, a simple health test is whether there's enough disk space in every NameNode
data directory. A more complicated health test may evaluate when the last checkpoint for HDFS was compared
to a threshold or whether a DataNode is connected to a NameNode. Some of these health tests also aggregate
other health tests: in a distributed system like HDFS, it's normal to have a few DataNodes down (assuming
you've got dozens of hosts), so we allow for setting thresholds on what percentage of hosts should color the
entire service down.
Health tests can return one of three values: Good, Concerning, and Bad. A test returns Concerning health if the
test falls below a warning threshold. A test returns Bad if the test falls below a critical threshold. The overall
health of a service or role instance is a roll-up of its health tests. If any health test is Concerning (but none are
Bad) the role's or service's health is Concerning; if any health test is Bad, the service's or role's health is Bad.
In the Cloudera Manager Admin Console, health tests results are indicated with colors: Good , Concerning ,
and Bad .
One common question is whether monitoring can be separated from configuration. One of the goals for monitoring
is to enable it without needing to do additional configuration and installing additional tools (for example, Nagios).
By having a deep model of the configuration, Cloudera Manager is able to know which directories to monitor,
which ports to use, and what credentials to use for those ports. This tight coupling means that, when you install
Cloudera Manager all the monitoring is enabled.
26 | Cloudera Introduction
Cloudera Manager 5 Overview
Some metrics (for example, total_cpu_seconds) are counters, and the appropriate way to query them is to
take their rate over time, which is why a lot of metrics queries contain the dt0 function. For example,
dt0(total_cpu_seconds). (The dt0 syntax is intended to remind you of derivatives. The 0 indicates that the
rate of a monotonically increasing counter should never have negative rates.)
and Home page display. In addition to a link to the Home page, the Cloudera Manager Admin Console top
navigation bar provides the following features:
• Clusters > ClusterName
– Services - Display individual services, and the Cloudera Management Service. In these pages you can:
– View the status and other details of a service instance or the role instances associated with the service
– Make configuration changes to a service instance, a role, or a specific role instance
– Add and delete a service or role
– Stop, start, or restart a service or role.
– View the commands that have been run for a service or a role
– View an audit event history
– Deploy and download client configurations
– Decommission and recommission role instances
– Enter or exit maintenance mode
– Perform actions unique to a specific type of service. For example:
– Enable HDFS high availability or NameNode federation
– Run the HDFS Balancer
– Create HBase, Hive, and Sqoop directories
Cloudera Introduction | 27
Cloudera Manager 5 Overview
– Reports - Create reports about the HDFS, MapReduce, YARN, and Impala usage and browse HDFS files,
and manage quotas for HDFS directories.
– ImpalaServiceName Queries - Query information about Impala queries running on your cluster.
– MapReduceServiceName Jobs - Query information about MapReduce jobs running on your cluster.
– YARNServiceName Applications - Query information about YARN applications running on your cluster.
• Hosts - Display the hosts managed by Cloudera Manager. In this page you can:
– View the status and a variety of detail metrics about individual hosts
– Make configuration changes for host monitoring
– View all the processes running on a host
– Run the Host Inspector
– Add and delete hosts
– Create and manage host templates
– Manage parcels
– Decommission and recommission hosts
– Make rack assignments
– Run the host upgrade wizard
• Diagnostics - Review logs, events, and alerts to diagnose problems. The subpages are:
– Events - Search for and displaying events and alerts that have occurred.
– Logs - Search logs by service, role, host, and search phrase as well as log level (severity).
– Server Log -Display the Cloudera Manager Server log.
• Audits - Query and filter audit events across clusters.
• Charts - Query for metrics of interest, display them as charts, and display personalized chart dashboards.
• Backup - Manage replication schedules and snapshot policies.
• Administration - Administer Cloudera Manager. The subpages are:
– Settings - Configure Cloudera Manager.
– Alerts - Display when alerts will be generated, configure alert recipients, and send test alert email.
– Users - Manage Cloudera Manager users.
– Kerberos - Generate Kerberos credentials and inspect hosts.
– License - Manage Cloudera licenses.
– Language - Set the language used for the content of activity events, health events, and alert email
messages.
– Peers - Connect multiple instances of Cloudera Manager.
• Indicators
– Parcel Indicator - indicates how many parcels are eligible for downloading or distribution.
– Running Commands Indicator - displays the number of commands currently running for all services
or roles.
• Search - Supports searching for services, roles, hosts, configuration properties, and commands. You can enter
a partial string and a drop-down list with up to sixteen entities that match will display.
• Support - Displays various support actions. The subcommands are:
– Send Diagnostic Data - Sends data to Cloudera Support to support troubleshooting.
– Support Portal (Cloudera Enterprise) - Displays the Cloudera Support portal.
– Mailing List (Cloudera Express) - Displays the Cloudera Manager Users list.
– Scheduled Diagnostics: Weekly - Configure the frequency of automatically collecting diagnostic data and
sending to Cloudera support.
– The following links open the latest documentation on the Cloudera web site:
– Help
28 | Cloudera Introduction
Cloudera Manager 5 Overview
– Installation Guide
– API Documentation
– Release Notes
– About - Version number and build details of Cloudera Manager and the current date and time stamp of
the Cloudera Manager server.
• Logged-in User Menu - The currently logged-in user. The subcommands are:
– Change Password - Change the password of the currently logged in user.
– Logout
Home Page
When you start the Cloudera Manager Admin Console on page 27, the Home page displays.
You can also navigate to the Home page by clicking Home in the top navigation bar.
Status
The default tab displayed when the Home page displays. It contains:
• Clusters - The clusters being managed by Cloudera Manager. Each cluster is displayed either in summary
form or in full form depending on the configuration of the Administration > Settings > Other > Maximum
Cluster Count Shown In Full property. When the number of clusters exceeds the value of the property, only
cluster summary information displays.
– Summary Form - A list of links to cluster status pages. Click Customize to jump to the Administration >
Settings > Other > Maximum Cluster Count Shown In Full property.
– Full Form - A separate section for each cluster containing a link to the cluster status page and a table
containing links to the Hosts page and the status pages of the services running in the cluster.
Cloudera Introduction | 29
Cloudera Manager 5 Overview
Each service row in the table has a menu of actions that you select by clicking and can contain one or
more of the following indicators:
Configuration Indicates that the service has at least one configuration issue. The
issue indicator shows the number of configuration issues at the highest severity
level. If there are configuration errors, the indicator is red. If there are no
errors but configuration warnings exist, then the indicator is yellow. No
indicator is shown if there are no configuration notifications.
Restart Configuration Indicates that at least one of a service's roles is running with a
Needed modified configuration that does not match the current configuration settings in
Cloudera Manager.
Refresh
Needed Click the indicator to display the Stale Configurations page.To bring the
cluster up-to-date, click the Refresh or Restart button on the Stale
Configurations page or follow the instructions in Refreshing a Cluster,
Restarting a Cluster, or Restarting Services and Instances after
Configuration Changes.
Client Indicates that the client configuration for a service should be redeployed.
configuration
Click the indicator to display the Stale Configurations page.To bring the
redeployment
cluster up-to-date, click the Deploy Client Configuration button on the
required
Stale Configurations page or follow the instructions in Manually
Redeploying Client Configuration Files.
30 | Cloudera Introduction
Cloudera Manager 5 Overview
– Cloudera Management Service - A table containing a link to the Cloudera Manager Service. The Cloudera
Manager Service has a menu of actions that you select by clicking .
– Charts - A set of charts (dashboard) that summarize resource utilization (IO, CPU usage) and processing
metrics.
Click a line, stack area, scatter, or bar chart to expand it into a full-page view with a legend for the individual
charted entities as well more fine-grained axes divisions.
By default the time scale of a dashboard is 30 minutes. To change the time scale, click a duration link
at the top-right of the dashboard.
To set the dashboard type, click and select one of the following:
• Custom - displays a custom dashboard.
• Default - displays a default dashboard.
• Reset - resets the custom dashboard to the predefined set of charts, discarding any customizations.
Resources
• Quick Start
• Cloudera Manager API tutorial
• Cloudera Manager API documentation
• Python client
Cloudera Introduction | 31
Cloudera Manager 5 Overview
• Using the Cloudera Manager Java API for Cluster Automation on page 34
Obtaining Configuration Files
1. Obtain the list of a service's roles:
http://cm_server_host:7180/api/v10/clusters/clusterName/services/serviceName/roles
http://cm_server_host:7180/api/v10/clusters/clusterName/services/serviceName/roles/roleName/process
http://cm_server_host:7180/api/v10/clusters/clusterName/services/serviceName/roles/roleName/process/
configFiles/configFileName
For example:
http://cm_server_host:7180/api/v10/clusters/Cluster%201/services/OOZIE-1/roles/
OOZIE-1-OOZIE_SERVER-e121641328fcb107999f2b5fd856880d/process/configFiles/oozie-site.xml
http://cm_server_host:7180/api/v10/clusters/Cluster%201/services/service_name/config?view=FULL
Search the results for the display name of the desired property. For example, a search for the display name
HDFS Service Environment Advanced Configuration Snippet (Safety Valve) shows that the corresponding property
name is hdfs_service_env_safety_valve:
{
"name" : "hdfs_service_env_safety_valve",
"require" : false,
"displayName" : "HDFS Service Environment Advanced Configuration Snippet (Safety
Valve)",
"description" : "For advanced use onlyu, key/value pairs (one on each line) to be
inserted into a roles
environment. Applies to configurations of all roles in this service except client
configuration.",
"relatedName" : "",
"validationState" : "OK"
}
Similar to finding service properties, you can also find host properties. First, get the host IDs for a cluster with
the URL:
http://cm_server_host:7180/api/v10/hosts
{
"hostId" : "2c2e951c-aaf2-4780-a69f-0382181f1821",
"ipAddress" : "10.30.195.116",
"hostname" : "cm_server_host",
"rackId" : "/default",
"hostUrl" :
"http://cm_server_host:7180/cmf/hostRedirect/2c2e951c-adf2-4780-a69f-0382181f1821",
32 | Cloudera Introduction
Cloudera Manager 5 Overview
"maintenanceMode" : false,
"maintenanceOwners" : [ ],
"commissionState" : "COMMISSIONED",
"numCores" : 4,
"totalPhysMemBytes" : 10371174400
}
Then obtain the host properties by including one of the returned host IDs in the URL:
http://cm_server_host:7180/api/v10/hosts/2c2e951c-adf2-4780-a69f-0382181f1821?view=FULL
Required Role:
Where:
• admin_uname is a username with either the Full Administrator or Cluster Administrator role.
• admin_pass is the password for the admin_uname username.
• cm_server_host is the hostname of the Cloudera Manager server.
• path_to_file is the path to the file where you want to save the configuration.
Warning: If you do not stop the cluster before making this API call, the API call will stop all cluster
services before running the job. Any running jobs and data are lost.
Cloudera Introduction | 33
Cloudera Manager 5 Overview
Where:
• admin_uname is a username with either the Full Administrator or Cluster Administrator role.
• admin_pass is the password for the admin_uname username.
• cm_server_host is the hostname of the Cloudera Manager server.
• path_to_file is the path to the file containing the JSON configuration file.
34 | Cloudera Introduction
Cloudera Manager 5 Overview
To use the Java client, add this dependency to your project's pom.xml:
<project>
<repositories>
<repository>
<id>cdh.repo</id>
<url>https://repository.cloudera.comgroups/cloudera-repos</url>
<name>Cloudera Repository</name>
</repository>
…
</repositories>
<dependencies>
<dependency>
<groupId>com.cloudera.api</groupId>
<artifactId>cloudera-manager-api</artifactId>
<version>4.6.2</version> <!-- Set to the version of Cloudera Manager you
use -->
</dependency>
…
</dependencies>
...
</project>
The Java client works like a proxy. It hides from the caller any details about REST, HTTP, and JSON. The entry
point is a handle to the root of the API:
From the root, you can traverse down to all other resources. (It's called "v10" because that is the current Cloudera
Manager API version, but the same builder will also return a root from an earlier version of the API.) The tree
view shows some key resources and supported operations:
• RootResourceV10
– ClustersResourceV10 - host membership, start cluster
– ServicesResourceV10 - configuration, get metrics, HA, service commands
– RolesResource - add roles, get metrics, logs
– RoleConfigGroupsResource - configuration
– ParcelsResource - parcel management
// List of clusters
ApiClusterList clusters = apiRoot.getClustersResource().readClusters(DataView.SUMMARY);
for (ApiCluster cluster : clusters) {
LOG.info("{}: {}", cluster.getName(), cluster.getVersion());
}
To see a full example of cluster deployment using the Java client, see whirr-cm. Go to CmServerImpl#configure
to see the relevant code.
Cloudera Introduction | 35
Cloudera Manager 5 Overview
36 | Cloudera Introduction
Cloudera Navigator 2 Overview
Related Information
• Installing Cloudera Navigator
• Upgrading Cloudera Navigator
• Cloudera Navigator Administration
• Cloudera Data Management
• Configuring Authentication in Cloudera Navigator
• Configuring SSL for Cloudera Navigator
• Cloudera Navigator User Roles
Cloudera Navigator UI
Cloudera Navigator UI is the web-based UI that you use to:
• Create and view audit reports
• Search entity metadata, view entity lineage, and modify business metadata
• Define policies for modifying business metadata when entities are extracted
Cloudera Introduction | 37
Cloudera Navigator 2 Overview
2. Log into Cloudera Navigator UI using the credentials assigned by your administrator. User groups are assigned
roles that constrain the features available to a logged in user.
38 | Cloudera Introduction
FAQs
FAQs
Cloudera Introduction | 39
FAQs
40 | Cloudera Introduction
FAQs
General Questions
What are the differences between the Cloudera Express and the Cloudera Enterprise versions of Cloudera
Manager?
Cloudera Express includes a free version of Cloudera Manager. The Cloudera Enterprise version of Cloudera
Manager provides additional functionality. Both the Cloudera Express and Cloudera Enterprise versions automate
the installation, configuration, and monitoring of CDH 4 or CDH 5 on an entire cluster. See the matrix at Cloudera
Express and Cloudera Enterprise Features on page 39 for a comparison of the two versions.
The Cloudera Enterprise version of Cloudera Manager is available as part of the Cloudera Enterprise subscription
offering, and requires a license. You can also choose a Cloudera Enterprise Data Hub Edition Trial that is valid
for 60 days.
If you are not an existing Cloudera customer, contact Cloudera Sales using this form or call 866-843-7207 to
obtain a Cloudera Enterprise license. If you are already a Cloudera customer and you need to upgrade from
Cloudera Express to Cloudera Enterprise, contact Cloudera Support to obtain a license.
Cloudera Introduction | 41
FAQs
Warning: Cloudera Manager 3 and CDH 3 have reached End of Maintenance (EOM) as of June 20,
2013. Cloudera does not support or provide patches for Cloudera Manager 3 and CDH 3 releases.
For instructions on upgrading from CDH 3 to CDH 4, see Upgrading CDH 3 to CDH 4 in a Cloudera Manager
Deployment.
For instructions on upgrading CDH 4 to CDH 5, see Upgrading CDH 4 to CDH 5. For instructions on upgrading
CDH 4 to a newer version, see Upgrading CDH 4.
Where are CDH libraries located when I distribute CDH using parcels?
With parcel software distribution, the path to the CDH libraries is /opt/cloudera/parcels/CDH/lib/ instead
of the usual /usr/lib/.
What upgrade paths are available for Cloudera Manager, and what's involved?
For instructions about upgrading, see Upgrading Cloudera Manager.
Do worker hosts need access to the Cloudera public repositories for an install with Cloudera Manager?
You can perform an installation or upgrade using the parcel format and when using parcels, only the Cloudera
Manager Server requires access to the Cloudera public repositories. Distribution of the parcels to worker hosts
is done between the Cloudera Manager Server and the worker hosts. See Parcels for more information. If you
want to install using the traditional packages, hosts only require access to the installation files.
For both parcels and packages, it is also possible to create local repositories that serve these files to the hosts
that are being upgraded. If you have established local repositories, no access to the Cloudera public repository
is required. For more information, see Creating and Using a Remote Package Repository.
42 | Cloudera Introduction
FAQs
Can I use the service monitoring features of Cloudera Manager without the Cloudera Management Service?
No. To understand the desired state of the system, Cloudera Manager requires the global configuration that the
Cloudera Management Service roles gather and provide. The Cloudera Manager Agent doubles as both the agent
for supervision and for monitoring.
Can I run the Cloudera Management Service and the Hadoop services on the host where the Cloudera Manager
Server is running?
Yes. This is especially common in deployments that have a small number of hosts.
Is Cloudera Navigator included with a Cloudera Enterprise Data Hub Edition license?
Yes. Cloudera Navigator is included with Cloudera Enterprise Data Hub Edition license and can be selected as a
choice with a Cloudera Enterprise Flex Edition license.
Can Cloudera Navigator be purchased standalone, that is, without Cloudera Manager?
Cloudera Navigator components are managed by Cloudera Manager. Therefore, Cloudera Manager is a prerequisite
for Cloudera Navigator.
What Cloudera Manager, CDH, and Cloudera Impala releases does Cloudera Navigator 2 work with?
See Cloudera Navigator 2 Requirements and Supported Versions.
Cloudera Introduction | 43
FAQs
How are Cloudera Navigator logs different from Cloudera Manager logs?
Cloudera Navigator tracks and aggregates only the accesses to the data stored in CDH services and used for
audit reports and analysis. Cloudera Manager monitors and logs all the activity performed by CDH services that
helps administrators maintain the health of the cluster. The target audiences of these logs are different but
together they provide better visibility into both the data access and system activity for an enterprise cluster.
Trying Impala
44 | Cloudera Introduction
FAQs
What are the software and hardware requirements for running Impala?
For information on Impala requirements, see Cloudera Impala Requirements. Note that there is often a minimum
required level of Cloudera Manager for any given Impala version.
and your cluster has 50 nodes, then each of those 50 nodes will transmit a maximum of 1000 rows back to
the coordinator node. The coordinator node needs enough memory to sort (LIMIT * cluster_size) rows,
although in the end the final result set is at most LIMIT rows, 1000 in this case.
Cloudera Introduction | 45
FAQs
then each node filters out a set of rows matching the WHERE conditions, sorts the results (with no size limit),
and sends the sorted intermediate rows back to the coordinator node. The coordinator node might need
substantial memory to sort the final result set, and so might use a temporary disk work area for that final
phase of the query.
• Whether the query contains any join clauses, GROUP BY clauses, analytic functions, or DISTINCT operators.
These operations all require some in-memory work areas that vary depending on the volume and distribution
of data. In Impala 2.0 and later, these kinds of operations utilize temporary disk work areas if memory usage
grows too large to handle. See SQL Operations that Spill to Disk for details.
• The size of the result set. When intermediate results are being passed around between nodes, the amount
of data depends on the number of columns returned by the query. For example, it is more memory-efficient
to query only the columns that are actually needed in the result set rather than always issuing SELECT *.
• The mechanism by which work is divided for a join query. You use the COMPUTE STATS statement, and query
hints in the most difficult cases, to help Impala pick the most efficient execution plan. See Performance
Considerations for Join Queries for details.
See Hardware Requirements for more details and recommendations about Impala hardware prerequisites.
46 | Cloudera Introduction
FAQs
What features from relational databases or Hive are not available in Impala?
• Querying streaming data.
• Deleting individual rows. You delete data in bulk by overwriting an entire table or partition, or by dropping a
table.
• Indexing (not currently). LZO-compressed text files can be indexed outside of Impala, as described in Using
LZO-Compressed Text Files.
• Full text search on text fields. The Cloudera Search product is appropriate for this use case.
• Custom Hive Serializer/Deserializer classes (SerDes). Impala supports a set of common native file formats
that have built-in SerDes in CDH. See How Impala Works with Hadoop File Formats for details.
• Checkpointing within a query. That is, Impala does not save intermediate results to disk during long-running
queries. Currently, Impala cancels a running query if any host on which that query is executing fails. When
one or more hosts are down, Impala reroutes future queries to only use the available hosts, and Impala
detects when the hosts come back up and begins using them again. Because a query can be submitted
through any Impala node, there is no single point of failure. In the future, we will consider adding additional
work allocation features to Impala, so that a running query would complete even in the presence of host
failures.
• Encryption of data transmitted between Impala daemons.
• Hive indexes.
• Non-Hadoop data stores, such as relational databases.
For the detailed list of features that are different between Impala and HiveQL, see SQL Differences Between
Impala and Hive.
Is Avro supported?
Yes, Avro is supported. Impala has always been able to query Avro tables. You can use the Impala LOAD DATA
statement to load existing Avro data files into a table. Starting with Impala 1.4, you can create Avro tables with
Impala. Currently, you still use the INSERT statement in Hive to copy data from another table into an Avro table.
See Using the Avro File Format with Impala Tables for details.
How do I?
Cloudera Introduction | 47
FAQs
statestore.live-backends:3
statestore.live-backends.list:[host1:22000, host1:26000, host2:22000]
The number of impalad nodes is the number of list items referring to port 22000, in this case two. (Typically,
this number is one less than the number reported by the statestore.live-backends line.) If an impalad
node became unavailable or came back after an outage, the information reported on this page would change
appropriately.
Impala Performance
Are results returned as they become available, or all at once when a query completes?
Impala streams results whenever they are available, when possible. Certain SQL operations (aggregation or
ORDER BY) require all of the input to be ready before Impala can return results.
- BytesRead: 180.33 MB
- BytesReadLocal: 180.33 MB
- BytesReadShortCircuit: 180.33 MB
If BytesReadLocal is lower than BytesRead, something in your cluster is misconfigured, such as the impalad
daemon not running on all the data nodes. If BytesReadShortCircuit is lower than BytesRead, short-circuit
reads are not enabled properly on that node; see Post-Installation Configuration for Impala for instructions.
• If the table was just created, or this is the first query that accessed the table after an INVALIDATE METADATA
statement or after the impalad daemon was restarted, there might be a one-time delay while the metadata
for the table is loaded and cached. Check whether the slowdown disappears when the query is run again.
When doing performance comparisons, consider issuing a DESCRIBE table_name statement for each table
first, to make sure any timings only measure the actual query time and not the one-time wait to load the
table metadata.
• Is the table data in uncompressed text format? Check by issuing a DESCRIBE FORMATTED table_name
statement. A text table is indicated by the line:
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Although uncompressed text is the default format for a CREATE TABLE statement with no STORED AS clauses,
it is also the bulkiest format for disk storage and consequently usually the slowest format for queries. For
data where query performance is crucial, particularly for tables that are frequently queried, consider starting
with or converting to a compact binary file format such as Parquet, Avro, RCFile, or SequenceFile. For details,
see How Impala Works with Hadoop File Formats.
48 | Cloudera Introduction
FAQs
• If your table has many columns, but the query refers to only a few columns, consider using the Parquet file
format. Its data files are organized with a column-oriented layout that lets queries minimize the amount of
I/O needed to retrieve, filter, and aggregate the values for specific columns. See Using the Parquet File Format
with Impala Tables for details.
• If your query involves any joins, are the tables in the query ordered so that the tables or subqueries are
ordered with the one returning the largest number of rows on the left, followed by the smallest (most
selective), the second smallest, and so on? That ordering allows Impala to optimize the way work is distributed
among the nodes and how intermediate results are routed from one node to another. For example, all other
things being equal, the following join order results in an efficient query:
See Performance Considerations for Join Queries for performance tips for join queries.
• Also for join queries, do you have table statistics for the table, and column statistics for the columns used
in the join clauses? Column statistics let Impala better choose how to distribute the work for the various
pieces of a join query. See How Impala Uses Statistics for Query Optimization for details about gathering
statistics.
• Does your table consist of many small data files? Impala works most efficiently with data files in the
multi-megabyte range; Parquet, a format optimized for data warehouse-style queries, uses large files
(originally 1 GB, now 256 MB in Impala 2.0 and higher) with a block size matching the file size. Use the
DESCRIBE FORMATTED table_name statement in impala-shell to see where the data for a table is located,
and use the hadoop fs -ls or hdfs dfs -ls Unix commands to see the files and their sizes. If you have
thousands of small data files, that is a signal that you should consolidate into a smaller number of large
files. Use an INSERT ... SELECT statement to copy the data to a new table, reorganizing into new data
files as part of the process. Prefer to construct large data files and import them in bulk through the LOAD
DATA or CREATE EXTERNAL TABLE statements, rather than issuing many INSERT ... VALUES statements;
each INSERT ... VALUES statement creates a separate tiny data file. If you have thousands of files all in
the same directory, but each one is megabytes in size, consider using a partitioned table so that each partition
contains a smaller number of files. See the following point for more on partitioning.
• If your data is easy to group according to time or geographic region, have you partitioned your table based
on the corresponding columns such as YEAR, MONTH, and/or DAY? Partitioning a table based on certain columns
allows queries that filter based on those same columns to avoid reading the data files for irrelevant years,
postal codes, and so on. (Do not partition down to too fine a level; try to structure the partitions so that there
is still sufficient data in each one to take advantage of the multi-megabyte HDFS block size.) See Partitioning
for Impala Tables for details.
Cloudera Introduction | 49
FAQs
Does Impala performance improve as it is deployed to more hosts in a cluster in much the same way that Hadoop
performance does?
Yes. Impala scales with the number of hosts. It is important to install Impala on all the data nodes in the cluster,
because otherwise some of the nodes must do remote reads to retrieve data not available for local reads. Data
locality is an important architectural aspect for Impala performance. See this Impala performance blog post for
background. Note that this blog post refers to benchmarks with Impala 1.1.1; Impala has added even more
performance features in the 1.2.x series.
What are good use cases for Impala as opposed to Hive or MapReduce?
Impala is well-suited to executing SQL queries for interactive exploratory analytics on large data sets. Hive and
MapReduce are appropriate for very long running, batch-oriented tasks such as ETL.
50 | Cloudera Introduction
FAQs
Is MapReduce required for Impala? Will Impala continue to work as expected if MapReduce is stopped?
Impala does not use MapReduce at all.
Is Impala intended to handle real time queries in low-latency applications or is it for ad hoc queries for the
purpose of data exploration?
Ad-hoc queries are the primary use case for Impala. We anticipate it being used in many other situations where
low-latency is required. Whether Impala is appropriate for any particular use-case depends on the workload,
data size and query volume. See Impala Benefits on page 8 for the primary benefits you can expect when using
Impala.
Can I use Impala to query data already loaded into Hive and HBase?
There are no additional steps to allow Impala to query tables managed by Hive, whether they are stored in HDFS
or HBase. Make sure that Impala is configured to access the Hive metastore correctly and you should be ready
to go. Keep in mind that impalad, by default, runs as the impala user, so you might need to adjust some file
permissions depending on how strict your permissions are currently.
See Using Impala to Query HBase Tables for details about querying data in HBase.
Cloudera Introduction | 51
FAQs
Hive itself is optional, and does not need to be installed on the same nodes as Impala. Currently, Impala supports
a wider variety of read (query) operations than write (insert) operations; you use Hive to insert data into tables
that use certain file formats. See How Impala Works with Hadoop File Formats for details.
Impala Availability
Can Impala and MapReduce jobs run on the same cluster without resource contention?
Yes. See Controlling Impala Resource Usage for how to control Impala resource usage using the Linux cgroup
mechanism, and Integrated Resource Management with YARN for how to use Impala with the YARN resource
management framework. Impala is designed to run on the DataNode hosts. Any contention depends mostly on
the cluster setup and workload.
For a detailed example of configuring a cluster to share resources between Impala queries and MapReduce jobs,
see Setting up a Multi-tenant Cluster for Impala and MapReduce
Impala Internals
52 | Cloudera Introduction
FAQs
hosts containing replicas of those blocks, queries involving that data could be very inefficient. In that case, the
data must be transmitted from one host to another for processing by “remote reads”, a condition Impala normally
tries to avoid. See Impala Concepts and Architecture for details about the Impala architecture. Impala schedules
query fragments on all hosts holding data relevant to the query, if possible.
Cloudera Introduction | 53
FAQs
• Impala can more naturally disperse query plans instead of having to fit them into a pipeline of map and
reduce jobs. This enables Impala to parallelize multiple stages of a query and avoid overheads such as sort
and shuffle when unnecessary.
Impala uses a more efficient execution engine by taking advantage of modern hardware and technologies:
• Impala generates runtime code. Impala uses LLVM to generate assembly code for the query that is being
run. Individual queries do not have to pay the overhead of running on a system that needs to be able to
execute arbitrary queries.
• Impala uses available hardware instructions when possible. Impala uses the supplemental SSE3 (SSSE3)
instructions which can offer tremendous speedups in some cases. (Impala 2.0 and 2.1 required the SSE4.1
instruction set; Impala 2.2 and higher relax the restriction again so only SSSE3 is required.)
• Impala uses better I/O scheduling. Impala is aware of the disk location of blocks and is able to schedule the
order to process blocks to keep all disks busy.
• Impala is designed for performance. A lot of time has been spent in designing Impala with sound
performance-oriented fundamentals, such as tight inner loops, inlined function calls, minimal branching,
better use of cache, and minimal memory usage.
54 | Cloudera Introduction
FAQs
SQL
Why do I have to use REFRESH and INVALIDATE METADATA, what do they do?
In Impala 1.2 and higher, there is much less need to use the REFRESH and INVALIDATE METADATA statements:
• The new impala-catalog service, represented by the catalogd daemon, broadcasts the results of Impala
DDL statements to all Impala nodes. Thus, if you do a CREATE TABLE statement in Impala while connected
to one node, you do not need to do INVALIDATE METADATA before issuing queries through a different node.
• The catalog service only recognizes changes made through Impala, so you must still issue a REFRESH statement
if you load data through Hive or by manipulating files in HDFS, and you must issue an INVALIDATE METADATA
statement if you create a table, alter a table, add or drop partitions, or do other DDL statements in Hive.
• Because the catalog service broadcasts the results of REFRESH and INVALIDATE METADATA statements to
all nodes, in the cases where you do still need to issue those statements, you can do that on a single node
rather than on every node, and the changes will be automatically recognized across the cluster, making it
more convenient to load balance by issuing queries through arbitrary Impala nodes rather than always using
the same coordinator node.
Cloudera Introduction | 55
FAQs
select 2+2;
select substr('hello',2,1);
select pow(10,6);
Partitioned Tables
HBase
What kinds of Impala queries or data are best suited for HBase?
HBase tables are ideal for queries where normally you would use a key-value store. That is, where you retrieve
a single row or a few rows, by testing a special unique key column using the = or IN operators.
HBase tables are not suitable for queries that produce large result sets with thousands of rows. HBase tables
are also not suitable for queries that perform full table scans because the WHERE clause does not request specific
values from the unique key column.
Use HBase tables for data that is inserted one row or a few rows at a time, such as by the INSERT ... VALUES
syntax. Loading data piecemeal like this into an HDFS-backed table produces many tiny files, which is a very
inefficient layout for HDFS data files.
56 | Cloudera Introduction
FAQs
If the lack of an UPDATE statement in Impala is a problem for you, you can simulate single-row updates by doing
an INSERT ... VALUES statement using an existing value for the key column. The old row value is hidden; only
the new row value is seen by queries.
HBase tables are often wide (containing many columns) and sparse (with most column values NULL). For example,
you might record hundreds of different data points for each user of an online service, such as whether the user
had registered for an online game or enabled particular account features. With Impala and HBase, you could
look up all the information for a specific customer efficiently in a single query. For any given customer, most of
these columns might be NULL, because a typical customer might not make use of most features of an online
service.
General
The following are general questions about Cloudera Search and the answers to those questions.
Cloudera Introduction | 57
FAQs
Do I need to configure Sentry restrictions for each access mode, such as for the admin console
and for the command line?
Sentry restrictions are consistently applied regardless of the way users attempt to complete actions. For example,
restricting access to data in a collection consistently restricts that access, whether queries come from the
command line, from a browser, or through the admin console.
Does Search support indexing data stored in JSON files and objects?
Yes, you can use the readJson and extractJsonPaths morphline commands that are included with the
CDK to access JSON data and files. For more information, see cdk-morphlines-json.
How can I set up Cloudera Search so that results include links back to the source that contains
the result?
You can use stored results fields to create links back to source documents. For information on data types,
including the option to set results fields as stored, see the Solr Wiki page on SchemaXml.
For example, with MapReduceIndexerTool you can take advantage of fields such as file_path. See
MapReduceIndexerTool Metadata for more information. The output from the MapReduceIndexerTool includes
file path information that can be used to construct links to source documents.
If you use the Hue UI, you can link to data in HDFS by inserting links of the form:
<a href="/filebrowser/download/{{file_path}}?disposition=inline">Download</a>
Why do I get an error “no field name specified in query and no default specified via 'df' param"
when I query a Schemaless collection?
Schemaless collections initially have no default or df setting. As a result, simple searches that might succeed
on non-Schemaless collections may fail on Schemaless collections.
When a user submits a search, it must be clear which field Cloudera Search should query. A default field, or df,
is often specified in solrconfig.xml, and when this is the case, users can submit queries that do not specify
fields. In such situations, Solr uses the df value.
When a new collection is created in Schemaless mode, there are initially no fields defined, so no field can be
chosen as the df field. As a result, when query request handlers do not specify a df, errors can result. There are
several ways to address this issue:
• Queries can specify any valid field name on which to search. In such a case, no df is required.
• Queries can specify a default field using the df parameter. In such a case, the df is specified in the query.
• You can uncomment the df section of the generated schemaless solrconfig.xml file and set the df
parameter to the desired field. In such a case, all subsequent queries can use the df field in solrconfig.xml
if no field or df value is specified.
58 | Cloudera Introduction
FAQs
How large of an index does Cloudera Search support per search server?
There are too many variables to provide a single answer to this question. Typically, a server can host from 10 to
300 million documents, with the underlying index as large as hundreds of gigabytes. To determine a reasonable
maximum document quantity and index size for servers in your deployment, prototype with realistic data and
queries.
Are there settings that can help avoid out of memory (OOM) errors during data ingestion?
A data ingestion pipeline can be made up of many connected segments, each of which may need to be evaluated
for sufficient resources. A common limiting factor is the relatively small default amount of permgen memory
allocated to the flume JVM. Allocating additional memory to the Flume JVM may help avoid OOM errors. For
example, for JVM settings for Flume, the following settings are often sufficient:
-Xmx2048m -XX:MaxPermSize=256M
Schema Management
The following are questions about schema management in Cloudera Search and the answers to those questions.
Cloudera Introduction | 59
FAQs
What is Apache Avro and how can I use an Avro schema for more flexible schema evolution?
To learn more about Avro and Avro schemas, see the Avro Overview page and the Avro Specification page.
To see examples of how to implement inheritance, backwards compatibility, and polymorphism with Avro, see
this InfoQ article.
Supportability
The following are questions about supportability in Cloudera Search and the answers to those questions.
Which file formats does Cloudera Search support for indexing? Does it support searching
images?
Cloudera Search uses the Apache Tika library for indexing many standard document formats. In addition, Cloudera
Search supports indexing and searching Avro files and a wide variety of other file types such as log files, Hadoop
Sequence Files, and CSV files. You can add support for indexing custom file formats using a morphline command
plug-in.
60 | Cloudera Introduction
Getting Support
Getting Support
This section describes how to get support.
Cloudera Support
Cloudera can help you install, configure, optimize, tune, and run CDH for large scale data processing and analysis.
Cloudera supports CDH whether you run it on servers in your own data center, or on hosted infrastructure
services such as Amazon EC2, Rackspace, SoftLayer, and VMware vCloud.
If you are a Cloudera customer, you can:
• Register for an account to create a support ticket at the support site.
• Visit the Cloudera Knowledge Base.
If you are not a Cloudera customer, learn how Cloudera can help you.
Community Support
There are several vehicles for community support. You can:
• Register for the Cloudera user groups.
• If you have any questions or comments about CDH, you can
– Send a message to the CDH user list: cdh-user@cloudera.org
– Visit the Using the Platform forum
• If you have any questions or comments about Cloudera Manager, you can
– Send a message to the Cloudera Manager user list: scm-users@cloudera.org.
– Visit the Cloudera Manager forum.
– Cloudera Express users can access the Cloudera Manager support mailing list from within the Cloudera
Manager Admin Console by selecting Support > Mailing List.
– Cloudera Enterprise customers can access the Cloudera Support Portal from within the Cloudera Manager
Admin Console, by selecting Support > Cloudera Support Portal. From there you can register for a support
account, create a support ticket, and access the Cloudera Knowledge Base.
Cloudera Introduction | 61
Getting Support
Report Issues
Cloudera tracks software and documentation bugs and enhancement requests for CDH on issues.cloudera.org.
Your input is appreciated, but before filing a request:
• Search the Cloudera issue tracker
• Search the CDH Manual Installation, Using the Platform, and Cloudera Manager forums
• Send a message to the CDH user list, cdh-user@cloudera.org, CDH developer list, cdh-dev@cloudera.org, or
Cloudera Manager user list, scm-users@cloudera.org
62 | Cloudera Introduction