0% found this document useful (0 votes)
133 views48 pages

Da Hadoophistory 19167008 21012020

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It began as an Apache project in 2006 based on earlier work on Nutch, an open-source web search engine. Over time, Hadoop has evolved through many versions, growing to support additional features and be used by large companies to handle big data.

Uploaded by

rudy myth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views48 pages

Da Hadoophistory 19167008 21012020

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It began as an Apache project in 2006 based on earlier work on Nutch, an open-source web search engine. Over time, Hadoop has evolved through many versions, growing to support additional features and be used by large companies to handle big data.

Uploaded by

rudy myth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

HADOOP HISTORY

By Rudra Pednekar
19167008
WHAT IS HADOOP?
Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that,
you can process it parallelly. There are basically two components in Hadoop:

The first one is HDFS for storage (Hadoop distributed File System), that allows you to store data of
various formats across a cluster. The second one is YARN, for resource management in Hadoop. It
allows parallel processing over the data, i.e. stored across HDFS.

HISTORY OF HADOOP:
2002
It all started in the year 2002 with the Apache Nutch project.

In 2002, Doug Cutting and Mike Cafarella were working on Apache Nutch Project that aimed at
building a web search engine that would crawl and index websites.

After a lot of research, Mike Cafarella and Doug Cutting estimated that it would cost around
$500,000 in hardware with a monthly running cost of $30,000 for a system supporting a one-billion-
page index.

This project proved to be too expensive and thus found infeasible for indexing billions of webpages.
So they were looking for a feasible solution that would reduce the cost.

2003
Meanwhile, In 2003 Google released a search paper on Google distributed File System (GFS) that
described the architecture for GFS that provided an idea for storing large datasets in a distributed
environment. This paper solved the problem of storing huge files generated as a part of the web
crawl and indexing process. But this is half of a solution to their problem.

2004
In 2004, Nutch’s developers set about writing an open-source implementation, the Nutch
Distributed File System (NDFS).

In 2004, Google introduced MapReduce to the world by releasing a paper on MapReduce. This paper
provided the solution for processing those large datasets. It gave a full solution to the Nutch
developers.

Google provided the idea for distributed storage and MapReduce. Nutch developers implemented
MapReduce in the middle of 2004.

2006
The Apache community realized that the implementation of MapReduce and NDFS could be used for
other tasks as well. In February 2006, they came out of Nutch and formed an independent
subproject of Lucene called “Hadoop” (which is the name of Doug’s kid’s yellow elephant).
As the Nutch project was limited to 20 to 40 nodes cluster, Doug Cutting in 2006 itself joined Yahoo
to scale the Hadoop project to thousands of nodes cluster.

2007
In 2007, Yahoo started using Hadoop on 1000 nodes cluster.

2008
In January 2008, Hadoop confirmed its success by becoming the top-level project at Apache.

By this time, many other companies like Last.fm, Facebook, and the New York Times started using
Hadoop.
2011 – 2012
On 27 December 2011, Apache released Hadoop version 1.0 that includes support for Security,
Hbase, etc.

On 10 March 2012, release 1.0.1 was available. This is a bug fix release for version 1.0.

On 23 May 2012, the Hadoop 2.0.0-alpha version was released. This release contains YARN.

The second (alpha) version in the Hadoop-2.x series with a more stable version of YARN was
released on 9 October 2012.

2017 – now
On 13 December 2017, release 3.0.0 was available

On 25 March 2018, Apache released Hadoop 3.0.1, which contains 49 bug fixes in Hadoop 3.0.0.

On 6 April 2018, Hadoop release 3.1.0 came that contains 768 bug fixes, improvements, and
enhancements since 3.0.0.

Later, in May 2018, Hadoop 3.0.3 was released.

On 8 August 2018, Apache 3.1.1 was released.

LET US SEE ALL THESE VERSIONS IN DETAIL


Hadoop version 0.x
4 September, 2007: release 0.14.1 available

New features in release 0.14 include:

• Better checksums in HDFS. Checksums are no longer stored in parallel HDFS files, but are
stored directly by datanodes alongside blocks. This is more efficient for the namenode and
also improves data integrity.
• Pipes: A C++ API for MapReduce
• Eclipse Plugin, including HDFS browsing, job monitoring, etc.
• File modification times in HDFS.

There are many other improvements, bug fixes, optimizations and new features.
Performance and reliability are better than ever.

October, 2007: release 0.14.3 available

This release fixes critical bugs in release 0.14.2.

29 October 2007: release 0.15.0 available

This release contains many improvements, new features, bug fixes and optimizations. Notably, this
contains the first working version of HBase.

26 November, 2007: release 0.14.4 available

This release fixes critical bugs in release 0.14.3.

27 November, 2007: release 0.15.1 available

This release fixes critical bugs in release 0.15.0.

2 January, 2008: release 0.15.2 available

This release fixes critical bugs in release 0.15.1.

18 January, 2008: release 0.15.3 available

This release fixes critical bugs in release 0.15.3.

7 February, 2008: release 0.16.0 available

This release contains many improvements, new features, bug fixes and optimizations.

13 March, 2008: release 0.16.1 available

This release fixes critical bugs in release 0.16.0.

2 April, 2008: release 0.16.2 available

This release fixes critical bugs in release 0.16.1.


16 April, 2008: release 0.16.3 available

This release fixes critical bugs in release 0.16.2.

5 May, 2008: release 0.16.4 available

This release fixes 4 critical bugs in release 0.16.3.

20 May, 2008: release 0.17.0 available

This release contains many improvements, new features, bug fixes and optimizations. See the
Hadoop 0.17.0 Release Notes for details.

23 June, 2008: release 0.17.1 available

This release contains many improvements, new features, bug fixes and optimizations. See the
Hadoop 0.17.1 Notes for details.

19 August, 2008: release 0.17.2 available

This release contains several critical bug fixes. See the Hadoop 0.17.2 Notes for details.

22 August, 2008: release 0.18.0 available

This release contains many improvements, new features, bug fixes and optimizations.
See the Hadoop 0.18.0 Release Notes for details. Alternatively, you can look at the complete

change log for this release or the Jira issue log for all releases.

17 September, 2008: release 0.18.1 available

This release contains several critical bug fixes.

3 November, 2008: release 0.18.2 available

This release contains several critical bug fixes.


See the Hadoop 0.18.2 Release Notes for details. Alternatively, you can look at the complete

change log for this release or the Jira issue log for all releases.

21 November, 2008: release 0.19.0 available

This release contains many improvements, new features, bug fixes and optimizations.
See the Hadoop 0.19.0 Release Notes for details. Alternatively, you can look at the complete

change log for this release or the Jira issue log for all releases.

29 January, 2009: release 0.18.3 available

This release contains many critical bug fixes.


See the Hadoop 0.18.3 Release Notes for details. Alternatively, you can look at the complete
change log for this release or the Jira issue log for all releases.

24 February, 2009: release 0.19.1 available

This release contains many critical bug fixes, including some data loss issues. The release also
introduces an incompatible change by disabling the file append API (HADOOP-5224) until it can be
stabilized.

See the Hadoop 0.19.1 Release Notes for details. Alternatively, you can look at the complete change
log for this release or the Jira issue log for all releases.

22 April, 2009: release 0.20.0 available

This release contains many improvements, new features, bug fixes and optimizations.
See the Hadoop 0.20.0 Release Notes for details. Alternatively, you can look at the complete

change log for this release or the Jira issue log for all releases.

23 July, 2009: release 0.19.2 available

This release contains several critical bug fixes.


See the Hadoop 0.19.2 Release Notes for details. Alternatively, you can look at the complete

change log for this release or the Jira issue log for all releases.

14 September, 2009: release 0.20.1 available

This release contains several critical bug fixes.

See the Hadoop 0.20.1 Release Notes for details. Alternatively, you can look at the complete change
log for this release or the Jira issue log for all releases.

26 February, 2010: release 0.20.2 available

This release contains several critical bug fixes.


See the Hadoop 0.20.2 Release Notes for details. Alternatively, you can look at the complete change
log for this release or the Jira issue log for all releases.

23 August, 2010: release 0.21.0 available

This release contains many improvements, new features, bug fixes and optimizations.
It has not undergone testing at scale and should not be considered stable or suitable for production.
This release is being classified as a minor release, which means that it should be API compatible with
0.20.2.

See the Hadoop 0.21.0 Release Notes for details. Alternatively, you can look at the complete change
log for this release or the Jira issue log for all releases.

11 May, 2011: release 0.20.203.0 available


This release contains many improvements, new features, bug fixes and optimizations. It is stable and
has been deployed in large (4,500 machine) production clusters.

5 Sep, 2011: release 0.20.204.0 available

This release contains improvements, new features, bug fixes and optimizations. This release includes
rpms and debs for the first time.

See the Hadoop 0.20.204.0 Release Notes for details. Alternatively, you can look at the complete
change log for this release.

Notes:

• The RPMs don't work with security turned on. (HADOOP-7599)


• The NameNode's edit log needs to be merged into the image via
• put the NameNode into safe mode
• run dfsadmin savenamespace command
• perform a normal upgrade

See the Hadoop 0.20.203.0 Release Notes for details. Alternatively, you can look at the complete
change log for this release or the Jira issue log for all releases.

17 Oct, 2011: release 0.20.205.0 available

This release contains improvements, new features, bug fixes and optimizations. This release includes
rpms and debs, all duly checksummed and securely signed.

See the Hadoop 0.20.205.0 Release Notes for details. Alternatively, you can look at the complete
change log for this release.

Notes:

• This release includes a merge of append/hsynch/hflush capabilities from 0.20-


append branch, to support HBase in secure mode.
• This release includes the new webhdfs file system, but webhdfs write calls currently
fail in secure mode.

11 Nov, 2011: release 0.23.0 available

This is the alpha version of the hadoop-0.23 major release. This is the first release we've made off
Apache Hadoop trunk in a long while. This release is alpha-quality and not yet ready for serious use.

hadoop-0.23 contains several major advances:

• HDFS Federation
• NextGen MapReduce (YARN)

It also has several major performance improvements to both HDFS and MapReduce. See the Hadoop
0.23.0 Release Notes for details.

1. 10 December, 2011: release 0.22.0 available


This release contains many bug fixes and optimizations compared to its predecessor 0.21.0.
See the Hadoop 0.22.0 Release Notes for details. Alternatively, you can look at the complete
change log for this release

Notes:
The following features are not supported in Hadoop 0.22.0.

• Latest optimizations of the MapReduce framework introduced in the Hadoop 0.20.security


line of releases.
• Disk-fail-in-place.
• JMX-based metrics v2.
• Security

Latest optimizations of the MapReduce framework introduced in the Hadoop 0.20.security line of
releases.

• Disk-fail-in-place.
• JMX-based metrics v2.

Hadoop 0.22.0 features

• HBase support with hflush and hsync.


• New implementation of file append.
• Symbolic links.
• BackupNode and CheckpointNode.
• Hierarchical job queues.
• Job limits per queue/pool.
• Dynamically stop/start job queues.
• Andvances in new mapreduce API: Input/Output formats, ChainMapper/Reducer.
• TaskTracker blacklisting.
• DistributedCache sharing.

Problems with Version 0.x and its improvements:

• HADOOP-7156. Critical bug reported by tlipcon and fixed by tlipcon


getpwuid_r is not thread-safe on RHEL6

Due to the following bug in SSSD, functions like getpwuid_r are not thread-safe in
RHEL 6.0 if sssd is specified in /etc/nsswitch.conf (as it is by default):

https://fedorahosted.org/sssd/ticket/640

This causes many fetch failures in the case that the native libraries are available,
since the SecureIO functions call getpwuid_r as part of fstat. By enabling -Xcheck:jni
I get the following trace on JVM crash:

*** glibc detected *** /mnt/toolchain/JDK6u20-64bit/bin/java: free(): invalid


pointer: 0x0000003575741d23 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3575675676]
/lib64/libnss_sss.so.2(_nss_sss_getpwuid_r+0x11b)[0x7fe716cb42cb]
/lib64/libc.so.6(getpwuid_r+0xdd)[0x35756a5dfd]

• HADOOP-7097. Blocker bug reported by nwatkins and fixed by nwatkins (build, native)
java.library.path missing basedir

My Hadoop installation is having trouble loading the native code library. It appears
from the log below that java.library.path is missing the basedir in its path. The
libraries are built, and present in the directory shown below (relative to hadoop-
common directory). Instead of seeing:

/build/native/Linux-amd64-64/lib

I would expect to see:

/path/to/hadoop-common/build/native/Linux-amd64-64/lib

I'm working in branch-0.22.

2011-01-10 17:09:27,695 DEBUG org.apache.hadoop.util.NativeCodeLoader: Failed


to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in
java.library.path
2011-01-10 17:09:27,695 DEBUG org.apache.hadoop.util.NativeCodeLoader:
java.library.path=/build/native/Linux-amd64-64/lib
2011-01-10 17:09:27,695 WARN org.apache.hadoop.util.NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java classes where
applicable

• HADOOP-7091. Major bug reported by kzhang and fixed by kzhang (security)


reloginFromKeytab() should happen even if TGT can't be found

HADOOP-6965 introduced a getTGT() method and prevents reloginFromKeytab()


from happening when TGT is not found. This results in the RPC layer not being able
to refresh TGT after TGT expires. The reason is RPC layer only does relogin when the
expired TGT is used and an exception is thrown. However, when that happens, the
expired TGT will be removed from Subject. Therefore, getTGT() will return null and
relogin will not be performed. We observed, for example, JT will not be able to re-
connect to NN after TGT expires.

• HADOOP-6989. Major bug reported by jghoman and fixed by chris.douglas


TestSetFile is failing on trunk

Testsuite: org.apache.hadoop.io.TestSetFile
Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.015 sec
------------- Standard Output ---------------
2010-10-04 16:32:01,030 INFO io.TestSetFile (TestSetFile.java:generate(56)) -
generating 10000 records in memory
2010-10-04 16:32:01,249 INFO io.TestSetFile (TestSetFile.java:generate(63)) - sorting
10000 records
2010-10-04 16:32:01,350 INFO io.TestSetFile (TestSetFile.java:writeTest(72)) -
creating with 10000 records
------------- ---------------- ---------------

Testcase: testSetFile took 0.964 sec


Caused an ERROR
key class or comparator option must be set
java.lang.IllegalArgumentException: key class or comparator option must be set
at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:247)
at org.apache.hadoop.io.SetFile$Writer.<init>(SetFile.java:60)
at org.apache.hadoop.io.TestSetFile.writeTest(TestSetFile.java:73)
at org.apache.hadoop.io.TestSetFile.testSetFile(TestSetFile.java:45)

• HDFS-1562. Major test reported by eli and fixed by eli (name-node, test)
Add rack policy tests

The existing replication tests (TestBlocksWithNotEnoughRacks,


TestPendingReplication, TestOverReplicatedBlocks, TestReplicationPolicy,
TestUnderReplicatedBlocks, and TestReplication) are missing tests for rack policy
violations. This jira adds the following tests which I created when generating a new
patch for HDFS-15.

* Test that blocks that have a sufficient number of total replicas, but are not
replicated cross rack, get replicated cross rack when a rack becomes available.
* Test that new blocks for an underreplicated file will get replicated cross rack.
* Mark a block as corrupt, test that when it is re-replicated that it is still replicated
across racks.
* Reduce the replication factor of a file, making sure that the only block that is
across racks is not removed when deleting replicas.
* Test that when a block is replicated because a replica is lost due to host failure the
the rack policy is preserved.
* Test that when the execss replicas of a block are reduced due to a node re-joining
the cluster the rack policy is not violated.
* Test that rack policy is still respected when blocks are replicated due to node
decommissioning.
* Test that rack policy is still respected when blocks are replicated due to node
decommissioning, even when the blocks are over-replicated.

• HADOOP-7861. Major improvement reported by shv and fixed by shv (documentation)


changes2html.pl should generate links to HADOOP, HDFS, and MAPREDUCE jiras

changes2html.pl correctly generates links to HADOOP jiras only. This hasn't been
updated since projects split.

• HADOOP-7786. Major improvement reported by eli and fixed by eli


Remove HDFS-specific configuration keys defined in FsConfig

HADOOP-4952 added a couple HDFS-specific configuration values to common (the


block size and the replication factor) that conflict with the HDFS values (eg have the
wrong defaults, wrong key name), are not used by common or hdfs and should be
removed. After removing these I noticed the rest of FsConfig is only used once
outside a test, and isn't tagged as a public API, I think we can remove it entirely.

LET US LOOK AT VERSION 1.X


27 December, 2011: release 1.0.0 available

After six years of gestation, Hadoop reaches 1.0.0! This release is from the 0.20-security code line,
and includes support for:

• security
• HBase (append/hsynch/hflush, and security)
• webhdfs (with full support for security)
• performance enhanced access to local files for HBase
• other performance enhancements, bug fixes, and features

Please see the complete Hadoop 1.0.0 Release Notes for details.

27 Feb, 2012: release 0.23.1 available

This is the second alpha version of the hadoop-0.23 major release after the first alpha 0.23.0. This
release has significant improvements compared to 0.23.0 but should still be considered as alpha-
quality and not for production use.

hadoop-0.23.1 contains several major advances from 0.23.0:

• Lots of bug fixes and improvements in both HDFS and MapReduce


• Major performance work to make this release either match or exceed performance
of

Hadoop-1 in most aspects of both HDFS and MapReduce.

• Several downstream projects like HBase, Pig, Oozie, Hive etc. are better integrated
with this release.

See the Hadoop 0.23.1 Release Notes for details.

3 Apr, 2012: Release 1.0.2 available

This is a bug fix release for version 1.0.


Bug fixes and feature enhancements in this minor release include:

• Snappy compressor/decompressor is available


• Occassional deadlock in metrics serving thread fixed
• 64-bit secure datanodes failed to start, now fixed
• Changed package names for 64-bit rpm/debs to use ".x86_64." instead of ".amd64."

16 May, 2012: Release 1.0.3 available

This is a bug fix release for version 1.0.


Bug fixes and feature enhancements in this minor release include:

• 4 patches in support of non-Oracle JDKs


• several patches to clean up error handling and log messages
• various production issue fixes

23 May, 2012: Release 2.0.0-alpha available

This is the first (alpha) version in the hadoop-2.x series.


This delivers significant major features over the currently stable hadoop-1.x series including:

• HDFS HA for NameNode (manual failover)


• YARN aka NextGen MapReduce
• HDFS Federation
• Performance
• Wire-compatibility for both HDFS and YARN/MapReduce (using protobufs)

12 October, 2012: Release 1.0.4 available

This is a Security Patch release for version 1.0.


There are four bug fixes and feature enhancements in this minor release:

• Security issue CVE-2012-4449: Hadoop tokens use a 20-bit secret


• HADOOP-7154 - set MALLOC_ARENA_MAX in hadoop-config.sh to resolve problems
with glibc in RHEL-6
• HDFS-3652 - FSEditLog failure removes the wrong edit stream when storage dirs
have same name
• MAPREDUCE-4399 - Fix (up to 3x) performance regression in shuffle

13 October, 2012: Release 1.1.0 available

This is a beta release for version 1.1.

This release has approximately 135 enhancements and bug fixes compared to Hadoop-1.0.4,
including:

• Many performance improvements in HDFS, backported from trunk


• Improvements in Security to use SPNEGO instead of Kerberized SSL for HTTP transactions
• Lower default minimum heartbeat for task trackers from 3 sec to 300msec to increase job
throughput on small clusters
• Port of Gridmix v3
• Set MALLOC_ARENA_MAX in hadoop-config.sh to resolve problems with glibc in RHEL-6
• Splittable bzip2 files
• Of course it also has the same security fix as release 1.0.4.

Bugs till Version 1.1.0 and improvements:


• HADOOP-5464. Major bug reported by rangadi and fixed by rangadi
DFSClient does not treat write timeout of 0 properly

Zero values for dfs.socket.timeout and dfs.datanode.socket.write.timeout are now


respected. Previously zero values for these parameters resulted in a 5 second
timeout.

• HDFS-3518. Major bug reported by bikassaha and fixed by szetszwo (hdfs client)
Provide API to check HDFS operational state

Add a utility method HdfsUtils.isHealthy(uri) for checking if the given HDFS is


healthy.

• HDFS-3522. Major bug reported by brandonli and fixed by brandonli (name-node)


If NN is in safemode, it should throw SafeModeException when getBlockLocations has zero
locations

getBlockLocations(), and hence open() for read, will now throw SafeModeException
if the NameNode is still in safe mode and there are no replicas reported yet for one
of the blocks in the file.

• MAPREDUCE-4087. Major bug reported by ravidotg and fixed by ravidotg


[Gridmix] GenerateDistCacheData job of Gridmix can become slow in some cases

Fixes the issue of GenerateDistCacheData job slowness.

• MAPREDUCE-4673. Major bug reported by arpitgupta and fixed by arpitgupta (test)


make TestRawHistoryFile and TestJobHistoryServer more robust

Fixed TestRawHistoryFile and TestJobHistoryServer to not write to /tmp.

• MAPREDUCE-4675. Major bug reported by arpitgupta and fixed by bikassaha (test)


TestKillSubProcesses fails as the process is still alive after the job is done

Fixed a race condition caused in TestKillSubProcesses caused due to a recent


commit.

• HADOOP-5836. Major bug reported by nowland and fixed by nowland (fs/s3)


Bug in S3N handling of directory markers using an object with a trailing "/" causes jobs to
fail

Some tools which upload to S3 and use a object terminated with a "/" as a directory
marker, for instance "s3n://mybucket/mydir/". If asked to iterate that "directory"
via listStatus(), then the current code will return an empty file "", which the
InputFormatter happily assigns to a split, and which later causes a task to fail, and
probably the job to fail.

• HADOOP-6527. Major bug reported by jghoman and fixed by ivanmi (security)


UserGroupInformation::createUserForTesting clobbers already defined group mappings

In UserGroupInformation::createUserForTesting the follow code creates a new


groups instance, obliterating any groups that have been previously defined in the
static groups field.
{code} if (!(groups instanceof TestingGroups)) {
groups = new TestingGroups();
}
{code}
This becomes a problem in tests that start a Mini{DFS,MR}Cluster and then create a
testing user. The user that started the user (generally the real user running the test)
immediately has their groups wiped out and is...

• HADOOP-6995. Minor improvement reported by tlipcon and fixed by tlipcon (security)


Allow wildcards to be used in ProxyUsers configurations

When configuring proxy users and hosts, the special wildcard value "*" may be
specified to match any host or any user.

• HADOOP-8230. Major improvement reported by eli2 and fixed by eli


Enable sync by default and disable append

Append is not supported in Hadoop 1.x. Please upgrade to 2.x if you need append. If
you enabled dfs.support.append for HBase, you're OK, as durable sync (why HBase
required dfs.support.append) is now enabled by default. If you really need the
previous functionality, to turn on the append functionality set the flag
"dfs.support.broken.append" to true.

• HADOOP-8365. Blocker improvement reported by eli2 and fixed by eli


Add flag to disable durable sync

This patch enables durable sync by default. Installation where HBase was not used,
that used to run without setting "dfs.support.append" or setting it to false explicitly
in the configuration, must add a new flag "dfs.durable.sync" and set it to false to
preserve the previous semantics.

• HDFS-2465. Major improvement reported by tlipcon and fixed by tlipcon (data-node,


performance)
Add HDFS support for fadvise readahead and drop-behind

HDFS now has the ability to use posix_fadvise and sync_data_range syscalls to
manage the OS buffer cache. This support is currently considered experimental, and
may be enabled by configuring the following keys:
dfs.datanode.drop.cache.behind.writes - set to true to drop data out of the buffer
cache after writing
dfs.datanode.drop.cache.behind.reads - set to true to drop data out of the buffer
cache when performing sequential reads
dfs.datanode.sync.behind.writes - set to true to trigger dirty page writeback
immediately after writing data
dfs.datanode.readahead.bytes - set to a non-zero value to trigger readahead for
sequential reads

• HDFS-2617. Major improvement reported by jghoman and fixed by jghoman (security)


Replaced Kerberized SSL for image transfer and fsck with SPNEGO-based solution

Due to the requirement that KSSL use weak encryption types for Kerberos tickets,
HTTP authentication to the NameNode will now use SPNEGO by default. This will
require users of previous branch-1 releases with security enabled to modify their
configurations and create new Kerberos principals in order to use SPNEGO. The old
behavior of using KSSL can optionally be enabled by setting the configuration option
"hadoop.security.use-weak-http-crypto" to "true".

HADOOP VERSION 2.X


14 February, 2013: Release 2.0.3-alpha available

This is the latest (alpha) version in the hadoop-2.x series.

This release delivers significant major features and stability over previous releases in hadoop-2.x
series:

• QJM for HDFS HA for NameNode


• Multi-resource scheduling (CPU and memory) for YARN
• YARN ResourceManager Restart
• Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so

far, at time of release)

This release, like previous releases in hadoop-2.x series is still considered alpha primarily
since some of APIs aren't fully-baked and we expect some churn in future. Furthermore,
please note that there are some API changes from previous hadoop-2.0.2-alpha release and
applications will need to recompile against hadoop-2.0.3-alpha.

6 June, 2013: Release 2.0.5-alpha available

This release delivers a number of critical bug-fixes for hadoop-2.x uncovered during integration
testing of previous release.

23 August, 2013: Release 2.0.6-alpha available

This release delivers a number of critical bug-fixes for hadoop-2.x uncovered during integration
testing of previous release.
25 August, 2013: Release 2.1.0-beta available

Apache Hadoop 2.1.0-beta is the beta release of Apache Hadoop 2.x.

Users are encouraged to immediately move to 2.1.0-beta since this release is significantly more
stable and has completley whetted set of APIs and wire-protocols for future compatibility.

In addition, this release has a number of other significant highlights:

• HDFS Snapshots
• Support for running Hadoop on Microsoft Windows
• YARN API stabilization
• Binary Compatibility for MapReduce applications built on hadoop-1.x
• Substantial amount of integration testing with rest of projects in the ecosystem

2.39 15 October, 2013: Release 2.2.0 available

Apache Hadoop 2.2.0 is the GA release of Apache Hadoop 2.x.


Users are encouraged to immediately move to 2.2.0 since this release is significantly more stable and
is guaranteed to remain compatible in terms of both APIs and protocols.

To recap, this release has a number of significant highlights compared to Hadoop 1.x:

• YARN - A general purpose resource management system for Hadoop to allow


MapReduce and other other data processing frameworks and services
• High Availability for HDFS
• HDFS Federation
• HDFS Snapshots
• NFSv3 access to data in HDFS
• Support for running Hadoop on Microsoft Windows
• Binary Compatibility for MapReduce applications built on hadoop-1.x
• Substantial amount of integration testing with rest of projects in the ecosystem

A couple of important points to note while upgrading to hadoop-2.2.0:

• HDFS - The HDFS community decided to push the symlinks feature out to a future
2.3.0 release and is currently disabled.
• YARN/MapReduce - Users need to change ShuffleHandler service name from
mapreduce.shuffle to mapreduce_shuffle.

20 February, 2014: Release 2.3.0 available

Apache Hadoop 2.3.0 contains a number of significant enhancements such as:

• Support for Heterogeneous Storage hierarchy in HDFS.


• In-memory cache for HDFS data with centralized administration and management.
• Simplified distribution of MapReduce binaries via HDFS in YARN Distributed Cache.
30 June, 2014: Release 2.4.1 available

Apache Hadoop 2.4.1 is a bug-fix release for the stable 2.4.x line.

There is also a security bug fix in this minor release.

• CVE-2014-0229: Add privilege checks to HDFS admin sub-commands refreshNamenodes,


deleteBlockPool and shutdownDatanode.

Users are encouraged to immediately move to 2.4.1.

11 August, 2014: Release 2.5.0 available

Apache Hadoop 2.5.0 is a minor release in the 2.x release line.

The release includes the following major features and improvements:

• Authentication improvements when using an HTTP proxy server.


• A new Hadoop Metrics sink that allows writing directly to Graphite.
• Specification for Hadoop Compatible Filesystem effort.
• Support for POSIX-style filesystem extended attributes.
• OfflineImageViewer to browse an fsimage via the WebHDFS API.
• Supportability improvements and bug fixes to the NFS gateway.
• Modernized web UIs (HTML5 and Javascript) for HDFS daemons.
• YARN's REST APIs support submitting and killing applications.
• Kerberos integration for the YARN's timeline store.
• FairScheduler allows creating user queues at runtime under any specified parent queue.

Users are encouraged to try out 2.5.0

18 November, 2014: Release 2.6.0 available

Apache Hadoop 2.6.0 contains a number of significant enhancements such as:

• Hadoop Common
• HADOOP-10433 - Key management server (beta)
• HADOOP-10607 - Credential provider (beta)
• Hadoop HDFS
• Heterogeneous Storage Tiers - Phase 2
• HDFS-5682 - Application APIs for heterogeneous storage
• HDFS-7228 - SSD storage tier
• HDFS-5851 - Memory as a storage tier (beta)
• HDFS-6584 - Support for Archival Storage
• HDFS-6134 - Transparent data at rest encryption (beta)
• HDFS-2856 - Operating secure DataNode without requiring root access
• HDFS-6740 - Hot swap drive: support add/remove data node volumes without

restarting data node (beta)


• HDFS-6606 - AES support for faster wire encryption
• Hadoop YARN

• YARN-896 - Support for long running services in YARN

• YARN-913 - Service Registry for applications

• YARN-666 - Support for rolling upgrades


• YARN-556 - Work-preserving restarts of ResourceManager
• YARN-1336 - Container-preserving restart of NodeManager
• YARN-796 - Support node labels during scheduling
• YARN-1051 - Support for time-based resource reservations in Capacity Scheduler

(beta)

• YARN-1964 - Support running of applications natively in Docker containers (alpha)

21 April 2015: Release 2.7.0 available

Apache Hadoop 2.7.0 contains a number of significant enhancements. A few of them are noted
below.

• IMPORTANT notes
• This release drops support for JDK6 runtime and works with JDK 7+ only.
• This release is not yet ready for production use. Critical issues are being ironed out
via testing and downstream adoption. Production users should wait for a 2.7.1/2.7.2
release.
• Hadoop Common

• HADOOP-9629 - Support Windows Azure Storage - Blob as a file system in Hadoop.

• Hadoop HDFS
• HDFS-3107 - Support for file truncate
• HDFS-7584 - Support for quotas per storage type
• HDFS-3689 - Support for files with variable-length blocks
• Hadoop YARN
• YARN-3100 - Make YARN authorization pluggable
• YARN-1492 - Automatic shared, global caching of YARN localized resources (beta)
• Hadoop MapReduce
• MAPREDUCE-5583 - Ability to limit running Map/Reduce tasks of a job
• MAPREDUCE-4815 - Speed up FileOutputCommitter for very large jobs with many
output files.
Why Version 2.x is better than Version 1.x

Sr. Key Hadoop 1 Hadoop 2


No.
Why
New As Hadoop 1 introduced prior to On other hand Hadoop 2 introduced
Components Hadoop 2 so has some less after Hadoop 1 so has more
and API components and APIs as components and APIs as compare to
1
compare to that of Hadoop 2. Hadoop 1 such as YARN API,YARN
FRAMEWORK, and enhanced
Resource Manager.

Support Hadoop 1 only supports On other hand Hadoop 2 allows to


MapReduce processing model work in MapReducer model as well
in its architecture and it does as other distributed computing
2
not support non MapReduce models like Spark, Hama, Giraph,
tools. Message Passing Interface) MPI &
HBase coprocessors.

Resource Map reducer in Hadoop 1 is On other hand in case of Hadoop 2


Management responsible for processing and for cluster resource management
3 cluster-resource management. YARN is used while processing
management is done using different
processing models.

Scalability As Hadoop 1 is prior to Hadoop On other hand Hadoop 2 has better


2 so comparatively less scalable scalability than Hadoop 1 and is
than Hadoop 2 and in context of scalable up to 10000 nodes per
4
scaling of nodes it is limited to cluster.
4000 nodes per cluster

Implementation Hadoop 1 is implemented as it On other hand Hadoop 2 follows


follows the concepts of slots concepts of containers that can be
5
which can be used to run a Map used to run generic tasks.
task or a Reduce task only.

Windows Initially in Hadoop 1 there is no On other hand with an advancement


Support support for Microsoft Windows in version of Hadoop Apache
6
provided by Apache. provided support for Microsoft
windows in Hadoop 2.
Bugs for version 2.x

BUGS OF VERSION 2.X


• YARN-5462. Major bug reported by Eric Badger and fixed by Eric Badger
TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown fails
intermittently
• YARN-5353. Critical bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
ResourceManager can leak delegation tokens when they are shared across apps
• YARN-5262. Major bug reported by Rohith Sharma K S and fixed by Rohith Sharma K S
(resourcemanager)
Optimize sending RMNodeFinishedContainersPulledByAMEvent for every AM heartbeat
• YARN-5206. Minor bug reported by Steve Loughran and fixed by Steve Loughran (client ,
security)
RegistrySecurity includes id:pass in exception text if considered invalid
• YARN-5197. Major bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
RM leaks containers if running container disappears from node update
• YARN-5009. Major bug reported by Jason Lowe and fixed by Jason Lowe (nodemanager)
NMLeveldbStateStoreService database can grow substantially leading to longer recovery
times
• YARN-4794. Critical bug reported by Sumana Sathish and fixed by Jian He
Deadlock in NMClientImpl
• YARN-4785. Major bug reported by Jayesh and fixed by Varun Vasudev (webapp)
inconsistent value type of the "type" field for LeafQueueInfo in response of RM REST API -
cluster/scheduler
• YARN-4773. Minor bug reported by Jason Lowe and fixed by Jun Gong (nodemanager)
Log aggregation performs extraneous filesystem operations when rolling log aggregation is
disabled
• YARN-4761. Major bug reported by Sangjin Lee and fixed by Sangjin Lee (fairscheduler)
NMs reconnecting with changed capabilities can lead to wrong cluster resource
calculations on fair scheduler
• YARN-4722. Major bug reported by Jason Lowe and fixed by Jason Lowe
AsyncDispatcher logs redundant event queue sizes

HADOOP VERSION 3.X


26 May 2017: Release 3.0.0-alpha3 available

This is a security release in the 3.0.0 release line. It consists of alpha2 plus security fixes, along with
necessary build-related fixes. Users on 3.0.0-alpha1 and 3.0.0-alpha2 are encouraged to upgrade to
3.0.0-alpha3.

Please note that alpha releases come with no guarantees of quality or API stability, and are not
intended for production use.
Users are encouraged to read the overview of major changes coming in 3.0.0. The alpha3 release
notes and changelog detail the changes since 3.0.0-alpha2.

07 July 2017: Release 3.0.0-alpha4 available

This is the fourth alpha release in the 3.0.0 release line. It consists of 814 bug fixes, improvements,
and other enhancements since 3.0.0-alpha3. This is planned to be the final alpha release, with the
next release being 3.0.0-beta1.

Please note that alpha releases come with no guarantees of quality or API stability, and are not
intended for production use.

Users are encouraged to read the overview of major changes coming in 3.0.0. The alpha4 release
notes and changelog detail the changes since 3.0.0-alpha3.

03 October 2017: Release 3.0.0-beta1 available

This is the first beta release in the 3.0.0 release line. It consists of 576 bug fixes, improvements, and
other enhancements since 3.0.0-alpha4. This is planned to be the final alpha release, with the next
release being 3.0.0 GA.

Please note that beta releases are API stable but come with no guarantees of quality, and are not
intended for production use.

Users are encouraged to read the overview of major changes coming in 3.0.0. The beta1 release
notes and changelog detail the changes since 3.0.0-alpha4.

13 December 2017: Release 3.0.0 generally available

After four alpha releases and one beta release, 3.0.0 is generally available. 3.0.0 consists of 302 bug
fixes, improvements, and other enhancements since 3.0.0-beta1. All together, 6242 issues were
fixed as part of the 3.0.0 release series since 2.7.0.

Users are encouraged to read the overview of major changes in 3.0.0. The GA release notes and
changelog detail the changes since 3.0.0-beta1.

25 March 2018: Release 3.0.1 available

This is the next release of Apache Hadoop 3.0 line. It contains 49 bug fixes, improvements and
enhancements since 3.0.0.

Please note: 3.0.0 is deprecated after 3.0.1 because HDFS-12990 changes NameNode default RPC
port back to 8020.

Users are encouraged to read the overview of major changes since 3.0.0. For details of 49 bug fixes,
improvements, and other enhancements since the previous 3.0.0 release, please check release notes
and changelog detail the changes since 3.0.0.

31 May 2018: Release 3.0.3 available


This is the next release of Apache Hadoop 3.0 line. It contains 249 bug fixes, improvments and other
enhancements since 3.0.2.

Users are encouraged to read the overview of major changes since 3.0.2. For details of 249 bug fixes,
improvements, and other enhancements since the previous 3.0.2 release, please check release notes
andi changelog detail the changes since 3.0.2.

Release 3.2.2 available

2021 Jan 9
This is the second stable release of Apache Hadoop 3.2 line. It contains 516 bug fixes, improvements
and enhancements since 3.2.1.
Users are encouraged to read the overview of major changes since 3.2.1. For details of 516 bug fixes,
improvements, and other enhancements since the previous 3.2.1 release, please check release
notes and changelog detail the changes since 3.2.1.

Release 3.1.4 available


2020 Aug 3
This is the second stable release of Apache Hadoop 3.1 line. It contains 308 bug fixes, improvements
and enhancements since 3.1.3.
Users are encouraged to read the overview of major changes since 3.1.3. For details of 308 bug fixes,
improvements, and other enhancements since the previous 3.1.3 release, please check release
notes and changelog.
Release 3.3.0 available
2020 Jul 14
This is the first release of Apache Hadoop 3.3 line. It contains 2148 bug fixes, improvements and
enhancements since 3.2.
Users are encouraged to read the overview of major changes. For details of please check release
notes and changelog.

What is new in Version3.x

1. JDK 8.0 is the Minimum JAVA Version Supported by Hadoop 3.x

Since Oracle has ended the use of JDK 7 in 2015, so to use Hadoop 3 users have to upgrade their
Java version to JDK 8 or above to compile and run all the Hadoop files. JDK version below 8 is no
more supported for using Hadoop 3.

2. Erasure Coding is Supported

Erasure coding is used to recover the data when the computer hard disk fails. It is a high-level
RAID(Redundant Array of Independent Disks) technology used by so many IT company’s to recover
their data. Hadoop file system HDFS i.e. Hadoop Distributed File System uses Erasure coding to
provide fault tolerance in the Hadoop cluster. Since we are using commodity hardware to build our
Hadoop cluster, failure of the node is normal. Hadoop 2 uses a replication mechanism to provide a
similar kind of fault-tolerance as that of Erasure coding in Hadoop 3.
In Hadoop 2 replicas of the data, blocks are made which is then stored on different nodes in the
Hadoop cluster. Erasure coding consumes less or half storage as that of replication in Hadoop 2 to
provide the same level of fault tolerance. With the increasing amount of data in the industry,
developers can save a large amount of storage with erasure coding. Erasure encoding minimizes the
requirement of hard disk and improves the fault tolerance by 50% with the similar resources
provided.

3. More Than Two NameNodes Supported

The previous version of Hadoop supports a single active NameNode and a single standby
NameNode. In the latest version of Hadoop i.e. Hadoop 3.x, the data block replication is done among
three JournalNodes(JNs). With the help of that, the Hadoop 3.x architecture is more capable to
handle fault tolerance than that of its previous version. Big data problems where high fault tolerance
is needed, Hadoop 3.x is very useful in that situation. In this Hadoop, 3.x users can manage the
number of standby nodes according to the requirement since the facility of multiple standby nodes
is provided.
For example, developers can now easily configure three NameNodes and Five JournalNodes with
that our Hadoop cluster is capable to handle two nodes rather than a single one.

4. Shell Script Rewriting

The Hadoop file system utilizes various shell-type commands that directly interact with the HDFS
and other file systems that Hadoop supports i.e. such as WebHDFS, Local FS, S3 FS, etc. The multiple
functionalities of Hadoop are controlled by the shell. The shell script used in the latest version of
Hadoop i.e. Hadoop 3.x has fixed lots of bugs. Hadoop 3.x shell scripts also provide the functionality
of rewriting the shell script.
5. Timeline Service v.2 for YARN
The YARN Timeline service stores and retrieve the applicant’s information(The information can be
ongoing or historical). Timeline service v.2 was much important to improve the reliability and
scalability of our Hadoop. System usability is enhanced with the help of flows and aggregation. In
Hadoop 1.x with TimeLine service, v.1 users can only make a single instance of reader/writer and
storage architecture that can not be scaled further.
Hadoop 2.x uses distributed writer architecture where data read and write operations are
separable. Here distributed collectors are provided for every YARN(Yet Another Resource
Negotiator) application. Timeline service v.2 uses HBase for storage purposes which can be scaled
to massive size along with providing good response time for reading and writing operations.

6. Filesystem Connector Support

This new Hadoop version 3.x now supports Azure Data Lake and Aliyun Object Storage System which
are the other standby option for the Hadoop-compatible filesystem.
7. Default Multiple Service Ports Have Been Changed

In the Previous version of Hadoop, the multiple service port for Hadoop is in the Linux ephemeral
port range (32768-61000). In this kind of configuration due to conflicts occurs in some other
application sometimes the service fails to bind to the ports. So to overcome this problem Hadoop
3.x has moved the conflicts ports from the Linux ephemeral port range and new ports have been
assigned to this as shown below.
// The new assigned Port
Namenode Ports: 50470 -> 9871, 50070 -> 9870, 8020 -> 9820
Datanode Ports: 50020-> 9867,50010 -> 9866, 50475 -> 9865, 50075 -> 9864
Secondary NN Ports: 50091 -> 9869, 50090 -> 9868

8. Intra-Datanode Balancer

DataNodes are utilized in the Hadoop cluster for storage purposes. The DataNodes handles multiple
disks at a time. This Disk’s got filled evenly during write operations. Adding or Removing the disk
can cause significant skewness in a DataNode. The existing HDFS-BALANCER can not handle this
significant skewness, which concerns itself with inter-, not intra-, DN skew. The latest intra-
DataNode balancing feature can manage this situation which is invoked with the help of HDFS disk
balancer CLI.

9. Shaded Client Jars

The new Hadoop–client-API and Hadoop-client-runtime are made available in Hadoop 3.x which
provides Hadoop dependencies in a single packet or single jar file. In Hadoop 3.x the Hadoop –client-
API have compile-time scope while Hadoop-client-runtime has runtime scope. Both of these contain
third-party dependencies provided by Hadoop-client. Now, the developers can easily bundle all the
dependencies in a single jar file and can easily test the jars for any version conflicts. using this way,
the Hadoop dependencies onto application classpath can be easily withdrawn.

10. Task Heap and Daemon Management

In Hadoop version 3.x we can easily configure Hadoop daemon heap size with some newly added
ways. With the help of the memory size of the host auto-tuning is made available. Instead
of HADOOP_HEAPSIZE, developers can use
the HEAP_MAX_SIZE and HEAP_MIN_SIZEvariables. JAVA_HEAP_SIZE internal variable is also
removed in this latest Hadoop version 3.x. Default heap sizes are also removed which is used for
auto-tuning by JVM(Java Virtual Machine). If you want to use the older default then enable it by
configuring HADOOP_HEAPSIZE_MAX in Hadoop-env.sh file.

DIFFERENCE BETWEEN HADOOP 3.X AND 2.X


i. License

Hadoop 2.x – Apache 2.0, Open Source



Hadoop 3.x – Apache 2.0, Open Source

ii. Minimum supported version of Java

• Hadoop 2.x – Minimum supported version of java is java 7.


• Hadoop 3.x – Minimum supported version of java is java 8
iii. Fault Tolerance

•Hadoop 2.x – Fault tolerance can be handled by replication (which is wastage of


space).
• Hadoop 3.x – Fault tolerance can be handled by Erasure coding.
iv. Data Balancing

•Hadoop 2.x – For data, balancing uses HDFS balancer.


•Hadoop 3.x – For data, balancing uses Intra-data node balancer, which is invoked via
the HDFS disk balancer CLI.
v. Storage Scheme

•Hadoop 2.x – Uses 3X replication scheme


•Hadoop 3.x – Support for erasure encoding in HDFS.
vi. Storage Overhead

• Hadoop 2.x – HDFS has 200% overhead in storage space.


• Hadoop 3.x – Storage overhead is only 50%.
vii. Storage Overhead Example

•Hadoop 2.x – If there is 6 block so there will be 18 blocks occupied the space because
of the replication scheme.
• Hadoop 3.x – If there is 6 block so there will be 9 blocks occupied the space 6 block and
3 for parity.
viii. YARN Timeline Service

• Hadoop 2.x – Uses an old timeline service which has scalability issues.
• Hadoop 3.x – Improve the timeline service v2 and improves the scalability and
reliability of timeline service.
ix. Default Ports Range

• Hadoop 2.x – In Hadoop 2.0 some default ports are Linux ephemeral port range. So at
the time of startup, they will fail to bind.
• Hadoop 3.x – But in Hadoop 3.0 these ports have been moved out of the ephemeral
range.
x. Tools

Hadoop 2.x – Uses Hive, pig, Tez, Hama, Giraph and other Hadoop tools.

Hadoop 3.x – Hive, pig, Tez, Hama, Giraph and other Hadoop tools are available.

Learn Apache Hadoop Ecosystem Components in detail.
xi. Compatible File System

• Hadoop 2.x – HDFS (Default FS), FTP File system: This stores all its data on remotely
accessible FTP servers. Amazon S3 (Simple Storage Service) file system Windows Azure
Storage Blobs (WASB) file system.
• Hadoop 3.x – It supports all the previous one as well as Microsoft Azure Data Lake
filesystem.
xii. Datanode Resources

• Hadoop 2.x – Datanode resource is not dedicated for the MapReduce we can use it for
other application.
• Hadoop 3.x – Here also data node resources can be used for other Applications too.
xiii. MR API Compatibility

• Hadoop 2.x – MR API compatible with Hadoop 1.x program to execute on Hadoop 2.X
• Hadoop 3.x – Here also MR API is compatible with running Hadoop 1.x programs to
execute on Hadoop 3.X
xiv. Support for Microsoft Windows

• Hadoop 2.x – It can be deployed on windows.


• Hadoop 3.x – It also supports for Microsoft windows.
xv. Slots/Container

• Hadoop 2.x – Hadoop 1 works on the concept of slots but Hadoop 2.X works on the
concept of the container. Through in the container, we can run the generic task.
• Hadoop 3.x – It also works on the concept of a container.
xvi. Single Point of Failure

• Hadoop 2.x – Has Features to overcome SPOF so whenever Namenode fails it recovers
automatically.
• Hadoop 3.x – Has Feature to overcome SPOF so whenever Namenode fails it recovers
automatically no needs manual intervention to overcome it.
xvii. HDFS Federation

• Hadoop 2.x – In Hadoop 1.0, only single NameNode to manage all Namespace but in
Hadoop 2.0, multiple NameNode for multiple Namespace.
• Hadoop 3.x – Hadoop 3.x also have multiple Namenode for multiple namespaces.
Learn Hadoop HDFS Federation in detail.
xviii. Scalability

• Hadoop 2.x – We can scale up to 10,000 Nodes per cluster.


• Hadoop 3.x – Better scalability. we can scale more than 10,000 nodes per cluster.
xix. Faster Access to Data

• Hadoop 2.x – Due to data Node caching we can fast access the data.
• Hadoop 3.x – Here also through Datanode caching we can fast access the data.
xx. HDFS Snapshot

• Hadoop 2.x – Hadoop 2 adds the support for a snapshot. It provides disaster recovery
and protection for user error.
• Hadoop 3.x – Hadoop 2 also support for the snapshot feature.
xxi. Platform

• Hadoop 2.x – Can serve as a platform for a wide variety of data analytics possible to
run event processing, streaming, and real-time operations.
• Hadoop 3.x – Here also it is possible to run event processing, streaming and real-time
operation on the top of YARN.
xxii. Cluster Resource Management

• Hadoop 2.x – For cluster resource Management it uses YARN. It improves


scalability, high availability, Multi-tenancy.
• Hadoop 3.x – For a cluster, resource Management Uses YARN, with all the features.

MAJOR BUGS AND IMPROVEMENTS


Zip Slip Vulnerability
“Zip Slip” is a widespread arbitrary file overwrite critical vulnerability, which typically results in
remote command execution. It was discovered and responsibly disclosed by the Snyk Security team
ahead of a public disclosure on June 5, 2018, and affects thousands of projects.

Cloudera has analyzed our use of zip-related software, and has determined that only Apache
Hadoop is vulnerable to this class of vulnerability in CDH 5. This has been fixed in upstream Hadoop
as CVE-2018-8009.

Products affected: Hadoop

Releases affected:

• CDH 5.12.x and all prior releases


• CDH 5.13.0, 5.13.1, 5.13.2, 5.13.3
• CDH 5.14.0, 5.14.2, 5.14.3
• CDH 5.15.0
Users affected: All

Date of detection: April 19, 2018

Detected by: Snyk

Severity: High

Impact: Zip Slip is a form of directory traversal that can be exploited by extracting files from an
archive. The premise of the directory traversal vulnerability is that an attacker can gain access to
parts of the file system outside of the target folder in which they should reside. The attacker can
then overwrite executable files and either invoke them remotely or wait for the system or user to
call them, thus achieving remote command execution on the victim’s machine. The vulnerability can
also cause damage by overwriting configuration files or other sensitive resources, and can be
exploited on both client (user) machines and servers.

CVE: CVE-2018-8009

Immediate action required: Upgrade to a version that contains the fix.


Addressed in release/refresh/patch: CDH 5.14.4, CDH 5.15.1

For the latest update on this issue, see the corresponding Knowledge article:

TSB: 2018-307: Zip Slip Vulnerability

Apache Hadoop MapReduce Job History Server (JHS) vulnerability CVE-2017-15713


A vulnerability in Hadoop’s Job History Server allows a cluster user to expose private files owned by
the user running the MapReduce Job History Server (JHS) process. See http://seclists.org/oss-
sec/2018/q1/79for reference.

Products affected: Apache Hadoop MapReduce

Releases affected: All releases prior to CDH 5.12.0. CDH 5.12.0, CDH 5.12.1, CDH 5.12.2, CDH 5.13.0,
CDH 5.13.1, CDH 5.14.0

Users affected: Users running the MapReduce Job History Server (JHS) daemon

Date/time of detection: November 8, 2017

Detected by: Man Yue Mo of lgtm.com

Severity (Low/Medium/High): High

Impact: The vulnerability allows a cluster user to expose private files owned by the user running the
MapReduce Job History Server (JHS) process. The malicious user can construct a configuration file
containing XML directives that reference sensitive files on the MapReduce Job History Server (JHS)
host.

CVE: CVE-2017-15713

Immediate action required: Upgrade to a release where the issue is fixed.

Addressed in release/refresh/patch: CDH 5.13.2, 5.14.2

Hadoop LdapGroupsMapping does not support LDAPS for self-signed LDAP server
Hadoop LdapGroupsMapping does not work with LDAP over SSL (LDAPS) if the LDAP server
certificate is self-signed. This use case is currently not supported even if Hadoop User Group
Mapping LDAP TLS/SSL Enabled, Hadoop User Group Mapping LDAP TLS/SSL Truststore, and Hadoop
User Group Mapping LDAP TLS/SSL Truststore Password are filled properly.

Bug: HADOOP-12862

Affected Versions: All CDH 5 versions.

Workaround: None.

HDFS

CVE-2018-1296 Permissive Apache Hadoop HDFS listXAttr Authorization Exposes Extended Attribute
Key/Value Pairs
AHDFS exposes extended attribute key/value pairs during listXAttrs, verifying only path-level search
access to the directory rather than path-level read permission to the referent.

Products affected: Apache HDFS


Releases affected:
• CDH 5.4.0 - 5.15.1, 5.16.0
• CDH 6.0.0, 6.0.1, 6.1.0
Users affected: Users who store sensitive data in extended attributes, such as users of HDFS
encryption.

Date/time of detection: Dcember 12, 2017

Detected by: Rushabh Shah, Yahoo! Inc., Hadoop committer

Severity (Low/Medium/High): Medium

Impact: HDFS exposes extended attribute key/value pairs during listXAttrs, verifying only path-level
search access to the directory rather than path-level read permission to the referent. This affects
features that store sensitive data in extended attributes.

CVE: CVE-2018-1296

Immediate action required:


• Upgrade: Update to a version of CDH containing the fix.
• Workaround: If a file contains sensitive data in extended attributes, users and admins
need to change the permission to prevent others from listing the directory that contains
the file.
Addressed in release/refresh/patch:
• CDH 5.15.2, 5.16.1
• CDH 6.1.1, 6.2.0

Clusters running CDH 5.16.1, 6.1.0, or 6.1.1 can lose some HDFS file permissions any time the
NameNode is restarted
When a cluster is upgraded to 5.16.1, 6.1.0, or 6.1.1 roles with SELECT and/or INSERT privileges on
an Impala database or table will have the REFRESH privilege added as part of the upgrade process.
HDFS ACLs for roles with the REFRESH privilege get set with empty permissions whenever the
NameNode is restarted. This can cause any jobs or queries run by users within affected roles to fail
because they will no longer be able to access affected Impala database or tables.

Products Affected: HDFS and components that access files in HDFS

Affected Versions: CDH 5.16.1, 6.1.0, 6.1.1

Users Affected: Clusters with Impala and HDFS ACLs managed by Sentry upgrading from any release
to CDH 5.16.1, 6.1.0, and 6.1.1.

Severity (Low/Medium/High): High

Root Cause and Impact: The new privilege REFRESH was introduced in CDH 5.16 and 6.1 and applies
to Impala databases and tables. When a cluster is upgraded to 5.16.1, 6.1.0, or 6.1.1, roles with
SELECT or INSERT privileges on an Impala database or table will have the REFRESH privilege added
during the upgrade.

HDFS ACLs for roles with the REFRESH privilege get set with empty permissions whenever the
NameNode is restarted. The NameNode is restarted during the upgrade.
For example if a group appdev is in role appdev_role and has SELECT access to the Impala table
"project" the HDFS ACLs prior to the upgrade would look similar to:

group: appdev

group::r--

After the upgrade the HDFS ACLs will be set with no permissions and will look like this:

group: appdev

group::---

Any jobs or queries run by users within affected roles will fail because they will no longer be able to
access affected Impala database or tables. This impacts any SQL client accessing the affected
databases and tables. For example, if a Hive client is used to access a table created in Impala it will
also fail. Jobs accessing the files directly through HDFS, e.g. via Spark, will also be impacted.

The HDFS ACLs will get reset whenever the NameNode is restarted.

Immediate action required: If possible, do not upgrade to releases CDH 5.16.1, 6.1.0, or 6.1.1 if
Impala is used and Sentry manages HDFS ACLs within your environment. Subsequent CDH releases
will resolve the problem with a product fix under SENTRY-2490.

If an upgrade is being considered, reach out to your account team to discuss other possibilities, and
to receive additional insight into future product release schedules.

If an upgrade must be executed, contact Cloudera Support indicating the upgrade plan and why an
upgrade is being executed. Options are available to assist with the upgrade if necessary.

Addressed in release/refresh/patch: Patches for 5.16.1, 6.1.0 and 6.1.1 are available for major
supported operating systems. Customers are encouraged to contact Cloudera Support for a patch.
The patch should be applied immediately after upgrade to any of the affected versions.

The fix for this TSB will be included in 6.1.2, 6.2.0, 5.16.2, and 5.17.0.

Potential data corruption due to race conditions between concurrent block read and write
Under rare conditions when an HDFS file is open for write, an application reading the same HDFS
blocks might read up-to-date block data of the partially written file, while reading a stale checksum
that corresponds to the block data before the latest write. The block is incorrectly declared corrupt
as a result. Normally the HDFS NameNode schedules additional replica for the same block from
other replicas if a replica is corrupted, but if the frequency of concurrent write and read is high
enough, there is a small probability that all replicas of a block can be declared corrupt, and the file
becomes corrupt and unrecoverable as well.

DataNode might print error log such as the following:

2017-10-18 11:23:46,627 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: ip-168-61-2-


30:50010:DataXceiver error processing WRITE_BLOCK operation src: /168.61.2.32:48163 dst:
/168.61.12.31:50010
java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected
checksum mismatch while writing BP-1666924250-168.61.12.36-
1494235758065:blk_1084584428_5057054 from /168.61.12.32:48163

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:604)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:894)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:794)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)

at java.lang.Thread.run(Thread.java:745)

The bug is fixed by HDFS-11056, HDFS-11160 and HDFS-11229.

Products Affected: HDFS

Affected Versions:
• All CDH 5.4 releases and lower
• CDH 5.5.0, 5.5.1, 5.5.2, 5.5.4, 5.5.5
• CDH 5.6.0, 5.6.1
• CDH 5.7.0, 5.7.1, 5.7.2, 5.7.3, 5.7.4, 5.7.5
• CDH 5.8.0, 5.8.2, 5.8.3
• CDH 5.9.0, 5.9.1
Users Affected: Workloads that require reading a file while it’s being concurrently written to HDFS.

Severity (Low/Medium/High): Low

Impact: If the workload requires reading and writing the same file concurrently, there is a small
probability that all replicas of a block can be declared corrupt, and the file becomes corrupt as well.

Immediate action required: Customers are advised to upgrade to a CDH version containing the fix if
the workloads are susceptible to this bug.

Fixed in Versions:
• CDH 5.5.6 and higher
• CDH 5.7.6 and higher
• CDH 5.8.4 and higher
• CDH 5.9.2 and higher
• CDH 5.10.0 and higher

Cannot re-encrypt an encryption zone if a previous re-encryption on it was canceled


When canceling a re-encryption on an encryption zone, the status of the re-encryption may continue
to show "Processing". When this occurs, future re-encrypt commands for this encryption zone will
fail inside the NameNode, and the re-encryption will never complete.

Cloudera Bug: CDH-59073

Affected Versions: CDH 5.13.0

Fixed in Versions: CDH 5.13.1 and higher

Workaround: To halt, or remove the "Processing" status for the encryption zone, re-issue the cancel
re-encryption command on the encryption zone. If a new re-encryption command is required for this
encryption zone, restart the NameNode before issuing the command.

Potential Block Corruption and Data Loss During Pipeline Recovery


A bug in the HDFS block pipeline recovery code can cause blocks to be unrecoverable due to
miscalculation of the block checksum. On a busy cluster where data is written and flushed
frequently, when a write pipeline recovery occurs, a node newly added to the write pipeline may
calculate the checksum incorrectly. This miscalculation is very rare, but when it does occur, the
replica becomes corrupted and data can be lost if all replicas are simultaneously affected.

Bug: HDFS-4660 , HDFS-9220

Detecting this known issue requires correlating multiple log messages. Below is an example
DataNode log error message captured at the time of block creation:

java.io.IOException: Terminating due to a checksum error.java.io.IOException:

Unexpected checksum mismatch while writing

BP-1800173197-10.x.y.z-1444425156296:blk_1170125248_96458336 from /10.x.y.z

Important: Absence of error messages in the DataNode logs does not necessarily mean that the
issue did not occur.
This issue affects these versions of CDH:
• CDH 5.0.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5, 5.0.6
• CDH 5.1.0, 5.1.2, 5.1.3, 5.1.4, 5.1.5
• CDH 5.2.0, 5.2.1, 5.2.3, 5.2.4, 5.2.5, 5.2.6
• CDH 5.3.0, 5.3.1, 5.3.2, 5.3.3, 5.3.4, 5.3.5, 5.3.7, 5.3.8, 5.3.9, 5.3.10
• CDH 5.4.0, 5.4.1, 5.4.2, 5.4.3, 5.4.4, 5.4.5, 5.4.7, 5.4.8, 5.4.9, 5.4.10
• CDH 5.5.0, 5.5.1
Workaround: None. Upgrade to a CDH version that includes the fix: CDH 5.4.11, CDH 5.5.2, CDH
5.6.0 and higher.Important: Upgrading does not fix existing corrupted block replicas. The HDFS block
scanner runs every three weeks and captures "Checksum failed" WARN messages in the DataNode
log. Look for these to identify and repair corrupted block replicas.
Users affected: All users running the affected CDH versions and using the HDFS file system.

Severity (Low/Medium/High): High

Impact: Potential loss of block data.

Immediate action required: Upgrade to a CDH version that includes the fix, specifically:
• CDH 5.4.11, CDH 5.5.2, CDH 5.6.0 and higher
DiskBalancer Occasionally Emits False Error Messages
Diskbalancer occasionally emits false error messages. For example:

2016-08-03 11:01:41,788 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer:

Disk Balancer is not enabled.

You can safely ignore this error message if you are not using DiskBalancer.

Affected Versions: CDH 5.8.1 and below.

Fixed in Versions: CDH 5.8.2 and higher.

Bug: HDFS-10588

Workaround: Use the following command against all DataNodes to suppress DiskBalancer logs:

hadoop daemonlog -setlevel <host:port> org.apache.hadoop.hdfs.server.datanode.DiskBalancer


FATAL

Another workaround is to suppress the warning by setting the log level of DiskBalancer to FATAL.
Add the following to log4j.properties (DataNode Logging Advanced Configuration Snippet (Safety
Valve)) and restart your DataNodes:

log4j.logger.org.apache.hadoop.hdfs.server.datanode.DiskBalancer = FATAL

Upgrade Requires an HDFS Upgrade


Upgrading from any release earlier than CDH 5.2.0 to CDH 5.2.0 or later requires an HDFS Upgrade.

See Upgrading Unmanaged CDH Using the Command Line for further information.

Optimizing HDFS Encryption at Rest Requires Newer openssl Library on Some Systems
CDH 5.3 implements the Advanced Encryption Standard New Instructions (AES-NI), which provide
substantial performance improvements. To get these improvements, you need a recent version
of libcrypto.so on HDFS and MapReduce client hosts that is, any host from which you originate
HDFS or MapReduce requests. Many OS versions have an older version of the library that does not
support AES-NI.

See HDFS Transparent Encryption in the Encryption section of the Cloudera Security guide for
instructions for obtaining the right version.

Other HDFS Encryption Known Issues

Potentially Incorrect Initialization Vector Calculation in HDFS Encryption


A mathematical error in the calculation of the Initialization Vector (IV) for encryption and decryption
in HDFS could cause data to appear corrupted when read. The IV is a 16-byte value input to
encryption and decryption ciphers. The calculation of the IV implemented in HDFS was found to be
subtly different from that used by Java and OpenSSL cryptographic routines. The result is that data
could possibly appear to be corrupted when it is read from a file inside an Encryption Zone.
Fortunately, the probability of this occurring is extremely small. For example, the maximum size of a
file in HDFS is 64 TB. This enormous file would have a 1-in-4- million chance of hitting this condition.
A more typically sized file of 1 GB would have a roughly 1-in-274-billion chance of hitting the
condition.

Affected Versions: CDH 5.2.1 and below

Fixed in Versions: CDH 5.3.0 and higher

Cloudera Bug: CDH-23618

Workaround: If you are using the experimental HDFS encryption feature in CDH 5.2, upgrade to CDH
5.3 and verify the integrity of all files inside an Encryption Zone.

DistCp between unencrypted and encrypted locations fails


By default, DistCp compares checksums provided by the filesystem to verify that data was
successfully copied to the destination. However, when copying between unencrypted and encrypted
locations, the filesystem checksums will not match since the underlying block data is different.

Affected Versions: CDH 5.2.1 and below.

Fixed in Versions: CDH 5.2.2 and higher.

Bug: HADOOP-11343

Workaround: Specify the -skipcrccheck and -update distcp flags to avoid verifying checksums.

Cannot move encrypted files to trash


With HDFS encryption enabled, you cannot move encrypted files or directories to the trash
directory.

Affected Versions: All CDH 5 versions

Bug: HDFS-6767

Workaround: To remove encrypted files/directories, use the following command with the -
skipTrash flag specified to bypass trash.

rm -r -skipTrash /testdir

HDFS NFS gateway and CDH installation (using packages) limitation


HDFS NFS gateway works as shipped ("out of the box") only on RHEL-compatible systems, but not on
SLES, Ubuntu, or Debian. Because of a bug in native versions of portmap/rpcbind , the HDFS NFS
gateway does not work out of the box on SLES, Ubuntu, or Debian systems when CDH has been
installed from the command-line, using packages. It does work on supported versions of RHEL-
compatible systems on which rpcbind-0.2.0-10.el6 or later is installed, and it does work if you use
Cloudera Manager to install CDH, or if you start the gateway as root. For more information,
see supported versions.

Bug: 731542 (Red Hat), 823364 (SLES), 594880 (Debian)

Workarounds and caveats:


• On Red Hat and similar systems, make sure rpcbind-0.2.0-10.el6 or later is installed.
• On SLES, Debian, and Ubuntu systems, do one of the following:
o Install CDH using Cloudera Manager; or
o As of CDH 5.1, start the NFS gateway as root; or
o Start the NFS gateway without using packages; or
o You can use the gateway by running rpcbind in insecure mode, using the -
i option, but keep in mind that this allows anyone from a remote host to bind to
the portmap.

HDFS does not currently provide ACL support for the NFS gateway
Affected Versions: All CDH 5 versions

Bug: HDFS-6949

Cloudera Bug: CDH-26921

No error when changing permission to 777 on .snapshot directory


Snapshots are read-only; running chmod 777 on the .snapshots directory does not change this, but
does not produce an error (though other illegal operations do).

Affected Versions: All CDH 5 versions

Bug: HDFS-4981

Cloudera Bug: CDH-13062

Workaround: None

Snapshot operations are not supported by ViewFileSystem


Affected Versions: All CDH 5 versions

Cloudera Bug: CDH-12600

Workaround: None

Snapshots do not retain directories' quotas settings


Affected Versions: All CDH 5 versions

Bug: HDFS-4897

Workaround: None

Permissions for dfs.namenode.name.dir incorrectly set.


Hadoop daemons should set permissions for the dfs.namenode.name.dir (or dfs.name.dir )
directories to drwx------ (700), but in fact these permissions are set to the file-system default, usually
drwxr-xr-x (755).

Affected Versions: All CDH 5 versions

Bug: HDFS-2470
Workaround: Use chmod to set permissions to 700. See Configuring Local Storage Directories for
Use by HDFS for more information and instructions.

hadoop fsck -move does not work in a cluster with host-based Kerberos
Affected Versions: All CDH 5 versions

Cloudera Bug: CDH-7017

Workaround: Use hadoop fsck -delete

HttpFS cannot get delegation token without prior authenticated request


A request to obtain a delegation token cannot initiate an SPNEGO authentication sequence; it must
be accompanied by an authentication cookie from a prior SPNEGO authentication sequence.

Affected Versions: CDH 5.1 and below

Fixed in Versions: CDH 5.2 and higher

Bug: HDFS-3988

Cloudera Bug: CDH-8144

Workaround: Make another WebHDFS request (such as GETHOMEDIR ) to initiate an SPNEGO


authentication sequence and then make the delegation token request.

DistCp does not work between a secure cluster and an insecure cluster in some cases
See the upstream bug reports for details.

Affected Versions: All CDH 5 versions

Bug: HDFS-7037, HADOOP-10016, HADOOP-8828

Cloudera Bug: CDH-14945, CDH-18779

Workaround: None

Port configuration required for DistCp to Hftp from secure cluster (SPNEGO)
To copy files using DistCp to Hftp from a secure cluster using SPNEGO, you must configure
the dfs.https.port property on the client to use the HTTP port (50070 by default).

Affected Versions: All CDH 5 versions

Bug: HDFS-3983

Cloudera Bug: CDH-8118

Workaround: Configure dfs.https.port to use the HTTP port on the client

Non-HA DFS Clients do not attempt reconnects


This problem means that streams cannot survive a NameNode restart or network interruption that
lasts longer than the time it takes to write a block.

Affected Versions: All CDH 5 versions


Bug: HDFS-4389

Cloudera Bug: CDH-10415

DataNodes may become unresponsive to block creation requests


DataNodes may become unresponsive to block creation requests from clients when the directory
scanner is running.

Affected Versions: CDH 5.2.1 and below

Fixed in Versions: CDH 5.2.2 and higher

Bug: HDFS-7489

Workaround: Disable the directory scanner by setting dfs.datanode.directoryscan.interval to -1 .

The active NameNode will not accept an fsimage sent from the standby during rolling upgrade
The result is that the NameNodes fail to checkpoint until the upgrade is finalized.Note: Rolling
upgrade is supported only for clusters managed by Cloudera Manager; you cannot do a rolling
upgrade in a command-line-only deployment.
Affected Versions: CDH 5.3.7 and below

Fixed in Versions: CDH 5.3.8 and higher

Bug: HDFS-7185

Workaround: None.

Block report can exceed maximum RPC buffer size on some DataNodes
On a DataNode with a large number of blocks, the block report may exceed the maximum RPC
buffer size.

Affected Versions: All CDH 5 versions

Bug: None

Workaround: Increase the value ipc.maximum.data.length in hdfs-site.xml:

<property>

<name>ipc.maximum.data.length</name>

<value>268435456</value>

</property>

Misapplied user-limits setting possible


The ulimits setting in /etc/security/limits.conf is applied to the wrong user when security is
enabled.

Affected Versions: CDH 5.2.0 and below

Bug: DAEMON-192
Anticipated Resolution: None

Workaround: To increase the ulimits applied to DataNodes, you must change the ulimit settings
for the root user, not the hdfs user.

LAZY_PERSIST storage policy is experimental and not supported


Using this storage policy could potentially lead to data loss.

Affected versions: All CDH 5 versions

Bug: HDFS-8229

Workaround: None

MapReduce2, YARN

NodeManager fails because of the changed default location of container executor binary
The default location of container-executor binary and .cfg files was changed to /var/lib/yarn-ce . It
used to be /opt/cloudera/parcels/<CDH_parcel_version> . Because of this change, if you did not
have the mount options -noexec and -nosuid set on /opt , the NodeManager can fail to start up as
these options are set on /var .

Affected versions CDH 5.16.1, All CDH 6 versions

Workaround: Either remove the -noexec and -nosuid mount options on /var or change the
container-executor binary and .cdf path using
the CMF_YARN_SAFE_CONTAINER_EXECUTOR_DIR environment variable.

YARN scheduler queue ACLs are not checked when performing MoveApplicationAcrossQueues
operations
The YARN moveApplicationAcrossQueues operation does not check ACLs on the target queue. This
allows a user to move an application to a queue that the user has no access to.

Affected Versions: All CDH 5 versions

Fixed Versions: CDH 6.0.0

Bug: YARN-5554

Cloudera Bug: CDH-43327

Workaround: N/A

Hadoop YARN Privilege Escalation CVE-2016-6811


A vulnerability in Hadoop YARN allows a user who can escalate to the yarn user the ability to
possibly run arbitrary commands as the root user.

Products affected: Hadoop YARN

Releases affected:

• CDH 5.12.x and all prior releases


• CDH 5.13.0, 5.13.1, 5.13.2, 5.13.3
• CDH 5.14.0, 5.14.2, 5.14.3
• CDH 5.15.0
Users affected: Users running the Hadoop YARN service.

Detected by: Freddie Rice

Severity: High

Impact: The vulnerability allows a user who has access to a node in the cluster running a YARN
NodeManager and who can escalate to the yarn user, the ability to run arbitrary commands as the
root user even if the user is not allowed to escalate directly to the root user.

CVE: CVE-2016-6811

Upgrade: Upgrade to a release where the issue is fixed.

Workaround: The vulnerability can be mitigated by restricting access to the nodes where the YARN
NodeManagers are deployed, and by removing su access to the yarn user and by making sure no one
other than the yarn user is a member of the yarn group. Please consult with your internal system
administration team and adhere to your internal security policy when evaluating the feasibility of
the above mitigation steps.

Addressed in release/refresh/patch: CDH 5.14.4, 5.15.1

For the latest update on this issue, see the corresponding Knowledge article:

TSB: 2018-309: Hadoop YARN privilege escalation

Missing results in Hive, Spark, Pig, Custom MapReduce jobs, and other Java applications when
filtering Parquet data written by Impala
Apache Hive and Apache Spark rely on Apache Parquet's parquet-mr Java library to perform
filtering of Parquet data stored in row groups. Those row groups contain statistics that make the
filtering efficient without having to examine every value within the row group.

Recent versions of the parquet-mr library contain a bug described in PARQUET-1217. This bug
causes filtering to behave incorrectly if only some of the statistics for a row group are written.
Starting in CDH 5.13, Apache Impala populates statistics in this way for Parquet files. As a result, Hive
and Spark may incorrectly filter Parquet data that is written by Impala.

In CDH 5.13, Impala started writing Parquet's null_count metadata field without writing
the min and max fields. This is valid, but it triggers the PARQUET-1217 bug in the predicate push-
down code of the Parquet Java library ( parquet-mr ). If the null_count field is set to a non-zero
value, parquet-mr assumes that min and max are also set and reads them without checking
whether they are actually there. If those fields are not set, parquet-mr reads their default value
instead.

For integer SQL types, the default value is 0 , so parquet-mr incorrectly assumes that
the min and max values are both 0 . This causes the problem when filtering data. Unless the
value 0 itself matches the search condition, all row groups are discarded due to the
incorrect min / max values, which leads to missing results.
Affected Products: The Parquet Java library ( parquet-mr ) and by extension, all Java applications
reading Parquet files, including, but not limited to:
• Hive
• Spark
• Pig
• Custom MapReduce jobs
Affected Versions:
• CDH 5.13.0, 5.13.1, 5.13.2, and 5.14.0
• CDS 2.2 Release 2 Powered by Apache Spark and earlier releases on CDH 5.13.0 and later
Who Is Affected: Anyone writing Parquet files with Impala and reading them back with Hive, Spark,
or other Java-based components that use the parquet-mr libraries for reading Parquet files.

Severity (Low/Medium/High): High

Impact: Parquet files containing null values for integer fields written by Impala produce missing
results in Hive, Spark, and other Java applications when filtering by the integer field.

Immediate Action Required:


• Upgrade

You should upgrade to one of the fixed maintenance releases mentioned below.

• Workaround

This issue can be avoided at the price of performance by disabling predicate push-down
optimizations:
o In Hive, use the following SET command:

SET hive.optimize.ppd = false;

o In Spark, disable the following configuration setting:

--conf spark.sql.parquet.filterPushdown=false

Addressed in the Following Releases:


• CDH 5.13.3 and higher
• CDH 5.14.2 and higher
• CDH 5.15.0 and higher
• CDS 2.3 Release 2 and higher
For the latest update on this issue, see the corresponding Knowledge Base article:

TSB:2018-300: Missing results in Hive, Spark, Pig, and other Java applications when filtering Parquet
data written by Impala

Apache Hadoop Yarn Fair Scheduler might stop assigning containers when preemption is on
In CDH 5.11.0 the preemption code was updated to improve preemption behavior. Further changes
were implemented in CDH 5.11.1 and CDH 5.12.0 to fix a remaining issue as described in YARN-
6432(FairScheduler: Reserve preempted resources for corresponding applications). This fix resulted
in two possible side effects:
• A race condition that results in the Fair Scheduler making duplicate reservations. The
duplicate reservations are never released and can result in an integer overflow stopping
container assignments.
• A possible deadlock in the event processing of the Fair Scheduler. This will stop all
updates in the Resource Manager.
Both side effects will ultimately cause the Fair Scheduler to stop processing resource requests.

Without the change from YARN-6432 the resources that are released after being preempted are not
reserved for the starved application. This could result in scheduler assigning the preempted
container to any application, not just the starved application. If no reservations are made on the
node for the starved application preemption will be less effective in solving the resource starvation.

Products affected: YARN

Releases affected: CDH 5.11.1, 5.12.0

Users affected: Users who have YARN configured with the FairScheduler and have turned
preemption on.

Severity (Low/Medium/High): Low

Impact: The Resource Manager will accept application but no application will change state or get
container assigned and thus progress.

Immediate action required:


• If you have not upgraded to the affected release and preemption in the FairScheduler is
in use, avoid upgrading to the affected releases.
• If you have already upgraded to the affected releases, choose from the following options:
o Upgrade to CDH 5.11.2 or 5.12.1
o Turn off preemption
Fixed in Versions: CDH 5.11.2 and 5.12.1

Yarn's Continuous Scheduling can cause slowness in Oozie


When Continuous Scheduling is enabled in Yarn, this can cause slowness in Oozie due to long delays
in communicating with Yarn. In Cloudera Manager 5.9.0 and higher, Enable Fair Scheduler
Continuous Scheduler is turned off by default.

Affected Versions: All CDH 5 versions

Bug: None

Cloudera Bug: CDH-60788

Workaround: Turn off Enable Fair Scheduler Continuous Scheduling in Cloudera Manager's Yarn
Configuration. To keep equivalent benefits of this feature, turn on Fair Scheduler Assign Multiple
Tasks .
Rolling upgrades to 5.11.0 and 5.11.1 may cause application failures
Affected Versions: CDH versions that can be upgraded to 5.11.0 or 5.11.1

Fixed in Versions: CDH 5.11.2 and higher

Bug: None

Cloudera Bug: CDH-55284, TSB-241

Workaround: Upgrade to 5.11.2 or higher.

Name resolution issues can result in unresponsive Web UI and REST endpoints
Name resolution issues can cause the Web UI or the RM REST endpoints to consume all
ResourceManager request handling threads, leaving the Web UI and REST endpoints unresponsive.

Fixed in Versions: CDH 5.10.0 and higher

Bug: YARN-4767

Cloudera Bug: CDH-45597

Workaround: Restart the ResourceManager or kill the application that is being accessed or waiting
for the ResourceManager to complete the job.

Loss of connection to the Zookeeper cluster can cause problems with the ResourceManagers
Loss of connection to the Zookeeper cluster can cause the ResourceManagers to be in active-active
state for an extended period of time.

Fixed in Versions: CDH 5.10.0 and higher

Bug: YARN-5677, YARN-5694

Cloudera Bug: CDH-45210

Workaround: None.

If the YARN user is granted access to all keys in KMS, then files localized from an encryption zone can
be world readable
If the YARN user is granted access to all keys in KMS, then files localized from an encryption zone can
be world readable.

Fixed in Versions: CDH 5.7.7, 5.8.5, 5.9.2, 5.10.1, 5.11.0 and higher.

Bug: None

Cloudera Bug: CDH-47377

Workaround: Make sure files in an encryption zone do not have world-readable files modes if they
are going to be localized.

Zookeeper outage can cause the ResourceManagers to exit


Fixed in Versions: CDH 5.12.0 and higher

Bug: YARN-3742

Cloudera Bug: CDH-47439


Workaround: None.

FairScheduler might not Assign Containers


Under certain circumstances, turning on Fair Scheduler Assign Multiple
Tasks ( yarn.scheduler.fair.assignmultiple ) causes the scheduler to stop assigning containers to
applications. Possible symptoms are that running applications show no progress, and new
applications do not start, staying in an Assigned state, despite the availability of free resources on
the cluster.

Affected Versions: CDH 5.5.0, CDH-5.5.1, CDH-5.5.2, CDH-5.5.3, CDH-5.5.4, CDH-5.5.5, CDH-5.5.6,
CDH-5.6.0, and CDH-5.6.1

Fixed in Versions: CDH 5.7.0 and higher

Bug: YARN-4477

Cloudera Bug: CDH-36686

Workaround: Turn off Fair Scheduler Assign Multiple Tasks ( yarn.scheduler.fair.assignmultiple ) and
restart the ResourceManager.

FairScheduler: AMs can consume all vCores leading to a livelock


When using FAIR policy with the FairScheduler, Application Masters can consume all vCores which
may lead to a livelock.

Fixed in Versions: CDH 5.7.3 and higher, except for CDH 5.8.0 and CDH 5.8.1

Bug: YARN-4866

Cloudera Bug: CDH-37529

Workaround: Use Dominant Resource Fairness (DRF) instead of FAIR; or make sure that the cluster
has enough vCores in proportion to the memory.

NodeManager mount point mismatch (YARN)


NodeManager may select a cgroups (Linux control groups) mount point that is not accessible to
user yarn , resulting in failure to start up. The mismatch occurs because YARN uses cgroups in
mount point /run/lxcfs/controllers , while Cloudera Manager typically
configures cgroups at /sys/fs/cgroups . This issue has occurred on Ubuntu 16.04 systems.

Fixed in Versions: CDH 5.11.1 and higher

Bug: YARN-6433

Cloudera Bug: CDH-52263

Workaround: Look through your YARN logs to identify the failing mount point, and unmount it:

$ umount errant_mount_point

You can also:


1. apt-get remove lxcfs
2. Reboot the node

JobHistory URL mismatch after server relocation


After moving the JobHistory Server to a new host, the URLs listed for the JobHistory Server on the
ResourceManager web UI still point to the old JobHistory Server. This affects existing jobs only. New
jobs started after the move are not affected.

Affected Versions: All CDH 5 versions.

Workaround: For any existing jobs that have the incorrect JobHistory Server URL, there is no option
other than to allow the jobs to roll off the history over time. For new jobs, make sure that all clients
have the updated mapred-site.xml that references the correct JobHistory Server.

Starting an unmanaged ApplicationMaster may fail


Starting a custom Unmanaged ApplicationMaster may fail due to a race in getting the necessary
tokens.

Affected Versions: CDH 5.1.5 and below.

Fixed in Versions: CDH 5.2 and higher.

Bug: YARN-1577

Cloudera Bug: CDH-17405

Workaround: Try to get the tokens again; the custom unmanaged ApplicationMaster should be able
to fetch the necessary tokens and start successfully.

Moving jobs between queues not persistent after restart


CDH 5 adds the capability to move a submitted application to a different scheduler queue. This
queue placement is not persisted across ResourceManager restart or failover, which resumes the
application in the original queue.

Affected Versions: All CDH 5 versions.

Bug: YARN-1558

Cloudera Bug: CDH-17408

Workaround: After ResourceManager restart, re-issue previously issued move requests.

Encrypted shuffle may fail (MRv2, Kerberos, TLS)


In MRv2, if the LinuxContainerExecutor is used (usually as part of Kerberos security),
and hadoop.ssl.enabled is set to true (see Configuring Encrypted Shuffle, Encrypted Web UIs, and
Encrypted HDFS Transport), then the encrypted shuffle does not work and the submitted job fails.

Affected Versions: All CDH 5 versions.

Bug: MAPREDUCE-4669

Cloudera Bug: CDH-8036

Workaround: Use encrypted shuffle with Kerberos security without encrypted web UIs, or use
encrypted shuffle with encrypted web UIs without Kerberos security.
ResourceManager-to-Application Master HTTPS link fails
In MRv2 (YARN), if hadoop.ssl.enabled is set to true (use HTTPS for web UIs), then the link from the
ResourceManager to the running MapReduce Application Master fails with an HTTP Error 500
because of a PKIX exception.

A job can still be run successfully, and, when it finishes, the link to the job history does work.

Affected Versions: CDH versions before 5.1.0.

Fixed Versions: CDH 5.1.0

Bug: YARN-113

Cloudera Bug: CDH-8014

Workaround: Do not use encrypted web UIs.

History link in ResourceManager web UI broken for killed Spark applications


When a Spark application is killed, the history link in the ResourceManager web UI does not work.

Workaround: To view the history for a killed Spark application, see the Spark HistoryServer web UI
instead.

Affected Versions: All CDH versions

Apache Issue: None

Cloudera Issue: CDH-49165

Routable IP address required by ResourceManager


ResourceManager requires routable host:port addresses
for yarn.resourcemanager.scheduler.address , and does not support using the wildcard 0.0.0.0
address.

Bug: None

Cloudera Bug: CDH-6808

Workaround: Set the address, in the form host:port , either in the client-side configuration, or on
the command line when you submit the job.

Amazon S3 copy may time out


The Amazon S3 filesystem does not support renaming files, and performs a copy operation instead.
If the file to be moved is very large, the operation can time out because S3 does not report progress
to the TaskTracker during the operation.

Bug: MAPREDUCE-972

Cloudera Bug: CDH-17955

Workaround: Use -Dmapred.task.timeout=15000000 to increase the MR task timeout.


Out-of-memory errors may occur with Oracle JDK 1.8
The total JVM memory footprint for JDK8 can be larger than that of JDK7 in some cases. This may
result in out-of-memory errors.

Bug: None

Workaround: Increase max default heap size ( -Xmx) . In the case of MapReduce, for example,
increase Reduce Task Maximum Heap Size in Cloudera Manager ( mapred.reduce.child.java.opts ,
or mapreduce.reduce.java.opts for YARN) to avoid out-of-memory errors during the shuffle phase.

MapReduce JAR file renamed (CDH 5.4.0)


As of CDH 5.4.0, hadoop-test.jar has been renamed to hadoop-test-mr1.jar . This JAR file contains
the mrbench , TestDFSIO , and nnbench tests.

Bug: None

Cloudera Bug: CDH-26521

Workaround: None.

Jobs in pool with DRF policy will not run if root pool is FAIR
If a child pool using DRF policy has a parent pool using Fairshare policy, jobs submitted to the child
pool do not run.

Affected Versions: All CDH 5 versions.

Bug: YARN-4212

Cloudera Bug: CDH-31358

Workaround: Change parent pool to use DRF.

Jobs with encrypted spills do not recover if the AM goes down


The fix to CVE-2015-1776 leads to not having enough information to recover a job should the
Application Master fail. Releases with this security fix cannot tolerate Application Master failures.

Affected Versions: All CDH 5 versions.

Bug: MAPREDUCE-6638

Cloudera Bug: CDH-37412

Workaround: None. Fix to come in a later release.

Large TeraValidate data sets can fail with MapReduce


In a cluster using MapReduce, TeraValidate fails when run over large TeraGen/TeraSort data sets
(1TB and larger) with an IndexOutOfBoundsException . Smaller data sets do not show this issue.

Affected Versions: CDH 5.3.7 and lower

Fixed in Versions: CDH 5.3.8 and higher

Bug: MAPREDUCE-6481
Cloudera Bug: CDH-31871

Workaround:None.

MapReduce job failure and rolling upgrade (CDH 5.6.0)


MapReduce jobs might fail during a rolling upgrade to or from CDH 5.6.0. Cloudera recommends
that you avoid doing rolling upgrades to CDH 5.6.0.

Bug: None

Cloudera Bug: CDH-38587

Workaround: Restart failed jobs.

Unsupported Features
The following features are not currently supported:
• FileSystemRMStateStore: Cloudera recommends you use ZKRMStateStore (ZooKeeper-
based implementation) to store the ResourceManager's internal state for recovery on
restart or failover. Cloudera does not support the use of FileSystemRMStateStore in
production.
• ApplicationTimelineServer (also known as Application History Server): Cloudera does
not support ApplicationTimelineServer v1. ApplicationTimelineServer v2 is under
development and Cloudera does not currently support it.
• Scheduler Reservations: Scheduler reservations are currently at an experimental stage,
and Cloudera does not support their use in production.
• Scheduler node-labels: Node-labels are currently experimental with CapacityScheduler.
Cloudera does not support their use in production.
• CapacityScheduler. This is deprecated and will be removed from CDH in a future version.

MapReduce1

Oozie workflows not recovered after JobTracker failover on a secure cluster


Delegation tokens created by clients (via JobClient#getDelegationToken() ) do not persist when the
JobTracker fails over. This limitation means that Oozie workflows will not be recovered successfully
in the event of a failover on a secure cluster.

Bug: None

Cloudera Bug: CDH-8913

Workaround: Re-submit the workflow.

Hadoop Pipes should not be used in secure clusters


Hadoop Pipes should not be used in secure clusters. A shared password used by the framework for
parent-child communications in the clear. A malicious user could intercept that password and
potentially use it to access private data in a running application.

Bug: None
No JobTracker becomes active if both JobTrackers are migrated to other hosts
If JobTrackers in an High Availability configuration are shut down, migrated to new hosts, then
restarted, no JobTracker becomes active. The logs show a Mismatched address exception.

Bug: None

Cloudera Bug: CDH-11801

Workaround: After shutting down the JobTrackers on the original hosts, and before starting them on
the new hosts, delete the ZooKeeper state using the following command:

$ zkCli.sh rmr /hadoop-ha/<logical name>

Hadoop Pipes may not be usable in an MRv1 Hadoop installation done through tarballs
Under MRv1, MapReduce's C++ interface, Hadoop Pipes, may not be usable with a Hadoop
installation done through tarballs unless you build the C++ code on the operating system you are
using.

Bug: None

Cloudera Bug: CDH-7304

Workaround: Build the C++ code on the operating system you are using. The C++ code is present
under src/c++ in the tarball.

CONCLUSION:

HADOOP 3.X IS BETTER.

SITES USED

https://hadoop.apache.org/old/releases.pdf

https://www.tutorialspoint.com/difference-between-hadoop-1-and-hadoop-2

https://data-flair.training/blogs/hadoop-2-x-vs-hadoop-3-x-comparison/

https://hadoop.apache.org/docs/r0.23.11/hadoop-project-dist/hadoop-
common/releasenotes.html

https://hadoop.apache.org/docs/r1.0.4/releasenotes.html

https://hadoop.apache.org/docs/r1.2.1/releasenotes.html
https://docs.cloudera.com/documentation/enterprise/release-
notes/topics/cdh_rn_hadoop_ki.html

https://svn.apache.org/repos/asf/hadoop/common/tags/release-
0.22.0/common/src/docs/releasenotes.html

https://archive.apache.org/dist/hadoop/core/hadoop-1.1.0/releasenotes.html

https://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-common/releasenotes.html

https://www.geeksforgeeks.org/hadoop-version-3-0-whats-new/

https://hadoop.apache.org/release.html

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy