0% found this document useful (0 votes)
27 views24 pages

Bda Unit2

Uploaded by

pubgmobilesd23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views24 pages

Bda Unit2

Uploaded by

pubgmobilesd23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

UNIT-II
HADOOP
1. HISTORY OF HADOOP
2. THE HADOOP DISTRIBUTED FILE SYSTEM
3. COMPONENTS OF HADOOP
4. ANALYZING THE DATA WITH HADOOP
5. SCALING OUT
6. HADOOP STREAMING
7. DESIGN OF HDFS
8. JAVA INTERFACES TO HDFS BASICS

Technical terms:
Terms Literal Meaning Technical Meaning
Scaling out To exit a position by selling To allow Hadoop to move
in increments as the price of the MapReduce computation
the stock climbs. to each machine hosting a
part of the data.

Streaming The process Continuous transfer


of data from one or more
sources at a steady, high
speed for processing into
specific outputs.
YARN - Component of Hadoop
Replica An exact copy of something. the process of creating and
maintaining identical copies
of data across multiple
storage locations
Latency something inactive The time delay between
when data is sent for use and
when it produces the desired
result.

1. HISTORY OF HADOOP
 Hadoop is an open source framework overseen by Apache Software Foundation
which is written in Java for storing and processing of huge datasets with the cluster
of commodity hardware.
Mainly two problems with the big data.
1. First one is to store such a huge amount of data.
2. Second one is to process that stored data.
 The traditional approach like RDBMS is not sufficient due to the heterogeneity of the
data.

1
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

 Hadoop comes as the solution to the problem of big data i.e. storing and processing
the big data with some extra capabilities.
History of Hadoop:
 Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when
they both started to work on Apache Nutch project.
 Apache Nutch project was the process of building a search engine system that can
index 1 billion pages.
 After a lot of research on Nutch, they concluded that such a system will cost around
half a million dollars in hardware, and along with a monthly running cost of $30, 000
approximately, which is very expensive.
 So, they realized that their project architecture will not be capable enough to the
workaround with billions of pages on the web.
 So they were looking for a feasible solution which can reduce the implementation
cost as well as the problem of storing and processing of large datasets.
The origin of the name of Hadoop:
 Doug cutting separates the distributed computing parts from Nutch and formed a new
project Hadoop (He gave name Hadoop it was the name of a yellow toy elephant
which was owned by the Doug Cutting’s son. and it was easy to pronounce and was
the unique word.)

year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

2006 o Hadoop introduced.


o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.

2007 o Yahoo runs 2 clusters of 1000 machines.


o Hadoop includes HBase.

2008 o YARN JIRA opened


o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node cluster
within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.

2009 o Yahoo runs 17 clusters of 24,000 machines.


o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.

2010 o Hadoop added the support for Kerberos.


o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.

2011 o Apache Zookeeper released.


o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

3
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

2023 Apache Hadoop 3.3 version released.

2024
Announcement 3.4.0

 Hadoop is an Apache project; all components are available via the Apache open
source License.
 Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and
MapReduce).
 HBase was originally developed at Powerset, now a department at Microsoft.
 Hive was originated and developed at Facebook.
 Pig, ZooKeeper, and Chukwa were originated and developed at Yahoo!
 Avro was originated at Yahoo! and is being co-developed with Cloudera.

HADOOP PROJECT COMPONENTS

2. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)


 When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines.
Definition of HDFS:
 Filesystems that manage the storage across a network of machines are called distributed
filesystems. Hadoop comes with a distributed filesystem called HDFS, which stands
for Hadoop Distributed Filesystem.

Where to use HDFS?


1. Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
2. Streaming Data Access: The time to read whole data set is more important than latency
in reading the first. HDFS is built on write-once and read-many-times pattern.
3. Commodity Hardware: It works on low cost hardware.

4
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

Where not to use HDFS?


Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the first
record.
Lots of Small Files: The name node contains the metadata of files in memory and if the files
are small in size it takes a lot of memory for name node's memory which is not feasible.
Multiple Writes: It should not be used when we have to write multiple times.

Concepts of HDFS or Architecture of HDFS:


HDFS Concepts:
Blocks:
A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB by
default and this is configurable. Files n HDFS are broken into block-sized chunks, which are
stored as independent units. Unlike a file system, if the file is in HDFS is smaller than block
size, then it does not occupy full blocks size, i.e. 5 MB of file stored in HDFS of block size 128
MB takes 5MB of space only. The HDFS block size is large just to minimize the cost of seek.
An HDFS cluster has 2 types of node operating in a master –worker pattern:
1. Name node(the master)
2. A number of data node (workers)
Name Node:
 HDFS works in master-worker pattern where the name node acts as master.
 Name Node is controller and manager of HDFS as it knows the status and the metadata
of all the files in HDFS:
1. The metadata information being file permission
2. Names and location of each block.
 The metadata are small, so it is stored in the memory of name node, allowing faster
access to data.
 Moreover the HDFS cluster is accessed by multiple clients concurrently, so all this
information is handled by a single machine.
 The file system operations like opening, closing, renaming etc. are executed by it.
Data Node:
 They store and retrieve blocks when they are told to; by client or name node.
 They report back to name node periodically, with list of blocks that they are storing.
 The data node being a commodity hardware also does the work of block creation,
deletion and replication as stated by the name node.
Secondary Name Node:
 It is a separate physical machine which acts as a helper of name node.
 It performs periodic check points.
 It communicates with the name node and take snapshot of meta data which helps
minimize downtime and loss of data.

5
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

HDFS Data Node and Name Node Image:

The command-line interface:


Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode.
Commands are given below.
To Format $ hadoop namenode -format
To Start $ start-dfs.sh

HDFS Basic File Operations


Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.

$ hadoop fs -mkdir /user/test


o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to
HDFS folder /user/ test
$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test
o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system


o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt

Compare the files and see that both are same


o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting
o hadoop fs -rmr <arg>

Ex:hadoop fs -rmr /user/sonoo/


HDFS Other commands
The below is used in the commands
"<path>" means any file or directory name.
"<path>..." means one or more file or directory names.

6
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

"<file>" means any filename.


"<src>" and "<dest>" are path names in a directed operation.
"<localSrc>" and "<localDest>" are paths as above, but on the local file system

o put <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within
the DFS.

o copyFromLocal <localSrc><dest>
Identical to -put

o copyFromLocal <localSrc><dest>
Identical to -put

o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest
within HDFS, and then deletes the local copy on success.

o get [-crc] <src><localDest>


Copies the file or directory in HDFS identified by src to the local file system path
identified by localDest.

o cat <filen-ame>
Displays the contents of filename on stdout.
o moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.

o setrep [-R] [-w] rep <path>


Sets the target replication factor for files identified by path to rep. (The actual
replication factor will move toward the target over time)

o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already
exists at path, unless the file is already size 0.

o test -[ezd] <path>


Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
o stat [format] <path>

7
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

Prints information about path. Format is a string which accepts file size in blocks (%b),
filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).

3. COMPONENTS OF HADOOP
(https://www.ques10.com/p/2798/what-are-the-different-components-of-
hadoop-fram-1/)
 Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is designed to scale up from single servers to thousands of machines, each providing
computation and storage.
 Rather than rely on hardware to deliver high-availability, the framework itself is
designed to detect and handle failures at the application layer, thus delivering a highly-
available service on top of a cluster of computers, each of which may be prone to
failures.
1. Hadoop Distributed File System (HDFS)
 HDFS is a distributed file system that provides high-throughput access to data.
 It provides a limited interface for managing the file system to allow it to scale and
provide high throughput.
 HDFS creates multiple replicas of each data block and distributes them on computers
throughout a cluster to enable reliable and rapid access.
The main components of HDFS are as described below:
 Name Node is the master of the system. It maintains the name system (directories and
files) and manages the blocks which are present on the Data Nodes.
 Data Nodes are the slaves which are deployed on each machine and provide the actual
storage. They are responsible for serving read and write requests for the clients.
 Secondary Name Node is responsible for performing periodic checkpoints. In the
event of Name Node failure, you can restart the NameNode using the checkpoint.
2. MapReduce
 MapReduce is a framework for performing distributed data processing using the
MapReduce programming paradigm.
 In the MapReduce paradigm, each job has a user-defined map phase (which is a
parallel, share-nothing processing of input; followed by a user-defined reduce phase
where the output of the map phase is aggregated).
 HDFS is the storage system for both input and output of the MapReduce jobs.
The main components of MapReduce:

8
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

 JobTracker is the master of the system which manages the jobs and resources in the
clusters (Task Trackers). The JobTracker tries to schedule each map as close to the
actual data being processed i.e. on the Task Tracker which is running on the same
DataNode as the underlying block.
 TaskTrackers are the slaves which are deployed on each machine. They are
responsible for running the map and reduce tasks as instructed by the JobTracker.
 JobHistoryServer is a daemon that serves historical information about completed
applications. Typically, JobHistory server can be co-deployed with Job Tracker, but we
recommend to run it as a separate daemon.
3. YARN Component
 The acronym YARN stands for “Yet Another Resource Negotiator”.

 It is a resource management and job scheduling unit.

 This Hadoop component manages the resources to run the processed data effectively.

 The data processing results of MapReduce are run on the Hadoop cluster
simultaneously, and each of them needs some resources to complete the task.

 This is done with the help of a set of resources such as RAM, network bandwidth, and
the CPU.
To process job requests and manage cluster resources in Hadoop, it consists of the
following components:
1. Resource Manager
The resource manager serves as the master and assigns resources.
2. Node Manager
The node manager aids the
resource manager and is
responsible for handling
the nodes and resource
usage in the nodes.
3. Application Manager
The application manager is
responsible for requesting
containers from the node
manager.
Once the node manager
gets the resources, he sends
them to the resource
manager.
4. Container
It is a collection of physical resources that conduct the actual processing of data.

9
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

4. ANALYZING DATA WITH HADOOP

 To take advantage of the parallel processing that Hadoop provides, we need to express

our query as a MapReduce job.

 After some local, small-scale testing, we will be able to run it on a cluster of machines.

MapReduce works by breaking the processing into two phases:


1. map phase

2. Reduce phase

Each phase has key-value pairs

as input and output, the types of

which may be chosen by the

programmer.
The programmer also specifies two functions: the map function and the reduce function.

 The input to our map phase is the raw NCDC (National climate Data Center).

 To choose a text input format that gives us each line in the dataset as a text value.

 The key is the offset of the beginning of the line from the beginning of the file, but as
we have no need for this, we ignore it.

 To visualize the way the map works, consider the following sample lines of input
data (some unused columns have been dropped to fit the page, indicated by
ellipses):

0067011990999991950051507004…9999999N9+00001+99999999999…

0043011990999991950051512004…9999999N9+00221+99999999999…

0043011990999991950051518004…9999999N9–00111+99999999999…

0043012650999991949032412004…0500001N9+01111+99999999999…

0043012650999991949032418004…0500001N9+00781+99999999999…

These lines are presented to the map function as the key-value pairs:
(0,0067011990999991950051507004…9999999N9+00001+99999999999…)
(106, 0043011990999991950051512004…9999999N9+00221+99999999999…)

10
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

(212, 0043011990999991950051518004…9999999N9–00111+99999999999…)

(318, 0043012650999991949032412004…0500001N9+01111+99999999999…)

(424, 0043012650999991949032418004…0500001N9+00781+99999999999…)

The keys are the line offsets within the file, which we ignore in our map function.

The map function merely extracts the year and the air temperature (indicated in bold text), and

emits them as its output (the temperature values have been interpreted as integers):
(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)

 The output from the map function is processed by the MapReduce framework before

being sent to the reduce function.

 This processing sorts and groups the key-value pairs by key. So, continuing the example,

our reduce function sees the following input:

(1949, [111, 78])


(1950, [0, 22, −11])

Each year appears with a list of all its air temperature readings. All the reduce function has to

do now is iterate through the list and pick up the maximum reading:

(1949, 111)

(1950, 22)
This is the final output: the maximum global temperature recorded in each year.

11
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

5. SCALING OUT
 MapReduce works for large inputs.
 To Scale out, we need to store the data in a distributed filesystem, typically HDFS, to
allow Hadoop to move the MapReduce computation to each machine hosting a part of
the data.

Data Flow:
 A MapReduce job is a unit of work that the client wants to be performed:
 It consists of the input data, the map reduce program and configuration information.

Hadoop runs the job by dividing it, into tasks, of which there are 2 types:
Map task
Reduce task

There are 2 types of nodes that control the job execution process:
1. Job tracker :
 The job tracker, which coordinates the job run.
 The job tracker is a Java application whose main class is Job Tracker.
 It coordinates all the job run on the system by scheduling tasks to run on task
trackers.
2. Task tracker:
 It runs tasks and send progress reports to the job tracker, which keeps a record
of the overall progress of each job.
 If a task fails, the job tracker can reschedule it on a different task tracker.

Input Splits:
 Hadoop divides the input to a MapReduce job into fixed size pieces called input splits
or just splits.
 Hadoop creates one map task for each split, which runs the user defined map function
for each record in the split.

Locality optimization:
 Hadoop does its best to run the map tsk on a node where the input data resides in
HDFS.This is called the data locality optimization.

12
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

 It should now be clear why the optimal split size is the same as the block size.
 It is the largest size is the same as the block size; it is the largest size of input that can
be stored on a single node.

Whole data flow with a single reduce task:

 The dotted boxes indicate nodes, the light arrows show data transfers on a node and the
 Heavy arrows show data transfers on a node and the heavy arrows show data transfers
between nodes.
 The number of reduce tasks is not governed by the size of the input, but is specified
independently.
 When they are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task.
 There can be many keys in each partition their output, each creating one partition for
each reduce task.
 There can be many keys in each partition, but the records for any given key are all in
single partition.
 The partitioning can be controlled by a user-defined partitioning function, but normally
the default partitioned which buckets keys using a hash function works very well.

13
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

MapReduce data flow with multiple reduce tasks:

 This diagram
makes it clear
why the data
flow between
map and
reduce tasks is
colloquially
known as “the
shuffle”, as
each reduce
task is fed by
many map
tasks.

6. HADOOP STREAMING
 Hadoop Streaming is another feature/utility of the Hadoop ecosystem that enables
users to execute a MapReduce job from an executable script as Mapper and
Reducers.
 Hadoop Streaming is often confused with real-time streaming, but it’s simply a utility
that runs an executable script in the MapReduce framework.
 The executable script may contain code to perform real-time ingestion of data.
 The basic functionality of the Hadoop Streaming utility is that it executes the Mapper
and Reducer without any external script, creates a MapReduce job, submits the
MapReduce job to the cluster, and monitors the job until it completes.

Differentiate of Hadoop streaming/spark streaming/Kafka

14
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

Tools/Utility True Processing Performance Scalability Hadoop Data Languages


Stream Framework Framework Retention
ing

Hadoop No MapReduc Relatively Scalable Doesn’t run No Python, Perl,


Streaming e low without C++
Hadoop

Spark No In-memory Relatively Scalable Can work No Java, Scala,


Streaming computatio higher without Python
n Hadoop

Kafka Yes Java-based High Scalable Can work Configurable Java, Python
Throughput without
of data Hadoop

Hadoop Streaming architecture


The Mapper reads the input data from Input Reader/Format in the form of key-value
pair, maps them as per logic written on code, and then passes through the Reduce
stream, which performs data aggregation and releases the data to the output.

Features of Hadoop Streaming:


1. Users can execute non-Java-programmed MapReduce jobs on Hadoop clusters. Supported
languages include Python, Perl, and C++.
2. Hadoop Streaming monitors the progress of jobs and provides logs of a job’s entire execution
for analysis.

3. Hadoop Streaming works on the MapReduce paradigm, so it supports scalability, flexibility,


and security/authentication.
4. Hadoop Streaming jobs are quick to develop and don't require much programming (except
for executables).

Syntax:
 To submit a MapReduce job using Hadoop Streaming and to use a Mapper and Reducer
defined in Python.

15
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

Where:

input = Input location for Mapper from where it can read input
output = Output location for the Reducer to store the output
mapper = The executable file of the Mapper
reducer = The executable file of the Reducer

 Python to write the Mapper (which reads the input from the file or stdin) and Reducer
(which writes the output to the output location as specified).
 The Hadoop Streaming utility creates a MapReduce job, submits the job to the cluster,
and monitors the job until completion.
 Depending upon the input file size, the Hadoop Streaming process launches a number
of Mapper tasks (based on input split) as a separate process.
 The Mapper task then reads the input (which is in the form of key-value pairs as
converted by the input reader format), applies the logic from the code (which is
basically to map the input), and passes the result to Reducers.

16
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

 Once all the mapper tasks are successful, the MapReduce job launches the Reducer task
(one, by default, unless otherwise specified) as a separate process.
 In the Reducer phase, data aggregation occurs and the Reducer emits the output in the
form of key-value pairs.

Hadoop Streaming use case:


A popular example is the WordCount program developed in Python and run on Hadoop
Streaming.

Program for Mapper.py:


#!/usr/bin/python

import sys
#Word Count Example

# input comes from standard input STDIN


for line in sys.stdin:

line = line.strip() #remove leading and trailing whitespaces


words = line.split() #split the line into words and returns as a list
for word in words:

#write the results to standard output STDOUT


print'%s %s' % (word,1) #Emit the word

Programmer for reducer.py :


#!/usr/bin/python
import sys

from operator import itemgetter


# using a dictionary to map words to their counts

current_word = None
current_count = 0
word = None

# input comes from STDIN


for line in sys.stdin:
line = line.strip()
word,count = line.split(' ',1)

17
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

try:
count = int(count)

except ValueError:
continue

if current_word == word:
current_count += count

else:
if current_word:

print '%s %s' % (current_word, current_count)


current_count = count

current_word = word
if current_word == word:
print '%s %s' % (current_word,current_count)

Save the file in the Hadoop location /user/input/input.txt

Submit the Mapreduce job with following command:

Once the job is successful, you will see some part files in /user/output location:

18
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

7. DESIGN OF HDFS

What is the purpose to design HDFS?


The Design of HDFS
 HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Very large files:
 “Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access:
 HDFS is built around the idea that the most efficient data processing pattern is a write-
once, read many-times pattern.
 A dataset is typically generated or copied from source, then various analyses are
performed on that dataset over time.
Commodity hardware:
 Hadoop doesn’t require expensive, highly reliable hardware to run on.
 It’s designed to run on clusters of commodity hardware (commonly available hardware
available from multiple vendors3) for which the chance of node failure across the
cluster is high, at least for large clusters.
Scalable distributed filesystem:
 So essentially, as you add disks you get scalable performance. And as you add more,
you're adding a lot of disks, and that scales out the performance.
Data Replication:
 Helps to handle hardware failures.
 Try to spread the data, same piece of data on different nodes.
Detecting faults:
 It have a technology in place to scan and detect faults quickly and effectively as it
includes a large number of commodity hardware.
Name node:
 It is master node.

19
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

 It assign work to them.


 It executes filesystem operations like opening, closing, renaming file and directories.
Data node:
 It is actual worker.
 It does reading, writing processing
 Deletion and replication upon instruction from the master.
 Deploy commodity hardware.

HDFS is not a good fit today:


Low-latency data access:
 Applications that require low-latency access to data, in the tens of milliseconds range,
will not work well with HDFS.
Lots of small files:
 The namenode holds filesystem metadata in memory, the limit to the number of files in
a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications:
 Files in HDFS may be written to by a single writer.
 Writes are always made at the end of the file.
 There is no support for multiple writers, or for modifications at arbitrary offsets in the
file
8. JAVA INTERFACES TO HDFS

How to read data from HDFS using java interface?


Reading Data from HDFS:
1. Create FileSystem
2. Open InputStream to a Path

3. Copy bytes using IOUtils


4. Close Stream

1. Create FileSystem:
FileSystem fs = FileSystem.get(new Configuration());
 If you run with yarn command, Distributed Filesystem (HDFS) will be created
 Utilizes fs.default.name property from configuration
 Recall that Hadoop framework loads core-site.xml which sets property to hdfs
(hdfs://localhost:8020)

2. Open Input Stream to a Path:


InputStream input = null;
try
{ input = fs.open(fileToRead);}

20
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

 fs.open returns org.apache.hadoop.fs.FSDataInputStream.


 Another FileSystem implementation will return their own custom implementation of
InputStream
 Opens stream with a default buffer of 4k
 If you want to provide your own buffer size use – fs.open(Path f, int bufferSize)

3. Copy bytes using IOUtils


IOUtils.copyBytes(inputStream, outputStream, buffer);
 Copy bytes from InputStream to OutputStream
 Hadoop’s IOUtils makes the task simple – buffer parameter specifies number of bytes
to buffer at a time

4. Close Stream
finally { IOUtils.closeStream(input); }

 Utilize IOUtils to avoid boiler plate code that catches IOException

Example of ReadFile.java:

public class ReadFile {

public static void main(String[] args)

throws IOException {

Path fileToRead = new Path("/training/data/readMe.txt");

FileSystem fs = FileSystem.get(new Configuration());

InputStream input = null;

try {

input = fs.open(fileToRead);

IOUtils.copyBytes(input, System.out, 4096);

} finally {

IOUtils.closeStream(input); }}}

Output:
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.ReadFile Hello from readme.txt

21
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

How to write data into HDFS?


Write Data into HDFS:
1. Create FileSystem instance
2. Open OutputStream

 FSDataOutputStream in this case


 Open a stream directly to a Path from FileSystem
 Creates all needed directories on the provided path

3.Copy data using IOUtils


public class WriteToFile {
public static void main(String[] args) throws IOException {

String textToWrite = "Hello HDFS! Elephants are awesome!\n";


InputStream in = new BufferedInputStream(
new ByteArrayInputStream(textToWrite.getBytes()));
Path toHdfs = new Path("/training/playArea/writeMe.txt");

Configuration conf = new Configuration();


FileSystem fs = FileSystem.get(conf);

FSDataOutputStream out = fs.create(toHdfs);


IOUtils.copyBytes(in, out, conf);
}}

Output:
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.WriteToFile
$ hdfs dfs -cat /training/playArea/writeMe.txt
Hello HDFS! Elephants are awesome!

How to append data into HDFS?

Append:
 Append to the end of the existing file.Optional support by concrete FileSystem HDFS
supports
 No support for writing in the middle of the File

Example:
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream out = fs.append(toHdfs, new Progressable(){
@Override

22
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

public void progress() { System.out.print("..");}

How to delete data in HDFS?


Delete Data:
FileSystem.delete(Path path,Boolean recursive)

Path toDelete =new Path("/training/playArea/writeMe.txt");


boolean isDeleted = fs.delete(toDelete, false);
System.out.println("Deleted: " + isDeleted);

Output:

$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.DeleteFile


Deleted: true
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.DeleteFile
Deleted: false

How to create new directory in HDFS:

FileSystem: mkdirs

 Create a directory. It will create all the parent directories.


Configuration conf = new Configuration();
Path newDir = new Path("/training/playArea/newDir");
FileSystem fs = FileSystem.get(conf);
boolean created = fs.mkdirs(newDir);
System.out.println(created);

Output:

$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.MkDir


True

How to seek a file using java interface?


Reading Data – Seek
FileSystem.open returns FSDataInputStream Extension of java.io.DataInputStream
Supports random access and reading via interfaces:
PositionedReadable : read chunks of the stream
Seekable : seek to a particular position in the stream
Program:
public class SeekReadFile {
public static void main(String[] args) throws IOException {
Path fileToRead = new Path("/training/data/readMe.txt");
FileSystem fs = FileSystem.get(new Configuration());

23
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS

FSDataInputStream input = null;


try {
input = fs.open(fileToRead);
System.out.print("start postion=" + input.getPos() + ":IOUtils.copyBytes(input, System.out,
4096, false);
input.seek(11);
System.out.print("start postion=" + input.getPos() + ":IOUtils.copyBytes(input, System.out,
4096, false);
input.seek(0);
System.out.print("start postion=" + input.getPos() + ":IOUtils.copyBytes(input, System.out,
4096, false);
} finally {
IOUtils.closeStream(input);}}}
Output:
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.SeekReadFile
start position=0: Hello from readme.txt
start position=11: readme.txt
start position=0: Hello from readme.txt

UNIT –II COMPLETED

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy