0% found this document useful (0 votes)
0 views12 pages

Hadoop Configuration

The document provides a detailed step-by-step guide for installing Hadoop on a Linux system, emphasizing the significance of each step and the necessary configuration settings. It covers prerequisites, user creation, SSH configuration, downloading and extracting Hadoop, setting environment variables, configuring Hadoop, formatting the HDFS Namenode, starting Hadoop daemons, verifying the installation, and running a sample MapReduce job. Additionally, it elaborates on the importance of four key configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml) that define Hadoop's operational parameters.

Uploaded by

mangal jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views12 pages

Hadoop Configuration

The document provides a detailed step-by-step guide for installing Hadoop on a Linux system, emphasizing the significance of each step and the necessary configuration settings. It covers prerequisites, user creation, SSH configuration, downloading and extracting Hadoop, setting environment variables, configuring Hadoop, formatting the HDFS Namenode, starting Hadoop daemons, verifying the installation, and running a sample MapReduce job. Additionally, it elaborates on the importance of four key configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml) that define Hadoop's operational parameters.

Uploaded by

mangal jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 12

Hadoop installation on Linux in a step by step manner, dictating significance of

each step and also the why we need respective configuration settings at
respective step where ever applicable

Installing Hadoop on a Linux system involves several steps, each of which is


crucial for setting up a functional Hadoop cluster. Below is a step-by-step guide,
along with the significance of each step and the rationale behind the
configuration settings.

Step 1: Prerequisites
Actions:
1. Update the System:

sudo apt-get update


sudo apt-get upgrade

2. Install Java: Hadoop is written in Java, so a Java Development Kit (JDK) is


required.
sudo apt-get install openjdk-8-jdk

3. Verify Java Installation:

java -version

Significance:
• System Update: Ensures that your system has the latest security
patches and software updates.
• Java Installation: Hadoop is built on Java, and it requires a compatible
JDK to run. OpenJDK 8 is commonly used because it is stable and widely
supported.

Step 2: Create a Dedicated Hadoop User


Actions:
1. Create a New User:

sudo adduser hadoopuser

2. Add User to Sudoers (Optional):

sudo usermod -aG sudo hadoopuser

Significance:
• Security and Isolation: Running Hadoop under a dedicated user
account enhances security and isolates Hadoop processes from other
system processes.

Step 3: Configure SSH for Password-less Login


Actions:
1. Generate SSH Keys:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

2. Copy the Public Key to Authorized Keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3. Test SSH:

ssh localhost

Significance:
• Password-less Login: Hadoop requires SSH access to manage its nodes.
Password-less login simplifies the process and allows Hadoop to start and
stop daemons without manual intervention.

Step 4: Download and Extract Hadoop


Actions:
1. Download Hadoop:

wget
https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-
3.3.1.tar.gz

2. Extract the Tarball:

tar -xzvf hadoop-3.3.1.tar.gz

3. Move Hadoop to a Directory:

sudo mv hadoop-3.3.1 /usr/local/hadoop

Significance:
• Hadoop Installation: This step installs Hadoop binaries on your system.
Extracting the tarball and moving it to a standard directory like
/usr/local/hadoop makes it easier to manage and configure.

Step 5: Configure Environment Variables


Actions:
1. Edit .bashrc or .bash_profile:

nano ~/.bashrc

2. Add the Following Lines:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

3. Reload the Shell:

source ~/.bashrc

Significance:
• Environment Variables: These variables help the system locate Hadoop
binaries and Java libraries. Setting JAVA_HOME is crucial because Hadoop
needs to know where Java is installed.

Step 6: Configure Hadoop


Actions:
• Edit hadoop-env.sh:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add or modify:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

• Edit core-site.xml:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

• HTML
• Edit hdfs-site.xml:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
</configuration>HTML

• Edit mapred-site.xml:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>un HTML

• Edit yarn-site.xml:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>Run HTML

Significance:
• Hadoop Configuration: These configuration files define how Hadoop
operates:
• core-site.xml: Sets the default file system and other core
parameters.
• hdfs-site.xml: Configures HDFS (Hadoop Distributed File System)
settings like replication factor and storage directories.
• mapred-site.xml: Configures the MapReduce framework to use YARN.

• yarn-site.xml: Configures YARN (Yet Another Resource Negotiator)


settings, such as the shuffle service.

Step 7: Format the HDFS Namenode


Actions:
1. Format the Namenode:

hdfs namenode -format

Significance:
• Initialization: This step initializes the HDFS filesystem. It is required
before you can start using HDFS.

---------------------------------------------------------------------------------------

Step 8: Start Hadoop Daemons


Actions:
1. Start HDFS:

start-dfs.sh

2. Start YARN:

start-yarn.sh

Significance:
• Daemon Startup: These commands start the necessary Hadoop
daemons (NameNode, DataNode, ResourceManager, NodeManager, etc.)
to run the Hadoop cluster.
---------------------------------------------------------------------------------------

Step 9: Verify the Installation


Actions:
1. Check Running Processes:

bash
Copy
jps

2. Access Web Interfaces:

• NameNode: http://localhost:9870

• ResourceManager: http://localhost:8088

Significance:
• Verification: Ensures that all Hadoop components are running correctly.
The jps command lists all Java processes, including Hadoop daemons. The
web interfaces provide a graphical view of the cluster's status.
--------------------------------------------------------------------------------------------

Step 10: Run a Sample MapReduce Job


Actions:
1. Create an Input Directory:

hdfs dfs -mkdir /input

2. Copy Input Files to HDFS:

hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /input

3. Run the MapReduce Job:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-


mapreduce-examples-3.3.1.jar grep /input /output 'dfs[a-z.]+'

4. View the Output:


hdfs dfs -cat /output/*

Significance:
• Testing: Running a sample job verifies that the entire Hadoop stack is
functioning correctly. It also demonstrates how to interact with HDFS and
submit MapReduce jobs.

Elaborate on the four configuration files ie the xml files stated above
in step 6
The four configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, and
yarn-site.xml) are critical for setting up and customizing Hadoop's behavior.
Each file serves a specific purpose and contains key-value pairs that define how
Hadoop components operate. Below is a detailed explanation of each file, its
significance, and the common configurations used.

1. core-site.xml
This file contains configuration settings that are common across the entire
Hadoop ecosystem. It defines core parameters that are used by both HDFS and
MapReduce.

Key Properties:
• fs.defaultFS: Specifies the default filesystem URI. This is the address
where the NameNode runs, and it is used by clients to connect to HDFS.

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

• Run HTML
• Why: This tells Hadoop where the NameNode is located. Clients use this
URI to interact with HDFS.
• hadoop.tmp.dir: Defines the base directory for temporary files used by
Hadoop.

<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>

• Run HTML
• Why: This directory is used to store temporary data, such as
intermediate results during MapReduce jobs. It must be writable by
the Hadoop user.

Significance:
• Global Settings: core-site.xml defines settings that are shared across all
Hadoop components, making it a central configuration file.

2. hdfs-site.xml
This file contains configuration settings specific to HDFS (Hadoop Distributed
File System). It defines how the NameNode and DataNodes operate.

Key Properties:
• dfs.replication: Specifies the number of replicas for each block of data
stored in HDFS.
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

• Why: Replication ensures data reliability. A value of 1 means no


replication (suitable for a single-node setup), while a typical production
value is 3.
• dfs.namenode.name.dir: Specifies the directory where the NameNode
stores its metadata.

<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>Run HTML

• Why: This directory stores critical metadata about the filesystem, such as
the directory tree and file-to-block mappings.
• dfs.datanode.data.dir: Specifies the directory where DataNodes store
the actual data blocks.

<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>

• Run HTML
• Why: This directory stores the actual data blocks that make up the
files in HDFS.

Significance:
• HDFS-Specific Settings: hdfs-site.xml configures the behavior of HDFS,
including data storage, replication, and metadata management.

3. mapred-site.xml
This file contains configuration settings for the MapReduce framework. It
defines how MapReduce jobs are executed and managed.

Key Properties:
• mapreduce.framework.name: Specifies the execution framework for
MapReduce jobs.

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

• Run HTML
• Why: Setting this to yarn ensures that MapReduce jobs are executed
using the YARN resource manager, which is the modern way to run
MapReduce in Hadoop.
• mapreduce.jobtracker.address: (Optional) Specifies the address of the
JobTracker (used in Hadoop 1.x). In Hadoop 2.x and later, this is replaced
by YARN.

<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:9001</value>
</property>Run HTML

• Why: In Hadoop 1.x, this was used to specify the JobTracker's


address. In Hadoop 2.x and later, YARN replaces the JobTracker.

Significance:
• MapReduce Configuration: mapred-site.xml defines how MapReduce
jobs are executed, including resource management and job scheduling.

4. yarn-site.xml
This file contains configuration settings for YARN (Yet Another Resource
Negotiator), which is responsible for resource management and job scheduling
in Hadoop.

Key Properties:
• yarn.nodemanager.aux-services: Specifies the auxiliary services that
NodeManagers should run.

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

• Run HTML
• Why: The mapreduce_shuffle service is required for MapReduce jobs to
work with YARN. It handles the shuffling of data between map and reduce
tasks.
• yarn.resourcemanager.hostname: Specifies the hostname of the
ResourceManager.

<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>

• Run HTML
• Why: This tells YARN where the ResourceManager is running. In a multi-
node cluster, this would be the hostname of the master node.
• yarn.nodemanager.resource.memory-mb: Specifies the amount of
memory available for YARN containers on each NodeManager.

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>

• Run HTML
• Why: This limits the amount of memory YARN can use on each node,
preventing resource contention with other applications.

Significance:
• Resource Management: yarn-site.xml configures YARN, which is
responsible for managing resources (CPU, memory) and scheduling tasks
across the cluster.

Summary of Significance:
• core-site.xml: Defines global settings for the entire Hadoop ecosystem.

• hdfs-site.xml: Configures HDFS, including data storage, replication, and


metadata management.
• mapred-site.xml: Configures the MapReduce framework, including job
execution and resource management.
• yarn-site.xml: Configures YARN, which manages resources and
schedules tasks in a Hadoop cluster.

Why These Files Are Important:


• Customization: These files allow you to tailor Hadoop's behavior to your
specific environment and requirements.
• Scalability: Proper configuration ensures that Hadoop can scale
efficiently across multiple nodes.
• Performance: Optimizing these settings can significantly improve the
performance of your Hadoop cluster.
• Reliability: Configurations like replication and resource management
ensure that your data and jobs are handled reliably.
By understanding and configuring these files, you can set up a Hadoop cluster
that is optimized for your use case, whether it's a single-node setup for
development or a multi-node cluster for production.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy