Hadoop Configuration
Hadoop Configuration
each step and also the why we need respective configuration settings at
respective step where ever applicable
Step 1: Prerequisites
Actions:
1. Update the System:
java -version
Significance:
• System Update: Ensures that your system has the latest security
patches and software updates.
• Java Installation: Hadoop is built on Java, and it requires a compatible
JDK to run. OpenJDK 8 is commonly used because it is stable and widely
supported.
Significance:
• Security and Isolation: Running Hadoop under a dedicated user
account enhances security and isolates Hadoop processes from other
system processes.
3. Test SSH:
ssh localhost
Significance:
• Password-less Login: Hadoop requires SSH access to manage its nodes.
Password-less login simplifies the process and allows Hadoop to start and
stop daemons without manual intervention.
wget
https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-
3.3.1.tar.gz
Significance:
• Hadoop Installation: This step installs Hadoop binaries on your system.
Extracting the tarball and moving it to a standard directory like
/usr/local/hadoop makes it easier to manage and configure.
nano ~/.bashrc
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
source ~/.bashrc
Significance:
• Environment Variables: These variables help the system locate Hadoop
binaries and Java libraries. Setting JAVA_HOME is crucial because Hadoop
needs to know where Java is installed.
Add or modify:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
• Edit core-site.xml:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
• HTML
• Edit hdfs-site.xml:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
</configuration>HTML
• Edit mapred-site.xml:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>un HTML
• Edit yarn-site.xml:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>Run HTML
Significance:
• Hadoop Configuration: These configuration files define how Hadoop
operates:
• core-site.xml: Sets the default file system and other core
parameters.
• hdfs-site.xml: Configures HDFS (Hadoop Distributed File System)
settings like replication factor and storage directories.
• mapred-site.xml: Configures the MapReduce framework to use YARN.
Significance:
• Initialization: This step initializes the HDFS filesystem. It is required
before you can start using HDFS.
---------------------------------------------------------------------------------------
start-dfs.sh
2. Start YARN:
start-yarn.sh
Significance:
• Daemon Startup: These commands start the necessary Hadoop
daemons (NameNode, DataNode, ResourceManager, NodeManager, etc.)
to run the Hadoop cluster.
---------------------------------------------------------------------------------------
bash
Copy
jps
• NameNode: http://localhost:9870
• ResourceManager: http://localhost:8088
Significance:
• Verification: Ensures that all Hadoop components are running correctly.
The jps command lists all Java processes, including Hadoop daemons. The
web interfaces provide a graphical view of the cluster's status.
--------------------------------------------------------------------------------------------
Significance:
• Testing: Running a sample job verifies that the entire Hadoop stack is
functioning correctly. It also demonstrates how to interact with HDFS and
submit MapReduce jobs.
Elaborate on the four configuration files ie the xml files stated above
in step 6
The four configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, and
yarn-site.xml) are critical for setting up and customizing Hadoop's behavior.
Each file serves a specific purpose and contains key-value pairs that define how
Hadoop components operate. Below is a detailed explanation of each file, its
significance, and the common configurations used.
1. core-site.xml
This file contains configuration settings that are common across the entire
Hadoop ecosystem. It defines core parameters that are used by both HDFS and
MapReduce.
Key Properties:
• fs.defaultFS: Specifies the default filesystem URI. This is the address
where the NameNode runs, and it is used by clients to connect to HDFS.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
• Run HTML
• Why: This tells Hadoop where the NameNode is located. Clients use this
URI to interact with HDFS.
• hadoop.tmp.dir: Defines the base directory for temporary files used by
Hadoop.
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
• Run HTML
• Why: This directory is used to store temporary data, such as
intermediate results during MapReduce jobs. It must be writable by
the Hadoop user.
Significance:
• Global Settings: core-site.xml defines settings that are shared across all
Hadoop components, making it a central configuration file.
2. hdfs-site.xml
This file contains configuration settings specific to HDFS (Hadoop Distributed
File System). It defines how the NameNode and DataNodes operate.
Key Properties:
• dfs.replication: Specifies the number of replicas for each block of data
stored in HDFS.
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>Run HTML
• Why: This directory stores critical metadata about the filesystem, such as
the directory tree and file-to-block mappings.
• dfs.datanode.data.dir: Specifies the directory where DataNodes store
the actual data blocks.
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
• Run HTML
• Why: This directory stores the actual data blocks that make up the
files in HDFS.
Significance:
• HDFS-Specific Settings: hdfs-site.xml configures the behavior of HDFS,
including data storage, replication, and metadata management.
3. mapred-site.xml
This file contains configuration settings for the MapReduce framework. It
defines how MapReduce jobs are executed and managed.
Key Properties:
• mapreduce.framework.name: Specifies the execution framework for
MapReduce jobs.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
• Run HTML
• Why: Setting this to yarn ensures that MapReduce jobs are executed
using the YARN resource manager, which is the modern way to run
MapReduce in Hadoop.
• mapreduce.jobtracker.address: (Optional) Specifies the address of the
JobTracker (used in Hadoop 1.x). In Hadoop 2.x and later, this is replaced
by YARN.
<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:9001</value>
</property>Run HTML
Significance:
• MapReduce Configuration: mapred-site.xml defines how MapReduce
jobs are executed, including resource management and job scheduling.
4. yarn-site.xml
This file contains configuration settings for YARN (Yet Another Resource
Negotiator), which is responsible for resource management and job scheduling
in Hadoop.
Key Properties:
• yarn.nodemanager.aux-services: Specifies the auxiliary services that
NodeManagers should run.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
• Run HTML
• Why: The mapreduce_shuffle service is required for MapReduce jobs to
work with YARN. It handles the shuffling of data between map and reduce
tasks.
• yarn.resourcemanager.hostname: Specifies the hostname of the
ResourceManager.
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
• Run HTML
• Why: This tells YARN where the ResourceManager is running. In a multi-
node cluster, this would be the hostname of the master node.
• yarn.nodemanager.resource.memory-mb: Specifies the amount of
memory available for YARN containers on each NodeManager.
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
• Run HTML
• Why: This limits the amount of memory YARN can use on each node,
preventing resource contention with other applications.
Significance:
• Resource Management: yarn-site.xml configures YARN, which is
responsible for managing resources (CPU, memory) and scheduling tasks
across the cluster.
Summary of Significance:
• core-site.xml: Defines global settings for the entire Hadoop ecosystem.