0% found this document useful (0 votes)
143 views20 pages

DSBDSAssingment 11

The document discusses how to design and implement a distributed application using MapReduce to process a log file. It describes the key components of MapReduce including the Mapper, Reducer, and Driver classes and their roles in processing log file data. It also provides steps to install Hadoop on Linux and configure the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views20 pages

DSBDSAssingment 11

The document discusses how to design and implement a distributed application using MapReduce to process a log file. It describes the key components of MapReduce including the Mapper, Reducer, and Driver classes and their roles in processing log file data. It also provides steps to install Hadoop on Linux and configure the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Assignment 11

Title:

To design a distributed application using MapReduce which processes a log file of a


system.

Problem Statement:
Design a distributed application using MapReduce which processes a log file of a
system.

Objective:
By completing this task, students will learn the following
1. Hadoop Distributed File System.
2. MapReduce Framework.

Software/Hardware Requirements:
64-bit Open source OS-Linux, Java, Java Development Kit (JDK), Hadoop.

Theory:

Hadoop : Hadoop is an open-source distributed computing framework designed to


handle and process large volumes of data across clusters of commodity hardware. It
was inspired by Google's MapReduce and Google File System (GFS) papers and is
written in Java. Apache Hadoop provides a scalable, reliable, and distributed
computing environment for processing and analyzing big data.
Map Reduce: MapReduce is a programming model and framework for processing
and generating large datasets in a distributed and parallel manner. It consists of two
main phases: the Map phase, where input data is divided into smaller chunks and
processed independently, and the Reduce phase, where the results from the Map
phase are aggregated and combined to produce the final output.
Mapper Class: The Mapper class is a crucial component in a MapReduce job,
responsible for processing each input record and generating intermediate key-value
pairs.
In the context of processing a log file, the Mapper class parses each log entry and
extracts relevant information.
The typical steps involved in implementing a Mapper class for log file processing
include:
1. Input Parsing: Read each line of the log file.
2. Data Extraction: Extract relevant information from each log entry, such as
timestamps, error codes, or other data points of interest.
3. Data Transformation: Convert the extracted information into key-value pairs.
For example, if the goal is to analyze error frequencies, the Mapper might emit
<error_code, 1> pairs for each occurrence of an error code in the log entry.
4. Output Emission: Emit the key-value pairs to the MapReduce framework for
further processing.
The Mapper class typically extends the Mapper class provided by the MapReduce
framework and overrides the map() method to define custom logic for processing
input records.
Reducer Class: The Reducer class is another crucial component in a MapReduce job,
responsible for aggregating and processing the intermediate key-value pairs
generated by the Mapper class.
In the context of processing a log file, the Reducer class receives key-value pairs
where the key represents a unique identifier (e.g., an error code) and the value
represents the count of occurrences.
The typical steps involved in implementing a Reducer class for log file processing
include:
1. Input Aggregation: Receive key-value pairs grouped by key from the
MapReduce framework.
2. Data Aggregation: Aggregate the counts of occurrences for each unique key.
3. Output Generation: Produce the final output, which may include aggregated
statistics, summaries, or any other desired analysis results.
The Reducer class typically extends the Reducer class provided by the MapReduce
framework and overrides the reduce() method to define custom logic for aggregating
intermediate results.
Driver Class : The Driver class in a MapReduce application is responsible for
configuring the job, setting up input and output paths, specifying mapper and
reducer classes, and submitting the job for execution. Here's a breakdown of the key
components typically found in a Driver class:
1. Configuration Setup: In the Driver class, you initialize a Hadoop configuration
object (Configuration) which holds various settings and parameters for the
MapReduce job. This includes properties such as input/output paths, mapper
and reducer classes, and any other job-specific configurations.
2. Job Initialization: Using the configuration object, you create a Job object (Job),
which represents the entire MapReduce job to be executed. This involves
specifying the name of the job, setting input/output formats, and configuring
the mapper and reducer classes.
3. Input and Output Paths: Specify the input and output paths for the job. This
tells Hadoop where to find the input data (e.g., log file) and where to write the
output of the job.
4. Mapper and Reducer Classes: Set the mapper and reducer classes to be used
in the MapReduce job. This involves specifying the Java classes that
implement the Mapper and Reducer interfaces and defining the logic for data
processing and aggregation.
5. InputFormat and OutputFormat: Configure the input and output formats for
the job, which define how input data is read and how output data is written.
Hadoop provides default input and output formats, but you can also use
custom formats if needed.
6. Output Key-Value Types: Specify the types of keys and values that the mapper
and reducer classes will emit. These types should match the output types of
the mapper and reducer classes.
7. Job Submission: Finally, submit the MapReduce job to the Hadoop or
MapReduce framework for execution. This involves calling the
job.waitForCompletion() method, which initiates the job execution and waits
for it to complete.
Log file : A log file is a file that records events, actions, or messages that occur within
a software application, operating system, or system component. These files are
commonly used for troubleshooting, debugging, monitoring, auditing, and analysis
purposes. Here's some key information about log files:
Log files serve various purposes, including:
Recording system events: Log files often record events such as system startups,
shutdowns, errors, warnings, and informational messages.
Debugging: Developers use log files to debug software by analyzing logs to identify
and fix issues.
Monitoring and performance analysis: System administrators use log files to monitor
system health, performance, and resource usage.
Auditing and compliance: Log files are sometimes used to track user activities for
auditing and compliance purposes.
Log files can be stored in various formats, including plain text, XML, JSON, and
structured formats. The choice of format depends on the logging framework or
application generating the logs and the requirements of downstream analysis tools.

Steps to install hadoop (Linux) :

Step 1 : Install Java Development Kit


The default Ubuntu repositories contain Java 8 and Java 11 both. I am using Java 8
because hive only works on this version.Use the following command to install it.
sudo apt update && sudo apt install openjdk-8-jdk

Step 2 : Verify the Java version :


Once you have successfully installed it, check the current Java version:
java -version

Step 3 : Install SSH :


SSH (Secure Shell) installation is vital for Hadoop as it enables secure communication
between nodes in the Hadoop cluster. This ensures data integrity, confidentiality,
and allows for efficient distributed processing of data across the cluster.
sudo apt install ssh

Step 4 : Create the hadoop user :


All the Hadoop components will run as the user that you create for Apache Hadoop,
and the user will also be used for logging in to Hadoop’s web interface.
Run the command to create user and set password :
sudo adduser hadoop

Step 5 : Switch user :


Switch to the newly created hadoop user:
su - hadoop

Step 6 : Configure SSH :


Now configure password-less SSH access for the newly created hadoop user, so I
didn’t enter key to save file and passpharse. Generate an SSH keypair first:
ssh-keygen -t rsa
Step 7 : Set permissions :
Copy the generated public key to the authorized key file and set the proper
permissions:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys

Step 8 : SSH to the localhost


ssh localhost

You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes
and hit Enter to authenticate the localhost.

Step 9 : Switch user


Again switch to hadoop
su - hadoop

Step 10 : Install hadoop


Download hadoop 3.3.6
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Once you’ve downloaded the file, you can unzip it to a folder.


tar -xvzf hadoop-3.3.6.tar.gz

Rename the extracted folder to remove version information. This is an optional step,
but if you don’t want to rename, then adjust the remaining configuration paths.
mv hadoop-3.3.6 hadoop

Next, you will need to configure Hadoop and Java Environment Variables on your
system. Open the ~/.bashrc file in your favorite text editor.Here I am using nano
editior , to pasting the code we use ctrl+shift+v for saving the file ctrl+x and
ctrl+y ,then hit enter:
nano ~/.bashrc

Append the below lines to the file.


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Load the above configuration in the current environment.
source ~/.bashrc

You also need to configure JAVA_HOME in hadoop-env.sh file. Edit the Hadoop
environment variable file in the text editor:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Search for the “export JAVA_HOME” and configure it .


JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 11 : Configuring Hadoop :
First, you will need to create the namenode and datanode directories inside the
Hadoop user home directory. Run the following command to create both directories:
cd hadoop/

mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}

Next, edit the core-site.xml file and update with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml

Change the following name as per your system hostname:


<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Save and close the file.
Then, edit the hdfs-site.xml file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Change the NameNode and DataNode directory paths as shown below:


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Then, edit the mapred-site.xml file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the following changes:


<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
<property>
<name>mapreduce.map.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
</configuration>

Then, edit the yarn-site.xml file:


nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Make the following changes:


<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save the file and close it .

Step 12 : Start Hadoop cluster:


Before starting the Hadoop cluster. You will need to format the Namenode as a
hadoop user.
Run the following command to format the Hadoop Namenode:
hdfs namenode -format

Once the namenode directory is successfully formatted with hdfs file system, you
will see the message “Storage directory
/home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted”.
Then start the Hadoop cluster with the following command.
start-all.sh

You can now check the status of all Hadoop services using the jps command:
jps
To check that all the Hadoop services are up and running, run the below command.

Step 1)jps

Step 2) cd

Step 3) sudo mkdir mapreduce_bhargavi

Step 4) sudo chmod 777 -R mapreduce_bhargavi/

Step 5) sudo chown -R bhargavi mapreduce_bhargavi/

Step 6) sudo cp /home/bhargavi/Desktop/logfiles1/* ~/bhargavi/

Step 7) cd mapreduce_bhargavi/

Step 8) ls

Step 9) sudo chmod +r *.*

Step 10) export CLASSPATH="/home/bhargavi/hadoop-3.3.6/share/hadoop/mapreduce/hadoop-


mapreduce-client-core-3.3.6.jar:/home/bhargavi/hadoop-
3.3.6/share/hadoop/mapreduce/hadoop-mapreduce-client-common-
3.3.6.jar:/home/bhargavi/hadoop-3.3.6/share/hadoop/common/hadoop-common-
3.3.6.jar:~/mapreduce_bhargavi/SalesCountry/*:$HADOOP_HOME/lib/*"

Step 11) javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

Step 12) ls
Step 13) cd SalesCountry/

Step 14) ls (check is class files are created)


Step 15) cd ..

Step 16) gedit Manifest.txt


(add following lines to it:
Main-Class: SalesCountry.SalesCountryDriver)

Step 17) jar -cfm mapreduce_vijay.jar Manifest.txt SalesCountry/*.class

Step 18) ls

Step 19) cd

Step 20) cd mapreduce_bhargavi/

Step 21) sudo mkdir /input200


Step 22) sudo cp access_log_short.csv /input200

Step 23) $HADOOP_HOME/bin/hdfs dfs -put /input200 /

Step 24) $HADOOP_HOME/bin/hadoop jar mapreduce_bhargavi.jar /input200 /output200

Step 25) hadoop fs -ls /output200

Step 26) hadoop fs -cat /out321/part-00000

Step 27) Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the
NameNode interface.

Java Code to process logfile

Mapper Class:
package SalesCountry;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;

public class SalesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,

IntWritable> {

private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter

reporter) throws IOException {

String valueString = value.toString();

String[] SingleCountryData = valueString.split("-");

output.collect(new Text(SingleCountryData[0]), one);

Reducer Class:

package SalesCountry;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;
public class SalesCountryReducer extends MapReduceBase implements Reducer<Text, IntWritable,

Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable>

output, Reporter reporter) throws IOException {

Text key = t_key;

int frequencyForCountry = 0;

while (values.hasNext()) {

// replace type of value with the actual type of our value

IntWritable value = (IntWritable) values.next();

frequencyForCountry += value.get();

output.collect(key, new IntWritable(frequencyForCountry));

Driver Class:

package SalesCountry;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {

public static void main(String[] args) {

JobClient my_client = new JobClient();

// Create a configuration object for the job

JobConf job_conf = new JobConf(SalesCountryDriver.class);


// Set a name of the Job

job_conf.setJobName("SalePerCountry");

// Specify data type of output key and value

job_conf.setOutputKeyClass(Text.class);

job_conf.setOutputValueClass(IntWritable.class);

// Specify names of Mapper and Reducer Class

job_conf.setMapperClass(SalesCountry.SalesMapper.class);

job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);

// Specify formats of the data type of Input and output

job_conf.setInputFormat(TextInputFormat.class);

job_conf.setOutputFormat(TextOutputFormat.class);

// Set input and output directories using command line arguments,

//arg[0] = name of input directory on HDFS, and arg[1] = name of output directory to be created to

store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));

FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);

try {

// Run the job

JobClient.runJob(job_conf);

} catch (Exception e) {

e.printStackTrace();
}

Note : The paths and directory names will change according to your folder. Change the names and

paths accordingly.

Conclusion :
In this assignment we learnt how to process a log file using Hadoop frame work on distributed
system.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy