0% found this document useful (0 votes)
208 views45 pages

Bda Lab Manual

Uploaded by

devilsniper52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views45 pages

Bda Lab Manual

Uploaded by

devilsniper52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Department of Information Technology & Computer Applications

BIG DATA ANALYTICS LAB


(22IT805)

Regulation – R22

Lab manual for the Academic Year (2024-25)

IV B.Tech I Semester

DEPARTMENT OF
INFORMATION TECHNOLOGY & COMPUTER APPLICATIONS

Vadlamudi, 522 213, AP, India

Name of the Faculty: Dr. SRIKANTH YADAV.M


Asst. Professor,
Department of IT&CA

1
Department of Information Technology & Computer Applications
Vision:
The vision of the Big Data Analytics Lab is to empower students with advanced skills and
expertise in big data technologies, preparing them to become future-ready professionals in the rapidly
evolving field of data analytics. By providing hands-on experience with industry-leading tools such as
Hadoop, MapReduce, Hive, PIG, and Spark, the lab aims to cultivate a deep understanding of data
management, processing, and analysis. Through practical exercises and real-world applications, students
are encouraged to explore innovative solutions to complex data challenges, fostering a mindset of
continuous learning and adaptation to emerging trends in big data analytics. The lab seeks to inspire
students to become leaders and innovators in leveraging data-driven insights for impactful decision-
making and problem-solving across diverse industries and domains.

Mission:
 Equip students with comprehensive knowledge and practical skills in big data analytics, focusing on
handling and analyzing large datasets efficiently.
 Foster a deep understanding of various big data tools and technologies such as Hadoop, MapReduce,
Hive, PIG, and Spark through hands-on experiences and real-world applications.
 Empower students to apply advanced data analytics techniques to solve complex problems and make
informed decisions across different industry sectors.
 Prepare students to meet the growing demand for skilled data analysts and professionals capable of
harnessing the power of big data to drive innovation and business success.

Programme Educational Objectives (PEOs)


Graduates of Information Technology programme should be able to:
 PEO1: Evolve as globally competent computer professionals possessing leadership skills for
developing innovative solutions in multidisciplinary domains.
 PEO2: Excel as socially committed individual having high ethical values and empathy for the
needs of society.
 PEO3: To prepare students to succeed in employment/profession or to pursue postgraduate.
 PEO4: Involve in lifelong learning to adapt the technological advancements in the emerging areas
of computer applications.

Programme Specific Outcomes (PSOs)


 PSO1: Develop a competitive edge in basic technical skills of computer applications like
Programming Languages, Algorithms and Data Structures, Databases and Software Engineering.
 PSO2: Able to identify, analyze and formulate solutions for problems using computer
applications.
 PSO3: Attempt to work professionally with a positive attitude as an individual or in
multidisciplinary teams and to communicate effectively among the stakeholders.

Programme Outcomes (POs)


The graduates of Information Technology will be able to:
 PO1: Able to design and develop reliable software applications for social needs and excel in
ITenabled services.
 PO2: Able to analyse and identify the customer requirements in multidisciplinary domains, create
high level design and implement robust software applications using latest technological skills.

2
Department of Information Technology & Computer Applications
 PO3: Proficient in successfully designing innovative solutions for solving real life business
problems and addressing business development issues with a passion for quality, competency and
holistic approach.
 PO4: Perform professionally with social, cultural and ethical responsibility as an individual aswell
as in multifaceted teams with positive attitude
 PO5: Capable of adapting to new technologies and constantly upgrade their skills with an attitude
towards independent and lifelong learning.

3
Department of Information Technology & Computer Applications

PREFACE
Pre requisite : Data Mining
About the LAB:
The Big Data Analytics Lab is designed to equip students with practical skills and in-depth knowledge
required to handle and analyze massive datasets using contemporary big data tools and technologies. This lab
focuses on the critical aspects of big data, ranging from understanding different types of digital data to
implementing and managing big data frameworks like Hadoop, YARN, and MapReduce. Through hands-on
exercises, students will gain the ability to process and analyze structured, semi-structured, and unstructured
data, which are pivotal in various real-world applications such as social media analysis, sensor data processing,
and customer review analytics.
In the initial sessions, students will explore fundamental concepts such as the types of digital data, their
real-time applications, and the significant challenges associated with big data environments. By distinguishing
between structured, unstructured, and semi-structured data, students will understand how to categorize and
approach different data sets. Moreover, the lab will delve into real-time challenges like data volume, velocity,
variety, and veracity, providing students with strategies to address these issues effectively. The role of a data
analyst in the decision-making process and hardware support for processing huge amounts of data will also be
discussed, laying a strong foundation for the practical sessions.
As the lab progresses, emphasis will be placed on the practical aspects of big data analytics, including
the installation and configuration of Hadoop and its associated components. Students will learn to set up a
Hadoop environment, manage HDFS, and execute basic Hadoop commands, which are essential for data
storage and processing. The lab also covers the installation and management of YARN, which is crucial for
resource allocation and job scheduling in a Hadoop cluster. Furthermore, students will explore the advantages
and drawbacks of MapReduce, the modules of MapReduce, and the installation and configuration of
MapReduce. By mastering these technologies, students will become proficient in managing and processing
large datasets efficiently.
In addition to Hadoop and MapReduce, the lab will cover the installation and utilization of Hive, PIG,
and Spark. Students will learn how to install these tools, understand their running and execution modes, and
explore their real-time applications. Hive will be explored for its use in data warehousing and SQL-like query
execution. PIG will be examined for its data flow scripting capabilities, and Spark will be discussed for its
powerful in-memory data processing capabilities. By the end of the lab, students will be well-versed in the
complete ecosystem of big data tools, prepared to tackle complex data analytics tasks in both academic and
professional settings.

Relevance to industry:
The skills and knowledge gained in the Big Data Analytics Lab are highly relevant to the industry, as
they address the critical need for professionals capable of managing and analyzing vast amounts of data. In
today's data-driven world, businesses across various sectors, including finance, healthcare, retail, and
technology, rely on big data analytics to drive decision-making, optimize operations, and gain competitive
4
Department of Information Technology & Computer Applications
advantages. Proficiency in tools like Hadoop, MapReduce, Hive, PIG, and Spark, as well as a deep
understanding of data processing challenges and solutions, prepares students to meet the demands of the job
market. These skills enable them to efficiently handle big data projects, contribute to strategic initiatives, and
innovate in fields where data analysis is pivotal.

Latest Technologies:
1. Hadoop
2. MapReduce
3. Hive, PIG, Spark

Lab Evaluation Procedure:


The performance of a student in each lab is evaluated continuously during the semester. The marks awarded
through continuous evaluation are referred to as internal marks. A comprehensive end-semester examination has
conducted the marks awarded for this evaluation referred to as external marks.
The maximum sum of internal and external assessment marks is 100, in the ratio of 50:50.

Internal Evaluation -50marks


External Evaluation -50marks
Internal Evaluation Criteria:

5
Department of Information Technology & Computer Applications

External Evaluation Criteria:

6
Department of Information Technology & Computer Applications

INDEX

7
Department of Information Technology & Computer Applications

22IT805 BIG DATA ANALYTICS LAB

Syllabus

1. HDFS basic command-line file operations.


2. HDFS monitoring User Interface.
3. WordCount Map Reduce program using Hadoop.
4. Implementation of word count with combiner Map Reduce program.
5. Practice on Map Reduce monitoring User Interface
6. Implementation of Sort operation using MapReduce
7. MapReduce program to count the occurrence of similar words in a file by using
partitioner.
8. Design MapReduce solution to find the years whose average sales is greater than 30.
input file format has year, sales of all months and average sales Year Jan Feb Mar
April May Jun July Aug Sep Oct Nov Dec Average
9. MapReduce program to find Dept wise salary. Empno EmpName Dept Salary
10.Creation of Database using hive.
11.Creation of partitions and buckets using Hive.
12.Practice of advanced features in Hive Query Language: RC File & XML data
processing.
13.Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter
the data.
14.Implementation of Word count using Pig.
15.Implement of word count using spark RDDs
16.Filter the log data using Spark RDDs.

Text Book:
1. Big Data Analytics 2ed, Seema Acharya, Subhashini Chellappan, Wiley Publishers, 2020
Reference Books:

1. Boris lublinsky, Kevin t. Smith, AlexeyYakubovich, “Professional Hadoop Solutions”,


Wiley, ISBN: 9788126551071, 2015.
2. Chris Eaton, Dirkderooset al. , “Understanding Big data ”, McGraw Hill, 2012.
3. Tom White, “HADOOP: The definitive Guide”, O Reilly 2012.
4. Vignesh Prajapati, “Big Data Analytics with R and Haoop”, Packet Publishing 2013

8
Department of Information Technology & Computer Applications

Course Description & Objectives:


This course gives an overview of Big Data, i.e. storage, retrieval and processing of big data. The focus
will be on the “technologies”, i.e., the tools/algorithms that are available for storage, processing of Big Data
and a variety of analytics.

Course Outcomes: After completion of this course, a student will be able to:

COs Course Outcomes POs

1 Understand Big Data and its analytics in the real world 1

Use the Big Data frameworks like Hadoop and NOSQL to


2 2
efficiently store and process Big Data to generate Analytics
Design of Algorithms to solve Data Intensive problems using Map
3 3
Reduce Paradigm
Design and Implementation of Big Data Analytics using Pig and
4 4
Spark to solve Data Intensive problems and to generate analytics

5 Analyse Big Data using Hive 5

Skills:
 Build and maintain reliable, scalable, distributed systems with Apache Hadoop.
 Develop Map-Reduce based Applications for Big Data.
 Design and build applications using Hive and Pig based Big data Applications.
 Learn tips and tricks for Big Data use cases and solutions

Mapping Of Course Outcomes with Program Outcomes:


PO1 PO2 PO3 PO4 PO5 PSO1 PSO2 PSO3
CO1 √ √ √ √ √
CO2 √ √ √ √ √
CO3 √ √ √ √ √
CO4 √ √ √ √ √
CO5 √ √ √ √ √

9
Department of Information Technology & Computer Applications

List of Experiments:

S.NO PROGRAM NAME CO

1 HDFS basic command-line file operations 2

2 HDFS monitoring User Interface 4

3 WordCount Map Reduce program using Hadoop 4

4 Implementation of word count with combiner Map Reduce program 3

5 Practice on Map Reduce monitoring User Interface 3

6 Implementation of Sort operation using MapReduce 3


MapReduce program to count the occurrence of similar words in a file by using
7 3
partitioner
Design MapReduce solution to find the years whose average sales is greater than
8 30. input file format has year, sales of all months and average sales Year Jan Feb 3
Mar April May Jun July Aug Sep Oct Nov Dec Average
MapReduce program to find Dept wise salary.
9 5
Empno EmpName Dept Salary
Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the
10 1
data
11 Implementation of Word count using Pig 3

12 Creation of Database and tables using Hive query language 5

13 Creation of partitions and buckets using Hive 1


Practice of advanced features in Hive Query Language: RC File & XML data
14 3
processing
15 Implement of word count using spark RDDs 5

16 Filter the log data using Spark RDDs 1

LAB Setup procedure:


1. Installation of Hadoop software stable version
2. Installation of Hive, PIG, Spark
3. Installation of cloudera quickstart
4. Installation of VMware stduio

10
Department of Information Technology & Computer Applications

Installation of Hadoop

Install and run Hadoop in standalone mode, pseudo mode and fully distributed cluster.

Step – 1: Java installed on your system


Open terminal and run:
$ java -version
If you have already installed Java, move to 2.1. If not, follow the steps:
$ sudo apt-get update $ sudo apt-get install default-jdk
Now check the version once again
$ java -version

Standalone mode.
Step-2: Download the latest version of Hadoop here.
$ tar -xzvf hadoop-2.7.3.tar.gz
//Change the version number if needed to match the Hadoop version you have downloaded.//

Step-3: Now we are moving this extracted file to /usr/local, suitable for local installs.
$ sudo mv hadoop-2.7.3 /usr/local/hadoop

Step-4: Now go to the Hadoop distribution directory using terminal


$ cd /usr/local/hadoop
Let’s see what’s inside the Hadoop folder
etc — has the configuration files for Hadoop environment.
bin — include various commands useful like Hadoop cmdlet.
share — has the jars that is required when you write MapReduce job. It has Hadoop libraries

Step-5: Hadoop command in the bin folder is used to run jobs in Hadoop.
$ bin/hadoop

Step-6: jar command is used to run the MapReduce jobs on Hadoop cluster
$ bin/hadoop jar

Step-7: Now we will run an example MapReduce to ensure that our standalone install works

11
Department of Information Technology & Computer Applications

create a input directory to place the input files and we run MapReduce command on it. These
are the configuration and command files along with hadoop, we will use those as text file input
for our MapReduce.
$ mkdir input $ cp etc/hadoop/* input/

This is run using the bin/hadoop command. It is used to run MapReduce. Jar indicates that the
MapReduce operation is specified in a Java archive. Here we will use the Hadoop-MapReduce-
examples.jar file which come along with installation. Jar name differ based on the version you
are installing. Now move on to your Hadoop install directory and type:
$ Hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.7.3.jar
If you are not in the correct directory, you will get an error saying “Not a valid JAR” as shown
below. If this issue persists, check if the location of Jar file is correct for your system.

Running example to check working of standalone mode.


This is a MapReduce ran successfully on standalone setup.

12
Department of Information Technology & Computer Applications

Experiment 01
Problem statement:
HDFS basic command-line file operations.

Objective: To understand the basic hadoop commands

1. To Check hadoop version

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hadoop version

2. To java Version

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$javac -version

3. To update hadoop packages

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$sudo apt-get update

4. To format namenode

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs namenode -format

5. To start hadoop services

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$start-all.sh

6. To see the services started

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$jps

7. To Create new directory

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs -mkdir /msy

Or

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hadoop fs -mkdir /msy

8. To Remove a file in the specified path:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –rm <src>

Eg. hdfs dfs –rm /msy/abc.txt

9. To Copy file from local file system to hdfs:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$ hdfs dfs –copyFromLocal <src> <dst>


13
Department of Information Technology & Computer Applications
Eg. hdfs dfs –copyFromLocal /home/vignan/sample.txt /msy/abc1.txt\

10. To display a list of contents in a directory:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –ls <path>\

Eg. hdfs dfs –ls /msy

11. To display contents in a file:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –cat <path>

Eg. hdfs dfs –cat /msy/abc1.txt\

12. To copy file from hdfs to the local file system:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –copyToLocal <src <dst>

Eg. hdfs dfs –copyToLocal /msy/abc1.txt /home/vignan/Desktop/sample.txt

13. To display the last few lines of a file:\

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –tail <path>\

Eg. hdfs dfs –tail /msy/abc1.txt

14. To Display aggregate length of the file in bytes:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –du <path>\

Eg. hdfs dfs –du /msy

15. To count no.of directories, files, and bytes under the given path:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –count <path>

Eg. hdfs dfs –count /msy

o/p: 1 1 60

16. To Remove a directory from hdfs

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –rmr <path>\

Eg. hdfs dfs rmr /chp

Experiment 02
HDFS monitoring User Interface

14
Department of Information Technology & Computer Applications

Let us access the HDFS web console.


1. Access the link http://MASTER_NODE:50070/ using your browser, and verify that you can
see the HDFS startup page. Here, replace MASTER_NODE with the IP address of the master
node running the HDFS NameNode.
2. The following screenshot shows the current status of the HDFS installation including the
number of nodes, total storage, storage taken by each node. It also allows users to browse the
HDFS filesystem.

15
Department of Information Technology & Computer Applications
Experiment 03
Problem statement: Wordcount Map Reduce program using standalone Hadoop
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount
{
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
While (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
{

private IntWritable result = new IntWritable();


public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values) { sum+= val.get();
}
result.set(sum);
context.write(key, result);

16
Department of Information Technology & Computer Applications
}
}

Steps:
1. Create one folder on Desktop-”WordCountTutorial”
a. Paste WordCount.java file
b. Create a folder named”Input_Data”->Create Input.txt file (enter some words)
c. Create a folder name “tutorial_classes
2. export HADOOP_CLASSPATH=$(hadoop classpath)
3. echo $HADOOP_CLASPATH
4. hadoop fs -mkdir /WordCountTutorial
5. hadoop fs -mkdir /WordCountTutorial/input
6. hadoop fs -put /home/hadoop/Desktop/WordCountTutorial/Input_Data/Input.txt
/WordCountTutorial/input
7. Change the current directory to the tutorial directory
cd '/home/hadoop/Desktop/WordCountTutorial'
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:~/Desktop/WordCountTutorial$
8. Compile the java code
javac -classpath ${HADOOP_CLASSPATH} -d
/home/hadoop/Desktop/WordCountTutorial/tutorial_classes
/home/hadoop/Desktop/WordCountTutorial/WordCount.java
9. Put the output files in one JAR file
jar -cvf firstTutorial.jar -C tutorial_classes/ .
10. Run JAR file
hadoop jar /home/hadoop/Desktop/WordCountTutorial/firstTutorial.jar WordCount
/WordCountTutorial/input /WordCountTutorial/output
11. See the output
hadoop dfs -cat /WordCountTutorial/output/*

Output:

17
Department of Information Technology & Computer Applications
Experiment 04
Problem statement:
Implementation of word count with combiner Map Reduce program

CODE: Combiner/Reducer Code


ReduceClass.java
package com.javacodegeeks.examples.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReduceClass extends Reducer{
@Override
protected void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException
{
int sum = 0;
Iterator valuesIt = values.iterator();
//For each key value pair, get the value and adds to the sum
//to get the total occurances of a word
while(valuesIt.hasNext())
{
sum = sum + valuesIt.next().get();
}
//Writes the word and total occurrences as key-value pair to the context
context.write(key, new IntWritable(sum));
}}

Output:

18
Department of Information Technology & Computer Applications
Experiment 05
Problem statement:
Practice on Map Reduce monitoring User Interface

Aim: Practice on Map Reduce monitoring User Interface


PROCEDURE:
Let us access the MapReduce web console.
1. Access the link http://MASTER_NODE:8088/ using your browser, and verify that you can
see the HDFS startup page. Here, replace MASTER_NODE with the IP address of the master
node running the HDFS NameNode.
2. The following screenshot shows the current status of the various running applications,
application id, completed applications, and failed applications, resources used.

19
Department of Information Technology & Computer Applications
Experiment 06
Problem statement:
Implementation of Sort operation using MapReduce
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)?0:1);
}
}

20
Department of Information Technology & Computer Applications

Alternative program
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws
IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}

Steps:
1. Create one folder on Desktop-”SortTutorial”
a. Paste Sort.java file
b. Create a folder named ”Input_Data”->Create Input.txt file (enter some words)
21
Department of Information Technology & Computer Applications
c. Create a folder name “tutorial_classes
2. export HADOOP_CLASSPATH=$(hadoop classpath)
3. echo $HADOOP_CLASSPATH
4. hadoop fs -mkdir /SortTutorial
5. hadoop fs -mkdir /SortTutorial/input
6. hadoop fs -put /home/hadoop/Desktop/SortTutorial/Input_Data/Input.txt
/SortTutorial/input
7. Change the current directory to the tutorial directory
cd '/home/hadoop/Desktop/SortTutorial'
8. Compile the java code
javac -classpath ${HADOOP_CLASSPATH} -d
/home/hadoop/Desktop/SortTutorial/tutorial_classes
/home/hadoop/Desktop/SortTutorial/Sort.java
9. Put the output files in one JAR file
jar -cvf firstTutorial.jar -C tutorial_classes/ .
10. Run JAR file
hadoop jar /home/hadoop/Desktop/SortTutorial/firstTutorial.jar WordCount
/SortTutorial/input /SortTutorial/output
11. See the output
hadoop dfs -cat /SortTutorial/output/*

Output:

22
Department of Information Technology & Computer Applications
Experiment 07
Problem statement:
MapReduce program to count the occurrence of similar words in a file by using partitioner

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);

23
Department of Information Technology & Computer Applications
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Output:

24
Department of Information Technology & Computer Applications
Experiment 08
Problem statement:
Design MapReduce solution to find the years whose average sales is greater than 30. input file format has year,
sales of all months and average sales. The sample input file is,

package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable, /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()){
lasttoken=s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
}
//Reducer class
25
Department of Information Technology & Computer Applications
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce(Text key, Iterator <IntWritable> values, OutputCollector>Text, IntWritable> output, Reporter
reporter) throws IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())

{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
}
//Main function
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(Eleunits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Save the above program into ProcessUnits.java. The compilation and execution of the program is given below.
Compilation and Execution of ProcessUnits Program
Let us assume we are in the home directory of Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1 − Use the following command to create a directory to store the compiled java classes.
$ mkdir units
Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program.
Download the jar from mvnrepository.com. Let us assume the download folder is /home/hadoop/.
Step 3 − The following commands are used to compile the ProcessUnits.java program and to create a jar for the
program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .

26
Department of Information Technology & Computer Applications
Step 4 − The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5 − The following command is used to copy the input file named sample.txt in the input directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6 − The following command is used to verify the files in the input directory
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7 − The following command is used to run the Eleunit_max application by taking input files from the input
directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

Output:

27
Department of Information Technology & Computer Applications
Experiment 09
Problem statement:
MapReduce program to find Dept wise salary. The sample input file is as follow.

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Salary
{
public static class SalaryMapper extends Mapper <LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String[] token = value.toString().split(",");
int s = Integer.parseInt(token[2]);
IntWritable sal = new IntWritable();
sal.set(s);
context.write(new Text(token[1]),sal);
}
}
public static class SalaryReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException,
InterruptedException
28
Department of Information Technology & Computer Applications
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key,result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Salary");
job.setJarByClass(Salary.class);
job.setMapperClass(SalaryMapper.class);

job.setReducerClass(SalaryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Output:

29
Department of Information Technology & Computer Applications
Experiment 10
Problem statement:
Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the data.
The Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop and
the Map Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and produces
another relation as output. These operators are the main tools for Pig Latin provides to operate on the data. They
allow you to transform it by sorting, grouping, joining, projecting, and filtering. Let’s create two files to run the
commands:
We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.

The second file contain two fields: url & rating. These two files are CSV files.

The Apache Pig operators can be classified as: Relational and Diagnostic.
Relational Operators:
Relational operators are the main tools Pig Latin provides to operate on the data. It allows you to transform the
data by sorting, grouping, joining, projecting and filtering. This section covers the basic relational operators.

LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.
In this example, the Load operator loads data from file ‘first’ to form relation ‘loading1’.
The field names are user, url, id.

30
Department of Information Technology & Computer Applications
Experiment 11
Problem statement:
Implementation of Word count using Pig.

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);


words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

31
Department of Information Technology & Computer Applications
Experiment 12
Problem statement:
Creation of Database using hive

a. To create a database named “STUDENTS” with comments and database properties.

CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH


DBPROPERTIES ('creator' = 'JOHN');

b. To describe a database.

DESCRIBE DATABASE STUDENTS;

c. To drop database.

DROP DATABASE STUDENTS;

d. To create managed table named ‘STUDENT’.

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';

e. To create external table named ‘EXT_STUDENT’.

CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name STRING,gpa FLOAT)


ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION ‘/STUDENT_INFO;

f. To load data into the table from file named student.tsv.

LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE INTO TABLE


EXT_STUDENT;

g. To retrieve the student details from “EXT_STUDENT” table.

SELECT * from EXT_STUDENT;

32
Department of Information Technology & Computer Applications
Experiment 13
Problem statement:
Creation of partitions and buckets using Hive.
Partition is of two types:
 STATIC PARTITION: It is upon the user to mention the partition (the segregation unit) where the data from
the file is to be loaded.
 DYNAMIC PARTITION: The User is required to simply state the column, basis which the partitioning will
take place. Hive will then create partitions basis the unique values in the column on which partition is to be
carried out.
 Partitions split the larger dataset into more meaningful chunks.
 Hive provides two kinds of partitions: Static Partition and Dynamic Partition.

a. To create static partition based on “gpa” column.

CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, name STRING)


PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

b. Load data into partition table from table.

INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa =4.0) SELECT rollno,


name from EXT_STUDENT where gpa=4.0;

Bucketing is similar to partition.


 However there is a subtle difference between partition and bucketing. In partition, you need to create
partition for each unique value of the column. This may lead to a situation where you may end up with
thousands of partitions.
 This can be avoided using Bucketing in which you can limit the number of buckets that will be created. A
bucket is a file whereas a partition is a directory.

a. To create a bucketed table having 3 buckets.

CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name STRING,grade FLOAT)


CLUSTERED BY (grade) into 3 buckets;

b. Load data to bucketed table.

FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
c. To display the content of first bucket.

33
Department of Information Technology & Computer Applications
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);

Hive supports aggregation functions like avg, count, etc.

a. To write the average and count aggregation function.

SELECT avg(gpa) FROM STUDENT;

SELECT count(*) FROM STUDENT;

b. To write group by and having function.

SELECT rollno, name,gpa


FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;

Experiment 14
Problem statement:
Practice of advanced features in Hive Query Language: RC File & XML data processing.

34
Department of Information Technology & Computer Applications
Experiment 15
Problem statement:
Implement of word count using spark RDDs.
val data=sc.textFile("sparkdata.txt")
data.collect;
val splitdata = data.flatMap(line => line.split(" "));
splitdata.collect;
val mapdata = splitdata.map(word => (word,1));
mapdata.collect;
val reducedata = mapdata.reduceByKey(_+_);
reducedata.collect;

(OR)

val textFile = sc.textFile("hdfs://...")


val counts = textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

35
Department of Information Technology & Computer Applications
Experiment 16
Problem statement:
Basic RDD opearions in Spark
Login into spark environment
1. Open terminal in cloudera quickstart
2. Type spark-shell
Types of RDD creation
1. Using Parallelize
scala>val data = Array (1,2,3,4,5)
scala>val distdata = sc.parallelize(data)
2. External dataset
a. create one text file on desktop data.txt
b. create one directory in HDFS
hdfs dfs -mkdir /spark
c. Load the file from local to HDFS
hdfs dfs -put /home/cloudera/Desktop/data.txt /spark/data.txt
Basic RDD Transformations
1. To print elements of RDD
a. syntax: rdd.foreach(println)
Example: lines.foreach(println)
2. MAP
val x = sc.parallelize(Array("b", "a", "c"))
val y = x.map(z => (z,1))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
b, a, c
(b,1), (a,1), (c,1)
3. FILTER
val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)

36
Department of Information Technology & Computer Applications
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
1, 2, 3
1, 3
4. FLATMAP
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
1, 2, 3
1, 100, 42, 2, 200, 42, 3, 300, 42
5. GROUPBY
val x = sc.parallelize(
Array("John", "Fred", "Anna", "James"))
val y = x.groupBy(w => w.charAt(0))
println(y.collect().mkString(", "))
output:
['John', 'Fred', 'Anna', 'James']
[('A',['Anna']),('J',['John','James']),('F',['Fred'])]
6. GROUPBYKEY
val x = sc.parallelize(
Array(('B',5),('B',4),('A',3),('A',2),('A',1)))
val y = x.groupByKey()
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
[('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)]
[('A', [2, 3, 1]),('B',[5, 4])
7. SAMPLE

37
Department of Information Technology & Computer Applications
val x = sc.parallelize(Array(1, 2, 3, 4, 5))
val y = x.sample(false, 0.4)
// omitting seed will yield different output
println(y.collect().mkString(", "))
output:
[1, 2, 3, 4, 5]
[1, 3]
8. UNION
val x = sc.parallelize(Array(1,2,3), 2)
val y = sc.parallelize(Array(3,4), 1)
val z = x.union(y)
val zOut = z.glom().collect()
output:
[1, 2, 3]
[3, 4]
[[1], [2, 3], [3, 4]]
9. JOIN
val x = sc.parallelize(Array(("a", 1), ("b", 2)))
val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z = x.join(y)
println(z.collect().mkString(", "))
output:
[("a", 1), ("b", 2)]
[("a", 3), ("a", 4), ("b", 5)]
[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
10. DISTINCT
val x = sc.parallelize(Array(1,2,3,3,4))
val y = x.distinct()
println(y.collect().mkString(", "))
output:
[1, 2, 3, 3, 4]

38
Department of Information Technology & Computer Applications
[1, 2, 3, 4]
Basic RDD ACTIONS
1. COLLECT
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.collect()
val xOut = x.glom().collect()
println(y)
output:
[[1], [2, 3]]
[1, 2, 3]
2. REDUCE
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))
println(y)
output:
[1, 2, 3, 4]
10
3. AGGREGATE
def seqOp = (data:(Array[Int], Int), item:Int) =>
(data._1 :+ item, data._2 + item)
def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) =>
(d1._1.union(d2._1), d1._2 + d2._2)

val x = sc.parallelize(Array(1,2,3,4))
val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)
println(y)
output:
[1, 2, 3, 4]
(Array(3, 1, 2, 4),10)

39
Department of Information Technology & Computer Applications
4. MAX
val x = sc.parallelize(Array(2,4,1))
val y = x.max
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
4
5. SUM
val x = sc.parallelize(Array(2,4,1))
val y = x.sum
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
7
6. MEAN
val x = sc.parallelize(Array(2,4,1))
val y = x.mean
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
2.3333333
7. STDEV
val x = sc.parallelize(Array(2,4,1))
val y = x.stdev
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]

40
Department of Information Technology & Computer Applications
1.2472191

Filter the log data using Spark RDDs.


from pyspark import SparkContext
from pyspark import SparkContext
words = sc.parallelize (
["python",
"java",
"hadoop",
"C"
]
)
words_map = words.map(lambda x: (x, 1))
mapping = words_map.collect()
print("Key value pair -> %s" % (mapping))

from pyspark import SparkContext


x = sc.parallelize([("pyspark", 1), ("hadoop", 3)])
y = sc.parallelize([("pyspark", 2), ("hadoop", 4)])
joined = x.join(y)
mapped = joined.collect()
print("Join RDD -> %s" % (mapped))

41
Department of Information Technology & Computer Applications
ASSIGNMENT 1: SPLIT

Objective: To learn about SPLIT relational operator. Problem Description: Write a Pig Script to split customers for reward
program based on their life time values.

Customers Life Time Value


Jack 25000
Smith 8000
David 35000
John 15000
Scott 10000
Joshi 28000
disperReport 12000
Vinay
Joseph 21000

Input: Customers Life Time Value Jack 25000 Smith 8000 David 35000 John 15000 Scott 10000 Joshi 28000 Ajay 12000
Vinay 30000 Joseph 21000

If Life Time Value is >1000 and <= 2000 Sliver Program.

If Life Time Value is >20000 ---> Gold Program.

ASSIGNMENT 2: GROUP

Objective: To learn about GROUP relational operator. Problem Description: Create a data file for below schemas:

Order: CustomerId, ItemId, ItemName, OrderDate, DeliveryDate


Customer: CustomerId, CustomerName, Address, City, State, Country
1. Load Order and Customer Data.
2. Write a Pig Latin Script to determine number of items bought by each customer.

42
Department of Information Technology & Computer Applications
ASSIGNMENT 3: COMPLEX DATA TYPE —BAG Objective: To learn complex datatype — bag in Pig. Problem
Description:

1. Create a file which contains bag dataset as shown below.

User ID From To
user1001 user1001@sample.corn {(user003@sample.com),(user004@sample.com),
(user006@sample.com)}
user1002 user1002@sample.com {(user005@sample.com), (user006@sampte.com)}
user1003 user1003@sample.com {(user001@sample.com),(user005@sample.com)}
2. Write a Pig Latin statement to display the names of all users who have sent emails and also a list of all the people that
they have sent the email to.

3. Store the result in a hie.

ASSIGNMENT 1: HIVEQL

Objective: To learn about HiveQL statement Problem Description:

Create a data file for below schemas:


Order: CustomerId, ItemId, ItemName, OrderDate, DeliveryDate
Customer: CustomerId, CustomerName, Address, City, State, Country
1. Create a table for Order and Customer Data.
2. Write a HiveQL to find number of items bought by each customer.

ASSIGNMENT 2: PARTITION
Objective: To learn about partitions in hive.
Problem Description: Create a partition table for customer schema to reward the customers based on their life time values.
Input:
Customer ID Customers Life Time Value
1001 Jack 25000
1002 Smith 8000
1003 David 12000
1004 John 15000
1005 Scott 12000
1006 Joshi 28000
1007 Ajay 12000
1008 Vinay 30000
1009 Joseph 21000

Create a partition table if life time value is 12000.


Create a partition table for all life time values.

SERDE
SerDe stands for Serializer/Deserializer.
1. Contains the logic to convert unstructured data into records.
43
Department of Information Technology & Computer Applications
2. Implemented using Java.
3. Serializers are used at the time of writing.
4. Descrializers are used at query time (SELECT Statement).

Deserializer interface takes a binary representation or string of a record, converts it into a java object that Hive
can then manipulate. Serializer takes a java object that Hive has been working with and translates it into
something that Hive can write to HDFS.

Objective: To manipulate the XML data.


input:
<employee> <empid> 1001 </empid› <name> John</name> <designation> Team Lead </designation›
</employee>
<employee> <empid> 1002</empid> <name>Smith</name> <designation> Analyst </designation> </ employee
>

Act:
CREATE TABLE XMLSAMPLE(xmldata string);
LOAD DATA LOCAL INPATH /root/hivedemos/input.xml' INTO TABLE XMLSAMPLE;

CREATE TABLE xpath_table AS


SELECT xpath_int(xmldata,'employee/empid'),
xpath_string(xmldata,'employee/name'),
xpath_string(xmidata,'employee/designation')
FROM xmlsample;
SELECT * FROM xpath_table;

USER-DEFINED FUNCTION (UDF)


In Hive, you can use custom functions by defining the User-Defined Function (UDF).
Objective: Write a Hive function to convert the values of a field to uppercase.
Act:
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
@Description(
Name="SimpleUDFExample")
public final class MyLowerCase extends UDF
public String evaluate(final String word)
{
return word.toLowerCase(); }
Note: Convert this Java Program into Jar.
ADD JAR /root/hivedemos/UpperCase.jar;
CREATE TEMPORARY FUNCTION touppercase AS 'com.example.hive.udf.MyUpperCase'; SELECT
TOUPPERCASE(name) FROM STUDENT;

44
Department of Information Technology & Computer Applications

Outcome: hive> ADD JAR /root/hivedemos/UpperCase.jar;


Added [/root/hivedemos/UpperCase.jar] to class path
Added resources: [/root/hivedemos/upperCase.jar]
hive> CREATE TEMPORARY FUNCTION touppercase AS 'com.example.hive.udf.MyUpperCase’;
OK

RCFILE IMPLEMENTATION
RCFile (Record Columnar File) is a data placement structure that determines how to store relational table on
computer clusters.

Objective: To work with RCFILE Format.

CREATE TABLE STUDENT_RC( rollno int, name string,gpa float ) STORED AS RCFILE; INSERT
OVERWRITE table STUDENT_RC SELECT * FROM STUDENT;
SELECT SUM(gpa) FROM STUDENT_RC;

45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy