Bda Lab Manual
Bda Lab Manual
Regulation – R22
IV B.Tech I Semester
DEPARTMENT OF
INFORMATION TECHNOLOGY & COMPUTER APPLICATIONS
1
Department of Information Technology & Computer Applications
Vision:
The vision of the Big Data Analytics Lab is to empower students with advanced skills and
expertise in big data technologies, preparing them to become future-ready professionals in the rapidly
evolving field of data analytics. By providing hands-on experience with industry-leading tools such as
Hadoop, MapReduce, Hive, PIG, and Spark, the lab aims to cultivate a deep understanding of data
management, processing, and analysis. Through practical exercises and real-world applications, students
are encouraged to explore innovative solutions to complex data challenges, fostering a mindset of
continuous learning and adaptation to emerging trends in big data analytics. The lab seeks to inspire
students to become leaders and innovators in leveraging data-driven insights for impactful decision-
making and problem-solving across diverse industries and domains.
Mission:
Equip students with comprehensive knowledge and practical skills in big data analytics, focusing on
handling and analyzing large datasets efficiently.
Foster a deep understanding of various big data tools and technologies such as Hadoop, MapReduce,
Hive, PIG, and Spark through hands-on experiences and real-world applications.
Empower students to apply advanced data analytics techniques to solve complex problems and make
informed decisions across different industry sectors.
Prepare students to meet the growing demand for skilled data analysts and professionals capable of
harnessing the power of big data to drive innovation and business success.
2
Department of Information Technology & Computer Applications
PO3: Proficient in successfully designing innovative solutions for solving real life business
problems and addressing business development issues with a passion for quality, competency and
holistic approach.
PO4: Perform professionally with social, cultural and ethical responsibility as an individual aswell
as in multifaceted teams with positive attitude
PO5: Capable of adapting to new technologies and constantly upgrade their skills with an attitude
towards independent and lifelong learning.
3
Department of Information Technology & Computer Applications
PREFACE
Pre requisite : Data Mining
About the LAB:
The Big Data Analytics Lab is designed to equip students with practical skills and in-depth knowledge
required to handle and analyze massive datasets using contemporary big data tools and technologies. This lab
focuses on the critical aspects of big data, ranging from understanding different types of digital data to
implementing and managing big data frameworks like Hadoop, YARN, and MapReduce. Through hands-on
exercises, students will gain the ability to process and analyze structured, semi-structured, and unstructured
data, which are pivotal in various real-world applications such as social media analysis, sensor data processing,
and customer review analytics.
In the initial sessions, students will explore fundamental concepts such as the types of digital data, their
real-time applications, and the significant challenges associated with big data environments. By distinguishing
between structured, unstructured, and semi-structured data, students will understand how to categorize and
approach different data sets. Moreover, the lab will delve into real-time challenges like data volume, velocity,
variety, and veracity, providing students with strategies to address these issues effectively. The role of a data
analyst in the decision-making process and hardware support for processing huge amounts of data will also be
discussed, laying a strong foundation for the practical sessions.
As the lab progresses, emphasis will be placed on the practical aspects of big data analytics, including
the installation and configuration of Hadoop and its associated components. Students will learn to set up a
Hadoop environment, manage HDFS, and execute basic Hadoop commands, which are essential for data
storage and processing. The lab also covers the installation and management of YARN, which is crucial for
resource allocation and job scheduling in a Hadoop cluster. Furthermore, students will explore the advantages
and drawbacks of MapReduce, the modules of MapReduce, and the installation and configuration of
MapReduce. By mastering these technologies, students will become proficient in managing and processing
large datasets efficiently.
In addition to Hadoop and MapReduce, the lab will cover the installation and utilization of Hive, PIG,
and Spark. Students will learn how to install these tools, understand their running and execution modes, and
explore their real-time applications. Hive will be explored for its use in data warehousing and SQL-like query
execution. PIG will be examined for its data flow scripting capabilities, and Spark will be discussed for its
powerful in-memory data processing capabilities. By the end of the lab, students will be well-versed in the
complete ecosystem of big data tools, prepared to tackle complex data analytics tasks in both academic and
professional settings.
Relevance to industry:
The skills and knowledge gained in the Big Data Analytics Lab are highly relevant to the industry, as
they address the critical need for professionals capable of managing and analyzing vast amounts of data. In
today's data-driven world, businesses across various sectors, including finance, healthcare, retail, and
technology, rely on big data analytics to drive decision-making, optimize operations, and gain competitive
4
Department of Information Technology & Computer Applications
advantages. Proficiency in tools like Hadoop, MapReduce, Hive, PIG, and Spark, as well as a deep
understanding of data processing challenges and solutions, prepares students to meet the demands of the job
market. These skills enable them to efficiently handle big data projects, contribute to strategic initiatives, and
innovate in fields where data analysis is pivotal.
Latest Technologies:
1. Hadoop
2. MapReduce
3. Hive, PIG, Spark
5
Department of Information Technology & Computer Applications
6
Department of Information Technology & Computer Applications
INDEX
7
Department of Information Technology & Computer Applications
Syllabus
Text Book:
1. Big Data Analytics 2ed, Seema Acharya, Subhashini Chellappan, Wiley Publishers, 2020
Reference Books:
8
Department of Information Technology & Computer Applications
Course Outcomes: After completion of this course, a student will be able to:
Skills:
Build and maintain reliable, scalable, distributed systems with Apache Hadoop.
Develop Map-Reduce based Applications for Big Data.
Design and build applications using Hive and Pig based Big data Applications.
Learn tips and tricks for Big Data use cases and solutions
9
Department of Information Technology & Computer Applications
List of Experiments:
10
Department of Information Technology & Computer Applications
Installation of Hadoop
Install and run Hadoop in standalone mode, pseudo mode and fully distributed cluster.
Standalone mode.
Step-2: Download the latest version of Hadoop here.
$ tar -xzvf hadoop-2.7.3.tar.gz
//Change the version number if needed to match the Hadoop version you have downloaded.//
Step-3: Now we are moving this extracted file to /usr/local, suitable for local installs.
$ sudo mv hadoop-2.7.3 /usr/local/hadoop
Step-5: Hadoop command in the bin folder is used to run jobs in Hadoop.
$ bin/hadoop
Step-6: jar command is used to run the MapReduce jobs on Hadoop cluster
$ bin/hadoop jar
Step-7: Now we will run an example MapReduce to ensure that our standalone install works
11
Department of Information Technology & Computer Applications
create a input directory to place the input files and we run MapReduce command on it. These
are the configuration and command files along with hadoop, we will use those as text file input
for our MapReduce.
$ mkdir input $ cp etc/hadoop/* input/
This is run using the bin/hadoop command. It is used to run MapReduce. Jar indicates that the
MapReduce operation is specified in a Java archive. Here we will use the Hadoop-MapReduce-
examples.jar file which come along with installation. Jar name differ based on the version you
are installing. Now move on to your Hadoop install directory and type:
$ Hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.7.3.jar
If you are not in the correct directory, you will get an error saying “Not a valid JAR” as shown
below. If this issue persists, check if the location of Jar file is correct for your system.
12
Department of Information Technology & Computer Applications
Experiment 01
Problem statement:
HDFS basic command-line file operations.
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hadoop version
2. To java Version
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$javac -version
4. To format namenode
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$start-all.sh
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$jps
Or
15. To count no.of directories, files, and bytes under the given path:
o/p: 1 1 60
Experiment 02
HDFS monitoring User Interface
14
Department of Information Technology & Computer Applications
15
Department of Information Technology & Computer Applications
Experiment 03
Problem statement: Wordcount Map Reduce program using standalone Hadoop
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount
{
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
While (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
{
16
Department of Information Technology & Computer Applications
}
}
Steps:
1. Create one folder on Desktop-”WordCountTutorial”
a. Paste WordCount.java file
b. Create a folder named”Input_Data”->Create Input.txt file (enter some words)
c. Create a folder name “tutorial_classes
2. export HADOOP_CLASSPATH=$(hadoop classpath)
3. echo $HADOOP_CLASPATH
4. hadoop fs -mkdir /WordCountTutorial
5. hadoop fs -mkdir /WordCountTutorial/input
6. hadoop fs -put /home/hadoop/Desktop/WordCountTutorial/Input_Data/Input.txt
/WordCountTutorial/input
7. Change the current directory to the tutorial directory
cd '/home/hadoop/Desktop/WordCountTutorial'
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:~/Desktop/WordCountTutorial$
8. Compile the java code
javac -classpath ${HADOOP_CLASSPATH} -d
/home/hadoop/Desktop/WordCountTutorial/tutorial_classes
/home/hadoop/Desktop/WordCountTutorial/WordCount.java
9. Put the output files in one JAR file
jar -cvf firstTutorial.jar -C tutorial_classes/ .
10. Run JAR file
hadoop jar /home/hadoop/Desktop/WordCountTutorial/firstTutorial.jar WordCount
/WordCountTutorial/input /WordCountTutorial/output
11. See the output
hadoop dfs -cat /WordCountTutorial/output/*
Output:
17
Department of Information Technology & Computer Applications
Experiment 04
Problem statement:
Implementation of word count with combiner Map Reduce program
Output:
18
Department of Information Technology & Computer Applications
Experiment 05
Problem statement:
Practice on Map Reduce monitoring User Interface
19
Department of Information Technology & Computer Applications
Experiment 06
Problem statement:
Implementation of Sort operation using MapReduce
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)?0:1);
}
}
20
Department of Information Technology & Computer Applications
Alternative program
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws
IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
Steps:
1. Create one folder on Desktop-”SortTutorial”
a. Paste Sort.java file
b. Create a folder named ”Input_Data”->Create Input.txt file (enter some words)
21
Department of Information Technology & Computer Applications
c. Create a folder name “tutorial_classes
2. export HADOOP_CLASSPATH=$(hadoop classpath)
3. echo $HADOOP_CLASSPATH
4. hadoop fs -mkdir /SortTutorial
5. hadoop fs -mkdir /SortTutorial/input
6. hadoop fs -put /home/hadoop/Desktop/SortTutorial/Input_Data/Input.txt
/SortTutorial/input
7. Change the current directory to the tutorial directory
cd '/home/hadoop/Desktop/SortTutorial'
8. Compile the java code
javac -classpath ${HADOOP_CLASSPATH} -d
/home/hadoop/Desktop/SortTutorial/tutorial_classes
/home/hadoop/Desktop/SortTutorial/Sort.java
9. Put the output files in one JAR file
jar -cvf firstTutorial.jar -C tutorial_classes/ .
10. Run JAR file
hadoop jar /home/hadoop/Desktop/SortTutorial/firstTutorial.jar WordCount
/SortTutorial/input /SortTutorial/output
11. See the output
hadoop dfs -cat /SortTutorial/output/*
Output:
22
Department of Information Technology & Computer Applications
Experiment 07
Problem statement:
MapReduce program to count the occurrence of similar words in a file by using partitioner
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
23
Department of Information Technology & Computer Applications
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Output:
24
Department of Information Technology & Computer Applications
Experiment 08
Problem statement:
Design MapReduce solution to find the years whose average sales is greater than 30. input file format has year,
sales of all months and average sales. The sample input file is,
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable, /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()){
lasttoken=s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
}
//Reducer class
25
Department of Information Technology & Computer Applications
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce(Text key, Iterator <IntWritable> values, OutputCollector>Text, IntWritable> output, Reporter
reporter) throws IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
}
//Main function
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(Eleunits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Save the above program into ProcessUnits.java. The compilation and execution of the program is given below.
Compilation and Execution of ProcessUnits Program
Let us assume we are in the home directory of Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1 − Use the following command to create a directory to store the compiled java classes.
$ mkdir units
Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program.
Download the jar from mvnrepository.com. Let us assume the download folder is /home/hadoop/.
Step 3 − The following commands are used to compile the ProcessUnits.java program and to create a jar for the
program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
26
Department of Information Technology & Computer Applications
Step 4 − The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5 − The following command is used to copy the input file named sample.txt in the input directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6 − The following command is used to verify the files in the input directory
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7 − The following command is used to run the Eleunit_max application by taking input files from the input
directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Output:
27
Department of Information Technology & Computer Applications
Experiment 09
Problem statement:
MapReduce program to find Dept wise salary. The sample input file is as follow.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Salary
{
public static class SalaryMapper extends Mapper <LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String[] token = value.toString().split(",");
int s = Integer.parseInt(token[2]);
IntWritable sal = new IntWritable();
sal.set(s);
context.write(new Text(token[1]),sal);
}
}
public static class SalaryReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException,
InterruptedException
28
Department of Information Technology & Computer Applications
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key,result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Salary");
job.setJarByClass(Salary.class);
job.setMapperClass(SalaryMapper.class);
job.setReducerClass(SalaryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output:
29
Department of Information Technology & Computer Applications
Experiment 10
Problem statement:
Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the data.
The Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop and
the Map Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and produces
another relation as output. These operators are the main tools for Pig Latin provides to operate on the data. They
allow you to transform it by sorting, grouping, joining, projecting, and filtering. Let’s create two files to run the
commands:
We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.
The second file contain two fields: url & rating. These two files are CSV files.
The Apache Pig operators can be classified as: Relational and Diagnostic.
Relational Operators:
Relational operators are the main tools Pig Latin provides to operate on the data. It allows you to transform the
data by sorting, grouping, joining, projecting and filtering. This section covers the basic relational operators.
LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.
In this example, the Load operator loads data from file ‘first’ to form relation ‘loading1’.
The field names are user, url, id.
30
Department of Information Technology & Computer Applications
Experiment 11
Problem statement:
Implementation of Word count using Pig.
31
Department of Information Technology & Computer Applications
Experiment 12
Problem statement:
Creation of Database using hive
b. To describe a database.
c. To drop database.
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';
32
Department of Information Technology & Computer Applications
Experiment 13
Problem statement:
Creation of partitions and buckets using Hive.
Partition is of two types:
STATIC PARTITION: It is upon the user to mention the partition (the segregation unit) where the data from
the file is to be loaded.
DYNAMIC PARTITION: The User is required to simply state the column, basis which the partitioning will
take place. Hive will then create partitions basis the unique values in the column on which partition is to be
carried out.
Partitions split the larger dataset into more meaningful chunks.
Hive provides two kinds of partitions: Static Partition and Dynamic Partition.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
c. To display the content of first bucket.
33
Department of Information Technology & Computer Applications
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
Experiment 14
Problem statement:
Practice of advanced features in Hive Query Language: RC File & XML data processing.
34
Department of Information Technology & Computer Applications
Experiment 15
Problem statement:
Implement of word count using spark RDDs.
val data=sc.textFile("sparkdata.txt")
data.collect;
val splitdata = data.flatMap(line => line.split(" "));
splitdata.collect;
val mapdata = splitdata.map(word => (word,1));
mapdata.collect;
val reducedata = mapdata.reduceByKey(_+_);
reducedata.collect;
(OR)
35
Department of Information Technology & Computer Applications
Experiment 16
Problem statement:
Basic RDD opearions in Spark
Login into spark environment
1. Open terminal in cloudera quickstart
2. Type spark-shell
Types of RDD creation
1. Using Parallelize
scala>val data = Array (1,2,3,4,5)
scala>val distdata = sc.parallelize(data)
2. External dataset
a. create one text file on desktop data.txt
b. create one directory in HDFS
hdfs dfs -mkdir /spark
c. Load the file from local to HDFS
hdfs dfs -put /home/cloudera/Desktop/data.txt /spark/data.txt
Basic RDD Transformations
1. To print elements of RDD
a. syntax: rdd.foreach(println)
Example: lines.foreach(println)
2. MAP
val x = sc.parallelize(Array("b", "a", "c"))
val y = x.map(z => (z,1))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
b, a, c
(b,1), (a,1), (c,1)
3. FILTER
val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)
36
Department of Information Technology & Computer Applications
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
1, 2, 3
1, 3
4. FLATMAP
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
1, 2, 3
1, 100, 42, 2, 200, 42, 3, 300, 42
5. GROUPBY
val x = sc.parallelize(
Array("John", "Fred", "Anna", "James"))
val y = x.groupBy(w => w.charAt(0))
println(y.collect().mkString(", "))
output:
['John', 'Fred', 'Anna', 'James']
[('A',['Anna']),('J',['John','James']),('F',['Fred'])]
6. GROUPBYKEY
val x = sc.parallelize(
Array(('B',5),('B',4),('A',3),('A',2),('A',1)))
val y = x.groupByKey()
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
[('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)]
[('A', [2, 3, 1]),('B',[5, 4])
7. SAMPLE
37
Department of Information Technology & Computer Applications
val x = sc.parallelize(Array(1, 2, 3, 4, 5))
val y = x.sample(false, 0.4)
// omitting seed will yield different output
println(y.collect().mkString(", "))
output:
[1, 2, 3, 4, 5]
[1, 3]
8. UNION
val x = sc.parallelize(Array(1,2,3), 2)
val y = sc.parallelize(Array(3,4), 1)
val z = x.union(y)
val zOut = z.glom().collect()
output:
[1, 2, 3]
[3, 4]
[[1], [2, 3], [3, 4]]
9. JOIN
val x = sc.parallelize(Array(("a", 1), ("b", 2)))
val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z = x.join(y)
println(z.collect().mkString(", "))
output:
[("a", 1), ("b", 2)]
[("a", 3), ("a", 4), ("b", 5)]
[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
10. DISTINCT
val x = sc.parallelize(Array(1,2,3,3,4))
val y = x.distinct()
println(y.collect().mkString(", "))
output:
[1, 2, 3, 3, 4]
38
Department of Information Technology & Computer Applications
[1, 2, 3, 4]
Basic RDD ACTIONS
1. COLLECT
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.collect()
val xOut = x.glom().collect()
println(y)
output:
[[1], [2, 3]]
[1, 2, 3]
2. REDUCE
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))
println(y)
output:
[1, 2, 3, 4]
10
3. AGGREGATE
def seqOp = (data:(Array[Int], Int), item:Int) =>
(data._1 :+ item, data._2 + item)
def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) =>
(d1._1.union(d2._1), d1._2 + d2._2)
val x = sc.parallelize(Array(1,2,3,4))
val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)
println(y)
output:
[1, 2, 3, 4]
(Array(3, 1, 2, 4),10)
39
Department of Information Technology & Computer Applications
4. MAX
val x = sc.parallelize(Array(2,4,1))
val y = x.max
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
4
5. SUM
val x = sc.parallelize(Array(2,4,1))
val y = x.sum
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
7
6. MEAN
val x = sc.parallelize(Array(2,4,1))
val y = x.mean
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
2.3333333
7. STDEV
val x = sc.parallelize(Array(2,4,1))
val y = x.stdev
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
40
Department of Information Technology & Computer Applications
1.2472191
41
Department of Information Technology & Computer Applications
ASSIGNMENT 1: SPLIT
Objective: To learn about SPLIT relational operator. Problem Description: Write a Pig Script to split customers for reward
program based on their life time values.
Input: Customers Life Time Value Jack 25000 Smith 8000 David 35000 John 15000 Scott 10000 Joshi 28000 Ajay 12000
Vinay 30000 Joseph 21000
ASSIGNMENT 2: GROUP
Objective: To learn about GROUP relational operator. Problem Description: Create a data file for below schemas:
42
Department of Information Technology & Computer Applications
ASSIGNMENT 3: COMPLEX DATA TYPE —BAG Objective: To learn complex datatype — bag in Pig. Problem
Description:
User ID From To
user1001 user1001@sample.corn {(user003@sample.com),(user004@sample.com),
(user006@sample.com)}
user1002 user1002@sample.com {(user005@sample.com), (user006@sampte.com)}
user1003 user1003@sample.com {(user001@sample.com),(user005@sample.com)}
2. Write a Pig Latin statement to display the names of all users who have sent emails and also a list of all the people that
they have sent the email to.
ASSIGNMENT 1: HIVEQL
ASSIGNMENT 2: PARTITION
Objective: To learn about partitions in hive.
Problem Description: Create a partition table for customer schema to reward the customers based on their life time values.
Input:
Customer ID Customers Life Time Value
1001 Jack 25000
1002 Smith 8000
1003 David 12000
1004 John 15000
1005 Scott 12000
1006 Joshi 28000
1007 Ajay 12000
1008 Vinay 30000
1009 Joseph 21000
SERDE
SerDe stands for Serializer/Deserializer.
1. Contains the logic to convert unstructured data into records.
43
Department of Information Technology & Computer Applications
2. Implemented using Java.
3. Serializers are used at the time of writing.
4. Descrializers are used at query time (SELECT Statement).
Deserializer interface takes a binary representation or string of a record, converts it into a java object that Hive
can then manipulate. Serializer takes a java object that Hive has been working with and translates it into
something that Hive can write to HDFS.
Act:
CREATE TABLE XMLSAMPLE(xmldata string);
LOAD DATA LOCAL INPATH /root/hivedemos/input.xml' INTO TABLE XMLSAMPLE;
44
Department of Information Technology & Computer Applications
RCFILE IMPLEMENTATION
RCFile (Record Columnar File) is a data placement structure that determines how to store relational table on
computer clusters.
CREATE TABLE STUDENT_RC( rollno int, name string,gpa float ) STORED AS RCFILE; INSERT
OVERWRITE table STUDENT_RC SELECT * FROM STUDENT;
SELECT SUM(gpa) FROM STUDENT_RC;
45