0% found this document useful (0 votes)
14 views59 pages

2022-23-BDA-LAB Manual

The document is a laboratory manual for the Big Data Analytics course at L. D. College of Engineering for the academic term 2022-23. It outlines the vision and mission of the Computer Engineering department, practical aims, and rubrics for evaluation, along with detailed instructions for various practical exercises involving Hadoop and Spark. The manual includes specific tasks such as creating a Hadoop cluster, running a word count program, and exploring NoSQL databases.

Uploaded by

41Priya Modi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views59 pages

2022-23-BDA-LAB Manual

The document is a laboratory manual for the Big Data Analytics course at L. D. College of Engineering for the academic term 2022-23. It outlines the vision and mission of the Computer Engineering department, practical aims, and rubrics for evaluation, along with detailed instructions for various practical exercises involving Hadoop and Spark. The manual includes specific tasks such as creating a Hadoop cluster, running a word count program, and exploring NoSQL databases.

Uploaded by

41Priya Modi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

L. D.

College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad -
380015

Laboratory
Manual Odd Term
: 2022-23
Branch: Computer Engineering

Big Data Analytics (3170722)


Professional Elective-VI Semester:

VII Prepared By

Hetal A. Joshiara
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Computer Engineering Department


L. D. College of Engineering Ahmedabad

VISION

• To achieve academic excellence in Computer Engineering by providing


value based education.
MISSION

• To produce graduates according to the needs of industry, government,


society and scientific community.
• To develop partnership with industries, research and development
organizations and government sectors for continuous improvement of
faculties and students.
• To motivate students for participating in reputed conferences,
workshops, seminars and technical events to make them technocrats and
entrepreneurs.
• To enhance the ability of students to address the real life issues by
applying technical expertise, human values and professional ethics.
• To inculcate habit of using free and open source software, latest
technology and soft skills so that they become competent professionals.
• To encourage faculty members to upgrade their skills and qualification
through training and higher studies at reputed universities.

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Sr. Page
AIM Date CO Sign
No. No.

Make a single node cluster in Hadoop.


1 CO-2

Run Word count program in Hadoop with 250 MB CO-2


2
size of Data Set.

Understand the Logs generated by MapReduce CO-2


3
program.

Run two different Data sets/Different size of CO-2


4
Datasets on Hadoop and Compare the Logs.

Develop Map Reduce Application to Sort a given CO-2


5
file and do aggregation on some parameters.

Download any two Big Data Sets from CO-2


6
authenticated website.

Explore Spark and Implement a Word count CO-3


7
application using Spark.

Creating the HDFS tables and loading them in Hive CO-5


8
and learn joining of tables in Hive.

Implementation of Matrix algorithms in Spark Sql CO-3


9
programming.

Create A Data Pipeline Based On Messaging Using CO-3


10
PySpark And Hive -Covid-19 Analysis.

Explore NoSQL database like MongoDB and CO-3


11
perform basic CRUD operation.

Case study based on the concept of Big Data CO-1


12
Analytics. Prepare presentation in the group of 4.
Submit PPTs .

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

L. D. College of Engineering, Ahmedabad

Department of Computer Engineering

Rubrics for Practical

SEMESTER: BE-VII Academic Term: July-Nov 2022-23(ODD)

Subject: Big Data Analytics (3170722) Elective Subject

Faculty: Hetal A. Joshiara

Rubrics Criteria Marks Good Satisfactory Need


ID (2) (1) Improvement
(0)
RB1 Regularity 05 High (>70%) Moderate (40- Poor (0-40%)
70%)
RB2 Problem Analysis 05 Apt & Full Limited Very Less
& Development Identification of Identification Identification of
of the Solution the Problem & of the Problem the Problem /
Complete / Incomplete Very Less
Solution for the Solution for the Solution for the
Problem Problem Problem
RB3 Testing of the 05 Correct Solution Partially Very less correct
Solution as required Correct solution for the
Solution for the problem
Problem
RB4 Documentation 05 Documentation Not up to Proper format not
completed neatly. standard. followed,
incomplete.

Sign of Faculty

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Sr.
AIM CO Major Objective
No.

Make a single node cluster in Hadoop.  Student will be able to install single
1 CO-2
node Hadoop Cluster

Run Word count program in Hadoop with CO-2  Student will be able to write
2
250 MB size of Data Set. MapReduce program in Java and
experiment with varying size of
datasets.
Understand the Logs generated by CO-2  Student will be able to understand the
3 logs generated by the MapReduce
MapReduce program.

Run two different Data sets/Different size CO-2  Student will be able to experiment with
4
of Datasets on Hadoop and Compare the varying size of datasets and data size
and compare the logs.
Logs.
Develop Map Reduce Application to Sort a CO-2  Student will be able to develop Map
5 Reduce Application on shorting and
given file and do aggregation on some
understand aggregation function.
parameters.
Download any two Big Data Sets from CO-2  Student will be able to explore
6 authenticate datasets from websites for
authenticated website.
future use.
Explore Spark and Implement a Word CO-3  Student will be able to install Spark and
7
count application using Spark. perform small practical.

Creating the HDFS tables and loading CO-5  Student will be able to install Hive and
8 perform queries on join.
them in Hive and learn joining of tables in
Hive.
Implementation of Matrix algorithms in CO-3  Student will be able to implement
9 matrix algorithm in Spark.
Spark Sql programming.

Create A Data Pipeline Based On CO-3  Student will be able to analyze Covid-
10
Messaging Using PySpark And Hive - 19 data

Covid-19 Analysis.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Explore NoSQL database like MongoDB CO-3  Student will be able to basic operation
11
and perform basic CRUD operation. of MongoDB and its installation.

Case study based on the concept of Big CO-1  Student will be able to explore new
12
Data Analytics. Prepare presentation in the technology and participate in a group .

group of 4. Submit PPTs .

Subject Name: Big Data Analytics(3170722)


Term: 2022-23
Enrollment No.:
Name:

Practical CO RB1 RB2 RB3 RB4 Total


No. Achieved (5) (5) (5) (5) (20)
1 CO-2

2 CO-2

3 CO-2

4 CO-2

5 CO-2

6 CO-2

7 CO-3

8 CO-5

9 CO-3

10 CO-3

11 CO-3

12 CO-1

5
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 1: Make a single node cluster in Hadoop.

Prerequisite:
1. JAVA-Java JDK (installed)
2. HADOOP-Hadoop package (Downloaded)

Step 1: Verify the Java installed

javac -version

Step 2: Extract Hadoop at C:\Hadoop

6
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Step 3: Setting up the HADOOP_HOME variable

Use windows environment variable setting for Hadoop Path setting.

Step 4: Set JAVA_HOME variable

Use windows environment variable setting for Hadoop Path setting.

Step 5: Set Hadoop and Java bin directory path

7
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Step 6: Hadoop Configuration :

For Hadoop Configuration we need to modify Six files that are listed below-
1. Core-site.xml
2. Mapred-site.xml
3. Hdfs-site.xml
4. Yarn-site.xml
5. Hadoop-env.cmd
6. Create two folders datanode and namenode

Step 6.1: Core-site.xml configuration

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 6.2: Mapred-site.xml configuration

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Step 6.3: Hdfs-site.xml configuration

<configuration>
<property>
<name>dfs.replication</name>

8
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
</configuration>

Step 6.4: Yarn-site.xml configuration

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 6.5: Hadoop-env.cmd configuration

Set "JAVA_HOME=C:\Java" (On C:\java this is path to file jdk.18.0)

9
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Step 6.6: Create datanode and namenode folders

1. Create folder "data" under "C:\Hadoop-2.8.0"


2. Create folder "datanode" under "C:\Hadoop-2.8.0\data"
3. Create folder "namenode" under "C:\Hadoop-2.8.0\data"

Step 7: Format the namenode folder

Open command window (cmd) and typing command “hdfsnamenode –format”

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Step 8: Testing the setup

Open command window (cmd) and typing command “start-all.cmd”

Step 8.1: Testing the setup:

Ensure that namenode, datanode, and Resource manager are running

Step 9: Open: http://localhost:8088

Step 10:

Open: http://localhost:50070

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 2: Run Word count program in Hadoop with 250 MB size


of Data Set.

File: WC_Mapper.java
package com.javatpoint;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

File: WC_Reducer.java
package com.javatpoint;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable>


{
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}

File: WC_Runner.java
package com.javatpoint;

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 3: Understand the Logs generated by MapReduce program.

Log files are an essential troubleshooting tool during testing and production and contain important
runtime information about the general health of workload daemons and system services.

You can configure the system to be able to retrieve application logs that are written on all compute hosts
on the grid from one central location, through the cluster management console.

Log files and location


The MapReduce framework in IBM Spectrum Symphony provides the following log files, which are
located under the $PMR_HOME/logs/ directory. The following table describes the log files and lists
their location.

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Logs & File Location


• Default directory of Hadoop log file is $HADOOP_HOME/logs.
Several parts of Log generated by MapReduce program during execution:

1. Task Progress related Logs:


• This log shows details about process related information. Such as no. process, no. of splits, progress
of map & reduce task

2. File system related Logs:


• This log contains information related File. For example; no. of read & write bytes, no. of
read operation, etc.

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

3. Job related Logs:


• This part of log contains information related to job that is being running. This contains
information like., no. launched map & reduce task, total time spent by map & reduce task,
total memory taken by map & reduce task, etc.

4. Map-Reduce related logs:


• This part contains information related to Map-Reduce task. Such as; shuffled maps, failed
maps, peak virtual memory and physical memory for map & Reduce task, CPU time
spent, total elapsed time, etc.

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

5. Logs related to Errors:


 These parts indicate logs related to errors, occurred during processing

 Information regarding Job on Cluster Manger:

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 4: Run two different Data sets/Different size of Datasets


on Hadoop and Compare the Logs.

First Dataset: Wikipedia Article dataset


▪ Program: Word Count

▪ Dataset Size: 900MB

 Logs from Command Line:

1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

 Logs from Cluster Manger:

 From the logs we can see that dataset with 920mb size takes around 2mins to run word
count Job.

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Second Dataset: Movie Reviews dataset


▪ Program: Word Count

▪ Dataset Size: 3.65GB

 Logs from Command Line:

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

• Logs from Cluster Manager:

 From the logs we can see that dataset with 3.65GB size takes around 5.5mins to run word count
Job.

Conclusion:
For two different size of dataset running same map reduce job we can say that time taken for running
for job is increase as size of dataset bigger.

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 5: Develop Map Reduce Application to Sort a given file


and do aggregation on some parameters.

Mapreduce should be thought of as a ubiquitous sorting tool, since by design it sorts all the map output
records (using the map output keys), so that all the records that reach a single reducer are sorted. the
diagram below shows the internals of how the shuffle phase works in mapreduce.

given that mapreduce already performs sorting between the map and reduce phases, then sorting files
can be accomplished with an identity function (one where the inputs to the map and reduce phases are
emitted directly). this is in fact what the sort example that is bundled with hadoop does. you can look at
the how the example code works by examining the org.apache.hadoop.examples.sort class. to use this
example code to sort text files in hadoop, you would use it as follows:

shell$ export hadoop_home=/usr/lib/hadoop


shell$ $hadoop_home/bin/hadoop jar $hadoop_home/hadoop-examples.jar sort \
-informatorg.apache.hadoop.mapred.keyvaluetextinputformat \
-outformatorg.apache.hadoop.mapred.textoutputformat \
-outkeyorg.apache.hadoop.io.text \
-outvalueorg.apache.hadoop.io.text \
/hdfs/path/to/input \
/hdfs/path/to/output

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Hadoop example sort can be accomplished with the hadoop-utils sort as follows:

to bring sorting in mapreduce closer to the linux sort, the --key and --field-separator options can be used
shell$ $hadoop_home/bin/hadoop jar hadoop-utils-<version>-jar-with-
to specify one or more columns that should be used for sorting, as well as a custom separator
(whitespace is the \default).
dependencies.jar for example, imagine you had\ a file in hdfs called /input/300names.txt which
com.alexholmes.hadooputils.sort.sort
contained first and last names:
/hdfs/path/to/input \
to sort /hdfs/path/to/output
on the last name you would run:
shell$ hadoop fs -cat 300names.txt | head -n
5 roy
the franklin
syntax of --key is pos1[,pos2] , where the first position (pos1) is required, and the second position
shell$ $hadoop_home/bin/hadoop jar hadoop-utils-<version>-jar-with-
(pos2) is optional - if it‟s omitted then pos1 through the rest of the line is used for sorting.
mariogardne
dependencies.jar \ com.alexholmes.hadooputils.sort.sort \
r
--key 2 \
willisromer
/input/300names.txt \
o
/hdfs/path/to/output

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 6: Download any two Big Data Sets from


authenticated website.

1. YelpDataset
Website: https://www.yelp.com/dataste

2. Kaggle
Website: https://www.kaggle.com/datasets

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

3. UCI Machine Learning Repository


Website: http://archive.ics.uci.edu/ml/index.php

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 7: Explore Spark and Implement Word count


application using Spark.

Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based
on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of
computations, which includes interactive queries and stream processing. The main feature of Spark is
its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms,
interactive queries and streaming. Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools.
Evolution of Apache Spark
Spark is one of Hadoop‟s sub project developed in 2009 in UC Berkeley‟s AMPLab by MateiZaharia. It
was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013,
and now Apache Spark has become a top level Apache project from Feb-2014.
Features of Apache Spark
Apache Spark has following features.
 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
 Advanced Analytics − Spark not only supports „Map‟ and „reduce‟. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop


The following diagram shows three ways of how Spark can be built with Hadoop components.

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

There are three ways of Spark deployment as explained below.


 Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark
and MapReduce will run side by side to cover all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop
stack. It allows other components to run on top of stack.
 Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.

Components of Spark
The following illustration depicts the different components of Spark.

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that all other functionality is
built upon. It provides In-Memory computing and referencing datasets in external storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It
ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on
those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed memory-
based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop
disk-based version of Apache Mahout (before Mahout gained a Spark interface).

2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Word Count Program:

packagecom.journaldev.sparkdemo;

importorg.apache.spark.SparkConf;

importorg.apache.spark.api.java.JavaPairRDD;

importorg.apache.spark.api.java.JavaRDD;

importorg.apache.spark.api.java.JavaSparkContext;

import scala.Tuple2;

importjava.util.Arrays;

public class WordCounter {

private static void wordCount(String fileName) {

SparkConfsparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");

JavaSparkContextsparkContext = new JavaSparkContext(sparkConf);

JavaRDD<String>inputFile = sparkContext.textFile(fileName);

JavaRDD<String>wordsFromFile = inputFile.flatMap(content ->Arrays.asList(content.split(" ")));

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

JavaPairRDDcountData = wordsFromFile.mapToPair(t -> new Tuple2(t, 1)).reduceByKey((x, y) -> (int) x + (int)


y);

countData.saveAsTextFile("CountData");

public static void main(String[] args) {

if (args.length == 0) {

System.out.println("No files provided.");

System.exit(0);

wordCount(args[0]);

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 8: Creating the HDFS tables and loading them in Hive


and learn joining of tables in Hive.

Create a folder on HDFS under /user/cloudera HDFS Path

javachain~hadoop]$ hadoop fs -mkdirjavachain

Move the text file from local file system into newly created folder called javachain

javachain~hadoop]$ hadoop fs -put ~/Desktop/student.txt javachain/

Create Empty table STUDENT in HIVE

hive> create table student


Load Data from HDFS path into HIVE TABLE.
>( std_idint,
>std_name string,
>std_grade string,
>std_addres
hive> string)
load data inpath 'javachain/student.txt' into table student
> partitioned by (countryLoading
partition(country='usa'); string) data to table default.student partition (country=usa)
>row format delimited
chgrp: changing ownership of
> fields terminated by ','
'hdfs://quickstart.cloudera:8020/user/hive/warehouse/student/country=usa/student.txt': User does
> ; OK
not belong to hive
Time taken:
Partition 0.349 seconds
default.student{country=usa} stats: [numFiles=1, numRows=0, totalSize=120,
rawDataSize=0] OK
Time taken: 1.048 seconds

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Select the values in the Hive table.

hive> select * from student; OK

101 'JAVACHAIN' 'ANTO'3RD 10TH 'USA


'PRABU'
usa2ND
usa 'KUMAR'
102 'ENGLAND' 'INDIA' usa 4TH'USA' 'INDIA' usa
103 usa
104
105 'jack' 2ND
Time taken: 0.553 seconds, Fetched: 5 row(s)

JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER
JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

LEFT OUTER JOIN


The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in
the right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still
returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or
NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

LEFT OUTER JOIN ORDERS o


ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

RIGHT OUTER JOIN


The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns
a row in the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or
NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT
OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:

FULL OUTER JOIN


The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that
fulfil the JOIN condition. The joined table contains either all the records from both the tables, or fills in
NULL values for missing matches on either side.

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 9: Implementation of Matrix algorithms in Spark


Sql programming

Matrix_multiply.py

frompyspark import SparkConf, SparkContext

import sys, operator

defadd_tuples(a, b):

return list(sum(p) for p in zip(a,b))

def permutation(row):

rowPermutation = []

for element in row:

for e in range(len(row)):

rowPermutation.append(float(element) * float(row[e]))

returnrowPermutation

def main():

input = sys.argv[1]

output = sys.argv[2]

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

conf = SparkConf().setAppName('Matrix Multiplication')

sc = SparkContext(conf=conf)

assertsc.version>= '1.5.1'

row = sc.textFile(input).map(lambda row : row.split(' ')).cache()

ncol = len(row.take(1)[0])

intermediateResult = row.map(permutation).reduce(add_tuples)

outputFile = open(output, 'w')

result = [intermediateResult[x:x+3] for x in range(0, len(intermediateResult), ncol)]

for row in result:

for element in row:

outputFile.write(str(element) + ' ')

outputFile.write('\n')

outputFile.close()

# outputResult = sc.parallelize(result).coalesce(1)

# outputResult.saveAsTextFile(output)

if name == " main ":

main()

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

matrix_multiply_sparse.py

frompyspark import SparkConf, SparkContext

import sys, operator

fromscipy import *

fromscipy.sparse import csr_matrix

defcreateCSRMatrix(input):

row = []

col = []

data = []

for values in input:

value = values.split(':')

row.append(0)

col.append(int(value[0]))

data.append(float(value[1]))

returncsr_matrix((data,(row,col)), shape=(1,100))

defmultiplyMatrix(csrMatrix):

csrTransponse = csrMatrix.transpose(copy=True)

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

return (csrTransponse*csrMatrix)

defformatOutput(indexValuePairs):

return ' '.join(map(lambda pair : str(pair[0]) + ':' + str(pair[1]), indexValuePairs))

def main():

input = sys.argv[1]

output = sys.argv[2]

conf = SparkConf().setAppName('Sparse Matrix Multiplication')

sc = SparkContext(conf=conf)

assertsc.version>= '1.5.1'

sparseMatrix = sc.textFile(input).map(lambda row : row.split('


')).map(createCSRMatrix).map(multiplyMatrix).reduce(operator.add)

outputFile = open(output, 'w')

for row in range(len(sparseMatrix.indptr)-1):

col = sparseMatrix.indices[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]

data = sparseMatrix.data[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]

indexValuePairs = zip(col,data)

formattedOutput = formatOutput(indexValuePairs)

outputFile.write(formattedOutput + '\n')

if name == " main ":

main()

3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 10: Create A Data Pipeline Based On Messaging


Using PySpark And Hive -Covid-19 Analysis

Building data pipeline for Covid-19 data analysis using BigData technologies and Tableau

• The purpose is to collect the real time streaming data from COVID19 open API for every 5
minutes into the ecosystem using NiFi and to process it and store it in the data lake on AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi and
stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally KPIs are visualized in tableau.

Data Architecture

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Tools used:

1. Nifi -nifi-1.10.0

2. Hadoop -hadoop_2.7.3

3. Hive-apache-hive-2.1.0

4. Spark-spark-2.4.5

5. Zookeeper-zookeeper-2.3.5

6. Kafka-kafka_2.11-2.4.0

7. Airflow-airflow-1.8.1

8. Tableau

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 11: Explore NoSQL database like MongoDB and


perform basic CRUDoperation.

MongoDB

MongoDB is an open-source document-oriented database that is designed to store a large scale of data
and also allows you to work with that data very efficiently. It is categorized under the NoSQL (Not
only SQL) database because the storage and retrieval of data in the MongoDB are not in the form of
tables.
The MongoDB database is developed and managed by MongoDB.Inc under SSPL(Server Side Public
License) and initially released in February 2009. It also provides official driver support for all the
popular languages like C, C++, C#, and .Net, Go, Java, Node.js, Perl, PHP, Python, Motor, Ruby,
Scala, Swift, Mongoid. So, that you can create an application using any of these languages. Nowadays
there are so many companies that used MongoDB like Facebook, Nokia, eBay, Adobe, Google, etc. to
store their large amount of data.
Features of MongoDB –

 Schema-less Database: It is the great feature provided by the MongoDB. A Schema-less database
means one collection can hold different types of documents in it. Or in other words, in the
MongoDB database, a single collection can hold multiple documents and these documents may
consist of the different numbers of fields, content, and size. It is not necessary that the one
document is similar to another document like in the relational databases. Due to this cool feature,
MongoDB provides great flexibility to databases.
 Document Oriented: In MongoDB, all the data stored in the documents instead of tables like in
RDBMS. In these documents, the data is stored in fields(key-value pair) instead of rows and
columns which make the data much more flexible in comparison to RDBMS. And each document
contains its unique object id.
 Indexing: In MongoDB database, every field in the documents is indexed with primary and
secondary indices this makes easier and takes less time to get or search data from the pool of the
data. If the data is not indexed, then database search each document with the specified query which
takes lots of time and not so efficient.
 Scalability: MongoDB provides horizontal scalability with the help of sharding. Sharding means
to distribute data on multiple servers, here a large amount of data is partitioned into data chunks
using the shard key, and these data chunks are evenly distributed across shards that reside across
many physical servers. It will also add new machines to a running database.
 Replication: MongoDB provides high availability and redundancy with the help of replication, it
creates multiple copies of the data and sends these copies to a different server so that if one server
fails, then the data is retrieved from another server.
 Aggregation: It allows to perform operations on the grouped data and get a single result or
computed result. It is similar to the SQL GROUPBY clause. It provides three different
aggregations i.e, aggregation pipeline, map-reduce function, and single-purpose aggregation
methods

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

 High Performance: The performance of MongoDB is very high and data persistence as compared
to another database due to its features like scalability, indexing, replication, etc.

Advantages of MongoDB :

 It is a schema-less NoSQL database. You do need not to design the schema of the database when
you are working with MongoDB.
 It does not support join operations.
 It provides great flexibility to the fields in the documents.
 It contains heterogeneous data.
 It provides high performance, availability, and scalability.
 It supports Geospatial efficiently.
 It is a document-oriented database and the data is stored in BSON documents.
 It also supports multiple document ACID transition(string from MongoDB 4.0).
 It does not require any SQL injection.
 It is easily integrated with Big Data Hadoop

Disadvantages of MongoDB:

 It uses high memory for data storage.


 You are not allowed to store more than 16MB data in the documents.
 The nesting of data in BSON is also limited you are not allowed to nest data more than 100 levels.

MongoDB CRUD operations

As we know that we can use MongoDB for various things like building an application (including web
and mobile), or analysis of data, or an administrator of a MongoDB database, in all these cases we
need to interact with the MongoDB server to perform certain operations like entering new data into
the application, updating data into the application, deleting data from the application, and reading the
data of the application. MongoDB provides a set of some basic but most essential operations that will
help you to easily interact with the MongoDB server and these operations are known as CRUD

Operations.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Create Operations –

The create or insert operations are used to insert or add new documents in the collection. If a
collection does not exist, then it will create a new collection in the database. You can perform, create
operations using the following methods provided by the MongoDB:

Method Description

db.collection.insertOne() It is used to insert a single document in the collection.

db.collection.insertMany()It is used to insert multiple documents in the collection.

db.createCollection() It is used to create an empty collection.

Example 1: In this example, we are inserting details of a single student in the form of document in the
student collection using db.collection.insertOne()
method.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Example 2: In this example, we are inserting details of the multiple students in the form of documents
in the student collection using db.collection.insertMany() method.

Read Operations –

The Read operations are used to retrieve documents from the collection, or in other words, read
operations are used to query a collection for a document. You can perform read operation using the
following method provided by the MongoDB:

Method Description

db.collection.find()It is used to retrieve documents from the collection.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Example : In this example, we are retrieving the details of students from the student collection using
db.collection.find()method.

Update Operations –

The update operations are used to update or modify the existing document in the collection. You can
perform update operations using the following methods provided by the MongoDB:

Method Description

It is used to update a single document in the collection that satisfy the


db.collection.updateOne() given criteria.

It is used to update multiple documents in the collection that satisfy the


db.collection.updateMany()given criteria.

It is used to replace single document in the collection that satisfy the given
db.collection.replaceOne() criteria.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Example 1: In this example, we are updating the age of Sumit in the student collection

.
Example 2: In this example, we are updating the year of course in all the documents in the student
collection using db.collection.updateMany()method.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Delete Operations –
The delete operation are used to delete or remove the documents from a collection. You can perform
delete operations using the following methods provided by the MongoDB:
Method Description

It is used to delete a single document from the collection that satisfy the
db.collection.deleteOne() given criteria.

It is used to delete multiple documents from the collection that satisfy the
db.collection.deleteMany()given criteria.

Example 1: In this example, we are deleting a document from the student collection using
db.collection.deleteOne()method.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Example 2: In this example, we are deleting all the documents from the student collection using
db.collection.deleteMany() method.

4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Practical 12: Case study based on the concept of Big Data


Analytics.

Figure 1.1 Figure 1.2

Figure 1.3 Figure 1.4

5
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)

Figure 1.5 Figure 1.6

Figure 1.7 Figure 1.8

5
Big Data Analytics B.E. SEMESTER VII (ODD-

Figure 1.9 Figure 1.10

Figure 1.11 Figure 1.12

5
Big Data Analytics B.E. SEMESTER VII (ODD-

Figure 1.13 Figure 1.14

Figure 1.15 Figure 1.16

5
Big Data Analytics B.E. SEMESTER VII (ODD-

Figure 1.17 Figure 1.18

Figure 1.19 Figure 1.20

5
Big Data Analytics B.E. SEMESTER VII (ODD-

Figure 1.21 Figure 1.22

Figure 1.23 Figure 1.24

5
Big Data Analytics B.E. SEMESTER VII (ODD-

Figure 1.25 Figure 1.26

Figure 1.27 Figure 1.28

5
Big Data Analytics B.E. SEMESTER VII (ODD-

Figure 1.29 Figure 1.30

Figure 1.31 Figure 1.32

5
Big Data Analytics B.E. SEMESTER VII (ODD-

Figure 1.33 Figure 1.34

Figure 1.35 Figure 1.36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy