2022-23-BDA-LAB Manual
2022-23-BDA-LAB Manual
College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad -
380015
Laboratory
Manual Odd Term
: 2022-23
Branch: Computer Engineering
VII Prepared By
Hetal A. Joshiara
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
VISION
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Sr. Page
AIM Date CO Sign
No. No.
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Sign of Faculty
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Sr.
AIM CO Major Objective
No.
Make a single node cluster in Hadoop. Student will be able to install single
1 CO-2
node Hadoop Cluster
Run Word count program in Hadoop with CO-2 Student will be able to write
2
250 MB size of Data Set. MapReduce program in Java and
experiment with varying size of
datasets.
Understand the Logs generated by CO-2 Student will be able to understand the
3 logs generated by the MapReduce
MapReduce program.
Run two different Data sets/Different size CO-2 Student will be able to experiment with
4
of Datasets on Hadoop and Compare the varying size of datasets and data size
and compare the logs.
Logs.
Develop Map Reduce Application to Sort a CO-2 Student will be able to develop Map
5 Reduce Application on shorting and
given file and do aggregation on some
understand aggregation function.
parameters.
Download any two Big Data Sets from CO-2 Student will be able to explore
6 authenticate datasets from websites for
authenticated website.
future use.
Explore Spark and Implement a Word CO-3 Student will be able to install Spark and
7
count application using Spark. perform small practical.
Creating the HDFS tables and loading CO-5 Student will be able to install Hive and
8 perform queries on join.
them in Hive and learn joining of tables in
Hive.
Implementation of Matrix algorithms in CO-3 Student will be able to implement
9 matrix algorithm in Spark.
Spark Sql programming.
Create A Data Pipeline Based On CO-3 Student will be able to analyze Covid-
10
Messaging Using PySpark And Hive - 19 data
Covid-19 Analysis.
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Explore NoSQL database like MongoDB CO-3 Student will be able to basic operation
11
and perform basic CRUD operation. of MongoDB and its installation.
Case study based on the concept of Big CO-1 Student will be able to explore new
12
Data Analytics. Prepare presentation in the technology and participate in a group .
2 CO-2
3 CO-2
4 CO-2
5 CO-2
6 CO-2
7 CO-3
8 CO-5
9 CO-3
10 CO-3
11 CO-3
12 CO-1
5
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Prerequisite:
1. JAVA-Java JDK (installed)
2. HADOOP-Hadoop package (Downloaded)
javac -version
6
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
7
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
For Hadoop Configuration we need to modify Six files that are listed below-
1. Core-site.xml
2. Mapred-site.xml
3. Hdfs-site.xml
4. Yarn-site.xml
5. Hadoop-env.cmd
6. Create two folders datanode and namenode
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
8
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
9
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Step 10:
Open: http://localhost:50070
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
File: WC_Mapper.java
package com.javatpoint;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>{
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
File: WC_Reducer.java
package com.javatpoint;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
File: WC_Runner.java
package com.javatpoint;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Log files are an essential troubleshooting tool during testing and production and contain important
runtime information about the general health of workload daemons and system services.
You can configure the system to be able to retrieve application logs that are written on all compute hosts
on the grid from one central location, through the cluster management console.
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
1
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
From the logs we can see that dataset with 920mb size takes around 2mins to run word
count Job.
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
From the logs we can see that dataset with 3.65GB size takes around 5.5mins to run word count
Job.
Conclusion:
For two different size of dataset running same map reduce job we can say that time taken for running
for job is increase as size of dataset bigger.
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Mapreduce should be thought of as a ubiquitous sorting tool, since by design it sorts all the map output
records (using the map output keys), so that all the records that reach a single reducer are sorted. the
diagram below shows the internals of how the shuffle phase works in mapreduce.
given that mapreduce already performs sorting between the map and reduce phases, then sorting files
can be accomplished with an identity function (one where the inputs to the map and reduce phases are
emitted directly). this is in fact what the sort example that is bundled with hadoop does. you can look at
the how the example code works by examining the org.apache.hadoop.examples.sort class. to use this
example code to sort text files in hadoop, you would use it as follows:
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Hadoop example sort can be accomplished with the hadoop-utils sort as follows:
to bring sorting in mapreduce closer to the linux sort, the --key and --field-separator options can be used
shell$ $hadoop_home/bin/hadoop jar hadoop-utils-<version>-jar-with-
to specify one or more columns that should be used for sorting, as well as a custom separator
(whitespace is the \default).
dependencies.jar for example, imagine you had\ a file in hdfs called /input/300names.txt which
com.alexholmes.hadooputils.sort.sort
contained first and last names:
/hdfs/path/to/input \
to sort /hdfs/path/to/output
on the last name you would run:
shell$ hadoop fs -cat 300names.txt | head -n
5 roy
the franklin
syntax of --key is pos1[,pos2] , where the first position (pos1) is required, and the second position
shell$ $hadoop_home/bin/hadoop jar hadoop-utils-<version>-jar-with-
(pos2) is optional - if it‟s omitted then pos1 through the rest of the line is used for sorting.
mariogardne
dependencies.jar \ com.alexholmes.hadooputils.sort.sort \
r
--key 2 \
willisromer
/input/300names.txt \
o
/hdfs/path/to/output
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
1. YelpDataset
Website: https://www.yelp.com/dataste
2. Kaggle
Website: https://www.kaggle.com/datasets
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based
on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of
computations, which includes interactive queries and stream processing. The main feature of Spark is
its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms,
interactive queries and streaming. Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools.
Evolution of Apache Spark
Spark is one of Hadoop‟s sub project developed in 2009 in UC Berkeley‟s AMPLab by MateiZaharia. It
was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013,
and now Apache Spark has become a top level Apache project from Feb-2014.
Features of Apache Spark
Apache Spark has following features.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
Advanced Analytics − Spark not only supports „Map‟ and „reduce‟. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Components of Spark
The following illustration depicts the different components of Spark.
2
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
packagecom.journaldev.sparkdemo;
importorg.apache.spark.SparkConf;
importorg.apache.spark.api.java.JavaPairRDD;
importorg.apache.spark.api.java.JavaRDD;
importorg.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
importjava.util.Arrays;
JavaRDD<String>inputFile = sparkContext.textFile(fileName);
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
countData.saveAsTextFile("CountData");
if (args.length == 0) {
System.exit(0);
wordCount(args[0]);
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Move the text file from local file system into newly created folder called javachain
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER
JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
On successful execution of the query, you get to see the following response:
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Matrix_multiply.py
defadd_tuples(a, b):
def permutation(row):
rowPermutation = []
for e in range(len(row)):
rowPermutation.append(float(element) * float(row[e]))
returnrowPermutation
def main():
input = sys.argv[1]
output = sys.argv[2]
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
sc = SparkContext(conf=conf)
assertsc.version>= '1.5.1'
ncol = len(row.take(1)[0])
intermediateResult = row.map(permutation).reduce(add_tuples)
outputFile.write('\n')
outputFile.close()
# outputResult = sc.parallelize(result).coalesce(1)
# outputResult.saveAsTextFile(output)
main()
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
matrix_multiply_sparse.py
fromscipy import *
defcreateCSRMatrix(input):
row = []
col = []
data = []
value = values.split(':')
row.append(0)
col.append(int(value[0]))
data.append(float(value[1]))
returncsr_matrix((data,(row,col)), shape=(1,100))
defmultiplyMatrix(csrMatrix):
csrTransponse = csrMatrix.transpose(copy=True)
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
return (csrTransponse*csrMatrix)
defformatOutput(indexValuePairs):
def main():
input = sys.argv[1]
output = sys.argv[2]
sc = SparkContext(conf=conf)
assertsc.version>= '1.5.1'
col = sparseMatrix.indices[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]
data = sparseMatrix.data[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]
indexValuePairs = zip(col,data)
formattedOutput = formatOutput(indexValuePairs)
outputFile.write(formattedOutput + '\n')
main()
3
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Building data pipeline for Covid-19 data analysis using BigData technologies and Tableau
• The purpose is to collect the real time streaming data from COVID19 open API for every 5
minutes into the ecosystem using NiFi and to process it and store it in the data lake on AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi and
stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally KPIs are visualized in tableau.
Data Architecture
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Tools used:
1. Nifi -nifi-1.10.0
2. Hadoop -hadoop_2.7.3
3. Hive-apache-hive-2.1.0
4. Spark-spark-2.4.5
5. Zookeeper-zookeeper-2.3.5
6. Kafka-kafka_2.11-2.4.0
7. Airflow-airflow-1.8.1
8. Tableau
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
MongoDB
MongoDB is an open-source document-oriented database that is designed to store a large scale of data
and also allows you to work with that data very efficiently. It is categorized under the NoSQL (Not
only SQL) database because the storage and retrieval of data in the MongoDB are not in the form of
tables.
The MongoDB database is developed and managed by MongoDB.Inc under SSPL(Server Side Public
License) and initially released in February 2009. It also provides official driver support for all the
popular languages like C, C++, C#, and .Net, Go, Java, Node.js, Perl, PHP, Python, Motor, Ruby,
Scala, Swift, Mongoid. So, that you can create an application using any of these languages. Nowadays
there are so many companies that used MongoDB like Facebook, Nokia, eBay, Adobe, Google, etc. to
store their large amount of data.
Features of MongoDB –
Schema-less Database: It is the great feature provided by the MongoDB. A Schema-less database
means one collection can hold different types of documents in it. Or in other words, in the
MongoDB database, a single collection can hold multiple documents and these documents may
consist of the different numbers of fields, content, and size. It is not necessary that the one
document is similar to another document like in the relational databases. Due to this cool feature,
MongoDB provides great flexibility to databases.
Document Oriented: In MongoDB, all the data stored in the documents instead of tables like in
RDBMS. In these documents, the data is stored in fields(key-value pair) instead of rows and
columns which make the data much more flexible in comparison to RDBMS. And each document
contains its unique object id.
Indexing: In MongoDB database, every field in the documents is indexed with primary and
secondary indices this makes easier and takes less time to get or search data from the pool of the
data. If the data is not indexed, then database search each document with the specified query which
takes lots of time and not so efficient.
Scalability: MongoDB provides horizontal scalability with the help of sharding. Sharding means
to distribute data on multiple servers, here a large amount of data is partitioned into data chunks
using the shard key, and these data chunks are evenly distributed across shards that reside across
many physical servers. It will also add new machines to a running database.
Replication: MongoDB provides high availability and redundancy with the help of replication, it
creates multiple copies of the data and sends these copies to a different server so that if one server
fails, then the data is retrieved from another server.
Aggregation: It allows to perform operations on the grouped data and get a single result or
computed result. It is similar to the SQL GROUPBY clause. It provides three different
aggregations i.e, aggregation pipeline, map-reduce function, and single-purpose aggregation
methods
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
High Performance: The performance of MongoDB is very high and data persistence as compared
to another database due to its features like scalability, indexing, replication, etc.
Advantages of MongoDB :
It is a schema-less NoSQL database. You do need not to design the schema of the database when
you are working with MongoDB.
It does not support join operations.
It provides great flexibility to the fields in the documents.
It contains heterogeneous data.
It provides high performance, availability, and scalability.
It supports Geospatial efficiently.
It is a document-oriented database and the data is stored in BSON documents.
It also supports multiple document ACID transition(string from MongoDB 4.0).
It does not require any SQL injection.
It is easily integrated with Big Data Hadoop
Disadvantages of MongoDB:
As we know that we can use MongoDB for various things like building an application (including web
and mobile), or analysis of data, or an administrator of a MongoDB database, in all these cases we
need to interact with the MongoDB server to perform certain operations like entering new data into
the application, updating data into the application, deleting data from the application, and reading the
data of the application. MongoDB provides a set of some basic but most essential operations that will
help you to easily interact with the MongoDB server and these operations are known as CRUD
Operations.
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Create Operations –
The create or insert operations are used to insert or add new documents in the collection. If a
collection does not exist, then it will create a new collection in the database. You can perform, create
operations using the following methods provided by the MongoDB:
Method Description
Example 1: In this example, we are inserting details of a single student in the form of document in the
student collection using db.collection.insertOne()
method.
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Example 2: In this example, we are inserting details of the multiple students in the form of documents
in the student collection using db.collection.insertMany() method.
Read Operations –
The Read operations are used to retrieve documents from the collection, or in other words, read
operations are used to query a collection for a document. You can perform read operation using the
following method provided by the MongoDB:
Method Description
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Example : In this example, we are retrieving the details of students from the student collection using
db.collection.find()method.
Update Operations –
The update operations are used to update or modify the existing document in the collection. You can
perform update operations using the following methods provided by the MongoDB:
Method Description
It is used to replace single document in the collection that satisfy the given
db.collection.replaceOne() criteria.
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Example 1: In this example, we are updating the age of Sumit in the student collection
.
Example 2: In this example, we are updating the year of course in all the documents in the student
collection using db.collection.updateMany()method.
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Delete Operations –
The delete operation are used to delete or remove the documents from a collection. You can perform
delete operations using the following methods provided by the MongoDB:
Method Description
It is used to delete a single document from the collection that satisfy the
db.collection.deleteOne() given criteria.
It is used to delete multiple documents from the collection that satisfy the
db.collection.deleteMany()given criteria.
Example 1: In this example, we are deleting a document from the student collection using
db.collection.deleteOne()method.
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
Example 2: In this example, we are deleting all the documents from the student collection using
db.collection.deleteMany() method.
4
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
5
Big Data Analytics (3170722) B.E. SEMESTER VII (ODD-2022)
5
Big Data Analytics B.E. SEMESTER VII (ODD-
5
Big Data Analytics B.E. SEMESTER VII (ODD-
5
Big Data Analytics B.E. SEMESTER VII (ODD-
5
Big Data Analytics B.E. SEMESTER VII (ODD-
5
Big Data Analytics B.E. SEMESTER VII (ODD-
5
Big Data Analytics B.E. SEMESTER VII (ODD-
5
Big Data Analytics B.E. SEMESTER VII (ODD-