BDAV Internal Notes
BDAV Internal Notes
BDAV
HoneyPot
HoneyPot
Unit 1
Unit 2
Q1] HDFS Vs HBase Q2]
HDFS
Q3] . Partitioner and Combiner wrt to Map Reduce
Q4] Explain Hadoop Distributed File System (HDFS) Architecture in detail.
Q5] What is Rack awareness in Hadoop HDFS? How to achieve fault tolerance with the help
of Rack awareness? Explain with the help of a suitable example
Q6] Explain matrix multiplication mapper and reducer function for MapReduce with a
help of a suitable example.
Q7] Discus Matrix Multiplication by MapReduce. Perform Matrix Multiplication with 1-
step method Matrix A: Matrix B:
12
HoneyPot
56
34
23
Q8]
Unit 3
Q1] Illustrate NoSQL data architecture patterns with suitable example. Q2] What is
NoSQL? Explain various NoSQL Data architecture patterns Q3] Master slave vs peer
to peer Architecture in No SQL
Q5] Explain HBase Architecture and Components of HBase Architecture. Q6] List and
Explain types of NoSQL with example of each
Unit 4
Q2] Explain the architecture of Apache Pig with help of diagram Q3] Hive
Q4] PIG
Q5] What is significance of Apache Pig in Hadoop Context? Explain the main components and
the working of Apache Pig with the help of diagram ?
Q6] Joins in Hive
Q7] Explain the significance of Pig in Hadoop Ecosystem. Explain the Data Model in Pig
Latin with suitable example.
Q8] Illustrate Hive. Explain Hive architecture with suitable example. Q9]
Unit 5
Q1] What is Apache Kafka? Explain the benefits and need of Apache Kafka. Q2] Features
of Apache Spark
Q6] Explain Bulk Synchronous Processing (BSP) and graph processing wrt to Apache spark ?
Q7] Discuss the Apache Kafka fundamentals. Explain the Kafka Cluster Architecture
with suitable diagram
Q8] DataSets vs DataFrames
Q9] Explain Apache Spark. What are the advantages of Apache spark over Map
Reduce
Q10] Illustrate the Apache Kafka. Explain with suitable example the streaming of real
time data with respect to Apache Kafka.
Unit 6
Q3] What are tools and benefits of Data Visualization also explain challenges of Big
Data Visualization
Q4] Illustrate the characteristics of social media which make it suitable for Big Data
Analytics
Q5] Explain data visualization. List out and Illustrate the various Data Visualization
Tools.
Q6]
Answers : Ans :
Unit 1
Q1] Yarn Daemon
HoneyPot
BDAV
Hadoop's resource management layer is Apache Yarn, which stands for "Yet Another
Resource Negotiator." The yarn was first introduced in Hadoop 2. x. Yarn enables various
data processing engines, including as graph processing, interactive processing, stream
processing, and batch processing, to operate and handle HDFS data (Hadoop Distributed File
System). Yarn offers work schedule in addition to resource management.
Refer Q9.
1.3.1 Structured: Structured data is one sort of large data. Structured data is defined as
data that can be processed, stored, and retrieved in a consistent
HoneyPot
way. It refers to information that is neatly structured and can be easily and smoothly stored
and accessed from a database using basic search engine methods. For example, the
employee table in a corporate database will be
formatted such that personnel details, work positions, salary, and so on are all presented in
an ordered manner
1.3.2 Unstructured: Unstructured data is data that does not have any defined shape or
organization. This makes processing and analyzing unstructured data extremely complex
and time-consuming. Unstructured data is something like email. Big data may be classified
into two types: structured and unstructured. Today's enterprises have a lot of data at
their disposal, but they don't know how to extract value from it since the data is in its raw
form or unstructured format. Example: The output returned by „Google Search
be of interest to users. They often delete old messages and focus on fresh changes. Data
transfer is now nearly real-time, and the update window has shrunk to fractions of a
second. Big Data is represented by this high-velocity data.
Variety: Data may be stored in a variety of formats. For example, it may be saved in a
database, excel, csv, access, or even a plain text file. Sometimes the data is not even in the
typical format that we expect; it may be in the form of video, SMS, pdf, or something else
that we haven't considered. It is the
organization's responsibility to organize it and make it relevant. It will be simple if we have
data in the same format, but this is not always the case. The actual world has data in a
variety of forms, which is the problem we must conquer
with Big Data. This variety of data represents Bigdata.
b) Semi-structured data: This data type isn't as organised as structured data but has some
level of structure, like XML or JSON files.
c) Unstructured data: A category representing the majority of data generated today, it
includes videos, images, social media posts, emails, and much more.
Q4] Explain Hadoop Ecosystem with core components. Explain its architecture ?
Ans :
The Hadoop Ecosystem is a robust platform designed to handle big data challenges by offering
a wide range of services for data processing, storage, analysis, and maintenance. It integrates
several Apache projects along with various commercial tools. The key components of the
Hadoop ecosystem are:
Core Components:
HDFS (Hadoop Distributed File System): A distributed file system that stores large volumes of
data across multiple nodes.
YARN (Yet Another Resource Negotiator): Manages computing resources and schedules tasks.
MapReduce: A programming model used for processing large datasets in parallel.
Hadoop Common: Provides common utilities and libraries needed by other Hadoop
components.
Q5] What are the different frameworks that run under YARN? Discuss the various YARN
Daemons
Ans : Refer Q9
Q6] What is Big Data and its types? List out the difference between Traditional Data Vs.
Big Data with the help of example ?
Ans :
Big Data :
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15
byte size is called Big Data. It is stated that almost 90% of today's data has been generated
in the past 3 years.
Types : Refer Q2
HoneyPot
HoneyPot
Q9]Explain the Workflow using Resource Manager, Application Master & Node Master
of Hadoop YARN with a suitable diagram
Ans :
HoneyPot
. Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per
application).
● Resource Manager (RM): It is Yarn's master daemon. The global allocation of resources
(CPU and memory) across all apps is managed by RM. It resolves conflicts over system
resources between competing programs. To study Yarn Resource Manager in depth, see to
the Resource Manager guide. The main components of Resource Manager are:
✔ Application manager
✔ Scheduler Application
Manager:
It handles the running of Application Masters in the cluster, which means it is in charge of
initiating application masters as well as monitoring and restarting
them on various nodes in the event of a failure. Scheduler:
The scheduler is in charge of distributing resources to the currently executing programme.
The scheduler is a pure scheduler, which means it conducts no monitoring or tracking for the
application and makes no assurances regarding resuming unsuccessful jobs due to
application or hardware issues.
HoneyPot
The AM obtains containers from the RM's Scheduler before contacting the associated NMs to
begin the different tasks of the application
Q10] Explain the distributed storage system of Hadoop with the help of a neat diagram
Ans :
The Hadoop Distributed File System (HDFS) is a Hadoop distributed file system. It is built
with master/slave architecture. This architecture consists of a single Name Node acting as
the master and multiple Data Nodes acting as slaves.
Name Node and Data Node are both capable of running on commodity machines. HDFS is
written in the Java programming language.
As a result, the Name Node and Data Node software can be run on any machine that
supports the Java programming language. The Hadoop
Distributed File System (HDFS) is a distributed file system based on the Google File System
(GFS) that is designed to run on commodity hardware. It is very
similar to existing distributed file systems.
However, there are significant differences between it and other distributed file systems. It is
highly fault-tolerant and is intended for use on low-cost hardware. It allows for high-
throughput access to application data and is appropriate for applications with large
datasets. In the HDFS cluster, there is only one master
server.
HoneyPot
Name Node:
● When instructed by the Name Node, it creates, deletes, and replicates blocks.
Job Tracker:
● Job Tracker's role is to accept MapReduce jobs from clients and process the data using
Name Node.
● It receives the task and code from Job Tracker and applies it to the file. This procedure
is also known as a Mapper.
Mapreduce Layer:
When the client application submits the MapReduce job to Job Tracker, the MapReduce is
created. As a result, the Job Tracker forwards the request to the relevant Task Trackers. The
Task Tracker occasionally fails or times out. That portion of the job is rescheduled in such a
case.
Unit 2
Q1] HDFS Vs HBase
Ans :
Q2] HDFS
Ans : Refer from unit 1
"Hadoop" → [1]
"World" → [1]
Step 3: Reduce Phase
The Reducer takes each key (word) and sums up the values associated with it (the 1s).
For "Hello," the reducer adds 1 + 1 and outputs ("Hello", 2).
Example:
arduino
Copy code
("Hello", 2)
("Hadoop", 1)
("World", 1)
Final Output:
The result is a word count for each unique word in the input:
Copy code
Hello 2
Hadoop 1
World 1
Ans :
A partitioner partitions the key-value pairs in the Map intermediate output. Separate data
using user-defined conditions that act like hash functions. The total number of sections
equals the number of Reducer jobs per job.
A Combiner is also called and semi-reducer, is an optional step used to optimize the
MapReduce process. Used to reduce output at node level. At this point, you can merge the
cloned output from the map output into a single output file. The merge step speeds up the
shufle step by improving job performance.
Combiner class is used between Map class and Reduce class to reduce the
amount of data transfer between Map and Reduce. In general, the output of the map job
is large and the data passed to the reduce job is large.
Q4] Explain Hadoop Distributed File System (HDFS) Architecture in detail. Ans :
HDFS is particularly fault-tolerant and is designed to be deployed on lowcost hardware. HDFS
gives excessive throughput access to software records and is suitable for programs that have
massive datasets. HDFS relaxes some POSIX requirements to permit streaming get right of
entry to document system data.
HDFS changed into firstly built as infrastructure for the Apache Nutch web search engine
undertaking. HDFS is part of the Apache Hadoop core
undertaking.
An HDFS cluster consists of a single NameNode, a master server that manages the file
system namespace and regulates access to files by clients. It maintains the file system tree
and metadata for all the files and directories in the tree.
This information is stored on the local disk in the form of two files: the namespace image and
the edit log.
Along with it there are a some DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on. The Data Nodes are also called as worker
nodes. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
The NameNode also knows the DataNodes on which all the blocks for a given file are
located, however, it does not store block locations, since this
information is reconstructed from DataNodes when the system starts. The NameNode
executes file system namespace operations like opening, closing,
and renaming files and directories. It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the
file system‘s clients. The DataNodes also perform block creation, deletion, and replication
upon instruction from the NameNode.
It is important to make NameNode fault tolerant as without the NameNode the file system
cannot be used. To make is fault tolerant the first step will be taking back-ups for the
persistent state files and metadata present in it. Hadoop can be configured so that the
NameNode writes its persistent state to multiple filesystems. These writes are synchronous
and atomic. The usual configuration choice is to write to local disk as well as a remote NFS
mount.
It is also possible to run a secondary NameNode, which has the same HDFS name but it
still does not act as a NameNode. Its main role is to periodically merge the namespace
image with the edit log to prevent the edit log from becoming too large. The secondary
NameNode usually runs on a separate physical machine, since it requires plenty of CPU
and as much memory as the NameNode to perform the merge. It keeps a copy of the
merged namespace
image, which can be used in the event of the NameNode failing. However, the state of the
secondary NameNode lags that of the primary, so in the event of total failure of the primary,
data loss is almost certain. The usual course of
HoneyPot
action in this case is to copy the NameNode‘s metadata files that are on NFS to the secondary
and run it as the new primary.
Q) HDFS Architecture
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of name node and data node help users to easily check the status
of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
HoneyPot
It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be
a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
Q5] What is Rack awareness in Hadoop HDFS? How to achieve fault tolerance with the help
of Rack awareness? Explain with the help of a suitable example
Ans :
Most of us are familiar with the term Rack. The rack is a physical collection of nodes in our
Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of many Racks. With
the help of this Racks information, Namenode chooses the closest Datanode to achieve
maximum performance while performing the read/write information which reduces the
Network Traffic. A
rack can have multiple data nodes storing the file blocks and their replica’s. The Hadoop
itself is so smart that it will automatically write a particular file block in 2 different Data
nodes in Rack. If you want to store that block of data into more than 2 Racks then you can
do that. Also as this feature is configurable means you can change it Manually. Example of
Rack in a cluster:
HoneyPot
As we all know a large Hadoop cluster contains multiple Racks, in each rack there are lots
of data nodes are available. Communication between the Datanodes that are present on
the same rack is quite much faster than the communication between the data node
present at the 2 different racks. The name node has the feature of finding the closest data
node for faster
performance for that Name node holds the ids of all the Racks present in the
HoneyPot
Hadoop cluster. This concept of choosing the closest data node for serving a purpose is Rack
Awareness. Let’s understand this with an example.
In the above image, we have 3 different Racks in our Hadoop cluster each Rack contains 4
Datanode. Now suppose you have 3 file blocks(Block 1, Block 2, Block 3) that you want to
put in this data node. As we all know Hadoop has a Feature to make Replica’s of the file
blocks to provide the high availability and fault tolerance. By default, the Replication Factor
is 3 so Hadoop is so smart that it will place the replica’s of Blocks in Racks in such a way that
we can
achieve a good network bandwidth. For that Hadoop has some Rack awareness policies.
There should not be more than 1 replica on the same Datanode.
More than 2 replica’s of a single block is not allowed on the same Rack.
The number of racks used inside a Hadoop cluster must be smaller than the
number of replicas.
Now let’s continue with our above example. In the diagram, we can easily found that we
have block 1 in the first Datanode of Rack 1 and 2 replica’s of Block 1 in 5 and 6 number
Data node of Rack which sum up to 3. Similarly, we also have a Replica distribution of 2
other blocks in different Racks which are following the above policies. Benefits of
Implementing Rack Awareness in our Hadoop Cluster:
HoneyPot
With the rack awareness policy’s we store the data in different Racks so no way to
lose our data.
Rack awareness helps to maximize the network bandwidth because the data
blocks transfer within the Racks.
It also improves the cluster performance and provides high data
availability.
HDFS Rack Awareness Example:
Q6] Explain matrix multiplication mapper and reducer function for MapReduce with a
help of a suitable example.
Q7] Discus Matrix Multiplication by MapReduce. Perform Matrix Multiplication with 1-
step method Matrix A: Matrix B:
12
56
34
23
HoneyPot
Q8]
Unit 3
Q1] Illustrate NoSQL data architecture patterns with suitable example. Ans : NoSQL
databases come in various types, each with its unique features and applications:
1. Key-Value Store
o Description: A simple storage system that uses unique keys to access
corresponding values. Data is organized as key-value pairs, where the key is a
constant identifier, and the value can be any type of data.
o Example: In a mobile phone database, a mobile phone could be the key, while its
color, model, and brand could be the values.
2. Document Databases
o Description: Stores data in collections of documents, typically in formats like
JSON or XML. Documents are treated as whole entities, maintaining their
hierarchical structure.
o Example: A sample document might look like this:
json
Copy code
{
"Name": "ABC",
"Address": "Indore",
"Email": "abc@xyz.com",
"Contact": "98765"
}
3. Graph Database
o Description: Based on graph theory, these databases store data in nodes and
edges. They are designed to represent and work with relationships between data
points. Each node and edge has a unique identifier.
o Example: In a social network, users could be nodes, and their friendships could
be represented as edges connecting those nodes.
4. Column Store
Description: A column store organizes data in a sparse matrix format, using columns as keys.
Each storage block holds data from a single column, which saves space by not storing null
values. A column family is a group of rows, with each row having a unique identifier.
Example: In a column family for user data, columns might represent attributes like name,
email, and address, while each row corresponds to a different user.
Q2] What is NoSQL? Explain various NoSQL Data architecture patterns Ans :
e. NoSQL is actually an acronym that stands for ‗Not Only SQL‘. Today NoSQL is used as an
umbrella term for all the data stores and databases that do not
follow traditional well established RDBMS principles. NoSQL is a set of concepts that allows the
rapid and efficient processing of data sets with primary focus on performance, reliability and
agility. NoSQL databases are widely used in big data and other real-time web applications. These
HoneyPot
In contrast, Cassandra is a wide-column store database built for high availability and scalability.
It uses a flexible schema with rows and columns, which makes it easy to organize and manage
different types of data. A key advantage of Cassandra is that it continues to work even if some
servers fail, ensuring uninterrupted access to data. It is optimized for fast data writing, making it
ideal for applications that require quick data input, like IoT data storage and social media
platforms. Its distributed design also supports easy scaling, making it a strong choice for
handling large-scale applications
Refer Q1 also
Region Server: In HBase, a Region Server manages regions, which are sorted collections of
rows. It handles both read and write operations for these regions. Region Servers, also
known as Slave Nodes, are responsible for storing data and function within the cluster.
HBase Master (HMaster): The HBase Master controls a set of Region Servers that reside
on Data Nodes. It assigns regions to these servers during startup and can reassign them
for load balancing or recovery purposes. The HMaster coordinates the HBase cluster and
performs administrative duties. With the help of Zookeeper, it monitors the health of
Region Servers and facilitates recovery when a server fails. Additionally, the HMaster
manages table operations like creating and deleting tables.
Zookeeper: Zookeeper serves as a coordinator that helps the HMaster manage HBase's
distributed environment. It regularly receives heartbeat signals from both Region Servers
and the HMaster, ensuring they are operational. If Zookeeper fails to get these signals, it
sends out failure notifications and starts recovery processes. When a Region Server goes
HoneyPot
down, Zookeeper informs the HMaster, which then reallocates the regions of the failed
server to other active Region Servers.
Q6] List and Explain types of NoSQL with example of each Ans : Refer Q2
Q7] CAP theorem Ans :
● Consistency: Making available the same single updated readable version of the data to all
the clients. Here, Consistency is concerned with multiple clients reading the same data from
replicated partitions and getting consistent results.
● High Availability: System remains functional even if some nodes fail. High Availability
means that the system is designed and implemented in such a way that it continues its read
and write operation even if nodes in a cluster crash or some hardware or software parts are
down due to upgrades. Internal
communication failures between replicated data should not prevent updates.
● Partition tolerance: Partition tolerance is the ability of the system to continue functioning
in the presence of network partitions. This situation arises when
network nodes cannot connect to each other (temporarily or permanently). System remains
operations on system split or communication malfunction. A single node failure should not cause
the entire system to collapse.
Unit 4
Q1] Load and Store Functions of Apache Pig Ans :
Apache Pig, in general, runs on top of Hadoop. It's a statistical tool for analysing huge
datasets in the Hadoop File System. To use Apache Pig to analyse data, we must first load
the data into Apache Pig. This chapter explains how to load data from HDFS into Apache Pig.
LOAD operator: The LOAD operator of Pig Latin can be used to load data into Apache Pig
HoneyPot
from a file system (HDFS/ Local). The "=" operator divides the load statement into two
sections. On the left, we must specify the name of the relation in which we want to store
the data, and on the right, we must specify how we will store the data. The Load operator's
syntax is seen below.
Relation_name = LOAD 'Input file path' USING function as schema;
● relation_name: We have to mention the relation in which we want to store the data.
● Input file path: We have to mention the HDFS directory where the file is stored. (In
MapReduce mode)
● Function: We have to choose a function from the set of load functions provided by
Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
● Schema: We have to define the schema of the data. We can define the required
schema as follows:
(column1: data type, column2 : data type, column3 : data type); Store
Operator:
You can store the loaded data in the file system using the store operator. This chapter explains
how to store data in Apache Pig using the Store operator.
Syntax:
STORE Relation_name INTO ' required_directory_path ' [USING function];
loading and Storing Data, we have seen how to load data from external storage for
processing in Pig. Storing the results is straightforward, too. Here‘s an
example of using PigStorage to store tuples as plain-text values separated by a colon
character: grunt> STORE A INTO 'out' USING PigStorage(':');
HoneyPot
Q2] Explain the architecture of Apache Pig with help of diagram Ans :
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel
data processing language which provides a rich set of data types and operators to perform
various operations on the data.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.
HoneyPot
1. Parser: The parser handles any pig scripts or instructions in the grunt shell. Parse will run
checks on the scripts, such as checking the syntax, type checking, and a variety of other
things. These checks will produce results in the form of a Directed Acyclic Graph (DAG),
which contains pig Latin sentences and logical
operators. Our logical operators of the scripts are nodes, and data flows are edges, therefore
the DAG will have nodes that are connected to distinct edges.
2. Optimizer: After parsing and DAG generation, the DAG is sent to the logical optimizer,
which performs logical optimizations such as projection and pushdown. By removing
extraneous columns or data and pruning the loader to only load the relevant column,
projection and pushdown increase query
performance.
3. Compiler: The compiler compiles the optimised logical plan provided above and
generates a series of Map-Reduce tasks. Essentially, the compiler will
transform pig jobs into MapReduce jobs and exploit optimization opportunities in scripts,
allowing the programmer to avoid manually tuning the software.
Pig's compiler can reorder the execution sequence to enhance performance if
HoneyPot
the execution plan remains the same as the original programme because it is a data-flow
language.
4. Execution Engine: Finally, all of the MapReduce jobs generated by the compiler are
sorted and delivered to Hadoop. Finally, Hadoop executes the MapReduce job to
produce the desired output. 5. Pig has two execution
modes, which are determined by the location of the script and the availability of data.
● Local Mode: For limited data sets, local mode is the best option. Pig is
implemented on a single JVM because all files are installed and run-on
localhost, preventing parallel mapper execution. Pig will also peek into the local file system
while importing data.
● MapReduce Mode (MR Mode): In MapReduce, the mode programmer needs access and
setup of the Hadoop cluster and HDFS installation. In this mode data on which processing is
done is exists in the HDFS system. After execution of pig script in MR mode, pig Latin
statement is converted into Map Reduce jobs in the back-end to perform the operations on
the data
HoneyPot
Apache Pig is a high-level data processing language used to analyze large data sets in Hadoop.
The language used in Pig is called Pig Latin, which provides various data types and
operators to perform data operations.
Pig Latin: The language allows users to write scripts using rich data types (tuples, bags, maps)
and operators (for loading, transforming, filtering, and storing data), making it accessible
for data analysts.
3. Execution Methods:
4. Architecture:
The Pig framework parses the Pig Latin scripts into a logical plan, optimizes it, and
translates it into a series of MapReduce jobs for execution on a Hadoop cluster.
To perform data processing tasks, programmers write scripts in Pig Latin. These scripts are
executed using different methods, such as the Grunt shell or User Defined Functions
(UDFs).
Execution Process:
1. Parser:
o The parser checks the Pig script for errors, such as syntax or type issues.
o It creates a Directed Acyclic Graph (DAG) that represents the logical structure
of the script, with nodes representing operations and edges representing
HoneyPot
data flow.
2. Optimizer:
3. Compiler:
o The compiler can rearrange tasks for better performance, while maintaining
the original execution plan.
4. Execution Engine:
o The generated MapReduce jobs are sent to Hadoop for execution, which
processes the data and produces the desired output.
Execution Modes:
Pig offers two execution modes based on data location and availability:
1. Local Mode:
o Runs on a single JVM and processes files from the local file system without
parallel execution.
o Requires a Hadoop cluster setup. The scripts are converted into MapReduce
jobs that operate on data in HDFS.
Q3] Hive
Ans :
The term 'Big Data' refers to massive datasets with a high volume, high velocity, and a
diverse mix of data that is growing by the day. Processing Big Data using typical data
management solutions is tough. As a result, the Apache Software Foundation developed
Hadoop, a framework for managing and processing
large amounts of data. Hive is a Hadoop data warehouse infrastructure solution that allows
you to process structured data. It is built on top of Hadoop to
summarise Big Data and facilitate querying and analysis. Initially developed by Facebook,
Hive was eventually taken up by the Apache Software Foundation and developed as an
open-source project under the name Apache Hive. It is utilised by a variety of businesses.
HoneyPot
1. What is Hive?: Apache Hive is a data warehouse tool built on top of Hadoop. It helps in
managing and querying large volumes of data stored in Hadoop.
2. User-Friendly: Hive uses a SQL-like language called HiveQL, making it easier for users
familiar with SQL to write queries and perform data analysis.
3. Designed for Big Data: Hive is specifically designed to handle large datasets efficiently,
allowing businesses to analyze structured data quickly.
4. Open Source: Initially created by Facebook, Hive is now an open-source project under the
Apache Software Foundation, which means it has a strong community for support and
development.
5. Real-World Applications: Many businesses use Hive for data analysis, reporting, and
business intelligence, helping them gain insights from their Big Data.
In essence, Hive simplifies the process of working with Big Data in a way that is accessible to users
without requiring in-depth technical skills.
HoneyPot
● HIVE clients
● HIVE Services
● Storage and computing
HIVE Clients:
Hive offers a variety of drivers for interacting with various types of applications. It will
provide a Thrift client for communication in Thriftbased applications.
JDBC Drivers are available for Java-based applications. ODBC drivers are available for any
sort of application. In the Hive services, these Clients and Drivers communicate with the
Hive server.
HIVE Services:
Hive Services can be used by clients to interact with Hive. If a client wishes to do any
query-related actions in Hive, it must use Hive Services to do so. The command line
interface (CLI) is used by Hive to perform DDL (Data Definition
HoneyPot
Hive services like Meta store, File system, and Job Client communicate with Hive storage
and carry out the following tasks.
● The Hive "Meta storage database" stores the metadata of tables generated in Hive.
● Query results and data loaded into tables will be stored on HDFS in a Hadoop cluster.
Apache Hive is a data warehousing solution built on top of the Hadoop ecosystem, allowing users
to process and analyze large datasets using a SQL-like query language. Understanding its key
components is essential for leveraging its functionalities effectively. The main components of
Hive include Hive Clients, Hive Services, and Storage and Computing systems.
1. Hive Clients
Hive provides various drivers for applications to interact with it, facilitating connectivity:
Thrift Client: This client allows communication with Hive for applications built using the
Thrift framework. It’s particularly useful for programming languages other than Java.
JDBC Driver: Designed for Java applications, this driver enables them to execute Hive
queries as they would in traditional relational databases, maintaining familiarity for Java
developers.
ODBC Driver: This driver offers a standardized way for numerous applications and
programming languages to connect with Hive, making it ideal for Business Intelligence (BI)
tools and applications that support ODBC.
HoneyPot
These clients enable users to send queries to Hive, retrieve results, and interact seamlessly with
the stored data.
2. Hive Services
Hive Services serve as the bridge between clients and the Hive server, enabling query execution
and command operations:
Command Line Interface (CLI): The CLI allows users to perform Data Definition Language
(DDL) and Data Manipulation Language (DML) operations directly from the command line,
such as creating, altering, and dropping tables.
Main Driver: This core component manages the communication between Hive Clients
(JDBC, ODBC, Thrift) and the Hive server. It processes requests from clients and routes
them to the appropriate backend systems, facilitating smooth data flow.
Meta Store Interaction: The Main Driver also interacts with the Meta Store, which stores
information about database schemas, tables, and other metadata. This ensures that
queries have access to accurate data structure and definitions.
Hive utilizes a storage layer for managing data and a computing layer for processing
queries:
o Meta Store: Stores metadata information about tables created in Hive, such as
schemas, locations, and other attributes.
o HDFS (Hadoop Distributed File System): The actual query results and the data
loaded into Hive tables are stored on HDFS, which forms the backbone of the
Hadoop ecosystem.
o Job Client: Manages the execution of jobs and serves as the communication link
between Hive and the underlying processing framework (like MapReduce).
Q4] PIG
Ans : Refer Q2
Q5] What is significance of Apache Pig in Hadoop Context? Explain the main components and
the working of Apache Pig with the help of diagram ?
Ans : Refer Q2 and Q7 Q6]
Joins in Hive
Ans :
HoneyPot
● However, in the same query more than two tables can be joined.
● Basically, to offer more control over ON Clause for which there is no match LEFT,
RIGHT, FULL OUTER joins exist in order.
Types of Joins:
Left Outer Join: On defining HiveQL Left Outer Join, even if there are no matches in the right
table it returns all the rows from the left table. To be more
HoneyPot
specific, even if the ON clause matches 0 (zero) records in the right table, then also this Hive
JOIN still returns a row in the result. Although, it returns with NULL in each column from the
right table. In addition, it returns all the values from the left table. Also, the matched values
from the right table, or NULL in case of no matching JOIN predicate. However, the below
query shows LEFT OUTER JOIN between CUSTOMER as well as ORDER tables:
Right Outer Join: Basically, even if there are no matches in the left table, HiveQL Right Outer Join
returns all the rows from the right table. To be more specific,
even if the ON clause matches 0 (zero) records in the left table, then also this Hive JOIN still
returns a row in the result. Although, it returns with NULL in
each column from the left table. In addition, it returns all the values from the right table.
Also, the matched values from the left table or NULL in case of no matching join
predicate.
Full Outer Join: The primary goal of this HiveQL Full outer Join is to merge the records from
both the left and right outside tables in order to satisfy the Hive JOIN requirement. In
addition, this connected table either contains all of the records from both tables or fills in
NULL values for any missing matches on
either side.
Q7] Explain the significance of Pig in Hadoop Ecosystem. Explain the Data Model in Pig
Latin with suitable example.
HoneyPot
Ans :
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes
such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s
data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field.
Example − ‘raja’ or ‘30’ Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30) Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non- unique) is known
as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’.
It is similar to a table in RDBMS, but
unlike a table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
HoneyPot
A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja,
30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30] Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
Significance of pig :
o Nested data types - The Pig provides a useful concept of nested data types like
tuple, bag, and map.
Q8] Illustrate Hive. Explain Hive architecture with suitable example. Ans : Refer
Q3
Unit 5
Q1] What is Apache Kafka? Explain the benefits and need of Apache Kafka. Ans :
Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that
can handle a high volume of data and enables you to pass messages from one endpoint to
another. Kafka is suitable for both ofline and online
message consumption. Kafka messages are persisted on the disk and replicated within the cluster
to prevent data loss. Kafka is built on top of the ZooKeeper
HoneyPot
synchronization service. It integrates very well with Apache Storm and Spark for real-time
streaming data analysis.
Benefits :
● Performance: Kafka has high throughput for both publishing and subscribing messages.
It maintains stable performance even many TB of messages are
stored.
Kafka is very fast and guarantees zero downtime and zero data loss
Need
Kafka is a unified platform for handling all the real-time data feeds. Kafka supports low
latency message delivery and gives guarantee for fault tolerance in the presence of
machine failures. It has the ability to handle a large number of diverse consumers. Kafka is
very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially
means that all the writes go to the page cache of the OS (RAM). This makes it very efficient
to transfer data
from page cache to a network socket.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of
the ways to implement Spark.
HoneyPot
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has
its own cluster management computation, it uses Hadoop for storage purpose only
● Speed:
Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and
10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
● Advanced Analytics:
Spark not only supports ‗Map‘ and ‗reduce‘. It also supports SQL queries, Streaming data,
Machine learning (ML), and Graph algorithms.
Q3] Apache Kafka Ans :
Refer Q1
Q4] What is RDD? How is data partitioned in RDD Ans :
Resilient Distributed Datasets: Resilient Distributed Datasets (RDD) is a
fundamental data structure of Spark. It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
HoneyPot
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.
Data partitioning in RDD is a crucial factor in optimizing performance, as it allows Spark to
process data in parallel by dividing it into chunks called
partitions. Here's how partitioning works in RDDs:
1. Default Partitioning:
o By default, Spark decides the number of partitions based on the input data
source. For instance, if reading from an HDFS file, Spark uses the file’s block
size to determine partitions.
o Typically, each partition processes a subset of the data, enabling
parallelism. The number of partitions is usually the number of blocks in
the source file but can be adjusted.
2. Custom Partitioning:
o Users can define the number of partitions explicitly when creating an RDD,
e.g., sc.textFile("path", numPartitions).
o Spark also provides partitioning functions such as repartition() and coalesce()
to control data distribution. repartition() increases or decreases the number
of partitions, while coalesce() only reduces the number of partitions,
improving efficiency by minimizing data shufling.
3. Hash Partitioning:
o For certain operations like joins, grouping, and aggregations, Spark uses hash
partitioning. Hash partitioning is achieved by hashing keys and placing them
into partitions.
o For example, if there are three partitions, keys are distributed based on
their hash value mod 3, ensuring similar keys are
grouped in the same partition for operations that benefit from it.
HoneyPot
4. Range Partitioning:
o In range partitioning, data is divided based on ranges of values, which is
useful for ordered data. For example, numeric or sorted data can be
divided into specific ranges to optimize processing.
Example of Data Partitioning in an RDD
Consider a scenario where you have an RDD created from a large dataset containing
customer transactions. If you load this data into Spark with five
partitions, Spark divides the data into five chunks, each assigned to a separate node in the
cluster. Here’s how partitioning optimizes data processing:
python
Copy code
Each partition can process the transactions independently and in parallel, leading to faster
and more efficient computation.
Advantages of Partitioning in RDDs
Q5] What are the advantages of Apache Spark over Map Reduce Ans :
Real-time monitoring: Spark supports real-time data analysis, which can help you
respond quickly to new trends or issues.
Flexibility: Spark's speed and flexibility are useful when evaluating multiple
data models or strategies.
Scalability: Spark can expand systems and build scalable solutions quickly,
efficiently, and cost-effectively.
Data processing: Spark is a popular engine for large-scale data
processing.
Batch processing: Spark can process both batch and real-time
applications efficiently.
Low latency: Spark offers low latency due to reduced I/O operations
Q6] Explain Bulk Synchronous Processing (BSP) and graph processing wrt to Apache spark ?
Ans :
BSP in the Context of Apache Spark
Apache Spark supports a similar paradigm but not explicitly based on the BSP model.
However, Spark’s RDD transformations (such as map, reduce, and join) and actions (like
collect or count) work on the concept of distributed computation, where each node
performs operations independently and then synchronizes data when necessary.
Spark’s approach to processing large datasets aligns with BSP principles, especially when
performing distributed computations. While Spark doesn’t explicitly use the BSP model, it
follows similar steps to synchronize tasks and ensure data consistency across a distributed
system.
Parallel Computation: Each node in Spark processes a partition of the data
independently.
Communication: Tasks communicate across nodes when necessary, such as in
shufling data for operations like join, groupBy, or reduceByKey.
Synchronization: After each stage of computation, Spark ensures all tasks are
completed before moving to the next stage, similar to the synchronization
phase of BSP.
HoneyPot
Graph processing is the task of working with graph-structured data (nodes and edges), and
Apache Spark provides a powerful framework for handling graph data through GraphX, its
distributed graph processing library.
GraphX integrates graph processing with Spark’s RDD-based processing,
enabling the use of Spark’s in-memory capabilities and distributed computation for graph
algorithms.
Key Features of GraphX:
1. Graph Representation:
o In GraphX, a graph is represented as a combination of two RDDs:
GraphX allows the manipulation of vertex and edge properties, which can be
updated as part of graph processing. These properties can represent anything
from user information (on vertices) to weights (on edges).
Q7] Discuss the Apache Kafka fundamentals. Explain the Kafka Cluster Architecture with
suitable diagram
Ans :
8.9.1 Topics:
A stream of messages belonging to a particular category is called a topic. Data is stored in
topics. Topics are split into partitions. For each topic, Kafka keeps a mini mum of one
partition. Each such partition contains messages in an
immutable ordered sequence. A partition is implemented as a set of segment files of
equal sizes.
8.9.2 Partition:
Topics may have many partitions, so it can handle an arbitrary amount of data.
8.9.5 Brokers:
● Brokers are simple system responsible for maintaining the published data. Each broker
may have zero or more partitions per topic. Assume, if there are N partitions in a topic and
N number of brokers, each broker will have one partition.
HoneyPot
● Assume if there are N partitions in a topic and more than N brokers (n + m), the first N
broker will have one partition and the next M broker will not have any partition for that
particular topic.
● Assume if there are N partitions in a topic and less than N brokers (nm), each broker will
have one or more partition sharing among them. This scenario is not recommended due to
unequal load distribution among the brokers.
8.9.7 Producers:
Producers are the publisher of messages to one or more Kafka topics. Producers send data
to Kafka brokers. Every time a producer pub-lishes a message to a broker, the broker
simply appends the message to the last
segment file. Actually, the message will be appended to a partition. Producer can also send
messages to a partition of their choice.
8.9.8 Consumers:
Consumers read data from brokers. Consumers subscribes to one or more topics and
consume published messages by pulling data from the brokers.
8.9.9 Leader:
Leader is the node responsible for all reads and writes for the given partition. Every
partition has one server acting as a leader.
8.9.10 Follower:
Node which follows leader instructions are called as follower. If the leader fails, one of the
followers will automatically become the new leader. A follower acts as normal consumer,
pulls messages and up-dates its own data store.
HoneyPot
Broker: Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka
brokers are stateless, so they use ZooKeeper for maintaining their cluster state. One Kafka
broker instance can handle hundreds of thousands of reads and writes per second and
each bro-ker can handle TB of messages without performance impact. Kafka broker leader
election can be done by ZooKeeper.
ZooKeeper:
Producers push data to brokers. When the new broker is started, all the producers search it
and automatically sends a message to that new broker. Kafka producer doesn’t wait for
acknowledgements from the broker and sends messages as fast as the broker can handle.
Consumers:
Since Kafka brokers are stateless, which means that the consumer has to maintain how
many messages have been consumed by using partition offset. If the consumer
acknowledges a particular message offset, it implies that the
consumer has consumed all prior messages. The consumer issues an asynchronous pull
request to the broker to have a buffer of bytes ready to
consume. The consumers can rewind or skip to any point in a partition simply by supplying
an offset value. Consumer offset value is notified by ZooKeeper.
As of now, we discussed the core concepts of Kafka. Let us now throw some light on the
workflow of Kafka.
Kafka is simply a collection of topics split into one or more partitions. A Kafka partition is a
linearly ordered sequence of messages, where each message is
identified by their index (called as offset). All the data in a Kafka cluster is the disjointed
union of partitions. Incoming messages are written at the end of a partition and messages
are sequentially read by consumers. Durability is provided by replicating messages to
different brokers.
Kafka provides both pub-sub and queue-based messaging system in a fast, reliable,
persisted, fault-tolerance and zero downtime manner. In both cases, producers simply
send the message to a topic and consumer can choose any
one type of messaging system depending on their need. Let us follow the steps in the next
section to understand how the consumer can choose the messaging system of their choice.
Q8] DataSets vs DataFrames Ans
:
Q9] Explain Apache Spark. What are the advantages of Apache spark over Map Reduce
Ans :
HoneyPot
Q10] Illustrate the Apache Kafka. Explain with suitable example the streaming of real
time data with respect to Apache Kafka.
Ans :
Refer Q1 and
Consider a scenario where a taxi service wants to track the real-time location of all its taxis.
Here’s how Apache Kafka would be used:
2. Producer:
o Each taxi is equipped with a GPS device that periodically sends location
data (like latitude and longitude) to a Kafka producer.
o The Kafka producer, integrated with the taxi’s GPS, publishes this location
data to the taxi_location topic in real time.
o Data might look like this:
json
Copy code
3. Broker:
o The Kafka broker receives the data from the producer and stores it in the
taxi_location topic.
o If there are multiple brokers, data is replicated and partitioned for scalability
and reliability.
4. Consumer:
o A consumer application, like a real-time map service, reads from the
taxi_location topic to display each taxi’s current location on the map.
HoneyPot
5. Streaming Process:
o As taxis continuously publish location data, the consumer applications
receive real-time updates. This allows the map service to show current taxi
locations to customers or dispatchers.
o The data analytics service could also use the data stream to generate
real-time insights and reports on busy areas or peak hours for better
resource allocation.
Key Benefits of Using Apache Kafka in Real-Time Streaming
Scalability: Kafka can handle large volumes of data by adding more brokers
and partitions as needed.
Fault Tolerance: Kafka’s data replication ensures high availability.
Low Latency: Kafka is optimized for low-latency data streaming, which is crucial for
real-time applications.
Data Retention: Kafka can retain data for a specified time, allowing
consumers to replay events if necessary.
Unit 6
D3, short for Data-Driven Documents, is a powerful JavaScript library designed for
manipulating documents based on data. It’s one of the most effective
frameworks for data visualization, enabling developers to create dynamic, interactive
visualizations in the browser using HTML, CSS, and SVG.
Data visualization refers to representing data in graphical or pictorial forms, making even
complex datasets easy to understand. Visualizations make it
easier to spot patterns and conduct comparative analysis, which aids decision- making with
minimal effort. Frameworks like D3.js excel in making these visual representations.
HoneyPot
D3.js stands out for several reasons that make it the preferred choice for data visualization:
Web Standards Integration:
D3 leverages modern web standards like HTML, CSS, and SVG, allowing the creation of
powerful visualizations that work seamlessly in browsers.
Data-Driven Approach:
D3 enables users to pull data from different web nodes or servers, analyze it, and render
visualizations based on the data. It also supports processing static datasets.
Versatile Graphics Creation:
D3 provides tools ranging from simple tables to advanced charts, like pie charts, bar
graphs, and even complex GIS mapping. It supports customizable visualizations, making it
adaptable to various needs.
Support for Large Datasets:
D3 efficiently handles large datasets and allows users to reuse predefined libraries, making
the development process smoother and more efficient.
Transitions and Animations:
D3 simplifies the creation of animations and transitions, handling the logic implicitly.
Developers don’t need to manually manage transitions, and the library ensures
responsive and smooth animation rendering.
DOM Manipulation:
One of D3’s standout features is its ability to manipulate the Document Object Model (DOM)
dynamically, making it highly flexible for managing the properties of its handlers.
Refer Big Data from unit 1
Visualizing Big Data: Today, organizations generate and collect data each minute. The huge
amount of generated data, known as Big Data, brings new challenges to visualization
because of the speed, size and diversity of
information that must be considered. The volume, variety and velocity of such data requires
from an organization to leave its comfort zone technologically to derive intelligence for
effective decisions. New and more sophisticated visualization techniques based on core
fundamentals of data analysis take Data Visualization into account not only the cardinality,
but also the structure and the origin of such data
Q3] What are tools and benefits of Data Visualization also explain challenges of Big Data
Visualization
Ans :
Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context.
Tools of data visualization provide an
accessible way to see and understand trends, outliers, and patterns in data by using visual
effects or elements such as a chart, graphs, and maps.
By SciForce:
SciPot
Line Plot: The simplest technique, a line plot is used to plot the relationship or dependence
of one variable on another. To plot the relationship between the two variables, we can
simply call the plot function.
Bar Chart: Data Visualization Bar charts are used for comparing the quantities of different
categories or groups. Values of a category are represented with the help of bars and they
can be configured with vertical or horizontal bars, with the length or height of each bar
representing the value.
HoneyPot
Pie and Donut Charts: There is much debate around the value of pie and donut charts. As a rule,
they are used to compare the parts of a whole and are most effective when there are limited
components and when text and percentages
are included to describe the content. However, they can be difficult to interpret because
the human eye has a hard time estimating areas and comparing visual angles.
Histogram Plot: A histogram, representing the distribution of a continuous variable over a
given interval or period of time, is one of the most frequently used data visualization
techniques in machine learning. It plots the data by chunking it into intervals called ‗bins‘.
It is used to inspect the underlying
frequency distribution, outliers, skewness, and so on
● Most of the organizations are unable to maintain regular checks due to large amounts of
data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.
● There is some information of a person which when combined with external large data
may lead to some facts of a person which may be secretive and he might not want the
owner to know this information about that person.
● Some of the organization collects information of the people in order to add value to
their business. This is done by making insights into their lives that they‘re unaware of
Quality of data:
When there is a collection of a large amount of data and storage of this data, it comes at a cost.
Big companies, business leaders and IT leaders always want
large data storage.
● For better results and conclusions, Big data rather than having irrelevant data,
focuses on quality data storage.
● This further arise a question that how it can be ensured that data is relevant, how much
data would be enough for decision making and whether the stored data is accurate or not.
Fault tolerance:
● Nowadays some of the new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should be within the
acceptable threshold that is the whole task should not begin from the scratch.
Scalability:
● Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead
towards cloud computing.
● It leads to various challenges like how to run and execute various jobs so that goal of
each workload can be achieved cost-effectively.
● It also requires dealing with the system failures in an efficient manner. This leads to a big
question again that what kinds of storage devices are to be used.
Q4] Illustrate the characteristics of social media which make it suitable for Big Data
Analytics
Ans :
Social media platforms are a significant source of big data due to their vast user base, high
engagement levels, and constant content generation. Here are the main characteristics of
social media that make it especially suitable for Big Data Analytics:
1. Volume
HoneyPot
Social media generates a massive volume of data every second. Billions of users
interact daily, sharing text, images, videos, likes, comments, and reactions.
2. Velocity
Social media data is generated and shared in real-time or near-real-time, with posts,
shares, likes, and comments constantly being added.
This high speed of data generation allows organizations to perform real- time
analytics and gain insights immediately, which can be valuable for making timely
decisions.
3. Variety
Social media data comes in various forms, including:
4. Veracity
Social media data can be noisy and contain inaccuracies, false
information, or irrelevant content (e.g., spam or bots).
Analytics must handle this "messiness" by filtering, validating, and verifying
data, which adds complexity to the analysis process.