0% found this document useful (0 votes)

16 views61 pages

BDAV Internal Notes

Uploaded by

mitaleeshirsat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views61 pages

BDAV Internal Notes

Uploaded by

mitaleeshirsat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 61

HoneyPot

BDAV
HoneyPot
HoneyPot

Unit 1

Q1] Yarn Daemon

Q2] DataTypes of Big Data Q3] 5
V’s of Big Data
Q4] Explain Hadoop Ecosystem with core components. Explain its architecture ?
Q5] What are the different frameworks that run under YARN? Discuss the various YARN
Daemons
Q6] What is Big Data and its types? List out the difference between Traditional Data Vs.
Big Data with the help of example ?
Q7] Big Data Characteristics (any 5) Q8] CAP
theorem
Q9]Explain the Workflow using Resource Manager, Application Master & Node Master
of Hadoop YARN with a suitable diagram
Q10] Explain the distributed storage system of Hadoop with the help of a neat diagram

Unit 2
Q1] HDFS Vs HBase Q2]
HDFS
Q3] . Partitioner and Combiner wrt to Map Reduce
Q4] Explain Hadoop Distributed File System (HDFS) Architecture in detail.

Q5] What is Rack awareness in Hadoop HDFS? How to achieve fault tolerance with the help
of Rack awareness? Explain with the help of a suitable example
Q6] Explain matrix multiplication mapper and reducer function for MapReduce with a
help of a suitable example.
Q7] Discus Matrix Multiplication by MapReduce. Perform Matrix Multiplication with 1-
step method Matrix A: Matrix B:
12
HoneyPot

56
34

23
Q8]
Unit 3

Q1] Illustrate NoSQL data architecture patterns with suitable example. Q2] What is
NoSQL? Explain various NoSQL Data architecture patterns Q3] Master slave vs peer
to peer Architecture in No SQL

Q4] Need of Big Data Analytics

Q5] Explain HBase Architecture and Components of HBase Architecture. Q6] List and
Explain types of NoSQL with example of each

Unit 4

Q1] Load and Store Functions of Apache Pig

Q2] Explain the architecture of Apache Pig with help of diagram Q3] Hive
Q4] PIG

Q5] What is significance of Apache Pig in Hadoop Context? Explain the main components and
the working of Apache Pig with the help of diagram ?
Q6] Joins in Hive

Q7] Explain the significance of Pig in Hadoop Ecosystem. Explain the Data Model in Pig
Latin with suitable example.
Q8] Illustrate Hive. Explain Hive architecture with suitable example. Q9]

Unit 5

Q1] What is Apache Kafka? Explain the benefits and need of Apache Kafka. Q2] Features
of Apache Spark

Q3] Apache Kafka

HoneyPot

Q4] What is RDD? How is data partitioned in RDD

Q5] What are the advantages of Apache Spark over Map Reduce

Q6] Explain Bulk Synchronous Processing (BSP) and graph processing wrt to Apache spark ?
Q7] Discuss the Apache Kafka fundamentals. Explain the Kafka Cluster Architecture
with suitable diagram
Q8] DataSets vs DataFrames

Q9] Explain Apache Spark. What are the advantages of Apache spark over Map
Reduce
Q10] Illustrate the Apache Kafka. Explain with suitable example the streaming of real
time data with respect to Apache Kafka.
Unit 6

Q1] D3 and big data

Q2] Visualization using Bar Chart

Q3] What are tools and benefits of Data Visualization also explain challenges of Big
Data Visualization
Q4] Illustrate the characteristics of social media which make it suitable for Big Data
Analytics
Q5] Explain data visualization. List out and Illustrate the various Data Visualization
Tools.
Q6]

Answers : Ans :

Unit 1
Q1] Yarn Daemon
HoneyPot

BDAV
Hadoop's resource management layer is Apache Yarn, which stands for "Yet Another
Resource Negotiator." The yarn was first introduced in Hadoop 2. x. Yarn enables various
data processing engines, including as graph processing, interactive processing, stream
processing, and batch processing, to operate and handle HDFS data (Hadoop Distributed File
System). Yarn offers work schedule in addition to resource management.
Refer Q9.

Q2] DataTypes of Big Data Ans :

1.3.1 Structured: Structured data is one sort of large data. Structured data is defined as
data that can be processed, stored, and retrieved in a consistent
HoneyPot

way. It refers to information that is neatly structured and can be easily and smoothly stored
and accessed from a database using basic search engine methods. For example, the
employee table in a corporate database will be
formatted such that personnel details, work positions, salary, and so on are all presented in
an ordered manner

1.3.2 Unstructured: Unstructured data is data that does not have any defined shape or
organization. This makes processing and analyzing unstructured data extremely complex
and time-consuming. Unstructured data is something like email. Big data may be classified
into two types: structured and unstructured. Today's enterprises have a lot of data at
their disposal, but they don't know how to extract value from it since the data is in its raw
form or unstructured format. Example: The output returned by „Google Search

1.3.3 Semi-Structured: The third category of big data is semi-structured data.

Semi-structured data is data that has both the above-mentioned types, namely structured
and unstructured data. To be more specific, it refers to data that,
while not categorized under a certain repository (database), contains critical information
or tags that separate different pieces within the data. Thus, we have reached the end of
the data kinds
Q3] 5 V’s of Big Data
Ans :
Volume: The exponential rise of data storage when data becomes more than just text
data. On our social media platforms, the data is available in the form of films, music, and
enormous photos. Terabytes and Petabytes of storage are increasingly popular in business
storage systems. As the database increases in size, the applications and architecture
designed to support it must be re-
evaluated on a regular basis. When the same data is re-evaluated from numerous
perspectives, even though the original data remains the same, the
new uncovered intelligence causes an explosion of the data. The large amount does, in
fact, reflect Big Data.
Velocity: The rise of data and social media has altered our perspective on data. There was a
time when we thought data from yesterday was current. In reality, newspapers continue to
follow that rationale. However, news networks and radios have transformed how quickly we
acquire information. People nowadays respond on social media to keep them up to speed on
what is going on. A few
seconds‟ old messages (tweets, status updates, etc.) on social media may not
HoneyPot

be of interest to users. They often delete old messages and focus on fresh changes. Data
transfer is now nearly real-time, and the update window has shrunk to fractions of a
second. Big Data is represented by this high-velocity data.
Variety: Data may be stored in a variety of formats. For example, it may be saved in a
database, excel, csv, access, or even a plain text file. Sometimes the data is not even in the
typical format that we expect; it may be in the form of video, SMS, pdf, or something else
that we haven't considered. It is the
organization's responsibility to organize it and make it relevant. It will be simple if we have
data in the same format, but this is not always the case. The actual world has data in a
variety of forms, which is the problem we must conquer
with Big Data. This variety of data represents Bigdata.

The first V: Volume

Volume in the 3Vs of Big Data refers to the sheer amount of data generated across the globe.
It denotes the magnitude of data that is produced, stored, and processed. Billions of
connected devices, from smartphones to Internet of Things (IoT) sensors, produce data
around the clock.
Traditional databases, designed for smaller, static datasets, often cannot handle the vast
volumes of modern data. The emergence of new technologies, such as distributed storage
systems and cloud platforms, has been driven by the need to accommodate this immense
volume. Here are some sources of the high volume of data:
a) Social media: Platforms, including Facebook, Twitter, and Instagram, see millions of posts,
images, and videos uploaded every minute.
b) E-commerce: Every transaction, product view, and customer interaction generates data.
c) IoT: Devices, ranging from smart thermostats to industrial sensors, continuously send
data.
d) Scientific research: Fields like genomics and astronomy produce vast datasets.

The second V: Velocity

Velocity within the 3Vs of Big Data denotes the speed at which data is generated, processed,
and made available. As we progress further into the digital age, data isn't just growing in
sheer quantity but is also moving at unparalleled speeds. Here are some reasons behind this
huge velocity of data:
a) Digital transactions: Every online purchase, stock trade, or money transfer happens in
real time, demanding instantaneous data processing.
b) Real-time analytics: Industries such as finance and marketing rely on real-time analytics
for immediate decision-making.
c) Social media: The constant stream of tweets, status updates, and video uploads, especially
during significant global events, demonstrate high data velocity.
d) IoT: Devices connected to the IOT constantly relay on data. For instance, self-driving cars
must process vast amounts of information instantaneously for safety.
The third V: Variety
Within the context of the 3Vs of Big Data, Variety refers to the diverse range of data types
that organisations must manage. Unlike the past, where structured data dominated today's
digital landscape, it is characterised by a wide variety of data forms. To understand Variety
properly, let us have a look at the types of data:
a) Structured data: This is the organised data we're familiar with, typically stored in relational
databases. It includes things like spreadsheets, where data is categorised into columns and
rows.
HoneyPot

b) Semi-structured data: This data type isn't as organised as structured data but has some
level of structure, like XML or JSON files.
c) Unstructured data: A category representing the majority of data generated today, it
includes videos, images, social media posts, emails, and much more.

Q4] Explain Hadoop Ecosystem with core components. Explain its architecture ?
Ans :
 The Hadoop Ecosystem is a robust platform designed to handle big data challenges by offering
a wide range of services for data processing, storage, analysis, and maintenance. It integrates
several Apache projects along with various commercial tools. The key components of the
Hadoop ecosystem are:
 Core Components:
 HDFS (Hadoop Distributed File System): A distributed file system that stores large volumes of
data across multiple nodes.
 YARN (Yet Another Resource Negotiator): Manages computing resources and schedules tasks.
 MapReduce: A programming model used for processing large datasets in parallel.
 Hadoop Common: Provides common utilities and libraries needed by other Hadoop
components.

 Spark: In-Memory data processing

 PIG, HIVE Query-based processing of data services:
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
HoneyPot

 Zookeeper: Managing cluster

 Oozie: Job Scheduling Apart from the above-mentioned components, there are many
other components too that are part of the Hadoop ecosystem
The Hadoop architecture consists of the file system, the MapReduce engine, and the HDFS
(Hadoop Distributed File System). MapReduce engines are either MapReduce / MR1 or
YARN / MR2.
A Hadoop cluster is made up of a single master and several slave nodes. Job Tracker, Task
Tracker, Name Node, and Data Node comprise the master node, while Data Node and Task
Tracker comprise the slave node.

Q5] What are the different frameworks that run under YARN? Discuss the various YARN
Daemons
Ans : Refer Q9

Q6] What is Big Data and its types? List out the difference between Traditional Data Vs.
Big Data with the help of example ?
Ans :
Big Data :

Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15
byte size is called Big Data. It is stated that almost 90% of today's data has been generated
in the past 3 years.
Types : Refer Q2
HoneyPot
HoneyPot

Q7] Big Data Characteristics (any 5) Ans :

Refer Q3
Q8] CAP theorem

Q9]Explain the Workflow using Resource Manager, Application Master & Node Master
of Hadoop YARN with a suitable diagram
Ans :
HoneyPot

. Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per
application).

● Resource Manager (RM): It is Yarn's master daemon. The global allocation of resources
(CPU and memory) across all apps is managed by RM. It resolves conflicts over system
resources between competing programs. To study Yarn Resource Manager in depth, see to
the Resource Manager guide. The main components of Resource Manager are:
✔ Application manager

✔ Scheduler Application
Manager:
It handles the running of Application Masters in the cluster, which means it is in charge of
initiating application masters as well as monitoring and restarting
them on various nodes in the event of a failure. Scheduler:
The scheduler is in charge of distributing resources to the currently executing programme.
The scheduler is a pure scheduler, which means it conducts no monitoring or tracking for the
application and makes no assurances regarding resuming unsuccessful jobs due to
application or hardware issues.
HoneyPot

● Node Manager (NM):

It is Yarn's slave daemon. NM is in charge of monitoring container resource utilization and
reporting it to the Resource Manager. On that computer,
manage the user process. Yarn Node Manager additionally monitors the node on which it
is executing. Long-running auxiliary services can also be plugged into the NM; these are
application-specific services that are defined as part of the settings and loaded by the NM
upon start-up. A shufle is a common
auxiliary service provided by NMs for MapReduce applications running on YARN

● Application Master (AM):

Each application has its own application master. It interacts with the node manager to
negotiate resources from the resource management. It is in charge of the application life
cycle.

The AM obtains containers from the RM's Scheduler before contacting the associated NMs to
begin the different tasks of the application
Q10] Explain the distributed storage system of Hadoop with the help of a neat diagram
Ans :

The Hadoop Distributed File System (HDFS) is a Hadoop distributed file system. It is built
with master/slave architecture. This architecture consists of a single Name Node acting as
the master and multiple Data Nodes acting as slaves.
Name Node and Data Node are both capable of running on commodity machines. HDFS is
written in the Java programming language.
As a result, the Name Node and Data Node software can be run on any machine that
supports the Java programming language. The Hadoop
Distributed File System (HDFS) is a distributed file system based on the Google File System
(GFS) that is designed to run on commodity hardware. It is very
similar to existing distributed file systems.

However, there are significant differences between it and other distributed file systems. It is
highly fault-tolerant and is intended for use on low-cost hardware. It allows for high-
throughput access to application data and is appropriate for applications with large
datasets. In the HDFS cluster, there is only one master
server.
HoneyPot

Name Node:

● In the HDFS cluster, there is only one master server.

● Because it is a single node, it may be the source of a single point failure.
● It manages the file system namespace by performing operations such as file opening,
renaming, and closing.

● It simplifies the system's architecture. Data

Node:

● There are several Data Nodes in the HDFS cluster.

● Each Data Node contains a number of data blocks.
● These data blocks are used to save information.
● It is Data Node’s responsibility to handle read and write requests from file system
clients.

● When instructed by the Name Node, it creates, deletes, and replicates blocks.

Job Tracker:

● Job Tracker's role is to accept MapReduce jobs from clients and process the data using
Name Node.

● Name Node responds by sending metadata to Job Tracker. Task Tracker:

● It performs the function of a slave node for Job Tracker.
HoneyPot

● It receives the task and code from Job Tracker and applies it to the file. This procedure
is also known as a Mapper.
Mapreduce Layer:

When the client application submits the MapReduce job to Job Tracker, the MapReduce is
created. As a result, the Job Tracker forwards the request to the relevant Task Trackers. The
Task Tracker occasionally fails or times out. That portion of the job is rescheduled in such a
case.
Unit 2
Q1] HDFS Vs HBase

Ans :

Q2] HDFS
Ans : Refer from unit 1

Q) Explain map reduce with and example

HoneyPot

MapReduce is a programming model used for distributed data processing. It works by

splitting large datasets across many machines for parallel processing. The two key steps are:
1. Map: Break down the task into smaller sub-tasks.
2. Reduce: Aggregate the results from these sub-tasks into the final output.
Word Count Example Using MapReduce:
Goal:
Count the occurrences of each word in a large collection of documents.

Step 1: Map Phase

 Input: A chunk of text (e.g., a line from a document).
 The Mapper reads the input text, splits it into words, and outputs each word paired
with the number 1.
Example: For the line "Hello Hadoop Hello World", the mapper emits:
arduino
Copy code
("Hello", 1)
("Hadoop", 1)
("Hello", 1)
("World", 1)
Step 2: Shuffle and Sort
Hadoop automatically groups all identical keys (words) from different mappers and prepares
them for the reduce phase. It sorts the words and aggregates all values for each unique
word.
Example: The system collects and groups:
arduino
Copy code
"Hello" → [1, 1]
HoneyPot

"Hadoop" → [1]
"World" → [1]
Step 3: Reduce Phase
 The Reducer takes each key (word) and sums up the values associated with it (the 1s).
 For "Hello," the reducer adds 1 + 1 and outputs ("Hello", 2).
Example:
arduino
Copy code
("Hello", 2)
("Hadoop", 1)
("World", 1)
Final Output:
The result is a word count for each unique word in the input:
Copy code
Hello 2
Hadoop 1
World 1

Q3] . Partitioner and Combiner wrt to Map Reduce

HoneyPot

Ans :

A partitioner partitions the key-value pairs in the Map intermediate output. Separate data
using user-defined conditions that act like hash functions. The total number of sections
equals the number of Reducer jobs per job.
A Combiner is also called and semi-reducer, is an optional step used to optimize the
MapReduce process. Used to reduce output at node level. At this point, you can merge the
cloned output from the map output into a single output file. The merge step speeds up the
shufle step by improving job performance.
Combiner class is used between Map class and Reduce class to reduce the
amount of data transfer between Map and Reduce. In general, the output of the map job
is large and the data passed to the reduce job is large.
Q4] Explain Hadoop Distributed File System (HDFS) Architecture in detail. Ans :
HDFS is particularly fault-tolerant and is designed to be deployed on lowcost hardware. HDFS
gives excessive throughput access to software records and is suitable for programs that have
massive datasets. HDFS relaxes some POSIX requirements to permit streaming get right of
entry to document system data.
HDFS changed into firstly built as infrastructure for the Apache Nutch web search engine
undertaking. HDFS is part of the Apache Hadoop core
undertaking.

HDFS has a master/slave architecture.

HoneyPot

An HDFS cluster consists of a single NameNode, a master server that manages the file
system namespace and regulates access to files by clients. It maintains the file system tree
and metadata for all the files and directories in the tree.
This information is stored on the local disk in the form of two files: the namespace image and
the edit log.
Along with it there are a some DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on. The Data Nodes are also called as worker
nodes. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
The NameNode also knows the DataNodes on which all the blocks for a given file are
located, however, it does not store block locations, since this
information is reconstructed from DataNodes when the system starts. The NameNode
executes file system namespace operations like opening, closing,
and renaming files and directories. It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the
file system‘s clients. The DataNodes also perform block creation, deletion, and replication
upon instruction from the NameNode.
It is important to make NameNode fault tolerant as without the NameNode the file system
cannot be used. To make is fault tolerant the first step will be taking back-ups for the
persistent state files and metadata present in it. Hadoop can be configured so that the
NameNode writes its persistent state to multiple filesystems. These writes are synchronous
and atomic. The usual configuration choice is to write to local disk as well as a remote NFS
mount.
It is also possible to run a secondary NameNode, which has the same HDFS name but it
still does not act as a NameNode. Its main role is to periodically merge the namespace
image with the edit log to prevent the edit log from becoming too large. The secondary
NameNode usually runs on a separate physical machine, since it requires plenty of CPU
and as much memory as the NameNode to perform the merge. It keeps a copy of the
merged namespace
image, which can be used in the event of the NameNode failing. However, the state of the
secondary NameNode lags that of the primary, so in the event of total failure of the primary,
data loss is almost certain. The usual course of
HoneyPot

action in this case is to copy the NameNode‘s metadata files that are on NFS to the secondary
and run it as the new primary.

Q) HDFS Architecture
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of name node and data node help users to easily check the status
of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks −
 Manages the file system namespace.
 Regulates client’s access to files.
HoneyPot

 It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be
a datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.

Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.

Q5] What is Rack awareness in Hadoop HDFS? How to achieve fault tolerance with the help
of Rack awareness? Explain with the help of a suitable example
Ans :

Most of us are familiar with the term Rack. The rack is a physical collection of nodes in our
Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of many Racks. With
the help of this Racks information, Namenode chooses the closest Datanode to achieve
maximum performance while performing the read/write information which reduces the
Network Traffic. A
rack can have multiple data nodes storing the file blocks and their replica’s. The Hadoop
itself is so smart that it will automatically write a particular file block in 2 different Data
nodes in Rack. If you want to store that block of data into more than 2 Racks then you can
do that. Also as this feature is configurable means you can change it Manually. Example of
Rack in a cluster:
HoneyPot

As we all know a large Hadoop cluster contains multiple Racks, in each rack there are lots
of data nodes are available. Communication between the Datanodes that are present on
the same rack is quite much faster than the communication between the data node
present at the 2 different racks. The name node has the feature of finding the closest data
node for faster
performance for that Name node holds the ids of all the Racks present in the
HoneyPot

Hadoop cluster. This concept of choosing the closest data node for serving a purpose is Rack
Awareness. Let’s understand this with an example.

In the above image, we have 3 different Racks in our Hadoop cluster each Rack contains 4
Datanode. Now suppose you have 3 file blocks(Block 1, Block 2, Block 3) that you want to
put in this data node. As we all know Hadoop has a Feature to make Replica’s of the file
blocks to provide the high availability and fault tolerance. By default, the Replication Factor
is 3 so Hadoop is so smart that it will place the replica’s of Blocks in Racks in such a way that
we can
achieve a good network bandwidth. For that Hadoop has some Rack awareness policies.
 There should not be more than 1 replica on the same Datanode.
 More than 2 replica’s of a single block is not allowed on the same Rack.

 The number of racks used inside a Hadoop cluster must be smaller than the
number of replicas.
Now let’s continue with our above example. In the diagram, we can easily found that we
have block 1 in the first Datanode of Rack 1 and 2 replica’s of Block 1 in 5 and 6 number
Data node of Rack which sum up to 3. Similarly, we also have a Replica distribution of 2
other blocks in different Racks which are following the above policies. Benefits of
Implementing Rack Awareness in our Hadoop Cluster:
HoneyPot

 With the rack awareness policy’s we store the data in different Racks so no way to
lose our data.
 Rack awareness helps to maximize the network bandwidth because the data
blocks transfer within the Racks.
 It also improves the cluster performance and provides high data
availability.
HDFS Rack Awareness Example:

Q6] Explain matrix multiplication mapper and reducer function for MapReduce with a
help of a suitable example.
Q7] Discus Matrix Multiplication by MapReduce. Perform Matrix Multiplication with 1-
step method Matrix A: Matrix B:
12

56
34

23
HoneyPot

Q8]
Unit 3

Q1] Illustrate NoSQL data architecture patterns with suitable example. Ans : NoSQL
databases come in various types, each with its unique features and applications:

1. Key-Value Store
o Description: A simple storage system that uses unique keys to access
corresponding values. Data is organized as key-value pairs, where the key is a
constant identifier, and the value can be any type of data.
o Example: In a mobile phone database, a mobile phone could be the key, while its
color, model, and brand could be the values.
2. Document Databases
o Description: Stores data in collections of documents, typically in formats like
JSON or XML. Documents are treated as whole entities, maintaining their
hierarchical structure.
o Example: A sample document might look like this:
json
Copy code
{
"Name": "ABC",
"Address": "Indore",
"Email": "abc@xyz.com",
"Contact": "98765"
}
3. Graph Database
o Description: Based on graph theory, these databases store data in nodes and
edges. They are designed to represent and work with relationships between data
points. Each node and edge has a unique identifier.
o Example: In a social network, users could be nodes, and their friendships could
be represented as edges connecting those nodes.
4. Column Store
Description: A column store organizes data in a sparse matrix format, using columns as keys.
Each storage block holds data from a single column, which saves space by not storing null
values. A column family is a group of rows, with each row having a unique identifier.
Example: In a column family for user data, columns might represent attributes like name,
email, and address, while each row corresponds to a different user.

Q2] What is NoSQL? Explain various NoSQL Data architecture patterns Ans :
e. NoSQL is actually an acronym that stands for ‗Not Only SQL‘. Today NoSQL is used as an
umbrella term for all the data stores and databases that do not
follow traditional well established RDBMS principles. NoSQL is a set of concepts that allows the
rapid and efficient processing of data sets with primary focus on performance, reliability and
agility. NoSQL databases are widely used in big data and other real-time web applications. These
HoneyPot

are non-relational, open source and distributed databases.

Q) Explain any two NoSql data stores

MongoDB is a document-oriented database that stores data in flexible, JSON-like documents,
known as BSON (Binary JSON). This structure allows for easy changes in data formats, making it
highly adaptable to evolving application requirements. MongoDB features a rich query language
that supports complex queries and aggregations, enabling developers to extract meaningful
insights from their data. Additionally, it supports horizontal scaling, meaning it can efficiently
handle large amounts of data by distributing it across multiple servers. This scalability,
combined with its schema flexibility, makes MongoDB ideal for applications that deal with
unstructured or semi-structured data, such as content management systems, real-time
analytics, and e-commerce platforms.

In contrast, Cassandra is a wide-column store database built for high availability and scalability.
It uses a flexible schema with rows and columns, which makes it easy to organize and manage
different types of data. A key advantage of Cassandra is that it continues to work even if some
servers fail, ensuring uninterrupted access to data. It is optimized for fast data writing, making it
ideal for applications that require quick data input, like IoT data storage and social media
platforms. Its distributed design also supports easy scaling, making it a strong choice for
handling large-scale applications

Refer Q1 also

Q3] Master slave vs peer to peer Architecture in any to nosql stores

Q4] Need of Big Data Analytics

Q5] Explain HBase Architecture and Components of HBase Architecture. Ans :

Following are the major components of HBase Architecture:

 Region Server: In HBase, a Region Server manages regions, which are sorted collections of
rows. It handles both read and write operations for these regions. Region Servers, also
known as Slave Nodes, are responsible for storing data and function within the cluster.

 HBase Master (HMaster): The HBase Master controls a set of Region Servers that reside
on Data Nodes. It assigns regions to these servers during startup and can reassign them
for load balancing or recovery purposes. The HMaster coordinates the HBase cluster and
performs administrative duties. With the help of Zookeeper, it monitors the health of
Region Servers and facilitates recovery when a server fails. Additionally, the HMaster
manages table operations like creating and deleting tables.

 Zookeeper: Zookeeper serves as a coordinator that helps the HMaster manage HBase's
distributed environment. It regularly receives heartbeat signals from both Region Servers
and the HMaster, ensuring they are operational. If Zookeeper fails to get these signals, it
sends out failure notifications and starts recovery processes. When a Region Server goes
HoneyPot

down, Zookeeper informs the HMaster, which then reallocates the regions of the failed
server to other active Region Servers.

Q6] List and Explain types of NoSQL with example of each Ans : Refer Q2
Q7] CAP theorem Ans :

● Consistency: Making available the same single updated readable version of the data to all
the clients. Here, Consistency is concerned with multiple clients reading the same data from
replicated partitions and getting consistent results.

● High Availability: System remains functional even if some nodes fail. High Availability
means that the system is designed and implemented in such a way that it continues its read
and write operation even if nodes in a cluster crash or some hardware or software parts are
down due to upgrades. Internal
communication failures between replicated data should not prevent updates.

● Partition tolerance: Partition tolerance is the ability of the system to continue functioning
in the presence of network partitions. This situation arises when
network nodes cannot connect to each other (temporarily or permanently). System remains
operations on system split or communication malfunction. A single node failure should not cause
the entire system to collapse.

Unit 4
Q1] Load and Store Functions of Apache Pig Ans :
Apache Pig, in general, runs on top of Hadoop. It's a statistical tool for analysing huge
datasets in the Hadoop File System. To use Apache Pig to analyse data, we must first load
the data into Apache Pig. This chapter explains how to load data from HDFS into Apache Pig.
LOAD operator: The LOAD operator of Pig Latin can be used to load data into Apache Pig
HoneyPot

from a file system (HDFS/ Local). The "=" operator divides the load statement into two
sections. On the left, we must specify the name of the relation in which we want to store
the data, and on the right, we must specify how we will store the data. The Load operator's
syntax is seen below.
Relation_name = LOAD 'Input file path' USING function as schema;

● relation_name: We have to mention the relation in which we want to store the data.
● Input file path: We have to mention the HDFS directory where the file is stored. (In
MapReduce mode)

● Function: We have to choose a function from the set of load functions provided by
Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).

● Schema: We have to define the schema of the data. We can define the required
schema as follows:
(column1: data type, column2 : data type, column3 : data type); Store
Operator:
You can store the loaded data in the file system using the store operator. This chapter explains
how to store data in Apache Pig using the Store operator.
Syntax:
STORE Relation_name INTO ' required_directory_path ' [USING function];

loading and Storing Data, we have seen how to load data from external storage for
processing in Pig. Storing the results is straightforward, too. Here‘s an
example of using PigStorage to store tuples as plain-text values separated by a colon
character: grunt> STORE A INTO 'out' USING PigStorage(':');
HoneyPot

grunt> cat out

Joe:cherry:2 Ali:apple:3
Joe:banana:2 Eve:apple:7

Q2] Explain the architecture of Apache Pig with help of diagram Ans :
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel
data processing language which provides a rich set of data types and operators to perform
various operations on the data.

To perform a particular task Programmers using Pig, programmers need to

write a Pig script using the Pig Latin language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through
a series of transformations applied by the Pig
Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.
HoneyPot

1. Parser: The parser handles any pig scripts or instructions in the grunt shell. Parse will run
checks on the scripts, such as checking the syntax, type checking, and a variety of other
things. These checks will produce results in the form of a Directed Acyclic Graph (DAG),
which contains pig Latin sentences and logical
operators. Our logical operators of the scripts are nodes, and data flows are edges, therefore
the DAG will have nodes that are connected to distinct edges.

2. Optimizer: After parsing and DAG generation, the DAG is sent to the logical optimizer,
which performs logical optimizations such as projection and pushdown. By removing
extraneous columns or data and pruning the loader to only load the relevant column,
projection and pushdown increase query
performance.

3. Compiler: The compiler compiles the optimised logical plan provided above and
generates a series of Map-Reduce tasks. Essentially, the compiler will
transform pig jobs into MapReduce jobs and exploit optimization opportunities in scripts,
allowing the programmer to avoid manually tuning the software.
Pig's compiler can reorder the execution sequence to enhance performance if
HoneyPot

the execution plan remains the same as the original programme because it is a data-flow
language.

4. Execution Engine: Finally, all of the MapReduce jobs generated by the compiler are
sorted and delivered to Hadoop. Finally, Hadoop executes the MapReduce job to
produce the desired output. 5. Pig has two execution
modes, which are determined by the location of the script and the availability of data.

● Local Mode: For limited data sets, local mode is the best option. Pig is
implemented on a single JVM because all files are installed and run-on
localhost, preventing parallel mapper execution. Pig will also peek into the local file system
while importing data.

● MapReduce Mode (MR Mode): In MapReduce, the mode programmer needs access and
setup of the Hadoop cluster and HDFS installation. In this mode data on which processing is
done is exists in the HDFS system. After execution of pig script in MR mode, pig Latin
statement is converted into Map Reduce jobs in the back-end to perform the operations on
the data
HoneyPot

Apache Pig is a high-level data processing language used to analyze large data sets in Hadoop.
The language used in Pig is called Pig Latin, which provides various data types and
operators to perform data operations.

Pig Latin: The language allows users to write scripts using rich data types (tuples, bags, maps)
and operators (for loading, transforming, filtering, and storing data), making it accessible
for data analysts.

3. Execution Methods:

 Grunt Shell: Interactive command-line interface.

 UDFs (User Defined Functions): Custom functions to extend Pig's functionality.

 Embedded Mode: Integration within programming languages like Java or Python.

4. Architecture:

 The Pig framework parses the Pig Latin scripts into a logical plan, optimizes it, and
translates it into a series of MapReduce jobs for execution on a Hadoop cluster.

5. Advantages: By abstracting the complexities of MapReduce, Apache Pig enables quick

development and execution of data analysis tasks, enhancing productivity for data
processing in large-scale environments.SSS

How Pig Works:

To perform data processing tasks, programmers write scripts in Pig Latin. These scripts are
executed using different methods, such as the Grunt shell or User Defined Functions
(UDFs).

Execution Process:

1. Parser:

o The parser checks the Pig script for errors, such as syntax or type issues.

o It creates a Directed Acyclic Graph (DAG) that represents the logical structure
of the script, with nodes representing operations and edges representing
HoneyPot

data flow.

2. Optimizer:

o The DAG is then optimized by removing unnecessary data (projection) and

only loading relevant columns (pushdown) to improve performance.

3. Compiler:

o The optimized logical plan is compiled into a series of MapReduce jobs.

o The compiler can rearrange tasks for better performance, while maintaining
the original execution plan.

4. Execution Engine:

o The generated MapReduce jobs are sent to Hadoop for execution, which
processes the data and produces the desired output.

Execution Modes:

Pig offers two execution modes based on data location and availability:

1. Local Mode:

o Ideal for small data sets.

o Runs on a single JVM and processes files from the local file system without
parallel execution.

2. MapReduce Mode (MR Mode):

o Suitable for larger data sets stored in Hadoop's HDFS.

o Requires a Hadoop cluster setup. The scripts are converted into MapReduce
jobs that operate on data in HDFS.

In summary, Apache Pig simplifies data processing in Hadoop by allowing programmers to

write scripts in Pig Latin, which are then optimized and transformed into efficient
MapReduce tasks for execution.

Q3] Hive
Ans :
The term 'Big Data' refers to massive datasets with a high volume, high velocity, and a
diverse mix of data that is growing by the day. Processing Big Data using typical data
management solutions is tough. As a result, the Apache Software Foundation developed
Hadoop, a framework for managing and processing
large amounts of data. Hive is a Hadoop data warehouse infrastructure solution that allows
you to process structured data. It is built on top of Hadoop to
summarise Big Data and facilitate querying and analysis. Initially developed by Facebook,
Hive was eventually taken up by the Apache Software Foundation and developed as an
open-source project under the name Apache Hive. It is utilised by a variety of businesses.
HoneyPot

1. What is Hive?: Apache Hive is a data warehouse tool built on top of Hadoop. It helps in
managing and querying large volumes of data stored in Hadoop.
2. User-Friendly: Hive uses a SQL-like language called HiveQL, making it easier for users
familiar with SQL to write queries and perform data analysis.
3. Designed for Big Data: Hive is specifically designed to handle large datasets efficiently,
allowing businesses to analyze structured data quickly.
4. Open Source: Initially created by Facebook, Hive is now an open-source project under the
Apache Software Foundation, which means it has a strong community for support and
development.
5. Real-World Applications: Many businesses use Hive for data analysis, reporting, and
business intelligence, helping them gain insights from their Big Data.
In essence, Hive simplifies the process of working with Big Data in a way that is accessible to users
without requiring in-depth technical skills.
HoneyPot

Important components of HIVE:

● HIVE clients
● HIVE Services
● Storage and computing
HIVE Clients:
Hive offers a variety of drivers for interacting with various types of applications. It will
provide a Thrift client for communication in Thriftbased applications.
JDBC Drivers are available for Java-based applications. ODBC drivers are available for any
sort of application. In the Hive services, these Clients and Drivers communicate with the
Hive server.
HIVE Services:

Hive Services can be used by clients to interact with Hive. If a client wishes to do any
query-related actions in Hive, it must use Hive Services to do so. The command line
interface (CLI) is used by Hive to perform DDL (Data Definition
HoneyPot

Language) operations. As indicated in the architectural diagram above, all

drivers communicate with Hive server and the main driver in Hive services. The main driver
is present in Hive services, and it interfaces with all types of JDBC, ODBC, and other
clientspecific applications. Driver will process requests from various apps and send them to
the meta store and field systems for processing.
Storage and Computing:

Hive services like Meta store, File system, and Job Client communicate with Hive storage
and carry out the following tasks.

● The Hive "Meta storage database" stores the metadata of tables generated in Hive.
● Query results and data loaded into tables will be stored on HDFS in a Hadoop cluster.

Apache Hive is a data warehousing solution built on top of the Hadoop ecosystem, allowing users
to process and analyze large datasets using a SQL-like query language. Understanding its key
components is essential for leveraging its functionalities effectively. The main components of
Hive include Hive Clients, Hive Services, and Storage and Computing systems.

1. Hive Clients

 Hive provides various drivers for applications to interact with it, facilitating connectivity:

 Thrift Client: This client allows communication with Hive for applications built using the
Thrift framework. It’s particularly useful for programming languages other than Java.

 JDBC Driver: Designed for Java applications, this driver enables them to execute Hive
queries as they would in traditional relational databases, maintaining familiarity for Java
developers.

 ODBC Driver: This driver offers a standardized way for numerous applications and
programming languages to connect with Hive, making it ideal for Business Intelligence (BI)
tools and applications that support ODBC.
HoneyPot

These clients enable users to send queries to Hive, retrieve results, and interact seamlessly with
the stored data.

2. Hive Services

Hive Services serve as the bridge between clients and the Hive server, enabling query execution
and command operations:

 Command Line Interface (CLI): The CLI allows users to perform Data Definition Language
(DDL) and Data Manipulation Language (DML) operations directly from the command line,
such as creating, altering, and dropping tables.

 Main Driver: This core component manages the communication between Hive Clients
(JDBC, ODBC, Thrift) and the Hive server. It processes requests from clients and routes
them to the appropriate backend systems, facilitating smooth data flow.

 Meta Store Interaction: The Main Driver also interacts with the Meta Store, which stores
information about database schemas, tables, and other metadata. This ensures that
queries have access to accurate data structure and definitions.

3. Storage and Computing

 Hive utilizes a storage layer for managing data and a computing layer for processing
queries:

o Meta Store: Stores metadata information about tables created in Hive, such as
schemas, locations, and other attributes.

o HDFS (Hadoop Distributed File System): The actual query results and the data
loaded into Hive tables are stored on HDFS, which forms the backbone of the
Hadoop ecosystem.

o Job Client: Manages the execution of jobs and serves as the communication link
between Hive and the underlying processing framework (like MapReduce).

Q4] PIG
Ans : Refer Q2

Q5] What is significance of Apache Pig in Hadoop Context? Explain the main components and
the working of Apache Pig with the help of diagram ?
Ans : Refer Q2 and Q7 Q6]
Joins in Hive

Ans :
HoneyPot

In Joins, only Equality joins are allowed.

● However, in the same query more than two tables can be joined.
● Basically, to offer more control over ON Clause for which there is no match LEFT,
RIGHT, FULL OUTER joins exist in order.

● Also, note that Hive Joins are not Commutative

● Whether they are LEFT or RIGHT joins in Hive, even then Joins are left-
associative.
SYNTAX:

Types of Joins:

● Inner join in Hive

● Left Outer Join in Hive
● Right Outer Join in Hive
● Full Outer Join in Hive
Inner Join: Basically, to combine and retrieve the records from multiple tables we use Hive
Join clause. Moreover, in SQL JOIN is as same as OUTER JOIN.
Moreover, by using the primary keys and foreign keys of the tables JOIN condition is to be
raised. Furthermore, the below query executes JOIN the CUSTOMER and ORDER tables.
Then further retrieves the records:

Left Outer Join: On defining HiveQL Left Outer Join, even if there are no matches in the right
table it returns all the rows from the left table. To be more
HoneyPot

specific, even if the ON clause matches 0 (zero) records in the right table, then also this Hive
JOIN still returns a row in the result. Although, it returns with NULL in each column from the
right table. In addition, it returns all the values from the left table. Also, the matched values
from the right table, or NULL in case of no matching JOIN predicate. However, the below
query shows LEFT OUTER JOIN between CUSTOMER as well as ORDER tables:

Right Outer Join: Basically, even if there are no matches in the left table, HiveQL Right Outer Join
returns all the rows from the right table. To be more specific,
even if the ON clause matches 0 (zero) records in the left table, then also this Hive JOIN still
returns a row in the result. Although, it returns with NULL in
each column from the left table. In addition, it returns all the values from the right table.
Also, the matched values from the left table or NULL in case of no matching join
predicate.

Full Outer Join: The primary goal of this HiveQL Full outer Join is to merge the records from
both the left and right outside tables in order to satisfy the Hive JOIN requirement. In
addition, this connected table either contains all of the records from both tables or fills in
NULL values for any missing matches on
either side.

Q7] Explain the significance of Pig in Hadoop Ecosystem. Explain the Data Model in Pig
Latin with suitable example.
HoneyPot

Ans :
Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes
such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s
data model.

Atom

Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field.
Example − ‘raja’ or ‘30’ Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30) Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non- unique) is known
as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’.
It is similar to a table in RDBMS, but
unlike a table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
HoneyPot

A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja,
30, {9848022338, raja@gmail.com,}}

Map

A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30] Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
Significance of pig :

Advantages of Apache Pig

o Less code - The Pig consumes less line of code to perform any operation.

o Reusability - The Pig code is flexible enough to reuse again.

o Nested data types - The Pig provides a useful concept of nested data types like
tuple, bag, and map.

Q8] Illustrate Hive. Explain Hive architecture with suitable example. Ans : Refer
Q3

Unit 5

Q1] What is Apache Kafka? Explain the benefits and need of Apache Kafka. Ans :
Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that
can handle a high volume of data and enables you to pass messages from one endpoint to
another. Kafka is suitable for both ofline and online
message consumption. Kafka messages are persisted on the disk and replicated within the cluster
to prevent data loss. Kafka is built on top of the ZooKeeper
HoneyPot

synchronization service. It integrates very well with Apache Storm and Spark for real-time
streaming data analysis.
Benefits :

Following are a few benefits of Kafka:

● Reliability: Kafka is distributed, partitioned, replicated and fault tolerance.

● Scalability: Kafka messaging system scales easily without down time.
● Durability: Kafka uses Distributed commit log which means messages persists on disk as
fast as possible, hence it is durable.

● Performance: Kafka has high throughput for both publishing and subscribing messages.
It maintains stable performance even many TB of messages are
stored.

Kafka is very fast and guarantees zero downtime and zero data loss

Need

Kafka is a unified platform for handling all the real-time data feeds. Kafka supports low
latency message delivery and gives guarantee for fault tolerance in the presence of
machine failures. It has the ability to handle a large number of diverse consumers. Kafka is
very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially
means that all the writes go to the page cache of the OS (RAM). This makes it very efficient
to transfer data
from page cache to a network socket.

Q2] Features of Apache Spark Ans :

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was
built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use
more types of computations which
includes Interactive Queries and Stream Processing.

As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of
the ways to implement Spark.
HoneyPot

Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has
its own cluster management computation, it uses Hadoop for storage purpose only

Apache Spark is a lightning-fast cluster computing technology, designed for fast

computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes
interactive queries and stream processing. The main feature of Spark is its in- memory
cluster computing that increases the processing speed of an application.
Features of Apache Spark: Apache Spark has following features.

● Speed:
Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and
10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.

● Supports multiple languages:

Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in
different languages. Spark comes up with 80 high-level
operators for interactive querying.

● Advanced Analytics:
Spark not only supports ‗Map‘ and ‗reduce‘. It also supports SQL queries, Streaming data,
Machine learning (ML), and Graph algorithms.
Q3] Apache Kafka Ans :
Refer Q1
Q4] What is RDD? How is data partitioned in RDD Ans :
Resilient Distributed Datasets: Resilient Distributed Datasets (RDD) is a
fundamental data structure of Spark. It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
HoneyPot

There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.
Data partitioning in RDD is a crucial factor in optimizing performance, as it allows Spark to
process data in parallel by dividing it into chunks called
partitions. Here's how partitioning works in RDDs:

1. Default Partitioning:
o By default, Spark decides the number of partitions based on the input data
source. For instance, if reading from an HDFS file, Spark uses the file’s block
size to determine partitions.
o Typically, each partition processes a subset of the data, enabling
parallelism. The number of partitions is usually the number of blocks in
the source file but can be adjusted.

2. Custom Partitioning:
o Users can define the number of partitions explicitly when creating an RDD,
e.g., sc.textFile("path", numPartitions).
o Spark also provides partitioning functions such as repartition() and coalesce()
to control data distribution. repartition() increases or decreases the number
of partitions, while coalesce() only reduces the number of partitions,
improving efficiency by minimizing data shufling.

3. Hash Partitioning:
o For certain operations like joins, grouping, and aggregations, Spark uses hash
partitioning. Hash partitioning is achieved by hashing keys and placing them
into partitions.
o For example, if there are three partitions, keys are distributed based on
their hash value mod 3, ensuring similar keys are
grouped in the same partition for operations that benefit from it.
HoneyPot

4. Range Partitioning:
o In range partitioning, data is divided based on ranges of values, which is
useful for ordered data. For example, numeric or sorted data can be
divided into specific ranges to optimize processing.
Example of Data Partitioning in an RDD

Consider a scenario where you have an RDD created from a large dataset containing
customer transactions. If you load this data into Spark with five
partitions, Spark divides the data into five chunks, each assigned to a separate node in the
cluster. Here’s how partitioning optimizes data processing:
python
Copy code

# Creating an RDD with a specified number of partitions

transactions_rdd = sc.textFile("hdfs://path_to_transactions_data", 5)

Each partition can process the transactions independently and in parallel, leading to faster
and more efficient computation.
Advantages of Partitioning in RDDs

 Parallelism: Partitions enable Spark to process data in parallel,

enhancing speed and performance.
 Data Locality: Spark tries to execute tasks on nodes where data is located,
reducing the need for data transfer across nodes.
 Minimized Shuffling: Proper partitioning minimizes shufling (data movement
across partitions), which can be resource-intensive and slow down the processing.

Q5] What are the advantages of Apache Spark over Map Reduce Ans :

Apache Spark has several advantages over MapReduce, including:

 Speed: Spark is faster than MapReduce because it processes data in memory,

while MapReduce processes data on disk. Spark can be up to 100 times faster
than MapReduce for multi-stage jobs.
HoneyPot

 Real-time monitoring: Spark supports real-time data analysis, which can help you
respond quickly to new trends or issues.
 Flexibility: Spark's speed and flexibility are useful when evaluating multiple
data models or strategies.
 Scalability: Spark can expand systems and build scalable solutions quickly,
efficiently, and cost-effectively.
 Data processing: Spark is a popular engine for large-scale data
processing.
 Batch processing: Spark can process both batch and real-time
applications efficiently.
 Low latency: Spark offers low latency due to reduced I/O operations

Q6] Explain Bulk Synchronous Processing (BSP) and graph processing wrt to Apache spark ?
Ans :
BSP in the Context of Apache Spark

Apache Spark supports a similar paradigm but not explicitly based on the BSP model.
However, Spark’s RDD transformations (such as map, reduce, and join) and actions (like
collect or count) work on the concept of distributed computation, where each node
performs operations independently and then synchronizes data when necessary.
Spark’s approach to processing large datasets aligns with BSP principles, especially when
performing distributed computations. While Spark doesn’t explicitly use the BSP model, it
follows similar steps to synchronize tasks and ensure data consistency across a distributed
system.
 Parallel Computation: Each node in Spark processes a partition of the data
independently.
 Communication: Tasks communicate across nodes when necessary, such as in
shufling data for operations like join, groupBy, or reduceByKey.
 Synchronization: After each stage of computation, Spark ensures all tasks are
completed before moving to the next stage, similar to the synchronization
phase of BSP.
HoneyPot

Graph Processing in Apache Spark

Graph processing is the task of working with graph-structured data (nodes and edges), and
Apache Spark provides a powerful framework for handling graph data through GraphX, its
distributed graph processing library.
GraphX integrates graph processing with Spark’s RDD-based processing,
enabling the use of Spark’s in-memory capabilities and distributed computation for graph
algorithms.
Key Features of GraphX:

1. Graph Representation:
o In GraphX, a graph is represented as a combination of two RDDs:

 Vertices: An RDD containing nodes, typically represented as pairs of

node IDs and associated data.
 Edges: An RDD containing edges, which represent relationships
between nodes. Each edge consists of a source node, a destination
node, and some associated data.
Graph Algorithms:
 GraphX provides several graph algorithms for common tasks such as:

o PageRank: Measures the importance of nodes in a graph based on their

links.
o Connected Components: Identifies connected components in the graph
(groups of nodes that are connected).
o Triangle Counting: Counts the number of triangles (sets of three
interconnected nodes) in the graph.
Efficient Parallel Computation:

 GraphX takes advantage of Spark’s parallel processing to handle large graphs

efficiently. Each vertex and edge can be processed in parallel across multiple
nodes in a cluster.
 Graph operations like transformations and actions can be applied using RDD-
based methods. For instance, you can map a function over the
vertices, perform reductions over edges, or run iterative algorithms like PageRank.
HoneyPot

Vertex and Edge Properties:

 GraphX allows the manipulation of vertex and edge properties, which can be
updated as part of graph processing. These properties can represent anything
from user information (on vertices) to weights (on edges).

Q7] Discuss the Apache Kafka fundamentals. Explain the Kafka Cluster Architecture with
suitable diagram
Ans :

Components and Description:

8.9.1 Topics:
A stream of messages belonging to a particular category is called a topic. Data is stored in
topics. Topics are split into partitions. For each topic, Kafka keeps a mini mum of one
partition. Each such partition contains messages in an
immutable ordered sequence. A partition is implemented as a set of segment files of
equal sizes.

8.9.2 Partition:
Topics may have many partitions, so it can handle an arbitrary amount of data.

8.9.3 Partition offset:

Each partitioned message has a unique sequence id called as offset.

8.9.4 Replicas of partition:

Replicas are nothing but backups of a partition. Replicas are never read or write data. They
are used to prevent data loss

8.9.5 Brokers:
● Brokers are simple system responsible for maintaining the published data. Each broker
may have zero or more partitions per topic. Assume, if there are N partitions in a topic and
N number of brokers, each broker will have one partition.
HoneyPot

● Assume if there are N partitions in a topic and more than N brokers (n + m), the first N
broker will have one partition and the next M broker will not have any partition for that
particular topic.

● Assume if there are N partitions in a topic and less than N brokers (nm), each broker will
have one or more partition sharing among them. This scenario is not recommended due to
unequal load distribution among the brokers.

8.9.6 Kafka Cluster:

Kafka’s having more than one broker are called as Kafka cluster. A Kafka cluster can be
expanded without downtime. These clusters are used to manage the persistence and
replication of message data.

8.9.7 Producers:
Producers are the publisher of messages to one or more Kafka topics. Producers send data
to Kafka brokers. Every time a producer pub-lishes a message to a broker, the broker
simply appends the message to the last
segment file. Actually, the message will be appended to a partition. Producer can also send
messages to a partition of their choice.

8.9.8 Consumers:
Consumers read data from brokers. Consumers subscribes to one or more topics and
consume published messages by pulling data from the brokers.

8.9.9 Leader:
Leader is the node responsible for all reads and writes for the given partition. Every
partition has one server acting as a leader.

8.9.10 Follower:
Node which follows leader instructions are called as follower. If the leader fails, one of the
followers will automatically become the new leader. A follower acts as normal consumer,
pulls messages and up-dates its own data store.
HoneyPot

Components and Description:

Broker: Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka
brokers are stateless, so they use ZooKeeper for maintaining their cluster state. One Kafka
broker instance can handle hundreds of thousands of reads and writes per second and
each bro-ker can handle TB of messages without performance impact. Kafka broker leader
election can be done by ZooKeeper.
ZooKeeper:

ZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper

service is mainly used to notify producer and consumer about the presence of any new
broker in the Kafka system or failure of the broker in the Kafka system. As per the
notification received by the Zookeeper regarding presence or failure of the broker then pro-
ducer and consumer takes decision and starts
coordinating their task with some other broker Producers:
HoneyPot

Producers push data to brokers. When the new broker is started, all the producers search it
and automatically sends a message to that new broker. Kafka producer doesn’t wait for
acknowledgements from the broker and sends messages as fast as the broker can handle.
Consumers:

Since Kafka brokers are stateless, which means that the consumer has to maintain how
many messages have been consumed by using partition offset. If the consumer
acknowledges a particular message offset, it implies that the
consumer has consumed all prior messages. The consumer issues an asynchronous pull
request to the broker to have a buffer of bytes ready to
consume. The consumers can rewind or skip to any point in a partition simply by supplying
an offset value. Consumer offset value is notified by ZooKeeper.
As of now, we discussed the core concepts of Kafka. Let us now throw some light on the
workflow of Kafka.
Kafka is simply a collection of topics split into one or more partitions. A Kafka partition is a
linearly ordered sequence of messages, where each message is
identified by their index (called as offset). All the data in a Kafka cluster is the disjointed
union of partitions. Incoming messages are written at the end of a partition and messages
are sequentially read by consumers. Durability is provided by replicating messages to
different brokers.
Kafka provides both pub-sub and queue-based messaging system in a fast, reliable,
persisted, fault-tolerance and zero downtime manner. In both cases, producers simply
send the message to a topic and consumer can choose any
one type of messaging system depending on their need. Let us follow the steps in the next
section to understand how the consumer can choose the messaging system of their choice.
Q8] DataSets vs DataFrames Ans
:

Q9] Explain Apache Spark. What are the advantages of Apache spark over Map Reduce
Ans :
HoneyPot

Q10] Illustrate the Apache Kafka. Explain with suitable example the streaming of real
time data with respect to Apache Kafka.
Ans :

Refer Q1 and

Consider a scenario where a taxi service wants to track the real-time location of all its taxis.
Here’s how Apache Kafka would be used:

1. Setting Up the Topic:

o Create a Kafka topic called taxi_location where all location data of taxis will
be published.

2. Producer:
o Each taxi is equipped with a GPS device that periodically sends location
data (like latitude and longitude) to a Kafka producer.
o The Kafka producer, integrated with the taxi’s GPS, publishes this location
data to the taxi_location topic in real time.
o Data might look like this:
json

Copy code

{"taxi_id": "TX123", "latitude": 40.7128, "longitude": -74.0060, "timestamp": "2024-11-

08T10:00:00Z"}

3. Broker:
o The Kafka broker receives the data from the producer and stores it in the
taxi_location topic.
o If there are multiple brokers, data is replicated and partitioned for scalability
and reliability.

4. Consumer:
o A consumer application, like a real-time map service, reads from the
taxi_location topic to display each taxi’s current location on the map.
HoneyPot

o Another consumer, such as a data analytics service, could process location

data in real-time to determine demand patterns in various areas.

5. Streaming Process:
o As taxis continuously publish location data, the consumer applications
receive real-time updates. This allows the map service to show current taxi
locations to customers or dispatchers.

o The data analytics service could also use the data stream to generate
real-time insights and reports on busy areas or peak hours for better
resource allocation.
Key Benefits of Using Apache Kafka in Real-Time Streaming

 Scalability: Kafka can handle large volumes of data by adding more brokers
and partitions as needed.
 Fault Tolerance: Kafka’s data replication ensures high availability.

 Low Latency: Kafka is optimized for low-latency data streaming, which is crucial for
real-time applications.
 Data Retention: Kafka can retain data for a specified time, allowing
consumers to replay events if necessary.
Unit 6

Q1] D3 and big data Ans

What is D3.js?

D3, short for Data-Driven Documents, is a powerful JavaScript library designed for
manipulating documents based on data. It’s one of the most effective
frameworks for data visualization, enabling developers to create dynamic, interactive
visualizations in the browser using HTML, CSS, and SVG.
Data visualization refers to representing data in graphical or pictorial forms, making even
complex datasets easy to understand. Visualizations make it
easier to spot patterns and conduct comparative analysis, which aids decision- making with
minimal effort. Frameworks like D3.js excel in making these visual representations.
HoneyPot

Key Features of D3.js

D3.js stands out for several reasons that make it the preferred choice for data visualization:
Web Standards Integration:

D3 leverages modern web standards like HTML, CSS, and SVG, allowing the creation of
powerful visualizations that work seamlessly in browsers.
Data-Driven Approach:

D3 enables users to pull data from different web nodes or servers, analyze it, and render
visualizations based on the data. It also supports processing static datasets.
Versatile Graphics Creation:

D3 provides tools ranging from simple tables to advanced charts, like pie charts, bar
graphs, and even complex GIS mapping. It supports customizable visualizations, making it
adaptable to various needs.
Support for Large Datasets:

D3 efficiently handles large datasets and allows users to reuse predefined libraries, making
the development process smoother and more efficient.
Transitions and Animations:

D3 simplifies the creation of animations and transitions, handling the logic implicitly.
Developers don’t need to manually manage transitions, and the library ensures
responsive and smooth animation rendering.
DOM Manipulation:

One of D3’s standout features is its ability to manipulate the Document Object Model (DOM)
dynamically, making it highly flexible for managing the properties of its handlers.
Refer Big Data from unit 1

Q2] Visualization using Bar Chart Ans :

Refer Q3 bar chart and
HoneyPot

Visualizing Big Data: Today, organizations generate and collect data each minute. The huge
amount of generated data, known as Big Data, brings new challenges to visualization
because of the speed, size and diversity of
information that must be considered. The volume, variety and velocity of such data requires
from an organization to leave its comfort zone technologically to derive intelligence for
effective decisions. New and more sophisticated visualization techniques based on core
fundamentals of data analysis take Data Visualization into account not only the cardinality,
but also the structure and the origin of such data
Q3] What are tools and benefits of Data Visualization also explain challenges of Big Data
Visualization
Ans :

Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context.
Tools of data visualization provide an
accessible way to see and understand trends, outliers, and patterns in data by using visual
effects or elements such as a chart, graphs, and maps.
By SciForce:

SciPot

Line Plot: The simplest technique, a line plot is used to plot the relationship or dependence
of one variable on another. To plot the relationship between the two variables, we can
simply call the plot function.

Bar Chart: Data Visualization Bar charts are used for comparing the quantities of different
categories or groups. Values of a category are represented with the help of bars and they
can be configured with vertical or horizontal bars, with the length or height of each bar
representing the value.
HoneyPot

Pie and Donut Charts: There is much debate around the value of pie and donut charts. As a rule,
they are used to compare the parts of a whole and are most effective when there are limited
components and when text and percentages
are included to describe the content. However, they can be difficult to interpret because
the human eye has a hard time estimating areas and comparing visual angles.
Histogram Plot: A histogram, representing the distribution of a continuous variable over a
given interval or period of time, is one of the most frequently used data visualization
techniques in machine learning. It plots the data by chunking it into intervals called ‗bins‘.
It is used to inspect the underlying
frequency distribution, outliers, skewness, and so on

Scatter Plot: Another common visualization techniques is a scatter plot that is a

twodimensional plot representing the joint variation of two data items. Each marker
(symbols such as dots, squares and plus signs) represents an
observation. The marker position indicates the value for each observation. When you assign
more than two measures, a scatter plot matrix is produced that is a series of scatter plots
displaying every possible pairing of the measures that are assigned to the visualization.
Scatter plots are used for examining the relationship, or correlations, between X and Y
variables.
Some of the Big Data challenges are:

1. Sharing and Accessing Data:

● Perhaps the most frequent challenge in big data efforts is the inaccessibility of data sets
from external sources.
HoneyPot

● Sharing data can cause substantial challenges.

● It includes the need for inter and intra- institutional legal documents.
● Accessing data from public repositories leads to multiple difficulties.
● It is necessary for the data to be available in an accurate, complete and
timely manner because if data in the company‘s information system is to be used to make
accurate decisions in time then it becomes necessary for data to be available in this
manner.

2. Privacy and Security:

● It is another most important challenge with Big Data. This challenge includes sensitive,
conceptual, technical as well as legal significance.

● Most of the organizations are unable to maintain regular checks due to large amounts of
data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.

● There is some information of a person which when combined with external large data
may lead to some facts of a person which may be secretive and he might not want the
owner to know this information about that person.

● Some of the organization collects information of the people in order to add value to
their business. This is done by making insights into their lives that they‘re unaware of

3. Analytical Challenges: Data Visualization

● There are some huge analytical challenges in big data which arise some main challenges
questions like how to deal with a problem if data volume gets too
large?

● Or how to find out the important data points?

● Or how to use data to the best advantage?
● These large amount of data on which these type of analysis is to be done can be
structured (organized data), semi-structured (Semiorganized data) or unstructured
(unorganized data). There are two techniques through which
decision making can be done:

● Either incorporate massive data volumes in the analysis.

● Or determine upfront which Big data is relevant. 4. Technical challenges:
HoneyPot

Quality of data:

When there is a collection of a large amount of data and storage of this data, it comes at a cost.
Big companies, business leaders and IT leaders always want
large data storage.

● For better results and conclusions, Big data rather than having irrelevant data,
focuses on quality data storage.

● This further arise a question that how it can be ensured that data is relevant, how much
data would be enough for decision making and whether the stored data is accurate or not.
Fault tolerance:

● Fault tolerance is another technical challenge and fault tolerance computing is

extremely hard, involving intricate algorithms.

● Nowadays some of the new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should be within the
acceptable threshold that is the whole task should not begin from the scratch.
Scalability:

● Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead
towards cloud computing.

● It leads to various challenges like how to run and execute various jobs so that goal of
each workload can be achieved cost-effectively.

● It also requires dealing with the system failures in an efficient manner. This leads to a big
question again that what kinds of storage devices are to be used.
Q4] Illustrate the characteristics of social media which make it suitable for Big Data
Analytics
Ans :

Social media platforms are a significant source of big data due to their vast user base, high
engagement levels, and constant content generation. Here are the main characteristics of
social media that make it especially suitable for Big Data Analytics:

1. Volume
HoneyPot

 Social media generates a massive volume of data every second. Billions of users
interact daily, sharing text, images, videos, likes, comments, and reactions.

 Platforms like Facebook, Twitter, and Instagram produce enormous amounts

of data that can be analyzed for insights on user behavior, trending topics,
and more.

2. Velocity
 Social media data is generated and shared in real-time or near-real-time, with posts,
shares, likes, and comments constantly being added.
 This high speed of data generation allows organizations to perform real- time
analytics and gain insights immediately, which can be valuable for making timely
decisions.

3. Variety
 Social media data comes in various forms, including:

o Text: Posts, comments, hashtags, and keywords.

o Images and Videos: Photos, live videos, stories, etc.

o Structured Data: Likes, shares, and followers.

o Unstructured Data: Free-form text, emojis, and multimedia content.

 This diversity requires advanced analytics techniques to process and derive
insights across different data types.

4. Veracity
 Social media data can be noisy and contain inaccuracies, false
information, or irrelevant content (e.g., spam or bots).
 Analytics must handle this "messiness" by filtering, validating, and verifying
data, which adds complexity to the analysis process.

5. Sentiment and Emotion

 Social media content often reflects user opinions, sentiments, and emotions,
which are valuable for sentiment analysis.
HoneyPot

 By analyzing posts, comments, and reactions, organizations can gauge public

sentiment toward a product, service, or event, helping in sentiment-driven
decision-making.

6. User Connectivity and Relationships

 Social media platforms are inherently networked, allowing for analysis of user
relationships and interactions.
 Social network analysis helps identify communities, influencers, and the spread
of information, providing insights into how ideas, trends, or brand messages
travel across social networks.
Applications of Social Media in Big Data Analytics

 Customer Sentiment Analysis: Understand customer opinions,

satisfaction, and sentiment toward brands and products.
 Influencer and Network Analysis: Identify key influencers and study how
information spreads across social networks.
 Trend Prediction: Detect and analyze emerging trends to align marketing and
product development strategies.
 Market Research: Gather insights on customer demographics,
preferences, and location-based data for targeted marketing.
Q5] Explain data visualization. List out and Illustrate the various Data Visualization
Tools.
Ans : Refer Q3

What is Data Visualization? Data visualization is a graphical representation of

quantitative information and data by using visual elements like graphs, charts, and
maps. Data visualization convert large and small data sets into visuals,
which is easy to understand and process for humans. Data visualization tools provide
accessible ways to understand outliers, patterns, and trends in the data. In the world of
Big Data, the data visualization tools and technologies are required to analyse vast
amounts of information. Data visualizations are
common in your everyday life, but they always appear in the form of graphs and
charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics. Data visualizations are used to discover unknown facts and
trends.

bda unit 1 - mam
No ratings yet
bda unit 1 - mam
198 pages
BigData_Unit1
No ratings yet
BigData_Unit1
74 pages
ScratchJr Book
No ratings yet
ScratchJr Book
17 pages
UNIT 1
No ratings yet
UNIT 1
57 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
89 pages
Module 6_Big Data and NOSQL
No ratings yet
Module 6_Big Data and NOSQL
63 pages
BDS Module-1
No ratings yet
BDS Module-1
59 pages
Big Data_1
No ratings yet
Big Data_1
46 pages
Slide 1 Big Data Introduction
No ratings yet
Slide 1 Big Data Introduction
88 pages
mcy-20240325SCHEDULE 14A INFORMATION
No ratings yet
mcy-20240325SCHEDULE 14A INFORMATION
50 pages
SemVII_BigDataAnalytics
No ratings yet
SemVII_BigDataAnalytics
31 pages
Lec 1 - Introduction to Big Data
No ratings yet
Lec 1 - Introduction to Big Data
37 pages
Kumbharana CK Thesis Cs
No ratings yet
Kumbharana CK Thesis Cs
243 pages
bda qb answer
No ratings yet
bda qb answer
39 pages
unit 1
No ratings yet
unit 1
20 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
The Main Differences Between The Three Schools of Usul Al Fiqh
No ratings yet
The Main Differences Between The Three Schools of Usul Al Fiqh
6 pages
Big-Data-1
No ratings yet
Big-Data-1
22 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
Lecture 1: Big Data Challenges and Overview: Extracted From
No ratings yet
Lecture 1: Big Data Challenges and Overview: Extracted From
26 pages
Bda M1
No ratings yet
Bda M1
111 pages
Resorts
No ratings yet
Resorts
12 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
Bigdata Intro
No ratings yet
Bigdata Intro
25 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Big Data
100% (1)
Big Data
27 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
BDA Handy Notes
No ratings yet
BDA Handy Notes
19 pages
Unit 1 What Is Big Data
No ratings yet
Unit 1 What Is Big Data
26 pages
big-data-2022-notes
No ratings yet
big-data-2022-notes
118 pages
Unit 5
No ratings yet
Unit 5
63 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Primary Education Degree Dissertation Ideas
100% (2)
Primary Education Degree Dissertation Ideas
8 pages
UNIT I
No ratings yet
UNIT I
25 pages
Unit 1
No ratings yet
Unit 1
56 pages
PMK 40 Ibs
No ratings yet
PMK 40 Ibs
8 pages
Protocol List Olfeo
No ratings yet
Protocol List Olfeo
32 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
42 pages
Types of Blockchain
No ratings yet
Types of Blockchain
7 pages
2024 Guide to Searching the Internet
No ratings yet
2024 Guide to Searching the Internet
24 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Studio Equipment 1
No ratings yet
Studio Equipment 1
10 pages
Effectiveness of The Automated Election System in The Philippines A Comparative Study in Barangay 1 Poblacion Malaybalay City Bukidnon
100% (1)
Effectiveness of The Automated Election System in The Philippines A Comparative Study in Barangay 1 Poblacion Malaybalay City Bukidnon
109 pages
2019-2
No ratings yet
2019-2
14 pages
Evolution of Big Data and Tools For Big Data
No ratings yet
Evolution of Big Data and Tools For Big Data
9 pages
Property Profitability Scorecard by House of Zat
No ratings yet
Property Profitability Scorecard by House of Zat
5 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Level 2 Student Book
No ratings yet
Level 2 Student Book
58 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Structure of Proteins
No ratings yet
Structure of Proteins
7 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
A_Review_of_Machine_Learning_Techniques
No ratings yet
A_Review_of_Machine_Learning_Techniques
6 pages
Filli
No ratings yet
Filli
4 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Organ Donation Act 1@
No ratings yet
Organ Donation Act 1@
10 pages
Reflection Paper 2
No ratings yet
Reflection Paper 2
9 pages
Unit 1
No ratings yet
Unit 1
26 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Tuition Scholarship Application Form of Dali University: Present Address
No ratings yet
Tuition Scholarship Application Form of Dali University: Present Address
3 pages
Seminar Big Data Hadoop
No ratings yet
Seminar Big Data Hadoop
28 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
SciMethod Presentation Rubric
No ratings yet
SciMethod Presentation Rubric
2 pages
Mittal School of Business: Course Code: CAP348 Course Title: Introduction To Big Data
No ratings yet
Mittal School of Business: Course Code: CAP348 Course Title: Introduction To Big Data
6 pages
Pediatrics History Taking and Physical Examination
0% (1)
Pediatrics History Taking and Physical Examination
61 pages
Paguia Vs Office of The President
No ratings yet
Paguia Vs Office of The President
5 pages
Big Data With Cloud Computing Discussions and Challenges
No ratings yet
Big Data With Cloud Computing Discussions and Challenges
9 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
BigData Brief
100% (1)
BigData Brief
4 pages
Data Categories-3 _ Lesson 1 _ Data Analytics TEVTA 2024 _ certiportlearning
No ratings yet
Data Categories-3 _ Lesson 1 _ Data Analytics TEVTA 2024 _ certiportlearning
1 page
BhagavadGita SridharaGloss PDF
75% (4)
BhagavadGita SridharaGloss PDF
623 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
BDA Question Answer
No ratings yet
BDA Question Answer
29 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Track and Field Webquest
No ratings yet
Track and Field Webquest
5 pages
Project On e Filing (Income Tax Return Online)
87% (55)
Project On e Filing (Income Tax Return Online)
60 pages
Recycle Cardboard Into Anything With 3D Printing!: Instructables
No ratings yet
Recycle Cardboard Into Anything With 3D Printing!: Instructables
23 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
Egypt Packet Review Answers
No ratings yet
Egypt Packet Review Answers
2 pages
Bibliography Migration
No ratings yet
Bibliography Migration
4 pages
The Memory and Birth of The Nation by Resil Mojares
No ratings yet
The Memory and Birth of The Nation by Resil Mojares
1 page
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.