0% found this document useful (0 votes)
54 views88 pages

BDA Question Bank With Solutions

The document discusses various aspects of digital data, including its classifications into structured, semi-structured, and unstructured data. It elaborates on Big Data and its analytics, highlighting the differences between traditional Business Intelligence and Big Data, as well as the challenges businesses face in leveraging Big Data. Additionally, it covers Hadoop and MongoDB, explaining their roles, features, and differences from traditional databases.

Uploaded by

jyotsnas99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views88 pages

BDA Question Bank With Solutions

The document discusses various aspects of digital data, including its classifications into structured, semi-structured, and unstructured data. It elaborates on Big Data and its analytics, highlighting the differences between traditional Business Intelligence and Big Data, as well as the challenges businesses face in leveraging Big Data. Additionally, it covers Hadoop and MongoDB, explaining their roles, features, and differences from traditional databases.

Uploaded by

jyotsnas99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 88

Short Answer Questions and Solutions:

1. What are the different types of digital data?


Digital data can be classified into structured, semi-structured, and unstructured data.
2. How is digital data classified?
Digital data is classified based on its structure into structured (e.g., relational
databases), semi-structured (e.g., XML, JSON), and unstructured (e.g., images,
videos, social media posts).
3. Define Big Data.
Big Data refers to extremely large datasets that cannot be efficiently processed using
traditional data processing techniques.
4. What led to the evolution of Big Data?
The explosion of internet usage, social media, IoT devices, and advancements in
computing power led to the evolution of Big Data.
5. How does traditional Business Intelligence differ from Big Data?
Traditional BI focuses on structured data with predefined schemas, whereas Big Data
handles large volumes of diverse data in real-time.
6. Explain the coexistence of Big Data and Data Warehouse.
Big Data complements traditional data warehouses by handling large volumes of
unstructured data and real-time processing.
7. What is Big Data Analytics?
Big Data Analytics is the process of analyzing and extracting valuable insights from
large datasets using advanced technologies.
8. How is Big Data Analytics different from traditional data analysis?
Big Data Analytics processes large, complex datasets in real-time, while traditional
analysis focuses on smaller, structured data.
9. Why is there sudden hype around Big Data Analytics?
The growing volume of data, advancements in AI and ML, and business demand for
data-driven insights have fueled the hype.
10. List the different classifications of analytics.
○ Descriptive Analytics
○ Diagnostic Analytics
○ Predictive Analytics
○ Prescriptive Analytics
11. What are the biggest challenges preventing businesses from capitalizing on
Big Data?
○ Data security and privacy issues
○ High infrastructure costs
○ Lack of skilled professionals
○ Data integration difficulties
12. Mention some top challenges faced in Big Data.
○ Data storage and processing
○ Data quality and governance
○ Scalability and performance issues
○ Regulatory compliance
13. Why is Big Data Analytics important?
It helps businesses gain insights, improve decision-making, enhance customer
experience, and drive innovation.
14. Define Data Science in the context of Big Data.
Data Science is an interdisciplinary field that uses statistical techniques, algorithms,
and machine learning to extract insights from Big Data.
15. List some key terminologies used in Big Data environments.
○ Hadoop
○ NoSQL
○ Data Lake
○ Data Warehouse
○ Machine Learning
16. What are the key features of Hadoop?

○ Hadoop is scalable, fault-tolerant, and distributed in nature.

○ It follows the Master-Slave architecture using HDFS and MapReduce.

○ Supports processing large datasets in a cost-effective manner.

○ Provides high availability using data replication across nodes.

○ Works efficiently on commodity hardware, reducing infrastructure costs.

17. List some advantages of using Hadoop.

○ Handles Big Data efficiently using distributed processing.

○ Cost-effective as it runs on low-cost hardware.

○ Highly scalable, supporting thousands of nodes.

○ Fault-tolerant due to HDFS data replication.

○ Supports diverse data formats, including structured, semi-structured, and


unstructured data.

18. What are the different versions of Hadoop?

○ Hadoop 1.0: Used MapReduce for processing but lacked efficient resource
management.

○ Hadoop 2.0: Introduced YARN (Yet Another Resource Negotiator), improving


scalability.

○ Hadoop 3.0+: Enhanced performance, introduced erasure coding, and


reduced storage costs.

19. Briefly describe the Hadoop ecosystem.

○ The Hadoop ecosystem consists of HDFS, MapReduce, YARN, and


supporting tools like Hive, Pig, HBase, and Spark that enhance Big Data
processing.

20. What are some common Hadoop distributions?

○ Apache Hadoop (open-source version).

○ Cloudera CDH (Enterprise-ready, with additional tools).

○ Hortonworks HDP (Focused on open-source development).

○ MapR (Provides additional performance enhancements).


21. Why is Hadoop needed in modern data processing?

○ Traditional systems struggle with huge, complex, and unstructured data.

○ Hadoop processes data efficiently in a distributed manner, making it ideal


for Big Data.

22. How does Hadoop differ from RDBMS?

○ Hadoop processes unstructured and large-scale data across distributed


clusters.

○ RDBMS deals with structured data stored in tables with strict schemas.

23. What are the main challenges in distributed computing?

○ Data consistency

○ Load balancing

○ Fault tolerance

○ Efficient communication between nodes

24. Provide a brief history of Hadoop.

○ Hadoop was inspired by Google’s MapReduce and Google File System


(GFS).

○ Created by Doug Cutting and Mike Cafarella at Yahoo in 2006.

25. What are the core components of Hadoop?

● HDFS (Hadoop Distributed File System) for storage.

● MapReduce for processing data.

● YARN for resource management.

11. Explain the purpose of HDFS in Hadoop.

● HDFS stores large files by distributing them across multiple nodes, ensuring high
availability and reliability.

12. How does Hadoop achieve fault tolerance?

● Hadoop uses data replication to maintain multiple copies across nodes.

● It reassigns tasks when failures occur, ensuring continued processing.

13. What is the significance of the NameNode and DataNode in HDFS?

● NameNode manages metadata and file system namespace.


● DataNodes store actual data blocks and perform read/write operations.

14. What is the role of MapReduce in Hadoop?

● MapReduce is a parallel computing framework that processes large datasets by


splitting tasks into Map and Reduce phases.

15. How does Hadoop handle large-scale data storage?

● Hadoop divides data into blocks and distributes them across nodes, ensuring
scalability and reliability.

Short Answer Questions and Solutions:

1. What is the role of Hadoop in data processing?


Hadoop enables distributed storage and parallel processing of large datasets,
making it efficient for handling Big Data.
2. Briefly explain how Hadoop processes large datasets.
Hadoop processes data using the MapReduce framework, which splits data into
chunks, processes them in parallel, and then combines the results.
3. What is MapReduce, and why is it important in Hadoop?
MapReduce is a programming model used in Hadoop for processing large-scale data
in a distributed environment.
4. Define the role of a Mapper in MapReduce.
The Mapper processes input data, extracts key-value pairs, and distributes them to
relevant reducers.
5. What is the function of a Reducer in MapReduce?
The Reducer aggregates the output from Mappers, performing operations like
summation, sorting, or categorization.
6. How does a Combiner optimize MapReduce jobs?
A Combiner acts as a mini-reducer, reducing the amount of data transferred between
Mappers and Reducers, improving efficiency.
7. What is the role of a Partitioner in MapReduce?
A Partitioner determines how data is distributed among Reducers, ensuring balanced
processing.
8. List the different types of NoSQL databases.
○ Key-Value Stores (e.g., Redis, DynamoDB)
○ Document Stores (e.g., MongoDB, CouchDB)
○ Column-Family Stores (e.g., Cassandra, HBase)
○ Graph Databases (e.g., Neo4j, ArangoDB)
9. What are the key advantages of using NoSQL databases?
○ High scalability
○ Flexible schema
○ Efficient handling of unstructured data
○ High availability and fault tolerance
10. How is NoSQL different from traditional relational databases?
NoSQL databases are schema-less, horizontally scalable, and optimized for
distributed architectures, unlike traditional RDBMS.
11. Mention some common use cases of NoSQL in the industry.
○ E-commerce (product catalogs)
○ Social media (user interactions)
○ IoT applications (sensor data storage)
○ Real-time analytics (log analysis, recommendation systems)
12. How does NewSQL differ from NoSQL?
NewSQL retains relational database features while incorporating NoSQL's scalability
and distributed processing.
13. Compare NoSQL and SQL databases in terms of scalability.
SQL databases scale vertically, while NoSQL databases scale horizontally, making
them suitable for large-scale distributed environments.
14. What are the main categories of NoSQL databases?
○ Key-Value Stores
○ Document Stores
○ Column-Family Stores
○ Graph Databases
15. Why is NoSQL preferred for handling large-scale unstructured data?
NoSQL databases do not require a fixed schema and can handle heterogeneous
data types efficiently.

What is MongoDB, and why is it used?


MongoDB is a NoSQL database that stores data in a flexible, document-oriented
format instead of traditional tables. It is widely used for handling large-scale, high-
volume data applications due to its scalability, high performance, and schema-less
nature.

Why is MongoDB considered a NoSQL database?


MongoDB is considered a NoSQL database because it does not follow the traditional
relational database model. Instead of storing data in rows and tables, it uses JSON-
like documents, making it more flexible for handling semi-structured and
unstructured data.

What are the key differences between MongoDB and RDBMS?

● MongoDB stores data as documents in collections, whereas RDBMS uses


tables and rows.

● MongoDB is schema-less, while RDBMS requires a predefined schema.

● MongoDB is horizontally scalable, whereas RDBMS typically scales vertically.

● Queries in MongoDB use JSON-like syntax, whereas RDBMS uses SQL.

Define a document in MongoDB.


A document in MongoDB is a JSON-like data structure that contains key-value pairs.
It is the basic unit of data storage in MongoDB. Example:

json
CopyEdit
{
"name": "Alice",
"age": 25,
"city": "New York"
}

What is a collection in MongoDB?


A collection is a group of MongoDB documents, similar to a table in RDBMS. Unlike
tables, collections do not require a fixed schema.

How does MongoDB store data internally?


MongoDB stores data in BSON (Binary JSON) format, which is a binary
representation of JSON documents.

List some common data types used in MongoDB.

● String

● Integer

● Boolean

● Array

● Object (Embedded document)

● Date

● ObjectId

What is BSON, and how is it related to MongoDB?


BSON (Binary JSON) is the format MongoDB uses to store documents. It is a binary
representation of JSON that supports additional data types and is optimized for
speed.

Explain the term "schema-less" in MongoDB.


MongoDB does not enforce a fixed schema for its collections, meaning documents in
the same collection can have different structures. This provides flexibility in handling
various data formats.

What are the equivalents of a table, row, and column in MongoDB?

● Table → Collection

● Row → Document

● Column → Field

What is the difference between an embedded document and a referenced document in


MongoDB?
Embedded Document: The related data is stored inside the same document.

json
CopyEdit
{
"name": "Alice",
"address": {
"city": "New York",
"zip": "10001"
}
}

Referenced Document: The related data is stored in a separate document with a


reference ID.

json
CopyEdit
{
"name": "Alice",
"address_id": "6123abcd5678efgh"
}

How does MongoDB handle indexing?


MongoDB uses indexes to improve query performance. By default, it creates an index
on the _id field, but additional indexes can be created using the createIndex()
function.

What is the purpose of the ObjectId in MongoDB?


The ObjectId is a unique identifier assigned to each document by default. It ensures
uniqueness and contains information such as timestamp and machine ID.

How do you insert a document into a MongoDB collection?

javascript
CopyEdit
db.users.insertOne({ "name": "Alice", "age": 25 });

What command is used to retrieve all documents from a MongoDB collection?

javascript
CopyEdit
db.users.find();
How do you update an existing document in MongoDB?

javascript
CopyEdit
db.users.updateOne({ "name": "Alice" }, { $set: { "age": 26 } });

What is the difference between find() and findOne() in MongoDB?

● find(): Returns all matching documents.

● findOne(): Returns only the first matching document.

How do you delete a document from a collection in MongoDB?

javascript
CopyEdit
db.users.deleteOne({ "name": "Alice" });

What are the different types of queries available in MongoDB?

● Find Queries (find(), findOne())

● Update Queries (updateOne(), updateMany())

● Delete Queries (deleteOne(), deleteMany())

● Aggregation Queries

What is aggregation in MongoDB?


Aggregation is used for processing and analyzing data, similar to SQL's GROUP BY.
Example:

javascript
CopyEdit
db.sales.aggregate([{ $group: { _id: "$product", totalSales: { $sum:
"$amount" } } }]);

Long Answer Questions with Solutions

1) what is bigdata? explain about types of data ?


Different Types of Big Data
Big data types in Big Data are used to categorize the numerous kinds of data generated

daily. Primarily there are 3 types of big data in analytics. The following types of Big Data with

examples are explained below:-

A. Structured Data

Any data that can be processed, is easily accessible, and can be stored in a fixed format is

called structured data. In Big Data, structured data is the easiest to work with because it has

highly coordinated measurements that are defined by setting parameters. Structured types

of Big Data are:-

Overview:

● Highly organized and easily searchable in databases.


● Follows a predefined schema (e.g., rows and columns in a table).
● Typically stored in relational databases (SQL).

Examples:

● Customer information databases (names, addresses, phone numbers).


● Financial data (transactions, account balances).
● Inventory management systems.
● Metadata (data about data).

Merits:

● Easy to analyze and query.


● High consistency and accuracy.
● Efficient storage and retrieval.
● Strong data integrity and validation.

Limitations:
● Limited flexibility (must adhere to a strict schema).
● Scalability issues with very large datasets.
● Less suitable for complex big data types.

B. Semi-structured Data

In Big Data, semi-structured data is a combination of both unstructured and structured types

of big data. This form of data constitutes the features of structured data but has unstructured

information that does not adhere to any formal structure of data models or any relational

database. Some semi-structured data examples include XML and JSON.

Overview:

● Contains both structured and unstructured elements.


● Lacks a fixed schema but includes tags and markers to separate data
elements.
● Often stored in formats like XML, JSON, or NoSQL databases.

Examples:

● JSON files for web APIs.


● XML documents for data interchange.
● Email messages (headers are structured, body can be unstructured).
● HTML pages.

Merits:

● More flexible than structured data.


● Easier to parse and analyze than unstructured data.
● Can handle a wide variety of data types.
● Better suited for hierarchical data.
Limitations:

● More complex to manage than structured data.


● Parsing can be resource-intensive.
● Inconsistent data quality.

C. Unstructured Data

Unstructured data in Big Data is where the data format constitutes multitudes of unstructured

files (images, audio, log, and video). This form of data is classified as intricate data because

of its unfamiliar structure and relatively huge size. A stark example of unstructured data is an

output returned by ‘Google Search’ or ‘Yahoo Search.’

Overview:

● Data that does not conform to a predefined schema.


● Includes text, multimedia, and other non-tabular data types.
● Stored in data lakes, NoSQL databases, and other flexible storage solutions.

Examples:

● Text documents (Word files, PDFs).


● Multimedia files (images, videos, audio).
● Social media posts.
● Web pages.

Merits:
● Capable of storing vast amounts of diverse data.
● High flexibility in data storage.
● Suitable for complex data types like multimedia.
● Facilitates advanced analytics and machine learning applications.
Limitations:

● Difficult to search and analyze without preprocessing.


● Requires large storage capacities.
● Inconsistent data quality and reliability.

2) explain about 4v's of data?

In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of
Big Data which are also termed as the characteristics of Big Data as follows:

1. Volume:

● The name ‘Big Data’ itself is related to a size which is enormous.


● Volume is a huge amount of data.
● To determine the value of data, size of data plays a very crucial role.
If the volume of data is very large then it is actually considered as a
‘Big Data’. This means whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of
data.
● Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
● Example: In the year 2016, the estimated global mobile traffic was
6.2 Exabytes(6.2 billion GB) per month. Also, by the year 2020 we
will have almost 40000 Exa Bytes of data.
2. Velocity:

● Velocity refers to the high speed of accumulation of data.


● In Big Data velocity data flows in from sources like machines,
networks, social media, mobile phones etc.
● There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed
to meet the demands.
● Sampling data can help in dealing with the issue like ‘velocity’.
● Example: There are more than 3.5 billion searches per day are made
on Google. Also, FaceBook users are increasing by 22%(Approx.)
year by year.
3. Variety:
● It refers to nature of data that is structured, semi-structured and
unstructured data.
● It also refers to heterogeneous sources.

● Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-
structured and unstructured.
● Structured data: This data is basically an organised data.
It generally refers to data that has defined the length and
format of data.
● Semi- Structured data: This data is basically a semi-
organised data. It is generally a form of data that do not
conform to the formal structure of data. Log files are the
examples of this type of data.
● Unstructured data: This data basically refers to
unorganized data. It generally refers to data that doesn’t fit
neatly into the traditional row and column structure of the
relational database. Texts, pictures, videos etc. are the
examples of unstructured data which can’t be stored in the
form of rows and columns.
4. Veracity:
● It refers to inconsistencies and uncertainty in data, that is data which
is available can sometimes get messy and quality and accuracy are
difficult to control.
● Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types and sources.
● Example: Data in bulk could create confusion whereas less amount
of data could convey half or Incomplete Information.

3)what is importance of big data, write about the main business drivers for the raising
of big data?
Big Data and its Importance

The importance of big data does not revolve around how much data a company
has but how a company utilizes the collected data. Every company uses data in
its own way; the more efficiently a company uses its data, the more potential it
has to grow. The company can take data from any source and analyze it to find
answers which will enable:

Big Data importance doesn’t revolve around the amount of data a company
has but lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the
company uses its data, more rapidly it grows.
By analysing the big data pools effectively the companies can get answers to :
Cost Savings :
o Some tools of Big Data like Hadoop can bring cost advantages to business
when large amounts of data are to be stored.
o These tools help in identifying more efficient ways of doing business.
Time Reductions :
o The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data
immediately.
o This helps us to make quick decisions based on the learnings.
Understand the market conditions :
o By analyzing big data we can get a better understanding of current market
conditions.
o For example: By analyzing customers’ purchasing behaviours, a company
can find out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
Control online reputation :
o Big data tools can do sentiment analysis.
o Therefore, you can get feedback about who is saying what about your
company.
o If you want to monitor and improve the online presence of your business,
then big data tools can help in all this.
Using Big Data Analytics to Boost Customer Acquisition(purchase) and
Retention :
o The customer is the most important asset any business depends on.
o No single business can claim success without first having to establish a solid
customer base.
o If a business is slow to learn what customers are looking for, then it is very
likely to deliver poor quality products.
o The use of big data allows businesses to observe various customer-related
patterns and trends.
Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights :
o Big data analytics can help change all business operations.
o Like the ability to match customer expectations, changing
company’s product line, etc.
o And ensuring that the marketing campaigns are powerful

Drivers for Big Data


Big Data has quickly risen to become one of the most desired topics in the
industry.
The main business drivers for such rising demand for Big Data Analytics are :
1. The digitization of society
2. The drop in technology costs
3. Connectivity through cloud computing
4. Increased knowledge about data science
5. Social media applications
6. The rise of Internet-of-Things(IoT)
Example: A number of companies that have Big Data at the core of their
strategy like :
Apple, Amazon, Facebook and Netflix have become very successful at the
beginning of the 21st century.

1. The digitization of society


Big Data is largely consumer driven and consumer oriented. Most of the data
in the world is generated by consumers, who are nowadays ‘always-on’.
Most people now spend 4-6 hours per day consuming and generating data
through a variety of devices and (social) applications.
With every click, swipe or message, new data is created in a database
somewhere around the world.
Because everyone now has a smartphone in their pocket, the data creation
sums to incomprehensible amounts.
Some studies estimate that 60% of data was generated within the last two
years, which is a good indication of the rate with which society has digitized.
2. The drop in technology costs
Technology related to collecting and processing massive quantities of diverse (high variety)
data has become increasingly more affordable.
The costs of data storage and processors keep declining, making it possible for small
businesses and individuals to become involved with Big Data.
For storage capacity, the often-cited Moore’s Law still holds that the storage density (and
therefore capacity) still doubles every two years.
The plummeting of technology costs has been depicted in the figure below.

Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks.

The most popular software framework (nowadays considered the standard for Big Data) is
Apache Hadoop for distributed storage and processing.

Due to the high availability of these software frameworks in open sources, it has become
increasingly inexpensive to start Big Data projects in organizations.

3. Connectivity through cloud computing


Cloud computing environments (where data is remotely stored in distributed storage systems)
have made it possible to quickly scale up or scale down IT infrastructure and facilitate a pay-
as-you-go model.

This means that organizations that want to process massive quantities of data (and thus have
large storage and processing requirements) do not have to invest in large quantities of IT
infrastructure.

Instead, they can license the storage and processing capacity they need and only pay for the
amounts they actually used. As a result, most of Big Data solutions leverage the possibilities of
cloud computing to deliver their solutions to enterprises.

4. Increased knowledge about data science


In the last decade, the term data science and data scientist have become tremendously
popular. In October 2012, Harvard Business Review called the data scientist “sexiest job of
the 21st century” and many other publications have featured this new job role in recent years.
The demand for data scientist (and similar job titles) has increased tremendously and many
people have actively become engaged in the domain of data science.

As a result, the knowledge and education about data science has greatly professionalized and
more information becomes available every day. While statistics and data analysis mostly
remained an academic field previously, it is quickly becoming a popular subject among
students and the working population.

5. Social media applications


Everyone understands the impact that social media has on daily life. However, in the
study of Big Data, social media plays a role of paramount importance. Not only because of
the sheer volume of data that is produced everyday through platforms such as Twitter,
Facebook, LinkedIn and Instagram, but also because social media provides nearly real-time
data about human behavior.

Social media data provides insights into the behaviors, preferences and opinions of ‘the
public’ on a scale that has never been known before. Due to this, it is immensely valuable to
anyone who is able to derive meaning from these large quantities of data. Social media data
can be used to identify customer preferences for product development, target new customers
for future purchases, or even target potential voters in elections. Social media data might even
be considered one of the most important business drivers of Big Data.

6. The upcoming internet of things (IoT)


The Internet of things (IoT) is the network of physical devices, vehicles, home appliances and
other items embedded with electronics, software, sensors, actuators, and network connectivity
which enables these objects to connect and exchange data.

It is increasingly gaining popularity as consumer goods providers start including ‘smart’ sensors
in household appliances. Whereas the average household in 2010 had around 10 devices that
connected to the internet, this number is expected to rise to 50 per household by 2020.

Examples of these devices include thermostats, smoke detectors, televisions, audio systems and
even smart refrigerators.
● Medical information, such as diagnostic imaging
● Photos and video footage uploaded to the World Wide Web
● Video surveillance, such as the thousands of video cameras
across a city
● Mobile devices, which provide geospatial location data of the users, Metadata about text
messages, phone calls, and application usage on smart phones, Smart devices, which
provide sensor-based collection of information from smart
● Non traditional IT devices, including the use of RFID readers, GPS navigation systems,
and seismic processing.
These are the multiple sources where the data can be generated from multiple sources.

4) Differences between traditional data and big data?

Traditional Data Big Data

Traditional data is generated in Big data is generated outside the


enterprise level. enterprise level.

Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.

Traditional database system deals with Big data system deals with
Traditional Data Big Data

structured, semi-structured,database,
structured data. and unstructured data.

Traditional data is generated per hour or But big data is generated more
per day or more. frequently mainly per seconds.

Traditional data source is centralized Big data source is distributed and it is


and it is managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is capable High system configuration is required


to process traditional data. to process big data.

The size is more than the traditional


The size of the data is very small. data size.

Special kind of data base tools are


Traditional data base tools are required required to perform any
to perform any data base operation. databaseschema-based operation.

Special kind of functions can


Normal functions can manipulate data. manipulate data.

Its data model is strict schema based Its data model is a flat schema based
and it is static. and it is dynamic.

Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.

Traditional data is in manageable Big data is in huge volume which


volume. becomes unmanageable.

It is easy to manage and manipulate the It is difficult to manage and


data. manipulate the data.

Its data sources includes ERP Its data sources includes social
transaction data, CRM transaction data, media, device data, sensor data,
Traditional Data Big Data

financial data, organizational data, web


transaction data etc. video, images, audio etc.

5) Explain briefly about the components of hadoop or Explain briefly about hadoop
architecture?

Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming
models. The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.

The Hadoop Architecture Mainly consists of 4 components.


● MapReduce

● HDFS(Hadoop Distributed File System)

● YARN(Yet Another Resource Negotiator)

● Common Utilities or Hadoop Common

Let’s understand the role of each one of this components in detail.


1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is


based on the YARN framework. The major feature of MapReduce is to
perform the distributed processing in parallel in a Hadoop cluster which
Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. MapReduce has mainly 2 tasks
which are divided phase-wise:

How Does Hadoop Work?


It is quite expensive to build bigger servers with heavy configurations that handle
large scale processing, but as an alternative, you can tie together many commodity
computers with single-CPU, as a single functional distributed system and
practically, the clustered machines can read the dataset in parallel and provide a
much higher throughput. Moreover, it is cheaper than one high-end server. So this is
the first motivational factor behind using Hadoop that it runs across clustered and
low-cost machines.

Hadoop runs code across a cluster of computers. This process


includes the following core tasks that Hadoop performs −

● Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
● These files are then distributed across various cluster nodes for further
processing.
● HDFS, being on top of the local file system, supervises the processing.
● Blocks are replicated for handling hardware failure.
● Checking that the code was executed successfully.
● Performing the sort that takes place between the map and reduce
stages.
● Sending the sorted data to a certain computer.
● Writing the debugging logs for each job.

Advantages of Hadoop

● Hadoop framework allows the user to quickly write and test


distributed systems. It is efficient, and it automatic distributes the data
and work across the machines and in turn, utilizes the underlying
parallelism of the CPU cores.
● Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to
detect and handle failures at the application layer.
● Servers can be added or removed from the cluster dynamically and
Hadoop continues to operate without interruption.
● Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.

6) Draw and explain about architecture of hadoop ecosystem?

Hadoop ecosystem is a platform or framework which helps in solving the big data
problems. It comprises of different components and services (ingesting, storing,
analyzing, and maintaining) inside of it. Most of the services available in the
Hadoop ecosystem are to supplement the main four core components of Hadoop
which include HDFS, YARN, MapReduce and Common.

Hadoop ecosystem includes both Apache Open Source projects and other wide
variety of commercial tools and solutions. Some of the well-known open source
examples include Spark,Hive,Pig, Sqoop andOozie.

As we have got some idea about what is Hadoop ecosystem, what it does, and what
are its components, let’s discuss each concept in detail.

Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s
the beauty of Hadoop that it revolves around data and hence making its
synthesis easier.
HDFS:
● HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
● HDFS consists of two core components i.e.
1. Name node
2. Data Node
● Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
● HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
● Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
● Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
● Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
● By making the use of distributed and parallel algorithms, MapReduce makes it
possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
● It is a platform for structuring the data flow, processing and analyzing huge data sets.
● Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
● Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
● Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
● With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).
● It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing
easier.
● Similar to the Query Processing frameworks, HIVE too comes with two components:
JDBC Drivers and HIVE Command Line.
● JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.
Mahout:
● Mahout, allows Machine Learnability to a system or application. Machine Learning,
as the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
● It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows
invoking algorithms as per our need with the help of its own libraries.
Apache Spark:
● It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
● It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
● Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
Apache HBase:
● It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.
● At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At
such times, HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:
● Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
● Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted
in inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and maintenance.
● Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie workflow
and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a
sequentially ordered manner whereas Oozie Coordinator jobs are those that are
triggered when some data or external stimulus is given to it.
7) What is the relationship of cloud and big data explain?
Big Data and Cloud Computing

One of the vital issues that organisations face with the storage and management of Big Data
is the huge amount of investment to get the required hardware setup and software packages.
Some of these resources may be over utilized or underutilised with the varying requirements
overtime. We can overcome these challenges by providing a set of computing resources that
can be shared through cloud computing. These shared resources comprise applications,
storage solutions, computational units, networking solutions, development and deployment
platforms, business processes, etc. The cloud computing environment saves costs related to
infrastructure in an organization by providing a framework that can be optimized and
expanded horizontally. In order to operate in the real world, cloud implementation requires
common standardized processes and their automation.

In cloud-based platforms, applications can easily obtain the resources to perform computing
tasks. The costs of acquiring these resources need to be paid as per the acquired resources and
their use cloud computing, this feature of resource acquisition is in accordance with the
requirements payment of cost and is known as elasticity. Cloud computing makes it possible
for organisation dynamically regulate the use of computing resources and access them as per
the need while paying only for those resources that are used. This facility of dynamic use of
resources provides flexibility, Careless resource monitoring and control can result in
unexpectedly high costs.

The cloud computing technique uses data centres to collect data and ensures that data backup
a recovery are automatically performed to cater to the requirements of the business
community. Both cloud computing and Big Data analytics use the distributed computing
model in a similar manner

Features of Cloud Computing

The following are some features of cloud computing that can be used to handle Big Data:

Scalability-Scalability means addition of new resources to an existing infrastructure increase


in the amount of data being collected and analysed requires organisations to improve
processing activities. The new hardware may not provide complete support to the software
existing hardware with a new set of hardware components in order to improve data manage to
run properly on the earlier set of hardware. We can solve such issues by using cloud services
that employ the distributed computing technique to provide scalability to the architecture.

Elasticity-Elasticity in cloud means hiring certain resources, as and when required, and for
the resources that have been used. No extra payment is required for acquiring specific
services. For example, a business expecting the use of more data during in-store process
could hire more resources to provide high processing power.

Resource Pooling-Resource pooling is an important aspect of cloud services for Bi analytics.


In resource pooling, multiple organisations, which use similar kinds of resource carry out
computing practices, have no need to individually hire all the resources. The sharing of
resources is allowed in a cloud, which facilitates cost cutting through resource pooling.

Self Service-Cloud computing involves a simple user interface that helps customers to
decide access the cloud services they want. The process of selecting the needed services
require intervention from human beings and can be accessed automatically.

Low Cost-A careful planning, use, management, and control of resources help organization
reduce the cost of acquiring hardware significantly. Also, cloud offers customized solutions
especially to organizations that cannot afford too much initial investment in purchasing
resources that are used for computation in Big Data analytics. The cloud provides them the
as-you-use option in which organizations need to sign for those resources only that are This
also helps the cloud provider in harnessing benefits of economies of scale and pro
benefit to their customers in terms of cost reduction.

Fault Tolerance : Cloud commuting provides fault tolerance by offering uninterrupted


services to customers, especially in cases of component failure. The responsibility of
handling the workload is shifted to other components of the cloud.

8) Explain about cloud delivery models?


Most cloud computing services fall into five broad categories:
1. Software as a service (SaaS)

2. Platform as a service (PaaS)

3. Infrastructure as a service (IaaS)

4. Anything/Everything as a service (XaaS)

5. Function as a Service (FaaS)

These are sometimes called the cloud computing stack because they are built on top of one
another. Knowing what they are and how they are different, makes it easier to accomplish
your goals. These abstraction layers can also be viewed as a layered architecture where
services of a higher layer can be composed of services of the underlying layer i.e, SaaS can
provide Infrastructure.

Software as a Service(SaaS)

Software-as-a-Service (SaaS) is a way of delivering services and applications over the


Internet. Instead of installing and maintaining software, we simply access it via the Internet,
freeing ourselves from the complex software and hardware management. It removes the
need to install and run applications on our own computers or in the data centers eliminating
the expenses of hardware as well as software maintenance.
SaaS provides a complete software solution that you purchase on a pay-as-you-go basis
from a cloud service provider. Most SaaS applications can be run directly from a web
browser without any downloads or installations required. The SaaS applications are
sometimes called Web-based software, on-demand software, or hosted software.

Advantages of SaaS
1. Cost-Effective: Pay only for what you use.
2. Reduced time: Users can run most SaaS apps directly from their web browser
without needing to download and install any software. This reduces the time spent in
installation and configuration and can reduce the issues that can get in the way of the
software deployment.
3. Accessibility: We can Access app data from anywhere.
4. Automatic updates: Rather than purchasing new software, customers rely on a SaaS
provider to automatically perform the updates.
5. Scalability: It allows the users to access the services and features on-demand.
The various companies providing Software as a service are Cloud9 Analytics,
Salesforce.com, Cloud Switch, Microsoft Office 365, Big Commerce, Eloqua, dropBox, and
Cloud Tran.
Disadvantages of Saas :
1. Limited customization: SaaS solutions are typically not as customizable as on-
premises software, meaning that users may have to work within the constraints of the
SaaS provider’s platform and may not be able to tailor the software to their specific
needs.
2. Dependence on internet connectivity: SaaS solutions are typically cloud-based,
which means that they require a stable internet connection to function properly. This
can be problematic for users in areas with poor connectivity or for those who need to
access the software in offline environments.
3. Security concerns: SaaS providers are responsible for maintaining the security of the
data stored on their servers, but there is still a risk of data breaches or other security
incidents.
4. Limited control over data: SaaS providers may have access to a user’s data, which
can be a concern for organizations that need to maintain strict control over their data
for regulatory or other reasons.

Platform as a Service
PaaS is a category of cloud computing that provides a platform and environment to allow
developers to build applications and services over the internet. PaaS services are hosted in
the cloud and accessed by users simply via their web browser.
A PaaS provider hosts the hardware and software on its own infrastructure. As a result,
PaaS frees users from having to install in-house hardware and software to develop or run a
new application. Thus, the development and deployment of the application take place
independent of the hardware.
The consumer does not manage or control the underlying cloud infrastructure including
network, servers, operating systems, or storage, but has control over the deployed
applications and possibly configuration settings for the application-hosting environment. To
make it simple, take the example of an annual day function, you will have two options either
to create a venue or to rent a venue but the function is the same.

Advantages of PaaS:
1. Simple and convenient for users: It provides much of the infrastructure and other IT
services, which users can access anywhere via a web browser.
2. Cost-Effective: It charges for the services provided on a per-use basis thus
eliminating the expenses one may have for on-premises hardware and software.
3. Efficiently managing the lifecycle: It is designed to support the complete web
application lifecycle: building, testing, deploying, managing, and updating.
4. Efficiency: It allows for higher-level programming with reduced complexity thus, the
overall development of the application can be more effective.
The various companies providing Platform as a service are Amazon Web services Elastic
Beanstalk, Salesforce, Windows Azure, Google App Engine, cloud Bees and IBM smart
cloud.
Disadvantages of Paas:
1. Limited control over infrastructure: PaaS providers typically manage the underlying
infrastructure and take care of maintenance and updates, but this can also mean that
users have less control over the environment and may not be able to make certain
customizations.
2. Dependence on the provider: Users are dependent on the PaaS provider for the
availability, scalability, and reliability of the platform, which can be a risk if the
provider experiences outages or other issues.
3. Limited flexibility: PaaS solutions may not be able to accommodate certain types of
workloads or applications, which can limit the value of the solution for certain
organizations.

Infrastructure as a Service
Infrastructure as a service (IaaS) is a service model that delivers computer infrastructure on
an outsourced basis to support various operations. Typically IaaS is a service where
infrastructure is provided as outsourcing to enterprises such as networking equipment,
devices, database, and web servers.
It is also known as Hardware as a Service (HaaS). IaaS customers pay on a per-user basis,
typically by the hour, week, or month. Some providers also charge customers based on the
amount of virtual machine space they use.
It simply provides the underlying operating systems, security, networking, and servers for
developing such applications, and services, and deploying development tools, databases,
etc.

Advantages of IaaS:
1. Cost-Effective: Eliminates capital expense and reduces ongoing cost and IaaS
customers pay on a per-user basis, typically by the hour, week, or month.
2. Website hosting: Running websites using IaaS can be less expensive than traditional
web hosting.
3. Security: The IaaS Cloud Provider may provide better security than your existing
software.
4. Maintenance: There is no need to manage the underlying data center or the
introduction of new releases of the development or underlying software. This is all
handled by the IaaS Cloud Provider.
The various companies providing Infrastructure as a service are Amazon web services,
Bluestack, IBM, Openstack, Rackspace, and Vmware.
Disadvantages of laaS :
1. Limited control over infrastructure: IaaS providers typically manage the underlying
infrastructure and take care of maintenance and updates, but this can also mean that
users have less control over the environment and may not be able to make certain
customizations.
2. Security concerns: Users are responsible for securing their own data and
applications, which can be a significant undertaking.
3. Limited access: Cloud computing may not be accessible in certain regions and
countries due to legal policies.

Anything as a Service
It is also known as Everything as a Service. Most of the cloud service providers nowadays
offer anything as a service that is a compilation of all of the above services including some
additional services.
Advantages of XaaS:
1. Scalability: XaaS solutions can be easily scaled up or down to meet the changing
needs of an organization.
2. Flexibility: XaaS solutions can be used to provide a wide range of services, such as
storage, databases, networking, and software, which can be customized to meet the
specific needs of an organization.
3. Cost-effectiveness: XaaS solutions can be more cost-effective than traditional on-
premises solutions, as organizations only pay for the services.
Disadvantages of XaaS:
1. Dependence on the provider: Users are dependent on the XaaS provider for the
availability, scalability, and reliability of the service, which can be a risk if the provider
experiences outages or other issues.
2. Limited flexibility: XaaS solutions may not be able to accommodate certain types of
workloads or applications, which can limit the value of the solution for certain
organizations.
3. Limited integration: XaaS solutions may not be able to integrate with existing systems
and data sources, which can limit the value of the solution for certain organisations.

Function as a Service :
FaaS is a type of cloud computing service. It provides a platform for its users or customers to
develop, compute, run and deploy the code or entire application as functions. It allows the
user to entirely develop the code and update it at any time without worrying about the
maintenance of the underlying infrastructure. The developed code can be executed with
response to the specific event. It is also the same as PaaS.
FaaS is an event-driven execution model. It is implemented in the serverless container.
When the application is developed completely, the user will now trigger the event to execute
the code. Now, the triggered event makes response and activates the servers to execute it.
The servers are nothing but the Linux servers or any other servers which is managed by the
vendor completely. Customer does not have clue about any servers which is why they do not
need to maintain the server hence it is serverless architecture.
Both PaaS and FaaS are providing the same functionality but there is still some
differentiation in terms of Scalability and Cost.
FaaS, provides auto-scaling up and scaling down depending upon the demand. PaaS also
provides scalability but here users have to configure the scaling parameter depending upon
the demand.
In FaaS, users only have to pay for the number of execution time happened. In PaaS, users
have to pay for the amount based on pay-as-you-go price regardless of how much or less
they use.
Advantages of FaaS :
● Highly Scalable: Auto scaling is done by the provider depending upon the demand.
● Cost-Effective: Pay only for the number of events executed.
● Code Simplification: FaaS allows the users to upload the entire application all at
once. It allows you to write code for independent functions or similar to those
functions.
● Maintenance of code is enough and no need to worry about the servers.
● Functions can be written in any programming language.
● Less control over the system.
The various companies providing Function as a Service are Amazon Web Services –
Firecracker, Google – Kubernetes, Oracle – Fn, Apache OpenWhisk – IBM, OpenFaaS,
Disadvantages of FaaS :
1. Cold start latency: Since FaaS functions are event-triggered, the first request to a
new function may experience increased latency as the function container is created
and initialized.
2. Limited control over infrastructure: FaaS providers typically manage the underlying
infrastructure and take care of maintenance and updates, but this can also mean that
users have less control over the environment and may not be able to make certain
customizations.
3. Security concerns: Users are responsible for securing their own data and
applications, which can be a significant undertaking.
4. Limited scalability: FaaS functions may not be able to handle high traffic or large
number of requests.

9)What is predictive analysis ?explain the types of predictive analytical model?


Predictive analytics is a branch of advanced analytics that makes predictions about future
outcomes using historical data combined with statistical modeling, data mining techniques
and machine learning. Companies employ predictive analytics to find patterns in this data to
identify risks and opportunities. Predictive analytics is often associated with big data and data
science.

predictive analytics models are designed to assess historical data, discover patterns, observe
trends, and use that information to predict future trends.

Predictive analytics can be deployed across various industries for different business
problems. Below are a few industry use cases to illustrate how predictive analytics can
inform decision-making within real-world situations.
Types of Predictive Analytical Models
There are three common techniques used in predictive analytics: Decision trees, neural
networks, and regression. Read more about each of these below.
Regression analysis
Regression is a statistical analysis technique that estimates relationships between variables.
Regression is useful to determine patterns in large datasets to determine the correlation
between inputs. It is best employed on continuous data that follows a known distribution.
Regression is often used to determine how one or more independent variables affects another,
such as how a price increase will affect the sale of a product.
Decision trees
Decision trees are classification models that place data into different categories based on
distinct variables. The method is best used when trying to understand an individual's
decisions. The model looks like a tree, with each branch representing a potential choice, with
the leaf of the branch representing the result of the decision. Decision trees are typically easy
to understand and work well when a dataset has several missing variables.
Neural networks
Neural networks are machine learning methods that are useful in predictive analytics when
modeling very complex relationships. Essentially, they are powerhouse pattern recognition
engines. Neural networks are best used to determine nonlinear relationships in datasets,
especially when no known mathematical formula exists to analyze the data. Neural networks
can be used to validate the results of decision trees and regression models.
Cluster Models
Clustering describes the method of aggregating data that share similar attributes. Consider a
large online retailer like Amazon.
Amazon can cluster sales based on the quantity purchased or it can cluster sales based on the
average account age of its consumer. By separating data into similar groups based on shared
features, analysts may be able to identify other characteristics that define future activity.

Time Series Modeling


Sometimes, data relates to time, and specific predictive analytics rely on the relationship
between what happens when. These types of models assess inputs at specific frequencies such
as daily, weekly, or monthly iterations. Then, analytical models seek seasonality, trends, or
behavioral patterns based on timing. This type of predictive model can be useful to predict
when peak customer service periods are needed or when specific sales will be made.

10) What is the relationship between mobile business intelligence in big data?
Mobile business intelligence is a technology-enabled process of extracting meaningful
insights from data and delivering them to end-users via mobile devices. Mobile BI users can
conduct data analysis in real time using smartphones, tablets, and wearables to make quick
data-driven decisions.

● Mobile Business Intelligence (BI) is an evolution of traditional BI technologies,


enabling the delivery and synthesis of business data through mobile devices like
smartphones and tablets. Unlike traditional BI, which is often confined to
desktops and laptops, Mobile BI emphasizes agility, real-time access, and flexible
user experiences. It allows users to retrieve, interact with, and analyze business
data on the move, breaking the shackles of stationary data interaction.

● Why is Mobile BI Needed?

● In the current fast-paced business world, decision-makers are often on the move
and require immediate access to data and analytics. With the increasing
capabilities of mobile devices, including enhanced data storage, processing
power, and connectivity, Mobile BI has become a critical tool for timely and
effective decision-making. It allows for the constant flow of information, keeping
business leaders connected with their operations, sales, and customer interactions
in real-time, regardless of their physical location.
Advantages of mobile BI
1. Simple access

Mobile BI is not restricted to a single mobile device or a certain place. You can view
your data at any time and from any location. Having real-time visibility into a firm
improves production and the daily efficiency of the business. Obtaining a company's
perspective with a single click simplifies the process.

2. Competitive advantage
Many firms are seeking better and more responsive methods to do business in order to
stay ahead of the competition. Easy access to real-time data improves company
opportunities and raises sales and capital. This also aids in making the necessary
decisions as market conditions change.

3. Simple decision-making

As previously stated, mobile BI provides access to real-time data at any time and from
any location. During its demand, Mobile BI offers the information. This assists
consumers in obtaining what they require at the time. As a result, decisions are made
quickly.

4. Increase Productivity
By extending BI to mobile, the organization's teams can access critical company data when
they need it. Obtaining all of the corporate data with a single click frees up a significant
amount of time to focus on the smooth and efficient operation of the firm. Increased
productivity results in a smooth and quick-running firm.

Disadvantages of mobile
1. Stack of data

The primary function of a mobile BI is to store data in a systematic manner and then
present it to the user as required. As a result, Mobile BI stores all of the information
and does end up with heaps of earlier data. The corporation only needs a small portion
of the previous data, but they need to store the entire information, which ends up in the
stack

2. Expensive

Mobile BI can be quite costly at times. Large corporations can continue to pay for their
expensive services, but small businesses cannot. As the cost of mobile BI is not
sufficient, we must additionally consider the rates of IT workers for the smooth
operation of BI, as well as the hardware costs involved.

However, larger corporations do not settle for just one Mobile BI provider for their
organisations; they require multiple. Even when doing basic commercial transactions,
mobile BI is costly.

3. Time consuming

Businesses prefer Mobile BI since it is a quick procedure. Companies are not patient
enough to wait for data before implementing it. In today's fast-paced environment,
anything that can produce results quickly is valuable. The data from the warehouse is
used to create the system, hence the implementation of BI in an enterprise takes more
than 18 months.

4. Data breach

The biggest issue of the user when providing data to Mobile BI is data leakage. If you
handle sensitive data through Mobile BI, a single error can destroy your data as well as
make it public, which can be detrimental to your business.

Many Mobile BI providers are working to make it 100 percent secure to protect their
potential users' data. It is not only something that mobile BI carriers must consider, but
it is also something that we, as users, must consider when granting data access
authorization.

5. Poor quality data

Because we work online in every aspect, we have a lot of data stored in Mobile BI,
which might be a significant problem. This means that a large portion of the data
analysed by Mobile BI is irrelevant or completely useless. This can speed down the
entire procedure. This requires you to select the data that is important and may be
required in the future.

11) What is hadoop? explain hdfs architecture and its components?


What is Hadoop?
Hadoop is a platform that provides both distributed storage and computational
capabilities. Hadoop was first conceived to fix a scalability issue that existed in
Nutch, an open source crawler and search engine.
At the time Google had published papers that described its novel distributed
filesystem, the Google File System (GFS), and Map-Reduce, a computational
framework for parallel processing.
The successful implementation of these papers’ concepts in Nutch resulted in its
split into two separate projects, the second of which became Hadoop, a first class
Apache project.
The Nutch project, and by extension Hadoop, was led by Doug Cutting and Mike
Cafarella.

Figure High-level Hadoop architecture


Core Hadoop components
To understand Hadoop’s architecture we’ll start by looking at the basics of HDFS.
HDFS

HDFS Explained
The Hadoop Distributed File System (HDFS) is fault-tolerant by design. Data is
stored in individual data blocks in three separate copies across multiple nodes and
server racks. If a node or even an entire rack fails, the impact on the broader system
is negligible.

DataNodes process and store data blocks, while NameNodes manage the many
DataNodes, maintain data block metadata, and control client access.
NameNode
Initially, data is broken into abstract data blocks. The file metadata for these blocks,
which include the file name, file permissions, IDs, locations, and the number of
replicas, are stored in a fsimage, on the NameNode local memory.
Should a NameNode fail, HDFS would not be able to locate any of the data sets
distributed throughout the DataNodes. This makes the NameNode the single point of
failure for the entire cluster. This vulnerability is resolved by implementing a
Secondary NameNode or a Standby NameNode.

Secondary NameNode
The Secondary NameNode served as the primary backup solution in early Hadoop
versions. The Secondary NameNode, every so often, downloads the current fsimage
instance and edit logs from the NameNode and merges them. The edited fsimage
can then be retrieved and restored in the primary NameNode.
The failover is not an automated process as an administrator would need to recover
the data from the Secondary NameNode manually.

Standby NameNode
The High Availability feature was introduced in Hadoop 2.0 and subsequent versions
to avoid any downtime in case of the NameNode failure. This feature allows you to
maintain two NameNodes running on separate dedicated master nodes.
The Standby NameNode is an automated failover in case an Active NameNode
becomes unavailable. The Standby NameNode additionally carries out the check-
pointing process. Due to this property, the Secondary and Standby NameNode are
not compatible. A Hadoop cluster can maintain either one or the other.
Zookeeper
Zookeeper is a lightweight tool that supports high availability and redundancy. A
Standby NameNode maintains an active session with the Zookeeper daemon.
If an Active NameNode falters, the Zookeeper daemon detects the failure and carries
out the failover process to a new NameNode. Use Zookeeper to automate failovers
and minimize the impact a NameNode failure can have on the cluster.

DataNode
Each DataNode in a cluster uses a background process to store the individual blocks
of data on slave servers.
By default, HDFS stores three copies of every data block on separate DataNodes.
The NameNode uses a rack-aware placement policy. This means that the
DataNodes that contain the data block replicas cannot all be located on the same
server rack.
A DataNode communicates and accepts instructions from the NameNode roughly
twenty times a minute. Also, it reports the status and health of the data blocks
located on that node once an hour. Based on the provided information, the
NameNode can request the DataNode to create additional replicas, remove them, or
decrease the number of data blocks present on the node.

Rack Aware Placement Policy


One of the main objectives of a distributed storage system like HDFS is to maintain
high availability and replication. Therefore, data blocks need to be distributed not
only on different DataNodes but on nodes located on different server racks.
This ensures that the failure of an entire rack does not terminate all data replicas.
The HDFS NameNode maintains a default rack-aware replica placement policy:
● The first data block replica is placed on the same node as the client.
● The second replica is automatically placed on a random DataNode on a
different rack.
● The third replica is placed in a separate DataNode on the same rack as the
second replica.
● Any additional replicas are stored on random DataNodes throughout the
cluster.
This rack placement policy maintains only one replica per node and sets a limit of
two replicas per server rack.
Rack failures are much less frequent than node failures. HDFS ensures high
reliability by always storing at least one data block replica in a DataNode on a
different rack.

12) Explain yarn architecture ?write about how does yarn work?
YARN (Yet Another Resource Negotiator) is the default cluster management
resource for Hadoop 2 and Hadoop 3. In previous Hadoop versions, MapReduce
used to conduct both data processing and resource allocation. Over time the
necessity to split processing and resource management led to the development of
YARN.
YARN’s resource allocation role places it between the storage layer, represented by
HDFS, and the MapReduce processing engine. YARN also provides a generic
interface that allows you to implement new processing engines for various data
types.
ResourceManager
The ResourceManager (RM) daemon controls all the processing resources in a
Hadoop cluster. Its primary purpose is to designate resources to individual
applications located on the slave nodes. It maintains a global overview of the
ongoing and planned processes, handles resource requests, and schedules and
assigns resources accordingly. The ResourceManager is vital to the Hadoop
framework and should run on a dedicated master node.
The RM sole focus is on scheduling workloads. Unlike MapReduce, it has no interest
in failovers or individual processing tasks. This separation of tasks in YARN is what
makes Hadoop inherently scalable and turns it into a fully developed computing
platform.

NodeManager
Each slave node has a NodeManager processing service and a DataNode storage
service. Together they form the backbone of a Hadoop distributed system.
The DataNode, as mentioned previously, is an element of HDFS and is controlled by
the NameNode. The NodeManager, in a similar fashion, acts as a slave to the
ResourceManager. The primary function of the NodeManager daemon is to track
processing-resources data on its slave node and send regular reports to the
ResourceManager.

Containers
Processing resources in a Hadoop cluster are always deployed in containers. A
container has memory, system files, and processing space.
A container deployment is generic and can run any requested custom resource on
any system. If a requested amount of cluster resources is within the limits of what’s
acceptable, the RM approves and schedules that container to be deployed.

The container processes on a slave node are initially provisioned, monitored, and
tracked by the NodeManager on that specific slave node.

Application Master
Every container on a slave node has its dedicated Application Master. Application
Masters are deployed in a container as well. Even MapReduce has an Application
Master that executes map and reduce tasks.

As long as it is active, an Application Master sends messages to the Resource


Manager about its current status and the state of the application it monitors. Based
on the provided information, the Resource Manager schedules additional resources
or assigns them elsewhere in the cluster if they are no longer needed.

The Application Master oversees the full lifecycle of an application, all the way from
requesting the needed containers from the RM to submitting container lease
requests to the NodeManager.

JobHistory Server
The JobHistory Server allows users to retrieve information about applications that
have completed their activity. The REST API provides interoperability and can
dynamically inform users on current and completed jobs served by the server in
question.

How Does YARN Work?


A basic workflow for deployment in YARN starts when a client application submits a
request to the ResourceManager.

1. The ResourceManager instructs a NodeManager to start an Application


Master for this request, which is then started in a container.
2. The newly created Application Master registers itself with the RM. The
Application Master proceeds to contact the HDFS NameNode and determine
the location of the needed data blocks and calculates the amount of map and
reduce tasks needed to process the data.
3. The Application Master then requests the needed resources from the RM and
continues to communicate the resource requirements throughout the life-
cycle of the container.
4. The RM schedules the resources along with the requests from all the other
Application Masters and queues their requests. As resources become
available, the RM makes them available to the Application Master on a
specific slave node.
5. The Application Manager contacts the NodeManager for that slave node and
requests it to create a container by providing variables, authentication tokens,
and the command string for the process. Based on that request, the
NodeManager creates and starts the container.
6. The Application Manager then monitors the process and reacts in the event of
failure by restarting the process on the next available slot. If it fails after four
different attempts, the entire job fails. Throughout this process, the
Application Manager responds to client status requests.

Once all tasks are completed, the Application Master sends the result to the client
application, informs the RM that the application has completed its task, deregisters
itself from the Resource Manager, and shuts itself down.

The RM can also instruct the NameNode to terminate a specific container during the
process in case of a processing priority change.

13) Explain about every phase of mapreduce or explain mapreduce?

MapReduce is a programming algorithm that processes data dispersed across the


Hadoop cluster. As with any process in Hadoop, once a MapReduce job starts, the
ResourceManager requisitions an Application Master to manage and monitor the
MapReduce job lifecycle.

The Application Master locates the required data blocks based on the information
stored on the NameNode. The AM also informs the ResourceManager to start a
MapReduce job on the same node the data blocks are located on. Whenever
possible, data is processed locally on the slave nodes to reduce bandwidth usage
and improve cluster efficiency.

The input data is mapped, shuffled, and then reduced to an aggregate result. The
output of the MapReduce job is stored and replicated in HDFS.

The Hadoop servers that perform the mapping and reducing tasks are often referred
to as Mappers and Reducers.

The ResourceManager decides how many mappers to use. This decision depends on
the size of the processed data and the memory block available on each mapper
server.

Map Phase
The mapping process ingests individual logical expressions of the data stored in the
HDFS data blocks. These expressions can span several data blocks and are called
input splits. Input splits are introduced into the mapping process as key-value pairs.
A mapper task goes through every key-value pair and creates a new set of key-value
pairs, distinct from the original input data. The complete assortment of all the key-
value pairs represents the output of the mapper task.

Based on the key from each pair, the data is grouped, partitioned, and shuffled to the
reducer nodes.

A client submitting a job to MapReduce


The role of the programmer is to define map and reduce functions, where the map
function outputs key/value tuples, which are processed by reduce functions to
produce the final output.

Shuffle and Sort Phase


Shuffle is a process in which the results from all the map tasks are copied to the
reducer nodes. The copying of the map task output is the only exchange of data
between nodes during the entire MapReduce job.

The output of a map task needs to be arranged to improve the efficiency of the
reduce phase. The mapped key-value pairs, being shuffled from the mapper nodes,
are arrayed by key with corresponding values. A reduce phase starts after the input
is sorted by key in a single input file.

The shuffle and sort phases run in parallel. Even as the map outputs are retrieved
from the mapper nodes, they are grouped and sorted on the reducer nodes.

MapReduce’s shuffle and sort


Reduce Phase
The map outputs are shuffled and sorted into a single reduce input file located on the
reducer node. A reduce function uses the input file to aggregate the values based on
the corresponding mapped keys. The output from the reduce process is a new key-
value pair. This result represents the output of the entire MapReduce job and is, by
default, stored in HDFS.
All reduce tasks take place simultaneously and work independently from one
another. A reduce task is also optional.

There can be instances where the result of a map task is the desired result and there
is no need to produce a single output value.

Figure A logical view of the reduce function

14) Explain how to move data in and out of hadoop?

Data movement is one of those things that you aren’t likely to think too much
about until you’re fully committed to using Hadoop on a project, at which point
it becomes this big scary unknown that has to be tackled. How do you get your
log data sitting across thousands of hosts into Hadoop? What’s the most efficient
way to get your data out of your relational and No/NewSQL systems and into
Hadoop? How do you get Lucene indexes generated in Hadoop out to your
servers? And how can these processes be automated?
This topic starts by highlighting key data-movement properties.We’ll start with
some simple techniques, such as using the command line and Java for ingress,
but we’ll quickly move on to more advanced techniques like using NFS
Ingress and egress refer to data movement into and out of a system, respectively.
Data egress refers to data leaving a network in transit to an external location.
Outbound email messages, cloud uploads, or files being moved to external
storage are simple examples of data egress
Data ingress in computer networking, including: Data that is downloaded from the internet to
a local computer. Email messages that are delivered to a mailbox. VoIP calls that come into a
network.
Once the low-level tooling is out of the way, we’ll survey higher-level tools that have
simplified the process of ferrying data into Hadoop. We’ll look at how you can automate the
movement of log files with Flume, and how Sqoop can be used to move relational data.
Moving data into Hadoop
The first step in working with data in Hadoop is to make it available to Hadoop. There are
two primary methods that can be used to move data into Hadoop: writing external data at the
HDFS level (a data push), or reading external data at the MapReduce level (more like a pull).
Reading data in MapReduce has advantages in the ease with which the operation can be
parallelized and made fault tolerant. Not all data is accessible from MapReduce, however,
such as in the case of log files, which is where other systems need to be relied on for
transportation, including HDFS for the final data hop.
In this section we’ll look at methods for moving source data into Hadoop. I’ll use the design
considerations in the previous section as the criteria for examining and understanding the
different tools.
Roll your own ingest
Hadoop comes bundled with a number of methods to get your data into HDFS. This section
will examine various ways that these built-in tools can be used for your data movement
needs. The first and Picking the right ingest tool for the job
The low-level tools in this section work well for one-off file movement activities, or when
working with legacy data sources and destinations that are file-based. But moving data in this
way is quickly becoming obsolete by the availability of tools such as Flume and Kafka
(covered later in this chapter), which offer automated data movement pipelines.
Kafka is a much better platform for getting data from A to B (and B can be a Hadoop cluster)
than the old-school “let’s copy files around!” With Kafka, you only need to pump your data
into Kafka, and you have the ability to consume the data in real time (such as via Storm) or in
offline/batch jobs (such as via Camus) tentially easiest tool you can use is the HDFS
command line.

15) Explain input output architecture of mapreduce?

Your data might be XML files sitting behind a number of FTP servers, text log files sitting on
a central web server, or Lucene indexes1 in HDFS. How does MapReduce support reading
and writing to these different serialization structures across the various storage mechanisms?
You’ll need to know the answer in order to support a specific serialization format.
Data input :-
The two classes that support data input in MapReduce are InputFormat and Record-Reader.
The InputFormat class is consulted to determine how the input data should be partitioned for
the map tasks, and the RecordReader performs the reading of data from the inputs.
INPUT FORMAT:-
Every job in MapReduce must define its inputs according to contracts specified in the
InputFormat abstract class. InputFormat implementers must fulfill three contracts: first, they
describe type information for map input keys and values; next, they specify how the input
data should be partitioned; and finally, they indicate the
RecordReader instance that should read the data from source

RECORD READER:-
The RecordReader class is used by MapReduce in the map tasks to read data from an input
split and provide each record in the form of a key/value pair for use by mappers. A task is
commonly created for each input split, and each task has a single RecordReader that’s
responsible for reading the data for that input split.

DATA OUTPUT:-
MapReduce uses a similar process for supporting output data as it does for input data.Two
classes must exist, an OutputFormat and a RecordWriter. The OutputFormat performs some
basic validation of the data sink properties, and the RecordWriter writes each reducer output
to the data sink.

OUTPUT FORMAT:-
Much like the InputFormat class, the OutputFormat class, as shown in figure 3.5, defines the
contracts that implementers must fulfill, including checking the information related to the job
output, providing a RecordWriter, and specifying an output committer, which allows writes to
be staged and then made “permanent” upon task and/or job success.

RECORD WRITER:-
You’ll use the RecordWriter to write the reducer outputs to the destination data sink. It’s a
simple class.

SequenceFileInputFormat – Hadoop MapReduce is not restricted to processing textual data.


It has support for binary formats, too. Hadoop’s sequence file format stores sequences of
binary key- value pairs. Sequence files are well suited as a format for MapReduce data
because they are splittable (they have sync points so that readers can synchronize with record
boundaries from an arbitrary point in the file, such as the start of a split), they support
compression as a part of the format, and they can store arbitrary types using a variety of
serialization framework

SequenceFileAsTextInputFormat – SequenceFileAsTextInputFormat is a variant of

SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects.
The conversion is performed by calling toString() on the keys and values. This format makes
sequence files suitable input for Streaming

SequenceFileAsBinaryInputFormat – SequenceFileAsBinaryInputFormat is a variant of


SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary
objects. They are encapsulated as BytesWritable objects, and the application is free to
interpret the underlying byte array as it pleases.

FixedLengthInputFormat – FixedLengthInputFormat is for reading fixed-width binary


records from a file, when the records are not separated by delimiters. The record size must be
set via fixedlengthinputformat.record.length.
SequenceFileOutputFormat – its for writing binary Output. As the name indicates,
SequenceFileOutputFormat writes sequence files for its output. This is a good choice of
output if it forms the input to a further MapReduce job, since it is compact and is readily
compressed.

SequenceFileAsBinaryOutputFormat –Is the counterpart to SequenceFile As Binary Input


Format writes keys and values in raw binary format into a sequence file container.

MapFileOutputFormat – MapFileOutputFormat writes map files as output. The keys in a


MapFile must be added in order, so you need to ensure that your reducers emit keys in sorted
order.

16) what is data serialisation in big data?


What is Serialization?
Serialization is the process of converting a data object—a combination of code and data
represented within a region of data storage—into a series of bytes that saves the state of the
object in an easily transmittable form. In this serialized form, the data can be delivered to
another data store (such as an in-memory computing platform), application, or some other
destination.

Data serialization is the process of converting an object into a stream of bytes to more easily
save or transmit it.

The reverse process—constructing a data structure or object from a series of bytes—is


deserialization. The deserialization process recreates the object, thus making the data easier to
read and modify as a native structure in a programming language.
Serialization and deserialization work together to transform/recreate data objects to/from a
portable format.

What Is Data Serialization in Big Data?

Big data systems often include technologies/data that are described as “schemaless.” This
means that the managed data in these systems are not structured in a strict format, as defined
by a schema. Serialization provides several benefits in this type of environment:

● Structure: By inserting some schema or criteria for a data structure through


serialization on read, we can avoid reading data that misses mandatory fields, is
incorrectly classified, or lacks some other quality control requirement.
● Portability: Big data comes from a variety of systems and may be written in a variety
of languages. Serialization can provide the necessary uniformity to transfer such data
to other enterprise systems or applications.
● Versioning: Big data is constantly changing. Serialization allows us to apply version
numbers to objects for lifecycle
management.
Data serialization is the process of converting data objects present in complex data structures
into a byte stream for storage, transfer and distribution purposes on physical devices.

Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-language-
neutral data format that all systems understand.

Once the serialized data is transmitted from the source machine to the destination machine,
the reverse process of creating objects from the byte sequence called deserialization is carried
out. Reconstructed objects are clones of the original object.

Choice of data serialization format for an application depends on factors such as data
complexity, need for human readability, speed and storage space constraints. XML, JSON,
BSON, YAML, MessagePack, and protobuf are some commonly used data serialization
formats.

computer data is generally organized in data structures such as arrays, tables, trees, classes.
When data structures need to be stored or transmitted to another location, such as across a
network, they are serialized.
7) Differentiate between RDBMS and HADOOP?
RDMS (Relational Database Management System): RDBMS is an information
management system, which is based on a data model.In RDBMS tables are used for
information storage. Each row o1 the table represents a record and column represents
an attribute o1 data. Organization o1 data and their manipulation processes are
di11erent in RDBMS from other databases. RDBMS ensures ACID (atomicity,
consistency, integrity, durability) properties required for designing a database. The
purpose o1 RDBMS is to store, manage, and retrieve data as quickly and reliably as
possible.
Hadoop: It is an open−source software framework used for storing
data and running applications on a group of commodity hardware. It
has large storage capacity and high processing power. It can manage
multiple concurrent processes at the same time. It is used in
predictive analysis, data mining and machine learning. It can handle
both structured and unstructured 1orm o1 data. It is more 1lexible in
storing, processing, and managing data than traditional RDBMS.
Unlike traditional systems, Hadoop enables multiple analytical
processes on the same data at the same time. It supports scalability
very 1lexibly.
Below is a table o1 di11erences between RDBMS and Hadoop:

S.Ko. RDBMS Hadoop

An open−source so1tware used


Traditional row−column based
1or storing data and running
l. databases, basically used 1or
applications or processes
data storage, manipulation and
concurrently.
retrieval.

In this structured data is mostly processed. In this both structured and unstructured
2.
data is processed.
It is best suited 1or BIG data.
It is best suited 1or OLTP environment.
3.

4. It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in Data normalization is not required in


5.
RDBMS. Hadoop.

It stores transformed and aggregated data. It stores huge volume o1 data.


6.

7. It has no latency in response. It has some latency in response.

The data schema of RDBMS is static type. The data schema of Hadoop is
8.
dynamic type.

Low data integrity available than


9. RDBMS.
High data integrity available.

Cost is applicable for licensed software. Free o1 cost, as it is an open source


l0. software.
18) Explain about hdfs daemons? explain components of hdfs?

HDFS Daemons:

Daemons mean Process. Hadoop Daemons are a set o1 processes that run on
Hadoop. Hadoop is a framework written in Java, so all these processes are Java
Processes.
Apache Hadoop 2 consists o1 the following Daemons:

● NameNode

● DataNode

● Secondary Name Node

● Resource Manager

● Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master System
while the Node Manager and DataNode work on the Slave machine.

1. NameNode

1. NameNode is the main central component of HDFS architecture framework.


2. NameNode is also known as Master node.
3. HDFS Namenode stores meta-data i.e. number of data blocks, file name, path, Block IDs, Block
location, no. of replicas, and also Slave related configuration. This meta-data is available in memory
in the master for faster retrieval of data.
4. NameNode keeps metadata related to the file system namespace in
memory, for quicker response time. Hence, more memory is needed. So
NameNode configuration should be deployed on reliable configuration.
5. NameNode maintains and manages the slave nodes, and assigns tasks to them.
6. NameNode has knowledge of all the DataNodes containing data blocks for a given
file.
7. NameNode coordinates with hundreds or thousands of data nodes and serves
the requests coming from client applications.

Two files ‘FSImage’ and the ‘EditLog’ are used to store metadata information.

FsImage: It is the snapshot the file system when Name Node is started. It is an “Image
file”. FsImage contains the entire filesystem namespace and stored as a file in the
NameNode’s local file system. It also contains a serialized form of all the directories
and file inodes in the filesystem. Each inode is an internal representation of file or
directory’s metadata.

EditLogs: It contains all the recent modifications made to the file system on the most
recent FsImage. NameNode receives a create/update/delete request from the client.
After that this request is first recorded to edits file.
Functions of NameNode in HDFS

2. It is the master daemon that maintains and manages the DataNodes (slave
nodes).
3. It records the metadata of all the files stored in the cluster, e.g. The location of
blocks stored, the size of the files, permissions, hierarchy, etc.
4. It records each change that takes place to the file system metadata. For
example, if a file is deleted in HDFS, the NameNode will immediately record
this in the EditLog.
5. It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
6. It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
7. The NameNode is also responsible to take care of the replication factor of all the
blocks.
8. In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.

Features:
● It never stores the data that is present in the file.

● As Namenode works on the Master System, the Master system should


have good processing power and more RAM than Slaves.
● It stores the information of DataNode such as their Block id’s and Number of Blocks
How to start Name Node?
hadoop-daemon.sh start namenode
How to stop Name Node?
hadoop-daemon.sh stop namenode
The namenode daemon is a master daemon and is responsible for storing all the
location information of/ the files present in HDFS. The actual data is never stored on a
namenode. In other words, it holds the metadata of the files in HDFS.

The name node maintains the entire metadata in RAM, which helps clients receive
quick responses to read requests. Therefore, it is important to run name node from a
machine that has lots of RAM at its disposal. The higher the number of files in HDFS,
the higher the consumption of RAM. The name node daemon also maintains a
persistent checkpoint of the metadata in a file stored on the disk called the fsimage file.

Whenever a file is placed/deleted/updated in the cluster, an entry of this action is


updated in a file called the edits logfile. After updating the edits log, the metadata
present in-memory is also updated accordingly. It is important to note that the fsimage
file is not updated for every write operation.

In case the name node daemon is restarted, the following sequence of events occur at
name node boot up:

1. Read the fsimage file from the disk and load it into memory (RAM).
2. Read the actions that are present in the edits log and apply each action to the
in-memory representation of the fsimage file.
3. Write the modified in-memory representation to the fsimage file on the disk.

The preceding steps make sure that the in-memory representation is up to date.

The namenode daemon is a single point of failure in Hadoop 1.x, which means that if
the node hosting the namenode daemon fails, the filesystem becomes unusable. To
handle this, the administrator has to configure the namenode to write the fsimage file to
the local disk as well as a remote disk on the network. This backup on the remote disk
can be used to restore the namenode on a freshly installed server. Newer versions of
Apache Hadoop (2.x) now support High Availability (HA), which deploys two
namenodes in an active/passive configuration, wherein if the active namenode fails, the
control falls onto the passive namenode, making it active. This configuration reduces
the downtime in case of a namenode failure.

Since the fsimage file is not updated for every operation, it is possible the edits logfile
would grow to a very large file. The restart of namenode service would become very
slow because all the actions in the large edits logfile will have to be applied on the
fsimage file. The slow boot up time could be avoided using the secondary namenode
daemon.

The namespace image and the edit log stores information of the data and the metadata. NameNode
also determines the linking of blocks to DataNodes. Furthermore, the NameNode is a single point
of failure. The DataNode is a multiple instance server. There can be several numbers of DataNode
servers. The number depends on the type of network and the storage system.

The DataNode servers, stores, and maintains the data blocks. The NameNode Server provisions the
data blocks on the basis of the type of job submitted by the client.

DataNode also stores and retrieves the blocks when asked by clients or the NameNode.
Furthermore, it reads/writes requests and performs block creation, deletion, and replication of
instruction from the NameNode. There can be only one Secondary NameNode server in a cluster.
Note that you cannot treat the Secondary NameNode server as a disaster recovery server. However,
it partially restores the NameNode server in case of a failure.

Data Node
1. Data Node is also known as Slave node.
2. In Hadoop HDFS Architecture, Data Node stores actual data in HDFS.
3. Data Nodes responsible for serving, read and write requests for the clients.
4. Data Nodes can deploy on commodity hardware.
5. Data Nodes sends information to the Name Node about the files and blocks stored
in that node and responds to the Name Node for all filesystem operations.
6. When a Data Node starts up it announce itself to the Name Node along with the
list of blocks it is responsible for.
7. Data Node is usually configured with a lot of hard disk space. Because the
actual data is stored in the Data Node.

Functions of Data Node in HDFS


1. These are slave daemons or process which runs on each slave machine.
2. The actual data is stored on Data Nodes.
3. The Data Nodes perform the low-level read and write requests from the file
system’s clients.
4. Every Data Node sends a heartbeat message to the Name Node every 3 seconds and
conveys that it is alive. In the scenario when Name Node does not receive a heartbeat
from a Data Node for 10 minutes, the Name Node considers that particular Data Node
as dead and starts the process of Block replication on some other Data Node..
5. All Data Nodes are synchronized in the Hadoop cluster in a way that they can
communicate with one another and make sure of
i. Balancing the data in the system
ii. Move data for keeping high replication
iii. Copy Data when required
Basic Operations of Datanode:
● Datanodes is responsible of storing actual data.

● Upon instruction from Namenode, it performs operations like


creation/replication/deletion of data blocks.
● When one of Datanode gets down then it will not make any effect on
Hadoop cluster due to replication.
● All Datanodes are synchronized in the Hadoop cluster in a way that they
can communicate with each other for various operations.
What happens if one of the Datanodes gets failed in HDFS?
Namenode periodically receives a heartbeat and a Block report from each
Datanode in the cluster. Every Datanode sends heartbeat message a1ter every 3
seconds to Namenode.
The health report is just information about a particular Datanode that is
working properly or not. In the other words we can say that particular Datanode is alive
or not.
A block report o1 a particular Data node contains information about all the
blocks on that resides on the corresponding Data node. When Name node doesn’t
receive any heartbeat message for 10 minutes(ByDe1ault) 1rom a particular Data node
then corresponding Data node is considered Dead or failed by Name node.
Since blocks will be under replicated, the system starts the replication process
from one Data node to another by taking all block information from the Block report
o1 corresponding Datanode. The Data for replication transfers directly from one Data
node to another without data passing through Name node.

How to start Data Node?


hadoop-daemon.sh start datanode
How to stop Data Node?
hadoop-daemon.sh stop datanode

The datanode daemon acts as a slave node and is responsible for storing the actual
files in HDFS. The files are split as data blocks across the cluster. The blocks are
typically 64 MB to 128 MB size blocks. The block size is a configurable parameter.
The file blocks in a Hadoop cluster also replicate themselves to other datanodes for
redundancy so that no data is lost in case a datanode daemon fails. The datanode
daemon sends information to the namenode daemon about the files and blocks stored
in that node and responds to the namenode daemon for all filesystem operations. The
following diagram shows how files are stored in the cluster:

File blocks of files A, B, and C are replicated across multiple nodes of the cluster for
redundancy. This ensures availability of data even if one of the nodes fail.
You can also see that blocks of file A are present on nodes 2, 4, and 6; blocks of file B
are present on nodes 3, 5, and 7; and blocks of file C are present on 4, 6, and 7. The
replication factor configured for this cluster is 3, which signifies that each file block is
replicated three times across the cluster. It is the responsibility of the namenode
daemon to maintain a list of the files and their corresponding locations on the cluster.
Whenever a client needs to access a file, the namenode daemon provides the location
of the file to client and the client, then accesses the file directly from the data node
daemon.

Secondary Name Node


What is Secondary Name Node?

Role of Secondary Namenode in Managing the Filesystem Metadata.

Each and every transaction that occurs on the file system is recorded within the edit log
file. At some point of time this file becomes very large.

Secondary Name node simply gets edit logs from name node periodically and
copies to fsimage. This new fsimage is copied back to namenode. Namenode now, this
uses this new fsimage for next restart which reduces the startup time.
It is a helper node to Name node and to precise Secondary Name node whole
purpose is to have checkpoint in HDFS, which helps name node to function effectively.
Hence, It is also called as Checkpoint node.

Now there are two important files which reside in the namenode’ s current
directory,

1. FsImage file :-This file is the snapshot of the HDFS metadata at a certain point
of time .
2. Edits Log file :-This file stores the records for changes that have been made in the
HDFS namespace .

The main function of the Secondary namenode is to store the latest copy of the
FsImage and the Edits Log files.

How does it help?

When the namenode is restarted , the latest copies of the Edits Log files are applied to
the FsImage file in order to keep the HDFS metadata latest. So it becomes very
important to store a copy of these two files , which is done by secondary namenode.

Now to keep latest versions of these two files, the secondary name node takes the
checkpoints at hourly basis which is the default time gap .

Checkpoint:-

A checkpoint is nothing but the updation of the latest FsImage file by applying the
latest Edits Log files to it .If the time gap of a checkpoint is large the there will be too
many Edits Log files generated and it will be very cumbersome and time consuming to
apply them all at once on the latest FsImage file . And this may lead to acute start time
for the primary namenode after a reboot .
However, the secondary namenode is just a helper to the primary namenode in a
HDFS cluster as it cannot perform all the functions of the primary namenode .

Note:-
There are two options to which can be used along with secondary namenode command

1. -geteditsize:- this option helps to find the current size of the edit_ingress file
present in namenode’ s current directory.Here edit_ingress file is the ongoing in
progress Edits Log file .
2. -checkpoint [force]:- this option forcefully checkpoints the secondary namenode
to the latest state of the primary namenode , whatever may the size of the Edits Log file
may be. But ideally the size of the Edits Log file should be greater than or equal to the
checkpoint file size .

Secondary Namenode helps to overcome the above issues by taking over


responsibility of merging edit logs with fsimage from the
namenode.
1. It gets the edit logs from the namenode in regular intervals and applies to fsimage.

2. Once it has new fsimage, it copies back to


namenode
3. Namenode will use this fsimage for the next restart,which will reduce the startup
time.
Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail
over feature. Secondary Namenode is optional now & Standby Namenode has been to used for failover process.
Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes .

The Secondary namenode is a helper node in hadoop, To understand the


functionality of the secondary namenodelet’s understand how the namenode works.

Name node stores metadata like file system namespace


information, blockinformation etc in the memory.It also stores the persistent copy of
the same on the disk. Name node stores information in two files.

fsimage: It’s a snapshot of the file system, stores information like modification
time access time, permission, replication.
Edit logs: It stores details of all the activities/transactions being performed on the
HDFS..

When the namenode is in the active state the edit logs size grows continuously as the
edit logs can only be applied to the fsimage at the time of name node restart, to get the
latest state of the HDFS. If edit logs grows significantly and name node tries to apply it
on fsimage at the time of name node restart, the process can take very long, here
secondary node come into the play.
Secondary namenode keeps the checkpoint on the name node, It reads the edit
logs from the namenode continuously after a specific interval and applies it to the
fsimage copy of secondary name node. In this way the fsimage file will have the most
recent state of HDFS.
The secondary namenode copies new fsimage to primary, so fsimage is updated.

Since fsimage is updated, there will be no overhead of copying of edit logs at the
moment of restarting the cluster.
Secondary namenode is a helper node and can’t replace the name node.

Secondary Name Node is used for taking the hourly backup of the data.

In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the
hourly backup or checkpoints of that data and store this data into a file name fsimage.
This file then gets transferred to a new system.

A new Meta Data is assigned to that new system and a new Master is created with this
Meta Data, and the cluster is made to run again correctly. This is the benefit of
Secondary Name Node.

Now in Hadoop2, we have High-Availability and Federation features that minimize


the importance of this Secondary Name Node in Hadoop2.

Checkpoint:-

A checkpoint is nothing but the updation of the latest FsImage file by applying the
latest Edits Log files to it .If the time gap of a checkpoint is large the there will be too
many Edits Log files generated and it will be very cumbersome and time consuming to
apply them all at once on the latest FsImage file . And this may lead to acute start time
for the primary namenode after a reboot .
However, the secondary namenode is just a helper to the primary namenode in a
HDFS cluster as it cannot perform all the functions of the primary namenode .

Note:-
There are two options to which can be used along with secondary namenode command

3. -geteditsize:- this option helps to find the current size of the edit_ingress file
present in namenode’ s current directory.Here edit_ingress file is the ongoing in
progress Edits Log file .
4. -checkpoint [force]:- this option forcefully checkpoints the secondary namenode
to the latest state of the primary namenode , whatever may the size of the Edits Log file
may be. But ideally the size of the Edits Log file should be greater than or equal to the
checkpoint file size .

Secondary Namenode helps to overcome the above issues by taking over


responsibility of merging edit logs with fsimage from the
namenode.
4. It gets the edit logs from the namenode in regular intervals and applies to fsimage.

5. Once it has new fsimage, it copies back to


namenode
6. Namenode will use this fsimage for the next restart,which will reduce the startup
time.
Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail
over feature. Secondary Namenode is optional now & Standby Namenode has been to used for failover process.
Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes .

Major Function Of Secondary NameNode:


● It groups the Edit logs and Fsimage from NameNode together.

● It continuously reads the MetaData from the RAM of NameNode and


writes into the Hard Disk.

As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File


System, it is also known as the checkpoint Node.

The Hadoop Daemon’s Port

Name Node 50070

Data Node 50075

Secondary Name Node 50090

These ports can be configured manually in hd1s−si1e.xml and mapred−


si1e.xml 1iles.

1. Resource Manager

Resource Manager is also known as the Global Master Daemon that works on the
Master System.

The Resource Manager Manages the resources for the applications that are running in
a Hadoop Cluster.
The Resource Manager Mainly consists of 2 things.

1. Applications Manager
2. Scheduler

An Application Manager is responsible for accepting the request for a client and also
makes a memory resource on the Slaves in a Hadoop cluster to host the Application
Mas1er.
The scheduler is utilized for providing resources for applications in a Hadoop cluster
and for monitoring this application.

How to start ResourceManager?


yarn-daemon.sh start resourcemanager
How to stop ResourceManager?
yarn-daemon.sh stop resource manager

2. Node Manager

The Node Manager works on the Slaves System that manages the memory resource
within the Node and Memory Disk. Each Slave Node in a Hadoop cluster has a single
NodeManager Daemon running in it. It also sends this monitoring information to the
Resource Manager.
How to start Node Manager?
yarn-daemon.sh start node manager
How to stop Node Manager?
yarn-daemon.sh stop nodemanager

The Hadoop Daemon’s Port

ResourceManager 8088

NodeManager 8042
The below diagram shows how Hadoop works.

19) explain the anatomy file read and write in hdfs?


Anatomy of File Write and Read
Big data is nothing but a collection of data sets that are large, complex, and which are
difficult to store and process using available data management tools or traditional data
processing applications. Hadoop is a framework (open source) for writing, running,
storing, and processing large datasets in a parallel and distributed manner.
It is a solution that is used to overcome the challenges 1aced by big data.
Hadoop has two components:

● HDFS (Hadoop Distributed File System)

● YARN (Yet Another Resource Negotiator)

We focus on one of the components of Hadoop i.e., HDFS and the anatomy of file
reading and file writing in HDFS. HDFS is a file system designed for storing very
large files (files that are hundreds of megabytes, gigabytes, or terabytes in size) with
streaming data access, running on clusters of commodity hardware(commonly
available hardware that can be obtained from various vendors). In simple terms, the
storage unit of Hadoop is called HDFS.
Some of the characteristics of HDFS are:
Fault−Tole
rance
Scalability
Distributed
Storage
Reliability
High
availa
bility
Cost−
e11ect
ive
High
throug
hput
Building Blocks of Hadoop:
Name
Node
Data
Node
Secondary Name Node
(SNN)
Job Tracker
Task Tracker

Anatomy of File Read in HDFS

Let’s get an idea of how data flows between the client interacting with HDFS, the
name node, and the data nodes with the help o1 a diagram. Consider the figure:

Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which
1or HDFS is an instance o1 Distributed File System).
Step 2: Distributed File System (DFS) calls the name node, using remote procedure calls (RPCs), to
determine the locations of the first few blocks in the file. For each block, the name node returns the
addresses o1 the data nodes that have a copy o1 that block.
The DFS returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in
turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read () on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the
stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data
node, then finds the best data node for the next block.

This happens transparently to the client, which from its point o1 view is simply reading an endless
stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes
because the client reads through the stream. It will also call the name node to retrieve the data node
locations for the next batch o1 blocks as needed.
Step 6: When the client has finished reading the 1ile, a function is called,
close() on the FSDataInputStream.
Anatomy of File Write in HDFS

Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to
get a better understanding o1 the concept.

Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the 1iles which
are already stored in HDFS, but we can append data by reopening the files.
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the 1ile. I1 these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and there1ore the client is
thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the in1o queue. The data queue is consumed
by the DataStreamer, which is liable for asking the name node to allocate new blocks
by picking an inventory o1 suitable data nodes to store the replicas. The list of data
nodes forms a pipeline, and here we’ll assume the replication level is three, so there
are three nodes in the pipeline. The DataStreamer streams the packets to the primary
data node within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.

Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue o1 packets that
are waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and
waits for acknowledgments be1ore connecting to the name node to signal whether the
1ile is complete or not.

HDFS follows Write Once Read Many models. So, we can’t edit files that are already
stored in HDFS, but we can include them by again reopening the 1ile. This design
allows HDFS to scale to a large number o1 concurrent clients because the data tra11ic
is spread across all the data nodes in the cluster. Thus, it increases the availability,
scalability, and throughput o1 the system.

20) Explain replication management policy of hdfs?


Blocks:
Now, as we know that the data in HDFS is scattered across the DataNodes as blocks.
Let’s have a look at what is a block and how is it formed?

Blocks are the nothing but the smallest continuous location on your hard drive where
data is stored. In general, in any of the File System, you store the data as a collection of
blocks. Similarly, HDFS stores each file as blocks which are scattered throughout the
Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop
2.x (64 MB in Apache Hadoop 1.x) which you can configure as per your requirement.

It is not necessary that in HDFS, each file is stored in exact multiple of the
configured block size (128 MB, 256 MB etc.). Let’s take an example where I have a file “example.txt”
of size 514 MB as shown in above figure. Suppose that we are using the default configuration of block
size, which is 128 MB. Then, how many blocks will be created? 5, Right. The first four blocks will be
of 128 MB. But, the last block will be of 2 MB size only.

Now, you must be thinking why we need to have such a huge blocks size
i.e. 128 MB?

Well, whenever we talk about HDFS, we talk about huge data sets, i.e. Terabytes and
Petabytes of data. So, if we had a block size of let’s say of 4 KB, as in Linux file
system, we would be having too many blocks and therefore too much of the metadata.
So, managing these no. of blocks and metadata will create huge overhead, which is
something, we don’t want.

Replication Management:
HDFS provides a reliable way to store huge data in a distributed environment as data
blocks. The blocks are also replicated to provide fault tolerance. The default replication
factor is 3 which is again configurable. So, as you can see in the figure below where
each block is replicated three times and stored on different DataNodes (considering
the default replication factor):

Therefore, if you are storing a file of 128 MB in HDFS using the default configuration,
you will end up occupying a space of 384 MB (3*128 MB) as the blocks will be
replicated three times and each replica will be residing on a different DataNode.

Note: The NameNode collects block report from DataNode periodically to maintain
the replication factor. Therefore, whenever a block is over-replicated or under-
replicated the NameNode deletes or add replicas as needed.
Rack Awareness:

Anyways, moving ahead, let’s talk more about how HDFS places replica and what is
rack awareness? Again, the NameNode also ensures that all the replicas are not stored
on the same rack or a single rack. It follows an in-built Rack Awareness Algorithm to
reduce latency as well as provide fault tolerance.

Considering the replication factor is 3, the Rack Awareness Algorithm says that the
first replica of a block will be stored on a local rack and the next two replicas will be
stored on a different (remote) rack but, on a different DataNode within that (remote)
rack as shown in the figure above.

This is how an actual Hadoop production cluster looks like. Here, you have multiple
racks populated with DataNodes:

Advantages of Rack Awareness:

So, now you will be thinking why do we need a Rack Awareness algorithm? The
reasons are:
● To improve the network performance: The communication between nodes
residing on different racks is directed via switch. In general, you will find
greater network bandwidth between machines in the same rack than the
machines residing in different rack. So, the Rack Awareness helps you to have
reduce write traffic in between different racks and thus providing a better write
performance. Also, you will be gaining increased read performance because
you are using the bandwidth of multiple racks.

● To prevent loss of data: We don’t have to worry about the data even if an
entire rack fails because of the switch failure or power failure. And if you think
about it, it will make sense, as it is said that never put all your eggs in the same
basket.

21) Explain about hdfs read and write architecture?


HDFS Read/ Write Architecture:
Now let’s talk about how the data read/write operations are performed on
HDFS. HDFS follows Write Once – Read Many Philosophy. So, you can’t edit files
already stored in HDFS. But, you can append new data by re-opening the file. Get a
better understanding of the Hadoop Clusters, nodes, and architecture from the

HDFS Write Architecture:


Suppose a situation where an HDFS client, wants to write a file named
“example.txt” of size 248 MB.

Assume that the system block size is configured for 128 MB (default). So, the client
will be dividing the file “example.txt” into 2 blocks – one of 128 MB (Block A) and

the other of 120 MB (block B).

Now, the following protocol will be followed whenever the data is written into HDFS:

● At first, the HDFS client will reach out to the NameNode for a Write Request
against the two blocks, say, Block A & Block B.
● The NameNode will then grant the client the write permission and will provide
the IP addresses of the DataNodes where the file blocks will be copied
eventually.
● The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness that we have discussed
earlier.
● Let’s say the replication factor is set to default i.e. 3. Therefore, for each block
the NameNode will be providing the client a list of (3) IP addresses of
DataNodes. The list will be unique for each block.
● Suppose, the NameNode provided following lists of IP addresses to the client:
o For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of
DataNode 6}
o For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of
DataNode 9}

● Each block will be copied in three di 昀昀 erent DataNodes to maintain the


replication factor consistent throughout the cluster.
● Now the whole data copy process will happen in three stages:

1. Set up of Pipeline
2. Data streaming and replication
3. Shutdown of Pipeline (Acknowledgement stage)

1. Set up of Pipeline:

Before writing the blocks, the client confirms whether the DataNodes, present in each
of the list of IPs, are ready to receive the data or not. In doing so, the client creates a
pipeline for each of the blocks by connecting the individual DataNodes in the
respective list for that block. Let us consider Block A. The list of DataNodes provided
by the NameNode is:

For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.

So, for block A, the client will be performing the following steps to create a pipeline:

● The client will choose the first Data Node in the list (Data Node IPs for Block
A) which is Data Node 1 and will establish a TCP/IP connection.
● The client will inform Data Node 1 to be ready to receive the block. It will also
provide the IPs of next two Data Nodes (4 and 6) to the Data Node 1 where the
block is supposed to be replicated.
● The Data Node 1 will connect to Data Node 4. The DataNode 1 will inform
Data Node 4 to be ready to receive the block and will give it the IP of
DataNode 6. Then, Data Node 4 will tell Data Node 6 to be ready for receiving
the data.
● Next, the acknowledgement of readiness will follow the reverse sequence, i.e.
From the DataNode 6 to 4 and then to 1.
● At last DataNode 1 will inform the client that all the DataNodes are ready and
a pipeline will be formed between the client, DataNode 1, 4 and 6.
● Now pipeline set up is complete and the client will finally begin the data copy
or streaming process.

2. Data Streaming:

As the pipeline has been created, the client will push the data into the pipeline. Now,
don’t forget that in HDFS, data is replicated based on replication factor. So, here Block
A will be stored to three DataNodes as the assumed replication factor
is 3. Moving ahead, the client will copy the block (A) to DataNode 1 only. The replication is always
done by DataNodes sequentially.

So, the following steps will take place during replication:

● Once the block has been written to DataNode 1 by the client, DataNode 1 will connect to
DataNode 4.
● Then, DataNode 1 will push the block in the pipeline and data will be copied to DataNode 4.

● Again, DataNode 4 will connect to DataNode 6 and will copy the last replica of the block.

3. Shutdown of Pipeline or Acknowledgement stage:

Once the block has been copied into all the three DataNodes, a series of acknowledgements will take
place to ensure the client and NameNode that the data has been written successfully. Then, the client
will finally close the pipeline to end the TCP session.

As shown in the figure below, the acknowledgement happens in the reverse sequence i.e. from
DataNode 6 to 4 and then to 1. Finally, the DataNode 1 will push three acknowledgements

(including its own) into the pipeline and send it to the client. The client will inform NameNode that
data has been written successfully. The NameNode will update its metadata and the client will shut
down the pipeline.
Similarly, Block B will also be copied into the DataNodes in parallel with Block A. So, the
following things are to be noticed here:

● The client will copy Block A and Block B to the first DataNode simultaneously.

● Therefore, in our case, two pipelines will be formed for each of the block and all the process
discussed above will happen in parallel in these two pipelines.
● The client writes the block into the first DataNode and then the DataNodes will be
replicating the block sequentially.

As you can see in the above image, there are two pipelines formed for each block (A and B).
Following is the flow of operations that is taking place for each block in their respective pipelines:

● For Block A: 1A -> 2A -> 3A -> 4A

● For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B

HDFS Read Architecture:


HDFS Read architecture is comparatively easy to understand. Let’s take the above example again
where the HDFS client wants to read the file “example.txt” now.
Now, following steps will be taking place while reading the file:

● The client will reach out to NameNode asking for the block metadata for the file
“example.txt”.
● The NameNode will return the list of DataNodes where each block (Block A and B) are
stored.
● After that client, will connect to the DataNodes where the blocks are stored.

● The client starts reading data parallel from the DataNodes (Block A from DataNode 1 and
Block B from DataNode 3).
● Once the client gets all the required file blocks, it will combine these blocks to form a file.

While serving read request of the client, HDFS selects the replica which is closest to the client. This
reduces the read latency and the bandwidth consumption. Therefore, that replica is selected which
resides on the same rack as the reader node, if possible.

Now, you should have a pretty good idea about Apache Hadoop HDFS Architecture. I understand
that there is a lot of information here and it may not be easy to get it in one go..

18. Discuss the different types of NoSQL databases with examples.

Types of NoSQL Database


NoSQL databases can be classified into four main types, based on their data storage and retrieval methods:
1. Document-based databases
2. Key-value stores
3. Column-oriented databases
4. Graph-based databases
Each type has unique advantages and use cases, making NoSQL a preferred choice for big data applications, real-
time analytics, cloud computing, and distributed systems.
1. Document-Based Database
The document-based database is a nonrelational database. Instead of storing the data in rows and columns (tables),
it uses the documents to store the data in the database. A document database stores data in JSON, BSON, or XML
documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in applications which
means less translation is required to use these data in the applications. In the Document database, the particular
elements can be accessed by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all the documents are in
any collection as they require a similar schema because document databases have a flexible schema.

Key features of documents database:


● Flexible schema: Documents in the database has a flexible schema. It means the documents in the database
need not be the same schema.
● Faster creation and maintenance: the creation of documents is easy and minimal maintenance is required
once we create the document.

● No foreign keys: There is no dynamic relationship between two documents so documents can be independent
of one another. So, there is no requirement for a foreign key in a document database.
● Open formats: To build a document we use XML, JSON, and others.

Popular Document Databases & Use Cases


Database Use Case
MongoDB Content management, product catalogs, user profiles

CouchDB Offline applications, mobile synchronization

Firebase Firestore Real-time apps, chat applications

2. Key-Value Stores
A key-value store is a nonrelational database. The simplest form of a NoSQL database is a key-value store. Every
data element in the database is stored in key-value pairs. The data can be retrieved by using a unique key allotted to
each element in the database. The values can be simple data types like strings and numbers or complex objects. A
key-value store is like a relational database with only two columns which is the key and the value.

Key features of the key-value store:


● Simplicity: Data retrieval is extremely fast due to direct key access.
● Scalability: Designed for horizontal scaling and distributed storage.
● Speed: Ideal for caching and real-time applications.

Popular Key-Value Databases & Use Cases


Database Use Case

Redis Caching, real-time leaderboards, session storage

Memcached High-speed in-memory caching

Amazon DynamoDB Cloud-based scalable applications

3. Column Oriented Databases


A column-oriented database is a non-relational database that stores the data in columns instead of rows. That means
when we want to run analytics on a small number of columns, we can read those columns directly without consuming
memory with the unwanted data. Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data.

Key features of Columnar Oriented Database


● High Scalability: Supports distributed data processing.
● Compression: Columnar storage enables efficient data compression.
● Faster Query Performance: Best for analytical queries.
Popular Column-Oriented Databases & Use Cases
Database Use Case

Apache Cassandra Real-time analytics, IoT applications

Google Bigtable Large-scale machine learning, time-series data

HBase Hadoop ecosystem, distributed storage

4. Graph-Based Databases
Graph-based databases focus on the relationship between the elements. It stores the data in the form of nodes in the
database. The connections between the nodes are called links or relationships, making them ideal for complex
relationship-based queries.
● Data is represented as nodes (objects) and edges (connections).
● Fast graph traversal algorithms help retrieve relationships quickly.
● Used in scenarios where relationships are as important as the data itself.

Key features of Graph Database


● Relationship-Centric Storage: Perfect for social networks, fraud detection, recommendation engines.
● Real-Time Query Processing: Queries return results almost instantly.
● Schema Flexibility: Easily adapts to evolving relationship structures

19. Compare SQL and NoSQL databases in terms of structure, performance, and scalability.
Differences Between SQL and NoSQL

NoSQL (Non-
Aspect SQL (Relational)
relational)

Document-based, key-
Data Tables with rows and
value, column-family, or
Structure columns
graph-based

Fixed schema Flexible schema (dynamic


Schema
(predefined structure) and adaptable)
Vertically scalable Horizontally scalable
Scalability
(upgrading hardware) (adding more servers)

ACID-compliant (strong BASE-compliant (more


Data Integrity
consistency) available, less consistent)

Varies (e.g., MongoDB


Query SQL (Structured Query
uses its own query
Language Language)
language)

Efficient for complex Better for large-scale


Performance queries and data and fast read/write
transactions operations

Best for transactional Ideal for big data, real-


Use Case systems (banking, ERP, time web apps, and data
etc.) lakes

MySQL, PostgreSQL, MongoDB, Cassandra,


Examples
Oracle, MS SQL Server CouchDB, Neo4j

1. Type

SQL databases are primarily called Relational Databases (RDBMS); whereas NoSQL
databases are primarily called non-relational or distributed databases.

2. Language
SQL databases define and manipulate data-based structured query language (SQL). Seeing
from a side this language is extremely powerful. SQL is one of the most versatile and
widely-used options available which makes it a safe choice, especially for great complex
queries. But from another side, it can be restrictive.
SQL requires you to use predefined schemas to determine the structure of your data before
you work with it. Also, all of our data must follow the same structure. This can require
significant up-front preparation which means that a change in the structure would be both
difficult and disruptive to your whole system.

3. Scalability

In almost all situations SQL databases are vertically scalable. This means that you can
increase the load on a single server by increasing things like RAM, CPU, or SSD. But on the
other hand, NoSQL databases are horizontally scalable. This means that you handle more
traffic by sharing, or adding more servers in your NoSQL database.
It is similar to adding more floors to the same building versus adding more buildings to the
neighborhood. Thus NoSQL can ultimately become larger and more powerful, making
these databases the preferred choice for large or ever-changing data sets.

4. Structure

SQL databases are table-based on the other hand NoSQL databases are either key-value
pairs, document-based, graph databases, or wide-column stores. This makes relational
SQL databases a better option for applications that require multi-row transactions such as
an accounting system or for legacy systems that were built for a relational structure.
Here is a simple example of how a structured data with rows and columns vs a non-
structured data without definition might look like. A product table in SQL db might accept
data looking like this:
{

"id": "101",

"category":"food"

"name":"Apples",

"qty":"150"

}
Whereas a unstructured NOSQL DB might save the products in many variations without
constraints to change the underlying table structure
Products=[

"id":"101:

"category":"food",,

"name":"California Apples",

"qty":"150"

},

"id":"102,

"category":"electronics"

"name":"Apple MacBook Air",

"qty":"10",

"specifications":{

"storage":"256GB SSD",

"cpu":"8 Core",

"camera": "1080p FaceTime HD camera"

5. Property followed

SQL databases follow ACID properties (Atomicity, Consistency, Isolation, and Durability)
whereas the NoSQL database follows the Brewers CAP theorem (Consistency, Availability,
and Partition tolerance).

6. Support

Great support is available for all SQL databases from their vendors. Also, a lot of
independent consultants are there who can help you with SQL databases for very large-
scale deployments but for some NoSQL databases you still have to rely on community
support and only limited outside experts are available for setting up and deploying your
large-scale NoSQL deploy.

What is SQL?
SQL databases, also known as Relational Database Management Systems (RDBMS), use
structured tables to store data. They rely on a predefined schema that determines the
organization of data within tables, making them suitable for applications that require a
fixed, consistent structure.
● Structured Data: Data is organized in tables with rows and columns, making it

easy to relate different types of information.

● ACID Compliance: SQL databases follow the ACID properties (Atomicity,

Consistency, Isolation, Durability) to ensure reliable transactions and data

integrity.

● Examples: Popular SQL databases include MySQL, PostgreSQL, Oracle, and MS

SQL Server.

1. Explain MongoDB and its necessity in modern applications.


MongoDB is essential in modern applications due to its flexible schema, scalability, and ability to
handle large volumes of data. It is widely used in Big Data, real-time analytics, and cloud applications.

2. Compare MongoDB and RDBMS in terms of data storage, performance, and scalability.

○ Storage: MongoDB uses documents, whereas RDBMS uses tables.

○ Performance: MongoDB is optimized for high-speed read/write operations.

○ Scalability: MongoDB scales horizontally, whereas RDBMS scales vertically.

3. Describe the key differences between SQL and MongoDB queries with examples.

SQL:

sql
CopyEdit
SELECT * FROM users WHERE age > 25;

MongoDB:

javascript
CopyEdit
db.users.find({ "age": { $gt: 25 } });

Discuss the various data types available in MongoDB with examples.


MongoDB supports various data types such as strings, numbers, arrays, and objects. Example:

json
CopyEdit
{
"name": "Alice",
"age": 25,
"skills": ["Java", "MongoDB"]
}

4.
5. Explain the structure of a MongoDB document and how it is different from a relational database row.
MongoDB documents are self-contained and flexible, unlike RDBMS rows that adhere to a strict
schema.

6. What are the important terms used in MongoDB, and how do they relate to RDBMS concepts?

○ Table → Collection

○ Row → Document

○ Column → Field

7. How does indexing improve performance in MongoDB? Explain different types of indexes.
Indexes speed up queries. Types include single-field, compound, and text indexes.

8. Describe the CRUD operations in MongoDB with examples.

○ Create: insertOne(), insertMany()

○ Read: find(), findOne()

○ Update: updateOne(), updateMany()

○ Delete: deleteOne(), deleteMany()

Explain how aggregation works in MongoDB and provide an example.


Aggregation is used for data analysis. Example:

javascript
CopyEdit
db.sales.aggregate([{ $group: { _id: "$product", totalSales: { $sum:
"$amount" } } }]);

9.
10. Discuss the advantages and disadvantages of using MongoDB over traditional relational databases.
○ Advantages: Flexible schema, high scalability, faster queries.

○ Disadvantages: No support for transactions, high memory usage.

1. What is R programming, and why is it widely used in data science?

Answer:
R is an open-source programming language primarily used for statistical computing and data visualization. It
is widely used in data science due to its powerful data analysis capabilities, extensive library support, and
ability to handle large datasets. Additionally, R provides various built-in functions for machine learning,
regression analysis, and data visualization, making it popular among statisticians and data scientists.

2. List some key features of R programming.

Answer:

● Open-source and free to use.

● Supports various data types and structures (vectors, matrices, data frames, etc.).

● Extensive libraries for statistics, data visualization, and machine learning.

● Strong graphical capabilities with ggplot2 and base R plotting functions.

● Compatible with other programming languages like Python, C, and Java.

● Provides interactive development environments (RStudio, Jupyter Notebook).

● Platform-independent and runs on Windows, macOS, and Linux.

3. What are the different types of operators in R?

Answer:
R supports the following types of operators:

● Arithmetic Operators (+, -, *, /, ^, %%, %/%) – Used for mathematical calculations.

● Relational (Comparison) Operators (>, <, >=, <=, ==, !=) – Used for comparisons.

● Logical Operators (&, |, !, &&, ||) – Used for logical operations.

● Assignment Operators (<-, =, <<-) – Used for assigning values to variables.

● Bitwise Operators (bitwAnd, bitwOr, bitwXor) – Used for bitwise operations.

● Miscellaneous Operators (%in%, :, $) – Used for checking membership, sequence generation, and
accessing list elements.

4. How do you assign values to variables in R?


Answer:
Values can be assigned to variables using the assignment operators <-, =, or <<-.

r
CopyEdit
x <- 10
y = 20
z <<- 30

The <- operator is the most commonly used in R.

5. What are control statements in R? Provide examples.

Answer:
Control statements allow the execution of specific blocks of code based on conditions. Common control
statements in R include:

if statement

r
CopyEdit
x <- 10
if (x > 5) {
print("x is greater than 5")
}

if-else statement

r
CopyEdit
if (x > 5) {
print("x is greater than 5")
} else {
print("x is not greater than 5")
}

switch statement

r
CopyEdit
y <- "two"
switch(y,
"one" = print("Selected One"),
"two" = print("Selected Two"),
"three" = print("Selected Three"))

6. How do conditional statements (if, else, switch) work in R?

Answer:

● if statement executes code only if the condition is true.

● if-else statement provides an alternative action if the condition is false.

● switch statement selects a value based on matching cases.

7. What are looping structures in R? Give examples.

Answer:
Loops allow repetitive execution of a block of code.

for loop

r
CopyEdit
for (i in 1:5) {
print(i)
}

while loop

r
CopyEdit
x <- 1
while (x <= 5) {
print(x)
x <- x + 1
}

repeat loop

r
CopyEdit
x <- 1
repeat {
print(x)
x <- x + 1
if (x > 5) break
}

8. Define functions in R. How do you create a function?

Answer:
Functions are reusable blocks of code that perform specific tasks.

r
CopyEdit
my_function <- function(a, b) {
return(a + b)
}
my_function(3, 5)

9. What is the difference between an in-built function and a user-defined function in R?

Answer:

● In-built functions are pre-defined in R (e.g., sum(), mean(), sqrt()).

● User-defined functions are created by users using the function keyword.

10. What is meant by interfacing with R?

Answer:
Interfacing with R means integrating it with external systems such as databases, C/C++ programs, and other
scripting languages.

11. How can you integrate R with databases?

Answer:
R can connect to databases using packages like:

● RMySQL for MySQL

● RSQLite for SQLite

● RODBC for ODBC-compliant databases


Example connection to MySQL:

r
CopyEdit
library(RMySQL)
con <- dbConnect(MySQL(), user='root', password='password', dbname='database',
host='localhost')

12. What are vectors in R? How are they created?

Answer:
Vectors are one-dimensional arrays that store elements of the same type.

r
CopyEdit
v <- c(1, 2, 3, 4, 5)

13. Explain how matrices are used in R.

Answer:
Matrices are two-dimensional data structures with the same data type.

r
CopyEdit
m <- matrix(1:6, nrow=2, ncol=3)

14. What is a list in R? How does it differ from a vector?

Answer:
A list can store elements of different data types, unlike a vector.

r
CopyEdit
l <- list(1, "text", TRUE)

15. What is a data frame in R? How is it different from a matrix?

Answer:
A data frame is a table-like structure with columns of different data types.

r
CopyEdit
df <- data.frame(Name=c("A", "B"), Age=c(25, 30))

A matrix has only one data type, while a data frame can have multiple types.

16. What is the purpose of factors in R?

Answer:
Factors represent categorical data and improve efficiency.
r
CopyEdit
factor(c("male", "female", "male"))

17. How do tables help in data manipulation in R?

Answer:
Tables store frequency distributions and make it easier to summarize data.

r
CopyEdit
table(c("A", "B", "A", "B", "C"))

18. What are the different ways to take input in R?

Answer:

● readline() for single-line input

● read.csv() for CSV files

● scan() for multiple values

19. How do you write output to a file in R?

Answer:

r
CopyEdit
write.csv(df, "output.csv")

20. What are different types of graphs available in R?

Answer:

● Bar plot (barplot())

● Histogram (hist())

● Scatter plot (plot())

● Boxplot (boxplot())

21. What is the R apply family of functions?


Answer:
apply(), lapply(), sapply(), and tapply() are functions used for data manipulation.

22. What is the difference between apply() and lapply()?

Answer:

● apply() works on matrices and data frames.

● lapply() returns a list.

23. How does sapply() function work in R?

Answer:
sapply() is similar to lapply() but returns vectors/matrices instead of lists.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy