BDA Question Bank With Solutions
BDA Question Bank With Solutions
○ Hadoop 1.0: Used MapReduce for processing but lacked efficient resource
management.
○ RDBMS deals with structured data stored in tables with strict schemas.
○ Data consistency
○ Load balancing
○ Fault tolerance
● HDFS stores large files by distributing them across multiple nodes, ensuring high
availability and reliability.
● Hadoop divides data into blocks and distributes them across nodes, ensuring
scalability and reliability.
json
CopyEdit
{
"name": "Alice",
"age": 25,
"city": "New York"
}
● String
● Integer
● Boolean
● Array
● Date
● ObjectId
● Table → Collection
● Row → Document
● Column → Field
json
CopyEdit
{
"name": "Alice",
"address": {
"city": "New York",
"zip": "10001"
}
}
json
CopyEdit
{
"name": "Alice",
"address_id": "6123abcd5678efgh"
}
javascript
CopyEdit
db.users.insertOne({ "name": "Alice", "age": 25 });
javascript
CopyEdit
db.users.find();
How do you update an existing document in MongoDB?
javascript
CopyEdit
db.users.updateOne({ "name": "Alice" }, { $set: { "age": 26 } });
javascript
CopyEdit
db.users.deleteOne({ "name": "Alice" });
● Aggregation Queries
javascript
CopyEdit
db.sales.aggregate([{ $group: { _id: "$product", totalSales: { $sum:
"$amount" } } }]);
daily. Primarily there are 3 types of big data in analytics. The following types of Big Data with
A. Structured Data
Any data that can be processed, is easily accessible, and can be stored in a fixed format is
called structured data. In Big Data, structured data is the easiest to work with because it has
highly coordinated measurements that are defined by setting parameters. Structured types
Overview:
Examples:
Merits:
Limitations:
● Limited flexibility (must adhere to a strict schema).
● Scalability issues with very large datasets.
● Less suitable for complex big data types.
B. Semi-structured Data
In Big Data, semi-structured data is a combination of both unstructured and structured types
of big data. This form of data constitutes the features of structured data but has unstructured
information that does not adhere to any formal structure of data models or any relational
Overview:
Examples:
Merits:
C. Unstructured Data
Unstructured data in Big Data is where the data format constitutes multitudes of unstructured
files (images, audio, log, and video). This form of data is classified as intricate data because
of its unfamiliar structure and relatively huge size. A stark example of unstructured data is an
Overview:
Examples:
Merits:
● Capable of storing vast amounts of diverse data.
● High flexibility in data storage.
● Suitable for complex data types like multimedia.
● Facilitates advanced analytics and machine learning applications.
Limitations:
In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of
Big Data which are also termed as the characteristics of Big Data as follows:
1. Volume:
● Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-
structured and unstructured.
● Structured data: This data is basically an organised data.
It generally refers to data that has defined the length and
format of data.
● Semi- Structured data: This data is basically a semi-
organised data. It is generally a form of data that do not
conform to the formal structure of data. Log files are the
examples of this type of data.
● Unstructured data: This data basically refers to
unorganized data. It generally refers to data that doesn’t fit
neatly into the traditional row and column structure of the
relational database. Texts, pictures, videos etc. are the
examples of unstructured data which can’t be stored in the
form of rows and columns.
4. Veracity:
● It refers to inconsistencies and uncertainty in data, that is data which
is available can sometimes get messy and quality and accuracy are
difficult to control.
● Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types and sources.
● Example: Data in bulk could create confusion whereas less amount
of data could convey half or Incomplete Information.
3)what is importance of big data, write about the main business drivers for the raising
of big data?
Big Data and its Importance
The importance of big data does not revolve around how much data a company
has but how a company utilizes the collected data. Every company uses data in
its own way; the more efficiently a company uses its data, the more potential it
has to grow. The company can take data from any source and analyze it to find
answers which will enable:
Big Data importance doesn’t revolve around the amount of data a company
has but lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the
company uses its data, more rapidly it grows.
By analysing the big data pools effectively the companies can get answers to :
Cost Savings :
o Some tools of Big Data like Hadoop can bring cost advantages to business
when large amounts of data are to be stored.
o These tools help in identifying more efficient ways of doing business.
Time Reductions :
o The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data
immediately.
o This helps us to make quick decisions based on the learnings.
Understand the market conditions :
o By analyzing big data we can get a better understanding of current market
conditions.
o For example: By analyzing customers’ purchasing behaviours, a company
can find out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
Control online reputation :
o Big data tools can do sentiment analysis.
o Therefore, you can get feedback about who is saying what about your
company.
o If you want to monitor and improve the online presence of your business,
then big data tools can help in all this.
Using Big Data Analytics to Boost Customer Acquisition(purchase) and
Retention :
o The customer is the most important asset any business depends on.
o No single business can claim success without first having to establish a solid
customer base.
o If a business is slow to learn what customers are looking for, then it is very
likely to deliver poor quality products.
o The use of big data allows businesses to observe various customer-related
patterns and trends.
Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights :
o Big data analytics can help change all business operations.
o Like the ability to match customer expectations, changing
company’s product line, etc.
o And ensuring that the marketing campaigns are powerful
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks.
The most popular software framework (nowadays considered the standard for Big Data) is
Apache Hadoop for distributed storage and processing.
Due to the high availability of these software frameworks in open sources, it has become
increasingly inexpensive to start Big Data projects in organizations.
This means that organizations that want to process massive quantities of data (and thus have
large storage and processing requirements) do not have to invest in large quantities of IT
infrastructure.
Instead, they can license the storage and processing capacity they need and only pay for the
amounts they actually used. As a result, most of Big Data solutions leverage the possibilities of
cloud computing to deliver their solutions to enterprises.
As a result, the knowledge and education about data science has greatly professionalized and
more information becomes available every day. While statistics and data analysis mostly
remained an academic field previously, it is quickly becoming a popular subject among
students and the working population.
Social media data provides insights into the behaviors, preferences and opinions of ‘the
public’ on a scale that has never been known before. Due to this, it is immensely valuable to
anyone who is able to derive meaning from these large quantities of data. Social media data
can be used to identify customer preferences for product development, target new customers
for future purchases, or even target potential voters in elections. Social media data might even
be considered one of the most important business drivers of Big Data.
It is increasingly gaining popularity as consumer goods providers start including ‘smart’ sensors
in household appliances. Whereas the average household in 2010 had around 10 devices that
connected to the internet, this number is expected to rise to 50 per household by 2020.
Examples of these devices include thermostats, smoke detectors, televisions, audio systems and
even smart refrigerators.
● Medical information, such as diagnostic imaging
● Photos and video footage uploaded to the World Wide Web
● Video surveillance, such as the thousands of video cameras
across a city
● Mobile devices, which provide geospatial location data of the users, Metadata about text
messages, phone calls, and application usage on smart phones, Smart devices, which
provide sensor-based collection of information from smart
● Non traditional IT devices, including the use of RFID readers, GPS navigation systems,
and seismic processing.
These are the multiple sources where the data can be generated from multiple sources.
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Traditional database system deals with Big data system deals with
Traditional Data Big Data
structured, semi-structured,database,
structured data. and unstructured data.
Traditional data is generated per hour or But big data is generated more
per day or more. frequently mainly per seconds.
Its data model is strict schema based Its data model is a flat schema based
and it is static. and it is dynamic.
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.
Its data sources includes ERP Its data sources includes social
transaction data, CRM transaction data, media, device data, sensor data,
Traditional Data Big Data
5) Explain briefly about the components of hadoop or Explain briefly about hadoop
architecture?
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming
models. The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
● Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
● These files are then distributed across various cluster nodes for further
processing.
● HDFS, being on top of the local file system, supervises the processing.
● Blocks are replicated for handling hardware failure.
● Checking that the code was executed successfully.
● Performing the sort that takes place between the map and reduce
stages.
● Sending the sorted data to a certain computer.
● Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop ecosystem is a platform or framework which helps in solving the big data
problems. It comprises of different components and services (ingesting, storing,
analyzing, and maintaining) inside of it. Most of the services available in the
Hadoop ecosystem are to supplement the main four core components of Hadoop
which include HDFS, YARN, MapReduce and Common.
Hadoop ecosystem includes both Apache Open Source projects and other wide
variety of commercial tools and solutions. Some of the well-known open source
examples include Spark,Hive,Pig, Sqoop andOozie.
As we have got some idea about what is Hadoop ecosystem, what it does, and what
are its components, let’s discuss each concept in detail.
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s
the beauty of Hadoop that it revolves around data and hence making its
synthesis easier.
HDFS:
● HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
● HDFS consists of two core components i.e.
1. Name node
2. Data Node
● Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
● HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
● Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
● Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
● Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
● By making the use of distributed and parallel algorithms, MapReduce makes it
possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
● It is a platform for structuring the data flow, processing and analyzing huge data sets.
● Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
● Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
● Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
● With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).
● It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing
easier.
● Similar to the Query Processing frameworks, HIVE too comes with two components:
JDBC Drivers and HIVE Command Line.
● JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.
Mahout:
● Mahout, allows Machine Learnability to a system or application. Machine Learning,
as the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
● It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows
invoking algorithms as per our need with the help of its own libraries.
Apache Spark:
● It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
● It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
● Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
Apache HBase:
● It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.
● At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At
such times, HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:
● Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
● Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted
in inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and maintenance.
● Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie workflow
and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a
sequentially ordered manner whereas Oozie Coordinator jobs are those that are
triggered when some data or external stimulus is given to it.
7) What is the relationship of cloud and big data explain?
Big Data and Cloud Computing
One of the vital issues that organisations face with the storage and management of Big Data
is the huge amount of investment to get the required hardware setup and software packages.
Some of these resources may be over utilized or underutilised with the varying requirements
overtime. We can overcome these challenges by providing a set of computing resources that
can be shared through cloud computing. These shared resources comprise applications,
storage solutions, computational units, networking solutions, development and deployment
platforms, business processes, etc. The cloud computing environment saves costs related to
infrastructure in an organization by providing a framework that can be optimized and
expanded horizontally. In order to operate in the real world, cloud implementation requires
common standardized processes and their automation.
In cloud-based platforms, applications can easily obtain the resources to perform computing
tasks. The costs of acquiring these resources need to be paid as per the acquired resources and
their use cloud computing, this feature of resource acquisition is in accordance with the
requirements payment of cost and is known as elasticity. Cloud computing makes it possible
for organisation dynamically regulate the use of computing resources and access them as per
the need while paying only for those resources that are used. This facility of dynamic use of
resources provides flexibility, Careless resource monitoring and control can result in
unexpectedly high costs.
The cloud computing technique uses data centres to collect data and ensures that data backup
a recovery are automatically performed to cater to the requirements of the business
community. Both cloud computing and Big Data analytics use the distributed computing
model in a similar manner
The following are some features of cloud computing that can be used to handle Big Data:
Elasticity-Elasticity in cloud means hiring certain resources, as and when required, and for
the resources that have been used. No extra payment is required for acquiring specific
services. For example, a business expecting the use of more data during in-store process
could hire more resources to provide high processing power.
Self Service-Cloud computing involves a simple user interface that helps customers to
decide access the cloud services they want. The process of selecting the needed services
require intervention from human beings and can be accessed automatically.
Low Cost-A careful planning, use, management, and control of resources help organization
reduce the cost of acquiring hardware significantly. Also, cloud offers customized solutions
especially to organizations that cannot afford too much initial investment in purchasing
resources that are used for computation in Big Data analytics. The cloud provides them the
as-you-use option in which organizations need to sign for those resources only that are This
also helps the cloud provider in harnessing benefits of economies of scale and pro
benefit to their customers in terms of cost reduction.
These are sometimes called the cloud computing stack because they are built on top of one
another. Knowing what they are and how they are different, makes it easier to accomplish
your goals. These abstraction layers can also be viewed as a layered architecture where
services of a higher layer can be composed of services of the underlying layer i.e, SaaS can
provide Infrastructure.
Software as a Service(SaaS)
Advantages of SaaS
1. Cost-Effective: Pay only for what you use.
2. Reduced time: Users can run most SaaS apps directly from their web browser
without needing to download and install any software. This reduces the time spent in
installation and configuration and can reduce the issues that can get in the way of the
software deployment.
3. Accessibility: We can Access app data from anywhere.
4. Automatic updates: Rather than purchasing new software, customers rely on a SaaS
provider to automatically perform the updates.
5. Scalability: It allows the users to access the services and features on-demand.
The various companies providing Software as a service are Cloud9 Analytics,
Salesforce.com, Cloud Switch, Microsoft Office 365, Big Commerce, Eloqua, dropBox, and
Cloud Tran.
Disadvantages of Saas :
1. Limited customization: SaaS solutions are typically not as customizable as on-
premises software, meaning that users may have to work within the constraints of the
SaaS provider’s platform and may not be able to tailor the software to their specific
needs.
2. Dependence on internet connectivity: SaaS solutions are typically cloud-based,
which means that they require a stable internet connection to function properly. This
can be problematic for users in areas with poor connectivity or for those who need to
access the software in offline environments.
3. Security concerns: SaaS providers are responsible for maintaining the security of the
data stored on their servers, but there is still a risk of data breaches or other security
incidents.
4. Limited control over data: SaaS providers may have access to a user’s data, which
can be a concern for organizations that need to maintain strict control over their data
for regulatory or other reasons.
Platform as a Service
PaaS is a category of cloud computing that provides a platform and environment to allow
developers to build applications and services over the internet. PaaS services are hosted in
the cloud and accessed by users simply via their web browser.
A PaaS provider hosts the hardware and software on its own infrastructure. As a result,
PaaS frees users from having to install in-house hardware and software to develop or run a
new application. Thus, the development and deployment of the application take place
independent of the hardware.
The consumer does not manage or control the underlying cloud infrastructure including
network, servers, operating systems, or storage, but has control over the deployed
applications and possibly configuration settings for the application-hosting environment. To
make it simple, take the example of an annual day function, you will have two options either
to create a venue or to rent a venue but the function is the same.
Advantages of PaaS:
1. Simple and convenient for users: It provides much of the infrastructure and other IT
services, which users can access anywhere via a web browser.
2. Cost-Effective: It charges for the services provided on a per-use basis thus
eliminating the expenses one may have for on-premises hardware and software.
3. Efficiently managing the lifecycle: It is designed to support the complete web
application lifecycle: building, testing, deploying, managing, and updating.
4. Efficiency: It allows for higher-level programming with reduced complexity thus, the
overall development of the application can be more effective.
The various companies providing Platform as a service are Amazon Web services Elastic
Beanstalk, Salesforce, Windows Azure, Google App Engine, cloud Bees and IBM smart
cloud.
Disadvantages of Paas:
1. Limited control over infrastructure: PaaS providers typically manage the underlying
infrastructure and take care of maintenance and updates, but this can also mean that
users have less control over the environment and may not be able to make certain
customizations.
2. Dependence on the provider: Users are dependent on the PaaS provider for the
availability, scalability, and reliability of the platform, which can be a risk if the
provider experiences outages or other issues.
3. Limited flexibility: PaaS solutions may not be able to accommodate certain types of
workloads or applications, which can limit the value of the solution for certain
organizations.
Infrastructure as a Service
Infrastructure as a service (IaaS) is a service model that delivers computer infrastructure on
an outsourced basis to support various operations. Typically IaaS is a service where
infrastructure is provided as outsourcing to enterprises such as networking equipment,
devices, database, and web servers.
It is also known as Hardware as a Service (HaaS). IaaS customers pay on a per-user basis,
typically by the hour, week, or month. Some providers also charge customers based on the
amount of virtual machine space they use.
It simply provides the underlying operating systems, security, networking, and servers for
developing such applications, and services, and deploying development tools, databases,
etc.
Advantages of IaaS:
1. Cost-Effective: Eliminates capital expense and reduces ongoing cost and IaaS
customers pay on a per-user basis, typically by the hour, week, or month.
2. Website hosting: Running websites using IaaS can be less expensive than traditional
web hosting.
3. Security: The IaaS Cloud Provider may provide better security than your existing
software.
4. Maintenance: There is no need to manage the underlying data center or the
introduction of new releases of the development or underlying software. This is all
handled by the IaaS Cloud Provider.
The various companies providing Infrastructure as a service are Amazon web services,
Bluestack, IBM, Openstack, Rackspace, and Vmware.
Disadvantages of laaS :
1. Limited control over infrastructure: IaaS providers typically manage the underlying
infrastructure and take care of maintenance and updates, but this can also mean that
users have less control over the environment and may not be able to make certain
customizations.
2. Security concerns: Users are responsible for securing their own data and
applications, which can be a significant undertaking.
3. Limited access: Cloud computing may not be accessible in certain regions and
countries due to legal policies.
Anything as a Service
It is also known as Everything as a Service. Most of the cloud service providers nowadays
offer anything as a service that is a compilation of all of the above services including some
additional services.
Advantages of XaaS:
1. Scalability: XaaS solutions can be easily scaled up or down to meet the changing
needs of an organization.
2. Flexibility: XaaS solutions can be used to provide a wide range of services, such as
storage, databases, networking, and software, which can be customized to meet the
specific needs of an organization.
3. Cost-effectiveness: XaaS solutions can be more cost-effective than traditional on-
premises solutions, as organizations only pay for the services.
Disadvantages of XaaS:
1. Dependence on the provider: Users are dependent on the XaaS provider for the
availability, scalability, and reliability of the service, which can be a risk if the provider
experiences outages or other issues.
2. Limited flexibility: XaaS solutions may not be able to accommodate certain types of
workloads or applications, which can limit the value of the solution for certain
organizations.
3. Limited integration: XaaS solutions may not be able to integrate with existing systems
and data sources, which can limit the value of the solution for certain organisations.
Function as a Service :
FaaS is a type of cloud computing service. It provides a platform for its users or customers to
develop, compute, run and deploy the code or entire application as functions. It allows the
user to entirely develop the code and update it at any time without worrying about the
maintenance of the underlying infrastructure. The developed code can be executed with
response to the specific event. It is also the same as PaaS.
FaaS is an event-driven execution model. It is implemented in the serverless container.
When the application is developed completely, the user will now trigger the event to execute
the code. Now, the triggered event makes response and activates the servers to execute it.
The servers are nothing but the Linux servers or any other servers which is managed by the
vendor completely. Customer does not have clue about any servers which is why they do not
need to maintain the server hence it is serverless architecture.
Both PaaS and FaaS are providing the same functionality but there is still some
differentiation in terms of Scalability and Cost.
FaaS, provides auto-scaling up and scaling down depending upon the demand. PaaS also
provides scalability but here users have to configure the scaling parameter depending upon
the demand.
In FaaS, users only have to pay for the number of execution time happened. In PaaS, users
have to pay for the amount based on pay-as-you-go price regardless of how much or less
they use.
Advantages of FaaS :
● Highly Scalable: Auto scaling is done by the provider depending upon the demand.
● Cost-Effective: Pay only for the number of events executed.
● Code Simplification: FaaS allows the users to upload the entire application all at
once. It allows you to write code for independent functions or similar to those
functions.
● Maintenance of code is enough and no need to worry about the servers.
● Functions can be written in any programming language.
● Less control over the system.
The various companies providing Function as a Service are Amazon Web Services –
Firecracker, Google – Kubernetes, Oracle – Fn, Apache OpenWhisk – IBM, OpenFaaS,
Disadvantages of FaaS :
1. Cold start latency: Since FaaS functions are event-triggered, the first request to a
new function may experience increased latency as the function container is created
and initialized.
2. Limited control over infrastructure: FaaS providers typically manage the underlying
infrastructure and take care of maintenance and updates, but this can also mean that
users have less control over the environment and may not be able to make certain
customizations.
3. Security concerns: Users are responsible for securing their own data and
applications, which can be a significant undertaking.
4. Limited scalability: FaaS functions may not be able to handle high traffic or large
number of requests.
predictive analytics models are designed to assess historical data, discover patterns, observe
trends, and use that information to predict future trends.
Predictive analytics can be deployed across various industries for different business
problems. Below are a few industry use cases to illustrate how predictive analytics can
inform decision-making within real-world situations.
Types of Predictive Analytical Models
There are three common techniques used in predictive analytics: Decision trees, neural
networks, and regression. Read more about each of these below.
Regression analysis
Regression is a statistical analysis technique that estimates relationships between variables.
Regression is useful to determine patterns in large datasets to determine the correlation
between inputs. It is best employed on continuous data that follows a known distribution.
Regression is often used to determine how one or more independent variables affects another,
such as how a price increase will affect the sale of a product.
Decision trees
Decision trees are classification models that place data into different categories based on
distinct variables. The method is best used when trying to understand an individual's
decisions. The model looks like a tree, with each branch representing a potential choice, with
the leaf of the branch representing the result of the decision. Decision trees are typically easy
to understand and work well when a dataset has several missing variables.
Neural networks
Neural networks are machine learning methods that are useful in predictive analytics when
modeling very complex relationships. Essentially, they are powerhouse pattern recognition
engines. Neural networks are best used to determine nonlinear relationships in datasets,
especially when no known mathematical formula exists to analyze the data. Neural networks
can be used to validate the results of decision trees and regression models.
Cluster Models
Clustering describes the method of aggregating data that share similar attributes. Consider a
large online retailer like Amazon.
Amazon can cluster sales based on the quantity purchased or it can cluster sales based on the
average account age of its consumer. By separating data into similar groups based on shared
features, analysts may be able to identify other characteristics that define future activity.
10) What is the relationship between mobile business intelligence in big data?
Mobile business intelligence is a technology-enabled process of extracting meaningful
insights from data and delivering them to end-users via mobile devices. Mobile BI users can
conduct data analysis in real time using smartphones, tablets, and wearables to make quick
data-driven decisions.
● In the current fast-paced business world, decision-makers are often on the move
and require immediate access to data and analytics. With the increasing
capabilities of mobile devices, including enhanced data storage, processing
power, and connectivity, Mobile BI has become a critical tool for timely and
effective decision-making. It allows for the constant flow of information, keeping
business leaders connected with their operations, sales, and customer interactions
in real-time, regardless of their physical location.
Advantages of mobile BI
1. Simple access
Mobile BI is not restricted to a single mobile device or a certain place. You can view
your data at any time and from any location. Having real-time visibility into a firm
improves production and the daily efficiency of the business. Obtaining a company's
perspective with a single click simplifies the process.
2. Competitive advantage
Many firms are seeking better and more responsive methods to do business in order to
stay ahead of the competition. Easy access to real-time data improves company
opportunities and raises sales and capital. This also aids in making the necessary
decisions as market conditions change.
3. Simple decision-making
As previously stated, mobile BI provides access to real-time data at any time and from
any location. During its demand, Mobile BI offers the information. This assists
consumers in obtaining what they require at the time. As a result, decisions are made
quickly.
4. Increase Productivity
By extending BI to mobile, the organization's teams can access critical company data when
they need it. Obtaining all of the corporate data with a single click frees up a significant
amount of time to focus on the smooth and efficient operation of the firm. Increased
productivity results in a smooth and quick-running firm.
Disadvantages of mobile
1. Stack of data
The primary function of a mobile BI is to store data in a systematic manner and then
present it to the user as required. As a result, Mobile BI stores all of the information
and does end up with heaps of earlier data. The corporation only needs a small portion
of the previous data, but they need to store the entire information, which ends up in the
stack
2. Expensive
Mobile BI can be quite costly at times. Large corporations can continue to pay for their
expensive services, but small businesses cannot. As the cost of mobile BI is not
sufficient, we must additionally consider the rates of IT workers for the smooth
operation of BI, as well as the hardware costs involved.
However, larger corporations do not settle for just one Mobile BI provider for their
organisations; they require multiple. Even when doing basic commercial transactions,
mobile BI is costly.
3. Time consuming
Businesses prefer Mobile BI since it is a quick procedure. Companies are not patient
enough to wait for data before implementing it. In today's fast-paced environment,
anything that can produce results quickly is valuable. The data from the warehouse is
used to create the system, hence the implementation of BI in an enterprise takes more
than 18 months.
4. Data breach
The biggest issue of the user when providing data to Mobile BI is data leakage. If you
handle sensitive data through Mobile BI, a single error can destroy your data as well as
make it public, which can be detrimental to your business.
Many Mobile BI providers are working to make it 100 percent secure to protect their
potential users' data. It is not only something that mobile BI carriers must consider, but
it is also something that we, as users, must consider when granting data access
authorization.
Because we work online in every aspect, we have a lot of data stored in Mobile BI,
which might be a significant problem. This means that a large portion of the data
analysed by Mobile BI is irrelevant or completely useless. This can speed down the
entire procedure. This requires you to select the data that is important and may be
required in the future.
HDFS Explained
The Hadoop Distributed File System (HDFS) is fault-tolerant by design. Data is
stored in individual data blocks in three separate copies across multiple nodes and
server racks. If a node or even an entire rack fails, the impact on the broader system
is negligible.
DataNodes process and store data blocks, while NameNodes manage the many
DataNodes, maintain data block metadata, and control client access.
NameNode
Initially, data is broken into abstract data blocks. The file metadata for these blocks,
which include the file name, file permissions, IDs, locations, and the number of
replicas, are stored in a fsimage, on the NameNode local memory.
Should a NameNode fail, HDFS would not be able to locate any of the data sets
distributed throughout the DataNodes. This makes the NameNode the single point of
failure for the entire cluster. This vulnerability is resolved by implementing a
Secondary NameNode or a Standby NameNode.
Secondary NameNode
The Secondary NameNode served as the primary backup solution in early Hadoop
versions. The Secondary NameNode, every so often, downloads the current fsimage
instance and edit logs from the NameNode and merges them. The edited fsimage
can then be retrieved and restored in the primary NameNode.
The failover is not an automated process as an administrator would need to recover
the data from the Secondary NameNode manually.
Standby NameNode
The High Availability feature was introduced in Hadoop 2.0 and subsequent versions
to avoid any downtime in case of the NameNode failure. This feature allows you to
maintain two NameNodes running on separate dedicated master nodes.
The Standby NameNode is an automated failover in case an Active NameNode
becomes unavailable. The Standby NameNode additionally carries out the check-
pointing process. Due to this property, the Secondary and Standby NameNode are
not compatible. A Hadoop cluster can maintain either one or the other.
Zookeeper
Zookeeper is a lightweight tool that supports high availability and redundancy. A
Standby NameNode maintains an active session with the Zookeeper daemon.
If an Active NameNode falters, the Zookeeper daemon detects the failure and carries
out the failover process to a new NameNode. Use Zookeeper to automate failovers
and minimize the impact a NameNode failure can have on the cluster.
DataNode
Each DataNode in a cluster uses a background process to store the individual blocks
of data on slave servers.
By default, HDFS stores three copies of every data block on separate DataNodes.
The NameNode uses a rack-aware placement policy. This means that the
DataNodes that contain the data block replicas cannot all be located on the same
server rack.
A DataNode communicates and accepts instructions from the NameNode roughly
twenty times a minute. Also, it reports the status and health of the data blocks
located on that node once an hour. Based on the provided information, the
NameNode can request the DataNode to create additional replicas, remove them, or
decrease the number of data blocks present on the node.
12) Explain yarn architecture ?write about how does yarn work?
YARN (Yet Another Resource Negotiator) is the default cluster management
resource for Hadoop 2 and Hadoop 3. In previous Hadoop versions, MapReduce
used to conduct both data processing and resource allocation. Over time the
necessity to split processing and resource management led to the development of
YARN.
YARN’s resource allocation role places it between the storage layer, represented by
HDFS, and the MapReduce processing engine. YARN also provides a generic
interface that allows you to implement new processing engines for various data
types.
ResourceManager
The ResourceManager (RM) daemon controls all the processing resources in a
Hadoop cluster. Its primary purpose is to designate resources to individual
applications located on the slave nodes. It maintains a global overview of the
ongoing and planned processes, handles resource requests, and schedules and
assigns resources accordingly. The ResourceManager is vital to the Hadoop
framework and should run on a dedicated master node.
The RM sole focus is on scheduling workloads. Unlike MapReduce, it has no interest
in failovers or individual processing tasks. This separation of tasks in YARN is what
makes Hadoop inherently scalable and turns it into a fully developed computing
platform.
NodeManager
Each slave node has a NodeManager processing service and a DataNode storage
service. Together they form the backbone of a Hadoop distributed system.
The DataNode, as mentioned previously, is an element of HDFS and is controlled by
the NameNode. The NodeManager, in a similar fashion, acts as a slave to the
ResourceManager. The primary function of the NodeManager daemon is to track
processing-resources data on its slave node and send regular reports to the
ResourceManager.
Containers
Processing resources in a Hadoop cluster are always deployed in containers. A
container has memory, system files, and processing space.
A container deployment is generic and can run any requested custom resource on
any system. If a requested amount of cluster resources is within the limits of what’s
acceptable, the RM approves and schedules that container to be deployed.
The container processes on a slave node are initially provisioned, monitored, and
tracked by the NodeManager on that specific slave node.
Application Master
Every container on a slave node has its dedicated Application Master. Application
Masters are deployed in a container as well. Even MapReduce has an Application
Master that executes map and reduce tasks.
The Application Master oversees the full lifecycle of an application, all the way from
requesting the needed containers from the RM to submitting container lease
requests to the NodeManager.
JobHistory Server
The JobHistory Server allows users to retrieve information about applications that
have completed their activity. The REST API provides interoperability and can
dynamically inform users on current and completed jobs served by the server in
question.
Once all tasks are completed, the Application Master sends the result to the client
application, informs the RM that the application has completed its task, deregisters
itself from the Resource Manager, and shuts itself down.
The RM can also instruct the NameNode to terminate a specific container during the
process in case of a processing priority change.
The Application Master locates the required data blocks based on the information
stored on the NameNode. The AM also informs the ResourceManager to start a
MapReduce job on the same node the data blocks are located on. Whenever
possible, data is processed locally on the slave nodes to reduce bandwidth usage
and improve cluster efficiency.
The input data is mapped, shuffled, and then reduced to an aggregate result. The
output of the MapReduce job is stored and replicated in HDFS.
The Hadoop servers that perform the mapping and reducing tasks are often referred
to as Mappers and Reducers.
The ResourceManager decides how many mappers to use. This decision depends on
the size of the processed data and the memory block available on each mapper
server.
Map Phase
The mapping process ingests individual logical expressions of the data stored in the
HDFS data blocks. These expressions can span several data blocks and are called
input splits. Input splits are introduced into the mapping process as key-value pairs.
A mapper task goes through every key-value pair and creates a new set of key-value
pairs, distinct from the original input data. The complete assortment of all the key-
value pairs represents the output of the mapper task.
Based on the key from each pair, the data is grouped, partitioned, and shuffled to the
reducer nodes.
The output of a map task needs to be arranged to improve the efficiency of the
reduce phase. The mapped key-value pairs, being shuffled from the mapper nodes,
are arrayed by key with corresponding values. A reduce phase starts after the input
is sorted by key in a single input file.
The shuffle and sort phases run in parallel. Even as the map outputs are retrieved
from the mapper nodes, they are grouped and sorted on the reducer nodes.
There can be instances where the result of a map task is the desired result and there
is no need to produce a single output value.
Data movement is one of those things that you aren’t likely to think too much
about until you’re fully committed to using Hadoop on a project, at which point
it becomes this big scary unknown that has to be tackled. How do you get your
log data sitting across thousands of hosts into Hadoop? What’s the most efficient
way to get your data out of your relational and No/NewSQL systems and into
Hadoop? How do you get Lucene indexes generated in Hadoop out to your
servers? And how can these processes be automated?
This topic starts by highlighting key data-movement properties.We’ll start with
some simple techniques, such as using the command line and Java for ingress,
but we’ll quickly move on to more advanced techniques like using NFS
Ingress and egress refer to data movement into and out of a system, respectively.
Data egress refers to data leaving a network in transit to an external location.
Outbound email messages, cloud uploads, or files being moved to external
storage are simple examples of data egress
Data ingress in computer networking, including: Data that is downloaded from the internet to
a local computer. Email messages that are delivered to a mailbox. VoIP calls that come into a
network.
Once the low-level tooling is out of the way, we’ll survey higher-level tools that have
simplified the process of ferrying data into Hadoop. We’ll look at how you can automate the
movement of log files with Flume, and how Sqoop can be used to move relational data.
Moving data into Hadoop
The first step in working with data in Hadoop is to make it available to Hadoop. There are
two primary methods that can be used to move data into Hadoop: writing external data at the
HDFS level (a data push), or reading external data at the MapReduce level (more like a pull).
Reading data in MapReduce has advantages in the ease with which the operation can be
parallelized and made fault tolerant. Not all data is accessible from MapReduce, however,
such as in the case of log files, which is where other systems need to be relied on for
transportation, including HDFS for the final data hop.
In this section we’ll look at methods for moving source data into Hadoop. I’ll use the design
considerations in the previous section as the criteria for examining and understanding the
different tools.
Roll your own ingest
Hadoop comes bundled with a number of methods to get your data into HDFS. This section
will examine various ways that these built-in tools can be used for your data movement
needs. The first and Picking the right ingest tool for the job
The low-level tools in this section work well for one-off file movement activities, or when
working with legacy data sources and destinations that are file-based. But moving data in this
way is quickly becoming obsolete by the availability of tools such as Flume and Kafka
(covered later in this chapter), which offer automated data movement pipelines.
Kafka is a much better platform for getting data from A to B (and B can be a Hadoop cluster)
than the old-school “let’s copy files around!” With Kafka, you only need to pump your data
into Kafka, and you have the ability to consume the data in real time (such as via Storm) or in
offline/batch jobs (such as via Camus) tentially easiest tool you can use is the HDFS
command line.
Your data might be XML files sitting behind a number of FTP servers, text log files sitting on
a central web server, or Lucene indexes1 in HDFS. How does MapReduce support reading
and writing to these different serialization structures across the various storage mechanisms?
You’ll need to know the answer in order to support a specific serialization format.
Data input :-
The two classes that support data input in MapReduce are InputFormat and Record-Reader.
The InputFormat class is consulted to determine how the input data should be partitioned for
the map tasks, and the RecordReader performs the reading of data from the inputs.
INPUT FORMAT:-
Every job in MapReduce must define its inputs according to contracts specified in the
InputFormat abstract class. InputFormat implementers must fulfill three contracts: first, they
describe type information for map input keys and values; next, they specify how the input
data should be partitioned; and finally, they indicate the
RecordReader instance that should read the data from source
RECORD READER:-
The RecordReader class is used by MapReduce in the map tasks to read data from an input
split and provide each record in the form of a key/value pair for use by mappers. A task is
commonly created for each input split, and each task has a single RecordReader that’s
responsible for reading the data for that input split.
DATA OUTPUT:-
MapReduce uses a similar process for supporting output data as it does for input data.Two
classes must exist, an OutputFormat and a RecordWriter. The OutputFormat performs some
basic validation of the data sink properties, and the RecordWriter writes each reducer output
to the data sink.
OUTPUT FORMAT:-
Much like the InputFormat class, the OutputFormat class, as shown in figure 3.5, defines the
contracts that implementers must fulfill, including checking the information related to the job
output, providing a RecordWriter, and specifying an output committer, which allows writes to
be staged and then made “permanent” upon task and/or job success.
RECORD WRITER:-
You’ll use the RecordWriter to write the reducer outputs to the destination data sink. It’s a
simple class.
SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects.
The conversion is performed by calling toString() on the keys and values. This format makes
sequence files suitable input for Streaming
Data serialization is the process of converting an object into a stream of bytes to more easily
save or transmit it.
Big data systems often include technologies/data that are described as “schemaless.” This
means that the managed data in these systems are not structured in a strict format, as defined
by a schema. Serialization provides several benefits in this type of environment:
Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-language-
neutral data format that all systems understand.
Once the serialized data is transmitted from the source machine to the destination machine,
the reverse process of creating objects from the byte sequence called deserialization is carried
out. Reconstructed objects are clones of the original object.
Choice of data serialization format for an application depends on factors such as data
complexity, need for human readability, speed and storage space constraints. XML, JSON,
BSON, YAML, MessagePack, and protobuf are some commonly used data serialization
formats.
computer data is generally organized in data structures such as arrays, tables, trees, classes.
When data structures need to be stored or transmitted to another location, such as across a
network, they are serialized.
7) Differentiate between RDBMS and HADOOP?
RDMS (Relational Database Management System): RDBMS is an information
management system, which is based on a data model.In RDBMS tables are used for
information storage. Each row o1 the table represents a record and column represents
an attribute o1 data. Organization o1 data and their manipulation processes are
di11erent in RDBMS from other databases. RDBMS ensures ACID (atomicity,
consistency, integrity, durability) properties required for designing a database. The
purpose o1 RDBMS is to store, manage, and retrieve data as quickly and reliably as
possible.
Hadoop: It is an open−source software framework used for storing
data and running applications on a group of commodity hardware. It
has large storage capacity and high processing power. It can manage
multiple concurrent processes at the same time. It is used in
predictive analysis, data mining and machine learning. It can handle
both structured and unstructured 1orm o1 data. It is more 1lexible in
storing, processing, and managing data than traditional RDBMS.
Unlike traditional systems, Hadoop enables multiple analytical
processes on the same data at the same time. It supports scalability
very 1lexibly.
Below is a table o1 di11erences between RDBMS and Hadoop:
In this structured data is mostly processed. In this both structured and unstructured
2.
data is processed.
It is best suited 1or BIG data.
It is best suited 1or OLTP environment.
3.
The data schema of RDBMS is static type. The data schema of Hadoop is
8.
dynamic type.
HDFS Daemons:
Daemons mean Process. Hadoop Daemons are a set o1 processes that run on
Hadoop. Hadoop is a framework written in Java, so all these processes are Java
Processes.
Apache Hadoop 2 consists o1 the following Daemons:
● NameNode
● DataNode
● Resource Manager
● Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master System
while the Node Manager and DataNode work on the Slave machine.
1. NameNode
Two files ‘FSImage’ and the ‘EditLog’ are used to store metadata information.
FsImage: It is the snapshot the file system when Name Node is started. It is an “Image
file”. FsImage contains the entire filesystem namespace and stored as a file in the
NameNode’s local file system. It also contains a serialized form of all the directories
and file inodes in the filesystem. Each inode is an internal representation of file or
directory’s metadata.
EditLogs: It contains all the recent modifications made to the file system on the most
recent FsImage. NameNode receives a create/update/delete request from the client.
After that this request is first recorded to edits file.
Functions of NameNode in HDFS
2. It is the master daemon that maintains and manages the DataNodes (slave
nodes).
3. It records the metadata of all the files stored in the cluster, e.g. The location of
blocks stored, the size of the files, permissions, hierarchy, etc.
4. It records each change that takes place to the file system metadata. For
example, if a file is deleted in HDFS, the NameNode will immediately record
this in the EditLog.
5. It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
6. It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
7. The NameNode is also responsible to take care of the replication factor of all the
blocks.
8. In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
Features:
● It never stores the data that is present in the file.
The name node maintains the entire metadata in RAM, which helps clients receive
quick responses to read requests. Therefore, it is important to run name node from a
machine that has lots of RAM at its disposal. The higher the number of files in HDFS,
the higher the consumption of RAM. The name node daemon also maintains a
persistent checkpoint of the metadata in a file stored on the disk called the fsimage file.
In case the name node daemon is restarted, the following sequence of events occur at
name node boot up:
1. Read the fsimage file from the disk and load it into memory (RAM).
2. Read the actions that are present in the edits log and apply each action to the
in-memory representation of the fsimage file.
3. Write the modified in-memory representation to the fsimage file on the disk.
The preceding steps make sure that the in-memory representation is up to date.
The namenode daemon is a single point of failure in Hadoop 1.x, which means that if
the node hosting the namenode daemon fails, the filesystem becomes unusable. To
handle this, the administrator has to configure the namenode to write the fsimage file to
the local disk as well as a remote disk on the network. This backup on the remote disk
can be used to restore the namenode on a freshly installed server. Newer versions of
Apache Hadoop (2.x) now support High Availability (HA), which deploys two
namenodes in an active/passive configuration, wherein if the active namenode fails, the
control falls onto the passive namenode, making it active. This configuration reduces
the downtime in case of a namenode failure.
Since the fsimage file is not updated for every operation, it is possible the edits logfile
would grow to a very large file. The restart of namenode service would become very
slow because all the actions in the large edits logfile will have to be applied on the
fsimage file. The slow boot up time could be avoided using the secondary namenode
daemon.
The namespace image and the edit log stores information of the data and the metadata. NameNode
also determines the linking of blocks to DataNodes. Furthermore, the NameNode is a single point
of failure. The DataNode is a multiple instance server. There can be several numbers of DataNode
servers. The number depends on the type of network and the storage system.
The DataNode servers, stores, and maintains the data blocks. The NameNode Server provisions the
data blocks on the basis of the type of job submitted by the client.
DataNode also stores and retrieves the blocks when asked by clients or the NameNode.
Furthermore, it reads/writes requests and performs block creation, deletion, and replication of
instruction from the NameNode. There can be only one Secondary NameNode server in a cluster.
Note that you cannot treat the Secondary NameNode server as a disaster recovery server. However,
it partially restores the NameNode server in case of a failure.
Data Node
1. Data Node is also known as Slave node.
2. In Hadoop HDFS Architecture, Data Node stores actual data in HDFS.
3. Data Nodes responsible for serving, read and write requests for the clients.
4. Data Nodes can deploy on commodity hardware.
5. Data Nodes sends information to the Name Node about the files and blocks stored
in that node and responds to the Name Node for all filesystem operations.
6. When a Data Node starts up it announce itself to the Name Node along with the
list of blocks it is responsible for.
7. Data Node is usually configured with a lot of hard disk space. Because the
actual data is stored in the Data Node.
The datanode daemon acts as a slave node and is responsible for storing the actual
files in HDFS. The files are split as data blocks across the cluster. The blocks are
typically 64 MB to 128 MB size blocks. The block size is a configurable parameter.
The file blocks in a Hadoop cluster also replicate themselves to other datanodes for
redundancy so that no data is lost in case a datanode daemon fails. The datanode
daemon sends information to the namenode daemon about the files and blocks stored
in that node and responds to the namenode daemon for all filesystem operations. The
following diagram shows how files are stored in the cluster:
File blocks of files A, B, and C are replicated across multiple nodes of the cluster for
redundancy. This ensures availability of data even if one of the nodes fail.
You can also see that blocks of file A are present on nodes 2, 4, and 6; blocks of file B
are present on nodes 3, 5, and 7; and blocks of file C are present on 4, 6, and 7. The
replication factor configured for this cluster is 3, which signifies that each file block is
replicated three times across the cluster. It is the responsibility of the namenode
daemon to maintain a list of the files and their corresponding locations on the cluster.
Whenever a client needs to access a file, the namenode daemon provides the location
of the file to client and the client, then accesses the file directly from the data node
daemon.
Each and every transaction that occurs on the file system is recorded within the edit log
file. At some point of time this file becomes very large.
Secondary Name node simply gets edit logs from name node periodically and
copies to fsimage. This new fsimage is copied back to namenode. Namenode now, this
uses this new fsimage for next restart which reduces the startup time.
It is a helper node to Name node and to precise Secondary Name node whole
purpose is to have checkpoint in HDFS, which helps name node to function effectively.
Hence, It is also called as Checkpoint node.
Now there are two important files which reside in the namenode’ s current
directory,
1. FsImage file :-This file is the snapshot of the HDFS metadata at a certain point
of time .
2. Edits Log file :-This file stores the records for changes that have been made in the
HDFS namespace .
The main function of the Secondary namenode is to store the latest copy of the
FsImage and the Edits Log files.
When the namenode is restarted , the latest copies of the Edits Log files are applied to
the FsImage file in order to keep the HDFS metadata latest. So it becomes very
important to store a copy of these two files , which is done by secondary namenode.
Now to keep latest versions of these two files, the secondary name node takes the
checkpoints at hourly basis which is the default time gap .
Checkpoint:-
A checkpoint is nothing but the updation of the latest FsImage file by applying the
latest Edits Log files to it .If the time gap of a checkpoint is large the there will be too
many Edits Log files generated and it will be very cumbersome and time consuming to
apply them all at once on the latest FsImage file . And this may lead to acute start time
for the primary namenode after a reboot .
However, the secondary namenode is just a helper to the primary namenode in a
HDFS cluster as it cannot perform all the functions of the primary namenode .
Note:-
There are two options to which can be used along with secondary namenode command
1. -geteditsize:- this option helps to find the current size of the edit_ingress file
present in namenode’ s current directory.Here edit_ingress file is the ongoing in
progress Edits Log file .
2. -checkpoint [force]:- this option forcefully checkpoints the secondary namenode
to the latest state of the primary namenode , whatever may the size of the Edits Log file
may be. But ideally the size of the Edits Log file should be greater than or equal to the
checkpoint file size .
fsimage: It’s a snapshot of the file system, stores information like modification
time access time, permission, replication.
Edit logs: It stores details of all the activities/transactions being performed on the
HDFS..
When the namenode is in the active state the edit logs size grows continuously as the
edit logs can only be applied to the fsimage at the time of name node restart, to get the
latest state of the HDFS. If edit logs grows significantly and name node tries to apply it
on fsimage at the time of name node restart, the process can take very long, here
secondary node come into the play.
Secondary namenode keeps the checkpoint on the name node, It reads the edit
logs from the namenode continuously after a specific interval and applies it to the
fsimage copy of secondary name node. In this way the fsimage file will have the most
recent state of HDFS.
The secondary namenode copies new fsimage to primary, so fsimage is updated.
Since fsimage is updated, there will be no overhead of copying of edit logs at the
moment of restarting the cluster.
Secondary namenode is a helper node and can’t replace the name node.
Secondary Name Node is used for taking the hourly backup of the data.
In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the
hourly backup or checkpoints of that data and store this data into a file name fsimage.
This file then gets transferred to a new system.
A new Meta Data is assigned to that new system and a new Master is created with this
Meta Data, and the cluster is made to run again correctly. This is the benefit of
Secondary Name Node.
Checkpoint:-
A checkpoint is nothing but the updation of the latest FsImage file by applying the
latest Edits Log files to it .If the time gap of a checkpoint is large the there will be too
many Edits Log files generated and it will be very cumbersome and time consuming to
apply them all at once on the latest FsImage file . And this may lead to acute start time
for the primary namenode after a reboot .
However, the secondary namenode is just a helper to the primary namenode in a
HDFS cluster as it cannot perform all the functions of the primary namenode .
Note:-
There are two options to which can be used along with secondary namenode command
3. -geteditsize:- this option helps to find the current size of the edit_ingress file
present in namenode’ s current directory.Here edit_ingress file is the ongoing in
progress Edits Log file .
4. -checkpoint [force]:- this option forcefully checkpoints the secondary namenode
to the latest state of the primary namenode , whatever may the size of the Edits Log file
may be. But ideally the size of the Edits Log file should be greater than or equal to the
checkpoint file size .
1. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the
Master System.
The Resource Manager Manages the resources for the applications that are running in
a Hadoop Cluster.
The Resource Manager Mainly consists of 2 things.
1. Applications Manager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also
makes a memory resource on the Slaves in a Hadoop cluster to host the Application
Mas1er.
The scheduler is utilized for providing resources for applications in a Hadoop cluster
and for monitoring this application.
2. Node Manager
The Node Manager works on the Slaves System that manages the memory resource
within the Node and Memory Disk. Each Slave Node in a Hadoop cluster has a single
NodeManager Daemon running in it. It also sends this monitoring information to the
Resource Manager.
How to start Node Manager?
yarn-daemon.sh start node manager
How to stop Node Manager?
yarn-daemon.sh stop nodemanager
ResourceManager 8088
NodeManager 8042
The below diagram shows how Hadoop works.
We focus on one of the components of Hadoop i.e., HDFS and the anatomy of file
reading and file writing in HDFS. HDFS is a file system designed for storing very
large files (files that are hundreds of megabytes, gigabytes, or terabytes in size) with
streaming data access, running on clusters of commodity hardware(commonly
available hardware that can be obtained from various vendors). In simple terms, the
storage unit of Hadoop is called HDFS.
Some of the characteristics of HDFS are:
Fault−Tole
rance
Scalability
Distributed
Storage
Reliability
High
availa
bility
Cost−
e11ect
ive
High
throug
hput
Building Blocks of Hadoop:
Name
Node
Data
Node
Secondary Name Node
(SNN)
Job Tracker
Task Tracker
Let’s get an idea of how data flows between the client interacting with HDFS, the
name node, and the data nodes with the help o1 a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which
1or HDFS is an instance o1 Distributed File System).
Step 2: Distributed File System (DFS) calls the name node, using remote procedure calls (RPCs), to
determine the locations of the first few blocks in the file. For each block, the name node returns the
addresses o1 the data nodes that have a copy o1 that block.
The DFS returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in
turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read () on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the
stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data
node, then finds the best data node for the next block.
This happens transparently to the client, which from its point o1 view is simply reading an endless
stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes
because the client reads through the stream. It will also call the name node to retrieve the data node
locations for the next batch o1 blocks as needed.
Step 6: When the client has finished reading the 1ile, a function is called,
close() on the FSDataInputStream.
Anatomy of File Write in HDFS
Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to
get a better understanding o1 the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the 1iles which
are already stored in HDFS, but we can append data by reopening the files.
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the 1ile. I1 these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and there1ore the client is
thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the in1o queue. The data queue is consumed
by the DataStreamer, which is liable for asking the name node to allocate new blocks
by picking an inventory o1 suitable data nodes to store the replicas. The list of data
nodes forms a pipeline, and here we’ll assume the replication level is three, so there
are three nodes in the pipeline. The DataStreamer streams the packets to the primary
data node within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue o1 packets that
are waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and
waits for acknowledgments be1ore connecting to the name node to signal whether the
1ile is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already
stored in HDFS, but we can include them by again reopening the 1ile. This design
allows HDFS to scale to a large number o1 concurrent clients because the data tra11ic
is spread across all the data nodes in the cluster. Thus, it increases the availability,
scalability, and throughput o1 the system.
Blocks are the nothing but the smallest continuous location on your hard drive where
data is stored. In general, in any of the File System, you store the data as a collection of
blocks. Similarly, HDFS stores each file as blocks which are scattered throughout the
Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop
2.x (64 MB in Apache Hadoop 1.x) which you can configure as per your requirement.
It is not necessary that in HDFS, each file is stored in exact multiple of the
configured block size (128 MB, 256 MB etc.). Let’s take an example where I have a file “example.txt”
of size 514 MB as shown in above figure. Suppose that we are using the default configuration of block
size, which is 128 MB. Then, how many blocks will be created? 5, Right. The first four blocks will be
of 128 MB. But, the last block will be of 2 MB size only.
Now, you must be thinking why we need to have such a huge blocks size
i.e. 128 MB?
Well, whenever we talk about HDFS, we talk about huge data sets, i.e. Terabytes and
Petabytes of data. So, if we had a block size of let’s say of 4 KB, as in Linux file
system, we would be having too many blocks and therefore too much of the metadata.
So, managing these no. of blocks and metadata will create huge overhead, which is
something, we don’t want.
Replication Management:
HDFS provides a reliable way to store huge data in a distributed environment as data
blocks. The blocks are also replicated to provide fault tolerance. The default replication
factor is 3 which is again configurable. So, as you can see in the figure below where
each block is replicated three times and stored on different DataNodes (considering
the default replication factor):
Therefore, if you are storing a file of 128 MB in HDFS using the default configuration,
you will end up occupying a space of 384 MB (3*128 MB) as the blocks will be
replicated three times and each replica will be residing on a different DataNode.
Note: The NameNode collects block report from DataNode periodically to maintain
the replication factor. Therefore, whenever a block is over-replicated or under-
replicated the NameNode deletes or add replicas as needed.
Rack Awareness:
Anyways, moving ahead, let’s talk more about how HDFS places replica and what is
rack awareness? Again, the NameNode also ensures that all the replicas are not stored
on the same rack or a single rack. It follows an in-built Rack Awareness Algorithm to
reduce latency as well as provide fault tolerance.
Considering the replication factor is 3, the Rack Awareness Algorithm says that the
first replica of a block will be stored on a local rack and the next two replicas will be
stored on a different (remote) rack but, on a different DataNode within that (remote)
rack as shown in the figure above.
This is how an actual Hadoop production cluster looks like. Here, you have multiple
racks populated with DataNodes:
So, now you will be thinking why do we need a Rack Awareness algorithm? The
reasons are:
● To improve the network performance: The communication between nodes
residing on different racks is directed via switch. In general, you will find
greater network bandwidth between machines in the same rack than the
machines residing in different rack. So, the Rack Awareness helps you to have
reduce write traffic in between different racks and thus providing a better write
performance. Also, you will be gaining increased read performance because
you are using the bandwidth of multiple racks.
● To prevent loss of data: We don’t have to worry about the data even if an
entire rack fails because of the switch failure or power failure. And if you think
about it, it will make sense, as it is said that never put all your eggs in the same
basket.
Assume that the system block size is configured for 128 MB (default). So, the client
will be dividing the file “example.txt” into 2 blocks – one of 128 MB (Block A) and
Now, the following protocol will be followed whenever the data is written into HDFS:
● At first, the HDFS client will reach out to the NameNode for a Write Request
against the two blocks, say, Block A & Block B.
● The NameNode will then grant the client the write permission and will provide
the IP addresses of the DataNodes where the file blocks will be copied
eventually.
● The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness that we have discussed
earlier.
● Let’s say the replication factor is set to default i.e. 3. Therefore, for each block
the NameNode will be providing the client a list of (3) IP addresses of
DataNodes. The list will be unique for each block.
● Suppose, the NameNode provided following lists of IP addresses to the client:
o For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of
DataNode 6}
o For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of
DataNode 9}
1. Set up of Pipeline
2. Data streaming and replication
3. Shutdown of Pipeline (Acknowledgement stage)
1. Set up of Pipeline:
Before writing the blocks, the client confirms whether the DataNodes, present in each
of the list of IPs, are ready to receive the data or not. In doing so, the client creates a
pipeline for each of the blocks by connecting the individual DataNodes in the
respective list for that block. Let us consider Block A. The list of DataNodes provided
by the NameNode is:
So, for block A, the client will be performing the following steps to create a pipeline:
● The client will choose the first Data Node in the list (Data Node IPs for Block
A) which is Data Node 1 and will establish a TCP/IP connection.
● The client will inform Data Node 1 to be ready to receive the block. It will also
provide the IPs of next two Data Nodes (4 and 6) to the Data Node 1 where the
block is supposed to be replicated.
● The Data Node 1 will connect to Data Node 4. The DataNode 1 will inform
Data Node 4 to be ready to receive the block and will give it the IP of
DataNode 6. Then, Data Node 4 will tell Data Node 6 to be ready for receiving
the data.
● Next, the acknowledgement of readiness will follow the reverse sequence, i.e.
From the DataNode 6 to 4 and then to 1.
● At last DataNode 1 will inform the client that all the DataNodes are ready and
a pipeline will be formed between the client, DataNode 1, 4 and 6.
● Now pipeline set up is complete and the client will finally begin the data copy
or streaming process.
2. Data Streaming:
As the pipeline has been created, the client will push the data into the pipeline. Now,
don’t forget that in HDFS, data is replicated based on replication factor. So, here Block
A will be stored to three DataNodes as the assumed replication factor
is 3. Moving ahead, the client will copy the block (A) to DataNode 1 only. The replication is always
done by DataNodes sequentially.
● Once the block has been written to DataNode 1 by the client, DataNode 1 will connect to
DataNode 4.
● Then, DataNode 1 will push the block in the pipeline and data will be copied to DataNode 4.
● Again, DataNode 4 will connect to DataNode 6 and will copy the last replica of the block.
Once the block has been copied into all the three DataNodes, a series of acknowledgements will take
place to ensure the client and NameNode that the data has been written successfully. Then, the client
will finally close the pipeline to end the TCP session.
As shown in the figure below, the acknowledgement happens in the reverse sequence i.e. from
DataNode 6 to 4 and then to 1. Finally, the DataNode 1 will push three acknowledgements
(including its own) into the pipeline and send it to the client. The client will inform NameNode that
data has been written successfully. The NameNode will update its metadata and the client will shut
down the pipeline.
Similarly, Block B will also be copied into the DataNodes in parallel with Block A. So, the
following things are to be noticed here:
● The client will copy Block A and Block B to the first DataNode simultaneously.
● Therefore, in our case, two pipelines will be formed for each of the block and all the process
discussed above will happen in parallel in these two pipelines.
● The client writes the block into the first DataNode and then the DataNodes will be
replicating the block sequentially.
As you can see in the above image, there are two pipelines formed for each block (A and B).
Following is the flow of operations that is taking place for each block in their respective pipelines:
● The client will reach out to NameNode asking for the block metadata for the file
“example.txt”.
● The NameNode will return the list of DataNodes where each block (Block A and B) are
stored.
● After that client, will connect to the DataNodes where the blocks are stored.
● The client starts reading data parallel from the DataNodes (Block A from DataNode 1 and
Block B from DataNode 3).
● Once the client gets all the required file blocks, it will combine these blocks to form a file.
While serving read request of the client, HDFS selects the replica which is closest to the client. This
reduces the read latency and the bandwidth consumption. Therefore, that replica is selected which
resides on the same rack as the reader node, if possible.
Now, you should have a pretty good idea about Apache Hadoop HDFS Architecture. I understand
that there is a lot of information here and it may not be easy to get it in one go..
● No foreign keys: There is no dynamic relationship between two documents so documents can be independent
of one another. So, there is no requirement for a foreign key in a document database.
● Open formats: To build a document we use XML, JSON, and others.
2. Key-Value Stores
A key-value store is a nonrelational database. The simplest form of a NoSQL database is a key-value store. Every
data element in the database is stored in key-value pairs. The data can be retrieved by using a unique key allotted to
each element in the database. The values can be simple data types like strings and numbers or complex objects. A
key-value store is like a relational database with only two columns which is the key and the value.
4. Graph-Based Databases
Graph-based databases focus on the relationship between the elements. It stores the data in the form of nodes in the
database. The connections between the nodes are called links or relationships, making them ideal for complex
relationship-based queries.
● Data is represented as nodes (objects) and edges (connections).
● Fast graph traversal algorithms help retrieve relationships quickly.
● Used in scenarios where relationships are as important as the data itself.
19. Compare SQL and NoSQL databases in terms of structure, performance, and scalability.
Differences Between SQL and NoSQL
NoSQL (Non-
Aspect SQL (Relational)
relational)
Document-based, key-
Data Tables with rows and
value, column-family, or
Structure columns
graph-based
1. Type
SQL databases are primarily called Relational Databases (RDBMS); whereas NoSQL
databases are primarily called non-relational or distributed databases.
2. Language
SQL databases define and manipulate data-based structured query language (SQL). Seeing
from a side this language is extremely powerful. SQL is one of the most versatile and
widely-used options available which makes it a safe choice, especially for great complex
queries. But from another side, it can be restrictive.
SQL requires you to use predefined schemas to determine the structure of your data before
you work with it. Also, all of our data must follow the same structure. This can require
significant up-front preparation which means that a change in the structure would be both
difficult and disruptive to your whole system.
3. Scalability
In almost all situations SQL databases are vertically scalable. This means that you can
increase the load on a single server by increasing things like RAM, CPU, or SSD. But on the
other hand, NoSQL databases are horizontally scalable. This means that you handle more
traffic by sharing, or adding more servers in your NoSQL database.
It is similar to adding more floors to the same building versus adding more buildings to the
neighborhood. Thus NoSQL can ultimately become larger and more powerful, making
these databases the preferred choice for large or ever-changing data sets.
4. Structure
SQL databases are table-based on the other hand NoSQL databases are either key-value
pairs, document-based, graph databases, or wide-column stores. This makes relational
SQL databases a better option for applications that require multi-row transactions such as
an accounting system or for legacy systems that were built for a relational structure.
Here is a simple example of how a structured data with rows and columns vs a non-
structured data without definition might look like. A product table in SQL db might accept
data looking like this:
{
"id": "101",
"category":"food"
"name":"Apples",
"qty":"150"
}
Whereas a unstructured NOSQL DB might save the products in many variations without
constraints to change the underlying table structure
Products=[
"id":"101:
"category":"food",,
"name":"California Apples",
"qty":"150"
},
"id":"102,
"category":"electronics"
"qty":"10",
"specifications":{
"storage":"256GB SSD",
"cpu":"8 Core",
5. Property followed
SQL databases follow ACID properties (Atomicity, Consistency, Isolation, and Durability)
whereas the NoSQL database follows the Brewers CAP theorem (Consistency, Availability,
and Partition tolerance).
6. Support
Great support is available for all SQL databases from their vendors. Also, a lot of
independent consultants are there who can help you with SQL databases for very large-
scale deployments but for some NoSQL databases you still have to rely on community
support and only limited outside experts are available for setting up and deploying your
large-scale NoSQL deploy.
What is SQL?
SQL databases, also known as Relational Database Management Systems (RDBMS), use
structured tables to store data. They rely on a predefined schema that determines the
organization of data within tables, making them suitable for applications that require a
fixed, consistent structure.
● Structured Data: Data is organized in tables with rows and columns, making it
integrity.
SQL Server.
2. Compare MongoDB and RDBMS in terms of data storage, performance, and scalability.
3. Describe the key differences between SQL and MongoDB queries with examples.
SQL:
sql
CopyEdit
SELECT * FROM users WHERE age > 25;
MongoDB:
javascript
CopyEdit
db.users.find({ "age": { $gt: 25 } });
json
CopyEdit
{
"name": "Alice",
"age": 25,
"skills": ["Java", "MongoDB"]
}
4.
5. Explain the structure of a MongoDB document and how it is different from a relational database row.
MongoDB documents are self-contained and flexible, unlike RDBMS rows that adhere to a strict
schema.
6. What are the important terms used in MongoDB, and how do they relate to RDBMS concepts?
○ Table → Collection
○ Row → Document
○ Column → Field
7. How does indexing improve performance in MongoDB? Explain different types of indexes.
Indexes speed up queries. Types include single-field, compound, and text indexes.
javascript
CopyEdit
db.sales.aggregate([{ $group: { _id: "$product", totalSales: { $sum:
"$amount" } } }]);
9.
10. Discuss the advantages and disadvantages of using MongoDB over traditional relational databases.
○ Advantages: Flexible schema, high scalability, faster queries.
Answer:
R is an open-source programming language primarily used for statistical computing and data visualization. It
is widely used in data science due to its powerful data analysis capabilities, extensive library support, and
ability to handle large datasets. Additionally, R provides various built-in functions for machine learning,
regression analysis, and data visualization, making it popular among statisticians and data scientists.
Answer:
● Supports various data types and structures (vectors, matrices, data frames, etc.).
Answer:
R supports the following types of operators:
● Relational (Comparison) Operators (>, <, >=, <=, ==, !=) – Used for comparisons.
● Miscellaneous Operators (%in%, :, $) – Used for checking membership, sequence generation, and
accessing list elements.
r
CopyEdit
x <- 10
y = 20
z <<- 30
Answer:
Control statements allow the execution of specific blocks of code based on conditions. Common control
statements in R include:
if statement
r
CopyEdit
x <- 10
if (x > 5) {
print("x is greater than 5")
}
if-else statement
r
CopyEdit
if (x > 5) {
print("x is greater than 5")
} else {
print("x is not greater than 5")
}
switch statement
r
CopyEdit
y <- "two"
switch(y,
"one" = print("Selected One"),
"two" = print("Selected Two"),
"three" = print("Selected Three"))
●
Answer:
Answer:
Loops allow repetitive execution of a block of code.
for loop
r
CopyEdit
for (i in 1:5) {
print(i)
}
while loop
r
CopyEdit
x <- 1
while (x <= 5) {
print(x)
x <- x + 1
}
repeat loop
r
CopyEdit
x <- 1
repeat {
print(x)
x <- x + 1
if (x > 5) break
}
Answer:
Functions are reusable blocks of code that perform specific tasks.
r
CopyEdit
my_function <- function(a, b) {
return(a + b)
}
my_function(3, 5)
Answer:
Answer:
Interfacing with R means integrating it with external systems such as databases, C/C++ programs, and other
scripting languages.
Answer:
R can connect to databases using packages like:
r
CopyEdit
library(RMySQL)
con <- dbConnect(MySQL(), user='root', password='password', dbname='database',
host='localhost')
Answer:
Vectors are one-dimensional arrays that store elements of the same type.
r
CopyEdit
v <- c(1, 2, 3, 4, 5)
Answer:
Matrices are two-dimensional data structures with the same data type.
r
CopyEdit
m <- matrix(1:6, nrow=2, ncol=3)
Answer:
A list can store elements of different data types, unlike a vector.
r
CopyEdit
l <- list(1, "text", TRUE)
Answer:
A data frame is a table-like structure with columns of different data types.
r
CopyEdit
df <- data.frame(Name=c("A", "B"), Age=c(25, 30))
A matrix has only one data type, while a data frame can have multiple types.
Answer:
Factors represent categorical data and improve efficiency.
r
CopyEdit
factor(c("male", "female", "male"))
Answer:
Tables store frequency distributions and make it easier to summarize data.
r
CopyEdit
table(c("A", "B", "A", "B", "C"))
Answer:
Answer:
r
CopyEdit
write.csv(df, "output.csv")
Answer:
● Histogram (hist())
● Boxplot (boxplot())
Answer:
Answer:
sapply() is similar to lapply() but returns vectors/matrices instead of lists.