0% found this document useful (0 votes)
19 views7 pages

Bda Assignment

Big data consists of very large and complex datasets that are difficult to process using traditional tools. It is characterized by high volume, variety, velocity and variability. Hadoop is an open source framework for distributed storage and processing of big data across clusters of commodity hardware. Its core components include HDFS for storage and MapReduce for processing. The NameNode tracks metadata and the DataNodes store actual data blocks. The JobTracker coordinates jobs while the TaskTrackers execute tasks on the DataNodes.

Uploaded by

Avinash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

Bda Assignment

Big data consists of very large and complex datasets that are difficult to process using traditional tools. It is characterized by high volume, variety, velocity and variability. Hadoop is an open source framework for distributed storage and processing of big data across clusters of commodity hardware. Its core components include HDFS for storage and MapReduce for processing. The NameNode tracks metadata and the DataNodes store actual data blocks. The JobTracker coordinates jobs while the TaskTrackers execute tasks on the DataNodes.

Uploaded by

Avinash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Q.1 - What is Big Data? Explain characteristics of Big Data.

ANS :-

Wikipedia defines "Big Data" as a collection of data sets so large and complex that it becomes
difficult to process using on hand database management tools or traditional data processing
applications.
In simple terms,
"Big Data" consists of very large volumes of heterogeneous data that is being generated, often,
at high speeds. These data sets cannot be managed and processed using traditional data
management tools and applications at hand. Big Data requires the use of a new set of tools,
applications and frameworks to process and manage the data.

Characteristics Of Big Data


Big data can be described by the following characteristics:

• Volume
• Variety
• Velocity
• Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data
plays a very crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big
Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications.
This variety of unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data
is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.
Describe Traditional vs. Big Data business approach. Explain Challenges of Conventional
4.
System.
ANS :-

Challenges of conventional systems

• Big data is the storage and analysis of large data sets.


• These are complex data sets that can be both structured or unstructured.
• They are so large that it is not possible to work on them with traditional analytical tools.
• One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
• Big data is continuously expanding, there are new companies and technologies that are
being developed every day.
• A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
• These days, organizations are realising the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency in their
work environment.
Q.7 :- What are the advantages of Hadoop? Explain Hadoop Architecture and its Components
with proper diagram.

ANS :- Advantage of Hadoop :-

1. Varied Data Sources


2. Cost-effective
3. Performance
4. Fault-Tolerant
5. Highly Available
6. Low Network Traffic
7. High Throughput
8. Open Source
9. Scalable
10. Ease of use
11. Compatibility
12. Multiple Languages Supported
Other Ecosystem Components/Projects :-
10. - What is Name node & Data node in Hadoop Architecture.
ANS :-

NameNode :-
NameNode can be regarded as the system’s master. It keeps track of the file system tree and
metadata for all of the system’s files and folders. Metadata information is stored in two files:
‘Namespace image’ and ‘edit log.’ Namenode is aware of all data nodes carrying data blocks
for a particular file, but it does not keep track of block positions. When the system starts, this
information is rebuilt from data nodes each time.

Name Node is the HDFS controller and manager since it is aware of the state and metadata of
all HDFS files, including file permissions, names, and block locations. Because the metadata
is tiny, it is kept in the memory of the name node, allowing for speedier data access.
Furthermore, because the HDFS cluster is accessible by several customers at the same time, all
of this data is processed by a single computer. It performs file system actions such as opening,
shutting, renaming, and so on.

DataNode :-
The data node is a commodity computer with the GNU/Linux operating system and data node
software installed. In a cluster, there will be a data node for each node (common
hardware/system). These nodes are in charge of the system’s data storage.

Datanodes respond to client requests by performing read-write operations on file systems. They
also carry out actions such as block creation, deletion, and replication in accordance with the
name node’s instructions.
Q.12. :-Discuss role of JobTracker and TaskTracker in processing data with Hadoop.

ANS :- Jo bTra cker a nd Ta skTra cker


JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in
MRv1 (or Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop
version 2) and replaced by Resource Manager, Application Master and Node Manager
Daemons.

Job Tracker –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is
replaced by ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data
locality (proximity of the data) and the available slots to execute a task on a
given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the
overall status of the job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce
execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.

TaskTracker –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by
TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by
JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling
the progress of the task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to
another node.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy