Bda Assignment
Bda Assignment
ANS :-
Wikipedia defines "Big Data" as a collection of data sets so large and complex that it becomes
difficult to process using on hand database management tools or traditional data processing
applications.
In simple terms,
"Big Data" consists of very large volumes of heterogeneous data that is being generated, often,
at high speeds. These data sets cannot be managed and processed using traditional data
management tools and applications at hand. Big Data requires the use of a new set of tools,
applications and frameworks to process and manage the data.
• Volume
• Variety
• Velocity
• Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data
plays a very crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big
Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications.
This variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data
is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.
Describe Traditional vs. Big Data business approach. Explain Challenges of Conventional
4.
System.
ANS :-
NameNode :-
NameNode can be regarded as the system’s master. It keeps track of the file system tree and
metadata for all of the system’s files and folders. Metadata information is stored in two files:
‘Namespace image’ and ‘edit log.’ Namenode is aware of all data nodes carrying data blocks
for a particular file, but it does not keep track of block positions. When the system starts, this
information is rebuilt from data nodes each time.
Name Node is the HDFS controller and manager since it is aware of the state and metadata of
all HDFS files, including file permissions, names, and block locations. Because the metadata
is tiny, it is kept in the memory of the name node, allowing for speedier data access.
Furthermore, because the HDFS cluster is accessible by several customers at the same time, all
of this data is processed by a single computer. It performs file system actions such as opening,
shutting, renaming, and so on.
DataNode :-
The data node is a commodity computer with the GNU/Linux operating system and data node
software installed. In a cluster, there will be a data node for each node (common
hardware/system). These nodes are in charge of the system’s data storage.
Datanodes respond to client requests by performing read-write operations on file systems. They
also carry out actions such as block creation, deletion, and replication in accordance with the
name node’s instructions.
Q.12. :-Discuss role of JobTracker and TaskTracker in processing data with Hadoop.
Job Tracker –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is
replaced by ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data
locality (proximity of the data) and the available slots to execute a task on a
given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the
overall status of the job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce
execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
TaskTracker –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by
TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by
JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling
the progress of the task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to
another node.