ABSTRACT
ABSTRACT
Big Data refers to the vast volumes of structured and unstructured data generated at high velocity
from various sources. Its characteristics include scale, diversity, and complexity, necessitating
advanced architectures and analytics to extract valuable insights.
Hadoop is a prominent open-source framework designed for distributed storage and processing
of big data across clusters, enabling scalability from single servers to thousands of machines. It
utilizes MapReduce for efficient data processing and HDFS (Hadoop Distributed File System)
for data storage, making it essential for handling massive datasets.
KEYWORDS: -
Big Data, RDBMS, Map Reduce, Hadoop Distributed File System, Google file System,
Scheduling algorithm,9v’s.
INTRODUCTION: -
Big data encompasses vast, complex datasets characterized by volume, variety, and velocity.
These datasets often exceed traditional processing capabilities of relational databases, making
them challenging to record and analyze. Analysts typically define big data as datasets ranging
from 30-50 terabytes to multiple petabytes, with a petabyte equating to 1,000 terabytes. The
increasing generation of unstructured data from sources like IoT devices and social media further
complicates processing, necessitating advanced technologies like Hadoop for efficient handling.
Consequently, big data requires specialized tools for effective analysis and visualization. It refers
to the vast volumes of structured and unstructured data generated at high velocity from various
sources. Its characteristics include scale, diversity, and complexity, necessitating advanced
architectures and analytics to extract valuable insights.
Hadoop is a prominent open-source framework designed for distributed storage and processing
of big data across clusters, enabling scalability from single servers to thousands of machines. It
utilizes MapReduce for efficient data processing and HDFS (Hadoop Distributed File System)
for data storage, making it essential for handling massive datasets.
Literature Survey: -
The term Big Data refers to massive datasets that traditional RDBMS techniques struggle to
manage due to their size and complexity. To address this, MapReduce, a programming model
within the Hadoop framework, was developed to facilitate efficient data processing through
parallelization. This model breaks down large datasets into smaller chunks, which are processed
concurrently across multiple servers, significantly speeding up analysis and reducing costs
compared to traditional methods. MapReduce’s ability to handle unstructured data and its fault
tolerance make it a vital tool for modern data management.
BIG DATA: -
Big Data refers to vast and complex datasets that traditional data-processing software cannot
manage effectively. It encompasses structured, semi-structured, and unstructured data from
various sources like social media, sensors, and transaction records. The key characteristics of Big
Data are often summarized as the 9 Vs. Volume, Velocity, Variety, Variability, Validity,
Vulnerability, Volatility, Visualization, and Value. These datasets are crucial for businesses to
derive insights, improve decision-making, and enhance customer experiences in today’s digital
landscape.
HADOOP: -
Hadoop is an open-source framework for processing large data sets across distributed clusters of
computers. It consists of two main components: Hadoop Distributed File System (HDFS) for
storage and MapReduce for processing. HDFS employs a master/slave architecture with a Name
Node managing metadata and multiple Data Nodes storing data blocks, ensuring fault tolerance
through replication. Designed to run on commodity hardware, Hadoop efficiently handles
petabytes of data, leveraging data locality to enhance processing speed. Developed by the
Apache Software Foundation, it was inspired by Google’s MapReduce and Google File System.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a scalable and fault-tolerant file system designed for
large data sets on commodity hardware. It operates on a master/slave architecture, with a single
Name Node managing metadata and multiple Data Nodes storing actual data blocks. HDFS
supports high throughput for large files, typically ranging from gigabytes to petabytes, and
achieves reliability through data replication, usually with a default factor of three. It is optimized
for batch processing and streaming access, making it suitable for various applications without
predefined schemas.
HDFS: -
HDFS (Hadoop Distributed File System) is a distributed file system designed for high-
throughput access to large data sets, primarily used in batch processing applications. It supports
various applications beyond MapReduce, including HBase, Apache Mahout, and Apache Hive.
HDFS is fault-tolerant, scalable, and efficient for data-intensive tasks like log analysis,
marketing analytics, and machine learning. Major companies like Amazon and Google utilize
HDFS for data mining and analytics due to its capability to handle large volumes of structured
and unstructured data across commodity hardware.
SCHEDULING IN HADOOP:-
In Hadoop, task scheduling is crucial for efficient resource allocation and execution speed.
Scheduling algorithms are categorized into dynamic and static types.
Dynamic scheduling uses real-time data to make decisions, exemplified by the Fair Scheduler
and Capacity Scheduler, which adapt to workload changes. In contrast, static scheduling, like the
FIFO Scheduler, follows a fixed order without considering job priority or resource availability.
Each approach has its benefits and drawbacks, impacting overall performance and resource
utilization in Hadoop clusters.
FIFO Scheduling:-
The FIFO (First In First Out) scheduler is the default job scheduling method in Apache Hadoop,
prioritizing tasks based on their submission order. While it is straightforward, it can lead to
inefficiencies, particularly for smaller jobs that may be delayed by longer-running tasks, as it
allocates resources strictly according to arrival time. For improved performance, alternatives like
the Fair Scheduler and Capacity Scheduler are recommended, as they dynamically allocate
resources and prioritize jobs more effectively, reducing wait times and optimizing resource
utilization in heterogeneous environments.
Fair scheduler: -
The Fair Scheduler in Hadoop is designed to ensure Quality of Service (QoS) by allocating
resources fairly among multiple jobs. It organizes jobs into pools, assigning guaranteed
minimum resources to each pool. This allows for equitable distribution of resources, ensuring
that all applications receive an average share over time. Unlike the default FIFO scheduler, it
enables short jobs to complete quickly without starving longer jobs, while also considering job
priorities through weights for resource allocation. The Fair Scheduler can dynamically adapt to
varying workloads, enhancing overall cluster efficiency and responsiveness.
Conclusions: -
The paper effectively outlines the concept of Big Data, emphasizing processing challenges such
as scale, dissimilarity, and privacy. It highlights the significance of MapReduce architecture in
managing large datasets, where outputs from mapping tasks are sorted and input into reducing
tasks. The discussion extends to Hadoop as an open-source solution for Big Data processing,
addressing the need for cost-effective strategies across various domains. Overall, it underscores
that overcoming technical challenges is essential for efficient Big Data operations and achieving
organizational goals in data management.
REFERENCES:-
[1] Bakshi, K., (2012),” Considerations for big data: Architecture and approach”
[2] Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W., (18-22 Dec.,2012),
“Shared disk big data analytics with Apache Hadoop”
[3] Harshawardhan S. Bhosale1, Prof. Devendra Gadekar, JSPM’s Imperial College of
Engineering & Research, Wagholi, Pune, a review on Big Data
[4] Aditya B. Patel, Manashvi Birla, Ushma Nair,(6-8 Dec. 2012), “Addressing Big Data
Problem Using Hadoop and Map Reduce”
[5] N. Deshai1, S. Venkataramana2, Dr. G. P. Saradi Varma3 volume No.07, Special No.02,
February 2018
[6] Roopa Raphael1, Raj Kumar T2 Volume 5 Issue 3, March 2016