0% found this document useful (0 votes)
16 views9 pages

ABSTRACT

Uploaded by

anchalsingh8291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

ABSTRACT

Uploaded by

anchalsingh8291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ABSTRACT: -

Big Data refers to the vast volumes of structured and unstructured data generated at high velocity
from various sources. Its characteristics include scale, diversity, and complexity, necessitating
advanced architectures and analytics to extract valuable insights.
Hadoop is a prominent open-source framework designed for distributed storage and processing
of big data across clusters, enabling scalability from single servers to thousands of machines. It
utilizes MapReduce for efficient data processing and HDFS (Hadoop Distributed File System)
for data storage, making it essential for handling massive datasets.

KEYWORDS: -
Big Data, RDBMS, Map Reduce, Hadoop Distributed File System, Google file System,
Scheduling algorithm,9v’s.

INTRODUCTION: -
Big data encompasses vast, complex datasets characterized by volume, variety, and velocity.
These datasets often exceed traditional processing capabilities of relational databases, making
them challenging to record and analyze. Analysts typically define big data as datasets ranging
from 30-50 terabytes to multiple petabytes, with a petabyte equating to 1,000 terabytes. The
increasing generation of unstructured data from sources like IoT devices and social media further
complicates processing, necessitating advanced technologies like Hadoop for efficient handling.
Consequently, big data requires specialized tools for effective analysis and visualization. It refers
to the vast volumes of structured and unstructured data generated at high velocity from various
sources. Its characteristics include scale, diversity, and complexity, necessitating advanced
architectures and analytics to extract valuable insights.
Hadoop is a prominent open-source framework designed for distributed storage and processing
of big data across clusters, enabling scalability from single servers to thousands of machines. It
utilizes MapReduce for efficient data processing and HDFS (Hadoop Distributed File System)
for data storage, making it essential for handling massive datasets.

Literature Survey: -
The term Big Data refers to massive datasets that traditional RDBMS techniques struggle to
manage due to their size and complexity. To address this, MapReduce, a programming model
within the Hadoop framework, was developed to facilitate efficient data processing through
parallelization. This model breaks down large datasets into smaller chunks, which are processed
concurrently across multiple servers, significantly speeding up analysis and reducing costs
compared to traditional methods. MapReduce’s ability to handle unstructured data and its fault
tolerance make it a vital tool for modern data management.
BIG DATA: -
Big Data refers to vast and complex datasets that traditional data-processing software cannot
manage effectively. It encompasses structured, semi-structured, and unstructured data from
various sources like social media, sensors, and transaction records. The key characteristics of Big
Data are often summarized as the 9 Vs. Volume, Velocity, Variety, Variability, Validity,
Vulnerability, Volatility, Visualization, and Value. These datasets are crucial for businesses to
derive insights, improve decision-making, and enhance customer experiences in today’s digital
landscape.

Figure 1: Challenges in Big Data


Big Data presents significant challenges characterized by the “9 Vs”: Volume, Velocity,
Variety, Variability, Validity, Vulnerability, Volatility, Visualization, and Value. Key issues
include:
• Data Storage: Managing vast amounts of data requires scalable solutions beyond traditional
databases.
• Data Security: Protecting sensitive information is increasingly complex due to the scale
and speed of data.
• Data Quality: Ensuring accuracy and consistency is crucial for reliable insights.
• Data Analysis: Analyzing unstructured data demands innovative tools and methods.
These challenges necessitate advanced technologies and strategic approaches for effective
management.
Big Data presents significant privacy and security concerns: -
Data Privacy: The vast amounts of personal data generated raise issues about individual privacy
rights. Consumers often struggle to balance the convenience of big data applications with their
desire for privacy, especially as regulations like GDPR impose stricter compliance requirements.
Data Security: Protecting sensitive information involves multiple layers, including
infrastructure, computing, and application layers. Organizations face risks from data breaches,
unauthorized access, and internal threats. Effective measures like encryption, access control, and
continuous monitoring are crucial to safeguard data integrity and confidentiality.
Fig 2: Layered Architecture of Big Data System

9V’s of Big Data: -


Volume: - The concept of “big data” is primarily characterized by its volume. In 2010, Thomson
Reuters reported that the world was inundated with over 800 exabytes of data, while EMC
estimated it closer to 900 exabytes, predicting a 50% annual growth. This exponential increase in
data generation highlights the challenges of storage and management, as the total amount of
information continues to expand significantly each year.
Velocity: -In data processing refers to the speed at which large volumes of data are captured and
analyzed. This is crucial during sensitive periods when organizations must quickly identify and
respond to false or misleading information. The continuous influx of vast data streams
necessitates efficient processing to maximize its value, ensuring that businesses can leverage
real-time insights effectively. As data becomes increasingly persistent and complex, maintaining
high velocity in processing is essential for operational success.
Variability: - It refers data to the diverse forms of information, both organized and
unstructured. Traditionally, data was stored in structured formats like spreadsheets and
databases. Today, however, it includes unstructured types such as emails, photos, videos, and
recordings. This complexity complicates data storage, mining, and analysis, as unstructured data
lacks the organization that facilitates easy retrieval and processing. Consequently, specialized
tools and techniques are necessary to manage this vast array of information effectively.
Validity: - It refers to the trustworthiness and quality of data, particularly in the context of large
volumes, velocities, and varieties of data. It encompasses the accuracy, consistency, and
reliability of data sources, acknowledging that not all data is perfect—some may be dirty or
contain noise. Essentially, veracity ensures that data can be trusted for decision-making by
confirming its provenance and integrity, thus enabling organizations to manage and utilize data
effectively.
Vulnerability: - Big data presents significant security challenges, including unauthorized access,
data breaches, and compliance with privacy regulations. Organizations must implement strong
security measures like encryption, access controls, and regular audits to protect sensitive
information. Advanced analytics can enhance security by identifying vulnerabilities and
managing risks effectively. By prioritizing robust security practices, businesses can build trust
and a positive brand reputation, setting themselves apart in a competitive landscape.
Volatility: - Data is generally considered irrelevant or historic after it no longer serves a purpose
for decision-making or analysis. While there is no universal timeframe, organizations often adopt
guidelines based on industry standards and data types.
Common practices include:
• Retention periods: Many businesses keep data for 3-7 years for compliance and
operational needs.
• Data lifecycle management: Regular reviews help determine if older data should be
archived or deleted based on relevance and usage patterns.
Visualization: - Visualizing a billion data points requires innovative techniques beyond
traditional graphs due to the complexity and volume of big data. Effective methods include:
• Data Clustering: Groups similar data points for clearer insights.
• Tree Maps & Sunbursts: Hierarchical visualizations that display proportions.
• Parallel Coordinates: Useful for exploring multi-dimensional data relationships.
• Circular Network Diagrams: Show connections between variables.
• Cone Trees: Represent hierarchical structures in a 3D space.
These methods help manage the variety and velocity of big data, making meaningful
visualizations more achievable.
Value: - Big data holds immense potential for businesses by enabling them to store and analyze
vast amounts of information. This capability is crucial for identifying trends, improving decision-
making, and enhancing operational efficiency. Companies can leverage big data to understand
customer behavior, optimize marketing strategies, and create innovative products, ultimately
leading to increased revenue and competitive advantage. As organizations integrate big data
analytics into their operations, they can respond more swiftly to market changes and customer
needs, driving growth and profitability.
Veracity: - In data refers to its trustworthiness, which is crucial for managers relying on
descriptive data. However, managers must recognize that all data inherently contain
discrepancies due to various factors such as collection methods and biases. Qualitative research
emphasizes four criteria to ensure trustworthiness: credibility, transferability, dependability, and
confirmability. Each criterion helps address potential discrepancies, ensuring that while data may
be descriptive, it is also rigorously validated and reliable for decision-making.

RDBMS (Relational Database Mangement System)


A Relational Database Management System (RDBMS) organizes data into tables linked by
common fields, primarily using SQL for data manipulation. RDBMSs are favored for their ease
of use and ability to handle structured data, making them suitable for applications like financial
records and personnel data. Despite challenges from object-oriented and XML databases in the
1980s and 1990s, RDBMSs maintain a dominant market share due to their robust features and
widespread adoption in large-scale systems.

HADOOP: -
Hadoop is an open-source framework for processing large data sets across distributed clusters of
computers. It consists of two main components: Hadoop Distributed File System (HDFS) for
storage and MapReduce for processing. HDFS employs a master/slave architecture with a Name
Node managing metadata and multiple Data Nodes storing data blocks, ensuring fault tolerance
through replication. Designed to run on commodity hardware, Hadoop efficiently handles
petabytes of data, leveraging data locality to enhance processing speed. Developed by the
Apache Software Foundation, it was inspired by Google’s MapReduce and Google File System.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a scalable and fault-tolerant file system designed for
large data sets on commodity hardware. It operates on a master/slave architecture, with a single
Name Node managing metadata and multiple Data Nodes storing actual data blocks. HDFS
supports high throughput for large files, typically ranging from gigabytes to petabytes, and
achieves reliability through data replication, usually with a default factor of three. It is optimized
for batch processing and streaming access, making it suitable for various applications without
predefined schemas.

HDFS: -
HDFS (Hadoop Distributed File System) is a distributed file system designed for high-
throughput access to large data sets, primarily used in batch processing applications. It supports
various applications beyond MapReduce, including HBase, Apache Mahout, and Apache Hive.
HDFS is fault-tolerant, scalable, and efficient for data-intensive tasks like log analysis,
marketing analytics, and machine learning. Major companies like Amazon and Google utilize
HDFS for data mining and analytics due to its capability to handle large volumes of structured
and unstructured data across commodity hardware.

Map Reduce Architecture: -


Hadoop’s MapReduce framework efficiently processes large datasets in parallel across clusters
of commodity hardware. It operates in two main phases: Map, where input data is split into
smaller chunks and transformed into intermediate key-value pairs, and reduce, which aggregates
these pairs into a final output. This design allows for fault tolerance and scalability, making it the
core of Hadoop’s data processing capabilities. By executing tasks on nodes where the data
resides, MapReduce minimizes network traffic and enhances performance.
MapReduce is a programming model for processing large datasets in parallel across a distributed
cluster.
Key Terminologies
• Map: A function that processes input data and produces key-value pairs as intermediate
output.
• Reduce: A function that takes intermediate key-value pairs and aggregates them into a
final output.
• Job: The complete MapReduce program, consisting of one or more tasks.
• Task: A single unit of work, which can be either a map or reduce operation.
• Task Attempt: An instance of a task execution; multiple attempts may occur if a task
fails.
The MapReduce framework handles failures by rescheduling tasks on healthy nodes and
allowing configurable retry limits.

SCHEDULING IN HADOOP:-
In Hadoop, task scheduling is crucial for efficient resource allocation and execution speed.
Scheduling algorithms are categorized into dynamic and static types.
Dynamic scheduling uses real-time data to make decisions, exemplified by the Fair Scheduler
and Capacity Scheduler, which adapt to workload changes. In contrast, static scheduling, like the
FIFO Scheduler, follows a fixed order without considering job priority or resource availability.
Each approach has its benefits and drawbacks, impacting overall performance and resource
utilization in Hadoop clusters.

FIFO Scheduling:-
The FIFO (First In First Out) scheduler is the default job scheduling method in Apache Hadoop,
prioritizing tasks based on their submission order. While it is straightforward, it can lead to
inefficiencies, particularly for smaller jobs that may be delayed by longer-running tasks, as it
allocates resources strictly according to arrival time. For improved performance, alternatives like
the Fair Scheduler and Capacity Scheduler are recommended, as they dynamically allocate
resources and prioritize jobs more effectively, reducing wait times and optimizing resource
utilization in heterogeneous environments.

Fair scheduler: -
The Fair Scheduler in Hadoop is designed to ensure Quality of Service (QoS) by allocating
resources fairly among multiple jobs. It organizes jobs into pools, assigning guaranteed
minimum resources to each pool. This allows for equitable distribution of resources, ensuring
that all applications receive an average share over time. Unlike the default FIFO scheduler, it
enables short jobs to complete quickly without starving longer jobs, while also considering job
priorities through weights for resource allocation. The Fair Scheduler can dynamically adapt to
varying workloads, enhancing overall cluster efficiency and responsiveness.

Conclusions: -
The paper effectively outlines the concept of Big Data, emphasizing processing challenges such
as scale, dissimilarity, and privacy. It highlights the significance of MapReduce architecture in
managing large datasets, where outputs from mapping tasks are sorted and input into reducing
tasks. The discussion extends to Hadoop as an open-source solution for Big Data processing,
addressing the need for cost-effective strategies across various domains. Overall, it underscores
that overcoming technical challenges is essential for efficient Big Data operations and achieving
organizational goals in data management.

REFERENCES:-
[1] Bakshi, K., (2012),” Considerations for big data: Architecture and approach”
[2] Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W., (18-22 Dec.,2012),
“Shared disk big data analytics with Apache Hadoop”
[3] Harshawardhan S. Bhosale1, Prof. Devendra Gadekar, JSPM’s Imperial College of
Engineering & Research, Wagholi, Pune, a review on Big Data
[4] Aditya B. Patel, Manashvi Birla, Ushma Nair,(6-8 Dec. 2012), “Addressing Big Data
Problem Using Hadoop and Map Reduce”
[5] N. Deshai1, S. Venkataramana2, Dr. G. P. Saradi Varma3 volume No.07, Special No.02,
February 2018
[6] Roopa Raphael1, Raj Kumar T2 Volume 5 Issue 3, March 2016

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy