0% found this document useful (0 votes)

16 views9 pages

ABSTRACT

Uploaded by

anchalsingh8291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views9 pages

ABSTRACT

Uploaded by

anchalsingh8291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

ABSTRACT: -

Big Data refers to the vast volumes of structured and unstructured data generated at high velocity
from various sources. Its characteristics include scale, diversity, and complexity, necessitating
advanced architectures and analytics to extract valuable insights.
Hadoop is a prominent open-source framework designed for distributed storage and processing
of big data across clusters, enabling scalability from single servers to thousands of machines. It
utilizes MapReduce for efficient data processing and HDFS (Hadoop Distributed File System)
for data storage, making it essential for handling massive datasets.

KEYWORDS: -
Big Data, RDBMS, Map Reduce, Hadoop Distributed File System, Google file System,
Scheduling algorithm,9v’s.

INTRODUCTION: -
Big data encompasses vast, complex datasets characterized by volume, variety, and velocity.
These datasets often exceed traditional processing capabilities of relational databases, making
them challenging to record and analyze. Analysts typically define big data as datasets ranging
from 30-50 terabytes to multiple petabytes, with a petabyte equating to 1,000 terabytes. The
increasing generation of unstructured data from sources like IoT devices and social media further
complicates processing, necessitating advanced technologies like Hadoop for efficient handling.
Consequently, big data requires specialized tools for effective analysis and visualization. It refers
to the vast volumes of structured and unstructured data generated at high velocity from various
sources. Its characteristics include scale, diversity, and complexity, necessitating advanced
architectures and analytics to extract valuable insights.
Hadoop is a prominent open-source framework designed for distributed storage and processing
of big data across clusters, enabling scalability from single servers to thousands of machines. It
utilizes MapReduce for efficient data processing and HDFS (Hadoop Distributed File System)
for data storage, making it essential for handling massive datasets.

Literature Survey: -
The term Big Data refers to massive datasets that traditional RDBMS techniques struggle to
manage due to their size and complexity. To address this, MapReduce, a programming model
within the Hadoop framework, was developed to facilitate efficient data processing through
parallelization. This model breaks down large datasets into smaller chunks, which are processed
concurrently across multiple servers, significantly speeding up analysis and reducing costs
compared to traditional methods. MapReduce’s ability to handle unstructured data and its fault
tolerance make it a vital tool for modern data management.
BIG DATA: -
Big Data refers to vast and complex datasets that traditional data-processing software cannot
manage effectively. It encompasses structured, semi-structured, and unstructured data from
various sources like social media, sensors, and transaction records. The key characteristics of Big
Data are often summarized as the 9 Vs. Volume, Velocity, Variety, Variability, Validity,
Vulnerability, Volatility, Visualization, and Value. These datasets are crucial for businesses to
derive insights, improve decision-making, and enhance customer experiences in today’s digital
landscape.

Figure 1: Challenges in Big Data

Big Data presents significant challenges characterized by the “9 Vs”: Volume, Velocity,
Variety, Variability, Validity, Vulnerability, Volatility, Visualization, and Value. Key issues
include:
• Data Storage: Managing vast amounts of data requires scalable solutions beyond traditional
databases.
• Data Security: Protecting sensitive information is increasingly complex due to the scale
and speed of data.
• Data Quality: Ensuring accuracy and consistency is crucial for reliable insights.
• Data Analysis: Analyzing unstructured data demands innovative tools and methods.
These challenges necessitate advanced technologies and strategic approaches for effective
management.
Big Data presents significant privacy and security concerns: -
Data Privacy: The vast amounts of personal data generated raise issues about individual privacy
rights. Consumers often struggle to balance the convenience of big data applications with their
desire for privacy, especially as regulations like GDPR impose stricter compliance requirements.
Data Security: Protecting sensitive information involves multiple layers, including
infrastructure, computing, and application layers. Organizations face risks from data breaches,
unauthorized access, and internal threats. Effective measures like encryption, access control, and
continuous monitoring are crucial to safeguard data integrity and confidentiality.
Fig 2: Layered Architecture of Big Data System

9V’s of Big Data: -

Volume: - The concept of “big data” is primarily characterized by its volume. In 2010, Thomson
Reuters reported that the world was inundated with over 800 exabytes of data, while EMC
estimated it closer to 900 exabytes, predicting a 50% annual growth. This exponential increase in
data generation highlights the challenges of storage and management, as the total amount of
information continues to expand significantly each year.
Velocity: -In data processing refers to the speed at which large volumes of data are captured and
analyzed. This is crucial during sensitive periods when organizations must quickly identify and
respond to false or misleading information. The continuous influx of vast data streams
necessitates efficient processing to maximize its value, ensuring that businesses can leverage
real-time insights effectively. As data becomes increasingly persistent and complex, maintaining
high velocity in processing is essential for operational success.
Variability: - It refers data to the diverse forms of information, both organized and
unstructured. Traditionally, data was stored in structured formats like spreadsheets and
databases. Today, however, it includes unstructured types such as emails, photos, videos, and
recordings. This complexity complicates data storage, mining, and analysis, as unstructured data
lacks the organization that facilitates easy retrieval and processing. Consequently, specialized
tools and techniques are necessary to manage this vast array of information effectively.
Validity: - It refers to the trustworthiness and quality of data, particularly in the context of large
volumes, velocities, and varieties of data. It encompasses the accuracy, consistency, and
reliability of data sources, acknowledging that not all data is perfect—some may be dirty or
contain noise. Essentially, veracity ensures that data can be trusted for decision-making by
confirming its provenance and integrity, thus enabling organizations to manage and utilize data
effectively.
Vulnerability: - Big data presents significant security challenges, including unauthorized access,
data breaches, and compliance with privacy regulations. Organizations must implement strong
security measures like encryption, access controls, and regular audits to protect sensitive
information. Advanced analytics can enhance security by identifying vulnerabilities and
managing risks effectively. By prioritizing robust security practices, businesses can build trust
and a positive brand reputation, setting themselves apart in a competitive landscape.
Volatility: - Data is generally considered irrelevant or historic after it no longer serves a purpose
for decision-making or analysis. While there is no universal timeframe, organizations often adopt
guidelines based on industry standards and data types.
Common practices include:
• Retention periods: Many businesses keep data for 3-7 years for compliance and
operational needs.
• Data lifecycle management: Regular reviews help determine if older data should be
archived or deleted based on relevance and usage patterns.
Visualization: - Visualizing a billion data points requires innovative techniques beyond
traditional graphs due to the complexity and volume of big data. Effective methods include:
• Data Clustering: Groups similar data points for clearer insights.
• Tree Maps & Sunbursts: Hierarchical visualizations that display proportions.
• Parallel Coordinates: Useful for exploring multi-dimensional data relationships.
• Circular Network Diagrams: Show connections between variables.
• Cone Trees: Represent hierarchical structures in a 3D space.
These methods help manage the variety and velocity of big data, making meaningful
visualizations more achievable.
Value: - Big data holds immense potential for businesses by enabling them to store and analyze
vast amounts of information. This capability is crucial for identifying trends, improving decision-
making, and enhancing operational efficiency. Companies can leverage big data to understand
customer behavior, optimize marketing strategies, and create innovative products, ultimately
leading to increased revenue and competitive advantage. As organizations integrate big data
analytics into their operations, they can respond more swiftly to market changes and customer
needs, driving growth and profitability.
Veracity: - In data refers to its trustworthiness, which is crucial for managers relying on
descriptive data. However, managers must recognize that all data inherently contain
discrepancies due to various factors such as collection methods and biases. Qualitative research
emphasizes four criteria to ensure trustworthiness: credibility, transferability, dependability, and
confirmability. Each criterion helps address potential discrepancies, ensuring that while data may
be descriptive, it is also rigorously validated and reliable for decision-making.

RDBMS (Relational Database Mangement System)

A Relational Database Management System (RDBMS) organizes data into tables linked by
common fields, primarily using SQL for data manipulation. RDBMSs are favored for their ease
of use and ability to handle structured data, making them suitable for applications like financial
records and personnel data. Despite challenges from object-oriented and XML databases in the
1980s and 1990s, RDBMSs maintain a dominant market share due to their robust features and
widespread adoption in large-scale systems.

HADOOP: -
Hadoop is an open-source framework for processing large data sets across distributed clusters of
computers. It consists of two main components: Hadoop Distributed File System (HDFS) for
storage and MapReduce for processing. HDFS employs a master/slave architecture with a Name
Node managing metadata and multiple Data Nodes storing data blocks, ensuring fault tolerance
through replication. Designed to run on commodity hardware, Hadoop efficiently handles
petabytes of data, leveraging data locality to enhance processing speed. Developed by the
Apache Software Foundation, it was inspired by Google’s MapReduce and Google File System.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a scalable and fault-tolerant file system designed for
large data sets on commodity hardware. It operates on a master/slave architecture, with a single
Name Node managing metadata and multiple Data Nodes storing actual data blocks. HDFS
supports high throughput for large files, typically ranging from gigabytes to petabytes, and
achieves reliability through data replication, usually with a default factor of three. It is optimized
for batch processing and streaming access, making it suitable for various applications without
predefined schemas.

HDFS: -
HDFS (Hadoop Distributed File System) is a distributed file system designed for high-
throughput access to large data sets, primarily used in batch processing applications. It supports
various applications beyond MapReduce, including HBase, Apache Mahout, and Apache Hive.
HDFS is fault-tolerant, scalable, and efficient for data-intensive tasks like log analysis,
marketing analytics, and machine learning. Major companies like Amazon and Google utilize
HDFS for data mining and analytics due to its capability to handle large volumes of structured
and unstructured data across commodity hardware.

Map Reduce Architecture: -

Hadoop’s MapReduce framework efficiently processes large datasets in parallel across clusters
of commodity hardware. It operates in two main phases: Map, where input data is split into
smaller chunks and transformed into intermediate key-value pairs, and reduce, which aggregates
these pairs into a final output. This design allows for fault tolerance and scalability, making it the
core of Hadoop’s data processing capabilities. By executing tasks on nodes where the data
resides, MapReduce minimizes network traffic and enhances performance.
MapReduce is a programming model for processing large datasets in parallel across a distributed
cluster.
Key Terminologies
• Map: A function that processes input data and produces key-value pairs as intermediate
output.
• Reduce: A function that takes intermediate key-value pairs and aggregates them into a
final output.
• Job: The complete MapReduce program, consisting of one or more tasks.
• Task: A single unit of work, which can be either a map or reduce operation.
• Task Attempt: An instance of a task execution; multiple attempts may occur if a task
fails.
The MapReduce framework handles failures by rescheduling tasks on healthy nodes and
allowing configurable retry limits.

SCHEDULING IN HADOOP:-
In Hadoop, task scheduling is crucial for efficient resource allocation and execution speed.
Scheduling algorithms are categorized into dynamic and static types.
Dynamic scheduling uses real-time data to make decisions, exemplified by the Fair Scheduler
and Capacity Scheduler, which adapt to workload changes. In contrast, static scheduling, like the
FIFO Scheduler, follows a fixed order without considering job priority or resource availability.
Each approach has its benefits and drawbacks, impacting overall performance and resource
utilization in Hadoop clusters.

FIFO Scheduling:-
The FIFO (First In First Out) scheduler is the default job scheduling method in Apache Hadoop,
prioritizing tasks based on their submission order. While it is straightforward, it can lead to
inefficiencies, particularly for smaller jobs that may be delayed by longer-running tasks, as it
allocates resources strictly according to arrival time. For improved performance, alternatives like
the Fair Scheduler and Capacity Scheduler are recommended, as they dynamically allocate
resources and prioritize jobs more effectively, reducing wait times and optimizing resource
utilization in heterogeneous environments.

Fair scheduler: -
The Fair Scheduler in Hadoop is designed to ensure Quality of Service (QoS) by allocating
resources fairly among multiple jobs. It organizes jobs into pools, assigning guaranteed
minimum resources to each pool. This allows for equitable distribution of resources, ensuring
that all applications receive an average share over time. Unlike the default FIFO scheduler, it
enables short jobs to complete quickly without starving longer jobs, while also considering job
priorities through weights for resource allocation. The Fair Scheduler can dynamically adapt to
varying workloads, enhancing overall cluster efficiency and responsiveness.

Conclusions: -
The paper effectively outlines the concept of Big Data, emphasizing processing challenges such
as scale, dissimilarity, and privacy. It highlights the significance of MapReduce architecture in
managing large datasets, where outputs from mapping tasks are sorted and input into reducing
tasks. The discussion extends to Hadoop as an open-source solution for Big Data processing,
addressing the need for cost-effective strategies across various domains. Overall, it underscores
that overcoming technical challenges is essential for efficient Big Data operations and achieving
organizational goals in data management.

REFERENCES:-
[1] Bakshi, K., (2012),” Considerations for big data: Architecture and approach”
[2] Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W., (18-22 Dec.,2012),
“Shared disk big data analytics with Apache Hadoop”
[3] Harshawardhan S. Bhosale1, Prof. Devendra Gadekar, JSPM’s Imperial College of
Engineering & Research, Wagholi, Pune, a review on Big Data
[4] Aditya B. Patel, Manashvi Birla, Ushma Nair,(6-8 Dec. 2012), “Addressing Big Data
Problem Using Hadoop and Map Reduce”
[5] N. Deshai1, S. Venkataramana2, Dr. G. P. Saradi Varma3 volume No.07, Special No.02,
February 2018
[6] Roopa Raphael1, Raj Kumar T2 Volume 5 Issue 3, March 2016

Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
Chennai Institute of Technology - Shortlist For Test
No ratings yet
Chennai Institute of Technology - Shortlist For Test
27 pages
BDA 1-5 Imp
No ratings yet
BDA 1-5 Imp
120 pages
Big Data Analytics Is
No ratings yet
Big Data Analytics Is
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
127 pages
ShrimanteeRoy TechnicalArtist
No ratings yet
ShrimanteeRoy TechnicalArtist
1 page
Big Data ANAlysis Short
No ratings yet
Big Data ANAlysis Short
114 pages
Big Data ANALYSIS LONG
No ratings yet
Big Data ANALYSIS LONG
117 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
Lecture - 2 - Digital Control System
No ratings yet
Lecture - 2 - Digital Control System
77 pages
Big Data
No ratings yet
Big Data
19 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Bda Mse
No ratings yet
Bda Mse
62 pages
Data, Big
No ratings yet
Data, Big
90 pages
Hadoop Research Paper
No ratings yet
Hadoop Research Paper
7 pages
Partiiunit5characteristics of Big Data and Data Analytics
No ratings yet
Partiiunit5characteristics of Big Data and Data Analytics
6 pages
Unit 5
No ratings yet
Unit 5
68 pages
Bda Ans
No ratings yet
Bda Ans
18 pages
Log FrenteCaixa
No ratings yet
Log FrenteCaixa
57 pages
Big Data - Iv Bda
No ratings yet
Big Data - Iv Bda
143 pages
Module 3 Free Elective
No ratings yet
Module 3 Free Elective
19 pages
Book Chapter
No ratings yet
Book Chapter
23 pages
OpenShift - Container - Platform 4.18 About en US
No ratings yet
OpenShift - Container - Platform 4.18 About en US
28 pages
Big Data Analytics
No ratings yet
Big Data Analytics
5 pages
Amey
No ratings yet
Amey
30 pages
BDA-1st Unit
No ratings yet
BDA-1st Unit
39 pages
Unit 1 Big Data Analytics Full
No ratings yet
Unit 1 Big Data Analytics Full
29 pages
Bda QB
No ratings yet
Bda QB
24 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
13 pages
2403RES29 - Hemant Choudhary - CS546 - Assignment - 1
No ratings yet
2403RES29 - Hemant Choudhary - CS546 - Assignment - 1
14 pages
Acc 112 Aud in A Cis Envt Output
No ratings yet
Acc 112 Aud in A Cis Envt Output
13 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Technical Seminar Report
No ratings yet
Technical Seminar Report
24 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
16 pages
Data Analytics Notes Unit 1
No ratings yet
Data Analytics Notes Unit 1
23 pages
BDA Answerbank
No ratings yet
BDA Answerbank
71 pages
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
Frequency Shift Chirp Modulation The LoRa Modulation
100% (2)
Frequency Shift Chirp Modulation The LoRa Modulation
4 pages
XZ000-G3 (Eusp) 2.0 Qig V2.0.1
No ratings yet
XZ000-G3 (Eusp) 2.0 Qig V2.0.1
2 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Meraki Datasheet Ms 220 320 Only
No ratings yet
Meraki Datasheet Ms 220 320 Only
11 pages
Bigdata
No ratings yet
Bigdata
12 pages
Big Data in Business
No ratings yet
Big Data in Business
13 pages
Builder
No ratings yet
Builder
19 pages
UNIT-1 BigData
No ratings yet
UNIT-1 BigData
10 pages
DotnetConf2019HCMC NETCore3
No ratings yet
DotnetConf2019HCMC NETCore3
11 pages
QB Bda Solution
No ratings yet
QB Bda Solution
46 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
COURSEWORK 2022/2023: IMAT5122 - Computer Systems and Networks
No ratings yet
COURSEWORK 2022/2023: IMAT5122 - Computer Systems and Networks
10 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
Sem Csen1301
No ratings yet
Sem Csen1301
12 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Big Data Ashish
No ratings yet
Big Data Ashish
7 pages
Fortinet SMB Partner Sales Guide
No ratings yet
Fortinet SMB Partner Sales Guide
11 pages
NB2500EA MB Schematic
100% (1)
NB2500EA MB Schematic
45 pages
Unit 1 and Unit 2 Notes Bda
No ratings yet
Unit 1 and Unit 2 Notes Bda
11 pages
Assignemnt
No ratings yet
Assignemnt
2 pages
Big Data: Beginning With Capture, Organize, Integrate, Analyze, and Act
100% (1)
Big Data: Beginning With Capture, Organize, Integrate, Analyze, and Act
23 pages
Introduction To Big Data - Report 1
No ratings yet
Introduction To Big Data - Report 1
5 pages
BDAchap 1
No ratings yet
BDAchap 1
15 pages
Course: Cryptography Code: CS-21123 Branch: M.Tech - IS 1 Semester
No ratings yet
Course: Cryptography Code: CS-21123 Branch: M.Tech - IS 1 Semester
28 pages
Module Code & Module Title CC5004NI Security in Computing
No ratings yet
Module Code & Module Title CC5004NI Security in Computing
5 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Bigdata
No ratings yet
Bigdata
12 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
Emma Mensah
No ratings yet
Emma Mensah
4 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data
No ratings yet
Big Data
1 page
The Rainbow Scada: Internet Enabled Genset Controller
100% (1)
The Rainbow Scada: Internet Enabled Genset Controller
42 pages
PC Creator Overclocking Benchmarks
No ratings yet
PC Creator Overclocking Benchmarks
35 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Programming Essentials in Python Introduction To Python
No ratings yet
Programming Essentials in Python Introduction To Python
33 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
CC Unit 3 Imp Questions
No ratings yet
CC Unit 3 Imp Questions
15 pages
Advanced Analytics: What Is Big Data Analytics? Definition, Benefits, and More
No ratings yet
Advanced Analytics: What Is Big Data Analytics? Definition, Benefits, and More
13 pages
Satellite C55 B5299
No ratings yet
Satellite C55 B5299
3 pages
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
No ratings yet
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
15 pages
Musfequr Rahman ID - 191051015
No ratings yet
Musfequr Rahman ID - 191051015
4 pages
Report On Bigdata
No ratings yet
Report On Bigdata
3 pages
Computer Networks - The International Journal of Computer and Telecommunications Networking Home
No ratings yet
Computer Networks - The International Journal of Computer and Telecommunications Networking Home
6 pages
Automatic Railway Track Fault Detecting Using Wireless Network Systems
No ratings yet
Automatic Railway Track Fault Detecting Using Wireless Network Systems
24 pages
Module 1: Introduction To Operating System: Need For An OS
No ratings yet
Module 1: Introduction To Operating System: Need For An OS
18 pages
My SQL
No ratings yet
My SQL
5 pages
Tut 3
No ratings yet
Tut 3
3 pages
Big Data Is A Broad Term For
No ratings yet
Big Data Is A Broad Term For
5 pages
BCC (IEEE Format) Big Data
No ratings yet
BCC (IEEE Format) Big Data
2 pages
C++ Infosystems 2
No ratings yet
C++ Infosystems 2
303 pages
CMR College of Engineering and Technology: Kandlyakoya, Hyderabad
No ratings yet
CMR College of Engineering and Technology: Kandlyakoya, Hyderabad
3 pages
Disomat Weighing Terminal
0% (1)
Disomat Weighing Terminal
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ABSTRACT

Uploaded by

ABSTRACT

Uploaded by

ABSTRACT: -

Figure 1: Challenges in Big Data

9V’s of Big Data: -

RDBMS (Relational Database Mangement System)

Map Reduce Architecture: -

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.