0% found this document useful (0 votes)
33 views17 pages

Big Data

This document discusses big data and tools for analyzing large datasets. It defines big data as large, diverse datasets that are difficult to process using traditional methods due to the volume, variety and velocity of the data. It describes three types of data: structured, unstructured and semi-structured. It also outlines five key characteristics of big data - volume, variety, velocity, veracity and value. Finally, it provides an overview of the Apache Hadoop framework and how it uses distributed processing through MapReduce to efficiently store and analyze big data across clusters of computers.

Uploaded by

Jelin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views17 pages

Big Data

This document discusses big data and tools for analyzing large datasets. It defines big data as large, diverse datasets that are difficult to process using traditional methods due to the volume, variety and velocity of the data. It describes three types of data: structured, unstructured and semi-structured. It also outlines five key characteristics of big data - volume, variety, velocity, veracity and value. Finally, it provides an overview of the Apache Hadoop framework and how it uses distributed processing through MapReduce to efficiently store and analyze big data across clusters of computers.

Uploaded by

Jelin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

BIG DATA

ABSTRACT

A huge repository of terabytes of data is generated each day


from modern information systems and digital technologies such as
Internet of Things and cloud computing. Analysis of these massive
data requires a lot of efforts at multiple levels to extract knowledge
for decision making. Therefore, big data analysis is a current area of
research and development. The basic objective of this paper is to
explore the potential impact of big data challenges and various tools
associated with it. As a result, this article provides a platform to
explore big data at numerous stages.
BIG DATA

Big data is a collection of data from


many different sources. Big data refers
to data that is so large, fast or complex
that it's difficult or impossible to
process using traditional methods. The
act of accessing and storing large
amounts of information for analytics
has been around for a long time.
TYPES OF BIG DATA

1.Structured Data
2.Unstructured Data
3.Semi-Structured Data
STRUCTURED DATA

Any data that can be stored, accessed and


processed in the form of fixed format is
termed as a ‘structured’ data. Over the
period of time, talent in computer science
has achieved greater success in developing
techniques for working with such kind of
data (where the format is well known in
advance) and also deriving value out of it.
However, nowadays, we are foreseeing
issues when a size of such data grows to a
huge extent, typical sizes are being in the
rage of multiple zettabytes.
UNSTRUCTURED DATA

Any data with unknown form or the structure


is classified as unstructured data. In addition
to the size being huge, un-structured data
poses multiple challenges in terms of its
processing for deriving value out of it. A
typical example of unstructured data is a
heterogeneous data source containing a
combination of simple text files, images,
videos etc. Now day organizations have
wealth of data available with them but
unfortunately, they don’t know how to derive
value out of it since this data is in its raw
form or unstructured format.
SEMI-STRUCTURED DATA

Semi-structured data can


contain both the forms of data.
We can see semi-structured data
as a structured in form but it is
actually not defined.
Example of semi-structured
data is a data represented in an
XML file.
CHARACTERISTICS OF BIG DATA

 VOLUME
 VARIETY
 VELOCITY
 VERACITY
 VALUE
 Volume: the size and amounts of big data
that companies manage and analyze.

 Value: the most important “V” from the


perspective of the business, the value of big
data usually comes from insight discovery
and pattern recognition that lead to more
effective operations, stronger customer
relationships and other clear and
quantifiable business benefits.
 Variety: the diversity and range of different data
types, including unstructured data, semi-structured
data and raw data
 Velocity: the speed at which companies receive,
store and manage data – e.g., the specific number of
social media posts or search queries received within
a day, hour or other unit of time
 Veracity: the “truth” or accuracy of data and
information assets, which often determines
executive-level confidence
TOOLS FOR BIG DATA

 Apache Hadoop and MapReduce


 Apache Mahout
 Apache Spark
 Dryad
 Storm
 Apache Drill
 Jaspersoft
 Splunk
APACHE HADOOP

Apache Hadoop is a collection of open-source


software utilities that facilitates using a network of
many computers to solve problems involving
massive amounts of data and computation. It
provides a software framework for distributed storage
and processing of big data using the MapReduce
programming model.
Apache Hadoop is an open source framework
that is used to efficiently store and process large
datasets ranging in size from gigabytes to petabytes
of data. Instead of using one large computer to
store and process the data, Hadoop allows
clustering multiple computers to analyze massive
datasets in parallel more quickly.
Hadoop consists of four main modules:
Hadoop Distributed File System (HDFS) – A
distributed file system that runs on standard or
low-end hardware. HDFS provides better data
throughput than traditional file systems, in
addition to high fault tolerance and native
support of large datasets.
Yet Another Resource Negotiator (YARN) –
Manages and monitors cluster nodes and
resource usage. It schedules jobs and tasks.
 MapReduce – A framework that helps
programs do the parallel computation on data.
The map task takes input data and converts it
into a dataset that can be computed in key
value pairs. The output of the map task is
consumed by reduce tasks to aggregate output
and provide the desired result.
 Hadoop Common – Provides common Java
libraries that can be used across all modules.
THANK YOU……..
ANY QUERIES ? ? ?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy