0% found this document useful (0 votes)
47 views3 pages

.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology

The document discusses big data classification and Hadoop technology. It covers topics like Hadoop components, limitations of Hadoop, and using Naive Bayes classifier for sentiment analysis on big data. It also discusses improving performance for storing massive small files in Hadoop.

Uploaded by

Precious Pearl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views3 pages

.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology

The document discusses big data classification and Hadoop technology. It covers topics like Hadoop components, limitations of Hadoop, and using Naive Bayes classifier for sentiment analysis on big data. It also discusses improving performance for storing massive small files in Hadoop.

Uploaded by

Precious Pearl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

2.

A Perusal of Big Data Classification and Hadoop


Technology
1. .ANALYSIS AND PROCESSING OF MASSIVE
DATA BASED ON HADOOP PLATFORM Big Data is used to delineate data which are very huge in
Based on the analysis of the key technical foundation size, which makes it arduous to analyze in conventional
and other existing distributed storage and calculation ways. Infrastructure:Big data storage is concerned with
researches on Hadoop cluster technology combination, storing and managing data in a scalable way, Distributed File
Systems: Hadoop File System (HDFS) offers the efficiency to store huge
as well as their business needs and the actual hardware amounts of unstructured data in a reliable way on commodity
and software capabilities, the paper proposes a large- hardware.No SQL Databases:It is the most important family of big data
scale storage technologies are No SQL database management systems. .New
SQL .
. It is urgent for enterprises to change their traditional
The Analytics:- Big data analysis provides making “sense” out of
architecture, and how to analyze these data and how to huge volumes of multifarious data that in its raw form deficiency a
make full use of data value in the face of massive data. data model to define what every element means in the context of
In massive data processing, how to mine potential value the others Summary Visualization:A current class of visualization
and transformation ability from mass data efficiently techniques has been proposed recently that process a such huge-scale data
before rendering it to visualization routines. IInfrastructure is the
and quickly will provide a basis for decision making,
foundation of Big Data architecture. Possessing the proper tools
and will become the core competitiveness of enterprises. for storing, processing and analyzing your data is vital in any Big
But with the faster and faster data generation and larger Data projectHadoop: Hadoop is really an open-source framework for
data volume, data processing technology faces more and processing, storing and analyzing data,Map
Reduce,Yarn,Spark,NoSQL,Massively Parallel Processing ,Cloud,
more challenges.
The Security and Privacy in Big Data :scenario the enormous
In this paper, as the research platform of Hadoop cloud amount of data being collected continues to rapidly grow, more
platform, for a large number of Web log data and more companies are building big data repositories to gather,
preprocessing model, the research of massive data aggregate and extract meaning from their data. Anonymization is the
procedure of altering and masking personal data in such a way that
processing performance of the Apriority algorithm
individuals cannot be re-identified. Hadoop accomplishes two tasks first
based on distributed data mining, effectively improve massive data storage and second distributed processing.
the cloud platform, make a contribution to promote the
Hadoop is a low-cost alternative for data storage over conventional data
development of large data processing technology. storage options. Hadoop uses commodity hardware to credibly store huge
Massive data is generally used to describe a lot of quantities of data.and application processing are protected against
unstructured data and semi-structured data, the data in a hardware Hadoop is really an open-source framework for processing,
storing and analyzing data
relational database for downloading to spend too much
time and money when analyzing. Massive data analysis .The Data Cleaning of Hadoop: Data Cleaner has this ability to help you
and cloud computing often linked together, because with the quality of data, ingestion of data, standardizing and monitoring of
data. We can leverage the computing power of your Hadoop cluster to
real-time analysis of large data sets requires the same as vanquish infrastructure and performance hurdles.
Map Reduce framework to assign to computer tens,
The Components of Hadoop: Hadoop Distributed File System
hundreds or even thousands of jobs .
(HDFS),Hadoop YARN ,The Hadoop,Map Reduce ,Pig,Hive
Suitable for mass data technologies, including massively ,HBase,Cassandra ,HCatalog ,Lucene ,Hama ,Crunch ,Avro ,Thrift
parallel processing (MPP) database, data mining grids, ,Drill ,Mahout,Ambari ,ZooKeeper ,Oozie ,Sqoop ,Flume ,Chukwa.
distributed file system, distributed databases, cloud
Limitations of Hadoop:Hadoop is an impressive platform for
computing platform, the 2018 4th World Conference on processing massive volumes of data with remarkable speed on
Control, Electronics and Computer Engineering lowcost commodity hardware, but it does have some momentous
(WCCECE 2018) data analysis system, limitations.
3.A Method to Improve the Performance for Storing 4. Scalable Sentiment Classification for Big Data
Massive Small Files in Hadoop Analysis Using Nave Bayes Classifier:.Sentiment class
ification is useful for the business consumer industry or
Hadoop is a popular distributed framework which online recommendation systems. Important steps of the
mainly consists of a high-performance distributed work flow.1.Instruct data parser of the format of input data
computing platform.HDFS is designed to store the and the desired output,2.Transmit source code to the name
node and execute.3.Trigger the result collector to collect
oversized files originally.It mainly consists of three
computing results once they are available on Hadoop
processes: the merging small files, the establishing
Distributed File System (HDFS) A review’s class is then
mapping index and the prefetching.
decided by the frequency of each word that appears in the
When it needs to upload the small files to HDFS, the model obtained from training data set.1) Pre-processing Raw
Data set: The data parser first pre processes all reviews into
client creates a temporary file in the memory
a common format. After the processing, each review is one
firstly.Small files are merged into the temporary
line in the data set, with document ID and sentiment
file.When the system accesses a small file, the client (positive or negative) pre_fixed All pre -processed reviews
firstly retrieves to HBase based on the small file are stored in the name node as a repository.
name.Then, according to the merged file nameThis
algorithm will operate when the small files are The WFC and the data parser work together to prepare input
uploaded, then compute the sum of the current data sets for all test trials. Sentiment Class ification Using
Had oop: The sentiment class ification is the key step in the
temporary merge file.If the sum is more than 64MB or
work flow. Once the training data and test data are ready in
the time of the temporary merge file has been in
HDFS, the WFC starts the training job to build a model. The
memory is longer than T1, then upload the merged file combining job then combines test data with the model,
to HDFS and put the index to Hbase. This algorithm is resulting an intermediate table. This automatic scheduling
executed periodically..The experiments take 100000 method can be easily applied to other programs with minor
small files as the data set, and the size of small files change of the parameters.we only have two classes of
ranges from 1 KB to 50 KB. The formats of the small documents.Training job ,Combining job ,Classify job
files are txt, doc, jpg.
.Virtual Had oop cluster is a fast and easy way to test a Had
We randomly select 20000, 40000, 60000, 80000 and oop program in the Cloud, although the performance might
100000 files from the file setFive groups of data are be weaker compared to a physical . To test the scalability of
uploaded by the improved scheme and the original Na¨ıve Bayes class ifier, the size of data set in our
experiment varies from one thousand to one million reviews
HDFS respectively. Five sets of data are uploaded by
in each class.The result statistics include the class ification
the improved scheme and the original HDFS
accuracy, the computation time and the throughput of the
respectively,uploading Speed,NameNode Memory
system.
Usage,Reading Speed.For the improved scheme, each
group randomly generates 1000 access logs, then starts
to read the small files according to the randomly
CONCLUSION believe that our work is just a beginning of
generated file name and record the time it takes. Each employing machine learning technologies in large-scale data
group randomly generates 1000 access logs, then starts sets. Future work will include using our framework for
to read the small files according to the randomly information fusion over imagery and text, distributed
generated file name and record the time it takes. Each robotics applications, and cyber analysis using cloud
group of the experiment is done for three times. computing.

the efficiency performance by considering processing


time and throughput. The system gives efficient results
even on larger data sets. The system throughput is
increased with the rise in data size

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy