.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
The document discusses big data classification and Hadoop technology. It covers topics like Hadoop components, limitations of Hadoop, and using Naive Bayes classifier for sentiment analysis on big data. It also discusses improving performance for storing massive small files in Hadoop.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
47 views3 pages
.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
The document discusses big data classification and Hadoop technology. It covers topics like Hadoop components, limitations of Hadoop, and using Naive Bayes classifier for sentiment analysis on big data. It also discusses improving performance for storing massive small files in Hadoop.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3
2.
A Perusal of Big Data Classification and Hadoop
Technology 1. .ANALYSIS AND PROCESSING OF MASSIVE DATA BASED ON HADOOP PLATFORM Big Data is used to delineate data which are very huge in Based on the analysis of the key technical foundation size, which makes it arduous to analyze in conventional and other existing distributed storage and calculation ways. Infrastructure:Big data storage is concerned with researches on Hadoop cluster technology combination, storing and managing data in a scalable way, Distributed File Systems: Hadoop File System (HDFS) offers the efficiency to store huge as well as their business needs and the actual hardware amounts of unstructured data in a reliable way on commodity and software capabilities, the paper proposes a large- hardware.No SQL Databases:It is the most important family of big data scale storage technologies are No SQL database management systems. .New SQL . . It is urgent for enterprises to change their traditional The Analytics:- Big data analysis provides making “sense” out of architecture, and how to analyze these data and how to huge volumes of multifarious data that in its raw form deficiency a make full use of data value in the face of massive data. data model to define what every element means in the context of In massive data processing, how to mine potential value the others Summary Visualization:A current class of visualization and transformation ability from mass data efficiently techniques has been proposed recently that process a such huge-scale data before rendering it to visualization routines. IInfrastructure is the and quickly will provide a basis for decision making, foundation of Big Data architecture. Possessing the proper tools and will become the core competitiveness of enterprises. for storing, processing and analyzing your data is vital in any Big But with the faster and faster data generation and larger Data projectHadoop: Hadoop is really an open-source framework for data volume, data processing technology faces more and processing, storing and analyzing data,Map Reduce,Yarn,Spark,NoSQL,Massively Parallel Processing ,Cloud, more challenges. The Security and Privacy in Big Data :scenario the enormous In this paper, as the research platform of Hadoop cloud amount of data being collected continues to rapidly grow, more platform, for a large number of Web log data and more companies are building big data repositories to gather, preprocessing model, the research of massive data aggregate and extract meaning from their data. Anonymization is the procedure of altering and masking personal data in such a way that processing performance of the Apriority algorithm individuals cannot be re-identified. Hadoop accomplishes two tasks first based on distributed data mining, effectively improve massive data storage and second distributed processing. the cloud platform, make a contribution to promote the Hadoop is a low-cost alternative for data storage over conventional data development of large data processing technology. storage options. Hadoop uses commodity hardware to credibly store huge Massive data is generally used to describe a lot of quantities of data.and application processing are protected against unstructured data and semi-structured data, the data in a hardware Hadoop is really an open-source framework for processing, storing and analyzing data relational database for downloading to spend too much time and money when analyzing. Massive data analysis .The Data Cleaning of Hadoop: Data Cleaner has this ability to help you and cloud computing often linked together, because with the quality of data, ingestion of data, standardizing and monitoring of data. We can leverage the computing power of your Hadoop cluster to real-time analysis of large data sets requires the same as vanquish infrastructure and performance hurdles. Map Reduce framework to assign to computer tens, The Components of Hadoop: Hadoop Distributed File System hundreds or even thousands of jobs . (HDFS),Hadoop YARN ,The Hadoop,Map Reduce ,Pig,Hive Suitable for mass data technologies, including massively ,HBase,Cassandra ,HCatalog ,Lucene ,Hama ,Crunch ,Avro ,Thrift parallel processing (MPP) database, data mining grids, ,Drill ,Mahout,Ambari ,ZooKeeper ,Oozie ,Sqoop ,Flume ,Chukwa. distributed file system, distributed databases, cloud Limitations of Hadoop:Hadoop is an impressive platform for computing platform, the 2018 4th World Conference on processing massive volumes of data with remarkable speed on Control, Electronics and Computer Engineering lowcost commodity hardware, but it does have some momentous (WCCECE 2018) data analysis system, limitations. 3.A Method to Improve the Performance for Storing 4. Scalable Sentiment Classification for Big Data Massive Small Files in Hadoop Analysis Using Nave Bayes Classifier:.Sentiment class ification is useful for the business consumer industry or Hadoop is a popular distributed framework which online recommendation systems. Important steps of the mainly consists of a high-performance distributed work flow.1.Instruct data parser of the format of input data computing platform.HDFS is designed to store the and the desired output,2.Transmit source code to the name node and execute.3.Trigger the result collector to collect oversized files originally.It mainly consists of three computing results once they are available on Hadoop processes: the merging small files, the establishing Distributed File System (HDFS) A review’s class is then mapping index and the prefetching. decided by the frequency of each word that appears in the When it needs to upload the small files to HDFS, the model obtained from training data set.1) Pre-processing Raw Data set: The data parser first pre processes all reviews into client creates a temporary file in the memory a common format. After the processing, each review is one firstly.Small files are merged into the temporary line in the data set, with document ID and sentiment file.When the system accesses a small file, the client (positive or negative) pre_fixed All pre -processed reviews firstly retrieves to HBase based on the small file are stored in the name node as a repository. name.Then, according to the merged file nameThis algorithm will operate when the small files are The WFC and the data parser work together to prepare input uploaded, then compute the sum of the current data sets for all test trials. Sentiment Class ification Using Had oop: The sentiment class ification is the key step in the temporary merge file.If the sum is more than 64MB or work flow. Once the training data and test data are ready in the time of the temporary merge file has been in HDFS, the WFC starts the training job to build a model. The memory is longer than T1, then upload the merged file combining job then combines test data with the model, to HDFS and put the index to Hbase. This algorithm is resulting an intermediate table. This automatic scheduling executed periodically..The experiments take 100000 method can be easily applied to other programs with minor small files as the data set, and the size of small files change of the parameters.we only have two classes of ranges from 1 KB to 50 KB. The formats of the small documents.Training job ,Combining job ,Classify job files are txt, doc, jpg. .Virtual Had oop cluster is a fast and easy way to test a Had We randomly select 20000, 40000, 60000, 80000 and oop program in the Cloud, although the performance might 100000 files from the file setFive groups of data are be weaker compared to a physical . To test the scalability of uploaded by the improved scheme and the original Na¨ıve Bayes class ifier, the size of data set in our experiment varies from one thousand to one million reviews HDFS respectively. Five sets of data are uploaded by in each class.The result statistics include the class ification the improved scheme and the original HDFS accuracy, the computation time and the throughput of the respectively,uploading Speed,NameNode Memory system. Usage,Reading Speed.For the improved scheme, each group randomly generates 1000 access logs, then starts to read the small files according to the randomly CONCLUSION believe that our work is just a beginning of generated file name and record the time it takes. Each employing machine learning technologies in large-scale data group randomly generates 1000 access logs, then starts sets. Future work will include using our framework for to read the small files according to the randomly information fusion over imagery and text, distributed generated file name and record the time it takes. Each robotics applications, and cyber analysis using cloud group of the experiment is done for three times. computing.
the efficiency performance by considering processing
time and throughput. The system gives efficient results even on larger data sets. The system throughput is increased with the rise in data size