0% found this document useful (0 votes)
87 views4 pages

What Is Hadoop

Hadoop is an open source software platform that addresses the problems of storing and analyzing large volumes of data. It uses HDFS for inexpensive and reliable data storage by breaking files into blocks and storing multiple redundant copies across clusters of commodity servers. Hadoop also uses MapReduce to analyze both structured and unstructured data by distributing the work across nodes and processing data in parallel, leveraging HDFS's data distribution to improve performance. These techniques allow Hadoop to cost-effectively manage exabytes of data and gain insights that help control processes, predict demand, and build better products and services.

Uploaded by

krishnanand
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views4 pages

What Is Hadoop

Hadoop is an open source software platform that addresses the problems of storing and analyzing large volumes of data. It uses HDFS for inexpensive and reliable data storage by breaking files into blocks and storing multiple redundant copies across clusters of commodity servers. Hadoop also uses MapReduce to analyze both structured and unstructured data by distributing the work across nodes and processing data in parallel, leveraging HDFS's data distribution to improve performance. These techniques allow Hadoop to cost-effectively manage exabytes of data and gain insights that help control processes, predict demand, and build better products and services.

Uploaded by

krishnanand
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

WhatIsHadoop?

ManagingBigDataintheEnterprise
Introduction
Datavolumesare growingmuchfaster thancomputepower. Thisgrowthdemands newstrategiesfor processingand analyzinginformation. AccordingtoIDC1,theamountdigitalinformationproducedin 2011willbetentimesthatproducedin2006:1,800exabytes. Themajorityofthisdatawillbeunstructuredcomplexdata poorlysuitedtomanagementbystructuredstoragesystems likerelationaldatabases. Unstructureddatacomesfrommanysourcesandtakesmany formsweblogs,textfiles,sensorreadings,usergenerated contentlikeproductreviewsortextmessages,audio,videoand stillimageryandmore. Largevolumesofcomplexdatacanhideimportantinsights. Aretherebuyingpatternsinpointofsaledatathatcan forecastdemandforproductsatparticularstores?DoRFIDtag readsshowanomaliesinthemovementofgoodsduring distribution?Douserlogsfromawebsite,orcallingrecordsin amobilenetwork,containinformationaboutrelationships amongindividualcustomers?Canacollectionofnucleotide sequencesbeassembledintoasinglegene?Companiesthat canextractfactslikethesefromthehugevolumeofdatacan bettercontrolprocessesandcosts,canbetterpredictdemand andcanbuildbetterproducts. Dealingwithbigdatarequirestwothings: Inexpensive,reliablestorage;and Newtoolsforanalyzingunstructuredandstructured data.

ApacheHadoopisapowerfulopensourcesoftwareplatform thataddressesbothoftheseproblems.HadoopisanApache SoftwareFoundationproject.Clouderaofferscommercial supportandservicestoHadoopusers. 1AnUpdatedForecastofWorldwideInformationGrowth Through2011,IDC,March2008. WhatisHadoop?BigDataintheEnterprise 1

ReliableStorage:HDFS
MajorInternet propertieslikeGoogle, Amazon,Facebookand Yahoo!havepioneered theuseofnetworksof inexpensivecomputers forlargescaledata storageand processing.HDFSuses thesetechniquesto storeenterprisedata. Hadoopincludesafaulttolerantstoragesystemcalledthe HadoopDistributedFileSystem,orHDFS.HDFSisabletostore hugeamountsofinformation,scaleupincrementallyand survivethefailureofsignificantpartsofthestorage infrastructurewithoutlosingdata. Hadoopcreatesclustersofmachinesandcoordinateswork amongthem.Clusterscanbebuiltwithinexpensivecomputers. Ifonefails,Hadoopcontinuestooperatetheclusterwithout losingdataorinterruptingwork,byshiftingworktothe remainingmachinesinthecluster. HDFSmanagesstorageontheclusterbybreakingincoming filesintopieces,calledblocks,andstoringeachoftheblocks redundantlyacrossthepoolofservers.Inthecommoncase, HDFSstoresthreecompletecopiesofeachfilebycopyingeach piecetothreedifferentservers: 2
4

1
2 5

1 2 3 4 5

HDFS

3 4

2
3 4

1
3 5

Figure1:HDFSdistributesfileblocksamongservers HDFShasseveralusefulfeatures.Intheverysimpleexample shown,anytwoserverscanfail,andtheentirefilewillstillbe available.HDFSnoticeswhenablockoranodeislost,and createsanewcopyofmissingdatafromthereplicasit

WhatisHadoop?BigDataintheEnterprise

manages.Becausetheclusterstoresseveralcopiesofevery block,moreclientscanreadthematthesametimewithout creatingbottlenecks. Otherfaulttolerant storagesystemsare oftenmoreexpensive thanHDFS. Ofcoursetherearemanyotherredundancytechniques, includingthevariousstrategiesemployedbyRAIDmachines. HDFSofferstwokeyadvantagesoverRAID:Itrequiresno specialhardware,sinceitcanbebuiltfromcommodityservers, andcansurvivemorekindsoffailureadisk,anodeonthe networkoranetworkinterface. TheoneobviousobjectiontoHDFSitsconsumptionofthree timesthenecessarystoragespaceforthefilesitmanagesis notsoserious,giventheplummetingcostofstorage.In addition,HDFSofferssomerealadvantagesfordata processing,asthenextsectionwillshow.

HadoopforBigDataAnalysis
Manypopulartoolsforenterprisedatamanagement relationaldatabasesystems,forexamplearedesignedto makesimplequeriesrunquickly.Theyusetechniqueslike indexingtoexaminejustasmallportionofalltheavailable datainordertoansweraquestion. Hadoopisdesignedfor largescaleanalyses thatneedtoexamine allthedataina repository. Hadoopisadifferentsortoftool.Hadoopisaimedatproblems thatrequireexaminationofalltheavailabledata.Forexample, textanalysisandimageprocessinggenerallyrequirethatevery singlerecordberead,andofteninterpretedinthecontextof similarrecords.HadoopusesatechniquecalledMapReduceto carryoutthisexhaustiveanalysisquickly. Intheprevioussection,wesawthatHDFSdistributesblocks fromasinglefileamongalargenumberofserversfor reliability.Hadooptakesadvantageofthisdatadistributionby pushingtheworkinvolvedinananalysisouttomanydifferent servers.Eachoftheserversrunstheanalysisonitsownblock fromthefile.Resultsarecollatedanddigestedintoasingle resultaftereachpiecehasbeenanalyzed.

WhatisHadoop?BigDataintheEnterprise

Hadooptakes advantageofHDFS datadistribution strategytopushwork outtomanynodesina cluster.Thisallows analysestorunin parallelandeliminates thebottlenecks imposedbymonolithic storagesystems.

2 4 5 1 3 4 2 3 4

1 2 5

1 3 5

Figure2:Hadooppushesworkouttothedata Runningtheanalysisonthenodesthatactuallystorethedata deliversmuchmuchbetterperformancethanreadingdata overthenetworkfromasinglecentralizedserver.Hadoop monitorsjobsduringexecution,andwillrestartworklostdue tonodefailureifnecessary.Infact,ifaparticularnodeis runningveryslowly,Hadoopwillrestartitsworkonanother serverwithacopyofthedata.

Summary
HadoopsMapReduceandHDFSusesimple,robusttechniques oninexpensivecomputersystemstodeliververyhighdata availabilityandtoanalyzeenormousamountsofinformation quickly.Hadoopoffersenterprisesapowerfulnewtoolfor managingbigdata. Formoreinformation,pleasecontactClouderaat: info@cloudera.com +16503620488 http://www.cloudera.com/

WhatisHadoop?BigDataintheEnterprise

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy