0% found this document useful (0 votes)
31 views61 pages

BDA Lab Manual-2

big data analytics lab manual

Uploaded by

Vemula Naresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
31 views61 pages

BDA Lab Manual-2

big data analytics lab manual

Uploaded by

Vemula Naresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 61
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD, B.Tech. in COMPUTER SCIENCE AND ENGINEERING (DS) COURSE STRUCTURE & SYLLABUS (R18) Applicable From 2020-21Admitted Batch BIG DATA ANALYTICS LAB MANUAL, UL Year B.Tech. CSE(DS) I Sem L Tabs 00318 Course Objectives: 1, The purpose of this course is to provide the students with the knowledge of Big data Analytics principles and techniques. 2. This course is also designed to give an exposure of the frontiers of Big data Analytics Course Outcomes: After the completion of the course the student will be able to: 1, Use Excel as an Analytical tool and visualization tool. 2. Ability to program using HADOOP and Map reduce. 3. Ability to perform data analytics using ML in R. 4. Use cassandra to perform social media analytics. List of Experiments 1. Implement a simple map-reduce job that builds an inverted index on the set of input documents (Hadoop) 2. Process big data in HBase 3, Store and retrieve data in Pig 4. Perform Social media analysis using Cassandra 5. Buyer event analytics using Cassandra on suitable product sales data. 6. Using Power Pivot (Excel) Perform the following on any dataset a) Big Data Analytics b) Big Data Charting 7. Use R-Project to carry out statistical analysis of big data 8, Use R-Project for data visualization of social media data Experiment 1: Implement a simple map-reduce job that builds an inverted index on the set of input documents (Hadoop) Hadoop MapReduce Inverted Index. This isa Java program for creating an inverted index of words occurring in a large set of documents extracted from web pages using Hadoop MapReduce and Google Dataproc. AAs our dataset, we are using a subset of 74 files from a total of 408 files (text extracted from HTML tags) derived from the Stanford WebBase project. It was obtained from a web crawl done in February 2007. It is one of the largest collections totaling more than 100 million web pages from more than 50,000 websites. This version has already been cleaned. In this project we first set up a sample Hadoop cluster using Local (Standalone) Mode and then implement the actual project in Fully-Distributed Mode on Google Dataproc on the actual data set. Local (Standalone) Mode Setup Hadoop We need to setup the project on GNU/Linux as it is the supported development and production platform for Hadoop. To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. Please note that this project was deployed and tested using hadoop-3.1.1. Unpack the downloaded Hadoop distribution. In the distribution folder, edit the file etc/hadoop/hadoop-env.sh to define environment variable parameters as follows: ## add these line to the hadoop-env.sh or set them in terminal export JAVA_HOME=/usr/java/latest export PATH=S{JAVA_HOME}/bin:S{PATH} export HADOOP_CLASSPATH=SUAVA_HOME)/lib/tools,jar Then try the following command: $ bin/hadoop This should display the usage documentation for the Hadoop script. Now you are ready to start your Hadoop cluster in the standalone mode. Setup and run a simple Hadoop job This simple Hadoop job, gets two text files from the " ‘fileo. 5722018411 Hello World Bye World ‘filo 6722018415 Hello Hadoop Goodbye Hadoop ‘And by submitting a Hadoop job and applying Reduce step, it generates an inverted index as below: bye 5722018411:1 goodbye 6722018415:1 hadoop 6722018415:2 hello 5722018411:1 6722018415:1 world 5722018411:2 ‘older as the arguments of the Mapper. To submit a Hadoop job, the MadReduce implementation should be packaged as a/jarfile. To do so, copy the Invertedindex java file of this project to the Hadoop's distribution root folder and while you are still there run the following commands to compile Invertedindex.java and create a jar file. $ bin/hadoop com.sun.tools.javac.Main Invertedindex.java $ jar cf invertedindex.jar Invertedindex* class Copy input/fileO1 and input/fileO2 of this project and place them inside the input folder of the Hadoop distrubution folder. While you are still there, run the following command to submit the job, get the input files. from input folder, generate the inverted index and store its output in the output folder: '$ bin/hadoop jar invertedindex.jar Invertedindex input output ‘And finally to see the output, run the below command: $ bin/hadoop dfs -cat output/part-+-00000 Fully-Distributed Mode In this section we create a cluster with 3 worker nodes on Google Dataproc and run invertedindex.jar job on the actual data set. Google Cloud Platform Setup First we need an account on Google Cloud Platform. Setting up Your Initial Machine In the Google Cloud Console either create a new project or select an existing one. For this exercise we will use Dataproc. Using Dataproc, we can quickly create a cluster of compute instances running Hadoop. The alternative to Dataproc would be to individually setup each compute node, install Hadoop on it, set up HDFS, set up master node, etc. Dataproc automates this grueling process for us. Creating a Hadoop Cluster on Google Cloud Platform In Google Cloud Consol, select Dataproc from the navigation list on the left. If this is the first time you're using Dataproc then you might need to enable Dataproc API first. Clicking on 'Create Cluster’ will take you to the cluster configuration section. Give any unique name to your cluster and select a desired zone. You need to create a master and 3 worker nodes. Select the default configuration processors (n1-standard-4 4vCPU 15 GB memory) for master and each member and reduce the storage to 32 GB HDD storage. Change the number of Worker nodes to 3, Leave everything else default and click on ‘Create’. Now that the cluster is setup we'll have to configure it a little before we can run jobs on it. Select the cluster you just created from the list of clusters under the cloud Dataproc section on your console. Go to the VM Instances tab and click on the SSH button next to the instance with the Master Role. Clicking on the SSH button will take you to a Command line Interface(CLI) like an xTerm or Terminal. All the ‘commands in the following steps are to be entered in the CLI. There isno home directory on HDFS for the current user do we'll have to set this up before proceeding further. (To find out your user name run whoami) '$ hadoop fs -mkdir -p /user/ Set up environment variables JAVA_HOME has already been setup and we don't need to set it up again. Please note that this step has to be done each time you open a new SSH terminal. For eliminating this step you can also setup this up JAVA_HOME, PATH, and HADOOP_CLASSPATH in the etc/hadoop/hadoop-enw.sh. $ export PATH=S{JAVA_HOME}/bin:S{PATH} $ export HADOOP_CLASSPATH=S{JAVA_HOME}/lib/tools jar Now run: $hadoop fs 4s If there is no error this implies that your cluster was successfully set up. If you do encounter an error it’s most likely due to a missing environment variable or user home directory not being set upright. @e2ee2e0e08020802828088888888888888 888888 O Oe To ensure that the environment variables are set, run the command env. NOTE: Please disable the billing for the cluster when you are not using it. Leaving it running will cost extra credits. Upload data to Google Cloud Storage Download the dataset from this link and unzip the contents. You will find two folders inside named ‘development" and ‘full data’. ‘development’ data can be used for development and testing purposes. lick on 'Dataproc' in the left navigation menu. Next, locate the address of the default Google cloud storage staging bucket for your cluster. Go to the storage section in the left navigation bar and select your cluster’s default bucket from the li Click on the Upload Folder button and upload the devdata and fulldata folders individually. of buckets. Submit Hadoop job and generate output Now we are ready to submit a Hadoop job to run our MapReduce implementation on the actual data. Either use SSH or nano/'vi command to copy or create Invertedindex java on the master cluster. Run the following command to create the jar file: $ hadoop com.sun.tools javac.Main Invertedindex.java $ jar cf invertedindex.jar invertedindex*.class Now you have a jar file for your job. You need to place this jar file in the default cloud bucket of your cluster. Just create a folder called JAR on your bucket and upload it to that folder. If you created your jar file on the cluster’s aster node itself use the following commands to copy it to the JAR folder. $ hadoop fs -copyFromLocal ./invertedindex.jar $ hadoop fs

file name. We have three sample data files employee_info_t.csv employee_info_2.csv employee_info_3.csv we will be creating inverted index as below so that it will be faster to search employee details based on the first name. AARON employee info_1.csv employee_info_2.csv ABADJR employee _info_i.csv ABARCA employee_info_1.csv First Name, Last Name, Job Titles, Department, Full or Part-Time, Salary or Hourly, Typical Hours, Annual Salary, Hourly Rate dubert,tomasz ,paramedic i/c,fire fsalary, $1080.00, edwards,tim p, lieutenant fir,f,salary,,114846.00, elkins eric ,sergeant,police,f salary,,104628.00, estrada lus f,palice officer, police,fsalary,,96060.00, ewing,marie a,clerk ili,police,f,salary,,53076.00, finn, sean pfirfighter fire fsalary,,87006.00, fitch jordan m,law clerk,lawf hourly, 35,,14.51 Mapper Code Inthe mapper class we are splitting the input data using comma as a delimiter and then checking for some invalid data to ignore it in the if condition First Name of employee is stored in the Oth index so we are fetching the first name of employee using the Oth index.We also require the file name to store as the value against the first name sowe are fetching the file name that is processed in the mapper using the context.getinputSplit()).getPath().getName() and adding it to the value. import java.io.lOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.ib.input FileSplit; Public class InvertedindexNameMapper extends Mapper { private Text nameKey = new Text(); private Text fileNameValue = new Text(); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedExcention { eeeoeveeeeaeveveveeeeoeoeoeoeooeoeeoenoenaeneenee eee © OC T@eCCHOHOHCOCOCOOHOOCOOOCHOOOOCOC OEE O OOO OOO OESS Stringl] field = data.split(",", String firstName = null; if (null = field && field.length firstName=field{0]; String fileName = ((FileSplit) context.getinputSplit().getPath().getName(); namekey.set(firstName}; fileNameValue.setifileName); context.write(namekey, fileNameValue); } } } ‘&& field[0].length() > 0) { Reducer Code ‘The reducer iterates through the set of input values and appends each file name to a String, delimited by @ space character. The input key is output along with this concatenation also we do 2 check to make sure we are not printing the duplicate file name for the same first name. import java.io.|OException; import org-apache.hadoop.io.Text; import org.apache.hadoop.mapreduce. Reducer; public class InvertedindexNameReducer extends Reducer { private Text result = new Text(); public void reduce(Text key, Iterable values, Context context) throws lOException, interruptedException { StringBuilder sb = new StringBuilder(); boolean first = true; for (Text value : values) { if (first) { first = false; Jelse { sb.append(""); y if (sb.lastindexof{value.tostring()) <0) { sb.append{value.toString(); y result.set(sb.toString()); context.write(key, result); ye Driver Code Finally we will use the driver class to test everything is working fine as expected nvertedindexNameReducer.class); job.setOutputkeyClass(Text.class); job.setOutputValueCiass(Text.class); System exit(job.waitForCompletion(true) 20: 1); yn Sample Output AARON employee _info_1.csv employee_info_2.csv ‘ABAD JR. employee_info_1.csv ABARCA employee_info_1.csv Model 2: Inverted index In this assignment you have to implement a simple map reduce job that builds an inverted index on the set of input documents. An inverted index maps each word to a list of documents that contain the word, and additionally records the position of each occurrence of the word within the document. For the purpose of this assignment, the position will be based on counting words, not characters. Ex: Assume below are the input Documents. file1=("data is good."} ‘data is not good?"} data ((filet,1)(le2,2)) good {(ilet,3)(ile2,4)) is (filet,2)fle2,2)) not (fle2,3)} For more details or jerted indices, you can check out the Wikipedia page on inverted indices Now in this assignment you need to implement above map-reduce job. Input: A set of documents Output: Map: wordi (filename, position) word2 (filename, position) word (filename, position) and so on for each occurrence of each word. Reduce: word! {(filename, position)(filename, position)} ‘word2 {ifilename, position)} and so on for each word. Code for getting file name in Hadoop, which can be used in the Map function: string flename=null; filename = ((FileSplit) context.getinputSplit()).getPath().getName(); /* Marking scheme: ‘Assignment 8: Inverted index Solution ‘Map function and Reduce function: 4 marks each, Job setup 2 marks. Full marks if solution is correct or almost so, half of the max marks for any reasonable attempt. ‘More detailed marks breakup at the discretion of the coordinator. port java.io.|OException; import java.util StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop fs.Path; import org.apache.hadoop .o.LongWritable; import org.apache.hadoop io.Text; /fimport org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapreduce.Job; jport org.apache.hadoop.mapreduce. Mapper; import org.apache.hadoop.mapreduce.Reducer; SPCC CHHCHCHCHCCHC OC OCHCHO ROO OREO BOR EOEBEOBE2ERES import org.apache.hadoop.mapreducelib.input.FilelnputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop util GenericOptionsParser; import org.apache.hadoop.mapreduce.lib.input.FileSplit; public class Invertedindex { Public static class Map extends Mapper { int count=1; String filename=t Public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { Text word = new Text(); // type of output key Text output =new Text(); // type of output value Iicode For Getting File Name filename = ((FileSplit) context. getinputSplit()).getPath().getName(); StringTokenizer i = new StringTokenizer(value.toString(),",).2"\"I@#$964&*:;<>/"); // line to string token while (itr.hasMoreTokenst)) { ‘word.set(itr.nextToken()); // set word as each input keyword output.set("(" + filename +"," + Integer.toString(count) + ")"); System.out.printin(output+"\n"); context.write(word,output); // create a pair +#count; } Public static class Reduce extends Reducer { Public void reduce(Text key, IterablecText> values, Context context) throws IOException, InterruptedException { Text value =new Text();// type of output value for (Text val result += val.toString(); } result +=")"; value.set(result); context.write{key, value); } // Driver program Public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); 3] other Args = new GenericOptionsParser(conf, args).getRemainingArgst); // get all args if (otherArgs.length != 2) { System.err.printin("Usa system.exit(2); wertedindex "); } // create a job with name "Invertedindex" Job job = new Job{conf,"Invertedindex’); job.setiarByClass(Invertedindex.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); u uncomment the following line to add the Combiner J] jobsetCombinerClass(Reduce.class); // set output key type job setOutputKeyClass(Text.class); // set output value type job setOutputValueClass(Text.class); //set the HDFS path of the input data FilelnputFormat.addinputPath(job, new Path(otherArgs(0); 1] setthe HDFS path for the output FileQutputFormat.setOutputPath(job, new Path(otherArgs(2)}; IN\ait till job completion ‘System.exit\job.waitForCompletion(true) ? 0 : 1); Mend of main method end of class Invertedindex eeeeeaneeneeeeeeeoeeeeoeeeeeeeoeeneeeede 4 Experiment 2: Processing Big Data in HBase What is HBase Hbase is an open source and sorted map data built on Hadoop. It is column oriented and horizontally scalable. It is based on Google's Big Table.|t has set of tables which keep data in key value format. Hbase is well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs enabling development in practically any programming language. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. ‘Why HBase RDBMS get exponentially slow as the data becomes large Expects data to be highly structured, ie. ability to fit in a well-defined schema ‘Any change in schema might require a downtime For sparse datasets, too much of overhead of maintaining NULL values Features of Hbase Horizontally scalable: You can add any number of columns anytime. ‘Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically switch data handling to a standby system in the event of system compromise Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/ Reduce to do the task and itis built over Hadoop Distributed File System. sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,and timestamp. Often referred as a key value store or column family-oriented database, or storing versioned maps of maps. fundamentally, its a platform for storing and retrieving data with random access. It doesn't care about datatypes(storing an integer in one row and a string in another for the same column). It doesn't enforce relationships within your data. It is designed to run on a cluster of computers, built using commodity hardware. Installing HBase ‘We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode, and Fully Distributed mode. Installing HBase in Standalone Mode Download the latest stable version of HBase form http://www.interior-dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the tar “zxv*” command. See the following command. ‘$cd usr/local/ ‘Sweet http://www. interior-dsgn.com/apache/hbase/stable/hbase-0.98.8- hadoop2-bin.tar.gz Star -2xvf hbase-0.98.8-hadoop2-bin.tar.gz Shift to super user mode and move the HBase folder to /usr/local as shown below. Ssu Spassword: enter your password here mv hbase-0.99.1/* Hbase/ Configuring HBase in Standalone Mode Before proceeding with HBase, you have to edit the following files and configure HBase. hbase-env.sh Set the java Home for HBase and open hbase-env.sh file from the conf folder. Edit JAVA_HOME environment variable and change the existing path to your current JAVA_HOME variable as shown below. cd /usr/local/Hbase/cont gedit hbase-env.sh This will open the env.sh file of HBase. Now replace the existing JAVA_HOME value with your current value as shown below. export JAVA_HOME=/ust/lib/jvmn/java-1.7.0 hbase-site.xml ‘This is the main configuration file of HBase. Set the data directory to an appropriate location by opening the HBase home folder in /usr/local/HBase. Inside the conf folder, you will ind several files, open the hbase- site.xm| file as shown below. ited /usr/local/HBase/ ited conf # gedit hbase-site.xml Inside the hbase-site.xm! file, you will find the and tags. Within them, set the HBase directory under the property key with the name “hbase.rootdir” as shown below. 1/Here you have to set the path where you want HBase to store its files. hbase.rootdir file:/nome/hadoop/HBase/HFiles ‘/Here you have to set the path where you want HBase to store its built in zookeeper files. hbase.zookeeper-property.dataDir /home/hadoop/zookeeper >. @eevoe0e202020202020280800008000808080808808888888 8888 Ov With this, the HBase installation and configuration partis successfully complete. We can start HBase by Using start-hbase.sh script provided in the bin folder of HBase. For that, open HBase Home Folder and run HBase start script as shown below. $cd /usr/local/HBase/bin $,/start-hbase.sh If everything goes well, when you try to run HBase start script, it will prompt you a message saying that HBase has started. starting master, logging to /usr/local/HBase/bin/.logs/hbase-tpmaster-localhost localdomain.out, Installing HBase in Pseudo-Distributed Mode Let us now check how HBase is installed in pseudo-distributed mode. Configuring HBase Before proceeding with HBase, configure Hadoop and HOFS on your local system or on a remote system and make sure they are running. Stop HBase if tis running. hbase-site xml Edit hbase-site.xml file to add the following properties. -hbase.cluster.distributed trues/value> It will mention in which mode HBase should be run. n the same file from the loca file system, change the hbase.rootdir, your HDFS instance address, using the hafs://// URI syntax. We are running HDFS on the localhost at port 8030. hbase.rootdir hafs://localhost:8030/hbase Starting HBase After configuration is over, browse to HBase home folder and start HBase using the following command. Sed /usr/local/HBase Sbin/star-hbase.sh Note: Before starting HBase, make sure Hadoop is running. Checking the HBase Directory in HDFS Base creates its directory in HDFS. To see the created directory, browse to Hadoop bin and type the following command. $ ./bin/hadoop fs -Is /hbase If everything goes well, it will give you the following output. Found 7 items ddrwnt-x-x hbase users 0 2014-06-25 18:58 /hbase/.tmp drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs ddrwat-xr-x- hbase users 0 2014-06-25 18:48 /hbase/corrupt drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data -rw-r=t~ 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id -tw-t--t- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs Starting and Stopping a Master Using the “local-master-backup.sh” you can start up to 10 servers. Open the home folder of HBase, master and ‘execute the following command to start it. $ ./bin/local-master-backup.sh 24 To killa backup master, you need its process id, which will be stored in a file named “/tmp/hbase-USER-X- master.pid.” you can kill the backup master using the following command. $ cat /tmp/hbase-user-1-master.pid |xargs kill 9 Starting and Stopping RegionServers You can run multiple region servers from a single system using the following command. $ .bin/local-regionservers.sh start 2 3 To stop a region server, use the following command, $ .bin/local-regionservers.sh stop 3 Starting HBaseShell After Installing HBase successfully, you can start HBase Shell. Below given are the sequence of steps that are to be followed to start the HBase shell. Open the terminal, and login as super user. Start Hadoop File System Browse through Hadoop home sbin folder and start Hadoop file system as shown below. ‘$cd SHADOOP_HOME/sbin Sstart-ll.sh Start HBase Browse through the HBase root directory bin folder and start HBase. Scd /usr/local/HBase $./bin/start-hbase.sh Start HBase Master Server This will be the same directory. Start it as shown below. $./din/local-master-backup.sh start 2 (number signifies specific server.) Start Re Start the region server as shown below. $,/bin/ /local-regionservers.sh start 3 Start HBase Shell You can start HBase shell using the following command. Sed bin $,/hbase shell This will give you the HBase Shell Prompt as shown below. 2014-12-09 14:24:27,526 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use jo.native.lib available HBase Shell; enter ‘helpcRETURN>' for list of supported commands, ‘Type “exitcRETURN>" to leave the HBase Shell Version 0.98.8-hadoop2, r6cfc8d064754251365e070a10a82eb169956d5fe, Fri eeoeeeeeeeeeeeeeeeeeeeeeeeeeeee eee es hhbase(main):001:0> HBase Web Interface To access the web interface of HBase, type the following ur in the browser. http://localhost:60010 This interface lists your currently running Region servers, backup masters and HBase tables. HBase Region servers and Backup Masters Dead Region Servers Setting Java Environment We can also communicate with HBase using Java libraries, but before accessing HBase using Java API you need to set classpath for those libraries. Setting the Classpath Before proceeding with programming, set the classpath to HBase libraries in .bashre file. Open .bashrc in any of the editors as shown below. $ gedit “/.bashre Set classpath for HBase libraries (lib folder in HBase) in it as shown below. ‘export CLASSPATH = SCLASSPATH://home/hadoop/hbase/lio/* This sto prevent the “class not found” exception while accessing the HBase using java API. General commands In Hbase, general commands are categorized into following commands Status Version Table_help (scan, drop, get, put, disable, etc.) Whoami Tables Managements commands ‘These commands will allow programmers to create tables and table schemas with rows and column families. ‘The following are Table Management commands Create list Describe Disable Disable_all Enable Enable_all Drop Drop_all Show_fiters Alter Alter_status into various command usage in HBase with an example. : create , hbase(main):001:0> create ‘education’ ,guru99" O rows(s) in 0.312 seconds Describe ‘Syntaxdescribe

hbase(main):010:0>describe ‘education’ Disable ‘Syntax: disable hbase(main):011:0>disable ‘education’ Disable All Syntax: disable_allc” matching regex” Enable Syntax: enable hbase(main):012:0>enable ‘education’ Show Filters Syntax: show_filters Drop Syntax:drop
hbase(main):017:0>drop ‘education’ Drop all Syntax: drop_all<"regex'> |s_enable Syntax: s_enabled ‘education’ Alter Syntax: alter , NAME=>, VERSIONS=>5 ‘© hbase> alter ‘education’, NAMI VERSIONS => 5} hhbase> alter ‘education’, NAME => 'fl', METHOD => ‘delete! hbase> alter ‘education’, ‘delete’ =>' guru99_1" alter <'tablename'>, MAX_FILESIZE=>'132545224" alter ‘education’, METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE’ Alter_status Syntax: alter_status ‘education’ hbase> alter 'edu', ‘guru99_1', {NAME => ‘guru99_2', IN_MEMORY => true}, (NAME => "guru99_3', Commands come under these are Count Put Get Delete Delete all Truncate Scan unt <'tablename'>, CACHE =>1000 hbase> count ‘gurus’, CACHE=>1000 hbase>count ‘guru99', INTERVAL => 100000 hhbase> count 'guru99', INTERVAL =>10, CACHE=> 1000 hbase>g.count INTERVAL=>100000 hbase>g.count INTERVAL=>10, CACHE=>1000 Put Syntax: put <'tablename’>,<'rowname'>,<'columnvalue’>,<'value'> hbase> put 'guru99’,'r1', cl", ‘value’, 10 hbase> g.put ‘guru99’,'r1', cl, ‘value’, 10 create ‘guru9S", {NAME=>'Edu', VERSIONS=>213423443) put ‘guru99}, 'r1','Edu:et’, ‘value’, 10 put 'guru99’, 'r1', ‘Edu:c, ‘value’, 15, put ‘guru99}, 'r1', 'Edu:et’, ‘value’, 30 Get Syntax: get <'tablename'>, crowname'>, {< Additional parameters>) hbase> get 'gurua9','r1', {COLUMN => 'c1') hbase> get 'guru99’,'r1’ hbase> get ‘guru99, 'r1', {TIMERANGE => {ts1, ts2]} hbase> get 'guru99','r1', {COLUMN => ['c1', 'c2','c3']}, Delete Syntax:delete <'tablename’>,<'row name’ ‘column name> hhbase(main):}020:0> delete ‘gurugs', '1’,'c1", Delete All Syntax: deleteall <'tablename'>, <'rowname'> hbase>deleteall ‘guru99’, 'r1','c1’ Truncate Syntax: truncate Scan Syntax: scan <‘tablename’>, {Optional parameters} scan 'guru99" scan 'guru99', (RAW=>true, VERSIONS=>1000} Put Syntax: put <'tablename’>,<'rowname’>,<'columnvalue'>,<'value’> hhbase> put ‘gurug9', 'r1', ‘cl’, ‘value’, 10 hhbase> g.put ‘guru99','r1''c1', ‘value’, 10 Get ‘Syntax: get <‘tablename'>, <'rowname'>, {< Additional parameters>) hbase> get 'guru99’,'r1', {COLUMN =>'c1'} hbase> get 'guru99’,'r1’ hbase> get ‘guru99’,'r1', (TIMERANGE => [ts1, ts2]} hhbase> get 'guru99’,'r1', {COLUMN => ['c1','c2','c3']} Experiment 3: Store and Retrieve Data in Pig You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator. Syntax to Store: STORE Relation_name INTO' required_directory_path ' [USING function}; Example ‘Assume we have a file student_data.txt in HDFS with the following content. 001 Rajiv, Reddy, 9848022337,Hyderabad 002, siddarth, Battacharya,9848022338, Kolkata (003, Rajesh Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005, Trupthi, Mohanthy,9848022336,8huwaneshwar (006,/Archana,Mishra,9848022335,Chennai. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(’,’) as ( id:int,firstname:chararray, lastname:chararray, phone:chararray, vararray ); grunt> STORE student INTO " hdfs://localhost:9000/pig_Output/'" USING PigStorage (:'); Output After executing the store statement, you will get the following output. A directory is created with the specified name and the data will be stored in it. 2015-10-05 13:05:05,429 [main] INFO org.apache.pig,backend.hadoop.executionengine.mapReduceLayer. MapReduceLau ncher - 100% complete 2015-10-05 13:05:05,429 [main] INFO org.apache. pig.tools.pigstats.mapreduce.SimplePigStats - HadoopVersion PigVersion Userid StartedAt FinishedAt Features 2.6.00.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05 UNKNOWN Success! Job Stats (time in seconds) Jobld Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime job_14459_06 1 0 n/a n/a n/a n/a MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature 0000:student MAP_ONLY OutPut folder hdfs://localhost:9000/pig_Output/ RS MES oe ee ee ee, 7 Coy aN | tide: O-Jome-sorey declan sae Hy devedexd To. 7 The Hed -teda cence, Neyet Peete cted wmedam, Gobi Mew «i to do awh whit te Comdnctivesy collage We exe Ghodeds of pila ccieuce Hrlding wel wck. PEN MEHOL, ‘hes CL ctor ce, Kindluy qo We pond Geien to AeA anche Ranking spo ; Your raittfalley PS Hedete . thwestep chided Say lunes te (Oe Byoeho™ ashe) (6922) Noqrronden (E416) meet Cae yess Go) Input(s): Successfully read 0 records from: "hafs://localhost:9000/pig_data/student_data.txt* Output(s): Successfully stored 0 records in: "hafs://localhost:$000/pig_Output" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0006 2015-10-05 13:06:06,192 [main] INFO org.apache.pig.backend.hadoop.executionengine mapReduceLayer.MapReduceLau ncher - Success! Verification You can verify the stored data as shown below. Step 1 First of al list out the files inthe directory named pig_output using the ls command as shown below. hfs dfs -Is "hafsi//localhost:9000/pig_Output/" Found 2 items rw-r-r- 1 Hadoop supergroup 0 2015-10-05 13:03 hafs://localhost:9000/pig_Output/_SUCCESS rw-r~t~ 1 Hadoop supergroup 224 2015-10-05 13:03 hafs://localhost:9000/pig_Output/part-m-00000 You can observe that two files were created after executing the store statement. Step 2 Using cat command, lst the contents of the file named part-m-00000 as shown below. $ has dfs -cat *hdfs://localhost:9000/pig_Output/part-m-00000" +L Rajiv Reddy,9848022337,Hyderabad 2,siddarth Battacharya,9848022338,Kolkata 3,Rajesh,khanna,9848022339,Delhi 4,Preethi,Agarwal,9848022330,Pune 5, Trupthi,Mohanthy,9848022336,Bhuwaneshwar 6,Archana,Mishra,9848022335,Chennai In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig. This chapter explains how to load data to Apache Pig from HOFS. Preparing HOFS In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, let us start HOFS and create the following sample data in HDFS. Student 1D | FirstName | Last Name Phone ity oot faliv Reddy 9948022337 Hyderabad oo sida Batachaya 9948022538 Kolata 03 Rajesh | Khanna 9018022289 Delhi cos Preeti Agarwal 9848022330 Pune Mohanthy 9848022336 Bhuwaneshwar 006 Archana Mishra 9848022335 Chennai | personal details like id, first name, last name, phone number and city, of six students. Step 1: Verifying Hadoop First of all, verify the installation using Hadoop version command, as shown below. $Shadoop version If your system contains Hadoop, and if you have set the PATH variable, then you will get the following output ~ Hadoop 2.6.0 Subversion https://git-wip-us.apache.org/repos/asf/hadoop git -r ©3496499ecb8d220fba9SdcSed4c99c8f9e33bb1 Compiled by jenkins on 2014-11-13721:102 Compiled with protoc 2.5.0 From source with checksum 18643357c8f927c0695f1e9522859d6a ‘This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop common-2.6.0.jar Step 2: Starting HOFS Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown below. cd /SHadoop_Home/sbin/ $ start-dfs.sh locathost: starting namenode, logging to /homne/Hadoop/hadoop/logs/hadoopHadoop-namenode- localhost.tocaldomain.out localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode- localhost.localdomain.out Starting secondary namenodes [0.0.0.0] starting secondarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode- localhost.localdomain.out Sstart-yarn.sh starting yarn daemons starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager- localhost.localdomain.out localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager- locathost.localdomain.out Step 3: Create a Directory in HDFS In Hadoop DFS, you can create directories using the command mkdir. Create a new directory in HDFS with the name Pig_Data in the required path as shown below. $cd /SHadoop_Home/bin/ $ hafs dfs -mkdir hafs://localhost:9000/Pig_Data ‘Step 4: Placing the data in HOFS. ‘The input file of Pig contains each tuple/record in individual lines. And the entities of the record are separated by a delimiter (In our example we used ,”). In the local file system, create an input file student_data.txt containing data as shown below. 001, Rajiv Reddy,9848022337,Hyderabad 002,siddarth Battacharya,9848022338,Kolkata 003,Rajesh,khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005, Trupthi, Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335, Chennai. Now, move the file from the local file system to HDFS using put command 2s shown below. (You can Use copyfromLocal command as well.) [Sed SHaDOOP_HOME/bin |S hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs: ‘/Mocalhost:9000/pig_data/ Verifying the file ‘You can use the cat command to verify whether the file has been moved into the HDFS, as shown below. ‘$.cd SHADOOP_HOME/bin $ hals dfs -cat hafs://localhost:9000/pig_data/student_data.txt Output You can see the content of the file as shown below. 15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (001, Rajv Reddy, 9848022337, Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi (004, Preethi,Agarwal,9848022330,Pune (005, rupthi, Mohanthy, 9848022336, 8huwaneshwar (06, Archana Mishra,9848022335,Chennai ‘The Load Operator You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin. Syntax ‘The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need to mention ‘the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator. Relation_name = LOAD ‘Input file path‘ USING function as schema; Where, ‘+ relation_name ~ We have to mention the relation in which we want to store the data. ‘+ Input file path ~ We have to mention the HDFS directory where the file is stored. (In MapReduce mode) ‘+ function - We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader). + Schema - We have to define the schema of the data, We can define the required schema as follows (columnt : data type, column? : data type, column3 : data type); out specifying the schema. In that case, the columns will be addressed as $01, $02, Note - We load the data etc... (check). Example ‘As an example, let us load the data in student_data.tet in Pig under the schema named Student using the LOAD command. Start the Pig Grunt Shell First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below. $ Pig -x mapreduce {twill start the Pig Grunt shell as shown below. 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Logging error messages to: ‘Thome/Hadoop/pig_1443683018078.log impl.uti.Utils - Default bootup file 2015-10-01 12:33:39,630 [main] INFO org.apache.pig. backend. hadoop.executionengine. HExecutionEngine - Connecting to hadoop file system at: hhdfsi/localhost:3000 grunt> Execute the Load Statement Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’ USING PigStorage(’,") as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Following is the description of the above statement. Relation We have stored the data in the schema student. name Inputfile. We are reading data from the file student_data.txt, which is in the /pig_data/ path directory of HOFS. Storage ‘We have used the PigStorage() function. It loads and stores data as structured function text files, It takes a delimiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘\t’ as a parameter. schema We have stored the data using the following schema. column id firstname | lastname | phone city datatype int char array chararray | char array char array Experiment 4: Perform Social Media Analysis using CASSANDRA. 2 BACKGROUND 2.1 Cassandra Cassandra [22] is a non-relational and column-oriented, distributed database. It was originally developed by Facebook. It is now an open source Apache project. Cassandra is designed to store large datasets over a set of commodity machines by using a peer-to-peer cluster structure to promote horizontal scalability. n the column-oriented data model of Cassandra, a columns the smallest component of data. Columns associated with a certain key constitute a row. Each row can contain any number of columns. A collection of rows forms a column family, whichis similar to tables in relational databases. Records in the column families are stored in sorted order by row keys, in separate files. The keyspace congregates one or more column families in the application, similar to a schema in a relational database. Interesting aspects of the Cassandra framework include independence from any additional file systems like HOFS, scalability, replication support for fault tolerance, balanced data partitioning, and MapReduce support with 2 Hadoop plug-in. In [12], we present a detailed ussion on the attributes of Cassandra that make it interesting for MapReduce processing. 2.2 MapReduce Taking inspiration from functional programming, MapReduce starts with the idea of splitting an input dataset over a set of commodity machines, called workers, and processes these data splits in parallel with user-defined map and reduce functions. MapReduce abstracts away from the application programmers the details of input distribution, parallelization, scheduling and fault tolerance. 2.2.1 Hadoop & Hadoop Streaming Apache Hadoop [1] is the leading open source MapReduce implementation. Hadoop relies on two fundamental components: the hadoop distributed file system (HDFS) [26] and the Hadoop MapReduce Framework for data management and job execution respectively. A Hadoop JobTracker, running on the master node is responsible for resolving job details (I.., number of mappers/reducers), monitoring the job progress and worker status. Once a dataset is put into the HDFS, itis split into data chunks and distributed throughout the cluster. Each worker hosting a data split runs a process called DataNode and a TaskTrackerthat is, responsible for processing the data splits owned by the local DataNode. Hadoop is implemented in Java and requires the map and reduce operations to also be implemented in Java and use the Hadoop API. This creates a challenge for legacy applications where it may be impractical to rewrite the applications in Java or where the source code is no longer available. Hadoop Streaming is designed to address this need by allowing users to create MapReduce jobs where any executable (written in any language or script) can be specified to be the map ‘orreduce operations. Although Hadoop Streaming has a restricted model [11], it is commonly used to run ‘numerous scientific applications from various disciplines. allows domain scientists to use legacy applications for ‘complex scientific processing or use simple scripting languages that eliminate the sharp lear ‘curve needed to write scalable MapReduce programs for Hadoop in Java. Protein sequence comparison, tropical storm detection, atmospheric river detection and numerical linear algebra are a few examples of domain scientists using Hadoop Streaming on NERSC [5] systems [25] 2.2.2 MARISSA In earlier work, we highlighted both the performance penalty and application challenges of Hadoop Streaming and introduced MARISSA to address these shortcomings [11], [29]. MARISSA leaves the input management to the underiying shared file system to solely focus on processing. In [21] we explain the details of MARISSA and provide a comparison to Hadoop Streaming under various application requirements. Unlike Hadoop Streaming, MARISSA does not require processes like TaskTrackers and ataNodes for execution of MapReduce operations. Once the input data is split by the master node using the Splitter module and placed into the shared file system, each worker node has access to the input chunks awaiting execution. Unlike @eeeoeoeoevsee0eeee02e20208020808080800808080808080888 80808 Data Preparation Data Transformation (MR) Data Proccessing (MR2) aa Fig. 1. Map Reduce streaming piptine showing each sagen (a), Date Preparation, each worker node expos the dataset rom he local Cassandra ‘servers fo te shared tle system. For the Hadoop setups 1 read expored data, wo pu the data fom th share fl slam ¥0 HDF. i 2), tis < Fi, 2. Thee ilerent streaming approaches to process Cascanca datasets with MepRduce using nonJava executables. Figure (a) shows tho architecture of using MARISSA forthe MapFleducestreang plpene in Fig. . The datas fst downloaded om the database sorvers tothe shared ‘la esto, pre-precoasod fort largolappletion an athe Gre stage processod with tho wor sol neva exten. (0), wo show hoy out of using Hadoop Stearangin suc a sting whero the datasets also placod ito he HDFS, Figure) shows the stucure of Hacoop-C”, which ‘we use e process Cassandra data recy om te loca database srvers Jing Hadoop with non Java oxecables dataset as JSON formatted files. As our assumption is that the target executables are legacy applications which are either impossible or impractical to be modified, the input data needs to be converted into a format that is expected by these target applications. For this reason, our software pipeline includes a MapReduce stage, Fig. 1b, where JSON data can be transformed into other formats. This phase simply processes each input record and converts it to another format, writing the results into the intermediate output files. This step does not involve any data or processing dependencies between nodes and therefore is a great fit for the MapReduce model. In fact, we only initiate the map stage of MapReduce since no reduce operations are needed. If necessary for the conversion of JSON files to the proper format, a reduce step may be added conveniently. We implemented this stage in Python scripts that can be run using either MARISSA and Hadoop Streaming without any modifications. {As this is the first of a series of iterative MapReduce operations whose output will be used as the input by the following MapReduce streaming operations, we simply call this stage MR1. Our system not only allows users to convert the dataset into the desired format but also makes it possible to specify the columns of interest. This is ‘especially useful when only a vertical subset of the dataset Is suficient for the actual data processing. Ths stage helps to reduce data size, in turn affecting the performance of the next MapReduce stage in a positive manner. This performance gain is a result of fewer I/O and data parsing operations. In the following sections of this paper we will refer to this stage either as MR1 or as Data Transformation. Section 4.2 provides a comparison between the performance of Data Transformation using MARISSA and Hadoop Streaming 3.1.3 Data Processing (MR2) This is the final step of the MapReduce Streaming pipeline shown in Fig. 1. In Fig. 1¢ we run the non-Java executables, which were the initial target applications, over the output data of MR1, as the data is now in a format that can be processed. We use MARISSA and Hadoop Streaming to run executables as map and reduce operations. Since this is the second MapReduce stage in our pipeline we name it MR2. Any MapReduce streaming job being run after MR1 is considered an MR2 step. In Section 4.3, we first compare the performance ‘of MARISSA and Hadoop Streaming based only on this stage under various application scenarios. Later, in order to show the full operation span we include the time taken for Data Preparation and Data Transformation under each MapReduce framework and repeat our comparisons. 3.2 MapReduce Streaming Pipeline with MARISSA As explained in Section 3.1.1, the Splitter module of MARISSA has been modified such that each worker connects to the local database server to take a snapshot of the input dataset in JSON format and place it into the shared file system. After the Data Preparation stage shown in Fig. 1a the input is split and ready for Data Transformation. Fig. 2a shows the architecture of MARISSA. It allows each non-Java executable to interact with the corresponding input splits directly without needing to mediate this interaction. In the stage of Data Transformation, each MARISSA mapper runs an executable to convert the JSON data files to the user-specified input format. These converted files are placed back into the shared file system. MARISSA runs the user given executables to create the next MapReduce stage, which we call Data Processing. This is accomplished using the previous output as the input. There is no re-distribution or re-creation of splits required since MARISSA is designed to allow iteration of MapReduce operations where the ‘output of one operation is fed as input to the next. 3.3 MapReduce Streaming Pipeline with Hadoop Streaming In the Data Preparation stage, each Hadoop worker connects to the local Cassandra server and exports the input dataset in JSON formatted files. Next, these files are placed into the HDFS using the put command. This distributes the input files among the DataNodes of the HOFS ‘and later they are used as input for the Data Transformation stage. HDFS is a non-POSIX compliant file system that requires the use of HDFS API to interact with the files. Since Hadoop Streaming uses non-Java application for map ‘and/or reduce, the assumption is that these executables do not use this API and therefore do not have immediate access to the input splits. So, Hadoop TaskTrackers read the input from HDFS and feed into the executables for processing and collect the results to write back to the HDFS. In the Data Transformation step shown in Fig. 1b, Hadoop Streaming uses our input conversion code to transform the input to the desired format and later Data Processing is performed on the output of this stage. Note that at the Data Processing stage the input is already in HDFS as it is the output of the previous MapReduce job. 3.4 Hadoop-C* Hadoop-C* is the setup where a Hadoop cluster is co-located with a Cassandra cluster to provide ‘an input source and an output placement alternative to the MapReduce operations. This setup, illustrated in Fig. 2c, allows users to leave the input dataset on its own local Cassandra servers, We use Hadoop TaskTrackers to read the input records directly from the local servers to ensure data locality. That is, there is no need for taking a snapshot of the dataset and placing it into the file system for MapReduce processing. Therefore, no Data Preparation or Data Transformation steps are required. Before starting any of the map operations each Hadoop ‘mapper starts the user specified non-lava executable and later in each map it reads an input record from the database and converts it to the expected format—streaming it to the running application using stdin. Later, the output is collected back from this application, using stdout, which is then turned into 2 database record and written back to the Cassandra data store. This design has the limitation that the user specified applications should start before-hand and should expect an input record from stdin and write the output to stdout. Although we ‘explain the limitations of such a model in [11], in this case we use it to provide inplace processing of Cassandra data for cases when itis not practical to constantly download data to the file system in order to process the most Up-to-date version. Fig. 2c shows that DataNodes are running on each worker node, however, they are not used for input management. DataNodes are required since HDFS is used for dependency jars and other static and Intermediary data. In the following sections we refer to this setup as Hadoop-C*. Furthermore, we will use the notation Hadoop-C*-FS for the cases when Hadoop TaskTrackers read the input records directly from the local Cassandra servers, but the output is collected in the shared file system. 4 PERFORMANCE RESULTS The following experiments were performed on the Grid and Cloud Computing Research Lab Cluster at Binghamton University. 8 Nodes in a cluster connected via Gigabit Ethernet, each of which has two 2.6 Ghz Intel Xeon CPUs, 8 GB of RAM, 8 cores, Dual Local 73 GB 15 K RPM RAID Striped SAS Drive, and 64-bit version of Linux 2.6.15. Single headless file server connected via 10 GB Fiber Ethernet and Gigabit Ethernet, which has four 2.0 Ghz Intel Xeon CPUs, 128 GB of RAM, 8 cores, 6.0 TB SAS RAID, provides NFSv4 and run a 64-bit version of Linux 2.6.15. Apache Cassandra version 1.1.6 installed on each Hadoop slave nodes. Hadoop version 1.2.1 installed on each node. In Tables 1 and 2, we present the important configuration parameters that are likely to affect the performance. In addition to these parameters, for both Cassandra and Hadoop, we used the default parameters shipped with specified distributions. In the following tests we use the three different setups that we explain in Section 3 to perform MapReduce operations over a dataset that resides in a Cassandra distributed database cluster 4.1 Data Preparation Fig, 3 shows the performance for taking a snapshot of the input dataset from Cassandra into the shared file system and HDFS for processing with MARISSA and Hadoop Streaming respectively. The cost of moving data from Cassandra servers expectedly increases with growing data sizes. Moving 256 million input records takes nearly 50 times more than moving four million. Fig. 3 also shows the disparity for the cost of Data Preparation for Hadoop Streaming and MARISSA. At four million records Data Preparation for Hadoop Streaming. is 1.1 times faster than MARISSA and it Is over 1.2 times faster at 64 and 256 million records. This performance variation of Data Preparation for each system can be explained with the inefficiencies, laid out in Section 3.1.1, of the MARISSA Splitter module which is responsible for creating the data splits for individual cores of the worker nodes. We plan to address this inefficiency in MARISSA in future work. ‘DEDE EYAL. PROCESSING CASSANDRA DATASETS WTH HADOOP STREAMING EASED APPROACHES, = em Ts) ‘hg. 9. me avemeas of moving data hom Cassandra into he stom {oP idapredoceprocesoing The cost increases wa growing ca nz 4.2 Cluster Size Upgrade In the next set of experiments we changed cluster sizes and performed Data Preparation and Data Transformation (MRt) stages followed by Data Processing (MR2) stage. We ran the target non-lava executables on four different cluster sizes for 64 million input records. We show the performance of running write intensive and read intensive applications with Hadoop Streaming in Data Processing (MR2) phase. Additionally, ‘we compare Hadoop Streaming with Hadoop-C* where we directly read the input records from the Cassandra database without the need for Data Preparation and Data Transformation (R1) stages. DEDEET AL PROCESSING CASSANORA DATASETS WITH HADOOP. STREAMING BASED APPROACHES (@) Read<=Write NOT incloing Data Preparaion and MRI () Read>Wrte NOT incloding Date Preparation and MRI (2) Rexd>>Wete including Dota Preparation and MRI Fig, 10. Running read intensive worloads ovr Caseancra data using aerate teaming models for non~iavaappilcations. na). the te spend SSaowmnecing Seat aa pre room ck cated as) ay Om aro aha te a oo spa. distributed through the nodes and expectedly the number of HDFS blocks to be processed increases, while the amount of data stays the same. Therefore, the number of map tasks spawned increases with the larger cluster sizes and the cost of setting up more map tasks amortizes the benefit of distributing work across the mappers. Fig. 10b shows the performance comparison of Hadoop Streaming and Hadoop-C* with the performance data for Data Preparation and Data Transformation (MR1) included under four cluster sizes. At the 2 node setup, Hadoop- CC* performs 2.2 times faster than Hadoop Streaming. Hadoop Streaming suffers the cost of Data Preparation and Data Transformation stages. Increasing the cluster size allows both setups to benefit from the distributed approach. Doubling the cluster size from 2 to 4 speeds up Hadoop-* by 48 percent and Hadoop Streaming by 45 percent. However, speed up decreases for Hadoop Streaming if we expand the cluster size from 4 to 8 nodes. Hadoop Streaming speeds up only 38 percent while Hadoop-C* achieves a speedup of 46 percent. For a given fixed size of records, increasing the number of nodes causes each node to have less data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy