Parallel Data Processing in The Cloud
Parallel Data Processing in The Cloud
INTRODUCTION
In recent years parallel data processing has emerged to be one of the
important applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for
framework.
EXISTING SYSTEM
Today a growing number of companies have to process huge amounts of data in a coefficient manner. Classic representatives for these companies are operators of Internet search engines, like Google, Yahoo, or Microsoft.
The vast amount of data they have to deal with every day has made
traditional database solutions prohibitively expensive. Cloud computing has emerged as a promising approach by using
frameworks.
Current data processing frameworks like Googles MapReduce or Microsofts Dryad engine have been designed for cluster environments.
The processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. The problem with these frameworks is that the resource allocation when large jobs are submitted is not efficient as they take more time for processing besides incurring more cost. disadvantages of existing systems are: Expensive
Complex
Increases data base organization extra overload on cloud
PROSOSED SYSTEM
The system as a whole is designed to overcome the major weaknesses of Map/Reduce Here we use a new framework called Nephele Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling and execution. Based on this new framework, we perform extended evaluations of MapReduce-inspired processing cloud system.
Task manager process the tasks and store in the cloud. Like this we can reduce the
Dynamic resource allocation Parallelism is implemented Designed to run data analysis jobs on a large amount of data Many Task Computing (MTC) has been developed Less expensive More effective More Faster
LITERATURE SURVEY
Title: Map Reduce: Simplified Data Processing on Large Clusters Author: Jeffrey Dean and Sanjay Ghemawat
Description:
Map Reduce is an approach that helps us perform cluster computing. MapReduce Is a programing model for processing large data sets wit
"Map" step: The master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node
"Reduce" step: The master node then collects the answers to all the sub-
problems and combines them in some way to form the output the answer to
the problem it was originally trying to solve.
Map(k1,v1) list(k2,v2)
Reduce(k2, list (v2)) list(v3) All values with same key are reduced together
map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently
Implementations: At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization
Advantages
This approach has several advantages, namely: low initial cost, and ease of maintenance (through cheap replacement of faulty machines). Fault-tolerant Automatic parallelization & distribution Provides status and monitoring tools Clean abstraction for programmers Simple and easy to use. Flexible.
Disadvantages
Cluster computing itself can be defined as the use of large number of low end machines to form a cluster, instead of a smaller number of high end machines
although their purpose in the MapReduce framework is not the same as their
original forms. Furthermore, the key contribution of the MapReduce framework are not the actual map and reduce functions,
Provided each mapping operation is independent of the others, all maps can be
performed in parallel though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source Google has now outgrown it, The main reason behind it is map reduce was hindering their ability to provide near real time updates to their index. next phase of operations cant start until you finish the first. If you want to build a system that's based on series of map-reduces, there's a certain probability that something will go wrong, and this gets larger as you increase the number of operations
Title: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Author:
Description:
Dryad is a general-purpose distributed execution engine forcoarse-grain data-parallel applications 2 main components of Dryad is: Job manager coordinates jobs, constructs graph
machines
Advantages
Globally the cost of licensing both Windows servers (DryadLINQ was meant
for Windows servers) and DryadLINQ compared to Unix servers and Hadoop (Free software developed by Apache) is a significantly higher.
The lack of a real distributed file system. In short, in order to support large
inputs, Dryad need to create a graph for virtual nodes and the
communication is done via local write/distant read Graphs manually constructed Dryad is not a database engine; it does not include a query planner or optimizer No way of defining dynamic graphs.
Warneke
Description:
The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as withOutput Contracts that give guarantees about the
The paper describes the PACT programming model for the Nephele system. The PACT programming model extends the concepts from map/reduce, but is applicable to more complex operations.This paper provide the methods to compile PACT programs to parallel data ows for the Nephele system, which is a exible execution engine for parallel data ows . Programming Model The PACTs are second-order functions that dene properties on the input and
Here, the type of the second-order function is referred as the Input Contract.
The properties of the output data are described by an attached Output Contract.
Input Contract It dene how the input data is organized into subsets that can be processed independently and hence in a data parallel fashion by independent instances of the UF. Output Contract It denote some properties on the UFs output data. Output Contracts are attached to the second-order function by the programmer. They describe additional semantic infor- mation of the UFs, which is exploited for optimization in order to generate ecient parallel data ows.
Comparison between Map/Reduce and PACT programming PACT programming model add additional functions that t many problems which are not naturally expressible as a map or reduce function. In Map/Reduce systems like Hadoop , the programming model and the execution model are tightly coupled each job is executed with a static plan that follows the steps map/combine/shue/sort/reduce. In contrast, PACT system separates the programming model and the execution and uses a compiler to generate the execution plan from the program. For several of the new PACTs multiple parallelization strategies are available. Map/Reduce loses all semantic information from the application, except the information that a function is either a map or a reduce. PACT model preserves more semantic information through both a larger set
DATAFLOW DIAGRAM
CONCLUSION
In this paper we have discussed the challenges and opportunities for efficient parallel data processing in cloud environments and presented Nephele, the first data processing framework to exploit the dynamic resource provisioning offered by todays IaaS clouds. Cloud computing has emerged as a promising approach by using frameworks. The processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. The main goal of our project is to decrease the overloads of the main cloud and increase the performance of the cloud. So we have implemented Nepheles architecture
The system as a whole is designed to overcome the major weaknesses of Map/Reduce Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling and execution.