0% found this document useful (0 votes)

143 views25 pages

Parallel Data Processing in The Cloud

The document discusses parallel data processing in the cloud. It introduces Nephele as a framework for parallel data processing that can efficiently allocate resources in cloud environments. The proposed system using Nephele aims to decrease overload on the main cloud and increase performance by dividing jobs among task managers and allowing users to access other users' data with permission. The key advantages of the proposed system are dynamic resource allocation, parallelism, lower cost, and faster processing.

Uploaded by

Vinu Davis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

143 views25 pages

Parallel Data Processing in The Cloud

Uploaded by

Vinu Davis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

PARALLEL DATA PROCESSING IN THE CLOUD

INTRODUCTION
In recent years parallel data processing has emerged to be one of the

important applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for

customers to access these services and to deploy their programs.

Examples are Googles MapReduce ,Microsofts Dryad,or Yahoo!s Reduce-merge Map-

The most modern framework used commonly is Nepheles framework.

The main goal of our project is to decrease the overloads of the main cloud and increase the performance of the cloud by using this Nepheles

framework.

EXISTING SYSTEM

Today a growing number of companies have to process huge amounts of data in a coefficient manner. Classic representatives for these companies are operators of Internet search engines, like Google, Yahoo, or Microsoft.

The vast amount of data they have to deal with every day has made
traditional database solutions prohibitively expensive. Cloud computing has emerged as a promising approach by using

frameworks.
Current data processing frameworks like Googles MapReduce or Microsofts Dryad engine have been designed for cluster environments.

The processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. The problem with these frameworks is that the resource allocation when large jobs are submitted is not efficient as they take more time for processing besides incurring more cost. disadvantages of existing systems are: Expensive

Complex
Increases data base organization extra overload on cloud

More time consuming for processing.

problem providing in maintenance and hard to troubleshoot.

PROSOSED SYSTEM
The system as a whole is designed to overcome the major weaknesses of Map/Reduce Here we use a new framework called Nephele Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling and execution. Based on this new framework, we perform extended evaluations of MapReduce-inspired processing cloud system.

We present a parallel data processor centered around a programming model of

so called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele.

The PACT programming model is a generalization of the well-known

map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function. Our definition of PACTs allows applying several types of optimizations on the data flow during the transformation. Using this requirements in this paper we asigning the all job of the main cloud to job manager,it divide the whole task and forward them to task manager. We are adding some additional features in this project, that is We are providing a facility that a user can access other users data if he has the right to do. At the same time a message will be send to the users mobile.

Task manager process the tasks and store in the cloud. Like this we can reduce the

work load and increase the performance of the cloud

We can use this concept in big cloud like Amazon or IBM as their service The advantages of proposed systems are as follows:

Dynamic resource allocation Parallelism is implemented Designed to run data analysis jobs on a large amount of data Many Task Computing (MTC) has been developed Less expensive More effective More Faster

Access to others data.

LITERATURE SURVEY
Title: Map Reduce: Simplified Data Processing on Large Clusters Author: Jeffrey Dean and Sanjay Ghemawat

Description:
Map Reduce is an approach that helps us perform cluster computing. MapReduce Is a programing model for processing large data sets wit

a parallel, distributed algorithm on a cluster

MapReduce has four phases: map,combine,shuttle and sort,reduce.

"Map" step: The master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node

"Reduce" step: The master node then collects the answers to all the sub-

problems and combines them in some way to form the output the answer to
the problem it was originally trying to solve.

Map(k1,v1) list(k2,v2)
Reduce(k2, list (v2)) list(v3) All values with same key are reduced together

map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently

Implementations: At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization

Advantages
This approach has several advantages, namely: low initial cost, and ease of maintenance (through cheap replacement of faulty machines). Fault-tolerant Automatic parallelization & distribution Provides status and monitoring tools Clean abstraction for programmers Simple and easy to use. Flexible.

Independent of the storage.

Disadvantages

no high level language.

reduce phase cant start until map phase is completely finished. A single fixed dataflow. Low efficiency. Restrictive semantics Pipelining Map/Reduce stages possibly inefficient Missing common DBMS utilities and features Transactions, updates, integrity constraints, views, Incompatible with DBMS tools Complex values, more serialization/deserialization overhead. More complex memory management. As value maps may grow too big, the approach has potential for scalability bottleneck.

Cluster computing itself can be defined as the use of large number of low end machines to form a cluster, instead of a smaller number of high end machines

although their purpose in the MapReduce framework is not the same as their
original forms. Furthermore, the key contribution of the MapReduce framework are not the actual map and reduce functions,

Provided each mapping operation is independent of the others, all maps can be
performed in parallel though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source Google has now outgrown it, The main reason behind it is map reduce was hindering their ability to provide near real time updates to their index. next phase of operations cant start until you finish the first. If you want to build a system that's based on series of map-reduces, there's a certain probability that something will go wrong, and this gets larger as you increase the number of operations

Title: Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Author:

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.

Description:

Dryad is Microsoft's alternative to MapReduce.

Program specification isdone by building a Direct Acyclic Graph (DAG) whose vertexes are operationsnand whose edges are data channels. An application written for Dryad is modeled as a directed acyclic graph (DAG). The DAG defines the dataflow of the application, and the vertices of the graph defines the operations that are to be performed on the data.

Dryad is a general-purpose distributed execution engine forcoarse-grain data-parallel applications 2 main components of Dryad is: Job manager coordinates jobs, constructs graph

Name server exposes computers with network topology

Dryad is what people see as map reduce done right. Dryads authors are claiming that they are able to to handle more general cases and to keep a good performance Besides allowing several inputs/outputs sets , Dryad allow runtime optimization such as aggregation . The approach offered by Dryad is pretty simple, you need to represent the flow of execution of programs in the form of a graph .Job manager schedules vertices on

machines

Advantages

Much more general than Map Reduce

Greedy algorithm is used Vertices are deterministic, and graph is acyclic, so manager can easily restart Runtime manager can reschedule vertices for better locality Graphs manually constructed, Jobs executed on vertices,Edges represent data channels More efficient communication,the ability to chain together multiple stages,and express more complicated computation. DryadLINQ offers a higher-level computational model where complex sequence of MapReduce steps can be easily expressed in a query language similar to SQL. More control to developer than MapReduce Choose data transport mechanism (files, TCP pipes, shared memory FIFOs)

Disadvantages Can be used only in Cluster environments.

Globally the cost of licensing both Windows servers (DryadLINQ was meant
for Windows servers) and DryadLINQ compared to Unix servers and Hadoop (Free software developed by Apache) is a significantly higher.

The lack of a real distributed file system. In short, in order to support large
inputs, Dryad need to create a graph for virtual nodes and the

communication is done via local write/distant read Graphs manually constructed Dryad is not a database engine; it does not include a query planner or optimizer No way of defining dynamic graphs.

Title: Nephele/PACTs: A Programming Model and Execution Framework for Web-

Scale Analytical Processing

Author: Dominic Odej Kao ,Battr Stephan Volker Markl, Ewen Fabian Hueske Daniel

Warneke

Description:

A parallel data processor centered around a programming model of so

called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele .

The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as withOutput Contracts that give guarantees about the

behavior of a func- tion.

We describe methods to transform a PACT program into a data ow for Nephele, which executes its sequential building blocks in parallel and deals with communication, synchronization and fault tolerance. The system as a whole is designed to be as generic as map/reduce systems, while overcoming several of their major weaknesses:

The paper describes the PACT programming model for the Nephele system. The PACT programming model extends the concepts from map/reduce, but is applicable to more complex operations.This paper provide the methods to compile PACT programs to parallel data ows for the Nephele system, which is a exible execution engine for parallel data ows . Programming Model The PACTs are second-order functions that dene properties on the input and

output data of their associated rst-order functions .

The system utilizes these properties to parallelize the execution of the user function(UF) and apply optimization rules.

Here, the type of the second-order function is referred as the Input Contract.
The properties of the output data are described by an attached Output Contract.

Input Contract It dene how the input data is organized into subsets that can be processed independently and hence in a data parallel fashion by independent instances of the UF. Output Contract It denote some properties on the UFs output data. Output Contracts are attached to the second-order function by the programmer. They describe additional semantic information of the UFs, which is exploited for optimization in order to generate ecient parallel data ows.

Comparison between Map/Reduce and PACT programming PACT programming model add additional functions that t many problems which are not naturally expressible as a map or reduce function. In Map/Reduce systems like Hadoop , the programming model and the execution model are tightly coupled each job is executed with a static plan that follows the steps map/combine/shue/sort/reduce. In contrast, PACT system separates the programming model and the execution and uses a compiler to generate the execution plan from the program. For several of the new PACTs multiple parallelization strategies are available. Map/Reduce loses all semantic information from the application, except the information that a function is either a map or a reduce. PACT model preserves more semantic information through both a larger set

of functions and through the annotations.

DATAFLOW DIAGRAM

Fig .Dataflow Diagram for Parallel data Processing in the Cloud

CONCLUSION
In this paper we have discussed the challenges and opportunities for efficient parallel data processing in cloud environments and presented Nephele, the first data processing framework to exploit the dynamic resource provisioning offered by todays IaaS clouds. Cloud computing has emerged as a promising approach by using frameworks. The processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. The main goal of our project is to decrease the overloads of the main cloud and increase the performance of the cloud. So we have implemented Nepheles architecture

The system as a whole is designed to overcome the major weaknesses of Map/Reduce Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling and execution.

Intellexa 2025 Final
No ratings yet
Intellexa 2025 Final
40 pages
GD121 Spare Parts Old
No ratings yet
GD121 Spare Parts Old
647 pages
Anatomy Of: Domain - Driven Design
No ratings yet
Anatomy Of: Domain - Driven Design
24 pages
Chapter-8 1679587952156
No ratings yet
Chapter-8 1679587952156
60 pages
Unit 5
No ratings yet
Unit 5
32 pages
Techno-Commercial Proposal (Without Price) (08!04!2025)
No ratings yet
Techno-Commercial Proposal (Without Price) (08!04!2025)
6 pages
PaperCrafter - Issue 168, February 2022
100% (4)
PaperCrafter - Issue 168, February 2022
92 pages
Data Features and Databases in Cloud and Grid
No ratings yet
Data Features and Databases in Cloud and Grid
18 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
No ratings yet
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
94 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
21 pages
NS & Tech - Grade 4 - Terminology List - IsiZulu
No ratings yet
NS & Tech - Grade 4 - Terminology List - IsiZulu
11 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
This Content Downloaded From 42.1.77.20 On Tue, 05 Nov 2024 14:43:27 UTC
No ratings yet
This Content Downloaded From 42.1.77.20 On Tue, 05 Nov 2024 14:43:27 UTC
17 pages
API FR - INR.RINR DS2 en Excel v2 2917298
No ratings yet
API FR - INR.RINR DS2 en Excel v2 2917298
74 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
A Study On The Performance of Insurance Companies in 1xynrowx1f
No ratings yet
A Study On The Performance of Insurance Companies in 1xynrowx1f
13 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Slidesgo Unlocking The Future The Impact of Ai and Machine Learning Technology 20241129165854RC3Y
No ratings yet
Slidesgo Unlocking The Future The Impact of Ai and Machine Learning Technology 20241129165854RC3Y
11 pages
Unit 5
No ratings yet
Unit 5
35 pages
Besongntor Orockakwa
No ratings yet
Besongntor Orockakwa
37 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Tappi T411
100% (1)
Tappi T411
4 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Book 3 Unit 8. Communicating With Staff: Group Name: 4 Arya Nugroho Indri Novianti Rahayu Yiyin
No ratings yet
Book 3 Unit 8. Communicating With Staff: Group Name: 4 Arya Nugroho Indri Novianti Rahayu Yiyin
10 pages
Arroyo Housing Project
No ratings yet
Arroyo Housing Project
20 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
ReportPdfResponseServlet - 2024-12-20T111226.809
No ratings yet
ReportPdfResponseServlet - 2024-12-20T111226.809
9 pages
Hum 103 Coverage For Semifinals
No ratings yet
Hum 103 Coverage For Semifinals
6 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
AllPack Cataloque - 11.10.24
No ratings yet
AllPack Cataloque - 11.10.24
8 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
The Classification of Stocks With Basic Financial Indicators An Application of Cluster Analysis On The BIST 100 Index
No ratings yet
The Classification of Stocks With Basic Financial Indicators An Application of Cluster Analysis On The BIST 100 Index
29 pages
An Application of Map Reduce For Domain Reducing in Cloud Computing
No ratings yet
An Application of Map Reduce For Domain Reducing in Cloud Computing
9 pages
A CMOS Self-Regulating VCO With Low Supply Sensitivity 4
No ratings yet
A CMOS Self-Regulating VCO With Low Supply Sensitivity 4
7 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Research Assignment
No ratings yet
Research Assignment
7 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Research Proposal
No ratings yet
Research Proposal
10 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Master Cheat Sheet
No ratings yet
Master Cheat Sheet
61 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
Lioba CV
No ratings yet
Lioba CV
5 pages
Mapreduce: Simplified Data Analysis of Big Data: Sciencedirect
No ratings yet
Mapreduce: Simplified Data Analysis of Big Data: Sciencedirect
9 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
BDA05 DistributedComputing
No ratings yet
BDA05 DistributedComputing
7 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
No ratings yet
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
9 pages
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
No ratings yet
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
8 pages
Unit 4 Map Reduce
No ratings yet
Unit 4 Map Reduce
10 pages
3412ijwsc01 PDF
No ratings yet
3412ijwsc01 PDF
13 pages
Purcom Speech 1
No ratings yet
Purcom Speech 1
1 page
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
By Christian Mechem and Geoff Crowley
No ratings yet
By Christian Mechem and Geoff Crowley
11 pages
Message Analyzer FAQ and Known Issues
No ratings yet
Message Analyzer FAQ and Known Issues
11 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Updated Constitution of Business Club
No ratings yet
Updated Constitution of Business Club
13 pages
What Is Budgetary Cycle
No ratings yet
What Is Budgetary Cycle
6 pages
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
No ratings yet
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
27 pages
Evaluation of Data Processing Using Mapreduce Framework in Cloud and Stand - Alone Computing
No ratings yet
Evaluation of Data Processing Using Mapreduce Framework in Cloud and Stand - Alone Computing
13 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Assn - No:1 Cloud Computing Assignment 13.10.2019
No ratings yet
Assn - No:1 Cloud Computing Assignment 13.10.2019
4 pages
Week 5 MODULE PURPOSIVE COMMUNICATION
No ratings yet
Week 5 MODULE PURPOSIVE COMMUNICATION
13 pages
Lesson-Plan 1
No ratings yet
Lesson-Plan 1
2 pages
Ijwsc 030401
No ratings yet
Ijwsc 030401
13 pages
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
No ratings yet
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
15 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Exploiting Dynamic Resource Allocation: 1. Abstract
No ratings yet
Exploiting Dynamic Resource Allocation: 1. Abstract
5 pages
Resource Scalability For Efficient Parallel Processing in Cloud
No ratings yet
Resource Scalability For Efficient Parallel Processing in Cloud
5 pages
Cinema India
No ratings yet
Cinema India
31 pages
A Brief On MapReduce Performance
No ratings yet
A Brief On MapReduce Performance
6 pages
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
No ratings yet
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
5 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in Cloud-By Using Nephel's Algorithm
No ratings yet
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in Cloud-By Using Nephel's Algorithm
3 pages
Ijettjournal V1i1p20
No ratings yet
Ijettjournal V1i1p20
5 pages
Family Waste Inventory
No ratings yet
Family Waste Inventory
2 pages
Cookbook - Cuisine of The United Kingdom
No ratings yet
Cookbook - Cuisine of The United Kingdom
4 pages
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Parallel Data Processing in The Cloud

Uploaded by

Parallel Data Processing in The Cloud

Uploaded by

PARALLEL DATA PROCESSING IN THE CLOUD

customers to access these services and to deploy their programs.

The most modern framework used commonly is Nepheles framework.

More time consuming for processing.

We present a parallel data processor centered around a programming model of

The PACT programming model is a generalization of the well-known

work load and increase the performance of the cloud

Access to others data.

a parallel, distributed algorithm on a cluster

Independent of the storage.

no high level language.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.

Dryad is Microsoft's alternative to MapReduce.

Name server exposes computers with network topology

Much more general than Map Reduce

Disadvantages Can be used only in Cluster environments.

Title: Nephele/PACTs: A Programming Model and Execution Framework for Web-

Scale Analytical Processing

A parallel data processor centered around a programming model of so

behavior of a func- tion.

output data of their associated rst-order functions .

of functions and through the annotations.

Fig .Dataflow Diagram for Parallel data Processing in the Cloud

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.