0% found this document useful (0 votes)

3 views22 pages

06 Application Architecture

The document discusses application architecture focusing on online and offline components, parallel processing, and the MapReduce framework for batch data processing. It highlights the importance of low latency and high throughput in query processing, as well as the challenges of handling failures and optimizing data locality in distributed systems. Additionally, it explores building workflows and dataflow engines to manage complex processing tasks efficiently.

Uploaded by

sj1282

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views22 pages

06 Application Architecture

Uploaded by

sj1282

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Application Architecture

Lecture 6
Srinivas Narayana
http://www.cs.rutgers.edu/~sn624/553-S25

1
Review: Offline and Online components
Services, update
databases,
ML inference
User update
request

update

Online, real- Offline processing:

time request Batch processing; stream
processing events; ML training
Partition-Aggregate
Processing interactive search queries
Review: Google search architecture

Web search for a planet, MICRO’03

Review of the Web Search workload
• Depending on the user’s query, decompress
a part of an index, then search for document
IDs there
• Depending on the user’s query, collect
snippets from within Web documents
• Data-dependent accesses
• High branch misprediction
• Blocks randomly accessed (OK within block)
• Fewer opportunities for instruction-level
parallelism; faster/better servers not better
Assume: one thread == one core

How to use parallelism? Fast core Fast core

• Few fast cores with

high-speed
interconnect, or more Fast core Fast core
slow cores?
• Cost per query
processed is dominated
Slow Slow Slow
by capital server costs
core core core
• Power draw and
cooling: faster == Slow Slow Slow
denser core core core
Slow Slow Slow
Discussion applies to both core core core
Server rack
hyperthreaded or multicore
Two kinds of parallelism
• Data parallelism: independent compute
over shards of data
• Fast interconnects not as critical
• Stateless: little coordination within a request
1 2
• Request parallelism: independent 3
compute across requests a
• More machines for more requests b
• Shard itself can be replicated for throughput
• Need lower latency?
4 5 6
• Compensate slow cores with smaller
shard (add more shards) a
• Each shard becomes more available
4 4 b
• Turn throughput into latency advantage 4
Index Doc
Google search LB
GWS servers servers
1 1
user DNS
2 2

GFE

Word → Doc

(Doc, word) → Snippet

All-up SLO (300 ms)

Pick
one
Relevant
document
IDs

Snippet
results
Many apps can use partition-aggregate
• Need low latency, but single-threaded low latency is hard
• Data parallelism
• Little coordination across shards
• Inexpensive merges across partial results from shards
• Query parallelism
• More replicas/machines for more requests
• Use commodity (not fancy) hardware
• Turn high throughput into a latency advantage
• Focus on price per unit performance
• Significant problems: cooling for many compute servers
Tail performance becomes important
• With partition-aggregate, each
machine may serve many requests
within a single user-level request
• A single user sends multiple
requests over a session
• Many shards are queried for a single user-level query
• Delays in any one of them can delay the entire response
• Leaving the shard out degrades the result
• Example: 1000 shards* 10 user requests per session
• 1 delay in 10,000 machine-level responses can be visible to a user
• 99.99th percentile delay matters
• Lots of delay on cutting the tail: hedging, duplication, …
Map Reduce
Batch processing with simple abstractions
Example: Batch data processing
• Server access log: want to get top-5 URLs visited
192.0.2.1 - - [07/Dec/2021:11:45:26 -0700] "GET
/index.html HTTP/1.1" 200 4310

• Analytics (not real time user query). How would you go about it?
• One way: shell script
cat /var/log/nginx/access.log
| awk ‘{print $7}’
| sort
| uniq –c
| sort –r –n
| head –n 5
Example: Batch data processing
Another way: Python script
counts = {}
for line in open(“/var/log/access.log”):
url = line.split()[6]
counts[url] += 1
sorted_counts = counts.items().sort()[::-1]
print (sorted_counts[0:5])

Which method would you use, and why?

What do we want from implementation?
• Process large log files, even when doesn’t fit into memory
• Ability to experiment with different processing steps
• Without corrupting the original data
• Unix principles help!
• Programs/tools that do one thing well (e.g., sort)
• Separate logic from wiring
• Any tool can produce for, or consume from, any other tool (pipe |)
• Inputs come from standard input or a file. Immutable inputs
• A choice to inspect data or write to disk anywhere (e.g., tee)
• Inspect output at any point (e.g., less)
Map-Reduce
• One way to think about it: a distributed implementation of Unix
processing pipelines for large batch processing
• Large data sets: data comes from a distributed filesystem (GFS, HDFS)
• Large computations: want to use multiple servers since data-intensive
• Examples:
• Distributed grep, term frequencies, distributed sort
• Output?
• A data structure, e.g., a search index
• A set of pre-computed values for faster reads, e.g., key-value cache
• Input to load into a traditional relational database (SQL) or view
Distributed system considerations
• Data resides on multiple machines
• How to bring data together? How to compute with parallel machines?
• Network bandwidth between servers is a significant consideration

MapReduce
• How to handle failures?
• Machine failures?
• What happens to partial computations?
• Should we replicate compute?
• What happens to intermediate results?
• Should you persist it? Replicate it?

Algorithm developers == Distributed system experts?

Abstraction borrowed from
functional programming
Many different
implementations exist Different
Same key
Key advantage of (intermediate)
space
MapReduce: handle dist key space
system issues!
Processing steps in MapReduce
• Input data consumed from a distributed filesystem
• Master ships code to the worker node closest to data, if possible
(CPU, memory constraints permitting)
• Each mapper partitions its input data by the reducer key
• Typically through a hash function, e.g., hash (key) mod R == r
• Sort output data (per partition) by the key; run map function
• Reducers are informed of partial result at each mapper
• Reducer pulls files from mappers through RPC
• Output persisted to distributed filesystem (typically involves
replication)
• Result: R output files in the DFS (one per reducer partition)
Implementation Key Principles
• Data locality
• Reduce network bandwidth: ship code to data
• Locally persist (not DFS) intermediate results
• Handle failures by re-doing compute
• No fancy hardware fault tolerance (e.g., RAID)
• Mapper failure: restart map job
• Assume deterministic operations
• Reducer failure (after completion): no problem (DFS)
• Identify and skip shards with deterministic faults
• Mitigate stragglers through eager replication of compute close to job
completion
• Combiners at mapper: preliminary reduce for associative and
commutative functions
More examples of using map-reduce
• Database Joins
• Example: user activity (e.g. URLs) with user information (e.g. age)
• Grouping (GROUPBY) aggregations:
• Count, sum, etc
• Creating the sequence of events in a user session, determining
whether e.g. a new version of a web page resulted in better sales
• Large distributed sorting
• Output sorting after mapper: important!
Building on Map-Reduce: (1) Workflows
• One Map-Reduce job isn’t usually enough
• Google web search index: pipeline of 10 jobs; recommendation
systems: 50—100
• Workflows: Chains of map-reduce jobs
• E.g., one MR for counting requests by URL; another to sort count
• Explicit output files from each?
• Like writing to file at the end of each tool in Unix pipeline
• Materialization of the intermediate results needed?
• Stragglers make workflows slower
• Separate systems needed just to orchestrate the workflows correctly
Building on Map-Reduce: (2) Dataflow
• Dataflow engines: handle the entire workflow
• “Operators”: chain map-reduce functions
• Only persist intermediate outputs to DFS when necessary
• Chain reducers (no explicit mappers) when the key is the same
• Don’t wait for stragglers of the previous job

• Stream Processing
• Incremental execution of batch jobs when new data arrives

• Selectively materialize or recompute intermediate results

• Lineages (RDD/Spark) or checkpoint

Whitney
No ratings yet
Whitney
19 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
He-Phan-Bo - Wyatt-Lloyd - L19-Big-Data - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Wyatt-Lloyd - L19-Big-Data - (Cuuduongthancong - Com)
16 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
02 Lecf 13 Map Reduce
No ratings yet
02 Lecf 13 Map Reduce
81 pages
Week 02
No ratings yet
Week 02
115 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
31 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Map Reduced B Seminar
No ratings yet
Map Reduced B Seminar
17 pages
L1
No ratings yet
L1
51 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Laptop-Hardware Repair Form
No ratings yet
Laptop-Hardware Repair Form
2 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
ds-pure-storage-flashblade-s
No ratings yet
ds-pure-storage-flashblade-s
5 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
Big Data IN A Gist
No ratings yet
Big Data IN A Gist
16 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Bash Shell from Zero to Hero: An SRE's Practical Guide to Terminal Skills, Scripting, and Automation
From Everand
Bash Shell from Zero to Hero: An SRE's Practical Guide to Terminal Skills, Scripting, and Automation
Nolan Reeves
No ratings yet
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
BD Notes
No ratings yet
BD Notes
11 pages
02-Hadoop
No ratings yet
02-Hadoop
117 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
No ratings yet
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
27 pages
Doubly Linked List
100% (1)
Doubly Linked List
16 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Sciencelogic Installation 11-1-0 PDF
No ratings yet
Sciencelogic Installation 11-1-0 PDF
133 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Foresee-Ncemad9d-16g C843687
No ratings yet
Foresee-Ncemad9d-16g C843687
29 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Ibm Websphere Application Server Advanced Edition, v3.5 - Websphere Jms - Jta Support For Mqseries (370K) Was
No ratings yet
Ibm Websphere Application Server Advanced Edition, v3.5 - Websphere Jms - Jta Support For Mqseries (370K) Was
49 pages
OS6900 AOS 8.9.107.R02 Release Notes
No ratings yet
OS6900 AOS 8.9.107.R02 Release Notes
71 pages
Ajp MCQ Chapter 2
100% (2)
Ajp MCQ Chapter 2
35 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
LaTeX CMRIT MiniProject Report
No ratings yet
LaTeX CMRIT MiniProject Report
19 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
CS 525 Advanced Distributed Systems Spring 2010: Ravenshaw Management Centre, Cuttack
No ratings yet
CS 525 Advanced Distributed Systems Spring 2010: Ravenshaw Management Centre, Cuttack
27 pages
Chapter III SpringBoot - Part 5 - Service
No ratings yet
Chapter III SpringBoot - Part 5 - Service
17 pages
Chapter 1
No ratings yet
Chapter 1
10 pages
Os Lab Practical File Tushar
No ratings yet
Os Lab Practical File Tushar
33 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Lab Activity 4
No ratings yet
Lab Activity 4
9 pages
What Is The Difference Between Echo and Print With Example
No ratings yet
What Is The Difference Between Echo and Print With Example
3 pages
Ass's Pack
No ratings yet
Ass's Pack
15 pages
Kailash Katkar (Quick Heal Solutions)
No ratings yet
Kailash Katkar (Quick Heal Solutions)
4 pages
Readme
No ratings yet
Readme
2 pages
ABAP Function - CONCAT - LINES - OF - SAPCODES
No ratings yet
ABAP Function - CONCAT - LINES - OF - SAPCODES
2 pages
Wolfstein 2009 Cheat
No ratings yet
Wolfstein 2009 Cheat
2 pages
Satellite A135 Detailed Product Specification
No ratings yet
Satellite A135 Detailed Product Specification
5 pages
IntroCompOrg Preview
No ratings yet
IntroCompOrg Preview
274 pages
Examples of Adding Static Routes in Solaris
No ratings yet
Examples of Adding Static Routes in Solaris
5 pages
LabVIEW Lab Skills Activity 1
No ratings yet
LabVIEW Lab Skills Activity 1
5 pages
History of Computing: Limited-Function Early Computers
No ratings yet
History of Computing: Limited-Function Early Computers
6 pages
Silo - Tips - Sdrive 1564 SD Card Interface For Commodore Computers
No ratings yet
Silo - Tips - Sdrive 1564 SD Card Interface For Commodore Computers
6 pages
11 03 10 Change - Control
No ratings yet
11 03 10 Change - Control
2 pages
Answer ALL Questions
No ratings yet
Answer ALL Questions
2 pages
PETvet Assembly Manual
No ratings yet
PETvet Assembly Manual
25 pages
Enhancing System Verilog - AOP
No ratings yet
Enhancing System Verilog - AOP
3 pages
Axis Bank - Aptitude Test Solved
No ratings yet
Axis Bank - Aptitude Test Solved
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

06 Application Architecture

Uploaded by

06 Application Architecture

Uploaded by

Application Architecture

Online, real- Offline processing:

Web search for a planet, MICRO’03

How to use parallelism? Fast core Fast core

• Few fast cores with

(Doc, word) → Snippet

Which method would you use, and why?

Algorithm developers == Distributed system experts?

• Selectively materialize or recompute intermediate results

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.