0% found this document useful (0 votes)

16 views51 pages

3 - Spark

The document provides an introduction to Apache Spark and its role in distributed computing, highlighting its benefits such as scalability, fault tolerance, and in-memory processing. It contrasts Spark with MapReduce, emphasizing Spark's ability to handle both batch and real-time processing while utilizing modern hardware advancements. Additionally, it covers key concepts of Spark's Resilient Distributed Datasets (RDDs), functional programming, and various transformations and actions available for data processing.

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views51 pages

3 - Spark

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

CIT650 Introduction to Big Data

Introduction to SPARK

1
Distributed Computing vs. Parallel computing

Parallel Computing Distributed Computing

processors access shared processors usually have
memory their own private memory

2
Distributed Computing Benefits

• Distributed systems are inherently scalable as they work across

multiple independent machines and scale horizontally.

• Distributed systems offer fault tolerance as independent nodes fail

without affecting system integrity.

• Distributed systems provide redundancy that enables business

continuity.

3
Spark for Distributed Computing

• Spark supports a computing framework for large-scale data

processing and analysis.
• Spark provides parallel distributed data processing capabilities.
• Spark provides scalability.
• Spark provides fault-tolerance on commodity hardware.
• Spark enables in-memory processing.
• Spark enables programming flexibility with easy-to-use Python,
Scala, and Java APIs.

4
Spark vs. MapReduce

MapReduce

5
Apache Spark1

A general-purpose, fast, large-scale data processing

engine
Started in AMPLab @UC Berkley, now Data Bricks
Written in Scala
Considered as a third-generation distributed data
processing system
Hadoop is considered as a second generation
Why would we need a new generation?
Limitations of MapReduce (Hadoop)
Materialized intermediate results
Abstraction of Map and Reduce Applications
Can only be written in Java
Poor support for real-time data processing
Exploit advancements in hardware
Memory is much cheaper
Multi-core is now a commodity
1
Zaharia, Matei, et al. “Spark: Cluster computing with working sets 6
Third Generation Distributed Systems

Handle both batch and real-time processing

Exploit RAM as much as disk
Multiple-core aware
Do not reinvent the wheel
Use HDFS for storage
Apache Mesos/YARN for execution
Plays well with Hadoop
Iterative processing

7
Functional Programming

• Mathematical function programming style

• Follows a declarative programming model

• Empathizes what instead of how to

• Uses expressions instead of statements

8
Functional Programming

Scala code defines a lambda Python code defines a lambda

expression with two parameters x and function with two parameters x and y.
y of type Int. The lambda expression The lambda function is assigned to
is stored in the variable add. the variable add.

9
Functional Programming – Passing Operations

Scala

// Define the function

def performOperation(x: Int, y: Int, operation: (Int, Int) => Int): Int = { operation(x, y) }

// Call the function with a lambda

val result = performOperation(5, 3, (a, b) => a * b)
println(result) // Output: 15

Python

# Define the function

def perform_operation(x, y, operation): return operation(x, y)

# Call the function with a lambda

result = perform_operation(5, 3, lambda a, b: a * b)
print(result) # Output: 15

10
Functional Programming – using Arrays

Scala

val add = (x: Int, y: Int) => x + y

val numbers = List((1,2), (3,4), (5,6))
val sums = numbers.map (pair => add(pair._1, pair._2))
println(sums) // Prints: List(3, 7, 11)

Python

add = lambda x, y: x + y
numbers = [(1,2), (3,4), (5,6)]
sums = list(map(lambda pair: add(pair[0], pair[1]), numbers))
print(sums) # Prints: [3, 7, 11]

11
Unified Platform for Big Data Processing

12
Why Unification?

Good for developers: One platform to learn

Good for users: Take apps everywhere
Good for distributions: More applications
Is based on a common abstraction

13
Spark Abstractions

Spark core abstraction is Resilient Distributed Dataset (RDD)

Resilient: fault tolerant and can be recomputed when recovering from a
failure
Distributed: processing takes place over several nodes in parallel, like
MapReduce
Dataset: initial data can come from files, memory, or created
programmatically
Immutable: once created cannot be changed
Lineage: each RDD knows about its parents
Spark applications are series of operations that transform input RDDs
into output RDDs or final values

14
Word count In Spark

Recall how we defined the code for word count in MapReduce?

How does it look in Spark?

sc = SparkContext(appName="PythonWordCount")

lines = sc.textFile(sys.argv[1])

counts = (lines.flatMap(lambda x: x.split(' ')) # Split lines into words

.map(lambda x: (x, 1)) # Map each word to a tuple (word, 1)
.reduceByKey(lambda a, b: a+b)) # Reduce by key, sum occurrences

output = counts.collect()

for (word, count) in output:

print("%s: %i" % (word, count))
sc.stop()
15
RDD Operations

Two main types of RDD

operations
Transformations: result in a
new RDD
Can be chained
Forked
joined
Actions: return values, no
more RDDs
One action at the end of
each transformation chain

16
RDD Transformations

Transformations create a new RDD from an existing one

RDDs are immutable
Apply a series of transformations to modify the data as needed
Common transformations
Map(function): 1-to-1 mapping
FlatMap(function): 1-to-Many mapping
Filter(function): 1-to-1 mapping with selectivity
Special transformations
ReduceByKey
GroupByKey

17
Map vs. FlatMap

Aspect Map FlatMap

Transforms each element of a Transforms each element and then
Functionality
collection independently. flattens the result.

Input
"Hello world", "This is a test" "Hello world", "This is a test"
Example
Splits each line into words and
Operation Splits each line into words.
merges them into a single collection.

[["Hello", "world"], ["Hello", "world", "This", "is", "a",

Output
["This", "is", "a", "test"]] "test"]
Collection of collections
Structure Single flat collection (array).
(array of arrays).

18
Transformation: flatMap

lines = lines.flatMap (lambda x : x.split (’ ’))

19
Transformation: filter

filtered = lines.filter (lambda x : x not in[’by’,’very’,’to’,

’the’])

20
Transformation: map

word = filtered.map(lambda x : (x,1))

21
Transformation: reduceByKey

counts = word.reduceByKey(lambda x,y : x + y)

22
RDD Actions

Actions trigger execution of transformation chains

No further RDD transformations
Common actions
Collect(): returns an array of all the elements
Take(n): returns an array of the first n elements
Count(): returns the number of elements in RDD
saveAsTextFile(): saves the data to file system, either HDFS for local

23
Action: collect

output = counts.collect( )

24
Lazy Execution

Spark follows a lazy execution scheme

No RDDs are computed until an action is specified
Why?
Help optimize execution plan
Only lineage is created as moving from one transformation to the
other
To learn about lineage of a chain of transformations, call
toDebugString() after the transformation you are interested in

25
RDD Lineage

Application
mydata_filt = sc.textFile('file.txt')
.map(lambda line: line.upper())
.filter(lambda line: line.startswith('I'))

Lineage
mydata_filt.toDebudString()

26
Pipelining

When possible, Spark will pass

individual outputs of each
transformation to the next.
In Hadoop, all intermediate results are
completely calculated before beginning
the next step
Pipelining helps reduce latency

27
Pipelining

When possible, Spark will pass

individual outputs of each
transformation to the next.
In Hadoop, all intermediate results are
completely calculated before beginning
the next step
Pipelining helps reduce latency

28
Pipelining

When possible, Spark will pass

individual outputs of each
transformation to the next.
In Hadoop, all intermediate results are
completely calculated before beginning
the next step
Pipelining helps reduce latency

29
Creating RDDS

We learned that RDDs can be created as a result of transformations

on parent RDDs
What about root RDDs?
Can be created from
Data in memory, collections
From files, example from text files as we saw earlier

30
Creating RDDs from Collections

SparkContext.parallelize(collection)
mydata = [’Alice’ , ’Jack’ , ’Andrew’ , ’Frank’]
myRDD = sc.Parallelize(mydata)
myRDD.take(2)
output : [’Alice’ , ’Jack’]
Useful for:
Testing
Integrating

31
Creating RDDs from Files

So far, we saw sc.textFile(“file”)

Accepts a single file, a wildcard list of files, or comma-separated list of
file names
Examples:
sc.textFile(“myfile.txt”)
sc.textFile(“mydata/*.log”)
sc.textFile(“myfile1.txt,myfile2.txt”)
textFile only works with line-delimited text files
Each line in the file is a separate record in the RDD
Files are referenced by relative or absolute URI
Absolute URI: file:/home/training/myfile.txt or
hdfs://localhost/loudacre/myfile.txt
Relative URI (uses default file system):myfile.txt
What about other file formats?

32
Creating RDDs from Other File Formats

Spark uses Hadoop’s InputFormat and OutputFormat Java classes

TextInputFormat/TextOutputFormat
SequenceInputFormat/SequenceOutputFormat
FixedLengthInputFormat
Support for other formats
AvroInputFormat/AvroOutputFormat

33
Using Input/output Formats

Define input format using sc.hadoopFile

Or newAPIhadoopFile for New API classes
Define output format using rdd.saveAsHadoopFile
Or saveAsNewAPIhadoopFile for New API classes

Example:
input_rdd = sc.newAPIHadoopFile(
"path/to/textfile.txt",
"org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
"org.apache.hadoop.io.LongWritable",
"org.apache.hadoop.io.Text" )

34
Whole-file-based RDDs
sc.textFile puts each line as a
separate element
What if you are
processing XML or
JSON files?

sc.wholeTextFile(directory)
creates a single element
in RDD for the whole
content of a file in the input
directory
Creates a special type of
RDDs, (paired RDDs),
we discuss later
Works for files with
small sizes (elements
must fit in memory)

35
RDD Content

RDD can hold elements of any type:

Primitive data types
Sequence types
Scala/Java objects (if serializable)
Mixed types
Special RDDs
Pair RDDs: consist of key-value pairs,
recall map step of word count example
sc.wholeTextFile
Double RDDs
RDDs consisting of numeric data

36
Other General RDD Transformations

Single RDD transformations

distinct: removes duplicate values
sortBy: sorts by the input function
Multi-RDD transformations
intersection: outputs common elements of the input RDDs
union: add all elements from input RDDs to the output RDD
zip: performs a cross product between the input RDDs

[1, 2, 3] zip ['a', 'b', 'c'] [(1, 'a'), (2, 'b'), (3, 'c')]

37
Pair RDDs

A special form of RDDs

Elements must be a tuple of two elements (key, value)
Keys and values can be of any type
Why use Pair RDDs
To have the benefits of MapReduce
Ability to scale by forwarding tuples of the same key value to the same
processing node, shuffling.
Other additional transformations are built-in, e.g., sorting, joining,
grouping, counting, etc.

38
Creating Pair RDDs

You can have your root RDD as a pair RDD, e.g., sc.wholeTextFile
You can use a transformation to put the data in pair RDDs
map
flatMap/flatMapValues
keyBy

39
Example

Create a pair RDD from a tab-delimited file

users=sc.textFile (file)
.map(lambda line: line.split(’\t’))
.map(lambda elems: (elems[0],elems[1]))
. keyBy (lambda line : line.split(’\ t’) [ 0 ] )

40
Transformation: reduceByKey

co u n ts = word . reduce By Key ( lambda x , y : x+y )

41
Other Pair RDD Transformations

countByKey
Returns a pair RDD with the same key as parent and value is the count
of key occurrences
groupByKey
Similar to the input of Hadoop Reducer, (key, [list of values])
sortByKey(ascending=true/false)
Returns a pair RDD sorted by the key
join
Takes two input pair RDDs with the same key (key, value1), (key,
value2)
Returns (Key, (value1,value2))

42
Examples

43
Example: join by key

Orders = orderItems.join(orderTotals)

44
Running A Spark Job on YARN

45
RDD Partitions

Data is partitioned over the worker

nodes
E.g., to follow blocks of an HDFS file
Partitioning is done automatically by
Spark
Optionally, you can control the
number of partitions
You can specify the minimum number
of partitions, default is 2
sc.textFile(“My File”, 3)

46
Parallel Operations on Partitions

Spark tries to maximize the localization of data processing

Group all transformations that can be processed on the same data
partition
Some transformations are partition-preserving
E.g., map, flatMap, filter
Some transformations repartition
E.g., reduceByKey, groupByKey, sortByKey

47
Stages

All operations that can work on the same data partition are grouped
into a stage.
Tasks within a stage are pipelined together
Spark divides the DAG of the job into stages
How Spark Calculates Stages? Based on RDD dependencies
Narrow dependencies
Only one child depends on the RDD
No shuffle required
Wide (shuffle) dependencies
Multiple children depend on the RDD
Defines a new stage

*DAG = Directed Acyclic Graph

48
Example: Average Word Length By First Letter

We have the following chain of operations

Avglength = sc.textFile(file).flatMap(line: line.split())

.map(word:(word[0], len (word)).groupByKey()

.map((k , values) : (k, sum(values)/ len(values)))

49
Tasks Pipelining

We have the following chain of operations

Avglength = sc.textFile(file).flatMap(line: line.split())

.map(word:(word[0], len (word)).groupByKey()

.map((k , values) : (k, sum(values )/ len(values)))

50
RDD Persistence

Spark maintains lineage of

RDDs by storing a reference to
the parent RDD in the child one
Each time an action is called on
an RDD, Spark recursively
traverses the lineage and
performs the transformation
This might be costly, especially
in case of disk access
Persistence makes Spark
maintain the content of RDDs,
default in memory
Useful for iterative, e.g.
machine learning, and
interactive processing
51

212-82 V12.95
No ratings yet
212-82 V12.95
92 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
J8pF3RMRxmfKRd0TFcZFw Learning Log Template Define Problems and Ask Questions With Data
No ratings yet
J8pF3RMRxmfKRd0TFcZFw Learning Log Template Define Problems and Ask Questions With Data
1 page
How To Trade Like A Trader-Preneur PDF
100% (2)
How To Trade Like A Trader-Preneur PDF
50 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Use of ICT in Automobile Industry
100% (3)
Use of ICT in Automobile Industry
3 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
SPARK
No ratings yet
SPARK
125 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Lec 9
No ratings yet
Lec 9
38 pages
Module 3
No ratings yet
Module 3
51 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Brushless DC Electric Motor
No ratings yet
Brushless DC Electric Motor
7 pages
Lol
No ratings yet
Lol
3 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
HCS 111 Handout 1
No ratings yet
HCS 111 Handout 1
11 pages
Sanyo Cm21sf1 Cm21sf1 Chassis Fc8-A SM
No ratings yet
Sanyo Cm21sf1 Cm21sf1 Chassis Fc8-A SM
37 pages
Data Sheet 6ES7331-7NF00-0AB0: Input Current
No ratings yet
Data Sheet 6ES7331-7NF00-0AB0: Input Current
3 pages
Spark
No ratings yet
Spark
160 pages
Jci Application Form
No ratings yet
Jci Application Form
3 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Az1084s PDF
No ratings yet
Az1084s PDF
17 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
SPARK
No ratings yet
SPARK
66 pages
BMW Innovations & RND
No ratings yet
BMW Innovations & RND
7 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Lec 9
No ratings yet
Lec 9
33 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Spark
No ratings yet
Spark
51 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
KOBRA 400 HS-6 Auto-Oiler
No ratings yet
KOBRA 400 HS-6 Auto-Oiler
2 pages
1 - Icue49301.2020.9307075
No ratings yet
1 - Icue49301.2020.9307075
7 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Synchronous Optical Networking (Sonet)
No ratings yet
Synchronous Optical Networking (Sonet)
6 pages
Web Mining
No ratings yet
Web Mining
13 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Pyspark
No ratings yet
Pyspark
31 pages
External Video-En
No ratings yet
External Video-En
2 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
PDF 3
No ratings yet
PDF 3
14 pages
Sample Complaint Letter
No ratings yet
Sample Complaint Letter
2 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
FAQ Wifiunifi 24022020
No ratings yet
FAQ Wifiunifi 24022020
3 pages
4 - Creating Creative Photomontages or Image Mixing Using Generative Adversarial Networks
No ratings yet
4 - Creating Creative Photomontages or Image Mixing Using Generative Adversarial Networks
9 pages
Unit-4 (STLD) Lecture2
No ratings yet
Unit-4 (STLD) Lecture2
21 pages
Why I Hate Microsoft by F.W. Van Wensveen
No ratings yet
Why I Hate Microsoft by F.W. Van Wensveen
73 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
IPM Lab Manual - Exp - 1
No ratings yet
IPM Lab Manual - Exp - 1
9 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
DWDV Notes
No ratings yet
DWDV Notes
111 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
X ICSE-practice Sheet-1
No ratings yet
X ICSE-practice Sheet-1
5 pages
Unit 1 - SLL and DLL
No ratings yet
Unit 1 - SLL and DLL
45 pages
SaratSasikumar v1701
No ratings yet
SaratSasikumar v1701
5 pages
01 - Disaster - (2) - JupyterLab
No ratings yet
01 - Disaster - (2) - JupyterLab
16 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
06 Ethweb 3
No ratings yet
06 Ethweb 3
11 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

3 - Spark

Uploaded by

3 - Spark

Uploaded by

CIT650 Introduction to Big Data

Parallel Computing Distributed Computing

• Distributed systems are inherently scalable as they work across

• Distributed systems offer fault tolerance as independent nodes fail

• Distributed systems provide redundancy that enables business

• Spark supports a computing framework for large-scale data

A general-purpose, fast, large-scale data processing

Handle both batch and real-time processing

• Mathematical function programming style

• Follows a declarative programming model

• Empathizes what instead of how to

• Uses expressions instead of statements

Scala code defines a lambda Python code defines a lambda

// Define the function

// Call the function with a lambda

# Define the function

# Call the function with a lambda

val add = (x: Int, y: Int) => x + y

Good for developers: One platform to learn

Spark core abstraction is Resilient Distributed Dataset (RDD)

Recall how we defined the code for word count in MapReduce?

counts = (lines.flatMap(lambda x: x.split(' ')) # Split lines into words

for (word, count) in output:

Two main types of RDD

Transformations create a new RDD from an existing one

Aspect Map FlatMap

[["Hello", "world"], ["Hello", "world", "This", "is", "a",

lines = lines.flatMap (lambda x : x.split (’ ’))

filtered = lines.filter (lambda x : x not in[’by’,’very’,’to’,

word = filtered.map(lambda x : (x,1))

counts = word.reduceByKey(lambda x,y : x + y)

Actions trigger execution of transformation chains

Spark follows a lazy execution scheme

When possible, Spark will pass

When possible, Spark will pass

When possible, Spark will pass

We learned that RDDs can be created as a result of transformations

So far, we saw sc.textFile(“file”)

Spark uses Hadoop’s InputFormat and OutputFormat Java classes

Define input format using sc.hadoopFile

RDD can hold elements of any type:

Single RDD transformations

A special form of RDDs

Create a pair RDD from a tab-delimited file

co u n ts = word . reduce By Key ( lambda x , y : x+y )

Data is partitioned over the worker

Spark tries to maximize the localization of data processing

*DAG = Directed Acyclic Graph

We have the following chain of operations

.map(word:(word[0], len (word)).groupByKey()

.map((k , values) : (k, sum(values)/ len(values)))

We have the following chain of operations

.map(word:(word[0], len (word)).groupByKey()

.map((k , values) : (k, sum(values )/ len(values)))

Spark maintains lineage of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.