0% found this document useful (0 votes)
73 views1 page

PySpark RDD Cheat Sheet

This document provides examples of common RDD (Resilient Distributed Dataset) operations in PySpark including: 1) Getting basic RDD information like the number of partitions and counting elements. 2) Transforming and aggregating RDDs through operations like reduce, reduceByKey, countByKey and collectAsMap. 3) Grouping RDDs using groupByKey and computing statistics like sum, mean, max, min on grouped RDDs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views1 page

PySpark RDD Cheat Sheet

This document provides examples of common RDD (Resilient Distributed Dataset) operations in PySpark including: 1) Getting basic RDD information like the number of partitions and counting elements. 2) Transforming and aggregating RDDs through operations like reduce, reduceByKey, countByKey and collectAsMap. 3) Grouping RDDs using groupByKey and computing statistics like sum, mean, max, min on grouped RDDs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

> Retrieving RDD Information

> Reshaping Data


Basic Information Re ducing

Python For Data Science


>>> rdd.getNumPartitions() #List the number of partitions

[('a', ),(' ',2)]

>>> rdd.reduce(lam da a, b
b
>>> rdd.reduceByKey(lam da x,y : x y).collect()
9 b
b: a
+ #Merge the rdd values for each key

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,' ',2) b
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lam da x: x b % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})

.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True [('a',[7,2]),(' ',[2])]b


Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements


#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes


>>> rdd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

>>> rdd3.stde () #Standard deviation of RDD elements

v #Aggregate the elements of each partition, and then the results

28.866070047722118
>>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins


>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()


>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions


Inspect SparkContext > Mathematical Operations
#Apply a function to each RDD element

>>> sc.version #Retrieve SparkContext version


>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
b
>>> rdd.su tract(rdd2).collect() #Return each rdd value not contained in rdd2

>>> sc.pythonVer #Retrieve Python version


[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
b
[(' ',2),('a',7)]

>>> sc.master #Master URL to connect to


#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rdd

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes


>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext


>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name


['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys

>>> sc.defaultParallelism #Return default level of parallelism


>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf, SparkContext
> Selecting Data b
>>> rdd2.sortBy(lam da x: x[1]).collect()
b
[('d',1),(' ',1),('a',2)]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster( local )

Getting b
[('a',2),(' ',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))

>>> rdd.collect() #Return a list with all RDD elements

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)


>>> rdd.take(2) #Take first 2 RDD elements

Using The Shell


[('a', 7), ('a', 2)]

>>> rdd.first() #Take first RDD element


> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements


4 #New RDD with 4 partitions

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> rdd.repartition( )
b
[(' ', 2), ('a', 7)]
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files code.py


Samplin g
F
>>> rdd3.sample( alse, 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
runtime path by passing a comma-separated list to --py-files.
Filterin g > Saving
>>> rdd.filter(lam da x: b "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
v T F " "
>>> rdd.sa eAs ext ile( rdd.txt )

> Loading Data 5


>>> rdd .distinct().collect()
['a',2,' ',7]
b
#Return distinct RDD values
>>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', ' '] b

b
>>> rdd = sc.parallelize([('a',7),('a',2),(' ',2)])
> Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))


> Iterating >>> sc.stop()
>>> rdd 4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

b
(' ', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h /
examples src main pyt on pi.py

>>> textFile = sc.textFile("/my/directory/*.txt")

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy