0% found this document useful (0 votes)

73 views1 page

PySpark RDD Cheat Sheet

This document provides examples of common RDD (Resilient Distributed Dataset) operations in PySpark including: 1) Getting basic RDD information like the number of partitions and counting elements. 2) Transforming and aggregating RDDs through operations like reduce, reduceByKey, countByKey and collectAsMap. 3) Grouping RDDs using groupByKey and computing statistics like sum, mean, max, min on grouped RDDs.

Uploaded by

Juan Ignacio Navarrete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views1 page

PySpark RDD Cheat Sheet

Uploaded by

Juan Ignacio Navarrete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

> Retrieving RDD Information

> Reshaping Data

Basic Information Re ducing

Python For Data Science

>>> rdd.getNumPartitions() #List the number of partitions

[('a', ),(' ',2)]

>>> rdd.reduce(lam da a, b
b
>>> rdd.reduceByKey(lam da x,y : x y).collect()
9 b
b: a
+ #Merge the rdd values for each key

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,' ',2) b
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lam da x: x b % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})

.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True [('a',[7,2]),(' ',[2])]b

Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements

#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes

>>> rdd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

>>> rdd3.stde () #Standard deviation of RDD elements

v #Aggregate the elements of each partition, and then the results

28.866070047722118
>>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

Inspect SparkContext > Mathematical Operations
#Apply a function to each RDD element

>>> sc.version #Retrieve SparkContext version

>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
b
>>> rdd.su tract(rdd2).collect() #Return each rdd value not contained in rdd2

>>> sc.pythonVer #Retrieve Python version

[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
b
[(' ',2),('a',7)]

>>> sc.master #Master URL to connect to

#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rdd

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name

['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys

>>> sc.defaultParallelism #Return default level of parallelism

>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf, SparkContext
> Selecting Data b
>>> rdd2.sortBy(lam da x: x[1]).collect()
b
[('d',1),(' ',1),('a',2)]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster( local )

Getting b
[('a',2),(' ',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))

>>> rdd.collect() #Return a list with all RDD elements

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)

>>> rdd.take(2) #Take first 2 RDD elements

Using The Shell

[('a', 7), ('a', 2)]

>>> rdd.first() #Take first RDD element

> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements

4 #New RDD with 4 partitions

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> rdd.repartition( )
b
[(' ', 2), ('a', 7)]
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files code.py

Samplin g
F
>>> rdd3.sample( alse, 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
runtime path by passing a comma-separated list to --py-files.
Filterin g > Saving
>>> rdd.filter(lam da x: b "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
v T F " "
>>> rdd.sa eAs ext ile( rdd.txt )

> Loading Data 5

>>> rdd .distinct().collect()
['a',2,' ',7]
b
#Return distinct RDD values
>>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', ' '] b

b
>>> rdd = sc.parallelize([('a',7),('a',2),(' ',2)])
> Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))

> Iterating >>> sc.stop()
>>> rdd 4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

b
(' ', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h /
examples src main pyt on pi.py

>>> textFile = sc.textFile("/my/directory/*.txt")

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

3-d Geometry Notes
No ratings yet
3-d Geometry Notes
6 pages
Differential Equation Maths
100% (2)
Differential Equation Maths
45 pages
Time Series Analysis Methods and Applications for Flight Data Zhang instant download
100% (2)
Time Series Analysis Methods and Applications for Flight Data Zhang instant download
54 pages
admv8052
No ratings yet
admv8052
66 pages
Mathematics Thesis Statement
100% (3)
Mathematics Thesis Statement
6 pages
Determinants of Income Inequality in Ethiopia
No ratings yet
Determinants of Income Inequality in Ethiopia
23 pages
Formal Friction Lab Sparkvue
No ratings yet
Formal Friction Lab Sparkvue
2 pages
Averages PixiPPt
No ratings yet
Averages PixiPPt
20 pages
The Dirty Little Secrets of NURBS
No ratings yet
The Dirty Little Secrets of NURBS
19 pages
Math Q2
No ratings yet
Math Q2
13 pages
Speech Forgery Detection Using QML
No ratings yet
Speech Forgery Detection Using QML
14 pages
Research - and - Application - of - Buckling - Res - PDF TAIWAN
No ratings yet
Research - and - Application - of - Buckling - Res - PDF TAIWAN
15 pages
Module 6 Gears 1
No ratings yet
Module 6 Gears 1
10 pages
System Programming Lab: LEX: Lexical Analyser Generator
No ratings yet
System Programming Lab: LEX: Lexical Analyser Generator
33 pages
Apio2009 Solutions
No ratings yet
Apio2009 Solutions
5 pages
Finalpractice 2017w
No ratings yet
Finalpractice 2017w
15 pages
Ib Mathai SL Study Guide
100% (2)
Ib Mathai SL Study Guide
77 pages
North Luzon Philippines State College: Midterm Examination IN Biostatistics
No ratings yet
North Luzon Philippines State College: Midterm Examination IN Biostatistics
6 pages
Home Asgn2 BEE 9AB
No ratings yet
Home Asgn2 BEE 9AB
2 pages
HEAT TRANSFER - Chapter 2
No ratings yet
HEAT TRANSFER - Chapter 2
2 pages
TOPS Core Skill Test 1
75% (4)
TOPS Core Skill Test 1
2 pages
Moment of Inertia
No ratings yet
Moment of Inertia
19 pages
Teaching of Upper Secondary Mathematics Part Iii
No ratings yet
Teaching of Upper Secondary Mathematics Part Iii
7 pages
Assignment
No ratings yet
Assignment
3 pages
Final Exam
No ratings yet
Final Exam
3 pages
Name:: College: Branch: Month & Year of Passing: College Code
No ratings yet
Name:: College: Branch: Month & Year of Passing: College Code
3 pages
Summative Assessment: Grasps Project
No ratings yet
Summative Assessment: Grasps Project
14 pages
DSA FAT Model Question Paper
No ratings yet
DSA FAT Model Question Paper
2 pages
Horizontal Curves
No ratings yet
Horizontal Curves
26 pages
II Prelim Revision 2024 computer
No ratings yet
II Prelim Revision 2024 computer
5 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6441)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5145)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (999)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1174)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

PySpark RDD Cheat Sheet

Uploaded by

PySpark RDD Cheat Sheet

Uploaded by

> Retrieving RDD Information

> Reshaping Data

Python For Data Science

[('a', ),(' ',2)]

+ b) #Merge the rdd values

>>> rdd.count() #Count RDD instances 3

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

True [('a',[7,2]),(' ',[2])]b

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.max() #Maximum value of RDD elements

PySpark is the Spark Python API that exposes

the Spark programming model to Python. #Mean value of RDD elements

#Aggregate values of each RDD key

>>> rdd3.mean() >>> rdd.aggregateByKey((0,0),seqop,combop).collect()

>>> rdd3.stde () #Standard deviation of RDD elements

v #Aggregate the elements of each partition, and then the results

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

>>> sc.version #Retrieve SparkContext version

>>> sc.pythonVer #Retrieve Python version

>>> sc.master #Master URL to connect to

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> sc.appName #Return application name

>>> sc.defaultParallelism #Return default level of parallelism

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

#Sort RDD by given function

>>> conf = (SparkConf()

" " >>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

>>> rdd.collect() #Return a list with all RDD elements

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)

Using The Shell

>>> rdd.first() #Take first RDD element

>>> rdd.top(2) #Take top 2 RDD elements

$ ./bin/pyspark --master local[4] --py-files code.py

> Loading Data 5

Para e ll lized Collections ['a', 'a', ' '] b

>>> rdd3 = sc.parallelize(range(100))

>>> rdd.foreac (g) h #Apply a function to all RDD elements

External Data ('a', 7)

>>> textFile = sc.textFile("/my/directory/*.txt")

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.