100% found this document useful (2 votes)
188 views17 pages

ML Cheatsheets

The document provides a summary of key NumPy functions for inspecting, subsetting, slicing, indexing, and performing arithmetic operations on multidimensional arrays. It explains how to import NumPy, create arrays of different dimensions, get array properties like shape, size, and data type, select subsets of elements, and perform element-wise mathematical operations like addition, subtraction, multiplication, and division.

Uploaded by

Ghikjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
188 views17 pages

ML Cheatsheets

The document provides a summary of key NumPy functions for inspecting, subsetting, slicing, indexing, and performing arithmetic operations on multidimensional arrays. It explains how to import NumPy, create arrays of different dimensions, get array properties like shape, size, and data type, select subsets of elements, and perform element-wise mathematical operations like addition, subtraction, multiplication, and division.

Uploaded by

Ghikjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Python For Data Science Cheat Sheet Inspecting Your Array Subsetting, Slicing, Indexing Also see Lists

>>> a.shape Array dimensions Subsetting


NumPy Basics >>>
>>>
len(a)
b.ndim
Length of array
Number of array dimensions
>>> a[2]
3
1 2 3 Select the element at the 2nd index
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements >>> b[1,2] 1.5 2 3 Select the element at row 0 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
NumPy >>> a[0:2]
array([1, 2])
1 2 3 Select items at index 0 and 1
2
The NumPy library is the core library for scientific computing in Asking For Help >>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
>>> np.info(np.ndarray.dtype) array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
Array Mathematics
1.5 2 3
>>> b[:1] Select all items at row 0
object, and tools for working with these arrays. array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays [-3. , -3. , -3. ]])
array([3, 2, 1])

>>> np.subtract(a,b) Boolean Indexing


1D array 2D array 3D array Subtraction
>>> a[a<2] Select elements from a less than 2
>>> b + a Addition 1 2 3
array([[ 2.5, 4. , 6. ], array([1])
axis 1 axis 2
1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0 array([ 4. , 2. , 6. , 1.5])
4 5 6 >>> a / b Division
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
Creating Arrays >>> a * b
array([[ 1.5, 4. , 9. ],
Multiplication
[ 4. , 5.
[ 1.5, 2.
,
,
6.
3.
,
,
4. ],
1.5]])

>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])


>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Flatten the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison
array([True, False, False], dtype=bool) Combining Arrays
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
I/O array([ 1, 2,
>>> np.vstack((a,b))
3, 10, 15, 20])
Stack arrays vertically (row-wise)
Aggregate Functions array([[ 1. , 2. , 3. ],
Saving & Loading On Disk [ 1.5, 2. , 3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> np.loadtxt("myfile.txt") >>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.std(b) Standard deviation [ 2, 15],
>>> np.genfromtxt("my_file.csv", delimiter=',') [ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Splitting Arrays
Data Types >>> h = a.view() Create a view of the array with the same data >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> np.copy(a) Create a copy of the array [array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point >>> h = a.copy() Create a deep copy of the array [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>>
>>>
np.bool
np.object
Boolean type storing TRUE and FALSE values
Python object type Sorting Arrays [ 4., 5., 6.]]])]

>>> np.string_ Fixed-length string type >>> a.sort() Sort an array


>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Data Wrangling Tidy Data – A founda7on for wrangling in pandas
with pandas
&
Cheat Sheet
In a 7dy
F M A F M A Tidy data complements pandas’s vectorized
opera8ons. pandas will automa7cally preserve
observa7ons as you manipulate variables. No
M
* A F

data set:
other format works as intui7vely with pandas.
h.p://pandas.pydata.org Each variable is saved
in its own column
Each observa8on is
saved in its own row
M * A
Syntax – Crea7ng DataFrames Reshaping Data – Change the layout of a data set
a b c df.sort_values('mpg')
1 4 7 10 Order rows by values of a column (low to high).
2 5 8 11
3 6 9 12
df.sort_values('mpg',ascending=False)
Order rows by values of a column (high to low).
df = pd.DataFrame(
{"a" : [4 ,5, 6], pd.melt(df) df.pivot(columns='var', values='val') df.rename(columns = {'y':'year'})
"b" : [7, 8, 9], Gather columns into rows. Spread rows into columns. Rename the columns of a DataFrame
"c" : [10, 11, 12]},
index = [1, 2, 3]) df.sort_index()
Sort the index of a DataFrame
Specify values for each column.
df.reset_index()
df = pd.DataFrame(
[[4, 7, 10], Reset index of DataFrame to row numbers, moving
[5, 8, 11], index to columns.
[6, 9, 12]], pd.concat([df1,df2]) pd.concat([df1,df2], axis=1) df.drop(['Length','Height'], axis=1)
index=[1, 2, 3], Append rows of DataFrames Append columns of DataFrames Drop columns from DataFrame
columns=['a', 'b', 'c'])
Specify values for each row.

n v
a b c
Subset Observa8ons (Rows) Subset Variables (Columns)
1 4 7 10
d
2 5 8 11
e 2 6 9 12

df = pd.DataFrame( df[['width','length','species']]
df[df.Length > 7] df.sample(frac=0.5) Select mul7ple columns with specific names.
{"a" : [4 ,5, 6],
Extract rows that meet logical Randomly select frac7on of rows. df['width'] or df.width
"b" : [7, 8, 9],
criteria. df.sample(n=10)
"c" : [10, 11, 12]}, Select single column with specific name.
df.drop_duplicates() Randomly select n rows. df.filter(regex='regex')
index = pd.MultiIndex.from_tuples(
Remove duplicate rows (only df.iloc[10:20]
[('d',1),('d',2),('e',2)], Select columns whose name matches regular expression regex.
considers columns). Select rows by posi7on.
names=['n','v'])))
df.head(n) df.nlargest(n, 'value') regex (Regular Expressions) Examples
Create DataFrame with a Mul7Index
Select first n rows. Select and order top n entries. '\.' Matches strings containing a period '.'
df.tail(n) df.nsmallest(n, 'value')
Method Chaining
'Length$' Matches strings ending with word 'Length'
Select last n rows. Select and order bo.om n entries. '^Sepal' Matches strings beginning with the word 'Sepal'

Most pandas methods return a DataFrame so that '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
another pandas method can be applied to the Logic in Python (and pandas) ''^(?!Species$).*' Matches strings except the string 'Species'
result. This improves readability of code. < Less than != Not equal to
df = (pd.melt(df) df.loc[:,'x2':'x4']
.rename(columns={ > Greater than df.column.isin(values) Group membership Select all columns between x2 and x4 (inclusive).
'variable' : 'var', == Equals pd.isnull(obj) Is NaN df.iloc[:,[1,2,5]]
'value' : 'val'}) <= Less than or equals pd.notnull(obj) Is not NaN
Select columns in posi7ons 1, 2 and 5 (first column is 0).
.query('val >= 200') df.loc[df['a'] > 10, ['a','c']]
>= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all
) Select rows mee7ng logical condi7on, and only the specific columns .
h.p://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (h.ps://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Wri.en by Irv Lus7g, Princeton Consultants
Summarize Data Handling Missing Data Combine Data Sets
df['w'].value_counts() df.dropna() adf bdf
Count number of rows with each unique value of variable Drop rows with any column having NA/null data. x1 x2 x1 x3
len(df) df.fillna(value) A 1 A T
# of rows in DataFrame. Replace all NA/null data with value. B 2 B F
df['w'].nunique() C 3 D T
# of dis7nct values in a column.
df.describe() Make New Columns Standard Joins
Basic descrip7ve sta7s7cs for each column (or GroupBy) x1 x2 x3 pd.merge(adf, bdf,
A 1 T how='left', on='x1')
B 2 F Join matching rows from bdf to adf.
C 3 NaN
df.assign(Area=lambda df: df.Length*df.Height)
pandas provides a large set of summary func8ons that operate on Compute and append one or more new columns. x1 x2 x3 pd.merge(adf, bdf,
different kinds of pandas objects (DataFrame columns, Series, df['Volume'] = df.Length*df.Height*df.Depth A 1.0 T how='right', on='x1')
GroupBy, Expanding and Rolling (see below)) and produce single Add single column. B 2.0 F Join matching rows from adf to bdf.
values for each of the groups. When applied to a DataFrame, the pd.qcut(df.col, n, labels=False) D NaN T
result is returned as a pandas Series for each column. Examples: Bin column into n buckets.
x1 x2 x3 pd.merge(adf, bdf,
sum() min()
A 1 T how='inner', on='x1')
Sum values of each object. Minimum value in each object. Vector Vector B 2 F Join data. Retain only rows in both sets.
count() max() func8on func8on
Count non-NA/null values of Maximum value in each object.
each object. mean() x1 x2 x3 pd.merge(adf, bdf,
median() Mean value of each object. pandas provides a large set of vector func8ons that operate on all A 1 T how='outer', on='x1')
Median value of each object. var() columns of a DataFrame or a single selected column (a pandas B 2 F Join data. Retain all values, all rows.
quantile([0.25,0.75]) Variance of each object. Series). These func7ons produce vectors of values for each of the C 3 NaN
Quan7les of each object. std() columns, or a single Series for the individual Series. Examples: D NaN T
apply(function) Standard devia7on of each max(axis=1) min(axis=1) Filtering Joins
Apply func7on to each object. object. Element-wise max. Element-wise min. x1 x2 adf[adf.x1.isin(bdf.x1)]
clip(lower=-10,upper=10) abs() A 1 All rows in adf that have a match in bdf.
Group Data Trim values at input thresholds Absolute value. B 2

df.groupby(by="col") The examples below can also be applied to groups. In this case, the x1 x2 adf[~adf.x1.isin(bdf.x1)]
Return a GroupBy object, func7on is applied on a per-group basis, and the returned vectors C 3 All rows in adf that do not have a match in bdf.
grouped by values in column are of the length of the original DataFrame.
named "col". shift(1) shift(-1) ydf zdf
Copy with values shihed by 1. Copy with values lagged by 1. x1 x2 x1 x2
df.groupby(level="ind") rank(method='dense') cumsum() A 1 B 2
Return a GroupBy object, Ranks with no gaps. Cumula7ve sum. B 2 C 3
grouped by values in index rank(method='min') cummax() C 3 D 4
level named "ind". Ranks. Ties get min rank. Cumula7ve max.
Set-like Opera7ons
All of the summary func7ons listed above can be applied to a group. rank(pct=True) cummin()
Addi7onal GroupBy func7ons: Ranks rescaled to interval [0, 1]. Cumula7ve min. x1 x2 pd.merge(ydf, zdf)
size() agg(function) rank(method='first') cumprod() B 2 Rows that appear in both ydf and zdf
Size of each group. Aggregate group using func7on. Ranks. Ties go to first value. Cumula7ve product. C 3 (Intersec7on).

x1 x2
Windows PloUng A
B
1
2
pd.merge(ydf, zdf, how='outer')
Rows that appear in either or both ydf and zdf
(Union).
df.expanding() df.plot.hist() df.plot.scatter(x='w',y='h') C 3
Return an Expanding object allowing summary func7ons to be Histogram for each column Sca.er chart using pairs of points D 4 pd.merge(ydf, zdf, how='outer',
applied cumula7vely. indicator=True)
df.rolling(n) x1 x2
A 1 .query('_merge == "left_only"')
Return a Rolling object allowing summary func7ons to be .drop(['_merge'],axis=1)
applied to windows of length n. Rows that appear in ydf but not zdf (Setdiff).
h.p://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (h.ps://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Wri.en by Irv Lus7g, Princeton Consultants
Geoms - Use a geom to represent data points, use the geom’s aesthetic properties to represent variables. Each function returns a layer.
Data Visualization One Variable Two Variables
with ggplot2 Continuous Continuous X, Continuous Y Continuous Bivariate Distribution
Cheat Sheet a <- ggplot(mpg, aes(hwy))
f <- ggplot(mpg, aes(cty, hwy)) i <- ggplot(movies, aes(year, rating))
f + geom_blank() i + geom_bin2d(binwidth = c(5, 0.5))
a + geom_area(stat = "bin")
xmax, xmin, ymax, ymin, alpha, color, fill,
Geomsx,-y,Usealpha,
a geomcolor, fill, linetype,
data points,size
Data Visualization to represent use the geom’s aesthetic properties to represent variables

One Variableb + geom_area(aes(y = ..density..), stat = "bin")


linetype, size, weight
with ggplot2 Two Variables f + geom_jitter() i + geom_density2d()
Geoms geom_density(kernel
a+
Continuous = "gaussian")
- Use a geom to represent Continuous X, Continuous Y Continuous Bivariate Distribution
x, y, alpha, colour, linetype, size
Data Visualization
Cheat Sheet
Basics One Variable
a <- ggplot(mpg, aes(hwy))
x, y, alpha, color, fill, linetype,f + geom_blank()
size, weight
data points, use the
f <- ggplot(mpg, geom’s
aes(cty, aesthetic properties to
hwy))
Two Variablesh + geom_bin2d(binwidth = c(5, 0.5))
h <- x, y, alpha,
represent
ggplot(movies, color, fill,
variables
aes(year, shape, size
rating))

with ggplot2 a + geom_area(stat


x, y, alpha, Continuous
= "bin")
b +
color, fill, linetype, size geom_density(aes(y = ..county..))
Continuous X, Continuous Y f xmax,
+ geom_point()
Continuous
xmin, ymax, ymin, alpha, color, fill, i + geom_hex()
ggplot2 is based on the grammar linetype, size,Bivariate
weight Distribution
Cheat Sheet of graphics, the b + geom_area(aes(y = a + geom_dotplot() f + geom_jitter()
..density..),
a <- ggplot(mpg, aes(hwy)) stat = "bin") f <- ggplot(mpg, aes(cty, hwy)) h <- ggplot(movies, aes(year, rating))
h + x, y, alpha,
geom_density2d() color, fill, shape, size x, y, alpha, colour, fill size
idea that you can build every graph from the same a + geom_density(kernal = "gaussian") f x,
+ geom_blank()
y, alpha, color, fill, shape, size hx,+y,geom_bin2d(binwidth = c(5, 0.5))
alpha, colour, linetype, size
few components: a data Basics ax,+y,geom_area(stat
set, a set of geoms—visual x, y,alpha,
= x,
"bin")y, alpha,
color, fill, linetype, size, weight color, fill xmax, xmin, ymax, ymin, alpha, color, fill,
alpha, color, fill, linetype, size
b + geom_density(aes(y = ..county..)) f + geom_point() hf ++ geom_quantile()
linetype, size, weight
geom_hex() Continuous Function
marks that represent data points, and a coordinate a+bgeom_dotplot()
+ geom_area(aes(y = ..density..), stat = "bin")
f x,
+ geom_jitter() hx,+y,geom_density2d()
system. F M A ax,+y,geom_density(kernal = "gaussian")
y, alpha, color, fill, shape, size
alpha, color, fill a + geom_freqpoly() x, y, alpha, color, fill, shape, size x,x,
alpha, colour, fill size
y, y, alpha,
alpha, colour, color, linetype, size, weight
linetype, size j <- ggplot(economics, aes(date, unemploy))
Basics 4 x, y, alpha, color, fill, linetype,
4 size, weight f + geom_quantile() Continuous Function
3 b + geom_density(aes(y =x,..county..))
a + geom_freqpoly()
3
y, alpha, color, linetype, f x, sizecolor, linetype, size, weight
+ geom_point()
y, alpha, h + geom_hex() aes(date, unemploy))
g <- ggplot(economics, j + geom_area()
+ = 2 a+ geom_dotplot() 2
x, y, alpha, color, linetype,b + geom_freqpoly(aes(y
size
x, y, alpha, color, fill, shape, size
= ..density..)) gf ++ y,geom_rug(sides
x,geom_area()
alpha, colour, fill size = "bl") x, y, alpha, color, fill, linetype, size
1
x, y, alpha, color, fill
1
f + geom_rug(sides = "bl")
F MA b + geom_freqpoly(aes(y = ..density..)) x,alpha, color,
y, alpha, color, linetype,
fill, linetype, size size
f alpha,
+ geom_quantile()
a + geom_histogram(binwidth
Continuous Function
= 5)
4 4
0 0
color, linetype, size
0 1 2 3 4 0 1 2 3 4
a + geom_histogram(binwidth = 5) g <- ggplot(economics, aes(date, unemploy))
data geom coordinate j + geom_line()
3
plot
3
ax,+y,geom_freqpoly() x, y, alpha, color, linetype, size, weight g + geom_line()
x=F + 2
system = 2
x, y, alpha,
alpha, color, fill, linetype, size, weight
color, fill, linetype,
f + geom_smooth(modelsize, weight = lm)
y=A
1 1
x, y, alpha, color, linetype, size
b + geom_histogram(aes(y = ..density..))
b + geom_freqpoly(aes(yb f x,
+ geom_rug(sides = "bl") size, weight
y, alpha, color, fill, linetype,
fgx,++y,geom_area()
geom_smooth(model
alpha, color, linetype, size
x, y, alpha, color, fill, linetype, size
= lm) x, y, alpha, color, linetype, size
To display data values, map 2 variables in the data set + geom_histogram(aes(y
= ..density..))
alpha, =
color,..density..))
linetype, size g + x, y, alpha, color, fill, linetype, size, weight
0 0
0 1 3 4 0 1 2 3 4 Discrete
a + geom_histogram(binwidth = 5) geom_step(direction = "hv")
data geom coordinate plot f + geom_text(aes(label = cty)) gx,+y,geom_line()
to aesthetic properties
A x = F 4 ofsystem the geom like size, color, x, y,aalpha,
<- ggplot(mpg, aes(fl))size, weight alpha, color, linetype, size
F M
y=A
4

b +bgeom_bar()
color, fill, linetype,
+ geom_histogram(aes(y = ..density..))
Discrete f x,
+ geom_smooth(model
y, label, alpha, angle, color, = lm)
family, fontface, x, y, alpha, color, linetype, size j + geom_step(direction = "hv")
and x and y locations.2 hjust, lineheight,
color,size, vjust size, weight
f + geom_text(aes(label = cty))
3 3
x, y, alpha, fill, linetype,
+ = 2 x, alpha, color, fill,
Discretelinetype, size, b <- ggplot(mpg, aes(fl))
weight C Visualizing error
g + geom_step(direction
df <- data.frame(grp = c("A", "B"), fit = "hv")
= 4:5, se = 1:2)
x, y, alpha, color, linetype, size
1 1
a <- ggplot(mpg, aes(fl)) f + geom_text(aes(label = cty)) AB e <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se)) family, fontface,
x,
x, y, y, label,
alpha, color, alpha,
linetype, angle,
size color,
F MA 04
0 1 3 4
04
0 1 3 4
b + geom_bar()
b + geom_bar() Discrete
x, y, label, alpha,X, Continuous
angle, color, family,Y fontface,
hjust, lineheight, size, vjust
Graphical Primitives
2 2
data geom coordinate 3 3
plot g <- ggplot(mpg,
hjust, lineheight,aes(class,
size, vjust hwy))
x=F 2
+ system = 2 x, alpha, color, fill, linetype, x, alpha,
size, weightcolor, fill, linetype, size, weight e + geom_crossbar(fatten
Visualizing error= 2) Visualizing error
y=A
color = F 1
size = A
1
c <- ggplot(map, aes(long, lat))
g + geom_bar(stat = "identity") df <- data.frame(grp
e <- ggplot(df,
x, y, ymax, ymin, = c("A",
alpha, "B"), fit =fill,4:5,
color,
size aes(grp, fit, ymin = fit-se, ymax = fit+se))
se = 1:2)
linetype,
df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
x,Discrete
y, alpha, color, X, fill, linetype, size,
Continuous Y weight
k <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))
0 0
0 1 2 3 4 0 1 2 3 4
c + geom_polygon(aes(group
Graphical Primitives= group)) g <- ggplot(mpg, aes(class, hwy)) e + geom_errorbar()
data geom coordinate
system
plot
x, y, alpha, color, fill, linetype, size g + geom_boxplot() Discrete
ex,+ymax, ymin, alpha,X,
geom_crossbar(fatten Continuous
color, = 2)
linetype, size, Y
Graphical Primitives
x=F
y=A glower,
+ geom_bar(stat
middle, upper,=x,"identity")
ymax, ymin, alpha, gwidth
x, <- ggplot(mpg,
(also
y, ymax, ymin, alpha, color,aes(class,
geom_errorbarh()) fill, linetype, hwy))
color = F
size = A c <- ggplot(map, aes(long, lat)) color, fill, linetype,
x, y, alpha, shape,
color, fill, size, size,
linetype, weightweight e +sizegeom_linerange() k + geom_crossbar(fatten = 2)
c + geom_polygon(aes(group = group)) g + geom_dotplot(binaxis = "y", ex,+ymin,
geom_errorbar()
ymax, alpha, color, linetype, size
Build a graph with qplot() or ggplot() g <- ggplot(economics, aes(date, unemploy))
x, y, alpha, color, fill, linetype, size gstackdir
+ geom_boxplot()
= "center") g x,+ymax, geom_bar(stat
ymin, alpha, color, linetype, = "identity")
size, x, y, ymax, ymin, alpha, color, fill, linetype,
g + geom_path(lineend="butt", c <- ggplot(map, aes(long, x,lower,
y, alpha, lat))
color,
middle, fill x, ymax, ymin, alpha,
upper, e +width (also geom_errorbarh())
geom_pointrange() size
aesthetic mappings data geom linejoin="round’, linemitre=1) g +color, fill, linetype, shape,
geom_violin(scale size, weight
= "area") ex,+x, y, alpha,
y,geom_linerange()
ymin, ymax, alpha,color,color,fill, linetype, size, weight
fill, linetype,
x, y, alpha, color, linetype, c + geom_polygon(aes(group
size gx,+y,geom_dotplot(binaxis
alpha, color,= group))
fill, linetype, =size,
"y", weight shape,
x, ymin,size
ymax, alpha, color, linetype, size k + geom_errorbar()
g <- ggplot(economics, aes(date, unemploy))
qplot(x = cty,Fy M = hwy, A color4 = cyl, data = 4mpg, geom = "point"g) + geom_ribbon(aes(ymin=unemploy - 900, stackdir = "center") g +geom_pointrange()
geom_boxplot() x, ymax, ymin, alpha, color, linetype, size,
+ 900)) x, y, alpha, color, fill, linetype, sizefill Maps
gymax=unemploy
+ geom_path(lineend="butt", x, y, alpha, color, data e<-+data.frame(murder = USArrests$Murder,
Creates a complete plot2 with given data, 3 3
geom, and x,linejoin="round’
ymax, ymin, alpha, , linemitre=1)
color, fill, linetype, size g + geom_violin(scale = "area") statex,=lower, middle, upper,
y,tolower(rownames(USArrests)))
ymin, ymax, alpha, color, x, ymax, ymin, alpha,
fill, linetype, width (also geom_errorbarh())
mappings. Supplies many + useful defaults. = 2 Discrete X, Discrete Y
map <- map_data("state")
shape, size
F MA
1 1 x, y, alpha, color, linetype, size
g + geom_ribbon(aes(ymin=unemploy - 900,
h <-x,ggplot(diamonds,
y, alpha, color, fill, linetype,
aes(cut, size, weight
color)) color,
e <- ggplot(data, fill, linetype,
aes(fill = murder)) shape, size, weight k + geom_linerange()
04 04
e + geom_map(aes(map_id Maps = state), map = map) +
d<-ymax=unemploy
ggplot(seals, aes(x + 900)= long,
) y = lat)) h + geom_jitter() datag<-+ geom_dotplot(binaxis = USArrests$Murder, = "y", x, ymin, ymax, alpha, color, linetype, size
0 1 3 4 0 1 2 3 4

data geom 3 coordinate


2
3
plot data.frame(murder
ggplot(data = mpg, xyaes(x + =system
cty, y = hwy)) = x, ymax, ymin,
d + geom_segment(aes( d <- ggplot(economics, aes(date,
alpha, color, fill, linetype, size x, y, alpha,
Discrete unemploy))
color, fill, shape,
X, Discrete Y size expand_limits(
state = x = map$long, y = map$lat)
tolower(rownames(USArrests)))
<-stackdir = "center")
=F 2 2
=A mapmap_id, alpha, color, fill, linetype, size
map_data("state")
xend = long + delta_long, h <- ggplot(diamonds, aes(cut, color)) e <- ggplot(data, aes(fill = murder))
k + geom_pointrange()
1 1

Begins a plot that you finish by3 adding layers to. No yend = lat + delta_lat))
d<- ggplot(seals, aes(x d + geom_path(lineend="butt", Three Variables e + x, y,
geom_map(alpha, color,= state),
fill map = map) +
0 0
0 1 0 1 3 4 = long, y = lat)) h + geom_jitter() aes(map_id
defaults, butdata provides geom more control
4

than plotqplot().
2 2

coordinate
system linejoin="round’
x, xend, y, yend, alpha, color, linetype, size
, linemitre=1)
seals$z x, y, alpha, color,
<- with(seals, fill, shape, size+ delta_lat^2))
sqrt(delta_long^2 gi map_id,
++ geom_violin(scale
expand_limits(
geom_raster(aes(fillx = map$long, = map$lat)
= z),yhjust=0.5, = "area") x, y, ymin, ymax, alpha, color, fill, linetype,
dd++geom_rect(aes(xmin
geom_segment(aes(= long, ymin = lat,
x=F
alpha, color, fill, linetype, size
data y=A
xend =long long++delta_long,
delta_long, x, y, alpha, color,
i
linetype,
<- ggplot(seals,
size
aes(long, lat)) vjust=0.5, interpolate=FALSE) shape, size
add layers, ymax xmax=
yend == latlat ++ delta_lat))
delta_lat)) x, y, alpha, color, fill, linetype, size, weight
Three Variables x, y, alpha, fill
ggplot(mpg, aes(hwy, cty)) +
elements with + xmax, x, xend, y, yend, alpha, d +
color, geom_ribbon(
linetype,
xmin, ymax, ymin, alpha, color, fill, size aes(ymin=unemploy
i + geom_contour(aes(z = z)) - 900, i + geom_tile(aes(fill = z))
i +x, geom_raster(aes(fill = z),size
hjust=0.5, Maps
seals$z <-x,with(seals,
y, z, alpha, sqrt(delta_long^2 + delta_lat^2))
geom_point(aes(color = cyl)) + layer = geom + dlinetype,
+ size
geom_rect(aes(xmin ymax=unemploy
= long, ymin = lat, + 900) )
i <- ggplot(seals, aes(long, lat))
colour, linetype, size, weight y, alpha, color, fill, linetype,
vjust=0.5, interpolate=FALSE) data <- data.frame(murder = USArrests$Murder,
geom_smooth(method ="lm") +
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212
default stat + xmax= ymax
long + delta_long,
= lat + delta_lat))
• rstudio.com
x, ymax, ymin, alpha, color, fill, linetype, size x, y, alpha, fill
Discrete
Learn more at docs.ggplot2.org X, 0.9.3.1
• ggplot2 Discrete • Updated: 3/15Y
state = tolower(rownames(USArrests)))
coord_cartesian() + layer specific xmax, xmin, ymax, ymin, alpha, color, fill,
i + geom_contour(aes(z = z)) i + geom_tile(aes(fill = z)) map <- map_data("state")
scale_color_gradient() + mappings linetype, size x, y, z, alpha, colour, linetype, size, weight h <-x, ggplot(diamonds,
y, alpha, color, fill, linetype, size aes(cut, color)) l <- ggplot(data, aes(fill = murder))
theme_bw() additional l + geom_map(aes(map_id = state), map = map) +
elements
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com e <- ggplot(seals, aes(x = long, y = lat)) h + geom_jitter()
Learn more at docs.ggplot2.org • ggplot2 0.9.3.1 • Updated: 3/15

x, y, alpha, color, fill, shape, size expand_limits(x = map$long, y = map$lat)


Add a new layer to a plot with a geom_*() e + geom_segment(aes( map_id, alpha, color, fill, linetype, size
or stat_*() function. Each provides a geom, a xend = long + delta_long,
set of aesthetic mappings, and a default stat yend = lat + delta_lat)) Three Variables
and position adjustment. x, xend, y, yend, alpha, color, linetype, size
seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2)) m + geom_raster(aes(fill = z), hjust=0.5,
last_plot() e + geom_rect(aes(xmin = long, ymin = lat, m <- ggplot(seals, aes(long, lat)) vjust=0.5, interpolate=FALSE)
Returns the last plot xmax= long + delta_long,
ymax = lat + delta_lat)) x, y, alpha, fill
ggsave("plot.png", width = 5, height = 5) m + geom_contour(aes(z = z)) m + geom_tile(aes(fill = z))
Saves last plot as 5’ x 5’ file named "plot.png" in xmax, xmin, ymax, ymin, alpha, color, fill,
linetype, size x, y, z, alpha, colour, linetype, size, weight x, y, alpha, color, fill, linetype, size
working directory. Matches file type to file extension.

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at docs.ggplot2.org • ggplot2 0.9.3.1 • Updated: 3/15
Stats - An alternative way to build a layer Scales Coordinate Systems Faceting
Scales control how a plot maps data values to the visual Facets divide a plot into subplots based on the values
Some plots visualize a transformation of the original data set. r <- b + geom_bar()
Use a stat to choose a common transformation to visualize, values of an aesthetic. To change the mapping, add a of one or more discrete variables.
e.g. a + geom_bar(stat = "bin")
r + coord_cartesian(xlim = c(0, 5)) t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
custom scale.
fl cty cyl n <- b + geom_bar(aes(fill = fl)) xlim, ylim
t + facet_grid(. ~ fl)
4 4
x ..count..
3 3
n The default cartesian coordinate system
+ 2
= 2
facet into columns based on fl
1 1
aesthetic prepackaged scale specific r + coord_fixed(ratio = 1/2)
0 0
0 1
scale_
to adjust scale to use arguments ratio, xlim, ylim
t + facet_grid(year ~ .)
facet into rows based on year
0 1 2 3 4 2 3 4

data stat geom coordinate plot


x=x system n + scale_fill_manual( Cartesian coordinates with fixed aspect
t + facet_grid(year ~ fl)
y = ..count..

Each stat creates additional variables to map aesthetics values = c("skyblue", "royalblue", "blue", "navy"), ratio between x and y units
limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", "r"),
facet into both rows and columns
to. These variables use a common ..name.. syntax. r + coord_flip()
name = "fuel", labels = c("D", "E", "P", "R")) t + facet_wrap(~ fl)
stat functions and geom functions both combine a stat xlim, ylim wrap facets into a rectangular layout
with a geom to make a layer, i.e. stat_bin(geom="bar") range of values to title to use in labels to use in breaks to use in Flipped Cartesian coordinates
does the same as geom_bar(stat="bin") include in mapping legend/axis legend/axis legend/axis
r + coord_polar(theta = "x", direction=1 ) Set scales to let axis limits vary across facets
layer specific variable created theta, start, direction t + facet_grid(y ~ x, scales = "free")
stat function mappings by transformation General Purpose scales
Polar coordinates x and y axis limits adjust to individual facets
Use with any aesthetic:
i + stat_density2d(aes(fill = ..level..), alpha, color, fill, linetype, shape, size r + coord_trans(ytrans = "sqrt") • "free_x" - x axis limits adjust
geom = "polygon", n = 100) xtrans, ytrans, limx, limy • "free_y" - y axis limits adjust
scale_*_continuous() - map cont’ values to visual values
geom for layer parameters for stat scale_*_discrete() - map discrete values to visual values Transformed cartesian coordinates. Set
extras and strains to the name Set labeller to adjust facet labels
a + stat_bin(binwidth = 1, origin = 10) scale_*_identity() - use data values as visual values of a window function.
1D distributions t + facet_grid(. ~ fl, labeller = label_both)
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. scale_*_manual(values = c()) - map discrete values to
60

a + stat_bindot(binwidth = 1, binaxis = "x") manually chosen visual values z + coord_map(projection = "ortho", fl: c fl: d fl: e fl: p fl: r
orientation=c(41, -74, 0))

lat
x, y, | ..count.., ..ncount.. t + facet_grid(. ~ fl, labeller = label_bquote(alpha ^ .(x)))
a + stat_density(adjust = 1, kernel = "gaussian") X and Y location scales projection, orientation, xlim, ylim ↵c ↵d ↵e ↵p ↵r
x, y, | ..count.., ..density.., ..scaled.. Use with x or y aesthetics (x shown here) Map projections from the mapproj package long t + facet_grid(. ~ fl, labeller = label_parsed)
f + stat_bin2d(bins = 30, drop = TRUE) scale_x_date(labels = date_format("%m/%d"), (mercator (default), azequalarea, lagrange, etc.) c d e p r
2D distributions
x, y, fill | ..count.., ..density.. breaks = date_breaks("2 weeks")) - treat x
f + stat_binhex(bins = 30) values as dates. See ?strptime for label formats.
x, y, fill | ..count.., ..density.. Position Adjustments Labels
scale_x_datetime() - treat x values as date times. Use
f + stat_density2d(contour = TRUE, n = 100)
same arguments as scale_x_date(). Position adjustments determine how to arrange t + ggtitle("New Plot Title")
x, y, color, size | ..level..
scale_x_log10() - Plot x on log10 scale geoms that would otherwise occupy the same space. Add a main title above the plot
m + stat_contour(aes(z = z)) 3 Variables scale_x_reverse() - Reverse direction of x axis s <- ggplot(mpg, aes(fl, fill = drv)) t + xlab("New X label") Use scale functions
x, y, z, order | ..level.. to update legend
s + geom_bar(position = "dodge") Change the label on the X axis
m+ stat_spoke(aes(radius= z, angle = z)) scale_x_sqrt() - Plot x on square root scale labels
angle, radius, x, xend, y, yend | ..x.., ..xend.., ..y.., ..yend.. Arrange elements side by side t + ylab("New Y label")
m + stat_summary_hex(aes(z = z), bins = 30, fun = mean) Color and fill scales Change the label on the Y axis
x, y, z, fill | ..value.. Discrete Continuous s + geom_bar(position = "fill") t + labs(title =" New title", x = "New x", y = "New y")
m + stat_summary2d(aes(z = z), bins = 30, fun = mean) Stack elements on top of one another,
x, y, z, fill | ..value.. n <- b + geom_bar( o <- a + geom_dotplot( All of the above
aes(fill = fl)) aes(fill = ..x..))
normalize height
g + stat_boxplot(coef = 1.5) Comparisons n + scale_fill_brewer( o + scale_fill_gradient( s + geom_bar(position = "stack") Legends
x, y | ..lower.., ..middle.., ..upper.., ..outliers.. palette = "Blues") low = "red",
For palette choices: high = "yellow") Stack elements on top of one another t + theme(legend.position = "bottom")
g + stat_ydensity(adjust = 1, kernel = "gaussian", scale = "area") o + scale_fill_gradient2(
library(RcolorBrewer) Place legend at "bottom", "top", "left", or "right"
x, y | ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. display.brewer.all() low = "red", hight = "blue", f + geom_point(position = "jitter")
mid = "white", midpoint = 25)
f + stat_ecdf(n = 40) n + scale_fill_grey( o + scale_fill_gradientn( Add random noise to X and Y position t + guides(color = "none")
Functions
x, y | ..x.., ..y.. start = 0.2, end = 0.8, colours = terrain.colors(6)) of each element to avoid overplotting Set legend type for each aesthetic: colorbar, legend,
na.value = "red") Also: rainbow(), heat.colors(),
f + stat_quantile(quantiles = c(0.25, 0.5, 0.75), formula = y ~ log(x), or none (no legend)
method = "rq")
topo.colors(), cm.colors(), Each position adjustment can be recast as a function
RColorBrewer::brewer.pal() with manual width and height arguments t + scale_fill_discrete(name = "Title",
x, y | ..quantile.., ..x.., ..y.. labels = c("A", "B", "C"))
f + stat_smooth(method = "auto", formula = y ~ x, se = TRUE, n = 80, Shape scales Manual Shape values
Manual shape values
s + geom_bar(position = position_dodge(width = 1)) Set legend title and labels with a scale function.
fullrange = FALSE, level = 0.95)
x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax.. p <- f + geom_point(
aes(shape = fl))
0 6 12 18 24 00

ggplot() + stat_function(aes(x = -3:3), 1 7 13 19 25 ++ Themes Zooming


General Purpose p + scale_shape(
fun = dnorm, n = 101, args = list(sd=0.5))
x | ..y..
solid = FALSE) 2 8 14 20 *
* --
r + theme_bw() 150
r + theme_classic()
Without clipping (preferred)
|
150

3 9 15 21 . | 100
t + coord_cartesian(
count

p + scale_shape_manual( White background


100
count

f + stat_identity() White background


o % xlim = c(0, 100), ylim = c(10, 20))
50
50

values = c(3:7)) 4 10 16 22 o %
with grid lines
0

no gridlines
ggplot() + stat_qq(aes(sample=1:100), distribution = qt,
0
c d e p r
c d e p r

Shape values shown in


fl fl

dparams = list(df=5)) chart on right


5 11 17 23 O
O ##
150 r + theme_grey() 150
r + theme_minimal() With clipping (removes unseen data points)
sample, x, y | ..x.., ..y.. Grey background Minimal theme
100 100
count

count

t + xlim(0, 100) + ylim(10, 20)


50 50

f + stat_sum() Size scales 0


c d e p r
(default theme) 0
c d e p r

t + scale_x_continuous(limits = c(0, 100)) +


fl fl

x, y, size | ..size.. q + scale_size_area(max = 6)


f + stat_summary(fun.data = "mean_cl_boot") q <- f + geom_point( Value mapped to area of circle ggthemes - Package with additional ggplot2 themes scale_y_continuous(limits = c(0, 100))
aes(size = cyl)) (not radius)
f + stat_unique()
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at docs.ggplot2.org • ggplot2 0.9.3.1 • Updated: 3/15
Python For Data Science Cheat Sheet 3 Plotting With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plotting conditional >>> h = sns.PairGrid(iris) Subplot grid for plotting pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
Statistical Data Visualization With Seaborn hue="sex",
data=titanic)
>>> i = i.plot(sns.regplot,
sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
attractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Scatterplot
>>> import matplotlib.pyplot as plt y="sepal_length", model fit
>>> sns.stripplot(x="species", Scatterplot with one
>>> import seaborn as sns data=iris,
y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical scatterplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", scatterplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>>
>>>
>>>
import seaborn as sns
tips = sns.load_dataset("tips")
sns.set_style("whitegrid") Step 2
Step 1
>>> sns.countplot(x="deck",
data=titanic,
Show count of observations
4 Further Customizations Also see Matplotlib
palette="Greens_d")
>>> g = sns.lmplot(x="tip", Step 3
Point Plot Axisgrid Objects
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove left spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],

1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

2 Figure Aesthetics Also see Matplotlib


5 Show or Save Plot Also see Matplotlib
>>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, Scale font elements and
>>> sns.set() (Re)set the seaborn default
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Palette >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color palette >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set palette
>>> sns.axes_style("whitegrid") Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color palette DataCamp
Learn Python for Data Science Interactively
Tens
orFl
owCheatSheet

About id(object)
Tens or Flow Returntheidenti
tyofanobjec
t.Thi
sis
Tens orFl ow™ i sanopens ources of t
war e guaranteedtobeuni queamong
l
ibr aryfornumer icalc omput ationus i
ng si
mul t
aneouslyexisti
ngobj
ect
s.
dat aflow gr aphs .Tens or Flow was import __builtin__
originallydevel opedf ort hepur pos esof dir(__builtin__)
conduc tingmac hinel earni nganddeep Otherbuilt
-i
nfunc t
ions
neur alnet wor ksr esear ch,butt hes ystem
i
sgener alenought obeappl i
cablei na Tens
orFl
ow
widevar ietyofot herdomai nsaswel l
. Mai ncl as ses
Skflow tf.Graph()
ScikitFlow pr ovidesas etofhi ghl evel tf.Operation()
modelc l
as sest hatyouc anus et oeas il
y tf.Tensor()
i
nt egratewi thyourexi stingSc ikit
-learn tf.Session()
pipelinec ode.Sc i
kitFlow i sas impl ified Someus efulfunct i
ons
i
nt erfacef orTens orFlow,t ogetpeopl e tf.get_default_session()
startedonpr edi ctiveanal yti
csanddat a tf.get_default_graph()
mi ning.Sc i
kitFlow hasbeenmer gedi nto tf.reset_default_graph()
Tens orFl ow s i
nc ever s
ion0. 8andnow ops.reset_default_graph()
calledTens or Flow Lear n. tf.device(“/cpu:0”)
Ker as tf.name_scope(value)
Ker asisami nimal ist,highl ymodul ar tf.convert_to_tensor(value)
neur alnet wor ksl i
br ary,wr itt
eni nPyt hon Tens or Flow Opt imi zers
andc apabl eofr unni ngont opofei ther GradientDescentOptimizer
Tens orFl ow orTheano AdadeltaOptimizer
AdagradOptimizer
I
nst
all
ati
on MomentumOptimizer
AdamOptimizer
How t oi nstallnew packageinPython:
FtrlOptimizer
pip install <package-name>
RMSPropOptimizer
Exampl e:pip install requests
Reduct ion
How t oi nstalltensorflow?
reduce_sum
device=cpu/ gpu
reduce_prod
python_ verson=cp27/
i cp34
reduce_min
sudo pip install
reduce_max
https://storage.googleapis.com/
reduce_mean
tensorflow/linux/$device/tensorflow-
reduce_all
0.8.0-$python_version-none-linux_x86
reduce_any
_64.whl
accumulate_n
How t oi nstallSkflow
Act ivat ionf unct i
ons
pip install sklearn
tf.nn?
How t oi nstallKeras
relu
pip install keras
relu6
update~/ .
keras/keras
.j
son-repl
ace
elu
“t
heano”by“ tens or
flow”
sof t
pl us
sof t
s i
gn
Hel
per
s dr opout
Pythonhel per bias _add
Importantf unct
ions sigmoi d
type(object) tanh
Getobjecttype sigmoi d_ cr
os s
_ent r
opy_ wit
h_logit
s
help(object) sof t
max
Gethelpf orobject(
li
stofavail
abl
e l
og_ s oftmax
methods ,att
ribut
es,si
gnaturesands
oon) sof t
max_ cross_entropy_wit
h_logit
s
dir(object) spar se_ softmax_ cross_entr
opy_ wit
h_l
ogi
ts
Getli
stofobjectatt
ribut
es wei ght ed_ cross_entropy_wit
h_ l
ogit
s
(fiel
ds,funct
ions) etc .
str(object)
Transf
orm anobj
ectt
ost
ring Skflow
Mainclasses
object? Tens
orFlowCl
assi
fier
Showsdocument
ati
onsaboutt
heobj
ect Tens
orFlowRegr
essor
globals() Tens
orFlowDNNClassifier
Returnthedi
cti
onarycontai
ningt
he Tens
orFlowDNNRegr essor
currentsc
ope'
sglobalvar
iabl
es. Tens
orFlowLi
nearCl
as si
fier
locals() Tens
orFlowLi
nearRegres s
or
Updateandr eturnadict
ionarycont
aini
ng Tens
orFlowRNNClassifier
thecurrents
c ope'
sloc
alvariabl
es. Tens
orFlowRNNRegres sor

Ver
sion1.
09 Gett
hel
ates
tver
sionat
:
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that Naive Bayes >>> print(classification_report(y_test, y_pred)) and support
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fitting Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Loading The Data Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels
>>> y_pred = lr.predict(X_test)
>>> print(cross_val_score(lr, X, y, cv=2))
Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y, Unsupervised Estimators Tune Your Model
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Linear Algebra Also see NumPy
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
>>> B = np.asmatrix(b) Subtraction
SciPy >>> C = np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
>>> np.divide(A,D) Division
scientific computing that provides mathematical Basic Matrix Routines Multiplication
algorithms and convenience functions built on the >>> A @ D Multiplication operator
Inverse
NumPy extension of Python. >>> A.I Inverse (Python 3)
>>> np.multiply(D,A) Multiplication
>>> linalg.inv(A) Inverse
>>> np.dot(A,D) Dot product
Interacting With NumPy Also see NumPy Transposition >>> np.vdot(A,D) Vector dot product
>>> import numpy as np >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> A.H Conjugate transposition >>> np.outer(A,D) Outer product
>>> a = np.array([1,2,3])
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)]) Trace >>> np.tensordot(A,D) Tensor dot product
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
Norm Exponential Functions
Index Tricks >>> linalg.expm(A) Matrix exponential
>>> linalg.norm(A) Frobenius norm
>>> np.mgrid[0:5,0:5] Create a dense meshgrid >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.norm(A,1) L1 norm (max column sum)
>>> np.ogrid[0:2,0:2] Create an open meshgrid >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> linalg.norm(A,np.inf) L inf norm (max row sum) decomposition)
>>> np.r_[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise)
>>> np.c_[b,c] Create stacked column-wise arrays Rank Logarithm Function
>>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
Shape Manipulation Determinant Trigonometric Functions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
>>> np.transpose(b) Permute array dimensions
Solving linear problems >>> linalg.cosm(D) Matrix cosine
>>> b.flatten() Flatten the array >>> linalg.tanm(A) Matrix tangent
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.solve(A,b) Solver for dense matrices
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index >>> linalg.lstsq(F,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vpslit(d,2) Split the array vertically at the 2nd index equation >>> linalg.coshm(D) Hyperbolic matrix cosine
Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
Polynomials >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
(least-squares solver) >>> np.signm(A) Matrix sign function
>>> from numpy import poly1d
>>> p = poly1d([3,4,5]) Create a polynomial object >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix Matrix Square Root
(SVD) >>> linalg.sqrtm(A) Matrix square root
Vectorizing Functions Creating Sparse Matrices Arbitrary Functions
>>> def myfunc(a): >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
if a < 0: >>> F = np.eye(3, k=1) Create a 2X2 identity matrix
return a*2
else:
>>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
return a/2 >>> C[C > 0.5] = 0
>>> np.vectorize(myfunc) Vectorize functions
>>> H = sparse.csr_matrix(C) Compressed Sparse Row matrix Eigenvalues and Eigenvectors
>>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
Type Handling >>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
>>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> np.real(b) Return the real part of the array elements >>> v[:,1] Second eigenvector
>>> np.imag(b) Return the imaginary part of the array elements
>>> np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Sparse Matrix Routines >>> linalg.eigvals(A) Unpack eigenvalues
>>> np.cast['f'](np.pi) Cast object to a data type Singular Value Decomposition
Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
>>> sparse.linalg.inv(I) Inverse >>> M,N = B.shape
Other Useful Functions
Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> sparse.linalg.norm(I) Norm LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values Solving linear problems >>> P,L,U = linalg.lu(C) LU Decomposition
(number of samples)
>>> g [3:] += np.pi >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>>
>>>
np.logspace(0,10,3) Create an array of evenly spaced values (log scale)
np.select([c<4],[c*2]) Return values from a list of arrays depending on
Sparse Matrix Functions >>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
conditions >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
>>> misc.factorial(a) Factorial
>>> Combine N things taken at k time
>>>
misc.comb(10,3,exact=True)
misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark Basics >>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect()
Merge the rdd values for
each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() Return (key,value) pairs as a
.mapValues(list)
.collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Summary
Aggregating
>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
>>> sc.appName Return application name min) .collect() applying a function
>>> sc.applicationId Retrieve application ID
>>>
>>>
sc.defaultParallelism Return default level of parallelism
sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
>>> rdd.first() Take first RDD element
Set which master the context connects to with the --master argument, and
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),
("b",["p", "r"])])
['a',2,'b',7]
>>> rdd.keys().collect() Return (key,value) RDD's keys
Stopping SparkContext
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \
.show()
in the groups
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F
Select Filter
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age

Spark SQL is Apache Spark's module for "age", and type


Sort
explode("phoneNumber") \
working with structured data. .alias("contactInfo")) \
.select("contactInfo.type", >>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
Initializing SparkSession "firstName",
"age") \ >>> df.orderBy(["age","city"],ascending=[0,1])\
A SparkSession can be used create DataFrame, register DataFrame as tables, .show() .collect()
execute SQL over tables, cache tables, and read parquet files. >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
>>> df.select(df['age'] > 24).show()
When
Show all entries where age >24 Missing & Replacing Values
.builder \ >>> df.select("firstName", Show firstName and 0 or 1 depending
.appName("Python Spark SQL basic example") \ >>> df.na.fill(50).show() Replace null values
F.when(df.age > 30, 1) \ on age >30 >>> df.na.drop().show() Return new df omitting rows with null values
.config("spark.some.config.option", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> df.na \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .show()
Creating DataFrames Like
.collect()

From RDDs
>>> df.select("firstName", Show firstName, and lastName is
df.lastName.like("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>>
>>>
people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
peopledf = spark.createDataFrame(people)
.show() Running SQL Queries Programmatically
Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+
| name|age|
Add, Update & Remove Columns >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
|
|
Mine| 28|
Filip| 29|
Adding Columns .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
From Spark Data Sources .withColumn('state',df.address.state) \
.withColumn('streetAddress',df.address.streetAddress) \
Output
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25|
|[New York,10021,N...| 21|
John|
Jane|
Smith|[[212 555-1234,ho...|
Doe|[[322 888-1234,ho...|
Updating Columns DataFrame
+--------------------+---+---------+--------+--------------------+
>>> df2 = spark.read.load("people.json", format="json")
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() >>> spark.stop()
>>> df.head() Return first n rows Count the number of rows in df
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.take(2) Return the first n rows >>> df.printSchema() Print the schema of df DataCamp
>>> df.schema Return the schema of df >>> df.explain() Print the (logical and physical) plans
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect()
Merge the rdd values for
each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() Return (key,value) pairs as a
.mapValues(list)
.collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Summary
Aggregating
>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
>>> sc.appName Return application name min) .collect() applying a function
>>> sc.applicationId Retrieve application ID
>>>
>>>
sc.defaultParallelism Return default level of parallelism
sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
>>> rdd.first() Take first RDD element
Set which master the context connects to with the --master argument, and
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),
("b",["p", "r"])])
['a',2,'b',7]
>>> rdd.keys().collect() Return (key,value) RDD's keys
Stopping SparkContext
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Model Architecture Inspect Model
>>> model.output_shape
Sequential Model Model output shape
Keras >>> from keras.models import Sequential
>>>
>>>
model.summary()
model.get_config()
Model summary representation
Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))

>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy