Google Earth Engine Planetary-Scale Geospatial Ana
Google Earth Engine Planetary-Scale Geospatial Ana
a r t i c l e i n f o a b s t r a c t
Article history: Google Earth Engine is a cloud-based platform for planetary-scale geospatial analysis that brings Google's mas-
Received 9 July 2016 sive computational capabilities to bear on a variety of high-impact societal issues including deforestation,
Received in revised form 5 June 2017 drought, disaster, disease, food security, water management, climate monitoring and environmental protection.
Accepted 27 June 2017
It is unique in the field as an integrated platform designed to empower not only traditional remote sensing sci-
Available online xxxx
entists, but also a much wider audience that lacks the technical capacity needed to utilize traditional supercom-
Keywords:
puters or large-scale commodity cloud computing resources.
Cloud computing © 2016 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license
Big data (http://creativecommons.org/licenses/by/4.0/).
Analysis
Platform
Data democratization
Earth Engine
1. Introduction geospatial datasets, without having to suffer the IT pains currently sur-
rounding either. Additionally, and unlike most supercomputing centers,
Supercomputers and high-performance computing systems are be- Earth Engine is also designed to help researchers easily disseminate
coming abundant (Cossu et al., 2010; Nemani et al., 2011) and large- their results to other researchers, policy makers, NGOs, field workers,
scale cloud computing is universally available as a commodity. At the and even the general public. Once an algorithm has been developed
same time, petabyte-scale archives of remote sensing data have become on Earth Engine, users can produce systematic data products or deploy
freely available from multiple U.S. Government agencies including interactive applications backed by Earth Engine's resources, without
NASA, the U.S. Geological Survey, and NOAA (Woodcock et al., 2008; needing to be an expert in application development, web programming
Loveland and Dwyer, 2012; Nemani et al., 2011), as well as the Europe- or HTML.
an Space Agency (Copernicus Data Access Policy, 2016), and a wide va-
riety of tools have been developed to facilitate large-scale processing of 2. Platform overview
geospatial data, including TerraLib (Câmara et al., 2000), Hadoop
(Whitman et al., 2014), GeoSpark (Yu et al., 2015), and GeoMesa Earth Engine consists of a multi-petabyte analysis-ready data catalog
(Hughes et al., 2015). co-located with a high-performance, intrinsically parallel computation
Unfortunately, taking full advantage of these resources still requires service. It is accessed and controlled through an Internet-accessible ap-
considerable technical expertise and effort. One major hurdle is in basic plication programming interface (API) and an associated web-based in-
information technology (IT) management: data acquisition and storage; teractive development environment (IDE) that enables rapid
parsing obscure file formats; managing databases, machine allocations, prototyping and visualization of results.
jobs and job queues, CPUs, GPUs, and networking; and using any of the The data catalog houses a large repository of publicly available
multitudes of geospatial data processing frameworks. geospatial datasets, including observations from a variety of satellite
This burden can put these tools out of the reach of many researchers and aerial imaging systems in both optical and non-optical wave-
and operational users, restricting access to the information contained lengths, environmental variables, weather and climate forecasts and
within many large remote-sensing datasets to remote-sensing experts hindcasts, land cover, topographic and socio-economic datasets. All of
with special access to high-performance computing resources. this data is preprocessed to a ready-to-use but information-preserving
Google Earth Engine is a cloud-based platform that makes it easy to form that allows efficient access and removes many barriers associated
access high-performance computing resources for processing very large with data management.
Users can access and analyze data from the public catalog as well as
⁎ Corresponding author. their own private data using a library of operators provided by the Earth
E-mail address: gorelick@google.com (N. Gorelick). Engine API. These operators are implemented in a large parallel
http://dx.doi.org/10.1016/j.rse.2017.06.031
0034-4257/© 2016 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
2 N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx
processing system that automatically subdivides and distributes com- individual band must be homogeneous in data type, resolution and pro-
putations, providing high-throughput analysis capabilities. Users access jection. However, images can contain any number of bands and the
the API either through a thin client library or through a web-based inter- bands within an image need not have uniform data types or projections.
active development environment built on top of that client library Each image can also have associated key/value metadata containing in-
(Fig. 1). formation such as the location, acquisition time, and the conditions
Users can sign up for access at the Earth Engine homepage, https:// under which the image was collected or processed.
earthengine.google.com, and access the user interface, as well as a user's Related images, such as all of the images produced by a single sensor,
guide, tutorials, examples, training videos, function reference, and edu- are grouped together and presented as a “collection”. Collections pro-
cational curricula. While prior experience with GIS, remote sensing and vide fast filtering and sorting capabilities that make it easy for users to
scripting make it easier to get started, they are not strictly required, and search through millions of individual images to select data that meets
the user's guide is oriented towards domain novices. Accounts come specific spatial, temporal or other criteria. For example, a user can easily
with a quota for uploading personal data and saving intermediate prod- select daytime images from the Landsat 7 sensor that cover any part of
ucts, and any inputs or results can be downloaded for offline use. Iowa, collected on day-of-year 80 to 104, from the years 2010 to 2012,
with less than 70% cloud cover.
3. The data catalog Images ingested into Earth Engine are pre-processed to facilitate fast
and efficient access. First, images are cut into tiles in the image's original
The Earth Engine public data catalog is a multi-petabyte curated col- projection and resolution and stored in an efficient and replicated tile
lection of widely used geospatial datasets. The bulk of the catalog is database. A tile size of 256 × 256 was chosen as a practical trade-off be-
made up of Earth-observing remote sensing imagery, including the en- tween loading unneeded data vs. the overhead of issuing additional
tire Landsat archive as well as complete archives of data from Sentinel-1 reads. In contrast to conventional “data cube” systems, this data inges-
and Sentinel-2, but it also includes climate forecasts, land cover data and tion process is information-preserving: the data are always maintained
many other environmental, geophysical and socio-economic datasets in their original projection, resolution and bit depth, avoiding the data
(Table 1). The catalog is continuously updated at a rate of nearly 6000 degradation that would be inherent in resampling all data to a fixed
scenes per day from active missions, with a typical latency of about grid that may or may not be appropriate for any particular application.
24 h from scene acquisition time. Users can request the addition of Additionally, in order to enable fast visualization during algorithm
new datasets to the public catalog, or they can upload their own private development, a pyramid of reduced-resolution tiles is created for each
data via a REST interface using either browser-based or command-line image and stored in the tile database. Each level of the pyramid is creat-
tools and share with other users or groups as desired. ed by downsampling the previous level by a factor of two until the en-
Earth Engine uses a simple and highly general data model based on tire image fits into a single tile. When downsampling, continuous-
2D gridded raster bands in a lightweight “image” container. Pixels in an valued bands are typically averaged, while discrete-valued bands, such
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx 3
Table 1
Frequently used datasets in the earth engine data catalog.
Landsat
Landsat 8 OLI/TIRS 30 m 16 day 2013–Now Global
Landsat 7 ETM+ 30 m 16 day 2000–Now Global
Landsat 5 TM 30 m 16 day 1984–2012 Global
Landsat 4–8 surface reflectance 30 m 16 day 1984–Now Global
Sentinel
Sentinel 1 A/B ground range detected 10 m 6 day 2014–Now Global
Sentinel 2A MSI 10/20 m 10 day 2015–Now Global
MODIS
MOD08 atmosphere 1° Daily 2000–Now Global
MOD09 surface reflectance 500 m 1 day/8 day 2000–Now Global
MOD10 snow cover 500 m 1 day 2000–Now Global
MOD11 temperature and emissivity 1000 m 1 day/8 day 2000–Now Global
MCD12 Land cover 500 m Annual 2000–Now Global
MOD13 Vegetation indices 500/250 m 16 day 2000–Now Global
MOD14 Thermal anomalies & fire 1000 m 8 day 2000–Now Global
MCD15 Leaf area index/FPAR 500 m 4 day 2000–Now Global
MOD17 Gross primary productivity 500 m 8 day 2000–Now Global
MCD43 BRDF-adjusted reflectance 1000/500 m 8 day/16 day 2000–Now Global
MOD44 veg. cover conversion 250 m Annual 2000–Now Global
MCD45 thermal anomalies and fire 500 m 30 day 2000–Now Global
ASTER
L1 T radiance 15/30/90 m 1 day 2000–Now Global
Global emissivity 100 m Once 2000–2010 Global
Other imagery
PROBA-V top of canopy reflectance 100/300 m 2 day 2013–Now Global
EO-1 hyperion hyperspectral radiance 30 m Targeted 2001–Now Global
DMSP-OLS nighttime lights 1 km Annual 1992–2013 Global
USDA NAIP aerial imagery 1m Sub-annual 2003–2015 CONUS
Topography
Shuttle Radar Topography Mission 30 m Single 2000 60°N–54°S
USGS National Elevation Dataset 10 m Single Multiple United States
USGS GMTED2010 7.5″ Single Multiple 83°N–57°S
GTOPO30 30″ Single Multiple Global
ETOPO1 1′ Single Multiple Global
Landcover
GlobCover 300 m Non-periodic 2009 90°N–65°S
USGS National Landcover Database 30 m Non-periodic 1992–2011 CONUS
UMD global forest change 30 m Annual 2000–2014 80°N–57°S
JRC global surface water 30 m Monthly 1984–2015 78°N–60°S
GLCF tree cover 30 m 5 year 2000–2010 Global
USDA NASS cropland data layer 30 m Annual 1997–2015 CONUS
Weather, precipitation & atmosphere
Global precipitation measurement 6′ 3h 2014–Now Global
TRMM 3B42 precipitation 15′ 3h 1998–2015 50°N–50°S
CHIRPS precipitation 3′ 5 day 1981–Now 50°N–50°S
NLDAS-2 7.5′ 1h 1979–Now North America
GLDAS-2 15′ 3h 1948–2010 Global
NCEP reanalysis 2.5° 6h 1948–Now Global
ORNL DAYMET weather 1 km Annual 1980–Now North America
GRIDMET 4 km 1 day 1979–Now CONUS
NCEP global forecast system 15′ 6h 2015–Now Global
NCEP climate forecast system 12′ 6h 1979–Now Global
WorldClim 30″ 12 images 1960–1990 Global
NEX downscaled climate projections 1 km 1 day 1950–2099 North America
Population
WorldPop 100 m 5 year Multiple 2010–2015
GPWv4 30″ 5 year 2000–2020 85°N–60°S
as classification labels, are sampled using one of min, mode, max or (Chang et al., 2008) and Spanner (Corbett et al., 2013) distributed data-
fixed sampling. When a portion of data from an image is requested for bases; Colossus, the successor to the Google File System (Ghemawat et
computation at a reduced resolution, only the relevant tiles from the al., 2003; Fikes, 2010); and the FlumeJava framework for parallel pipe-
most appropriate pyramid level need to be retrieved from the tile data- line execution (Chambers et al., 2010). Earth Engine also interoperates
base. This power-of-two downscaling enables having data ready at a va- with Google Fusion Tables (Gonzalez et al., 2010), a web-based database
riety of scales without introducing significant storage overhead, and that supports tables of geometric data (points, lines, and polygons) with
aligns with the common usage patterns in web-based mapping. attributes.
A simplified system architecture is shown in Fig. 2. The Earth Engine
4. System architecture Code Editor and third-party applications use client libraries to send in-
teractive or batch queries to the system through a REST API. On-the-
Earth Engine is built on top of a collection of enabling technologies fly requests are handled by Front End servers that forward complex
that are available within the Google data center environment, including sub-queries to Compute Masters, which manage computation distribu-
the Borg cluster management system (Verma et al., 2015); the Bigtable tion among a pool of Compute Servers. The batch system operates in a
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
4 N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx 5
Table 2
Earth Engine function summary.
Numerical operations
Primitive operations add, subtract, multiply, divide, etc. Per pixel/per feature
Trigonometric operations cos, sin, tan, acos, asin, atan, etc.
Standard functions abs, pow, sqrt, exp., log, erf, etc.
Logical operations eq, neq, gt, gte, lt, lte, and, or
Bit/bitwise operations and, or, xor, not, bit shift, etc.
Numeric casting int, float, double, uint8, etc.
Array/matrix operations
Elementwise operations (numeric operations as above) Per pixel/per feature
Array manipulation Get, length, cat, slice, sort, etc.
Array construction Identity, diagonal, etc.
Matrix operations Product, determinant, transpose, inverse, pseudoinverse, decomposition, etc.
Reduce and accumulate Reduce, accum
Machine learning
Supervised classification and regression Bayes, CART, Random Forest, SVM, Perceptron, Mahalanobis, etc. Per pixel/per feature
Unsupervised Classification K-Means, LVQ, Cobweb, etc.
Other per-pixel image operations
Spectral operations Unmixing, HSV transform, etc. Per pixel
Data masking Unmask, update mask, etc.
Visualization Min/max, color palette, gamma, SLD, etc.
Location Pixel area, pixel coordinates, etc.
Kernel operations
Convolution Convolve, blur, etc. Per image tile
Morphology Min, max, mean, distance, etc.
Texture Entropy, GLCM, etc.
Simple shape kernels Circle, rectangle, diamond, cross, etc.
Standard kernels Gaussian, Laplacian, Roberts, Sobel, etc.
Other kernels Euclidean, Manhattan and Chebyshev distance, arbitrary kernels and combinations
Other Image Operations
Band manipulation Add, select, rename, etc. Per image
Metadata properties Get, set, etc.
Derivative Pixel-space derivative, spatial gradient
Edge detection Canny, Hough transform
Terrain operations Slope, aspect, hillshade, fill minima, etc.
Connected components Components, component size
Image clipping Clip
Resampling Bilinear, bicubic, etc.
Warping Translate, changeProj
Image registration Register, displacement, displace
Other tile-based operations Cumulative cost, medial axis, reduce resolution with arbitrary reducers, etc.
Image aggregations Sample region(s), reduce region(s) with arbitrary reducers
Reducers
Simple Count, distinct, first, etc. Context-dependent
Mathematical sum, product, min, max, etc.
Logical Logical and/or, bitwise and/or
Statistical Mean, median, mode, percentile, standard deviation, covariance, histogram, etc.
Correlation Kendall, Spearman, Pearson, Sen's slope
Regression Linear regression, robust linear regression
Geometry Operations
Types Point, LineString, Polygon, etc. Per-feature
Measurements Length, area, perimeter, distance, etc.
Constructive operations Intersection, union, difference, etc.
Predicates Intersects, contains, withinDistance, etc.
Other operations Buffer, centroid, transform, simplify, etc.
Table/collection operations
Basic manipulation Sort, merge, size, first, limit, distinct, flatten, remap, etc. Streaming
Property filtering eq, neq, gt, lt, date range, and, or, not, etc.
Spatial filtering Intersects, contains, withinDistance, etc.
Parallel processing Map, reduce, iterate
Joins Simple, inner, grouping, etc.
Vector/raster operations
Rasterization Paint/draw, distance Per tile
Spatial interpolation Kriging, IDW interpolation
Vectorization reduceToVectors Scatter/gather
Other data types
Number, string, list, dictionary, date, daterange, projection, etc. Context-dependent
more complex calculation without requiring the user to pre-specify reprojection is managed, they have the option of precisely controlling
which pixels will be needed from it. Reprojection and resampling to the projection grid and can choose from bilinear and bicubic sampling
the requested output projection is by default performed using modes.
nearest-neighbor resampling of the input(s), to preserve spectral integ- This approach encourages an interactive and iterative mode of data
rity, selecting pixels from the next-highest-resolution pyramid level of exploration and algorithm development. Once a user has developed
each input. However, when the user has preferences for how this an algorithm that they would like to apply at scale, they may submit a
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
6 N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx
batch-processing request to Earth Engine to compute the complete re- 5.2. Spatial aggregations
sult and materialize it either as an image in Earth Engine or as one or
more image, table, or video files for download. Just as some classes of computation are inherently local, others are
inherently non-local, such as the computation of regional or global sta-
tistics, raster-to-vector conversion, or sampling an image to train a clas-
5. Data distribution models sifier. These operations, or portions of them, can often still be performed
in parallel, but computing the final result requires aggregating together
The functions in the Earth Engine library utilize several built-in many sub-results. For example, computing the mean value of an entire
parallelization and data distribution models to achieve high perfor- image can be performed by subdividing the image, computing sums and
mance. Each of these models is optimized for a different data access counts in parallel over each portion, and then summing these partial
pattern. sums and counts to compute the desired result.
In Earth Engine these types of computations are executed as distrib-
uted processes using a scatter-gather model. The spatial region over
5.1. Image tiling which an aggregation is to be performed is divided into subregions
that are assigned to workers in a distributed worker pool, to be evaluat-
Many raster processing operations used in remote sensing are local: ed in batches. Each worker fetches or computes the input pixels that it
the computation of any particular output pixel depends only on input needs and then runs the desired accumulation operation to compute
pixels within some fixed distance. Examples include per-pixel opera- its partial results. These results are sent back to the master for this com-
tions such as band math or spectral unmixing, as well as neighborhood putation, which combines them and transforms the result into the final
operations such as convolution or texture analysis. These operations can form. For example, when computing a mean value each worker will
be easily processed in parallel by subdividing an area into tiles and com- compute sums and counts, the master collects and sums these interme-
puting each independently. Processing each output tile usually requires diates, and the final result will be the total sum divided by the total
retrieving only one or a small number of tiles for each input. This fact, count.
combined with pyramided inputs and judicious caching, allows for This model is very similar to a traditional Map/Reduce with a fixed
fast computation of results at any requested scale or projection. As pre- pool of mappers and a single reducer, however the user need not be
viously mentioned, inputs are reprojected on the fly as needed to match aware of this implementation and need only specify the map projection,
the requested output projection. However, if the user determines that resolution and spatial region in which to perform the operation, which
using downsampled or reprojected inputs is undesired, they are free in turn determines the grid in which the input pixels will be computed
to explicitly specify computation in the input's projection and scale. and the number of subregions. Typically each subregion is a multiple of
Most tile-based operations are implemented in Earth Engine using the default input tile size (usually 1024 × 1024 pixels) to minimize the
one of two strategies, depending on their computational cost. Expensive RPC overhead during these computations. However, because of the
operations, and operations that benefit significantly from computing an large range of computational complexity of the intermediate products
entire tile at once, write results into a tile-sized output buffer. Tiles are over which users might be attempting to aggregate, controls were intro-
typically 256 × 256 pixels, to match the tiling-size of the input pre- duced to the system to allow users to adjust this multiple, should their
processing. computation require it, e.g. due to per-worker memory limitations.
Inexpensive per-pixel operations are implemented using a pixel-at-
a-time interface in which the image processing operations in a graph di-
rectly invoke one another. This structure is designed to take advantage 5.3. Streaming collections
of the fact that these operations execute in a Java Virtual Machine
(JVM) environment with a Just-In-Time (JIT) compiler that extracts Another common operation in the processing of large remote-sens-
and compiles sequences of function calls that occur repeatedly. The re- ing datasets is time-series analysis. The same statistical aggregation op-
sult is that in many cases, arbitrary chains of primitive image operations erations that can be applied spatially can also be applied temporally
such as band math can execute almost as efficiently as hand-built com- over the images in a collection to compute per-pixel statistics of an en-
piled code. Experiments detailing these efficiency gains are discussed in tire image stack through time. These operations are performed using a
Section 6. combination of tiling and aggregation. Each output tile is computed in
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx 7
parallel using lazy image evaluation, in the manner described above. • NormalizedDifference: A graph that computes the normalized
Within each tile, an aggregation operation is performed at each pixel. difference of two input buffers. This small-graph scenario contains
Tiles of pixel data from the images in the input collection are requested five nodes in total: two input nodes, one sum, one product, and one
in batches and “streamed” one at a time through the per-pixel quotient.
aggregators. Once all inputs that intersect the output tile have been • DeepProduct: A graph that consists of 64 binary product nodes in a
processed, the final transformation is applied at each pixel to generate chain, computing the product of 65 input nodes.
the output result. • DeepCosineSum: A graph with the same structure as DeepProduct,
This distribution model can be fast and efficient for aggregations that but where each node computes the more expensive binary operation
have a small intermediate state (e.g. computing the minimum value), cos(a + b).
but it can be prohibitively memory-intensive for those that don't (e.g.: • SumOfProducts: A graph that computes the sum over all pairwise
Pearson's correlation, which requires storing the complete data series products of 40 inputs. This graph has 40 input nodes, 780 product
at each pixel prior to computing the final result). Streaming through notes, and a tree of 779 sum nodes. Here the total number of nodes
even very large collections can still be fast as long as the size of a tile is much larger than the number of input nodes, allowing us to evalu-
is significantly smaller than a full image. For example, the entire stack ate the performance of complex graphs of primitive operations on
of Landsat 5, 7 and 8 collections, collectively containing more than 5 fixed amounts of input data, a common real-world scenario.
million images, is less than 2000 tiles deep at any point, and only
about 500 deep on average. Each of these tests was performed on a single 256 × 256-pixel tile
using a single-threaded execution environment on an Intel Sandy
5.4. Caching and common sub-expression elimination Bridge processor at 2.6 GHz, a configuration that is representative of
commercial cloud data center environments, with all non-essential sys-
Many processing operations in Earth Engine can be expensive and tem services disabled to minimize profiling noise. The results, summa-
data-intensive, so it can be valuable to avoid redundant computation. rized in Table 3, show that in 4 out of 5 of these common test cases,
For example, a single user viewing results on a map will trigger multiple Java with the JIT compiler outperforms similar dynamic graph-based
independent requests for output tiles all of which frequently depend on computation in C++ by as much as 50%, and in one case it even outper-
one or more common sub-expressions, such as a large spatial aggrega- forms a direct C++ implementation.
tion or the training of a supervised classifier. To avoid re-computing
values that have already been previously requested, expensive interme- 6.1. System throughput performance
diate results are stored in a distributed cache using a hash of the sub-
graph as a cache key. In the Google data center, CPUs are abundant. In this environment
While it is possible that multiple users could share an item in the raw efficiency, while still important, is not as important as the ability
cache, it is uncommon that two separate users independently make to efficiently distribute complex computations across many machines
identical queries. However, it is very common for a single user to repeat and much of Earth Engine's performance is due to its ability to marshal
the same queries during incremental algorithm development and thus and manage a large number of CPUs on a user's behalf. There is a hard
to benefit from this mechanism. The cache is also used as a form of ultimate upper limit on the efficiencies that can be achieved through
shared memory during distributed execution of a single query, storing code or query optimization, but there are fewer limitations on the addi-
the intermediate results corresponding to subgraphs of the query. tional computing resources that can be brought to bear.
When subsequent requests for the same computation arrive, the Experiments were conducted to demonstrate Earth Engine's ability
earlier computation may already have completed or it may still be in to scale horizontally (Fig. 4). In this test, two large collections of Landsat
progress. Previously computed results are simply retrieved and images were reprojected to a common projection, temporally aggregat-
returned by checking the cache prior to starting expensive operations. ed on a per-pixel basis and spatially aggregated down to a single num-
To handle the case in which the earlier computation is still in progress, ber while varying the number of CPUs per run. The two collections
all computations are sent to distributed workers via a small number of consisted of all available Landsat 8 Level-1T images acquired from
computation master servers. These servers track the computations 2014 to 01-01 to 2016–12-31, covering CONUS (26,961 scenes,
that are executing in the cluster at any given moment. When a new 1.21 trillion pixels) and Africa (77,528 scenes, 3.14 trillion pixels).
query arrives that depends on some computation already in progress, Tests were run using shared production resources over multiple days
that query will simply join the original query in waiting for the computa- and times to capture the natural variability due to load on the fleet.
tion to complete. Should a compute master fail, handles to in-progress The results show a nearly linear scaling in throughput with the number
computations could be lost, possibly allowing redundant computations of machines.
to start, but only if a query is re-requested before the existing ones finish.
7. Applications
6. Efficiency, performance, and scaling
Earth Engine is in use across a wide variety of disciplines, covering
Earth Engine takes advantage of the Java Just-In-Time (JIT) compiler topics such as global forest change (Hansen et al., 2013), global surface
to optimize the execution of chains of per-pixel operations that are water change (Pekel et al., 2016), crop yield estimation (Lobell et al.,
common in image processing. To evaluate the efficiency gains provided 2015), rice paddy mapping (Dong et al., 2016), urban mapping (Zhang
by the JIT compiler, a series of experiments were conducted to compare et al., 2015; Patel et al., 2015), flood mapping (Coltin et al., 2016), fire
the performance of three execution models: executing a computation
graph in Java using the JIT compiler; executing a graph using a similar
general implementation in C++; and finally, specialized native C++ Table 3
Results from Java JIT vs. C++ efficiency tests.
code in which the same calls are made directly instead of through a
graph, thereby avoiding function virtualization. Five test cases, each Test case C++ function C++ graph Java graph
testing a different type of image computation graph, were explored: SingleNode 0.057 ms 0.17 ms 0.056 ms
NormalizedDifference 0.40 ms 0.95 ms 0.41 ms
• SingleNode: A trivial graph with a single node consisting of an image DeepProduct 18 ms 55 ms 43 ms
data buffer. This test simply computes the sum of all the values in a DeepCosineSum 160 ms 200 ms 240 ms
SumOfProducts 110 ms 790 ms 360 ms
buffer.
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
8 N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx
recovery (Soulard et al., 2016) and malaria risk mapping (Sturrock et al., States to compute annual yields from 2008 to 2012. The total computa-
2014). It has also been integrated into a number of third-party applica- tion completed in approximately 2 min per 10,000 km2 per year.
tions, for example analyzing species habitat ranges (Map of Life, 2016),
monitoring climate (Climate Engine, 2016), and assessing land use 8. Challenges and future work
change (Collect Earth, 2016). The details of a few of these applications
will illustrate how Earth Engine's capabilities are being leveraged. One of the benefits of using Earth Engine is that the user is almost
Hansen et al. (2013) characterized forest extent, loss, and gain from completely shielded from the details of working in a parallel processing
2000 to 2012 using decision trees generated from an extensive set of environment. The system handles and hides nearly every aspect of how
training data and a deep stack of metrics computed from large collec- a computation is managed, including resource allocation, parallelism,
tions of Landsat scenes. Filtering operations supported by the data cata- data distribution, and retries. These decisions are purely administrative,
log reduced the 1.3 M Landsat scenes available at the time to 654,178 and none of them can affect the result of a query, only the speed at
growing-season scenes from the study period. These images were which it is produced. The price of liberation from these details is that
then screened for clouds, cloud shadows, and water, and converted the user is unable to influence them: the system is entirely responsible
from raw Landsat digital numbers to normalized top of atmosphere re- for deciding how to run a computation. This results in some interesting
flectance. All necessary data access, format conversion, cartographic challenges in both the design and use of the system.
reprojection, and resampling were handled automatically by the sys-
tem. Operations in the API were used to compute input metrics, such 8.1. Scaling challenges
as per-band percentile values and linear regressions of reflectance
values versus image date. These metrics, along with training data, The Earth Engine system as a whole can manage extremely large
were used to generate decision trees, which were applied to the metrics computations, but the underlying infrastructure is ultimately clusters
globally to produce the final output data. Those results were used for of low-end servers (Barroso et al., 2013). In this environment, the option
publication and made available as part of the Earth Engine catalog for of configuring arbitrarily large machines is not available, and there is a
further analysis by others. hard limit on the amount of data that can be brought into any individual
Many other users, both scientific and operational, have since suc- server. This means that users can only express large computations by
cessfully built on the Hansen results to produce derivative results using the parallel processing primitives provided in the Earth Engine li-
using Earth Engine. Global Forest Watch (2014) incorporated it into brary and some non-parallelizable operations simply cannot be per-
an interactive analysis application using Earth Engine to perform on- formed effectively in this environment. Additionally, the requirement
the-fly calculations of summary statistics. Joshi et al. (2016) used it to to express computations using the Earth Engine library means that
track changes in tiger habitat by extracting forest loss within protected existing algorithms and workflows have to be converted to use the
areas for each year, finding that the areas most suitable for doubling the Earth Engine API to utilize the platform at all.
wild tiger population were also the best protected. Earth Engine's API by design makes it easy to express extremely
In another example, Lobell et al. (2015) related the output of hun- large computations. For example, it would take only a few lines of
dreds of crop model simulations to vegetation indices, such as the Earth Engine code to request a global aggregation of the
green chlorophyll vegetation index (GCVI), that are measurable with 800 billion pixel Hansen forest cover map: while this computation is
satellite data. They then related simulated yields to measured vegetative straightforward, simply retrieving all the input pixels from storage in-
indices and weather for a set of dates early in the growing season and volves a substantial quantity of resources for a significant length of
late in the growing season. This resulted in a table of regression coeffi- time. By chaining operations on large collections of data over a wide
cients for each pairwise combination of early/late dates. They used range of spatial scales, it is easy to express queries that vary in cost by
Earth Engine to select, on a per-pixel basis, the best Landsat scenes for many orders of magnitude and to describe computations that are im-
the early and late periods, by first calculating reflectance of the Landsat practical even in an advanced parallel computing environment.
scenes using LEDAPS (Masek et al., 2006), automatically removing Since Earth Engine is a shared computing resource, limits and other
cloudy scenes using Earth Engine's SimpleCloudScore function, comput- defenses are necessary to ensure that users do not monopolize the sys-
ing GCVI values, and finally selecting the scenes with the highest GCVI. tem. For interactive sessions, Earth Engine imposes limits on the maxi-
Once the best pair of Landsat scenes was determined for a given pixel, mum duration of requests (currently 270 s), the total number of
weather data stored in Earth Engine and the GCVI were used to compute simultaneous requests per user (40), and the number of simultaneous
the predicted yield. This method was applied to roughly executions of certain expensive operations such as spatial aggregations
6.75 million hectares of maize and soy fields in the Midwestern United (25). For illustrative purposes, the interactive computational time-limit
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx 9
is sufficient to complete the following workflow within a single environment (e.g. a Python script) is not performing any of the compu-
timeout: Retrieve all Landsat 8 images covering the states of California tation itself. The entire chain of operations is recorded by client-side
and Nevada for 1 year (1177 scenes), use them to compute a maxi- proxy objects and sent to the server for execution, but this means that
mum-NDVI composite, and from that an average peak-NDVI for each it is not possible to mix Earth Engine library calls with standard local
of the 17 IGBP land cover classes over the region (735,000 km2). Note processing idioms. This includes some basic language features like con-
that the bulk of the time for this example is spent in transferring the ditionals and loops that depend on computed values, as well as standard
raw pixels for the full-resolution spatial aggregation; simply creating numerical packages. Users can still use these external tools, but they
and displaying the maximum-NDVI composite completes in a few cannot apply them directly to Earth Engine proxy objects, sometimes
seconds. leading to confusion. Fortunately, these programming errors are usually
None of the interactive limits apply when queries are invoked in a easy to resolve once identified.
batch context, and jobs that are orders of magnitude larger can run It is worth noting that this style of programming model is becoming
there, but there is still a limit to what each individual machine can ac- increasingly common for large-scale cloud-based computing; it is also
commodate when a request involves tile-based computations that can- used in TensorFlow (Abadi et al., 2016) when constructing and execut-
not be streamed or further distributed using the current data models. ing graphs.
This memory limit does not translate directly into a specific spatial or
temporal limit, but a common rule of thumb for the maximum size of 8.4. Advancing the state of the art
these sorts of requests is a stack depth of about 2000 bytes per pixel.
The current RPC and caching system imposes an additional limitation The overarching goal of Earth Engine is to make progress on society's
that applies in both interactive and batch cases: individual objects to biggest challenges by making it not just possible, but easy, to monitor,
be cached cannot exceed 100 MB in size. This limit is most often encoun- track and manage the Earth's environment and resources. Doing so re-
tered when the output of an aggregation operation is large, such as quires providing access not just to vast quantities of data and computa-
extracting data to train a machine-learning algorithm where it may tional power, but also increasingly sophisticated analysis techniques,
limit the total number of points in a training set. and making them easy to use.
Batch jobs are each run independently making it much harder for To this end, experiments are ongoing in the integration of deep
them to negatively impact each other, but to prevent monopolization, learning techniques (Abadi et al., 2016) and facilitating easy access to
jobs are still managed using a shared queuing system, and under other scalable infrastructures such as Google Compute Engine
heavy load, jobs may wait in the queue until resources become (Gonzalez and Krishnan, 2015) and BigQuery (Tigani and Naidu, 2014).
available.
References
8.2. Computational model mismatch
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., et al., 2016.
While parallelizable operations are very common in remote sensing, Tensorflow: Large-scale Machine Learning on Heterogeneous Distributed Systems.
arXiv preprint arXiv:1603.04467.
there are of course many other classes of operations that are not Barroso, L.A., Clidaras, J., Hölzle, U., 2013. The datacenter as a computer: an introduction to
parallelizable or are not accommodated by the parallel computation the design of warehouse-scale machines. Synth. Lect. Comput. Archit. 8 (3), 1–154.
constructions available in Earth Engine. The platform is well suited to Câmara, G., Souza, R., Pedrosa, B., Vinhas, L., Monteiro, A.M.V., Paiva, et al., 2000. TerraLib:
technology in support of GIS innovation. Proc. II Brazilian Symposium on
per-pixel and finite neighborhood operations such as band-math, mor- GeoInformatics. GeoInfo2000. 2, pp. 1–8.
phological operations, spectral unmixing, template matching and tex- Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.,
ture analysis, as well as long chains (hundreds to thousands) of these 2010. FlumeJava: easy, efficient data-parallel pipelines. ACM SIGPLAN Not. 45 (6),
363–375.
sorts of operations. It is also highly optimized for statistical operations Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., et
that can be applied to streaming data, such as computing statistics on al., 2008. Bigtable: a distributed storage system for structured data. ACM Trans.
a time-series stack of images, and can easily handle very deep stacks Comput. Syst. 26 (2), 4.
Climate Engine, 2016. Desert Research Institute, University of Idaho. http://climateengine.
this way (ie: millions of images; trillions of pixels). It performs poorly
org (accessed July 2016).
for operations in which a local value can be influenced by arbitrarily dis- Collect Earth, 2016. United Nations Food and Agriculture Organization. http://www.
tant inputs, such as watershed analysis or classical clustering algo- openforis.org/tools/collect-earth.html (accessed July 2016).
Coltin, B., McMichael, S., Smith, T., Fong, T., 2016. Automatic boosted flood mapping from
rithms; operations that require a large amount of data to be in hand at
satellite data. Int. J. Remote Sens. 37 (5), 993–1015.
the same time, such as training many classical machine learning Copernicus Data Access Policy, 2016. http://www.copernicus.eu/main/data-access
models; and operations that involve long-running iterative processes, (accessed June 30, 2016).
such as finite element analysis or agent-based models. Additionally, Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J.J., Ghemawat, S., et al., 2013.
Spanner: Google's globally distributed database. ACM Trans. Comput. Syst. 31 (3), 8
data intensive models that require large volumes of data not already (TOCS).
available in Earth Engine could require substantial additional effort to Cossu, R., Petitdidier, M., Linford, J., Badoux, V., Fusco, L., Gotab, B., Hluchy, L., et al., 2010. A
ingest. roadmap for a dedicated earth science grid platform. Earth Sci. Inf. 3 (3).
Dong, J., Xiao, X., Menarguez, M.A., Zhang, G., Qin, Y., Thau, D., Biradar, C., Moore, B., 2016.
These computational techniques can still be applied in Earth Engine, Mapping paddy rice planting area in northeastern Asia with Landsat 8 images, phe-
but often with sharp scaling limits. Extending Earth Engine to support nology-based algorithm and Google Earth Engine. Remote Sens. Environ 185,
new computational models is an active area of research and develop- 142–154.
Fikes, A., 2010. Storage Architecture and Challenges. http://goo.gl/pF6kmz (accessed June
ment. Users with problems that do not match Earth Engine's computa- 30, 2016).
tional model can run computations elsewhere in the Google Cloud Ghemawat, S., Gobioff, H., Leung, S., 2003. The Google file system. Proc. SOSP 29–43.
Platform to capitalize on running computations close to the underlying Global Forest Watch, 2014. World Resources Institute, Washington, DC. http://www.
globalforestwatch.org (accessed June 30, 2016).
data and still taking advantage of Earth Engine for its data catalog, pre- Gonzalez, J.U., Krishnan, S.P.T., 2015. Building Your Next Big Thing with Google Cloud
processing, post-processing, and visualization. Platform: A Guide for Developers and Enterprise Architects. Apress.
Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., et al., 2010.
Google fusion tables: web-centered data management and collaboration. ACM
8.3. Client/server programming model
SIGMOD 1061–1066.
Hansen, M.C., Potapov, P.V., Moore, R., Hancher, M., Turubanova, S.A., Tyukavina, A., et al.,
Earth Engine users are often unfamiliar with the client-server pro- 2013. High-resolution global maps of 21st-century forest cover change. Science 342,
gramming model. The Earth Engine client libraries attempt to provide 850–853.
Hughes, J.N., Annex, A., Eichelberger, C.N., Fox, A., Hulbert, A., Ronquest, M., 2015.
a more familiar procedural programming environment, but this can GeoMesa: a distributed architecture for spatio-temporal fusion. SPIE Defense + Secu-
lead to confusion when the user forgets that their local programming rity (pp. 94730F–94730F). Int. Soc. Optics Photonics.
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
10 N. Gorelick et al. / Remote Sensing of Environment xxx (2016) xxx–xxx
Joshi, A.R., Dinerstein, E., Wikramanayake, E., et al., 2016. Tracking changes and Soulard, C.E., Albano, C.M., Villarreal, M.L., Walker, J.J., 2016. Continuous 1985–2012
preventing loss in critical tiger habitat. Sci. Adv. 2 (4), e1501675. Landsat monitoring to assess fire effects on meadows in Yosemite National Park, Cal-
Lobell, D., Thau, D., Seifert, C., Engle, E., Little, B., 2015. A scalable satellite-based crop yield ifornia. Remote Sens. 8 (5), 371.
mapper. Remote Sens. Environ. 164, 324–333. Sturrock, H.J., Cohen, J.M., Keil, P., Tatem, A.J., Le Menach, A., Ntshalintshali, N.E., Hsiang,
Loveland, T.R., Dwyer, J.L., 2012. Landsat: Building a strong future. Remote Sens. Environ. M.S., Gosling, R.D., 2014. Fine-scale malaria risk mapping from routine aggregated
122, 22–29. case data. Malar. J. 13 (1), 1.
Map of Life, 2016. http://www.mol.org (accessed June 30, 2016). Tigani, J., Naidu, S., 2014. Google BigQuery Analytics. John Wiley & Sons.
Masek, J.G., Vermote, E.F., Saleous, N.E., Wolfe, R., Hall, F.G., Huemmrich, K.F., et al., 2006. A Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J., 2015. Large-
Landsat surface reflectance dataset for North America, 1990–2000. Geosci. Remote scale cluster management at Google with Borg. Proc. EuroSys 10, 18. ACM.
Sensing Lett. IEEE 3, 68–72. Whitman, R.T., Park, M.B., Ambrose, S.M., Hoel, E.G., 2014. Spatial indexing and analytics
Nemani, R., Votava, P., Michaelis, A., Melton, F., Milesi, C., 2011. Collaborative on hadoop. Proc. 22 ACM SIGSPATIAL. pp. 73–82.
supercomputing for global change science. EOS Trans. Am. Geophys. Union 92 (13), Woodcock, C.E., Allen, A.A., Anderson, M., Belward, A.S., Bindschadler, R., Cohen, W.B., et
109–110. al., 2008. Free access to Landsat imagery. Science 320, 1011.
Patel, N., Angiuli, E., Gamba, P., Gaughan, A., Lisini, G., Stevens, F., Tatem, A., Trianni, A., Yu, J., Wu, J., Sarwat, M., 2015. Geospark: a cluster computing framework for processing
2015. Multitemporal settlement and population mapping from Landsat using google large-scale spatial data. Proc. 23 SIGSPATIAL International Conference on Advances
earth engine. Int. J. Appl. Earth Obs. Geoinf. 35, 199–208. in Geographic Information Systems (p. 70). ACM.
Pekel, J.F., Cottam, A., Gorelick, N., Belward, A.S., 2016. High-resolution mapping of global Zhang, Q., Li, B., Thau, D., Moore, R., 2015. Building a better urban picture: combining day
surface water and its long-term changes. Nature. and night remote sensing imagery. Remote Sens. 7 (9), 11887–11913.
Please cite this article as: Gorelick, N., et al., Google Earth Engine: Planetary-scale geospatial analysis for everyone, Remote Sensing of Environ-
ment (2016), http://dx.doi.org/10.1016/j.rse.2017.06.031