GCP Data Engineer Master Cheat Sheet
GCP Data Engineer Master Cheat Sheet
At a testing facility you will have 2 hours to solve 50 multiple choice questions on a
computer. You can mark questions you are unsure about and can review them towards
the end of the test. After you submit your test, you will see a provisional result of either
PASS or FAIL. The official results will be sent to you in the next few days. If everything
goes well you'll receive this certification. Good luck!
Ingestion
Transfer Appliance
Transfer large amounts of data quickly and cost-effectively to GCP.
Transfers directly to GCS or BQ
Data Size >= 20TB
Offline Data Transfer
Takes more than 1 week to upload data.
Workflow
Receive Transfer Appliance and configure it and connect it to your network.
Before data is stored, it is deduplicated, compressed and encrypted with AES 256
algorithm using a password and passphrase specified by user.
Data integrity check is performed.
Transfer Appliance is shipped back to Google.
Encrypted data is copied to GCS staging bucket.
pg. 1
SKILLCERTPRO
Use Cases
Data Collection
o Geographical, environmental, medical, or financial data for analysis.
o Need to transfer data from researchers, vendors, or other sites to GCP.
Data Replication
o Supporting current operations with existing on prem infrastructure but
experimenting with cloud.
o Allows decommissioning duplicate datasets, test cloud infrastructure, and expose
data to machine learning analysis.
Data Migration
o Offline data transfer is suited for moving large amounts of existing backup
images and archives to ultra-low-cost, highly durable, and highly available
archival storage (Nearline/Coldline).
pg. 2
SKILLCERTPRO
Use cases:
Backup data to GCS from other storage providers
Move data from one GCS bucket to another (enables availability to different groups of
users or applications)
Periodically move data as part of a processing pipeline or analytical workflow
Storage
pg. 3
SKILLCERTPRO
Storage classes can change, but the objects (files) within them retain their storage
class.
A way to store data that can be commonly used by Dataproc and Bigquery
Signed Urls
Signed URL to give temporary access and users do not need to be GCP users
o TODO
Storage Classes
Multi-Regional
Serving website content, interactive workloads, mobile game/gaming applications
Highest availability
Geo-redundant: Stores data in at least 2 regions separated by at least 100 miles within
the multi-regional location of the bucket.
pg. 4
SKILLCERTPRO
Regional
Storing data used by Compute Engine
Better performance for data-intensive computation
Nearline
Accessed once a month max
30 day min. storage duration
Ex. Data backup, disaster recovery, archival storage
Coldline
Accessed once a year max
90 day min. storage duration
Ex. Data stored for legal or regulatory reasons
Versioning
Needs to be enabled
Things this enables:
o List archived versions of an object
o Restore live version of an object from an older state
o Permanently delete an archived version
Archived versions retain ACLs and does not necessarily have same permissions as live
version of object.
IAM vs ACLs
IAM
o Apply to all objects within a bucket.
o Standard Roles
Storage.objectCreator
Storage.objectViewer
Storage.objectAdmin
Storage.admin – full control over buckets
Can apply to a specific bucket.
o Primitive Roles
pg. 5
SKILLCERTPRO
Encryption
Server-side encryption
Layers on top of default encryption
Occurs after GCS receives data, but before written to disk
Generate and manage keys using Cloud Key Management Service (KMS)
KMS can be independent from the project that contains buckets (separation of duties)
Uses service accounts to encrypt/decrypt
Cloud SQL exports to GCS and Dataflow do not support this currently
Client-side encryption
pg. 6
SKILLCERTPRO
Cloud SQL
Managed/No ops relational database (PostgreSQL, MySQL)
o Low latency
o Doesn’t scale well beyond GB’s
o Data structures and underlying infrastructure required
pg. 7
SKILLCERTPRO
Use Case:
o Medical Records
o Blogs
Read Replicas
IAM
Cloudsql.admin
Cloudsql.editor
o Can’t see or modify permissions, users, or ssl certs.
o No ability to import data or restore from backup, nor clone, delete, or promote
instances.
o No ability to delete databases, replicas, or backups.
Cloudsql.viewer
o Read only access to all Cloud SQL instances.
Cloudsql.client
o Connectivity access to Cloud SQL instances from App Engine and the Cloud SQL
Proxy.
o Not required for accessing an instance using IP addresses.
Cloud Spanner
Distributed and scalable solution for RDBMS (more expensive)
Horizontal scaling: Add more machines
Use when:
o Need high availability
o Strong consistency
o Transactional support for reads and writes (especially writes)
Don’t use when:
o Data is not relational, or not even structured
o Want an open source RDBMS
pg. 8
SKILLCERTPRO
Data Model
Specifies a parent-child relationship for efficient storage
Interleaved representation (like HBase)
Interleaving
Rows are stored in sorted order of primary key values
Child rows are inserted between parent rows with that key prefix
Hotspotting
Need to choose primary keys carefully (like HBase)
Do not use monotonically increasing values, else writes will be on the same locations.
o No timestamps (also sequential)
Use descending order if timestamps are required.
Use hash of key value if using naturally monotonically ordered keys (serial in postgres)
Splits
Parent-child relationship can get complicated (i.e. 7 layers deep)
Spanner is distributed – uses “splits”
Split – Range of rows that can be moved around independent of other rows
Added to distribute high read-write data (to break up hotspots)
Secondary Indices
Key-based storage ensures fast sequential scan of keys (like HBase)
pg. 9
SKILLCERTPRO
Data Types
Transactions
Supports serializability
o All transactions appear if they were executed in a serial order, even if some
operations of distinct transactions actually occurred in parallel.
Stronger than traditional ACID
o Transactions commit in an order that is reflected in their commit timestamps
o Commit timestamps are “real time”
2 Transaction Modes
o Locking read-write
Slow
Only one that supports writing data
o Read-only
Fast
Only requires read locking
If making a one-off read use “Single Read Call”
o Fastest, no transaction checks needed!
Staleness
Can set timestamp bounds
o Strong: Read latest data
o Bounded Staleness: Read version no later than …
Could be in past or future
Multitenancy
pg. 10
SKILLCERTPRO
Replicas
Paxos-based replication scheme in which voting replicas take a vote on every write
request before it is committed.
Writes
o Client write requests always go to leader replica first, even if a non-leader is
closer geographically.
o Leader logs incoming write, forwards it in parallel to other replicas that are
eligible to vote.
o Replicas complete its write and then responds back to leader with a vote on
whether the write should be committed.
o Write is committed when a quorum agrees.
Reads
o Reads that are part of a read-write transaction are served from the leader replica,
since the leader maintains the locks required to enforce serializability.
o Single read and reads in a read-only transaction might require communication
with leader, depending on concurrency mode.
Single-region instances can only use read-write replicas. (3 in prod)
Types
o Read-write
Maintain a full copy of your data.
Can vote, can become leader, can serve reads
o Read-only
Maintain a full copy of your data, which is replicated from read-write
replicas.
Can serve reads
Do not participate in voting to commit writes -> location of read-only
replicas never contribute to write latency.
Allow scaling of read capacity without increasing quorum size needed for
writes (reduces total time of network latency for writes)
o Witness
Can vote
Easier to achieve quorums for writes without the storage and compute
resources required by read-write replicas to store a full copy of data and
serve reads.
pg. 11
SKILLCERTPRO
Production Environment
At least 3 nodes
Best performance when each CPU is under 75% utilization
Architecture
Nodes handle computation for queries, similar to that of BigTable.
o Each node serves up to 2 TB of storage.
o More nodes = more CPU/RAM = increased throughput.
Storage is replicated across zones (and regions, where applicable).
o Like BigTable, storage is separate from computing nodes.
Whenever an update is made to a database in one zone/region, it is automatically
replicated across zones/regions.
o Automatic synchronous replications.
When data is written, you know it has been written.
Any reads guarantee data accuracy.
IAM
Project, instance, or database level
Roles/spanner._____
o Admin – Full access to Spanner resources
o Database Admin – Create/edit/delete databases, grant access to databases
o Database Reader – Read databases and execute SQL queries and view schema.
o Database User – Read and write to DB, execute sql on DB including DML and
Partitioned DML, view and update schema.
o Viewer – View that instances and databases exist
Cannot modify or read from database.
Datastore
Typically, not used for either OLTP or OLAP
Serverless
pg. 12
SKILLCERTPRO
Specialty is that query execution depends on the size of the returned result and
not the size of the data set.
Suitable for:
o Atomic transactions
Can execute a set of operations where all succeed, or none occur.
o ACID transactions, SQL-like queries.
o Structured data.
o Hierarchical document storage such as HTML and XML
Query
Performance
Comparison to RDBMS
pg. 13
SKILLCERTPRO
Entity Groups
Hierarchical relationship between entities.
Ancestor Paths and Child Entities.
Index Types
Built in – default option
o Allows single property queries
Composite – specified with index configuration file (index.yaml)
pg. 14
SKILLCERTPRO
Deleting Index
datastore indexes cleanup
o Deletes all indexes for the production Datastore mode instance that are not
mentioned in the local version of index.yaml.
Exploding Indexes
Default – create entry for every possible combination of property values
Results in higher storage and degraded performance
Solutions
o Use custom index.yaml file to narrow index scope
o Do not index properties that don’t need indexing
Full Indexing
Built in indices on each property (~field) of each entity kind (~table row).
Composite indices on multiple property values.
Can exclude properties from indexing if certain it will never be queried.
Each query is evaluated using its “perfect index”
Perfect Index
Given a query, which is the index that most optimally returns query results?
Depends on following (in order)
o Equality filter
o Inequality filter
o Sort conditions if any specified.
pg. 15
SKILLCERTPRO
Multi-Tenancy
Separate data partitions for each client organizations.
Can use the same schema for all clients, but vary the values.
Specified via a namespace (inside which kinds and entities can exist)
Transaction Support
Can optionally use transactions – not required
Stronger than BigQuery and BigTable
Consistency
Strongly consistent
o Return up to date result, however long it takes
o Ancestor query
Those that execute against an entity group
Can set the read policy of a query to make this eventually consistent.
o key-value operations
Eventually consistent
o Faster, but might return stale data
o Global queries/projections
Exporting Entities
Deploy App Engine service that calls Datastore mode managed export feature.
Can run this service on a schedule with an App Engine Cron Service.
Cloud Firestore
pg. 16
SKILLCERTPRO
IAM Roles
Datastore.owner with Appengine.appAdmin
o Full access to Datastore mode.
Datastore.owner without Appengine.appAdmin
o Cannot enable Admin access
o Cannot see if Datastore mode Admin is enabled
o Cannot disable Datastore mode writes
o Cannot see if Datastore mode writes are disabled.
Datastore.user
o Read/write access to data in Datastore mode database.
o Intended for application developers and service accounts.
Datastore.viewer
o Read access to all Datastore mode resources.
Datastore.importExportAdmin
o Full access to manage import and exports.
Datastore.indexAdmin
o Full access to manage index definitions.
pg. 17
SKILLCERTPRO
BigTable
HBase equivalent
o Work with it using HBase API
o Advantages over HBase
Scalability (storage autoscales)
Low ops/admin burden
Cluster resizing without downtime
Many more column families before performance drops (~100K)
Stored on Google’s internal store Colossus
Not transactional (can handle petabytes of data)
Fast scanning of sequential key values
Column oriented NoSQL database
o Good for sparse data
Sensitive to hot spotting (like Spanner)
o Data is sorted on key value and then sequential lexicographically similar values
are stored next to each other.
o Need to design key structure carefully.
Designed for Sparse Tables
o Traditional RDBMS issues with sparse data
Can’t ignore with petabytes of data.
Null cells still occupy space.
Use BigTable When:
o Very fast scanning and high throughput
Throughput has linear growth with node count if correctly balanced.
o Non-structured key/value data
o Each data item is < 10MB and total data > 1TB
o Writes are infrequent/unimportant (no ACID) but fast scans crucial
o Time Series data
Avoid BigTable When:
o Need transaction support
o Less than 1TB data (can’t parallelize)
o Analytics/BI/data warehousing
o Documents or highly structured hierarchies
o Immutable blobs > 10MB each
pg. 18
SKILLCERTPRO
Row-Key
o Uniquely identifies a row
o Can be primitives, structures, arrays
o Represents internally as a byte array
o Sorted in ascending order
o NOTE - Can only query against this key.
Column Family
o Table name in RDBMS
o All rows have the same set of column families
o Each column family is stored in a separate data file
o Set up at schema definition time
Columns can be added on the fly
o Can have different columns for each row
Column
o Columns are units within a column family.
Timestamp
o Support for different versions based on timestamps of same data item. (like
Spanner)
o Omit timestamp gets you the latest data.
Hotspotting
Overloading a node with requests.
Row keys to Use
o Field Promotion
Move fields from column data that you need to search against should be
included in a single row key.
Use in reverse URL order like Java package names
Keys have similar prefixes, but different endings
o Salting
Hash the key value
o Timestamps as suffix in key (reverse timestamp)
Row Keys to Avoid
o Domain names (as opposed to field promotion)
Will cause common portion to be at end of row key leading to adjacent
values to not be logically related.
o Sequential numeric values.
o Timestamps alone
o Timestamps as prefix of row-key.
pg. 19
SKILLCERTPRO
Schema Design
Each table has just 1 index – row key
Rows sorted lexicographically by row key
All operations are atomic at row level
Keep all entity info in a single row.
Related entities should be in adjacent rows
o More efficient reads.
Tables are sparse: Empty columns don’t take up any space.
o Create a very large number of columns even if most are empty in most rows.
pg. 20
SKILLCERTPRO
Key Visualizer
Provides daily scans that show usage patterns for each table in a cluster.
Makes it easy to check whether your usage patterns are causing undesirable results, such
as hotspots on specific row keys or excessive CPU utilization.
Data Update
Deleting/updating actually write a new row with the desired data.
Append only, cannot update a single field
Tables should be tall and narrow
o Tall – Store changes by appending new rows
o Narrow – Collapse flags into a single column
pg. 21
SKILLCERTPRO
Prod:
o Standard instance with 1-2 clusters
o 3 or more nodes in each cluster
Use replication to provide high availability
o Replication available, throughput guarantee
Development:
o Low cost instance with 1 node cluster
o No replication
Create Compute Engine instance in same zone as Big Table instance
Resizing
Add and remove nodes and clusters with no downtime.
Architecture
Entire BigTable project is called an instance.
BigTable instance comprise of Clusters and Nodes
Tables belong to instances
o If multiple clusters, you cannot assign a table to an individual cluster
pg. 22
SKILLCERTPRO
IAM
Project wide or instance level
Bigtable.admin
Bigtable.user
o App developer or service accounts.
Bigtable.reader
o Data scientists, dashboard generators, and other data analytics.
Bigtable.viewer
o Provides no data access.
o Minimal set of conditions to access the GCP Console for BigTable.
BigQuery
Hive equivalent
No ACID properties
Great for analytics/business intelligence/data warehouse (OLAP)
Fully managed data warehouse
Has connectors to BigTable, GCS, Google Drive, and can import from Datastore backups,
CSV, JSON, and AVRO.
Performance
o Petabyte scale
o High latency
Worse than BigTable and DataStore
Data Model
Dataset = set of tables and views
Table must belong to dataset
Dataset must belong to a project
pg. 23
SKILLCERTPRO
Table Schema
Can be specified at creation time
Can also specify schema during initial load
Can update schema later too
Query
Standard SQL (preferred) or Legacy SQL (old)
o Standard
Table names can be referenced with backticks
Needed for wildcards
o Cannot use both Legacy and SQL2011 in same query.
o Table partitioning
o Distributed writing to file for output (i.e. file-0001-of-0002)
o User defined functions in JS (UDFJS)
Temporary – Can only use for current query or command line session.
o Query jobs are actions executed asynchronously to load, export, query, or copy
data.
o If you use the LIMIT clause, BigQuery will still process the entire table.
o Avoid SELECT * (full scan), select only columns needed (SELECT * EXCEPT)
o Denormalized Data Benefits
Increases query speed
Makes queries simpler
BUT: Normalization makes dataset better organized, but less performance
optimized.
o Types
Interactive (default)
Query executed immediately
Counts towards
Daily usage
Concurrent usage
Batch
Scheduled to run whenever possible (idle resources)
Don’t count towards limit on concurrent usage.
pg. 24
SKILLCERTPRO
Data Import
Data is converted into columnar format for Capacitor.
Batch (free)
o web console (local files), GCS, GDS, Datastore backups (particularly logs)
o Other Google services (i.e. Google Ad Manager, Google Ads)
Streaming (costly)
o Data with CDF, Cloud Logging, or POST calls
o High volume event tracking logs
o Realtime dashboards
o Can stream data to datasets in both the US and EU
o Streaming into ingestion-time partitioned tables:
Use tabledata.insertAll requests
Destination partition is inferred from current date based on UTC time.
Can override destination partition using a decorator like
so: mydataset.table$20170301
Newly arriving data will be associated with the UNPARTITIONED partition
while in the streaming buffer.
A query can therefore exclude data in the streaming buffer from a query
by filtering out the NULL values from the UNPARTITIONED partition by
using one of the pseudo-columns ([_PARTITIONTIME]) or
[_PARTITIONDATE] depending on preferred data type.
o Streaming to a partitioned table:
Can stream data into a table partitioned on a DATE or TIMESTAMP
column that is between 1 year in the past and 6 months in the future.
Data between 7 days prior and 3 days in the future is placed in the
streaming buffer, and then extracted to corresponding partitions.
Data outside that window (but within 1 year, 6 month range) is extracted
to the UNPARTITIONED partition and loaded to corresponding partitions
when there’s enough data.
o Creating tables automatically using template tables
Common usage pattern for streaming is to split a logical table into many
smaller tables to create smaller sets of data.
To create smaller tables by date -> partitioned tables
To create smaller tables that are not date based -> template tables
BQ will create the tables for you.
Add templateSuffix parameter to your insertAll request.
pg. 25
SKILLCERTPRO
<target_table_name> +
Only need to update template table schema then all subsequently
generated tables will use the updated schema.
o Quotas
Max row size: 1MB
HTTP request size limit: 10MB
Max rows per second: 100,000 rows/s for all tables combined.
Max bytes per second: 100MB/s per table
Raw Files
o Federated data source, CSV/JSON/Avro on GCS, Google sheets
Google Drive
o Loading is not currently supported.
o Can query data in Drive using an external table.
Expects all source data to be UTF-8 encoded.
To support (occasionally) schema changing you can use automatically detect (not default
setting).
o Available while:
Loading data
Querying external data
Web UI
o Upload a file greater than 10MB in size
o Upload multiple files at the same time
o Upload a file in SQL format
o Cannot load multiple files at once.
Can with CLI though.
pg. 26
SKILLCERTPRO
Partitions
Improves query performance => reduces costs
pg. 27
SKILLCERTPRO
Windowing
Window functions increase the efficiency and reduce the complexity of queries that
analyze partitions (windows) of a dataset by providing complex operations without the
need for many intermediate calculations.
Reduce the need for intermediate tables to store temporary data.
Bucketing
Like partitioning, but each split/partition should be the same size and is based on the
hash function of a column.
pg. 28
SKILLCERTPRO
Each bucket is a separate file, which makes for more efficient sampling and joining data.
Anti-Patterns
Avoid self joins
Partition/Skew
o Avoid unequally sized partitions
o Values occurring more often than other values..
Cross-Join
o Joins that generate more outputs than inputs
Update/Insert Single Row/Column
o Avoid a specific DML, instead batch updates/inserts
Anti-Patterns: https://cloud.google.com/bigtable/docs/schema-design
Table Types
Native Tables
o Backed by native BQ storage
External Tables
o Backed by storage external to BQ (federated data source)
o BigTable, Cloud Storage, Google Drive
Views
o Virtual tables defined by SQL query.
o Logical – not materialized
o Underlying query will execute each time the view is accessed.
o Benefits:
Reduce query complexity
Restrict access to data
Construct different logical tables from same physical table
pg. 29
SKILLCERTPRO
o Cons:
Can’t export data from a view
Can’t use JSON API to retrieve data
Can’t mix standard and legacy SQL
e.g. standard sql cannot access legacy sql view
No user-defined functions allowed
No wildcard table references
Due to partitioning
Limit of 1000 authorized views per dataset
Caching
No charge for a query that retrieves results from cache.
Results are cached for 24 hours.
Caching is per user only.
bq query –nouse_cache ‘’
Cached by Default unless
o A destination table is specified.
o If any referenced tables or logical units have changed since results previously
cached.
o If any referenced tables have recently received streaming inserts even if no new
rows have arrived.
o If the query uses non-deterministic functions such as CURRENT_TIMESTAMP(),
NOW(), CURRENT_USER()
o Querying multiple tables using a wildcard
o If the query runs against an external data source.
Export
Destination has to be GCS.
o Can copy table to another BigQuery dataset though.
Can be exported as JSON/CSV/Avro
o Default is CSV
Only compression option: GZIP
o Not supported for Avro
To export > 1 GB
o Need to put a wildcard in destination filename
o Up to 1 GB of table data in a single file
pg. 30
SKILLCERTPRO
Slots
Unit of computational capacity needed to run queries.
BQ calculates on basis of query size, complexity
Usually default slots are sufficient
Might need to be expanded over time, complex queries
Subject to quota policies ($$)
Can use StackDriver Monitoring to track slot usage.
Clustered Tables
Order of columns determines sort order of data.
Think of Clustering Columns in Cassandra
When to use:
o Data is already partitioned on date or timestamp column.
o You commonly use filters or aggregation against particular columns in your
queries.
Does not work if the clustered column is used in a complex filter (used in a function in
the filter expression)
BigQuery ML
pg. 31
SKILLCERTPRO
Best Practices
Costs
Avoid SELECT *
o Query only columns you need.
Sample data using preview options
o Don’t run queries to explore or preview table data.
Price your queries before running them.
o Before running queries, preview them to estimate costs.
Limit query costs by restricting the number of bytes billed.
o Use the maximum bytes billed setting to limit query costs.
LIMIT doesn’t affect cost
o Do not use LIMIT clause as a method of cost control as it does not affect the
amount of data that is read.
View costs using a dashboard and query your audit logs
o Create a dashboard to view your billing data so you can make adjustments to
your BigQuery usage. Also consider streaming audit logs to BigQuery to analyze
usage patterns.
Partition data by date
Materialize query results in stages
o Break large query into stages where each stage materializes the results by writing
to a destination table.
o Querying smaller destination table reduces amount of data that is read and
lowers costs.
Consider cost of large result sets
pg. 32
SKILLCERTPRO
o Use default table expiration time to remove data when not needed.
o Good for when writing large query results to a destination table.
Use streaming inserts with caution
o Only use if data is needed immediately available.
Query Performance
Input data and data sources (I/O)
o Control projection – Avoid SELECT *
o Prune partitioned queries
Use partition columns to filter
o Denormalize data when possible
JSON, Parquet, or Avro
When creating, specify Type in the Schema as RECORD
o Use external data sources appropriately
If performance is a top priority, do not use external source
o Avoid excessive wildcard tables
Use most granular prefix possible
Communication between nodes (shuffling)
o Reduce data before using a JOIN
o Do not treat WITH clauses as prepared statements
o Avoid tables sharded by date
Use time-based partitioned tables instead
Copy of schema and metadata is maintained for each sharded
table.
BQ might have to verify permissions for each queries table.
(overhead)
o Avoid oversharding tables
Computation
o Avoid repeatedly transforming data via SQL queries
o Avoid JavaScript user-defined functions.
Use native UDFs instead.
o Use approximate aggregation functions
COUNT(DISTINCT) vs. APPROX_COUNT_DISTINCT()
o Order query operations to maximize performance
Only use in the outermost query or within window clauses.
Push complex operations to the end of the query.
o Optimize join patterns
Start with the largest table
pg. 33
SKILLCERTPRO
Storage Optimization
Use expiration settings to remove unneeded tables and partitions
o Configure default table expiration for datasets
o Configure expiration time for tables
o Configure partition expiration for partitioned tables
Take advantage of long term storage
o Untouched tables (90 days) are as cheap as GCS Nearline
o Each partition is considered separately.
Use pricing calculator to estimate storage costs
Architecture
Jobs (queries) can scale up to thousands of CPU’s across many nodes, but the process is
completely invisible to end user.
Storage and compute are separated, connected by petabit network.
Columnar data store
o Separates records into column values, stores each value on different storage
volume.
o Poor writes (BQ does not update existing records)
pg. 34
SKILLCERTPRO
Components
Dremel - Execution Engine
Borg - Compute
pg. 35
SKILLCERTPRO
Cost
Based on:
o storage (amount of data stored)
o querying (amount of data/number of bytes processed by query)
o streaming inserts.
Storage options are active and long term
o Modified or not past 90 days
Query options are on-demand and flat-rate
IAM
Security can be applied at project and dataset level, but not at table or view level.
Predefined roles BQ
o Admin – Full access
o Data owner – Full dataset access
o Data editor – edit dataset tables
o Data viewer – view datasets and tables
o Job User – run jobs
o User – run queries and create datasets (but not tables)
o metaDataViewer
o readSessionUser – Create and use read sessions within project.
Authorized views allow you to share query results with particular users/groups without
giving them access to underlying data.
o Restrict access to particular columns or rows
o Create a separate dataset to store the view.
o How:
Grant IAM role for data analysts (bigquery.user)
They won’t have access to query data, view table data, or view
table schema details for datasets they did not create.
(In source dataset) Share the dataset, In permissions go to Authorized
views tab.
View gets access to source data, not analyst group.
pg. 36
SKILLCERTPRO
Data Processing
Streams Introduction
How can MapReduce be used to maintain a running summary of real-time data from
sensors?
o Send temp readings every 5 minutes
Batches
Bounded datasets
Slow pipeline from data ingestion to analysis
Periodic updates as jobs complete
Order of data received unimportant
Single global state of the world at any point in time
Typically small/singular source
Low latency not important
Often stored in storage services GCS, Cloud SQL, BigQuery
Streams
Unbounded datasets
Processing immediate, as data is received
Continuous updates as jobs run constantly
Order important, but out of order arrival tracked
No global state, only history of events received
Typically many sources sending tiny (KB) amounts of data
Requires low latency
Typically paired with Pub/Sub (ingest) and Dataflow (real-time processing)
pg. 37
SKILLCERTPRO
Stream-First Architecture
Data items can come from multiple sources
o Files, DBs, but at least one from a Stream
All files are aggregated and buffered in one way by a Message Transport (Queue)
o i.e. Kafka, PubSub
Passed to Stream Processing system
o Flink or Spark Streaming
Micro-batches
Message Transport
o Buffer for event data
o Performant and persistent
o Decoupling multiple source from processing
Stream Processing
o High throughput, low latency
o Fault tolerant with low overhead
o Manage out of order events
o Easy to use, maintainable
o Replay streams
A good approximation of stream processing is the use of micro-batches
o Group data items (time they were received)
o If small enough it approximates real-time stream processing
Advantages
o Exactly once semantics, replay micro-batches
o Latency-throughput trade off based on batch sizes
Can adjust to use case
Low latency better
High throughput better
pg. 38
SKILLCERTPRO
Dataflow
Executes Apache Beam Pipelines
Use when:
Templates
System Lag
o Max time an element has been waiting for processing in this stage of the
pipeline.
pg. 39
SKILLCERTPRO
Wall Time
pg. 40
SKILLCERTPRO
Cancelling
Immediately stop and abort all data ingestion and processing.
Buffered data may be lost.
Draining
Cease ingestion but will attempt to finish processing any remaining buffered data.
Pipeline resources will be maintained until buffered data has finished processing and any
pending output has finished writing.
pg. 41
SKILLCERTPRO
Pipeline Update
Replace an existing pipeline in-place with a new one and preserve Dataflow’s exactly-
once processing guarantee.
When updating pipeline manually, use DRAIN instead of CANCEL to maintain in flight
data.
o Drain command is supported for streaming pipelines only
Pipelines cannot share data or transforms.
Windowing
Can apply windowning to streams for rolling average for the window, max in a window
etc.
pg. 42
SKILLCERTPRO
Overlapping time
Number of entities differ within a window
Window Interval: How large window is
Sliding Interval: How much window moves over
Session Windows
Changing window size based on session data
No overlapping time
Number of entities differ within a window
Session gap determines window size
Per-key basis
Useful for data that is irregularly distributed with respect to time.
Triggers
Determines when a Window’s contents should be output based on a certain being met.
o Allows specifying a trigger to control when (in processing time) results for the
given window can be produced.
o If unspecified, the default behavior is to trigger first when the watermark passes
the end of the window, and then trigger again every time there is late arriving
data.
Time-Based Trigger
pg. 43
SKILLCERTPRO
Operate on the processing time – the time when the data element is processed at any
given stage in the pipeline.
Data-Driven Trigger
Operate by examining the data as it arrives in each window, and firing when that data
meets a certain property.
Currently, only support firing after a certain number of data elements.
Composite Triggers
Combine multiple triggers in various ways.
Watermarks
System’s notion of when all data in a certain window can be expected to have arrived in
the pipeline.
Tracks watermark because data is not guaranteed to arrive in a pipeline in order or at
predictable intervals.
No guarantees about ordering.
Indicates all windows ending before or at this timestamp are closed.
No longer accept any streaming entities that are before this timestamp.
For unbounded data, results are emitted when the watermark passes the end of the
window, indicating that the system believes all input data for that window has been
processed.
Used with Processing Time
IAM
Project-level only – all pipelines in the project (or none)
Pipeline data access separate from pipeline access.
Dataflow Admin
o Full pipeline access
o Machine type/storage bucket config access
Dataflow.developer
o Full pipeline access
o No machine type/storage bucket access (data privacy)
pg. 44
SKILLCERTPRO
Dataflow Viewer
o View permissions only.
Dataflow.worker
o Enables service account to execute work units for a Dataflow pipeline in Compute
Engine.
o Dataflow API also needs to be enabled.
Dataproc
Managed Hadoop (Spark, SparkML, Hive, Pig, etc…)
Code/Query only
Think in terms of a ‘job specific resource’ – for each job, create a cluster and then
delete it.
o Allow necessary web ports access via firewall rules, and limit access to your
network.
Tcp:8088 (Cluster Manager)
:8088
Tcp:50070 (Connect to HDFS name node)
:50070
o OR SOCKS proxy (routes through an SSH tunnel for secure access)
gcloud compute ssh [master_node_name]
Cost
pg. 45
SKILLCERTPRO
Storage
Can use on disk (HDFS) or GCS
HDFS
Split up on the cluster, but requires cluster to be up.
GCS
Allows for the use of preemptible machines that can reduce costs significantly.
o DO not need to configure startup and shutdown scripts to gracefully handle
shutdown, Dataproc already handles this.
o Cluster MUST have at least 2 standard worker nodes however.
Separate cluster and storage.
Cloud Storage Connector
o Allows you to run Hadoop or Spark jobs directly on GCS.
o Quick startup
In HDFS, a MapReduce job can’t start until the NameNode is out of safe
mode.
With GCS, can start job as soon as the task nodes start, leading to
significant cost savings over time.
Restartable Jobs
pg. 46
SKILLCERTPRO
Updating Clusters
Can only change # workers/preemptible VM’s/labels/toggle graceful decommission.
o Graceful Decommissioning
Finish work in progress on a worker before it is removed from Dataproc
cluster.
Incorporates graceful YARN decommissioning.
May fail for preemptible workers.
Can forcefully decommission preemptible workers at any time.
Will always work with primary workers.
gcloud dataproc clusters update –gracefult-decommission-timeout
Default to “0s” – forceful decommissioning.
Need to provide a time.
Max value 1 day.
o Automatically reshards data for you.
pg. 47
SKILLCERTPRO
Connectors
BQ/BigTable (copies data to GCS) /CloudStorage
Optional Components
Anaconda, Druid
Hive WebHCat
Jupyter
Kerberos
Presto
Zeppelin
Zookeeper
IAM
Project level only (primitive and predefined roles)
Cloud Dataproc Editor, Viewer, and Worker
o Editor – Full access to create/edit/delete clusters/jobs/workflows
o Viewer – View access only
o Worker – Assigned to service accounts
Read/write GCS, write to Cloud Logging
pg. 48
SKILLCERTPRO
Hadoop
Distributed Computing
Lots of cheap hardware
o HDFS
Replication and Fault Tolerance
o YARN
Distributed Computing
o MapReduce
HDFS
GCS is used on GCP.
o Don’t use HDFS as you would have to pay for a VM on Compute Engine.
Suited for batch processing.
o Data access has high throughput rather than low latency.
Architecture
Name Node
o 1 master node
o Contains YARN resource manager
o Manages overall file system
o Stores
The directory structure
Metadata on the files
Data Nodes
o Physically stores the data in the files.
Storing Data
Break data into blocks of equal size
o Different length files are treated the same way
o Storage is simplified
o Unit for replication and fault tolerance
Blocks are of size 128 MB
o Larger -> Reduces parallelism
pg. 49
SKILLCERTPRO
High Availability
Can have multiple name nodes.
Kept in sync with Zookeeper
MapReduce
Map
An operation performed in parallel, on small portions of dataset.
Outputs KV pairs
Reduce
Mapper outputs become one final output.
pg. 50
SKILLCERTPRO
Architecture
Resource Manager
Node Manager
Application Master
Container
Location Constraint
pg. 51
SKILLCERTPRO
Assign a process to the same node where the data to be processed lives.
If CPU/Memory not available, WAIT!
Scheduling Policies
FIFO Scheduler
o Queue
Capacity Scheduler
o Priority Queue
Fair Scheduler
o Jobs assigned equal share of all resources
HBase
Database management system on top of Hadoop.
Integrates with your application just like a traditional database.
Columnar Store
Advantages
o Sparse Tables
No wastage of space when storing data.
o Dynamic Attributes
Update attributes dynamically without changing storage structure.
Do not need to change schema.
Denormalized Storage
Column names repeat across rows.
Normalization Reduces data duplication => Optimizes storage.
o Storage is cheap in a distributed file system.
o Optimize number of disk seeks instead by denormalization.
Don’t have to join tables.
Read a single record to get all details about an employee in one read operation.
pg. 52
SKILLCERTPRO
o No group by
o No order by
No operations involving multiple tables
No indexes on tables
No constraints
Hive
Provides a SQL interface to Hadoop.
Bridge to Hadoop for people without OOP exposure.
Not suitable for very low latency apps due to HDFS.
HiveQL ~= SQL
Wrapper on top of MapReduce
Metastore
HCatalog
Bridge between HDFS and Hive
Stores metadata for all tables in Hive
Maps the files and directories in Hive to tables
Holds the definitions and the schema for each table
Any database with a JDBC driver can be used as a metastore.
Development
o Use built-in Derby database
o Embedded metastore
o Only one session can connect.
Production
o Local metastores
Allow multiple sessions to connect to Hive
DB is a separate process and can be on separate host.
o Remote metastores
pg. 53
SKILLCERTPRO
pg. 54
SKILLCERTPRO
On a join, one table is held in memory while the other is read from
disk
Hold smaller in memory
Structuring Joins as Map-Only Operation
Filter queries (only these rows)
Mapper needs to use null as key
o Windowing in Hive
A suite of functions which are syntactic sugar for complex queries.
e.x. What revenue percentile did this supplier fall into this quarter?
Window = 1 quarter
Operation = Percentile on revenue
Pig
ETL
A data manipulation language
Transforms unstructured data into a structured format
Query this structured data using interfaces like Hive.
Raw Data -> Pig -> Warehouse -> HiveQL -> Analytics
Pig Latin
o A procedural, data flow language to extract, transform and load.
Procedural
Uses a series of well-defined steps to perform operations.
No if statements or for loops.
Specifies exactly how data is to be modified at each step.
Data Flow
Focused on transformations applied to the data.
Written with a series of data operations in mind.
Nodes in a DAG
o Data from one or more sources can be read, processed and stored in parallel.
o Cleans data, precomputes common aggregates before storing in a data
warehouse.
Pig on Hadoop
o Optimizes operations before MapReduce jobs are run, to speed operations up.
Works better with Apache Tez and Spark.
Oozie
pg. 55
SKILLCERTPRO
Apache Spark
A distributed computing engine used along with Hadoop
Interactive shell to quickly process datasets
Has a bunch of built in libraries for machine learning, stream processing, graph
processing, …, etc.
Dataflow
General purpose
o Exploring
o Cleaning and Preparing
o Applying machine learning
o Building data applications
Interactive
o Provides a REPL environment
Read Evaluate Print Loop
Reduces boilerplate of standard MapReduce Java code.
pg. 56
SKILLCERTPRO
Lazy Evaluation
Materialize only when necessary
Spark Core
Basic functionality of Spark
Written in Scala
Runs on a Storage System and Cluster Manager
o Plug and play components
o Can be HDFS and YARN
Spark ML
MLlib is Spark’s machine learning library.
Provides tools such as:
o ML Algorithms: classification, regression, clustering, collaborative filtering.
o Featurization: feature extraction, transformation, dimensionality reduction, and
selection
o Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
o Persistence: saving and load algorithms, models, and Pipelines
o Utilities: linear algebra, statistics, data handling, etc.
PubSub
Server-less messaging “middleware”
Many to many asynchronous messaging
Decouples sender and receiver
Attributes can be set by sender (KV pairs)
Glue that connects all components
Order not guaranteed
Encoding as a Bytestring (utf-8) required for publishing.
Publishers: Any app that can make HTTPS requests to googleapis.com
Message Flow
Publisher app creates a topic object and sends a message to the topic.
Messages persisted in message store until acknowledged by subscribers
pg. 57
SKILLCERTPRO
Architecture
Data Plane
o Handles moving messages between publishers and subscribers
o Forwarders
Control Plane
o Handles assignment of publishers and subscribers to server on the data plane.
o Routers
Use Cases
Balancing workloads in a network cluster
Implementing async workflows
Distributing event notifications
Refreshing distributed caches
o i.e. An app can publish invalidation events to update the IDs of objects that have
changed
Logging to multiple systems
Data streaming from various processes or devices
Reliability improvement
o i.e. a single-zone GCE service can operate in additional zones by subscribing to a
common topic, to recover from failures in a zone or region.
pg. 58
SKILLCERTPRO
Deduplicate
Database table to store hash value and other metadata for each data entry.
Message_id can be used to detect duplicate messages
Handling Order
Order does not matter at all
o i.e. queue of independent tasks, collection of statistics on events
o Perfect for PubSub. No extra work needed.
Order in final result matters
o i.e. Logs, state updates
o Can attach a timestamp to every event in the publisher and make the subscriber
store the messages in some underlying data store (such as Datastore) that allows
storage or retrieval by the sorted timestamp.
Order of processed messages matters
o i.e. transactional data where thresholds must be enforced
o Subscriber must either:
Know the entire list of outstanding messages and the order in which they
must be processed, or
Assigning each message a unique identifier and storing in some
persistent place (Datastore) the order in which messages should
be processed.
Subscriber check persistent storage to know the next message it
must process and ensure that it only processes that message next.
Have a way to determine all messages it has currently received whether or
not there are messages it has not received that it needs to process first.
Cloud Monitoring to keep track of the
oldest_unacked_message_age metric.
pg. 59
SKILLCERTPRO
Cost
Data volume used per month (per GB)
IAM
Control access at project, topic, or subscription level
Resource types: Topic, Subscription, Project
Service accounts are best practice.
Pubsub.publisher
Pubsub.subscriber
Pubsub.viewer or viwer
Pubsub.editor or editor
Pubsub.admin or admin
Apache Kafka
Stream processing for unbounded datasets.
Similar to PubSub
Kafka Connect
o A tool for scalably and reliably streaming data between Apache Kafka and other
systems.
o There are connectors to PubSub/Dataflow/BigQuery
Compared to PubSub
Can have precisely once delivery with Spark direct Connector in addition to at least once.
pg. 60
SKILLCERTPRO
pg. 61
SKILLCERTPRO
pg. 62
SKILLCERTPRO
Cloud Dataprep
Intelligent data preparation
Partnered with Trifecta for data cleaning/processing service
Fully managed, serverless, and web based
User friendly interface
o Clean data by clicking on it
Supported file types
o Inputs
CSV, JSON, Plain Text, Excel, LOG, TSV, and Avro
o Outputs
CSV, JSON, Avro, BQ Table
CSV/JSON can be compressed or uncompressed
How it works
Backed by Cloud Dataflow
o After preparing Dataflow processes via Apache Beam pipeline
o “User-friendly Dataflow pipeline”
Dataprep Process
o Import data
o Transform sampled data with recipes
o Run Dataflow job on transformed dataset
Batch Job
Every recipe step is its own transform
o Export results (GCS, BigQuery)
Intelligent Suggestions
o Selecting data will often automatically give the best suggestion
o Can manually create recipes, however simple tasks (remove outliers, de-duplicate)
should just use auto-suggestions.
IAM
Dataprep.projects.user
o Run Dataprep in a project
Dataprep.serviceAgent
o Gives Trifecta necessary access to project resources.
Access GCS buckets, Dataflow Developer, BQ user/data editor
Necessary for cross-project access + GCE service account
pg. 63
SKILLCERTPRO
Cost
1.16 * cost of a Dataflow job
Flows
Add or import datasets to process with recipes
Public Bucket for testing: gs://dataprep-samples
For large datasets:
o UI only shows a sample to work with
o Recipe created is then applied to entirety of dataset
Jobs
Create a dataset in BQ first
Click on Run Job
o Default option is CSV in GCS bucket
o Choose BQ dataset instead
o Name table
o Run Job: Create Apache Beam pipeline with Dataflow
Cloud Composer
Fully managed workflow orchestration service based on Apache Airflow.
o No need to provision resources.
Pipelines are configured as DAGs
Workflows live on-premises, in multiple clouds, or full within GCP
Provides ability to author, schedule, and monitor your workflows in a unified manner.
Multi-cloud
Can use Python to dynamically author and schedule workflows.
Environments
Airflow is a micro-service architected framework.
o To deploy in a distributed setup, Cloud Composer provisions several GCP
components, collectively known as an Environment.
Can create one or more inside a project.
pg. 64
SKILLCERTPRO
Architecture
Distributes environment’s resource between a Google-managed tenant project and a
customer project.
For unified Cloud IAM access control and an additional layer of data security, Cloud
Composer deploys Cloud SQL and App Engine in the tenant project.
Tenant Project
Cloud SQL
o Stores Airflow metadata.
o Composer limits database access to the default or specified custom service
account used to create the environment.
o Metadata backed up daily.
o Cloud SQL proxy in GKE cluster
Used to remotely authorize access to your Cloud SQL database from an
application, client, or other GCP service.
App Engine
o Hosts the Airflow web server.
o Integrated with Cloud IAM by default.
o Assign composer.user role to grant access only to Airflow web server.
o Can deploy a self-managed Airflow web server in customer project (for orgs with
additional access-control reqs)
Customer Project
Cloud Storage
o Used for staging DAGs, plugins, data dependencies, and logs.
o To deploy workflows (DAGs), copy files to the bucket for you environment.
o Composer takes care of synchronizing DAGs among workers, schedulers, and the
web server.
GKE
o Scheduler, worker nodes, and CeleryExecutor here.
o Redis
pg. 65
SKILLCERTPRO
Stackdriver
Integrates to have a central place to view all Airflow service and workflow logs.
Can view logs of scheduler and worker emit immediately instead of waiting for Airflow
logging module synchronization (due to streaming nature of Stackdriver)
Airflow
DAGs
o Composed of Tasks
o Connects independent tasks and executes in specified sequence.
Tasks
o Created by instantiating an Operator class.
o Logical unit of code.
o Link Tasks/Operators in your DAG python code.
Operators
o Template to wrap and execute a task.
o BashOperator is used to execute a bash script.
o PythonOperator is used to execute python code.
o Specify DAG when instantiating Operator.
o Sensors
pg. 66
SKILLCERTPRO
Special type of Operator that will keep running until a certain criterion is
met.
Task Instance
o Represents a specific run of a task is is characterized by a combination of a dag, a
task, and a point in time.
o Has a state (running, success, failed, skipped, up for retry, etc)
CeleryExecutor
o Used to execute multiple DAGs in parallel
o Requires a message broker.
o SequentialExecutor is used for one DAG at a time.
IAM
Composer.Admin
Composer.environmentAndStorageObjectAdmin
Composer.environmentAndStorageObjectViewer
Composer.user
Composer.worker – service accounts
Machine Learning
Basics of Machine Learning
Types of Problems:
Classification
Regression
Clustering
Rule Extraction
Supervised Learning
Labels associated with the training data are used to correct the algorithm.
Unsupervised Learning
The model has to be set up right to learn the structure in the data.
pg. 67
SKILLCERTPRO
Deep Learning
Algorithms that learn what features matter.
Neural Networks
o Most common class of deep learning algorithms.
o Used to build representation learning systems.
o Composed of neurons (binary classifiers)
o Wide
Better for memorization
o Deep
Better for generalization
Neurons
Apply 2 functions on inputs.
Best values of W and b found by using cost function, optimizer, and training data.
Back Propagation
Activation Function
Helps to model non linear functions. (Logistic regression)
pg. 68
SKILLCERTPRO
Vanishing Gradients
Gradients for lower layers (closer to input) can become very small.
Leads to very slow training, if at all.
ReLu as an activation function can help prevent vanishing gradients.
Exploding Gradients
Weights in a network are very large, the gradients for the lower layers involve products
of many large terms.
Gradients get too large to converge.
Batch normalization and/or lowering learning rate can prevent this.
pg. 69
SKILLCERTPRO
Reducing Loss
Convergence: When loss stops changing or at least changes extremely slowly.
Gradient is a vector.
Learning rate is a scalar.
Gradient is multiplied by the learning rate.
Hyperparameters
Configuration settings used to tune hot the model is trained.
Steps
o Total number of training iterations. One step calculates the loss from one batch
and uses that value to modify the model’s weights once.
Batch Size
o Number of examples (chosen at random) for a single step.
o Total # of trained examples = Batch Size * Steps
Learning rate
pg. 70
SKILLCERTPRO
Periods
the # of training examples in each period = batch size * steps / period
Controls granularity of reporting.
o If periods = 7 and steps = 70, the loss value will be output every 10 steps.
Modifying period value does not alter what model learns.
Generalization
The less complex an ML model, the more likely that a good empirical result is not just
due to the peculiarities of the sample.
Overfitting occurs when a model tries to fit the training data so closely that it does not
generalize well to new data.
Identify Overfitting
o Loss for the validation set is significantly higher than for the training set. (look at
loss curve (loss/iterations))
o Validation loss eventually increases with iterations.
If the key assumptions of supervised ML are not met, then we lose important theoretical
guarantees on our ability to predict new data.
3 Basic Assumptions
o We draw examples independently and identically at random from the
distribution. I.e. examples don’t influence each other.
o The distribution is stationary; that is it does not change within the data set.
o We draw examples from partitions from the same distribution.
pg. 71
SKILLCERTPRO
Update hyperparams
Repeat
Finally test on test set
Representation
Process of mapping data to useful features.
Discrete feature
o A feature with a finite set of possible values.
o Categorical feature are an example
One-Hot Encoding
A sparse vector in which:
o One element is set to 1
o All other elements are set to 0
Commonly used to represent strings or identifiers that have a finite set of possible
values.
Feature Engineering
Process of determining which features might be useful in training a model, and then
converting raw data from log files and other sources into said features.
Sometimes called feature extraction.
pg. 72
SKILLCERTPRO
If no value, it is set to -1
Create a Boolean feature to indicate if quality rating was defined.
o Replace “magical” values as follows
For a variable that take a finite set of values (discrete variables), add a new
value to the set and use it to signify that feature value is missing.
For continuous variables, ensure missing values do not affect the model
by using the mean value of the feature’s data.
Account for upstream instability
o Definition of a feature shouldn’t change over time.
Cleaning Data
Scaling feature vectors
o Converting floating point feature values from their natural range (100 to 900) to a
standard range (0 to 1 or -1 to 1)
o Scaling ~= Normalization
o If only 1 feature, little to no practical benefit.
o Multiple features, great benefits
Helps gradient descent convere more quickly
Helps avoid NaN traps
One number in the model becomes a NaN (value exceeds floating
point precision limit during training) and due to math operations,
every other number in the model also eventually becomes NaN.
Helps the model learn appropriate weights for each feature. Without
scaling, the model pays too much attention to features having a wider
range.
Handling extreme outliers
o Log scaling
Still leaves a tail on distribution
o Cap or Clipping
Reduce feature values that are greater than a set maximum value down to
that maximum value.
Also, increasing feature values that are less than a specific minimum value
up to that minimum value.
o Binning (Bucketing)
Converting a (usually continuous) feature into multiple binary features
called buckets or bins, typically based on a value range.
Scrubbing
o Data can be unreliable due to:
Omitted values
pg. 73
SKILLCERTPRO
Duplicate examples
Bad labels
Bad feature values
o “Fix” by removing them from data set.
o Omitted and duplicate easy to detect.
o Detecting bad data in aggregate by using Histograms
o Stats can also help identifying bad data:
Max and Min
Mean and Median
Standard Deviation
Follow These Rules:
o Keep in mind what your data should look like
o Verify that the data meets these expectations
Or that you can explain why it doesn’t
o Double check that the training data agrees with other soruces
i.e. dashboards
Feature Crosses
A synthetic feature formed by crossing (Cartesian product) individual binary features
obtained from categorical data or from continuous features via bucketing.
Helps represent nonlinear relationships.
Encoding Nonlinearity
Crossing One-Hot Vectors
Regularization
Minimize loss + complexity
o Structural Risk Minimization
o Penalizes complexity to prevent overfitting
2 Common Ways to Think About Model Complexity
o As a function of the weights of all the features in the model
L2 Regularization
A feature weight with a high absolute value is more complex than one
with a low absolute value.
L2 = w1^2 + w2^2 + … + wn^2
Consequences of L2 Regularization
Encourages weight values toward 0 (but not exactly 0)
pg. 74
SKILLCERTPRO
Early Stopping
Ending training before the model reaches convergence (training loss finishes decreasing).
End model training when loss on a validation dataset starts to increase, that is, when
generalization performance worsens.
Sparsity
Sparse vectors often contain many dimensions.
Creating a feature cross results in even more dimensions.
High Dimensionality -> Large Model Size -> Large RAM reqs
L1 Regularization
o Penalizes absolute value of weights. (|weight|)
o Derivative of L1 is a constant, k. (2 * weight for L2)
A force that subtracts some constant value from the weight every time.
o Pushes weights toward 0
Efficient for wide models.
Reduces # of features -> smaller model size
o May cause informative features to get a weight of exactly 0:
Weakly informative features
Strongly informative features on different scales
Informative features strongly correlated with other similarly informative
features.
Dropout
pg. 75
SKILLCERTPRO
Logistic Regression
A model that generates a probability for each possible discrete label value in
classification problems by applying a sigmoid function to a linear prediction.
Often used in binary classification problems, but can also be used in multi-class
classification problems (multinomial regression)
Sigmoid Function
o Maps logistic or multinomial regression output (log odds) to probabilities,
returning a value between 0 and 1.
o Can serve as an activation function in neural networks.
Loss and Regularization
o Loss function is Log Loss
o Regularization
L2 or Early Stopping
One vs All
o Classification problem with N possible solutions.
o A one-vs-all solution consists of N separate binary classifiers.
Classification
Classification Threshold (Decision Threshold)
o Determines what the probability output from logistic regression is classified as.
Accuracy
Number of correct predictions over total number of predictions
TP + TN / (TP + TN + FP + FN)
pg. 76
SKILLCERTPRO
Confusion Matrix
An NxN table that summarizes how successful a classification model’s predictions were.
Useful when calculating precision and recall
Precision
Identifies the frequency with which the model was correct when predicting the positive
class.
TP/ (TP + FP)
i.e. how many predicted cats are actually cats
Raising classification threshold reduces FP, thus improving precision.
Recall
Out of all the possible positive labels, how many did the model correctly identify.
TP / (TP + FN)
i.e. number of predicted cats out of all cats
Raising classification threshold will cause # of TP to decrease or stay the same and will
cause the # of FN to increase or stay the same. Thus recall will either stay constant or
decrease.
Improving precision often reduces recall and vice versa.
ROC Curve
Receiver Operating Characteristic Curve
Shows performance of classification model at all classification thresholds.
TP rate (TP / TP + FN) vs. FP rate (FP / FP + TN)
Lowering classification threshold increase TP and FP.
AUC
Area Under the ROC Curve
Provides an aggregate measure of performance across all possible classification
thresholds.
0 – worst model
1 – best model
Desirable Because:
pg. 77
SKILLCERTPRO
o Scale Invariant
Measures how well predictions are ranked, rather than their absolute
values.
o Classification Threshold Invariant
Measures the quality of the model’s predictions irrespective of what
classification threshold is chosen.
Limitations
o Scale invariance is not always desirable
We may need well calibrates probability outputs and AUC won’t tell us
that.
o Classification threshold invariance is not always desirable
In cases where there are wide disparities in the cost of false negatives vs.
false positives, it may be critical to minimize one type of classification
error.
Prediction Bias
= average of predictions – average of labels
Different than bias, b, in wx + b
Possible root causes of prediction bias:
o Incomplete feature set
o Noisy data set
o Buggy pipeline
o Biased training sample
o Overly strong regularization
Avoid Calibration Layer as a fix
o Fixing symptoms rather than cause.
o Built a more brittle system that you must now keep up to date.
Examine prediction bias on a bucket of examples
Embeddings
D-Dimensional Embeddings
o Assumes something can be explained by d aspects.
Map items to low-dimensional real vectors in a way that similar items are close to each
other.
pg. 78
SKILLCERTPRO
ML API's
Pre-trained ML API’s
Sight
Vision AI
Image Recognition/analysis
Label Detection
o Extracts info in image across categories
Text Detection (OCR)
o Detect and extract text from images
Safe Search
o Recognize explicit content
Landmark Detection
Logo Detection
Image Properties
o Dominant colors, pixel counts
Crop Hints
o Crop coordinates of dominant object/face
Web Detection
o Find matching web entries
Object Localizer
o Returns labels and bounding boxes for detected objects.
Product Search
o Uses image and specific region(s) or largest object of interest to return matching
items from product set.
AutoML Vision
Object Detection
o Bounding box smart multi-object detection, Google Vision API on steroids.
Edge
o The IoT version of Vision detection for Edge Devices.
o Optimized to achieve high accuracy for low latency use cases on memory-
constrained devices.
pg. 79
SKILLCERTPRO
o Use Edge Connect to securely deploy the AutoML model to IoT devices (such as
Edge TPUs, GPUs, and mobile devices) and run predictions locally on the device.
Language
pg. 80
SKILLCERTPRO
Translation API
Detect and translate languages
Beta:
o Glossary
o Batch translations
AutoML Translation
Upload translated language pairs -> Train -> Evaluate
Conversation
Cloud AutoML
pg. 81
SKILLCERTPRO
Enables developers with limited machine learning expertise to train high-quality models
specific to their business needs.
Relies on transfer learning and neural architecture search technology.
AutoML Tables
Workflow:
o Table input
o Define data schema and labels
o Analyze input features
o Train (automatic)
Feature engineering
Normalize and bucketize numeric features
Create one-hot encoding and embeddings for categorical features
Perform basic processing for text features
Extract date- and time-related features from Timestamp columns.
Model selection
Parallel model testing
Linear
Feedforward deep neural network
Gradient Boosted Decision Tree
AdaNet
Ensembles of various model architectures
Hyperparameter tuning
o Evaluate model behavior
o Deploy
Structured Data
o Can use data from BigQuery or GCS (CSV)
pg. 82
SKILLCERTPRO
Structured Data
AutoML Tables
Cloud Inference API
o Quickly run large scale correlations over types time series data.
Recommendations AI (Beta)
BigQuery ML (beta)
Cost
Pay per API request per feature
Feature as in Landmark Detection
pg. 83
SKILLCERTPRO
AI Platform
Can use multiple ML platforms such as TensorFlow, scikit-learn and XGBoost
Workflow
Source and prepare data
o Data analysis
Join data from multiple sources and rationalize it into one dataset.
Visualize and look for trends.
Use data centric languages and tools to find patterns in data.
Identify features in your data.
Clean the data to find any anomalous values caused by errors in data
entry or measurement.
o Data preprocessing
Transform valid, clean data into the format that best suits the needs of
your model.
Examples
Normalizing numeric data to a common scale.
Applying formatting rules to data. Ex. removing HTML tagging
from a text feature.
Reducing data redundancy through simplification. Ex. converting a
text feature to a bag of words representation.
Representing text numerically. Ex. assigning values to each
possible value in a categorical feature (or 1 hot).
Assigning key values to data instances.
o Develop model
o Train an ML model on your data
Benefits of Training Locally
Quick iteration
No charge for cloud resources
o Deploy trained model
Upload to GCS bucket
Create a model resource in AI Platform specifying GCS path
pg. 84
SKILLCERTPRO
Preparing Data
Gather data
Clean data
o Clean data by column (attribute)
o Instances with missing features.
o Multiple methods of representing a feature.
Length measurement in different scale/format
o Features with values far out of the typical range (outliers)
o Significant change in data over distances in time, geographic location, or other
recognizable characteristics.
o Incorrect labels or poorly defined labeling criteria.
Split data
o Train, Validation, Test
o Better to randomly sample the subsets from one big dataset than use pre-divided
data. Otherwise could be non-uniform => overfitting.
o Size of datasets: training > validation > test
Engineer data features
o Can combine multiple attributes to make one generalizable feature.
Address and timestamp => position of sun
o Can use feature engineering to simplify data.
o Can get useful features and reduce number of instances in dataset by
engineering across instances. I.e. calculate frequency of something.
Preprocess features
pg. 85
SKILLCERTPRO
Training Overview
Upload datasets already split (training, validation) into something AI Platform can read
from.
Sets up resources for your job. One or more virtual machines (training instances)
o Applying standard machine image for the version of AI Platform your job uses.
o Loading application package and installing it with pip.
o Installing any additional packages that you specify as dependencies.
Distributed Training Structure
o Running job on a given node => replica
o Each replica given a single role or task in distributed training:
Master
Exactly 1 replica
Manages others and reports status for the job as a whole.
Status of master signals overall job status.
Single process job => the sole replica is the master for the job
Worker(s)
1 or more replica
Do work as designated in job configuration.
Parameter Servers
1 or more replicas
Coordinate shared model state between the workers.
o Tiers
Scale tiers
Number and types of machines you need.
CUSTOM tier
Allows you to specify the number of Workers and parameter
servers.
Add these to TrainingInput object in job configuration.
o Exception
The training service runs until your job succeeds or encounters an
unrecoverable error.
Distributed Case – status of the master replica that signals the overall
status.
Running a Cloud ML Engine training job locally (gcloud ml-engine local
train) is especially useful in the case of testing distributed models.
Start training
o Package application with any dependencies required
o 2 ways
pg. 86
SKILLCERTPRO
pg. 87
SKILLCERTPRO
Hyperparameter Tuning
–config hptuning_config.yaml
Hyperparameter: Data that governs the training process itself.
o DNN
Number of layers
Number of nodes for each layer
Usually constant during training.
How it works:
o Running multiple trials in a single training job.
o Each trail is a complete execution of your training application with values for
chosen hyperparameters, set within limits specified.
Tuning optimizes a single target variable (hyperparameter metric)
o Multiple params per metric.
Default name is training/hptuning/metric
o Recommended to change to custom name.
o Must set hyperparameterMetricTag value in HyperparameterSpec object in job
request to match custom name.
How to actually tune?
o Define a command line argument in main training module for each tuned
hyperparameter.
o Use value passed in those arguments to set the corresponding hyperparameter in
application’s TensorFlow code.
Types
o Double
o Integer
o Categorical
o Discrete – List of values in ascending order.
Scaling
o Recommended for Double and Integer types.
o Linear, Log, or Reverse Log Scale
Search Algorithm
o Unspecified
Same behavior as when you don’t specify a search algo.
pg. 88
SKILLCERTPRO
Bayesian optimization
o Grid Search
Useful when specifying a number of trials that is more than the number of
points in feasible space.
In such cases AI Platform default may generate duplicate
suggestions.
Can’t use with any params being Doubles
o Random Search
Online
Optimized to minimize the latency of serving predictions.
Predictions returned in the response message.
Input passed directly as a JSON string.
Returns as soon as possible.
Runs on runtime version and in region selected when deploying model.
Can serve predictions from a custom prediction routine.
Can generate logs if model is configured to do so. Must specify option when creating
model resource.
o onlinePredictionLogging or –enable-logging (gcloud)
Use when making requests in responses to application input or in other situations where
timely inference is needed.
Batch
Optimized to handle a high volume of instances in a job and to run more complex
models.
Predictions written to output files in Cloud Storage location that you specify.
o Can verify predictions before applying them. (sanity check)
Input data passed directly as one or more UIRs of files in Cloud Storage locations.
pg. 89
SKILLCERTPRO
Asynchronous request.
Can run in any available region, using any runtime version.
o Should run with defaults for deployed model versions.
Only Tensorflow supported. (Not XGBoost or scikit)
Ideal for processing accumulated data when you don’t need immediate results.
o i.e. a periodic job that gets predictions for all data collected since the last job.
Generates logs that can be viewed on Stackdriver.
Slow because AI Platform allocates and initializes resources for a batch prediction job
when the request is sent.
Batch
Scales nodes to minimize elapsed time job takes.
Allocates some nodes to handle your job when you start it.
Scales the number of nodes during the job in an attempt to optimize efficiency.
Shuts down nodes as soon as job is done.
Online
Scales nodes to maximize number of requests it can handle without too much latency.
Allocates some nodes the first time you request predictions after a long pause in
requests.
Scales number of nodes in response to request traffic, adding nodes when traffic
increases, removing them when there are fewer requests.
Keeps at least 1 node ready over a period of several minutes, to handle requests even
when there are none to handle.
Scales down to zero after model version goes several minutes without a prediction
request.
pg. 90
SKILLCERTPRO
IAM
Project Roles
o Ml.admin
o Ml.developer
o Ml.viewer
Model Roles
o Ml.modelOwner
o Ml.modelUser
Tensorflow
OS Machine learning/ Deep learning platform
Lazy evaluate during build, full evaluate during execution.
TensorFlow Estimator API
o High level object oriented API
o Makes it easy to build models.
o Specifies predefined architectures, such as linear regressors or neural networks.
Tf.layers, tf.losses, tf.metrics
o Reusable libraries for common model components.
Python TensorFlow
o Provides Ops, which wrap C++ Kernels
Can run on CPU, GPU, or TPU
o Kernels work on more than one platform.
Feature Engineering
o Often means converting raw log file entries to tf.Example protocol buffers. See
also tf.Transform
Kubeflow
Helps orchestrate machine learning training pipelines across on-prem and cloud-based
resources.
Can containerize training and serving infrastructure.
Components
Support for distributed TensorFlow training via the TFJob CRD
o TFJob is a Kubernetes custom resource used to run TensorFlow training jobs on
Kubernetes.
pg. 91
SKILLCERTPRO
Exploration/Visualization
Datalab
Managed Jupyter notebooks
Powerful interactive tool to explore, analyze, transform and visualize data and
build machine learning models on GCP.
o In Cell
%%bq query –name queryname
SQL underneath
pg. 92
SKILLCERTPRO
o datalab create
o datalab-network (VPC) is created
o datalab connect
o Cloud Source Repository
Used for sharing notebook between users
3 Ways to Run:
Locally
o Good if only one person using
Docker on GCE
o Better
o Use by multiple people through SSH or CloudShell
o Uses resources on GCE
Docker + Gateway
o Best
o Uses a gateway and proxy
o Runs locally
Notebooks
Can be in Cloud Storage Repository (git repo)
o Use ungit to commit changes to notebooks
Persistent Disk
Notebooks can be cloned from GCS to VM persistent disk.
This clone => workspace => add/remove/modify files
Notebooks autosave, but you need to commit.
Kernel
Opening a notebook => Backend kernel process manages session and variables.
Each notebook has 1 python kernel
pg. 93
SKILLCERTPRO
Connecting
SSH tunnels to notebook on port 8081
datalab connect
o RSA key is passphrase
pg. 94
SKILLCERTPRO
Cost
Free
Only pay for GCE resources Datalab runs on and other GCP services you interact with.
Data Studio
Easy to use data visualization and dashboards.
Cost
Free
BQ access run normal query costs
Basic Process
Connect to data source
Visualize data
Share with others
pg. 95
SKILLCERTPRO
Creating Charts
Use combinations of dimensions and metrics
Create custom fields if needed
Add date range filters with ease
Security
Cloud Identity and Access Management (IAM)
Provides administrators the ability to manage cloud resources centrally by controlling
who can take what action on specific resources.
Primitive Roles
o Owner, Editor, Viewer
Predefined Roles
o Finer grained access control than primitive roles.
Custom Roles
o Create to tailor permissions to the needs of your org when predefined do not
meet your needs.
pg. 96
SKILLCERTPRO
Least Privilege
Avoid primitive roles, instead grant predefined roles
Grant primitive in the following cases:
o GCP service does not provide a predefined role.
o You want to grant broader permissions for a project. (dev or test environments)
o Work in a small team where the team members don’t need granular permissions.
Treat each component of your application as a separate trust boundary.
o Multiple services that require different permissions, create a service account for
each service and then grant only the required permissions to each SA.
Remember that a policy set on a child resource cannot restrict access granted on its
parent.
Grant roles at the smallest scope needed.
Restrict who can act as service accounts. Users granted the Service Account Actor role for
a SA can access all the resources for which the SA has access.
Restrict who has access to create and manage SAs in your project.
Granting the Project IAM Admin and Folder IAM Admin predefined roles will allow access
to modify Cloud IAM policies without also allowing direct read, write, and administrative
access to all resources.
o Granting Owner role to a member will allow them to access and modify almost all
resources, including modifying Cloud IAM policies.
pg. 97
SKILLCERTPRO
Auditing
Use Cloud Audit Logging logs to regularly audit changes to your Cloud IAM policy.
Export audit logs to GCS to store for long periods of time.
Audit who has the ability to change your IAM policies on your project.
Restrict access to logs using Logging roles.
Apply the same access policies to the GCP resource that you use to export logs as
applied to the logs viewer.
Use Cloud Audit Logging logs to regularly audit access to SA keys.
Policy Management
Set organization level Cloud IAM policies to grant access to all projects in your
organization.
Grant roles to a Google group instead of individual users when possible.
o Easier to add/remove members to/from a group.
pg. 98
SKILLCERTPRO
If you need to grant multiple roles to allow a particular task, create a Google group,
grant the roles to that group, and then add users to that group.
Billing
Billing Account Administrator role
o Allows management of payments and invoices without granting permission to
view the project contents.
Billing Account User role
o Gives SA permissions to enable billing and therefore permit the SA to enable APIs
that require billing to be enabled.
Billing Account Creator role
o Allows developers to create new billing accounts and to attach billing accounts to
the projects.
Viewer role
o Allows developers to view the expenses for the projects they own.
https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-
organizations#identity-and-access-management
https://cloud.google.com/iam/docs/understanding-roles
Legal Compliance
Health Insurance Portability and Accountability Act (HIPAA)
Children’s Online Privacy Protection Act (COPPA)
FedRAMP
General Data Protection Regulation (GDPR)
Machine Learning
https://developers.google.com/machine-learning/glossary/
https://developers.google.com/machine-learning/crash-course/ml-intro
https://developers.google.com/machine-learning/guides/rules-of-ml/
pg. 99
SKILLCERTPRO
https://gcp.solutions/
Cloud Next
https://cloud.google.com/blog/products/ai-machine-learning/all-ai-
announcements-from-google-next19-the-smartest-laundry-list
Apache Beam
https://cloud.google.com/blog/products/gcp/why-apache-beam-a-google-
perspective
Compute
Compute Engine
Example
Features:
pg. 100
SKILLCERTPRO
Premptible instances can run a custom shutdown script. Instance is given 30 seconds.
App Engine
Platform as a Service (PaaS).
Features:
Kubernetes Engine
Hosted kubernetes engine. Super simple to get started a Kubernetes cluster.
Features:
pg. 101
SKILLCERTPRO
GKE In-Prem
Reliable, efficient, and secured way to run kubernetes clusters anywhere. Integration
using IAM. Similar features to GKE.
Cloud Functions
Automatically scales, highly available and fault tolerant. No servers to provision.
Access and IAM. VPC access to cloud functions. Can not connect vpc to cloud functions.
IAM controls on the invoke of the function. --member allow you to control which user can
invoke the function. IAM check to make sure that appropriate identity.
Serverless containers are coming soon.
Knative
Kubernetes + Serverless addon for GKE.
pg. 102
SKILLCERTPRO
Sheilded VMs
Shielded VMs protect against:
verify VM identity
quickly protect VMs against advanced threats: root kit, etc.
protect secrests
Containers Security
patch applied regular on frequent builds and image security reviews
containers have less code and a smaller attack surface
isolate processes working with recourses (separate storage from networking)
Storage
storage
o multi-regional (you pay for data transfer between regions)
reduce latency and increse redundancy
o regional (higher performance local access and high frequency analytics
workloads)
backup
o nearline highly durable storage for data accessed less than once a month
o coldline highly durable storage for data accessed less than once a year
pg. 103
SKILLCERTPRO
Requester Pays: you can enable requester of resources for pay for transfer.
Controlling Access:
Redundancy is
Persistent disk performance is predictable and scales linearly with provisioned capacity
until the limits for an instance’s provisioned vCPUs are reached.
pg. 104
SKILLCERTPRO
Free to get started. Uses google cloud storage behind the scenes. Easy way to provide
access to files to users based on authentication. Trigger functions to process these files.
Clients provides sdks for reliable uploads on spotty connections. Targets mobile.
Connects to compute instances and Kubernetes engine instances. Low latency file
operations. Performance equivalent to a typical HHD. Can get SSD performance for a
premium.
Size must be between 1 TB - 64 TB. Price per gigabyte per hour. About 5x more
expensive than object storage. About 2-3X more expensive than Blob storage.
Is an NFS fileshare.
Migration
Covers sending files to google.
pg. 105
SKILLCERTPRO
Storage Transfer Service allows you to quickly import online data into Cloud
Storage. You can also set up a repeating schedule for transferring data, as well as
transfer data within Cloud Storage, from one bucket to another.
o schedule one time transfer operations or recurring transfer operations
o deleete existing object in the destination bucket if they dont have a
corresping object in the source
o delete source objects after transfering them
o schedule periodic synchronization from source to data (with filters)
Transfer Appliance
install storage locally move data and send to google
data sources
campaign manager, cloud storage, google ad manager, google ads, google play,
youtube channel reports, youtube content owner reports.
Databases
Fully manged MYSQL and Postgresql service. Susstained usage discount. Data
replication between zones in a region.
pg. 106
SKILLCERTPRO
First Generation instances support MySQL 5.5 or 5.6, and provide up to 16 GB of RAM
and 500 GB data storage.
Create and manage instances in the Google Cloud Platform Console.
Instances available in US, EU, or Asia.
Customer data encrypted on Google’s internal networks and in database tables,
temporary files, and backups.
Support for secure external connections with the Cloud SQL Proxy or with the SSL/TLS
protocol.
Support for private IPbeta (private services access).
Data replication between multiple zones with automatic failover.
Import and export databases using mysqldump, or import and export CSV files.
Support for MySQL wire protocol and standard MySQL connectors.
Automated and on-demand backups, and point-in-time recovery.
Instance cloning.
Integration with Stackdriver logging and monitoring.
ISO/IEC 27001 compliant.
Performance
reverse domain names (domain names should be written in reverse) com.google for
example.
string identifiers (do not hash)
pg. 107
SKILLCERTPRO
Cloud bigtable
Fully managed.
Atomic transactions. Cloud Datastore can execute a set of operations where either all
succeed, or none occur.
High availability of reads and writes. Cloud Datastore runs in Google data centers, which
use redundancy to minimize impact from points of failure.
Massive scalability with high performance. Cloud Datastore uses a distributed
architecture to automatically manage scaling. Cloud Datastore uses a mix of indexes and
query constraints so your queries scale with the size of your result set, not the size of
your data set.
Flexible storage and querying of data. Cloud Datastore maps naturally to object-oriented
and scripting languages, and is exposed to applications through multiple clients. It also
provides a SQL-like query language.
Balance of strong and eventual consistency. Cloud Datastore ensures that entity lookups
by key and ancestor queries always receive strongly consistent data. All other queries are
eventually consistent. The consistency models allow your application to deliver a great
user experience while handling large amounts of data and users.
Encryption at rest. Cloud Datastore automatically encrypts all data before it is written to
disk and automatically decrypts the data when read by an authorized user. For more
information, see Server-Side Encryption.
pg. 108
SKILLCERTPRO
Fully managed with no planned downtime. Google handles the administration of the
Cloud Datastore service so you can focus on your application. Your application can still
use Cloud Datastore when the service receives a planned upgrade.
Ideal for low latency data that must be shared between workers. Failover is separate
zone. Application must be tollerant of failed writes.
Could this be used for jupyter notebooks? Probably not due to restrictions… cant see
exactly what text has changed.
Networking
Get private access to google services such as storage big data, etc. without having to
give a public ip.
pg. 109
SKILLCERTPRO
firewall rules
routes (how to send traffic)
forwarding rules
ip addresses
Ways to connect:
Global:
Global has a single anycast address, health checks, cookie-based affinity, autoscaling,
and monitoing.
Regional:
pg. 110
SKILLCERTPRO
Cloud CDN
Low-latency, low-cost content delivery using Google global network. Recently ranked
the fastest cdn. 90 cache sites. Always close to users. Cloud CDN comes with SSL/TLS.
invalidation
take down cached content in minutes
Cloud Interconnect
Connect directly to google.
Cloud DNS
Scalable, reliable, and amanaged DNS syetm. Guarantees 100% availability. Can create
millions of DNS records. Managed through API.
Network Telemetry
Allows for monitoring on network traffic patterns.
pg. 111
SKILLCERTPRO
Management Tools
Stackdriver
o monitoring
o logging
o error reporting with triggers
o trace
Cloud Console
web interface for working with google cloud recources
Cloud shell
command line management for web browser
create api proxies from Open API specifications and deploy in the cloud.
protect apis. oauth 2.0, saml, tls, and protection from traffic spikes
o dynamic routing, caching, and rate limiting policies
publish apis to a developer portar for developers to be able to explore
measure performance and usage integrating with stackdriver.
Free trial, then $500 quickstart and larger ones later. Monetization.
pg. 112
SKILLCERTPRO
API Monetization
Tools for creating billing reports for users. Flexible report models etc. This is through
apigee
Cloud Endpoints
Google Cloud Endpoints allows a shared backend. Cloud endpoints annotations. Will
generate client libraries for the different languages.
Nginx based proxy. Open API specification and provides insight with stackdriver,
monitoring, trace, and logging.
Control who has acess to your API and validate every call with JSON web tockens and
google api keys. Integration with Firebase Authentication and Auth0.
Generate API keys in GCP console and validate on every API call.
Developer Portal
Have dashboards and places for developers to easily test the API.
Developer Tools
Cloud SDK
bq
kubectl
management of kubernetes
pg. 113
SKILLCERTPRO
gsutil
command line access to manage cloud storage buckets and objects
Container Registry
More than a private docker repository.
Allows:
Cloud Build
Cloud Build lets you commit to deploy in containers quickly.
pg. 114
SKILLCERTPRO
Cloud Scheduler
CRON job scheduler. Batch and big data jobs, cloud infrastructure operations. Automate
everything retries, manage all automated tasks in one place.
Data Analytics
Big Query
Free up to 1 TB of data analyzed each month and 10 GB stored.
Data Warehouses
Business Intelegence
Big Query
Big Query ML
o Adding labels in SQL query and training model in SQL
o linear regresion, classification logistic regressin, roc curve, model weight
inspection.
o feature distribution analysis
o integrations with Data Studio, Looker, and Taeblo
Accessible through REST api and client libraries, including command line tool.
pg. 115
SKILLCERTPRO
Data is stored in Capacitor columnar data format and offers the standard database
conceps of tables, paritions, columns, and rows.
batch loads
o cloud storage
o other google services (example google ad manager)
o readable data source
o streaming inserts
o dml (data manipulation language) statements
o google big query IO transformation
streaming
avoid SELECT *
use summary routines to only show a few rows
use --dry-run command it will give you price of query
no charge for regular loading of data. Streaming does cost money.
Cloud Dataflow
Fully managed service for transforming and reacting to data.
cleaning up data
pg. 116
SKILLCERTPRO
triggering events
writing data to destinations SQL, bigquery, etc.
Regional Endpoints
Cloud Dataproc
Managed apache hadoop and apache spark instances.
Cloud Datalab
Jupyter notebooks + magiks that working google cloud.
Cloud Pub/Sub
simple reliable, scalable foundation for analytics and event driven computing systems.
Features:
Details:
message
pg. 117
SKILLCERTPRO
topic
a named entity that represetnds a feed of messages
subscription
an entity interested in receiving mesages ona particular topic
publisher
create messages and sends (published) them to a specific topic
subscriber (consumer)
recieves messages on a specified subscription
Performance(scalability):
Cloud Composer
Managed Apache airflow. (differences)
dataflow
bigquery
storage operators
spanner
sql
support in many other clouds
workflow orchestration solution
pg. 118
SKILLCERTPRO
Operators.
Builtin support for services outside GCP: http, sftp, bash , python, AWS, Azure,
Databricks, JIRA, Qubole, Slack, Hive, Mongo, MySQL, Oracle, Vertica.
Kubernetes PodOperator.
Integrates with gcloud composer. Cloud SQL is used to store the Airflow metadata. App
Engine for serving the web service. Cloud storage is used for storing python plugins and
dags etc. All running inside of GKE. Stackdriver is used for collecting all logs.
Genomics
Process in parallel.
pg. 119
SKILLCERTPRO
SDK in app you get performance metrics of your app as seen from the users.
AI Hub
A collection of end-to-end AI pipelines and out of the box algorithms for solving
specific machine learning problems.
Cloud AutoML
Makes it approachable even if you have minimal experience.
Products:
natrual language
pg. 120
SKILLCERTPRO
translation
vision
Use case is that you would like to specially train your model to detect features that are
more specific than the ones google provides.
Limited experience to train and make predictions. Full rest api and scalably train for
labeling features.
Cloud TPU
Much faster for machine learning computations and numerical algorithms.
Dialog Flow
conversational interfaces
o chatbots for example
o text to speech
o 20+ languages supported
pg. 121
SKILLCERTPRO
You can use Cloud Natural Language to extract information about people, places,
events, and much more mentioned in text documents, news articles, or blog posts. You
can use it to understand sentiment about your product on social media or parse intent
from customer conversations happening in a call center or a messaging app. You can
analyze text uploaded in your request or integrate with your document storage on
Google Cloud Storage.
Cloud Text-to-Speech
Google Cloud Text-to-Speech enables developers to synthesize natural-sounding
speech with 30 voices, available in multiple languages and variants. It applies
DeepMind’s groundbreaking research in WaveNet and Google’s powerful neural
networks to deliver high fidelity audio. With this easy-to-use API, you can create lifelike
interactions with your users, across many applications and devices.
Cloud Translation
Cloud Translation offers both an API that uses pretrained models and the ability to build
custom models specific to your needs, using AutoML Translation.
pg. 122
SKILLCERTPRO
The Translation API provides a simple programmatic interface for translating an arbitrary
string into any supported language using state-of-the-art Neural Machine Translation. It
is highly responsive, so websites and applications can integrate with Translation API for
fast, dynamic translation of source text from the source language to a target language
(such as French to English). Language detection is also available in cases where the
source language is unknown. The underlying technology is updated constantly to
include improvements from Google research teams, which results in better translations
and new languages and language pairs.
Cloud Vision
Cloud Vision offers both pretrained models via an API and the ability to build custom
models using AutoML Vision to provide flexibility depending on your use case.
Calculate correlations between data that you are getting from sensors etc. For example
using big table.
Firebase Predictions
Group users based on predictive behavior.
pg. 123
SKILLCERTPRO
Security
Learn if necessary.
Cloud IAM
Identity. Access is granted to members. Memebers can be of several types:
Resource:
Permissions:
You can set a Cloud IAM policy at any level in the resource hierarchy: the organization
level, the folder level, the project level, or the resource level. Resources inherit the
policies of the parent resource. If you set a policy at the organization level, it is
automatically inherited by all its children projects, and if you set a policy at the project
level, it’s inherited by all its child resources. The effective policy for a resource is the
union of the policy set at that resource and the policy inherited from higher up in the
hierarchy.
pg. 124
SKILLCERTPRO
Best Practices:
Service Accounts:
pg. 125