07 Resource Monitoring
07 Resource Monitoring
Resource Monitoring
In this module, I’ll give you an overview of the resource monitoring options in Google
Cloud.
The features covered in this module rely on Google Cloud’s operations suite, a
service that provides monitoring, logging, and diagnostics for your applications.
Proprietary + Confidential
Agenda
02 Monitoring
Lab: Resource Monitoring
03 Logging
04 Error Reporting
05 Tracing
06 Profiling
In this module we are going to explore the Cloud Monitoring, Cloud Logging, Error
Reporting, Cloud Trace, and Cloud Profiler services. You will have the opportunity to
apply some of these services in the lab of this module.
Let me start by giving you a high-level overview of Google Cloud’s operations suite
and its features.
Proprietary + Confidential
Google Cloud’s
Operations Suite
01
Proprietary + Confidential
Google Cloud’s
● Integrated monitoring, logging, diagnostics
operations
suite
● Manages across platforms
○ Google Cloud and AWS
○ Dynamic discovery of Google Cloud with smart defaults
○ Open-source agents and integrations
This provides you with access to powerful data and analytics tools plus collaboration
with many different third-party software providers.
Proprietary + Confidential
Monitoring Profiler
Logging Trace
Error Reporting
As we mentioned earlier, Google Cloud’s operations suite has services for monitoring,
logging, error reporting, fault tracing, and profiling. You only pay for what you use, and
there are free usage allotments so that you can get started with no upfront fees or
commitments. For more information about pricing, please refer to the documentation.
Now, in most other environments, these services are handled by completely different
packages, or by a loosely integrated collection of software. When you see these
functions working together in a single, comprehensive, and integrated service, you'll
realize how important that is to creating reliable, stable, and maintainable
applications.
Proprietary + Confidential
Partner integrations
Google Cloud’s operations suite also supports a rich and growing ecosystem of
technology partners, as shown on this slide. This helps expand the IT ops, security,
and compliance capabilities available to Google Cloud customers. For more
information about integrations, please refer to the documentation.
Proprietary + Confidential
02
Monitoring
Now that you understand Google Cloud’s operations suite from a high-level
perspective, let’s look at Cloud Monitoring.
Proprietary + Confidential
Product
Development
Capacity Planning
Testing+Release Procedures
Incident Response
Monitoring
If you want to learn more about SRE, we recommend exploring the free book written
by members of Google’s SRE team,
Proprietary + Confidential
Monitoring
Monitoring
● Dynamic config and intelligent defaults
● Uptime/health checks
● Dashboards
● Alerts
Cloud Monitoring dynamically configures monitoring after resources are deployed and
has intelligent defaults that allow you to easily create charts for basic monitoring
activities.
This allows you to monitor your platform, system, and application metrics by ingesting
data, such as metrics, events, and metadata. You can then generate insights from this
data through dashboards, charts, and alerts.
For example, you can configure and measure uptime and health checks that send
alerts via email.
Proprietary + Confidential
Hosts Monitors
AWS
Account #1
A metrics scope is the root entity that holds monitoring and configuration information
in Cloud Monitoring. Each metrics scope can have between 1 and 375 monitored
projects. Now, monitoring data for all projects in that scope will be visible.
A metrics scope contains the custom dashboards, alerting policies, uptime checks,
notification channels, and group definitions that you use with your monitored projects.
A metrics scope can access metric data from its monitored projects, but the metrics
data and log entries remain in the individual projects.
The first monitored Google Cloud project in a metrics scope is called the hosting
project, and it must be specified when you create the metrics scope. The name of that
project becomes the name of your metrics scope. To access an AWS account, you
must configure a project in Google Cloud to hold the AWS Connector.
https://cloud.google.com/monitoring/settings#concept-scope
Proprietary + Confidential
● Consider using separate metrics scopes for data and control isolation.
Because metrics scopes can monitor all your Google Cloud projects in a single place,
a metrics scope is a “single pane of glass” through which you can view resources
from multiple Google Cloud projects and AWS accounts. All users of Google Cloud’s
operations suite with access to that metrics scope have access to all data by default.
This means that a role assigned to one person on one project applies equally to all
projects monitored by that metrics scope.
In order to give people different roles per-project and to control visibility to data,
consider placing the monitoring of those projects in separate metrics scopes.
Proprietary + Confidential
Cloud Monitoring allows you to create custom dashboards that contain charts of the
metrics that you want to monitor. For example, you can create charts that display your
instances’ CPU utilization, the packets or bytes sent and received by those instances,
and the packets or bytes dropped by the firewall of those instances.
In other words, charts provide visibility into the utilization and network traffic of your
VM instances, as shown on this slide. These charts can be customized with filters to
remove noise, groups to reduce the number of time series, and aggregates to group
multiple time series together.
Now, although charts are extremely useful, they can only provide insight while
someone is looking at them. But what if your server goes down in the middle of the
night or over the weekend? Do you expect someone to always look at dashboards to
determine whether your servers are available or have enough capacity or bandwidth?
If not, you want to create alerting policies that notify you when specific conditions are
met.
For example, as shown on this slide, you can create an alerting policy when the
network egress of your VM instance goes above a certain threshold for a specific
timeframe. When this condition is met, you or someone else can be automatically
notified through email, SMS, or other channels in order to troubleshoot this issue.
You can also create an alerting policy that monitors your usage of Google Cloud’s
operations suite and alerts you when you approach the threshold for billing. For more
information about this, please refer to the documentation.
Proprietary + Confidential
Here is an example of what creating an alerting policy looks like. On the left, you can
see an HTTP check condition on the summer01 instance. This will send an email that
is customized with the content of the documentation section on the right.
Uptime checks can be configured to test the availability of your public services from
locations around the world, as you can see on this slide. The type of uptime check
can be set to HTTP, HTTPS, or TCP. The resource to be checked can be an App
Engine application, a Compute Engine instance, a URL of a host, or an AWS instance
or load balancer.
For each uptime check, you can create an alerting policy and view the latency of each
global location.
Proprietary + Confidential
Here is an example of an HTTP uptime check. The resource is checked every minute
with a 10-second timeout. Uptime checks that do not get a response within this
timeout period are considered failures.
Custom metrics
metric_kind=monitoring.MetricKind.GAUGE,
value_type=monitoring.ValueType.DOUBLE,
metric
description='This is a simple example metric
type
of a custom metric.') descriptor
descriptor.create() name
Predefined custom
If the standard metrics provided by Cloud Monitoring do not fit your needs, you can
create custom metrics.
For example, imagine a game server that has a capacity of 50 users. What metric
indicator might you use to trigger scaling events? From an infrastructure perspective,
you might consider using CPU load or perhaps network traffic load as values that are
somewhat correlated with the number of users. But with a Custom Metric, you could
actually pass the current number of users directly from your application into Cloud
Monitoring.
To get started with creating custom metrics, please refer to the documentation.
Proprietary + Confidential
Lab Intro
Resource Monitoring
Let’s take some of the monitoring concepts that we just discussed and apply them in a
lab.
Proprietary + Confidential
Lab objectives
In this lab, you learn how to use Cloud Monitoring to gain insight into applications that
run on Google Cloud. Specifically, you will enable Cloud Monitoring, add charts to
dashboards and create alerts, resource groups, and uptime checks.
Proprietary + Confidential
03 Logging
Monitoring is the basis of Google Cloud’s operations suite, but the service also
provides logging, error reporting, and tracing. Let’s learn about logging.
Proprietary + Confidential
Logging
Logging
● Platform, systems, and application logs
○ API to write to logs
○ 30-day retention
● Log search/view/filter
● Log-based metrics
Cloud Logging allows you to store, search, analyze, monitor, and alert on log data and
events from Google Cloud and AWS. It is a fully managed service that performs at
scale and can ingest application and system log data from thousands of VMs.
Logging includes storage for logs, a user interface called Logs Explorer, and an API to
manage logs programmatically. The service lets you read and write log entries, search
and filter your logs, and create log-based metrics.
Logs are only retained for 30 days, but you can export your logs to Cloud Storage
buckets, BigQuery datasets, and Pub/Sub topics.
Exporting logs to Cloud Storage makes sense for storing logs for more than 30 days,
but why should you export to BigQuery or Pub/Sub?
Proprietary + Confidential
Exporting logs to BigQuery allows you to analyze logs and even visualize them in
Looker Studio.
BigQuery runs extremely fast SQL queries on gigabytes to petabytes of data. This
allows you to analyze logs, such as your network traffic, so that you can better
understand traffic growth to forecast capacity, network usage to optimize network
traffic expenses, or network forensics to analyze incidents.
For example, in this screenshot we queried my logs to identify the top IP addresses
that have exchanged traffic with my web server. Depending on where these IP
addresses are and who they belong to, we could relocate part of my infrastructure to
save on networking costs or deny some of these IP addresses if we don’t want them
to access my web server.
If you want to visualize your logs, we recommend connecting your BigQuery tables to
Looker Studio. Looker Studio transforms your raw data into the metrics and
dimensions that you can use to create easy-to-understand reports and dashboards.
We mentioned that you can also export logs to Pub/Sub. This enables you to stream
logs to applications or endpoints.
Proprietary + Confidential
Error Reporting
04
Let’s learn about another feature of Google Cloud’s operations suite: Error Reporting.
Proprietary + Confidential
Error Reporting
Error Reporting
Aggregate and display errors for running cloud services
● Error notifications
● Error dashboard
Error Reporting counts, analyzes, and aggregates the errors in your running cloud
services. A centralized error management interface displays the results with sorting
and filtering capabilities, and you can even set up real-time notifications when new
errors are detected.
Currently, Error Reporting is generally available for App Engine on both standard and
flexible environments, Apps Script, Compute Engine, Cloud Functions, Cloud Run,
Google Kubernetes Engine, and Amazon EC2.
Tracing
05
Tracing is another Cloud Operations feature integrated into Google Cloud.
Proprietary + Confidential
Tracing
Trace
Tracing system
● Displays data in near real–time
● Latency reporting
● Per-URL latency sampling
Cloud Trace is a distributed tracing system that collects latency data from your
applications and displays it in the Google Cloud console. You can track how requests
propagate through your application and receive detailed near real-time performance
insights.
Managing the amount of time it takes for your application to handle incoming requests
and perform operations is an important part of managing overall application
performance. Cloud Trace is actually based on the tools used at Google to keep our
services running at extreme scale.
Proprietary + Confidential
06 Profiling
Finally, let’s cover the last feature of Google Cloud’s operations suite in this module,
which is the profiler.
Proprietary + Confidential
Profiling
Profiler
● Continuously analyze the performance of CPU or
memory-intensive functions executed across an application.
Poorly performing code increases the latency and cost of applications and web
services every day. Cloud Profiler continuously analyzes the performance of CPU or
memory-intensive functions executed across an application.
Quiz
Proprietary + Confidential
Question #1
Question
What is the foundational process at the base of Google’s Site Reliability Engineering (SRE)?
A. Capacity planning
B. Testing and release procedures
C. Monitoring
D. Root cause analysis
Proprietary + Confidential
Question #1
Answer
What is the foundational process at the base of Google’s Site Reliability Engineering (SRE)?
A. Capacity planning
B. Testing and release procedures
C. Monitoring
D. Root cause analysis
Explanation:
Before you can take any of the other actions, you must first be monitoring the system.
Proprietary + Confidential
Question #2
Question
Question #2
Answer
Explanation:
Cloud Trace provides latency sampling and reporting for App Engine, Google HTTPS
load balancers, and applications instrumented with the Cloud Trace SDKs. Reporting
includes per-URL statistics and latency distributions.
Proprietary + Confidential
Question #3
Question
Question #3
Answer
Explanation:
Cloud Operations integration streamlines and unifies these traditionally independent
services, making it much easier to establish procedures around them and to use them
in continuous ways.
Proprietary + Confidential
Review:
Resource Monitoring
In this module, we gave you an overview of Google Cloud’s operations suite and its
monitoring, logging, error reporting, and fault tracing features. Having all of these
integrated into Google Cloud allows you to operate and maintain your applications,
which is known as site reliability engineering or SRE.
If you’re interested in learning more about SRE, you can explore the book or some of
our SRE courses.