Enterprise Roadmap To Sre
Enterprise Roadmap To Sre
know more
about SRE?
To learn more, visit https://sre.google
Enterprise Roadmap to SRE
How to Build and Sustain
an SRE Function
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Enterprise Road‐
map to SRE, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
The views expressed in this work are those of the authors and do not represent the
publisher’s views. While the publisher and the authors have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and Google. See our statement
of editorial independence.
978-1-098-11771-9
[LSI]
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
3. SRE Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Embracing Risk (SRE Book Chapter 3) 16
Service-Level Objectives (SRE Book Chapter 4) 17
Eliminating Toil (SRE Book Chapter 5) 17
Monitoring Distributed Systems (SRE Book Chapter 6) 18
The Evolution of Automation at Google (SRE Book
Chapter 7) 19
Release Engineering (SRE Book Chapter 8) 20
Simplicity (SRE Book Chapter 9) 20
How Do You Map These Principles to Your Existing
Organization? 21
Preventing Org-Destroying Mistakes 21
iii
Create a Safe-to-Fail Environment for Your Adoption
Journey 22
Beware Diverging Priorities 22
How Do You Get Buy-In to These Principles, with the
Critical Sign-Off and Backing You Need? 23
4. SRE Practices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Where to Start? 26
Where Are You Going? 26
How to Get There 27
What Makes SRE Possible? 28
Building a Platform of Capabilities 31
Leadership 33
Staffing and Retention 37
Upskilling 38
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
iv | Table of Contents
Preface
v
CHAPTER 1
Getting Started with
Enterprise SRE
1
SRE Practices Can Coexist with the ITIL
Framework
The Information Technology Infrastructure Library, or ITIL, is a set
of detailed practices for IT activities, such as IT service management
(ITSM). Not every enterprise uses ITIL, but if you do have some
level of ITIL adoption in your organization, then be prepared for
there to be substantial overlap between SRE and ITIL practices.
Also, because ITIL is a framework, your particular implementation
may have wide variations from the library.
There are some common antipatterns for SRE that will prove chal‐
lenging to reconcile. A change advisory board (CAB) is a common
pattern for change control. The SRE approach to continuous deliv‐
ery means streamlining and making this body strategic: you can
read more in Google’s DevOps Research & Assessment (DORA)’s
article explaining streamlining change approval. Similarly, for a
network operations center (NOC), this should move from an event-
driven model to a more proactive approach, centered on automation
and enablement. In both cases, focus on evolving the model rather
than trying to immediately replace it.
DevOps/Agile/Lean
DevOps has a multitude of definitions. For simplicity, we’ll assume
it includes the relevant parts of other methodologies such as Agile
(Scaled Agile Framework [SAFe], Disciplined Agile Delivery [DAD],
and Large Scale Scrum [LeSS]) and Lean (Six Sigma, Kanban).
Google’s DORA research shows that SRE and DevOps are comple‐
mentary, so if you have some level of DevOps adoption in your
DevOps/Agile/Lean | 3
Outline Your Expectations and Vision
Next, it’s important to understand what outcomes you expect. SRE
has a number of technical and cultural components but they all have
a common goal of meeting reliability targets. You should expect to
spend meaningful time and effort in defining how they interact with
your existing frameworks. Simply stating “better reliability” isn’t
going to work. Similarly, if you are expecting outcomes not related
to reliability (e.g., cost, velocity), then be prepared to do some extra
work adapting SRE practices to your overall vision.
5
we feel its presence in customer satisfaction (CSAT) scores, third-
party sites like Downdetector, and the overall trend of moving more
of our lives and businesses onto the internet. During the COVID-19
pandemic, many software as a service (SaaS) products experienced
huge uptakes in business and had to dramatically increase their
reliability expectations.
Apart from availability, the commonly understood proxy for relia‐
bility, we can also consider features like durability, data residency,
speed or performance under load, consistency, and quality of results
as similar features of reliability that consumers and customers of
internet services implicitly expect.
Once we understand that reliability is actually a highly desired
feature of a product, we might even take the leap to state that it
is the most desired feature of a product. Because if the product is
not available, none of its features can be leveraged. If they cause
frustration due to performance or quality, they won’t be satisfying. If
they cannot be used during peak, critical moments, they might not
be worth having at all.
Google Search is famously “always up” to the point that its presence
is felt as ubiquitous. The availability of Google Search is a key differ‐
entiating factor when comparing to its competitors, alongside speed,
quality, ease of use, and user experience. This is not an accident,
but a deliberate choice and investment made by Google for over a
decade.
15
operate safely within well-understood guardrails. They should also
contain sensible defaults to nudge behavior in the right direction.
We’re going to give a quick overview of each principle from the SRE
book and how to translate that to your org. For more details, we
recommend reading the referenced chapters in the Site Reliability
Engineering book.
Antipattern: We’ll support any risks you take as long as they are
successful
Real risk budgets mean accepting some failures.
Antipattern: Giving up too soon. For example, trying SRE for six
months, then stopping after no immediate wins
This doesn’t mean that you need to achieve everything up
front, but there has to be a clear narrative around moving in
the right direction after a couple of quarters.
How Do You Get Buy-In to These Principles, with the Critical Sign-Off and Backing You Need? |
23
to “insufficient impact” by making the ongoing value of your SRE
adoption visible. A proven way to do so is to find the right metrics
for your organization to demonstrate this return on investment.
For example, in retail, you might focus on maximizing sales dur‐
ing Black Friday, while in healthcare you might focus on continu‐
ous compliance and availability, and in finance it might be about
throughput of a trading system or the speed to complete an analysis
pipeline.
25
situation, are extremely helpful for getting people more comfortable
with touching production in a stress-free environment. Reliving a
recent outage is an easy way to start. If one team member can
play the part of the Dungeon Master and present the evidence as it
played out in real life, other team members can talk through what
they would have done and/or directly use tooling to investigate the
system as it was during the event.
Teams should also be encouraged to gain more knowledge about
the systems they are operating from development teams. This is
not only a good exercise to better understand the existing system
but also an opportunity to directly introduce new instrumenta‐
tion, discuss and plan changes to the system such as performance
improvements, or address scalability or consistency concerns. These
conversations tend to be highly valuable in developing trust between
teams.
A team’s capabilities can also be expanded through the introduction
of new, third-party tooling, through open source software tools, or
by the teams writing their own tools.
Where to Start?
When adding capabilities to a team, where do you begin? The prob‐
lem space of reliability and SRE is vast, and not all capabilities
are appropriate at the same time. We suggest starting with a set of
practices that allow a team to learn what to work on next. Abstractly,
we refer to a model called plan–do–check–act (PDCA). By basing
your next step on how the system currently is working, your next
step will always be relevant. We explain later in this chapter how to
build a platform of these capabilities and where to start. This set of
early capabilities will form a flywheel, so your teams won’t have to
guess at what to build or adopt next—it will develop naturally from
their observations of the system.
• Institute SLOs.
• Formalize incident response.
• Practice blameless postmortems and reviews.
• Use risk modeling to perform prioritization.
• Burn through your reliability backlog based on error budget or
other risk tolerance methods.
Let this cycle be your flywheel to spin out new capabilities. For
example, if you have an outage in which a deployment introduced
a bug that crashed every server in the fleet, you’ll want to develop a
way to reduce that risk, possibly through something like blast radius
reduction, using canary releases, experiment frameworks, or other
forms of progressive rollouts. Similarly, if you find that a memory
leak is introduced, you might add a new form of load test to the
predeployment pipeline. Each of these is a capability that is added
to your platform, which can provide benefit and protection for
each service running on the platform. One-off fixes become rare as
generic mitigation strategies show their value.
Knowing If It Is Working
A well-run organization that understands and values reliability will
exhibit a few observable traits. First is the ability to slow or halt
feature delivery in the face of reliability concerns. When velocity
and shipping is the only goal, reliability and other nonfunctional
demands will always suffer. Do reliability efforts always get depriori‐
tized by features? Are projects proposed but never finished due to
“not enough time”? An important caveat here is that this should not
be seen as slowing down the code delivery pipeline—you should
keep your foot on the gas.
Another indicator of success is when individual heroism is no
longer being praised, but instead is actively discouraged. When the
success of a system is propped on the shoulders of a small set of
people, teams create an unsustainable culture of heroism that is
bound to collapse. Heroes will be incentivized to keep sole control
of their knowledge and unmotivated to systematically prevent the
need to use that knowledge. This is similar to the character Brent in
The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford
(IT Revolution Press). Not only is it inefficient to have a Brent, it
can also be downright dangerous. A team has to actively discourage
individual heroism while maintaining the team’s responsibility because
heroism can feel like a rational approach in short-term thinking.
Another sign of a well-functioning team is that reliability efforts are
funded before outages, as a part of proactive planning. In poorly
Leadership | 33
behaved teams, we see that an investment in reliability is used to
treat an outage or series of outages. Although this might be a neces‐
sary increase, the investment needs to be maintained over time, not
just treated as a one-off and abandoned or clawed-back “once things
get better.”
To illustrate this further, consider a simplification of your organ‐
ization’s approach to reliability as two modes: “peacetime” and
“wartime.” Respectively, “things are fine” or “everybody knows it’s
all about to fall apart.” By considering these two modes distinctly,
you’re able to make a choice about investment. During wartime, you
spend more time and money on hidden features of your platform,
infrastructure, process, and training. During peacetime, you don’t
abandon that work, but you certainly invest far less.
However, who decides when a company is in wartime? How is that
decision made? How is it communicated throughout the company
in ways that don’t cause panic or attrition? One method is to use
priority codes, such as: Code Yellow or Code Red. These are two
organizational practices that aid teams in prioritization of work.
Code Yellow implies a technical problem that will become a business
emergency within a quarter. Code Red implies the same within days,
or it is used for an already-present threat. These codes should have
well-defined criteria that must be understood and agreed to by all
your leadership team. Their declaration must be approved by leader‐
ship for the intended effect to take place. The outcome of such codes
should result in changing of team priorities, potentially the cessation
of existing work (as in the case of a Code Red), the approval of
large expenditures, and the ability to pull other teams in to help
directly. Priority codes are expensive operations for an enterprise, so
you should make sure there are explicit outcomes. These should be
defined from the outset as exit criteria and clearly articulated upon
completion. Without this, teams will experience signal fatigue and
no longer respond appropriately.
Making Decisions
By setting up reliability as an investment into a stronger product,
you’re able to make longer-term plans that have far greater impact.
Traditional models treat IT as a cost center, focusing entirely on
reducing that cost over time. At the end of the day, it doesn’t
matter that the service is cheap if it’s not up. You can still apply
cost reduction, but you should consider it after you’ve achieved
reliability goals. If you’re finding that the cost of maintaining your
stated reliability goals is too high, you can explicitly redefine those
goals—i.e., “drop a 9”—and evaluate the trade-offs that result.
To achieve all these goals, you’ll likely need to persuade some
governing board, group of decision makers, or executives. You’ll
need their buy-in to staff and maintain a team over time, provi‐
sion resources, and train and further develop team members. This
should be seen as a long-term investment and explicitly funded
accordingly, not as a hidden line-item in some other budget.
Leadership | 35
Antipattern: Ignoring Ulysses
When it comes to reliability, a common antipattern is to let
outages or other “bad news” affect your planning cycle, even
when they’re expected. Often, it’s tempting for leadership to
feel the need to “do something” in the face of bad news, and
“sticking to the plan” often doesn’t feel impactful. However,
given a plan that expects outages to happen, unless a significant
change in the understanding of a system occurs, “sticking to
the plan” is exactly the right thing to do. The term Ulysses
pact can be a useful illustration here. This is where a leader
(Odysseus) tells his team (his sailors) to stick to the plan (sail
past the Sirens as he is tied to the mast). When his team
sticks to the plan (despite his thrashing and begging to stop),
he congratulates them. They didn’t get tempted by short-term
thinking. Their plan considered the long-term impact, and
they had time to make a clear plan before the chaos started.
By allowing a team to make in-the-moment decisions, you’re
often choosing to ignore a good plan and make emotional
or ego-driven choices instead. A classic example of this is a
leader “walking into an outage” and taking over command
without having full context, and despite a capable team already
in control. This is often the outcome of a company’s culture.
A culture of HiPPOs (decisions based on the Highest Paid
Person’s Opinion) can have a drastically bad effect on incident
management and reliability in general. Instead, listen to Odys‐
seus, stick to the plan, and don’t abandon ship. This applies
not only to incident response, but also things like error budget
exhaustion or tracking SLOs in the face of “really bad” inci‐
dents. If your plan is to halt a feature release in the face of
error budget exhaustion, but you make an exception for “this
important feature” every time, your leadership will be severely
undercut. An effective practice to improve this is the introduc‐
tion of “silver bullets” in which a leader is granted three silver
bullets to be used sparingly as an override to the expected plan.
By introducing this artificial scarcity, leaders are required to
make explicit trade-offs. Similarly, if a single bad event wipes
out an SLO, don’t ignore it. Gather the team to analyze how
this changes your collective understanding of the system. Was
this type of failure never considered before? Was the response
to an outage insufficient?
Upskilling
When growing and transitioning existing staff into SREs, it is crit‐
ical to build an upskilling plan. This includes both the what and
the how—that is, what skills are needed in the role and how you’ll
go about enabling staff to acquire those skills. Tools like skills gap
analyses and surveys are intrinsically useful here to check assump‐
tions about the foundational skills that are required for the job.
These skills, often not talked about specifically in SRE literature,
are nevertheless essential to allow SREs to scale their contributions
organization-wide. For example, it is not unheard of for traditional
operations teams to be unfamiliar with software engineering funda‐
mentals such as version control, unit testing, and software design
patterns. Ensuring that these baselines are a part of your upskilling
plan and that they are tailored to each learner profile is crucial, not
just to establish a critical mass of skill on the team but to provide a
smooth on-ramp for individuals into the new expectations of their
role (and thus help reduce team churn).
Once you’ve decided that SRE is worth pursuing for your organiza‐
tion and resolved to invest in it, it’s important to ensure that your
investment is a successful one. It’s always hard to introduce change
into a system, but it’s even harder to make that change stick. Here
are some tips on how to keep SRE working in your organization.
39
Google internally uses shared objectives and key results (OKRs) to
align teams and set goals when it’s not always clear how they’ll
be achieved. Your organization might have its own processes to
do this, but they must be extended to include explicit iterations
and periodic reviews of SRE team metrics (toil, alerting, Software
Engineering impact, capacity plans, etc.). The nonlinear nature of
adoption means your progress will always include setbacks, so this
should also be treated as a normal part of the process.
1. Sublinear Scaling
We’ve already mentioned this earlier, but it’s important to clarify
that this isn’t about “doing more with less” but, rather, using auto‐
mation and continuous improvement culture to change the way we
approach reliability problems. SRE is explicitly designed not to be
scalable through headcount, so resist the temptation to add more
people to existing steps in your software assembly line, and use SREs
to automate or eliminate those steps instead.
Healthcare // Joseph
Joseph Bironas has been leading SRE adoption in several healthcare
organizations since his time as a Google SRE leader. As such, he
was able to provide an industry-wide view of how implementing
SRE in this space differs from other tech and startup cultures. Due
to the nature of its life-critical workflows, reliability is often top-of-
mind. However, the healthcare industry faces specific challenges
that span organizational models, culture, budgeting, and regulatory
requirements.
After working with a company that focused on very tight mar‐
gins in areas like medical device manufacturing, as well as FDA-
regulated fields, Joseph observed that reliability is understood as
a requirement, but that the cost–benefit ratio of SRE is far from
well-understood in the industry. As a result, SRE and infrastructure
teams can find themselves as “catch-all engineering” being pulled
into an IT cost center, with their scope increased dramatically.
45
What’s wrong with an SRE team being managed under an IT cost
center, you might ask? When enterprises are used to managing
through broad IT frameworks like ITIL, it’s hard to make value
judgments about SRE, which is a mere subset of ITIL—which also
handles things like hardware procurement that SRE has no opinion
on. More to the point, a CIO who manages all of corporate and
production IT is not in the best position to make judgments on
software systems reliability. Instead, rolling up to a software-focused
leader—e.g., an SVP of Engineering or perhaps a CTO—makes
more sense.
Organizations in this field often face the steep curve of hoping to
adopt SRE when they haven’t yet adopted DevOps practices. For
example, they were releasing software once a month, with very little
CI/CD automation due to the necessary complexity around regula‐
tions and organization-wide compliance controls. Many healthcare
organizations simply don’t want to deploy quickly: for some custom‐
ers, deploying too fast implies inadequate testing or insufficient
safety.
The willingness to implement changes—such as a meaningful pivot
to SRE—varies widely across the industry, perhaps due to differing
leadership priorities and styles. Joseph described one scenario in
which a team was able to send designers out into the field to
gather requirements, build new workflows, and revolutionize care
through a better product. In another scenario, a different team was
only incentivized to be better than the incumbents, which didn’t
require the same level of investment. In a third scenario, a team was
plagued by inertia, waiting for a top-down mandate before making
any change or investment. In Joseph’s experience, more progressive
leadership tends to be more sensitive to customer demands for
reliability.
In contrast to startup culture, change is very slow for some of these
teams. One particular team questioned if they could accomplish
“anything” (e.g., adopting SRE) in 18 months—an eternity for a
startup. When considering significant changes in organizations with
this pace, you have to have models to help understand planned
returns on investment. Knowing about the J curve (see roofshots
versus moonshots) is important here to avoid abandoning an effort
in its trough, before the real return. Joseph recommended checking
in quarterly with teams to keep a steady cadence on progress. He
recommends starting the transition to SRE with incident response
Healthcare // Joseph | 47
At the end of the day, though, there was value in the fact that people
started measuring things they were not measuring before. SLOs gave
the teams a way to ask each other a question they were not asking
before. This all goes to show that it’s important to talk to each other
and help each other make good decisions based on the data in front
of us.
We hope this provides some insight into how your enterprise might
adopt SRE, and where the challenges may lie. We think your chances
of success are higher if you clearly define your SRE principles, map
those to practices and capabilities, and prioritize growth and nurtur‐
ing of those within your team. We also showed some examples of
teams that have gone through the process of spinning up an SRE
practice within an enterprise, and the specific challenges they faced
and overcame.
We think this report will help your adoption of SRE and lead to a
more reliable technology experience for everyone. And we hope that
through this adoption operations teams can become more sustaina‐
ble, services can be more scalable, and development velocity can
increase.
“May the queries flow and the pager be silent.”
53
About the Authors
James Brookbank is a cloud solutions architect at Google. Solution
architects help make cloud easier for Google’s customers by solv‐
ing complex technical problems and providing expert architectural
guidance. Before joining Google, James worked at a number of large
enterprises with a focus on IT infrastructure and financial services.
Steve McGhee is a reliability advocate, helping teams understand
how best to build and operate world-class, reliable services. Before
that, he spent more than 10 years as an SRE within Google, learn‐
ing how to scale global systems in Search, YouTube, Android, and
Cloud. He managed multiple engineering teams in California, Japan,
and the UK. Steve also spent some time with a California-based
enterprise to help them transition onto the Cloud.