The State of Resilience 2025
The State of Resilience 2025
Resilience 2025
Confronting Outages, Downtime,
and Organizational Readiness
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 1
Executive summary
What 1,000 senior cloud and technology executives from
all over the world are saying about their organization’s
operational resilience — and their strategies for increasing it
Outages happen. Despite almost universal awareness of this fact, a shocking majority of businesses find themselves
dangerously exposed to serious consequences when an outage occurs.
Fallout following the recent CrowdStrike global outage jolted many organizations into action — 94% of technical executives
in this survey said that the event has catalyzed their companies to reassess their operational resilience. At the same
time, leaders at the global enterprise companies surveyed here, in “The State of Operational Resilience 2025,” report that
entrenched resistance to change, misaligned internal priorities, outdated systems, and budgetary gridlock prevent many from
implementing meaningful — sometimes even desperately needed — operational resilience measures.
1. Leaders are worried: 93% of leaders are concerned 3. Outages are the new normal: On average, companies
about the financial and organizational impacts of outages, report 86 outages per year—translating to 324 minutes of
and 95% are aware of operational weaknesses that weekly downtime. 55% experience weekly outages, while
leave them vulnerable. At the same time, however, 48% 14% report daily outages.
say their organizations aren’t doing enough to improve
resilience. 53% of banking and financial services companies report
experiencing service disruptions at least weekly, as do 60%
2. The high cost of service disruption: 100% of of retail and ecommerce enterprises.
companies surveyed experienced revenue losses from
outages in the past 12 months with per-outage losses These are not minor incidents. 70% of large enterprise
ranging from at least US$10,000 to well over $1,000,000. companies1 report that their outages typically take
60 minutes or more to resolve. Overall, nearly half of
The data also shows that the larger the organization, all respondents report that their average downtime lasts
the larger the annual revenue loss. For companies over two or more hours before resolution, with 10% reporting
1,000 employees and/or US$500 million ARR, outage-related the loss of a full workday or more before they are able to
losses averaged US$495,000 — though a handful of of resume operations. The average outage time across
these large enterprise organizations (8%) reported all geographies, ARR, company sizes, and industries
losses of US$1 million or higher over the last 12 months. is 196 minutes – or more than three hours of service
disruption.
1 Defined for this report as organizations with more than $500m ARR and 1,000+ employees
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 2
4. Unplanned outages cause more than economic 6. Resilience investments are overdue: Many
losses: Externally, they cause the loss of consumer organizations have known weaknesses, with 49% planning
and business partner confidence, damaging the to invest in automation, AI, and cloud infrastructure to boost
organization’s reputation. Internally, they diminish resilience.
trust in the technical IT responsible for preventing
or mitigating outages. An engineering group lacking 7. New operational resilience regulations loom
the trust of the larger organization will have a difficult large: A majority of all the technology executives surveyed
time recruiting and retaining quality technical staff. (79%) admit their organization is not completely prepared
to comply with new operational resilience governance
Frequent outages also cause a risk of staff burnout regulations like DORA (going into effect in Jan 2025) and the
and high turnover when teams are forced to miss NIS2 directive, opening them up to consequences. Nearly
other deadlines (39%) and pile up a backlog of half (44% ) say they are losing sleep over the regulatory
requests (43%) — while staying later or working fines and penalties that come from unplanned downtime or
weekends (48%) — to fire-fight outages. outages.
5. Preparation is spotty: Only 20% of respondents Executives in EMEA (Europe, the Middle East and Africa)
describe their organization as fully prepared for outages. (85%) are more likely than executives in APAC (Asia-
Only 33% have an organized response approach, and less Pacific) (76%) and North America (75%) to admit not being
than a third conduct regular failover testing. completely prepared to comply with new regulations
regarding unplanned downtime and outages.
Survey Methodology
The Operational Resilience in Enterprises Survey was conducted by Cockroach Labs and Wakefield Research
among 1,000 Senior Cloud Architects, Engineering, & Technology Executives, with a minimum seniority of Vice
President in three regions: North America (US, Canada), EMEA (Germany, Italy, France, UK), and APAC (India,
Australia, Singapore), between August 29th and September 10th, 2024, using an online survey.
Results of any sample are subject to sampling variation. The magnitude of the variation is measurable and is
affected by the number of interviews and the level of the percentages expressing the results. For the interviews
conducted in this particular study, the chances are 95 in 100 that a survey result does not vary, plus or minus,
by more than 3.1 percentage points in the global sample, 6.9 percentage points in the United States, and 9.8
percentage points in each of the remaining markets from the result that would be obtained if interviews had been
conducted with all persons in the universe represented by the sample.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 3
Introduction
Just a few months have passed since the biggest global software failure to date. The CrowdStrike
outage damage was widespread and instantaneous: As millions of devices crashed, millions of
banking customers were cut off from their accounts. Airlines grounded flights by the thousands,
hospitals canceled surgeries, and universities canceled classes. Emergency police, fire, or medical
aid became unavailable in some areas as 911 call centers went offline. The cause of this global
disaster was every technologist’s worst nightmare: a single faulty section of code in a single
software update from a single company.
How can you ensure your company is prepared for next time?
We conducted this survey because operational resilience can dictate how, and even if, a company is allowed
has, until now, been generally overlooked. Most companies to operate. The numbers also fail to represent the
focus on disaster recovery (DR), which is important after reputational damage a company will experience, following
disruptions occur. However, operational resilience focuses a major outage — including potentially a decrease in
on preventing service interruptions before they happen. stock value when investor confidence gets rattled.
In today’s astonishingly complex technical architectures, As the data from this survey demonstrates, outages
with their interdependence of sophisticated digital services, carry more than financial costs, and can cause long-term
anything that can go wrong, will go wrong eventually. The repercussions. These seasoned technical leaders from
results of this survey bear this out: 100% of companies every type of industry and from all around the globe divulge
surveyed experienced revenue losses from outages in the the current state of operational readiness in their own
past 12 months, with per-outage losses ranging from at organizations — and their strategies for increasing their
least US$10,000 to well over US$1,000,000. Across all sizes operational resilience without introducing new risks.
and sectors, the enterprises in this survey experience, on
average, 86 outages per year; the average length of an
outage is 196 minutes — over three hours of downtime.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 4
The report in six parts
Outages happen. Yet data collected in this survey reveals that, despite almost universal awareness of risks, a majority of
businesses find themselves dangerously exposed to serious consequences when an outage happens. The CrowdStrike
outage has jolted organizations into action, yet entrenched resistance to change, misaligned priorities, outdated systems, and
budgetary gridlock prevent many from implementing meaningful operational resilience measures. This report is in six parts:
Part 5: Key takeaways and emerging Part 6: How distributed SQL helps
enterprise operational resilience organizations achieve operational
strategies summarizes this report’s findings. resilience uses data points and takeaways
What do they mean for technical leaders trying from this report to show how distributed
to operate in a business-as-usual environment application architecture helps enterprises
that’s far more complex than ever before, increase their organizational resilience by
while implementing positive changes without mitigating the technical risks and weaknesses
introducing new risk? that lead to outages.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 5
PART 1
Operational
resilience in 2025
Are tech leaders worried about their operational
resilience?
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 6
1.1 Real-world outage costs & impacts
100% of the technology executives surveyed in this report say that their company lost money due to
outages over the past 12 months.
Loss of trust or damage to our reputation with customers or prospective customers 37%
Dollars down the drain Fear of firing: An overall 82% of leaders surveyed (87% in
the US and APAC; 77% in EU) expressed fear that they or
In the past twelve months, across all companies of all sizes
members of their technical teams could lose their jobs
and annual revenues in all sectors, one-third (32%) of
following a significant outage or downtime event.
respondents lost US$100K or more due to outages.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 7
1.2 Outages by the numbers: frequency & duration
100% of respondents report experiencing unplanned, unexpected downtime in the
past 12 months.
60%
of Retail/Ecommerce companies
report experiencing outages or
service interruptions weekly
2 Time to Recovery (or Time to Resolution) is the full time of an outage — from the time the system or product fails to
the time that it becomes fully functional again so that normal business operations can resume.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 8
1.3 The worries keeping tech leadership awake at night
93% of executives in this survey say they are losing sleep over the impacts of
unplanned downtime.
And the data gathered for this report show that they have
good reason: 86% say that every minute of unplanned
downtime is a minute they risk losing customers,
perhaps permanently. Beyond the obvious financial fallout,
there are many other sources for executive anxiety as well:
93%
of executives in this survey say
Potential punishment: 82% of leaders acknowledge
they are losing sleep over the
that their teams are worried technical staff will be
impacts of unplanned downtime.
fired for unplanned downtime or outages. These
numbers are higher still at companies experiencing
more frequent (several times per month or more, 88%)
outages, or where average outages lasting a significant
amount of time (two hours or more, 88%).
The Operational Resilience Report: Confronting Outages, Downtime, and the Urgent Need for Action 9
1.4 Causes of downtime
The breathtaking technical complexity of modern applications means that a multitude of
faults can act as the root problem behind any outage or unplanned downtime — and that
the cause can lie both inside and outside of the afflicted organization.
38%
36% 36%
35%
33%
31% 31%
30% 30%
Network Software Cyberattacks Cloud service Third-party Environmental Human Capacity Hardware
issues issues provider service factors error issues failures
reliability failures
Interestingly, the frequency of occurrence of the causes behind the outages respondents have experienced were remarkably
consistent across all cohorts examined in this survey, regardless of company revenue, size, longevity, or location.
This is good news, actually. When the root causes of downtime strike so similarly and consistently across the board, the
solutions and strategies for defeating downtime and improving operational resilience will be equally consistent for
every organization, everywhere.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 10
PART 2
Enterprise disaster
readiness
The current state of operational resilience
challenges and response strategies
Are organizations more proactive or reactive when
an outage happens? How are they working to prevent
unplanned downtime in the first place?
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 11
2.1 The state of outage response strategies (or lack thereof)
Globally, 95% of the executives surveyed say they are aware of at least one existing,
unresolved operational weakness within their technical estate that puts their
organization at risk.
Nearly three-quarters (72%) say their organization has multiple operational weaknesses
that impact their organization’s ability to meet business and technical objectives.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 12
2.2 The challenges tech leaders report facing when tasked
with improving operational resilience
94% of leaders worldwide admit that the CrowdStrike outage has forced them to reevaluate
their business continuity strategies.
While leaders are saying almost unanimously that the global fallout from the CrowdStrike
outage has catalyzed their organizations to get serious about implementing operational
resilience strategies, some are worried that their company may not be able to change course.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 13
PART 3
The road to
resilience
What tactics do companies use to prevent
unplanned downtime and service outages, and
what are their strategies for improving resilience?
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 14
3.1 How organizations report they are “doing” operational
resilience now
Virtually all organizations (100%) conduct at least some sort of resiliency testing to mitigate
unplanned downtime or outages.
Other 0%
My organization does not conduct any resilience tests to mitigate unplanned downtime or outages 0%
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 15
3.2 Where organizations are targeting future investments
to increase their resilience
Overall, results show that organizations believe that no one area of investment is the
most important for improving their operational resilience. These numbers are consistent
across companies worldwide, regardless of whether they are long-established
enterprises or smaller recent startups.
1% -
Automation Hardware Cloud Skills People Disaster Other None of
and Al-driven (physical infrastructure (capabilities) (human recovery and these
solutions resources) and services capital) business
planning
49% also feel automation and AI-driven solutions would to increase the overall reliability of their systems.
be valuable.
A confluence of factors: Companies that say they are
49% believe their cloud infrastructure and services need both (1) proactive in outage prevention and (2) committed
additional investments. to improving their operational resilience are the most likely
to be investing in both human and artificial intelligence. This
is a small cohort within the overall 1,000 leaders surveyed,
Mind over machine? but within this granular demographic respondents are
In section 1.4, we saw that 31% of companies significantly above average in seeking to increase their
report experiencing an unplanned service internal training (72%) and seeking automation and AI
outage due to human error. support to bolster their system reliability (68%).
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 16
PART 4
Operational
resilience and
regulatory risks
79% of technology leaders describe their org as “not
completely prepared” to comply with new regulations
regarding unplanned downtime and outages
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 17
It’s no secret that operational resilience has been moving up the list of regulatory
priorities for governments around the globe. Why? The probability of disruption —
i.e., outages — and the potential impacts from those disruptions are both steadily
increasing. Modern application architecture, reliant on digital technologies and
third-party platforms and services, has simultaneously expanded the threat surface
for disruptions. With no reason to believe any of these trendlines will reverse soon,
regulatory authorities are stepping in to act.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 18
79%
Technical leaders are worried.
Very worried.
of technology leaders describe their
And for good reason. This survey — the first to examine
organization as “not completely
enterprise readiness and concerns regarding regulatory
prepared” to comply with new
compliance as these laws come into full effect — shows that
regulations regarding unplanned
many simply are not ready.
% downtime and outages
79
79% of executives overall say their organization as “not
completely prepared” to comply with new regulations There is, fortunately, technology that can help enterprises
regarding unplanned downtime and outages, while 44% protect themselves. Distributed systems, and particularly
say they have “significant concern” over their organization’s distributed SQL databases, significantly contribute to
abilities to comply with operational resilience regulations. improving organizational resilience by mitigating the
technical risks and weaknesses that lead to outages.
Separately, 60% worry specifically about data loss or
corruption occurring during an outage that could result in See the next section of this report (“How distributed SQL
significant fines and penalties — not just under DORA and helps organizations achieve operational resilience”)
NIS2, but also GDPR and other data privacy regulations that for information on how a distributed SQL database in
impose penalties for data-related noncompliance. particular can help organizations align with operational
resilience regulations — and help companies achieve
their overall operational resilience goals.
How to get in compliance with
operational resilience regulatory
When “guaranteed uptime” is still
DORA and NIS2 focus on mitigating the risks associated
with disruption in essential or important digital services.
downtime
GDPR and the alphabet soup of other current global data A significant — and also very tactical — component of
privacy laws (CCPA, PIPEDA, POPI, LGPD, HIPAA, PCI-DSS, operational resilience lies in the service level agreements
to name but a few) focus on data integrity as an important (SLAs) from each vendor regarding availability for services,
component of overall operational resilience. Companies platforms, or databases in your application stack. Availability
found to be out of compliance with these regulations can SLAs are the percentage of time the cloud platform or
experience real and lasting consequences, both financial database, etc, are operational. The goal is 100%, but
and operational — and no enterprise is immune. even large and critical systems (such as the VISA card
payments network or Amazon Web Services, for example)
It’s not just the EU’s DORA; there are many similar don’t promise 100% availability. This is because when a
regulations on the horizon, from governments around the modern cloud application processes thousands, even
globe attempting to address operational risks, resilience, tens of thousands, queries per second, adverse events will
and business continuity. In the UK and Australia, new inevitably occur. In the real world, the gold standard SLA is
operational resilience requirements take effect in 2025. known as “five nines” or 99.999% uptime.
The governments of Hong Kong and Singapore are
actively drafting new resilience regulations, while in the Results from companies in this survey show that they, on
US representatives from the Federal Reserve, Treasury average across all organization sizes and sectors, experience
Department, and other agencies are currently developing 86 outages per year. The average outage lasts 196 minutes.
a set of operational resilience standards as a blueprint for This equates to an average of 280 hours of downtime
future legislation. per year. This sounds like a lot of downtime — but it’s
actually within the limits of any service level agreement (SLA)
guaranteeing 97% uptime.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 19
Availability
SLA
A key tactic for improving operational resilience is, naturally, Spanner can only run on Google Cloud — which introduces
choosing solutions that offer the highest possible uptime.3 a single point of failure for an application. A serious Google
Cloud outage will take the database down along with it, even
Using database-as-service (DBaaS) SLAs as an example, if the application itself is hosted on another ICP like AWS or
only two distributed SQL databases currently offer 99.999% Azure. CockroachDB, on the other hand, is cloud agnostic
uptime (5 minutes, 18 seconds per year): Google Spanner and can be deployed on any — or even all — of the major
and CockroachDB. Both are globally scalable, synchronously cloud providers, as well as self-hosted on premise and
replicated cloud native distributed relational databases that hybrid cloud/on-prem installations.
offer extremely high uptime SLAs out-of-the-box. .
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 20
PART 5
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 21
In 2025, ensuring operational resilience is no longer optional. It’s imperative.
The cost of outages is high for companies. For 86% of executives, every minute of unplanned downtime is a minute they could
lose a customer. Across all geographies, company sizes, and industries, the average annual outage-related revenue loss was
US$222,323 over the past twelve months.
The findings from this survey of 1,000 global enterprise technology leaders highlight existing critical gaps organizations
currently experience in terms of operational resilience within their digital estates. Despite near-universal recognition of the
importance of uptime, many organizations remain dangerously vulnerable to the risks associated with unplanned downtime
and outages. And, as new regulations like DORA and NIS2 come into effect (with others on the horizon), ensuring compliance
and resilience has gone from something companies should do to something they must do — or face a very real risk of serious
operational, financial, and regulatory consequences.
Key takeaways from this report emphasize both the urgency and complexity of addressing these challenges:
High frequency and cost of outages: A majority of enterprises face frequent outages, with 69%
reporting service interruptions once or more per week, or even every day. These events are costly:
overall, 32% report financial losses of US$100,000 or more annually. But the damage is more than
dollars down the drain: 100% of organizations in this survey say that outages have had significant
negative impacts on their technical teams and staff. Frequent outages can also cause lasting reputational
damage, losing both customers and the confidence of both the C-suite and external investors alike.
Regulatory pressure is increasing: With 79% of organizations admitting they are not fully prepared
to comply with operational resilience regulations, many executives are losing sleep over the potential
financial penalties. This situation underscores the pressing need for improved resilience strategies
to meet regulations like DORAand NIS2 — as well as the emerging legislative actions currently under
consideration by governments around the world.
Automation, AI, and cloud investments: Essentially half (49%) of the respondents see investment
in automation, AI, and cloud infrastructure as critical steps towards enhancing resilience. These
technologies will help organizations improve system reliability, reduce human error, and recover more
quickly from outages.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 22
Emerging enterprise
strategies for
improved resilience
In response to these challenges, enterprises are adopting a
range of strategies to bolster their resilience:
Downtime is inevitable. To deal with this truth, these strategies highlight a forward-looking approach where technology and
governance go hand in hand to secure the operational resilience of modern enterprises.
By aligning their technical estates with distributed architecture, automation, and regulatory requirements, companies can
improve their uptime and their ability to adapt to an increasingly complex and regulated digital environment — while putting
less stress on their technical teams and the business’s bottom line.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 23
PART 6
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 24
Here are key ways distributed SQL
databases help organizations
improve their operational
resilience and align with regulatory
requirements, current or future:
Planned downtime is still downtime: Beyond preventing outages, organizations dealing with frequent
planned downtime due to software upgrades or application changes can recapture that uptime with the
capabilities available in some distributed SQL solutions. For example, CockroachDB offers online updates
and live schema changes for zero downtime of any kind in production.
Distributed SQL databases provide high availability via a fault-tolerant design and active-active replication
architecture. They include disaster recovery capabilities to keep any downtime that does occur to
the absolute minimum defined by an organization’s RPO and RTO goals — and operational resilience
regulatory requirements.
Automated failover and disaster recovery: Operational continuity — keeping systems and
services available and online, even in adverse events — is a core tenet of operational resilience. Because
machines can almost always react faster than humans, intelligent automation of systems is critical to
optimizing resilience. Recent regulatory requirements for business continuity during system failures or
cyber incidents are rooted in the fact that many current databases do not offer sophisticated capabilities
for automating resilience.
Distributed SQL systems like CockroachDB automatically replicate data across geographic locations to
provide continuity, even in the face of entire-region outages. If a node fails, the system automatically
routes traffic to healthy nodes, ensuring continuity without manual intervention. Other cross-cluster
replication capabilities can be used to achieve a low RPO/RTO depending on deployment topology and
latency requirements.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 25
Data integrity and compliance: Data integrity is a second core tenet of operational resilience —
and 36% of executives in this survey say they worry about outage-related data loss and corruption.
Distributed databases allow enterprises to maintain control over where specific data resides, helping
to meet the data residency requirements of data privacy laws such as GDPR while also optimizing
availability. For instance, CockroachDB supports features that enforce data sovereignty by keeping
specific data in servers located in different geographic locations while simultaneously enabling high
availability across regions. In addition, CockroachDB supports data encryption in flight and at rest to
enhance security, and Auth-Z/N mechanisms to designate who can get access to what information.
Faster recovery times: The survey notes that 57% of organizations are only moderately organized in
their response to unplanned downtimeand meeting their RPO/RTO goals. Distributed SQL databases
offer built-in mechanisms for faster recovery times, and the most mature ones, like CockroachDB, fully
automate these mechanisms. These capabilities align with DORA’s requirement for prompt incident
management and recovery protocols
By minimizing downtime, enabling quick recovery, assisting with compliance with data residency laws, and
enhancing overall operational resilience, distributed systems and SQL databases can help enterprises meet
DORA requirements — and help the 79% of executives who admit their organizations risk punitive fines and
other penalties because they are simply not prepared to comply with operational resilience regulations.
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 26
cockroachlabs.com
The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 27