0% found this document useful (0 votes)
45 views27 pages

The State of Resilience 2025

Uploaded by

Miguel Martínez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views27 pages

The State of Resilience 2025

Uploaded by

Miguel Martínez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

The State of

Resilience 2025
Confronting Outages, Downtime,
and Organizational Readiness

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 1
Executive summary
What 1,000 senior cloud and technology executives from
all over the world are saying about their organization’s
operational resilience — and their strategies for increasing it

Outages happen. Despite almost universal awareness of this fact, a shocking majority of businesses find themselves
dangerously exposed to serious consequences when an outage occurs.

Fallout following the recent CrowdStrike global outage jolted many organizations into action — 94% of technical executives
in this survey said that the event has catalyzed their companies to reassess their operational resilience. At the same
time, leaders at the global enterprise companies surveyed here, in “The State of Operational Resilience 2025,” report that
entrenched resistance to change, misaligned internal priorities, outdated systems, and budgetary gridlock prevent many from
implementing meaningful — sometimes even desperately needed — operational resilience measures.

1. Leaders are worried: 93% of leaders are concerned 3. Outages are the new normal: On average, companies
about the financial and organizational impacts of outages, report 86 outages per year—translating to 324 minutes of
and 95% are aware of operational weaknesses that weekly downtime. 55% experience weekly outages, while
leave them vulnerable. At the same time, however, 48% 14% report daily outages.
say their organizations aren’t doing enough to improve
resilience. 53% of banking and financial services companies report
experiencing service disruptions at least weekly, as do 60%
2. The high cost of service disruption: 100% of of retail and ecommerce enterprises.
companies surveyed experienced revenue losses from
outages in the past 12 months with per-outage losses These are not minor incidents. 70% of large enterprise
ranging from at least US$10,000 to well over $1,000,000. companies1 report that their outages typically take
60 minutes or more to resolve. Overall, nearly half of
The data also shows that the larger the organization, all respondents report that their average downtime lasts
the larger the annual revenue loss. For companies over two or more hours before resolution, with 10% reporting
1,000 employees and/or US$500 million ARR, outage-related the loss of a full workday or more before they are able to
losses averaged US$495,000 — though a handful of of resume operations. The average outage time across
these large enterprise organizations (8%) reported all geographies, ARR, company sizes, and industries
losses of US$1 million or higher over the last 12 months. is 196 minutes – or more than three hours of service
disruption.

1 Defined for this report as organizations with more than $500m ARR and 1,000+ employees

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 2
4. Unplanned outages cause more than economic 6. Resilience investments are overdue: Many
losses: Externally, they cause the loss of consumer organizations have known weaknesses, with 49% planning
and business partner confidence, damaging the to invest in automation, AI, and cloud infrastructure to boost
organization’s reputation. Internally, they diminish resilience.
trust in the technical IT responsible for preventing
or mitigating outages. An engineering group lacking 7. New operational resilience regulations loom
the trust of the larger organization will have a difficult large: A majority of all the technology executives surveyed
time recruiting and retaining quality technical staff. (79%) admit their organization is not completely prepared
to comply with new operational resilience governance
Frequent outages also cause a risk of staff burnout regulations like DORA (going into effect in Jan 2025) and the
and high turnover when teams are forced to miss NIS2 directive, opening them up to consequences. Nearly
other deadlines (39%) and pile up a backlog of half (44% ) say they are losing sleep over the regulatory
requests (43%) — while staying later or working fines and penalties that come from unplanned downtime or
weekends (48%) — to fire-fight outages. outages.

5. Preparation is spotty: Only 20% of respondents Executives in EMEA (Europe, the Middle East and Africa)
describe their organization as fully prepared for outages. (85%) are more likely than executives in APAC (Asia-
Only 33% have an organized response approach, and less Pacific) (76%) and North America (75%) to admit not being
than a third conduct regular failover testing. completely prepared to comply with new regulations
regarding unplanned downtime and outages.

Survey Methodology
The Operational Resilience in Enterprises Survey was conducted by Cockroach Labs and Wakefield Research
among 1,000 Senior Cloud Architects, Engineering, & Technology Executives, with a minimum seniority of Vice
President in three regions: North America (US, Canada), EMEA (Germany, Italy, France, UK), and APAC (India,
Australia, Singapore), between August 29th and September 10th, 2024, using an online survey.

Results of any sample are subject to sampling variation. The magnitude of the variation is measurable and is
affected by the number of interviews and the level of the percentages expressing the results. For the interviews
conducted in this particular study, the chances are 95 in 100 that a survey result does not vary, plus or minus,
by more than 3.1 percentage points in the global sample, 6.9 percentage points in the United States, and 9.8
percentage points in each of the remaining markets from the result that would be obtained if interviews had been
conducted with all persons in the universe represented by the sample.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 3
Introduction
Just a few months have passed since the biggest global software failure to date. The CrowdStrike
outage damage was widespread and instantaneous: As millions of devices crashed, millions of
banking customers were cut off from their accounts. Airlines grounded flights by the thousands,
hospitals canceled surgeries, and universities canceled classes. Emergency police, fire, or medical
aid became unavailable in some areas as 911 call centers went offline. The cause of this global
disaster was every technologist’s worst nightmare: a single faulty section of code in a single
software update from a single company.

How can you ensure your company is prepared for next time?

We conducted this survey because operational resilience can dictate how, and even if, a company is allowed
has, until now, been generally overlooked. Most companies to operate. The numbers also fail to represent the
focus on disaster recovery (DR), which is important after reputational damage a company will experience, following
disruptions occur. However, operational resilience focuses a major outage — including potentially a decrease in
on preventing service interruptions before they happen. stock value when investor confidence gets rattled.

In today’s astonishingly complex technical architectures, As the data from this survey demonstrates, outages
with their interdependence of sophisticated digital services, carry more than financial costs, and can cause long-term
anything that can go wrong, will go wrong eventually. The repercussions. These seasoned technical leaders from
results of this survey bear this out: 100% of companies every type of industry and from all around the globe divulge
surveyed experienced revenue losses from outages in the the current state of operational readiness in their own
past 12 months, with per-outage losses ranging from at organizations — and their strategies for increasing their
least US$10,000 to well over US$1,000,000. Across all sizes operational resilience without introducing new risks.
and sectors, the enterprises in this survey experience, on
average, 86 outages per year; the average length of an
outage is 196 minutes — over three hours of downtime.

These numbers, startling as they are, still don’t represent


the true cost of downtime. These figures do not include
any additional fines or penalties that may be levied as
part of new operational resilience regulatory actions
in the EU and beyond. Fines that can easily double the
initial costs of the outage itself, and — because these
regulations carry the force of law — allow government
entities to penalize out-of-compliance companies and

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 4
The report in six parts
Outages happen. Yet data collected in this survey reveals that, despite almost universal awareness of risks, a majority of
businesses find themselves dangerously exposed to serious consequences when an outage happens. The CrowdStrike
outage has jolted organizations into action, yet entrenched resistance to change, misaligned priorities, outdated systems, and
budgetary gridlock prevent many from implementing meaningful operational resilience measures. This report is in six parts:

Part 1: Operational resilience in 2025 Part 2: Enterprise disaster readiness


reveals the current state of operational resilience queries how companies respond to outages,
in 2025. How often do organizations experience and how they work to prevent them in the first
outages? What are the common causes behind place. What challenges do they face when tasked
this downtime, and what exactly are the with improving their organization’s operational
financial and operational impacts when a service resilience, and what factors potentially block
disruption happens? their progress?

Part 3: The road to resilience is a Part 4: Operational resilience and


look at the tactics companies use to regulatory risks examines the average
prevent unplanned downtime and service organization’s readiness to comply with
outages and their strategies for making new legal mandates around operational
improvements. Where are organizations resilience and ability to comply with data
targeting new investments aimed at increasing privacy regulations in the event of data
their current operational resilience? loss or corruption following an outage.

Part 5: Key takeaways and emerging Part 6: How distributed SQL helps
enterprise operational resilience organizations achieve operational
strategies summarizes this report’s findings. resilience uses data points and takeaways
What do they mean for technical leaders trying from this report to show how distributed
to operate in a business-as-usual environment application architecture helps enterprises
that’s far more complex than ever before, increase their organizational resilience by
while implementing positive changes without mitigating the technical risks and weaknesses
introducing new risk? that lead to outages.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 5
PART 1

Operational
resilience in 2025
Are tech leaders worried about their operational
resilience?

In a word, yes — and for good reason.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 6
1.1 Real-world outage costs & impacts
100% of the technology executives surveyed in this report say that their company lost money due to
outages over the past 12 months.

Regulatory fines and penalties 44%

Erosion of trust in my IT organization by leadership 44%

Recovery costs 43%

Compensation costs 39%

Loss of trust or damage to our reputation with customers or prospective customers 37%

Data loss 36%

Lost revenue 34%

Dollars down the drain Fear of firing: An overall 82% of leaders surveyed (87% in
the US and APAC; 77% in EU) expressed fear that they or
In the past twelve months, across all companies of all sizes
members of their technical teams could lose their jobs
and annual revenues in all sectors, one-third (32%) of
following a significant outage or downtime event.
respondents lost US$100K or more due to outages.

Fire-fighting fallout: Beyond dropping workday


And, the bigger the player, the bigger the penalty:
responsibilities to fire-fight an outage, an overall 48% of
companies with higher revenue (US$500m+) are 256%
executives say their teams also must work overtime
more likely to incur outage-related financial losses of
and on weekends to fully restore normal operations.
over US$1M per year. On the surface, this reflects the fact
The same number (48%) also say that unplanned
that more people are impacted when a larger company
downtime has derailed their tech teams from meeting
suffers an outage — but it also illustrates just how high the
objectives — which, in turn, creates even more unplanned
financial stakes can be.
work required to prepare for and conduct post-mortems
and investigations (39%).
Losing more than money
100% of organizations in this survey say that outages Stressed technical teams: Not surprisingly, this
have had significant negative impacts on their technical frequently leads to friction and finger-pointing among
teams and staff. technical teams (43% overall, 48% in North America). The
majority of respondents (91%) say that overtime and
Drop everything: Beyond causing financial losses, increasing work backlog due to outages are a significant
unplanned downtime or outages disrupt normal business stress factor on their teams, which in turn can lead to
operations in other ways. In fact, 92% of the executives potential burnout and higher turnover.
surveyed report their teams must occasionally
deprioritize essential work in order to address
unplanned downtime or outages. Two thirds (66%)
say they are forced to deprioritize everyday tasks like
improvements, maintenance, or administrative tasks
frequently or even all the time.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 7
1.2 Outages by the numbers: frequency & duration
100% of respondents report experiencing unplanned, unexpected downtime in the
past 12 months.

Despite unanimous recognition that outages cost


companies both financially and organizationally, the data
gathered for this report shows that unplanned downtime
and service outages are not isolated incidents. A significant
majority of companies (69%) experience outages/
service interruptions at least weekly. For about one in
seven companies (14%) outages are a daily occurrence.

53% of Banking and Financial Services companies report


experiencing outages or service interruptions weekly or 53%
of Banking and Financial
more often – as do 60% of Retail/Ecommerce companies
Services companies report
experiencing outages or service
Time is money. The clock is ticking when an outage interruptions weekly
occurs; it’s a situation where time is literally money down
the drain. Stunningly, when an outage happens, overall
average downtime is 151 minutes time to recovery (TTR)2
for resolving an outage (unless you’re in India, where the
average outage lasts 211 minutes).

Overall, just 2% of companies surveyed


say they are able to resolve an unplanned
outage in 60 seconds or less.

60%
of Retail/Ecommerce companies
report experiencing outages or
service interruptions weekly

2 Time to Recovery (or Time to Resolution) is the full time of an outage — from the time the system or product fails to

the time that it becomes fully functional again so that normal business operations can resume.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 8
1.3 The worries keeping tech leadership awake at night
93% of executives in this survey say they are losing sleep over the impacts of
unplanned downtime.

And the data gathered for this report show that they have
good reason: 86% say that every minute of unplanned
downtime is a minute they risk losing customers,
perhaps permanently. Beyond the obvious financial fallout,
there are many other sources for executive anxiety as well:

 Regulatory retaliation: 44% overall (85% in EMEA)


are worried about the regulatory fines and penalties
that can be levied when an outage occurs – which
can go as high as €5,000,000 (approximately US$ 5.5
million) — and the risk that regulatory authorities
can limit a non-compliant entity’s ability to conduct
business until they comply.

 Data disaster: One out of every three leaders


(36%) worry about data getting corrupted, or even
completely lost/destroyed, during unplanned
downtime. Data loss haunts SMB startups
and venerable global enterprises alike: this
percentage is remarkably consistent regardless of
company size, longevity, revenue, or location.

93%
of executives in this survey say
 Potential punishment: 82% of leaders acknowledge
they are losing sleep over the
that their teams are worried technical staff will be
impacts of unplanned downtime.
fired for unplanned downtime or outages. These
numbers are higher still at companies experiencing
more frequent (several times per month or more, 88%)
outages, or where average outages lasting a significant
amount of time (two hours or more, 88%).

 Career concerns: Almost half (44%; 49% in NA)


worry that unplanned downtime erodes internal trust
in their technical teams and leadership. This lack of
confidence from the C-suite, in turn, impedes their
ability to get crucial buy-in for investing in infrastructure
improvements.

The Operational Resilience Report: Confronting Outages, Downtime, and the Urgent Need for Action 9
1.4 Causes of downtime
The breathtaking technical complexity of modern applications means that a multitude of
faults can act as the root problem behind any outage or unplanned downtime — and that
the cause can lie both inside and outside of the afflicted organization.

38%
36% 36%
35%
33%
31% 31%
30% 30%

Network Software Cyberattacks Cloud service Third-party Environmental Human Capacity Hardware
issues issues provider service factors error issues failures
reliability failures

Interestingly, the frequency of occurrence of the causes behind the outages respondents have experienced were remarkably
consistent across all cohorts examined in this survey, regardless of company revenue, size, longevity, or location.

This is good news, actually. When the root causes of downtime strike so similarly and consistently across the board, the
solutions and strategies for defeating downtime and improving operational resilience will be equally consistent for
every organization, everywhere.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 10
PART 2

Enterprise disaster
readiness
The current state of operational resilience
challenges and response strategies
Are organizations more proactive or reactive when
an outage happens? How are they working to prevent
unplanned downtime in the first place?

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 11
2.1 The state of outage response strategies (or lack thereof)
Globally, 95% of the executives surveyed say they are aware of at least one existing,
unresolved operational weakness within their technical estate that puts their
organization at risk.

Nearly three-quarters (72%) say their organization has multiple operational weaknesses
that impact their organization’s ability to meet business and technical objectives.

Proactive vs. reactive


organizations
39% of executives describe their outage handling as
95%
of the executives
“reactive.” They respond to outages as they occur, surveyed say they are
with no formal protocols or response planning in aware of at least one
place. Larger enterprises (1,000+ employees), though, are existing, unresolved
significantly more likely (49%) than smaller organizations and operational weakness
startups to have formal protocols and preventive measures within their technical
like continuous monitoring in place to minimize unplanned estate that puts their
downtime and service disruption. organization at risk

Despite acknowledging the inevitability of unplanned


downtime, these reactive organizations say they rely on a
few key people to respond to and manage outages when
they happen — essentially, a single-point-of-failure strategy.

Outage response strategies


“Controlled” chaos: Regardless of whether they are
proactive or reactive, most organizations (57%) report being
only “moderately organized” when reacting to unplanned
downtime or outages, and they recognize there are gaps in 72%
%
protocols that need to be addressed.
9 5
say their organization
has multiple
Reinventing the (broken) wheel: 10% of organizations operational
describe their response to unplanned downtime or outages weaknesses that
as “chaotic” — for them, it’s a scramble to determine how impact their
best to address each new problem as it happens. organization’s ability
to meet business and
technical objectives
%
72

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 12
2.2 The challenges tech leaders report facing when tasked
with improving operational resilience

94% of leaders worldwide admit that the CrowdStrike outage has forced them to reevaluate
their business continuity strategies.

While leaders are saying almost unanimously that the global fallout from the CrowdStrike
outage has catalyzed their organizations to get serious about implementing operational
resilience strategies, some are worried that their company may not be able to change course.

Higher priority given to other teams’ issues 38%

Budget constraints limiting our ability to address vulnerabilities 36%

Complexity of existing systems making fixes challenging 34%

Lack of comprehensive training for the team 34%

Insufficient staffing to tackle issues 33%

Lack of leadership buy-in for necessary changes 32%

Outdated or inadequate monitoring tools 30%

Too many vulnerabilities to manage 29%

Outdated or inadequate technological resources 29%

A new sense of urgency The factors blocking progress


The call is coming from inside the house. Leaders Survey respondents say there can be significant obstacles
worldwide say that fallout from the CrowdStrike has and challenges in the way of improving their organization’s
increased their organization’s sense of urgency: overall, resilience. What is blocking their progress?
almost half (46%) say they are “significantly” improving
planning, while another 48% are making “some”  Competing priorities: The top reason that recognized
improvements. vulnerabilities have not been addressed is because
other teams’ needs get prioritized instead (38%).
But will anyone answer the call? Survey participants
expressed mixed feelings about their organizations’ ability to  Financial constraints: 36% report that budgetary
change course. challenges limit their ability to address vulnerabilities.

 Lack of leadership confidence: Those in charge


A full 50% believe their organization will increase their
may not be taking the issues as seriously as technical
preparedness (and decrease their outage response times)
executives would hope: lack of executive leadership
over the next 12 months. Among the remaining half, 38%
buy-in for their proposed solutions is a roadblock for
predict they will maintain the status quo. The situation is
32%.
more dire for the 12% of executives that fear they will fall
even further behind the operational readiness curve.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 13
PART 3

The road to
resilience
What tactics do companies use to prevent
unplanned downtime and service outages, and
what are their strategies for improving resilience?

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 14
3.1 How organizations report they are “doing” operational
resilience now

Virtually all organizations (100%) conduct at least some sort of resiliency testing to mitigate
unplanned downtime or outages.

Security vulnerability assessments 42%

Regular system backups and restorations 38%

Load and stress testing 38%

Incident response drills 37%

Scalability testing 37%

Disaster recovery simulations 33%

Notification system testing 33%

Penetration (Pen) testing 30%

Failover testing 29%

Other 0%

My organization does not conduct any resilience tests to mitigate unplanned downtime or outages 0%

Testing, testing Unforced errors


The tests conducted most often are security Some things are just basic best practices. But a
vulnerability assessments (42%), which help surprising number of organizations admit they
organizations protect their systems and data from don’t always practice good technical hygiene
unauthorized access and data breaches. when it comes to resilience fundamentals:

Other proactive assessments include regular


system backups and restorations (38%) and load A shocking 62% of
and stress testing (38%) to see how a system organizations in this survey fail
would respond under extreme conditions to do regular system backups
and restoration exercises.

71% do no failover testing to


ensure their outage prevention
protocols are working.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 15
3.2 Where organizations are targeting future investments
to increase their resilience
Overall, results show that organizations believe that no one area of investment is the
most important for improving their operational resilience. These numbers are consistent
across companies worldwide, regardless of whether they are long-established
enterprises or smaller recent startups.

49% 49% 49% 46%


41% 38%

1% -
Automation Hardware Cloud Skills People Disaster Other None of
and Al-driven (physical infrastructure (capabilities) (human recovery and these
solutions resources) and services capital) business
planning

People power for outage prevention: Overall, 46% of


A multi-threaded path to respondents believe they should be investing in training and
prevention skills within their organization. 41% are also looking to hire
Instead of focusing on a single approach, executives report new headcount for positions to directly support operational
pursuing a multi-threaded strategy of targeted investments resilience initiatives.
in a variety of areas. Overall there is consistent agreement
on the technologies that will benefit their continuity: Automation through AI: Larger companies are more
likely to be looking at AI solutions, especially if they are in
 49% believe they need to make more investments in
the US or India, both locations where 57% of respondents
their hardware to improve resiliency.
indicate they are actively investigating AI-driven automation

 49% also feel automation and AI-driven solutions would to increase the overall reliability of their systems.
be valuable.
A confluence of factors: Companies that say they are
 49% believe their cloud infrastructure and services need both (1) proactive in outage prevention and (2) committed
additional investments. to improving their operational resilience are the most likely
to be investing in both human and artificial intelligence. This
is a small cohort within the overall 1,000 leaders surveyed,
Mind over machine? but within this granular demographic respondents are
In section 1.4, we saw that 31% of companies significantly above average in seeking to increase their
report experiencing an unplanned service internal training (72%) and seeking automation and AI
outage due to human error. support to bolster their system reliability (68%).

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 16
PART 4

Operational
resilience and
regulatory risks
79% of technology leaders describe their org as “not
completely prepared” to comply with new regulations
regarding unplanned downtime and outages

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 17
It’s no secret that operational resilience has been moving up the list of regulatory
priorities for governments around the globe. Why? The probability of disruption —
i.e., outages — and the potential impacts from those disruptions are both steadily
increasing. Modern application architecture, reliant on digital technologies and
third-party platforms and services, has simultaneously expanded the threat surface
for disruptions. With no reason to believe any of these trendlines will reverse soon,
regulatory authorities are stepping in to act.

database companies, and other SaaS providers all face the


The very real regulatory costs of same fines and penalties when they experience an outage.
downtime
Operational resilience is an increasing target of government These fines are potentially fierce: under DORA, financial
legislation and regulatory efforts like DORA (the EU’s Digital entities and third-party ICT providers that are considered
Operational Resilience Act) and the security-focused NIS2 critical by the European Supervisory Authorities (ESAs)
Directive — and technology leaders are worried. DORA can be fined up to €5,000,000 or 1% of their global annual
specifies the technical requirements aimed at ensuring revenue, whichever is higher. NIS2 is even more punitive:
operational resilience for financial institutions and critical the directive requires EU Member States to fine for a
services like utilities, logistics platforms, and healthcare. maximum of at least €7,000,000 or 1.4% of the global
Similarly, the NIS2 Directive, a security-focused set of annual revenue, whichever is higher.
operational resilience requirements, spells out how
organizations must implement technical, operational, and Beyond fines, other DORA/NIS2 outage-related penalties
organizational measures to manage cybersecurity risks. are also potentially severe. These include operational
restrictions, where regulatory authorities can limit a non-
NIS2 became law for all companies operating in the EU as of compliant organization’s ability to conduct business until
October 2024; DORA regulations take full effect on January they comply. EU member states can impose additional
1, 2025. Though these laws originate in the European Union, penalties, such as audits, suspensions, cease-and-desist
it’s important to understand that DORA and NIS2 cover orders, and public notices.
any company operating in the EU in any form, even if
they are headquartered outside of EU borders. Even if an event escaped the widespread havoc of
the CrowdStrike global outage, publicly documented
No matter where they’re based, it’s not just financial DORA violations can still damage a company’s brand
services enterprises and those in other named sectors that and reputation. Beyond any customers lost as a
need to worry. The regulations also cover third-party ICT consequence of an unplanned downtime incident,
(Information & Communications Technology) organizations: companies found to be in violation risk losing
namely, digital and data services providers. Companies like future customers — and investor confidence.
CrowdStrike, for example, as well as cloud service providers,

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 18
79%
Technical leaders are worried.
Very worried.
of technology leaders describe their
And for good reason. This survey — the first to examine
organization as “not completely
enterprise readiness and concerns regarding regulatory
prepared” to comply with new
compliance as these laws come into full effect — shows that
regulations regarding unplanned
many simply are not ready.
% downtime and outages
79
79% of executives overall say their organization as “not
completely prepared” to comply with new regulations There is, fortunately, technology that can help enterprises
regarding unplanned downtime and outages, while 44% protect themselves. Distributed systems, and particularly
say they have “significant concern” over their organization’s distributed SQL databases, significantly contribute to
abilities to comply with operational resilience regulations. improving organizational resilience by mitigating the
technical risks and weaknesses that lead to outages.
Separately, 60% worry specifically about data loss or
corruption occurring during an outage that could result in See the next section of this report (“How distributed SQL
significant fines and penalties — not just under DORA and helps organizations achieve operational resilience”)
NIS2, but also GDPR and other data privacy regulations that for information on how a distributed SQL database in
impose penalties for data-related noncompliance. particular can help organizations align with operational
resilience regulations — and help companies achieve
their overall operational resilience goals.
How to get in compliance with
operational resilience regulatory
When “guaranteed uptime” is still
DORA and NIS2 focus on mitigating the risks associated
with disruption in essential or important digital services.
downtime
GDPR and the alphabet soup of other current global data A significant — and also very tactical — component of
privacy laws (CCPA, PIPEDA, POPI, LGPD, HIPAA, PCI-DSS, operational resilience lies in the service level agreements
to name but a few) focus on data integrity as an important (SLAs) from each vendor regarding availability for services,
component of overall operational resilience. Companies platforms, or databases in your application stack. Availability
found to be out of compliance with these regulations can SLAs are the percentage of time the cloud platform or
experience real and lasting consequences, both financial database, etc, are operational. The goal is 100%, but
and operational — and no enterprise is immune. even large and critical systems (such as the VISA card
payments network or Amazon Web Services, for example)
It’s not just the EU’s DORA; there are many similar don’t promise 100% availability. This is because when a
regulations on the horizon, from governments around the modern cloud application processes thousands, even
globe attempting to address operational risks, resilience, tens of thousands, queries per second, adverse events will
and business continuity. In the UK and Australia, new inevitably occur. In the real world, the gold standard SLA is
operational resilience requirements take effect in 2025. known as “five nines” or 99.999% uptime.
The governments of Hong Kong and Singapore are
actively drafting new resilience regulations, while in the Results from companies in this survey show that they, on
US representatives from the Federal Reserve, Treasury average across all organization sizes and sectors, experience
Department, and other agencies are currently developing 86 outages per year. The average outage lasts 196 minutes.
a set of operational resilience standards as a blueprint for This equates to an average of 280 hours of downtime
future legislation. per year. This sounds like a lot of downtime — but it’s
actually within the limits of any service level agreement (SLA)
guaranteeing 97% uptime.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 19
Availability
SLA

99.999% 5 minutes, 18 seconds

99.990% 52 minutes, 9.8 seconds

99.900% 8 hours, 41 minutes, 38 seconds

99.000% 3 days, 14 hours, 56 minutes, 18 seconds (78 hours)

98.000% 7 days, 5 hours, 52 minutes, 35 seconds (174 hours)

97.000% 10 days, 20 hours, 48 minutes, 53 seconds (261 hours)

96.750% 11 days, 18 hours, 32 minutes, 57 seconds (282 hours)

96.000% 14 days, 11 hours, 45 minutes, 10 seconds (348hours)

0 50 100 150 200 250 300 350

Downtime per year (in hours)

A key tactic for improving operational resilience is, naturally, Spanner can only run on Google Cloud — which introduces
choosing solutions that offer the highest possible uptime.3 a single point of failure for an application. A serious Google
Cloud outage will take the database down along with it, even
Using database-as-service (DBaaS) SLAs as an example, if the application itself is hosted on another ICP like AWS or
only two distributed SQL databases currently offer 99.999% Azure. CockroachDB, on the other hand, is cloud agnostic
uptime (5 minutes, 18 seconds per year): Google Spanner and can be deployed on any — or even all — of the major
and CockroachDB. Both are globally scalable, synchronously cloud providers, as well as self-hosted on premise and
replicated cloud native distributed relational databases that hybrid cloud/on-prem installations.
offer extremely high uptime SLAs out-of-the-box. .

However, only one of them is truly fit for optimal operational


resilience.

3 Calculation source uptime.is SLA and Uptime Calculator

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 20
PART 5

Key takeaways and


emerging enterprise
operational
resilience strategies
Priorities for changes to be made, and strategies
organizations are considering

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 21
In 2025, ensuring operational resilience is no longer optional. It’s imperative.

The cost of outages is high for companies. For 86% of executives, every minute of unplanned downtime is a minute they could
lose a customer. Across all geographies, company sizes, and industries, the average annual outage-related revenue loss was
US$222,323 over the past twelve months.

The findings from this survey of 1,000 global enterprise technology leaders highlight existing critical gaps organizations
currently experience in terms of operational resilience within their digital estates. Despite near-universal recognition of the
importance of uptime, many organizations remain dangerously vulnerable to the risks associated with unplanned downtime
and outages. And, as new regulations like DORA and NIS2 come into effect (with others on the horizon), ensuring compliance
and resilience has gone from something companies should do to something they must do — or face a very real risk of serious
operational, financial, and regulatory consequences.

Key takeaways from this report emphasize both the urgency and complexity of addressing these challenges:

High frequency and cost of outages: A majority of enterprises face frequent outages, with 69%
reporting service interruptions once or more per week, or even every day. These events are costly:
overall, 32% report financial losses of US$100,000 or more annually. But the damage is more than
dollars down the drain: 100% of organizations in this survey say that outages have had significant
negative impacts on their technical teams and staff. Frequent outages can also cause lasting reputational
damage, losing both customers and the confidence of both the C-suite and external investors alike.

Regulatory pressure is increasing: With 79% of organizations admitting they are not fully prepared
to comply with operational resilience regulations, many executives are losing sleep over the potential
financial penalties. This situation underscores the pressing need for improved resilience strategies
to meet regulations like DORA​and NIS2 — as well as the emerging legislative actions currently under
consideration by governments around the world.

Automation, AI, and cloud investments: Essentially half (49%) of the respondents see investment
in automation, AI, and cloud infrastructure as critical steps towards enhancing resilience. These
technologies will help organizations improve system reliability, reduce human error, and recover more
quickly from outages​​.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 22
Emerging enterprise
strategies for
improved resilience
In response to these challenges, enterprises are adopting a
range of strategies to bolster their resilience:

Embracing AI-Driven automation: Automation through Adopting distributed systems: Enterprises


AI is becoming a top priority for many enterprises as it offers are increasingly investing in distributed systems to
proactive solutions for detecting and addressing operational ensure uninterrupted service even during system
risks before they cause downtime. AI-driven monitoring failures. These systems are designed to automatically
and self-healing systems reduce the need for human handle failures by making services available and by
intervention during outages​​. replicating data across multiple nodes and geographic
locations, thereby shrinking the threat surface for
Strengthening regulatory compliance: Organizations potential service disruptions and outages.
are recognizing the need for more robust disaster
recovery and compliance protocols to meet the stringent
requirements of regulations like DORA and NIS2. This
includes investing in secure, compliant data architectures
that support data sovereignty across multiple regions.

Downtime is inevitable. To deal with this truth, these strategies highlight a forward-looking approach where technology and
governance go hand in hand to secure the operational resilience of modern enterprises.

By aligning their technical estates with distributed architecture, automation, and regulatory requirements, companies can
improve their uptime and their ability to adapt to an increasingly complex and regulated digital environment — while putting
less stress on their technical teams and the business’s bottom line.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 23
PART 6

How distributed SQL


helps organizations
achieve operational
resilience
Applying data and results from this report
demonstrates how a distributed SQL database
can help enterprises increase their organizational
resilience by mitigating the technical risks and
weaknesses that lead to outages.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 24
Here are key ways distributed SQL
databases help organizations
improve their operational
resilience and align with regulatory
requirements, current or future:

Minimizing unplanned downtime: According to the survey, unplanned downtime is a major


concern, with 93% of tech executives losing sleep over the financial impact of outages. Meanwhile, more
than half of the financial services companies and banks (and 60% of ecommerce companies) surveyed
report experiencing service interruptions weekly or even more frequently.

Planned downtime is still downtime: Beyond preventing outages, organizations dealing with frequent
planned downtime due to software upgrades or application changes can recapture that uptime with the
capabilities available in some distributed SQL solutions. For example, CockroachDB offers online updates
and live schema changes for zero downtime of any kind in production.

Distributed SQL databases provide high availability via a fault-tolerant design and active-active replication
architecture. They include disaster recovery capabilities to keep any downtime that does occur to
the absolute minimum defined by an organization’s RPO and RTO goals — and operational resilience
regulatory requirements.

Automated failover and disaster recovery: Operational continuity — keeping systems and
services available and online, even in adverse events — is a core tenet of operational resilience. Because
machines can almost always react faster than humans, intelligent automation of systems is critical to
optimizing resilience. Recent regulatory requirements for business continuity during system failures or
cyber incidents are rooted in the fact that many current databases do not offer sophisticated capabilities
for automating resilience.

Distributed SQL systems like CockroachDB automatically replicate data across geographic locations to
provide continuity, even in the face of entire-region outages. If a node fails, the system automatically
routes traffic to healthy nodes, ensuring continuity without manual intervention. Other cross-cluster
replication capabilities can be used to achieve a low RPO/RTO depending on deployment topology and
latency requirements.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 25
Data integrity and compliance: Data integrity is a second core tenet of operational resilience —
and 36% of executives in this survey say they worry about outage-related data loss and corruption.
Distributed databases allow enterprises to maintain control over where specific data resides, helping
to meet the data residency requirements of data privacy laws such as GDPR while also optimizing
availability. For instance, CockroachDB supports features that enforce data sovereignty by keeping
specific data in servers located in different geographic locations while simultaneously enabling high
availability across regions. In addition, CockroachDB supports data encryption in flight and at rest to
enhance security, and Auth-Z/N mechanisms to designate who can get access to what information.

Faster recovery times: The survey notes that 57% of organizations are only moderately organized in
their response to unplanned downtime​and meeting their RPO/RTO goals. Distributed SQL databases
offer built-in mechanisms for faster recovery times, and the most mature ones, like CockroachDB, fully
automate these mechanisms. These capabilities align with DORA’s requirement for prompt incident
management and recovery protocols

By minimizing downtime, enabling quick recovery, assisting with compliance with data residency laws, and
enhancing overall operational resilience, distributed systems and SQL databases can help enterprises meet
DORA requirements — and help the 79% of executives who admit their organizations risk punitive fines and
other penalties because they are simply not prepared to comply with operational resilience regulations.

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 26
cockroachlabs.com

The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy