0% found this document useful (0 votes)
14 views59 pages

BP-2005_Data_Protection

The Nutanix Best Practices guide for Data Protection and Disaster Recovery outlines optimal configurations for protecting applications using Nutanix's hyperconverged infrastructure. It covers features such as native snapshots, two-way mirroring, and cloud backup options, emphasizing the scalability and efficiency of Nutanix's solutions. The document is intended for IT administrators and architects, providing detailed guidance on implementing data protection strategies within the Nutanix Enterprise Cloud environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views59 pages

BP-2005_Data_Protection

The Nutanix Best Practices guide for Data Protection and Disaster Recovery outlines optimal configurations for protecting applications using Nutanix's hyperconverged infrastructure. It covers features such as native snapshots, two-way mirroring, and cloud backup options, emphasizing the scalability and efficiency of Nutanix's solutions. The document is intended for IT administrators and architects, providing detailed guidance on implementing data protection strategies within the Nutanix Enterprise Cloud environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Protection and

Disaster Recovery
Nutanix Best Practices

Version 4.1 • February 2019 • BP-2005


Data Protection and Disaster Recovery

Copyright
Copyright 2019 Nutanix, Inc.
Nutanix, Inc.
1740 Technology Drive, Suite 150
San Jose, CA 95110
All rights reserved. This product is protected by U.S. and international copyright and intellectual
property laws.
Nutanix is a trademark of Nutanix, Inc. in the United States and/or other jurisdictions. All other
marks and names mentioned herein may be trademarks of their respective companies.

Copyright | 2
Data Protection and Disaster Recovery

Contents

1. Executive Summary.................................................................................6

2. Introduction.............................................................................................. 7
2.1. Audience.........................................................................................................................7
2.2. Purpose.......................................................................................................................... 7

3. Nutanix Enterprise Cloud Overview...................................................... 9


3.1. Nutanix Acropolis Architecture.....................................................................................10
3.2. Nutanix Xi Cloud Services........................................................................................... 10

4. Web-Scale Data Protection................................................................... 11

5. Deployment Overview........................................................................... 13
5.1. Native Nutanix Snapshots............................................................................................13
5.2. Two-Way Mirroring....................................................................................................... 13
5.3. Many-to-One.................................................................................................................13
5.4. To the Cloud.................................................................................................................14
5.5. Single-Node Backup.................................................................................................... 14
5.6. Leap: DR Orchestration............................................................................................... 15

6. Local Backup with Snapshots............................................................. 16


6.1. Native Snapshots......................................................................................................... 16
6.2. Crash-Consistent vs. Application-Consistent Snapshots............................................. 16

7. Protection Domains...............................................................................19
7.1. Consistency Groups ................................................................................................... 19

8. Backup and Disaster Recovery on Remote Sites...............................20


8.1. Remote Site Setup.......................................................................................................20
8.2. Scheduling Full Snapshots and Asynchronous Replication.........................................25
8.3. Scheduling LWS and Near-Sync Replication...............................................................27

3
Data Protection and Disaster Recovery

8.4. Cross-Hypervisor Disaster Recovery........................................................................... 28


8.5. Single-Node Backup Target......................................................................................... 28
8.6. Cloud Connect............................................................................................................. 29

9. Leap: Disaster Recovery Orchestration.............................................. 30


9.1. Required Infrastructure for DR Orchestration.............................................................. 30
9.2. Availability Zones......................................................................................................... 30
9.3. Protection Policies........................................................................................................31
9.4. Recovery Plans............................................................................................................ 31

10. Sizing Space.........................................................................................35


10.1. Full Local Snapshots................................................................................................. 35
10.2. Asynchronous Replication..........................................................................................36
10.3. Lightweight Snapshots (LWS).................................................................................... 37
10.4. Near-Sync Replication............................................................................................... 39

11. Bandwidth............................................................................................. 40
11.1. Seeding.......................................................................................................................40

12. Failover: Migrate vs. Activate.............................................................42


12.1. Protection Domain Cleanup....................................................................................... 43

13. Self-Service File Restore.................................................................... 44

14. Third-Party Backup Products............................................................. 45

15. Conclusion............................................................................................46

Appendix..........................................................................................................................47
Best Practices Checklist......................................................................................................47
PowerShell Scripts.............................................................................................................. 52
About the Author................................................................................................................. 57
About Nutanix...................................................................................................................... 57

List of Figures................................................................................................................ 58

4
Data Protection and Disaster Recovery

List of Tables.................................................................................................................. 59

5
Data Protection and Disaster Recovery

1. Executive Summary
The Nutanix Enterprise Cloud is a hyperconverged infrastructure system delivering storage,
compute, and virtualization services for any application. Designed for supporting multiple
virtualized environments, including Nutanix AHV, VMware ESXi, Microsoft Hyper-V, and Citrix
Hypervisor, Nutanix invisible infrastructure is exceptionally robust and provides many ways to
achieve your required recovery point objectives (RPOs).
Enterprises are increasingly vulnerable to data loss and downtime during disasters as they come
to rely on virtualized applications and infrastructure that their legacy data protection and disaster
recovery (DR) solutions can no longer adequately support. This best practices guide discusses
the optimal configuration for achieving data protection using the DR capabilities integrated
into Acropolis and the Leap DR orchestration features available both on-premises and in Xi.
Whatever your use case, you can protect your applications with drag-and-drop functionality.
The Nutanix Prism UI facilitates seamless management to configure the shortest recovery time
objectives (RTOs) possible, so customers can build out complex DR workflows at a moment’s
notice. With Leap built in, Prism Central allows you to apply protection policies across all of
your managed clusters. Once the business has decided on the required RPO, you can activate
recovery plans to validate, test, migrate, and fail over in a seamless fashion. Recovery plans can
protect availability zones both on-premises and hosted in Xi.
As application requirements change and grow, Nutanix can easily adapt to business needs.
Nutanix is uniquely positioned to protect and operate in environments with minimal administrative
effort because of its web-scale architecture and commitment to enterprise cloud operations.

1. Executive Summary | 6
Data Protection and Disaster Recovery

2. Introduction

2.1. Audience
We intend this guide for IT administrators and architects who want more information about the
data protection and disaster recovery features built into the Nutanix Enterprise Cloud. Consumers
of this document should have basic familiarity with Acropolis.

2.2. Purpose
This document provides best practice guidance for data protection solutions implementation on
Nutanix servers running Acropolis 5.10. We present the following concepts:
• Scalable metadata.
• Backup.
• Crash-consistent versus application-consistent snapshots.
• Protection domains.
• Protection policies.
• Recovery plans.
• Scheduling snapshots and asynchronous replication.
• Sizing disk space for local snapshots and replication.
• Scheduling lightweight snapshots (LWS) and near-sync replication.
• Sizing disk space for LWS and near-sync replication.
• Determining bandwidth requirements.
• File-level restore.

Table 1: Document Version History

Version
Published Notes
Number
1.0 December 2014 Original publication.

2. Introduction | 7
Data Protection and Disaster Recovery

Version
Published Notes
Number
Updated recommendations for current best practices
2.0 March 2016
throughout.
Updated Backup and Disaster Recovery on Remote
2.1 June 2016
Sites section.
2.2 July 2016 Updated bandwidth sizing information.
2.3 December 2016 Updated for AOS 5.0.
Updated information on sizing SSD space on a
2.4 May 2017
remote cluster.
3.0 December 2017 Updated for AOS 5.5.
3.1 September 2018 Updated overview and Remote Site Setup section.
4.0 December 2018 Updated for AOS 5.10 and Xi Leap.
Updated Sizing Space section and Leap product
4.1 February 2019
details.

2. Introduction | 8
Data Protection and Disaster Recovery

3. Nutanix Enterprise Cloud Overview


Nutanix delivers a web-scale, hyperconverged infrastructure solution purpose-built for
virtualization and cloud environments. This solution brings the scale, resilience, and economic
benefits of web-scale architecture to the enterprise through the Nutanix Enterprise Cloud
Platform, which combines three product families—Nutanix Acropolis, Nutanix Prism, and Nutanix
Calm.
Attributes of this Enterprise Cloud OS include:
• Optimized for storage and compute resources.
• Machine learning to plan for and adapt to changing conditions automatically.
• Self-healing to tolerate and adjust to component failures.
• API-based automation and rich analytics.
• Simplified one-click upgrade.
• Native file services for user and application data.
• Native backup and disaster recovery solutions.
• Powerful and feature-rich virtualization.
• Flexible software-defined networking for visualization, automation, and security.
• Cloud automation and life cycle management.
Nutanix Acropolis provides data services and can be broken down into three foundational
components: the Distributed Storage Fabric (DSF), the App Mobility Fabric (AMF), and AHV.
Prism furnishes one-click infrastructure management for virtual environments running on
Acropolis. Acropolis is hypervisor agnostic, supporting three third-party hypervisors—ESXi,
Hyper-V, and Citrix Hypervisor—in addition to the native Nutanix hypervisor, AHV.

Figure 1: Nutanix Enterprise Cloud

3. Nutanix Enterprise Cloud Overview | 9


Data Protection and Disaster Recovery

3.1. Nutanix Acropolis Architecture


Acropolis does not rely on traditional SAN or NAS storage or expensive storage network
interconnects. It combines highly dense storage and server compute (CPU and RAM) into a
single platform building block. Each building block delivers a unified, scale-out, shared-nothing
architecture with no single points of failure.
The Nutanix solution requires no SAN constructs, such as LUNs, RAID groups, or expensive
storage switches. All storage management is VM-centric, and I/O is optimized at the VM virtual
disk level. The software solution runs on nodes from a variety of manufacturers that are either
all-flash for optimal performance, or a hybrid combination of SSD and HDD that provides a
combination of performance and additional capacity. The DSF automatically tiers data across the
cluster to different classes of storage devices using intelligent data placement algorithms. For
best performance, algorithms make sure the most frequently used data is available in memory or
in flash on the node local to the VM.

3.2. Nutanix Xi Cloud Services


Xi Cloud Services offers a native extension to the Nutanix Enterprise Cloud, delivering an
integrated public cloud environment that customers can instantly provision and automatically
configure. The first service available in Xi Cloud Services, Xi Leap, provides disaster recovery
as a service (DRaaS). Leap rapidly and intelligently protects the applications and data in your
Nutanix environment without the need to purchase and maintain a separate infrastructure stack.
To learn more about the Nutanix Enterprise Cloud, please visit the Nutanix Bible and
Nutanix.com.

3. Nutanix Enterprise Cloud Overview | 10


Data Protection and Disaster Recovery

4. Web-Scale Data Protection


One of the key architectural differentiators for Nutanix is the ability to scale. Nutanix is not
bound to the same limitations that dual controller architectures or federations relying on
special hardware like NVRAM or customer ASICs for performance may face. When it comes to
snapshots and disaster recovery, scaling metadata becomes a key part of delivering performance
while ensuring availability and reliability. Each Nutanix node is responsible for a subset of the
overall platform’s metadata. All nodes in the cluster serve and manipulate metadata entirely
through software, eliminating traditional bottlenecks.
As each node has its own virtual storage controller and access to local metadata, replication
scales along with the system. Every node participates in replication to reduce hotspots
throughout the cluster.
Nutanix uses two different forms of snapshots: full snapshots for asynchronous replication (when
the RPO is 60 minutes or greater), and lightweight snapshots (LWS) for near-sync replication
(when the RPO is between 15 minutes and 1 minute). Full snapshots are really efficient at
keeping system resource usage low when you are using many snapshots over an extended
period of time. The LWS feature reduces metadata management overhead and increases storage
performance by decreasing the high number of storage I/O operations that long snapshot chains
can cause.

Figure 2: Scalable Replication as You Scale

In asynchronous replication, every node can replicate four files, up to an aggregate of 100 MB/
s at one time. Thus, in a four-node configuration, the cluster can replicate 400 MB/s or 3.2 Gb/s.
As you grow the cluster, the virtual storage controllers keep replication traffic distributed. In many-
to-one deployments, as when remote branch offices communicate with a main datacenter, the

4. Web-Scale Data Protection | 11


Data Protection and Disaster Recovery

main datacenter can use all its available resources to handle increased replication load from the
branch offices. When the main site is scalable and reliable, administrators don't have multiple
replication targets to maintain, monitor, and manage. You can protect both VMs and volume
groups with asynchronous replication.
Near-sync replication offers unbound throughput. Because all writes go to SSD, we want to make
sure that the performance tier does not fill up. Near-sync replication, which covers both VMs and
volume groups, is supported for bidirectional replication between two clusters.
Nutanix also provides cross-hypervisor disaster recovery natively via asynchronous replication.
Existing vSphere clusters can target AHV-based clusters as their DR and backup targets. Thanks
to true VM mobility, Nutanix customers can place their workloads on the platform that best meets
their needs.

4. Web-Scale Data Protection | 12


Data Protection and Disaster Recovery

5. Deployment Overview
Nutanix meets real-world requirements with native backup and replication infrastructure and
management features that support a wide variety of enterprise topologies.

5.1. Native Nutanix Snapshots


Per-VM or per-volume group snapshots enable instant recovery. Depending on the workload and
associated SLAs, customers can tune the snapshot schedule and retention periods to meet the
appropriate RPOs. With the intuitive UI snapshot browser, you can perform restore and cloning
operations instantly on the local cluster.

5.2. Two-Way Mirroring


The ability to mirror VM and volume group replication between multiple sites is necessary in
environments where all sites must support active traffic. Consider a two-site example. Site B
is the data protection target for selected workloads running on site A. At the same time, site A
serves as the target for designated workloads running on site B. While asynchronous replication
is supported for all workflows listed in this section, two-way mirroring is the only supported
topology for near-sync.

Figure 3: Two-Way Mirroring

5.3. Many-to-One
In a many-to-one or hub-and-spoke architecture, you can replicate workloads running on sites
A and B, for example, to a central site C. Centralizing replication to a single site may improve
operational efficiency for geographically dispersed environments. Remote and branch offices
(ROBO) are a classic many-to-one topology use case.

5. Deployment Overview | 13
Data Protection and Disaster Recovery

Figure 4: Many-to-One Architecture

5.4. To the Cloud


With Cloud Connect, customers can now use the public cloud as a destination for backing up
their on-cluster VMs and volume groups. At this time, Nutanix supports Amazon Web Services
(AWS) as the cloud destination. This option is particularly suitable for customers who do not
have an offsite location for their backups, or who are currently relying on tapes for storing their
backups offsite. Cloud Connect provides customers with backup options for both Hyper-V and
ESXi using the same Prism UI.

Figure 5: Using the Public Cloud as a Backup Destination

5.5. Single-Node Backup


Along with this variety of deployment topologies, Nutanix has added support for single-node
backup as a cost-efficient solution for providing full native backups to branch offices and SMBs.
Using the same underlying disaster recovery technology, the single node can be either on-site or
remote. Nutanix protects data on the node from single drive failure and provides native backup
end to end. Single-node systems can also run VMs for testing or remote branch offices.

5. Deployment Overview | 14
Data Protection and Disaster Recovery

5.6. Leap: DR Orchestration


Prism Central provides a single web console for monitoring and managing multiple clusters. From
AOS 5.10 onward, Nutanix adds protection policies and recovery plans to Prism Central, offering
an easy way to orchestrate operations around migrations and unplanned failures. Now you can
apply orchestration policies from a central location, ensuring consistency across all of your sites
and clusters.
To help manage these new protection policies and recovery plans, Nutanix uses a construct
called availability zones. On-premises, an availability zone includes all the Nutanix clusters
managed by one Prism Central. An availability zone can also represent a region in Nutanix
Xi Cloud Services. For DR, availability zones exist in pairs—either on-prem to on-prem or on-
prem to Xi. Once you have paired your on-prem environment to a Xi-based availability zone,
you can take advantage of Xi Leap, which is Nutanix DR as a service. There are multiple Xi
Leap subscription plans available, so you don’t have to pay the full cost associated with buying
a secondary cluster up front, while also saving the time it would take to manage and operate the
infrastructure.

Figure 6: On-Prem and Xi-Based Availability Zones

5. Deployment Overview | 15
Data Protection and Disaster Recovery

6. Local Backup with Snapshots

6.1. Native Snapshots


Nutanix native snapshots provide production-level data protection without sacrificing
performance. Nutanix utilizes a redirect-on-write algorithm to dramatically improve system
efficiency for snapshots. Native snapshots operate at the VM level, and our crash-consistent
snapshot implementation is the same across hypervisors. Implementation varies for application-
consistent snapshots due to differences in the hypervisor layer. Nutanix can create local backups
and recover data instantly to meet a wide range of data protection requirements.
Best practices:
• All VM files should sit on Nutanix storage. If you make non-Nutanix storage available to store
files (VMDKs), the storage should have the same file path on both the source and destination
clusters.
• Remove all external devices, including ISOs and floppy devices.

6.2. Crash-Consistent vs. Application-Consistent Snapshots


VM snapshots are by default crash-consistent, which means that the vDisks captured are
consistent with a single point in time. The snapshot represents the on-disk data as if the VM
crashed or the power cord was pulled from the server—it doesn’t include anything that was in
memory when the snapshot was taken. Today, most applications can recover well from crash-
consistent snapshots.
Application-consistent snapshots capture the same data as crash-consistent snapshots, with the
addition of all data in memory and all transactions in process. Because of their extra content,
application-consistent snapshots are the most involved and take the longest to perform.
While most organizations find crash-consistent snapshots to be sufficient, Nutanix also supports
application-consistent snapshots. The Nutanix application-consistent snapshot uses Nutanix
Volume Shadow Copy Service (VSS) to quiesce the file system for ESXi and AHV prior to taking
the snapshot. You can configure which type of snapshot each protection domain should maintain.

VSS Support with Nutanix Guest Tools


The Nutanix Guest Tools (NGT) software package for VMs plays an important role in application-
consistent snapshots. ESXi and AHV-based snapshots call the Nutanix VSS provider from the
Nutanix Guest Agent, which is one component of NGT. VMware tools talk to the guest VM’s
VSS writers. Application-consistent snapshots quiesce all I/O, complete all open transactions,

6. Local Backup with Snapshots | 16


Data Protection and Disaster Recovery

and flush the caches so everything is at the same point. VSS freezes write I/O while the native
Nutanix snapshot takes place, so all data and metadata is written in a consistent manner. Once
the Nutanix snapshot takes place, VSS thaws the system and allows queued writes to occur.
Application-consistent snapshots do not snapshot the OS memory during this process.
Requirements for Nutanix VSS snapshots:
• Configure an external cluster IP address.
• Guest VMs should be able to reach the external cluster IP on port 2074.
• Guest VMs should have an empty IDE CD-ROM for attaching NGT.
• Only available for ESXi and AHV.
• Virtual disks must use the SCSI bus type.
• For near-sync support, NGT version 1.3 must be installed.
• VSS is not supported with near-sync, or if the VM has any delta disks (hypervisor snapshots).
• Only available for these supported versions:
⁃ Windows 7
⁃ Windows Server 2008 R2 and later
⁃ CentOS 6.5 and 7.0
⁃ Red Had Enterprise Linux (RHEL) 6.5 and 7.0
⁃ Oracle Linux 6.5 and 7.0
⁃ SUSE Linux Enterprise Server (SLES) 11 SP4 and 12
• VSS must be running on the guest VM. Check the PowerShell Scripts section of the Appendix
for a script that verifies whether the service is running.
• The guest VM must support the use of VSS writers. Check the PowerShell Scripts section of
the Appendix for a script that ensures VSS writer stability (Windows only).
• VSS is not supported for volume groups.
• You cannot include volume groups in a protection domain that is configured for Metro
Availability.
• You cannot include volume groups in a protected VStore.
• You cannot use Nutanix native snapshots to protect VMs on which VMware fault tolerance is
enabled.
For ESXi, if you haven’t installed NGT, the process fails back using VMware Tools. Because the
VMware Tools method creates and deletes an ESXi-based snapshot whenever it creates a native

6. Local Backup with Snapshots | 17


Data Protection and Disaster Recovery

Nutanix snapshot, it generates more I/O stress. To eliminate this stress, we strongly recommend
installing NGT.
Best practices:
• Schedule application-consistent snapshots during off-peak hours. NGT takes less time to
quiesce applications than VMware Tools, but application-consistent snapshots still take longer
than crash-consistent snapshots.
• Increase cluster heartbeat settings when using Windows Server Failover Cluster.
• To avoid accidental cluster failover when performing a vMotion, follow VMware best practices
to increase heartbeat probes:
⁃ Change the tolerance of missed heartbeats from the default of 5 to 10.
⁃ Increase the number to 20 if your servers are on different subnets.
⁃ If you’re running Windows Server 2012, adjust the RouteHistoryLength to double the
CrossSubnetThreshold value.

Table 2: MS Failover Settings Adjusted for Using VSS

(get-cluster).SameSubnetThreshold = 10
(get-cluster).CrossSubnetThreshold = 20
(get-cluster).RouteHistoryLength = 40

VSS on Hyper-V
Hyper-V on Nutanix supports VSS only through third-party backup applications, not snapshots.
Because the Microsoft VSS framework requires a full share backup for every virtual disk
contained in the share, Nutanix recommends limiting the number of VMs on any container
utilizing VSS backup.
Best practices:
• Create different containers for VMs needing VSS backup support. Do not exceed 50 VMs on
each container.
• Create a separate large container for crash-consistent VMs.

6. Local Backup with Snapshots | 18


Data Protection and Disaster Recovery

7. Protection Domains
A protection domain is a group of VMs or volume groups that can be either snapshotted locally or
replicated to one or more clusters, when you have configured a remote site. Prism Element uses
protection domains when replicating between remote sites.
Best practices:
• Protection domain names must be unique across sites.
• No more than 200 VMs per protection domain.
⁃ VMware Site Recovery Manager and Metro Availability protection domains are limited to
3,200 files.
⁃ Near-sync is not currently supported with VMware Site Recovery Manager and Metro
Availability.
⁃ No more than 10 VMs per protection domain with LWS.
• Group VMs with similar RPO requirements.
⁃ Near-sync can only have one schedule, so be sure to place near-sync VMs in their own
protection domain.

7.1. Consistency Groups


Administrators can create a consistency group for VMs and volume groups that are part of a
protection domain and need to be snapshotted in a crash-consistent manner.
Best practices:
• Keep consistency groups as small as possible. Collect dependent applications or service VMs
into a consistency group to ensure that they are recovered in a consistent state (for example,
a webserver and database would be put into the same consistency group).
• For all hypervisors, try to limit consistency groups to fewer than 10 VMs following the above
best practices. Although we have tested consistency groups with up to 50 VMs, it is more
efficient to have smaller consistency groups.
• Each consistency group using application-consistent snapshots can contain only one VM.
• When providing disaster recovery for VDI using VMware View Composer or Machine Creation
Services, place each protected VM in its own consistency group (including the golden image)
inside a single protection domain.

7. Protection Domains | 19
Data Protection and Disaster Recovery

8. Backup and Disaster Recovery on Remote Sites


Nutanix allows administrators to set up remote sites and select whether they use those remote
sites for simple backup or for both backup and disaster recovery.
Remote sites are a logical construct. Any Acropolis cluster—either physical or based in the cloud
—used as the destination for storing snapshots must first be configured as a “remote site” from
the perspective of the source cluster. Similarly, on this secondary cluster, you must configure the
primary cluster as a “remote site” before snapshots from the secondary cluster start replicating to
it.
Configuring the backup option on Nutanix allows an organization to use its remote site as a
replication target. This means that you can back up data to this site and retrieve snapshots from
it to restore locally, but failover protection (that is, running failover VMs directly from the remote
site) is not enabled. Backup supports using multiple hypervisors; as an example, an enterprise
might have ESXi in the main data center but use Hyper-V at a remote location. With the backup
option configured, the Hyper-V cluster could use storage on the ESXi cluster for backup. Using
this method, Nutanix can also back up to Amazon Web Services from Hyper-V or ESXi.
Configuring the disaster recovery option allows you to use the remote site both as a backup
target and as a source for dynamic recovery. In this arrangement, failover VMs can run directly
from the remote site. Nutanix provides cross-hypervisor disaster recovery between ESXi and
AHV clusters. Today, Hyper-V clusters can only provide disaster recovery to other Hyper-V-based
clusters.
For data replication to succeed, ensure that you configure forward (DNS A) and reverse (DNS
PTR) DNS entries for each ESXi management host on the DNS servers used by the Nutanix
cluster.

8.1. Remote Site Setup


You can customize a number of options when setting up a remote site. Protection domains inherit
all remote site properties during replication.

8. Backup and Disaster Recovery on Remote Sites | 20


Data Protection and Disaster Recovery

Figure 7: Setup Options for a Remote Site

Address
Use the external cluster IP as the address for the remote site. The external cluster IP is highly
available, as it creates a virtual IP address for all of the virtual storage controllers. You can
configure the external cluster IP in the Prism UI under cluster details.
Other recommendations include:
• Try to keep both sites at the same Acropolis version. If both sites require compression, both
must have the compression feature licensed and enabled.
• Open the following ports between both sides: 2009 TCP, 2020 TCP, 9440 TCP, and 53 UDP. If
using the SSH tunnel described below, also open 22. Use the external cluster IP address for
source and destination. Cloud Connect uses a port between 3000–3099, but that setup occurs
automatically. All CVM IPs must be allowed to pass replication traffic between sites with the
ports detailed above. To simplify firewall rules, you can use the proxy described below.

8. Backup and Disaster Recovery on Remote Sites | 21


Data Protection and Disaster Recovery

Enable Proxy
The enable proxy option redirects all egress remote replication traffic through one node. It’s
important to note that this remote site proxy is different from the Prism proxy. With “enable proxy”
selected, replication traffic goes to the remote site proxy, which then forwards it to other nodes in
the cluster. This arrangement significantly reduces the number of firewall rules you need to set up
and maintain.
Best practice:
• Use proxy in conjunction with the external address.

SSH Tunnel
An SSH tunnel is a point-to-point connection—one node in the primary cluster connects to a node
in the remote cluster. By enabling proxy, we force replication traffic to go over this node pair. You
can use the SSH tunnel between Cloud Connect and physical Nutanix clusters when you can’t
set up a virtual private network (VPN) between the two clusters. We recommend using an SSH
tunnel as a fail-back option in lieu of a VPN.
Best practices:
• To use SSH tunnel, select enable proxy.
• Open port 22 between external cluster IPs.
• Only use SSH tunnel for testing—not production. Use a VPN between remote sites or a Virtual
Private Cloud (VPC) with Amazon Web Services.

Capabilities
The disaster recovery option requires that both sites either support cross-hypervisor disaster
recovery or have the same hypervisor. Today, Nutanix supports only ESXi and AHV for cross-
hypervisor disaster recovery with full snapshots. When using the backup option, the sites can use
different hypervisors, but you can’t restore VMs on the remote side. The backup option is also
used with backing up to AWS and Azure.

Bandwidth Throttling
Max bandwidth is set to throttle traffic between sites when no network device can limit replication
traffic. The max bandwidth option allows for different settings throughout the day, so you can
assign a max bandwidth policy when your sites are busy with production data and disable the
policy when they’re not as busy. Max bandwidth does not imply a maximum observed throughput.
When talking with your networking teams, it’s important to note that this setting is in MB/s not Mb/
s. Near-sync does not currently honor maximum bandwidth thresholds.

8. Backup and Disaster Recovery on Remote Sites | 22


Data Protection and Disaster Recovery

Figure 8: Max Bandwidth

Remote Container
VStore name mapping identifies the container on the remote cluster used as the replication
target. When establishing the VStore mapping, we recommend creating a new, separate remote
container with no VMs running on it on the remote side. This configuration allows the hypervisor
administrator to distinguish failed-over VMs quickly and to apply polices on the remote side easily
in case of a failover.

8. Backup and Disaster Recovery on Remote Sites | 23


Data Protection and Disaster Recovery

Figure 9: VStore-Container Mappings for Replication

Best practices:
• Create a new remote container as the target for the VStore mapping.
• If many clusters are backing up to one destination cluster, use only one destination container if
the source containers have similar advanced settings.
• Enable MapReduce compression if licensing permits.
• If you are using vCenter Server to manage both the primary and remote sites, do not have
storage containers with the same name on both sites.
If the aggregate incoming bandwidth required to maintain the current change rate is less than
500 Mb/s, we recommend skipping the performance tier. This setting saves your flash for other
workloads while also saving on SSD write endurance.
To skip the performance tier, use the following command from the nCLI:

Table 3: Skip the Performance Tier

ncli ctr edit sequential-io-priority-order=DAS-SATA,SSD-SATA,SSD-PCIe name=<container-


name>

8. Backup and Disaster Recovery on Remote Sites | 24


Data Protection and Disaster Recovery

You can reverse this command at any time.

Network Mapping
Acropolis supports network mapping for disaster recovery migrations moving to and from AHV.
Best practice:
• Whenever you delete or change the network attached to a VM specified in the network map,
modify the network map accordingly.

8.2. Scheduling Full Snapshots and Asynchronous Replication


The snapshot schedule should be equal to your desired RPO. In practical terms, the RPO
determines how much data you can afford to lose in the event of a failure. The failure could be
due to a hardware, human, or environmental issue. Taking a snapshot every 60 minutes for a
server that changes infrequently, or when you don’t need a low RPO, takes up resources that
could benefit more critical services.
The RPO is set from the local site. If you set a schedule to take a snapshot every hour,
bandwidth and available space at the remote site determine if you can achieve the RPO. In
constrained environments, limited bandwidth may cause the replication to take longer than the
one-hour RPO, thus increasing the RPO. We list guidelines for sizing bandwidth and capacity to
avoid this scenario later in this document.

8. Backup and Disaster Recovery on Remote Sites | 25


Data Protection and Disaster Recovery

Figure 10: Multiple Schedules for a Production Domain

You can create multiple schedules for a protection domain (PD) using full snapshots, and you can
have multiple protection domains. The figure above shows seven daily snapshots, four weekly
snapshots, and three monthly snapshots to cover a three-month retention policy. This policy is
more efficient in managing metadata on the cluster than using a daily snapshot with a 180-day
retention policy would be.
Best practices:
• Stagger replication schedules across PDs. If you have a PD starting at the top of the hour,
stagger the PDs by half of the most commonly used RPO. The goal is to spread out replication
impact on performance and bandwidth.
• Configure snapshot schedules to retain the lowest number of snapshots while still meeting the
retention policy, as shown in the previous figure.
Remote snapshots implicitly expire based on how many snapshots there are and how frequently
they are taken. For example, if you take daily snapshots and keep a maximum of five, on the
sixth day the first snapshot expires. At that point, you can’t recover from the first snapshot
because the system deletes it automatically.
In case of a prolonged network outage, Nutanix always retains the last snapshot to ensure that
you don’t ever lose all of the snapshots. You can modify the retention schedule from nCLI by

8. Backup and Disaster Recovery on Remote Sites | 26


Data Protection and Disaster Recovery

changing the min-snap-retention-count. This value ensures that you retain at least the specified
number of snapshots, even if all of the snapshots have reached the expiry time. This setting
works at the PD level.

8.3. Scheduling LWS and Near-Sync Replication


Nutanix offers near-sync with a telescopic schedule (time-based retention). When you set the
RPO to be ≤ 15 minutes and ≥ one minute, you have the option to save your snapshots for X
number of weeks or months. Once you select near-sync, you cannot add any more schedules.
The table below presents the default telescopic schedule if you want to save recovery points for
one month.

Table 4: Default Telescopic Schedule for One Month

Type Frequency Retention


Minute increments Every minute 15 minutes
Hourly Every hour 6 hours
Daily Every 24 hours 7 days
Weekly Every week 4 weeks
Monthly Every month 1 month

Limitations:
• Linked clone VMs (typically nonpersistent View desktops) are not supported.
• Metro and SRM-protected containers are not supported.
• Don’t configure near-sync for Hyper-V.
• Ensure that you have enough bandwidth to support the change rate.
• Deduplication on the source container is not supported.
• Enabling near-sync requires at least three nodes are in a cluster on both primary and remote
sites.
• Do not enable near-sync on a cluster where you have any node with more than 40 TB of
storage (either all SSDs, or a combination of SSDs and HDDs).
• Minimum 1.2 TB SSDs are needed in hybrid systems; 1.9 TB SSDs are preferred.
Best practice:

8. Backup and Disaster Recovery on Remote Sites | 27


Data Protection and Disaster Recovery

• Limit the number of VMs to 10 or fewer per protection domain. If it’s possible to do so,
maintaining one VM per protection domain can help you transition back to near-sync if you run
out of LWS reserve storage.

8.4. Cross-Hypervisor Disaster Recovery


Nutanix provides cross-hypervisor disaster recovery for migrating between ESXi and AHV
clusters. The migration works with one click and uses the Prism data protection workflow listed
above. Once you’ve installed the mobility drivers through Nutanix Guest Tools (NGT), a VM can
move freely between the hypervisors.
Best practices:
• Configure CVM external IP address.
• Obtain the mobility driver from Nutanix Guest Tools.
• Don’t migrate VMs:
⁃ With delta disks (hypervisor-based snapshots).
⁃ Using SATA disks.
• Ensure that protected VMs have an empty IDE CD-ROM attached.
• Ensure that network mapping is complete.

8.5. Single-Node Backup Target


Nutanix offers the ability to use an NX-1155 or NX-1175 appliance as a single-node backup
target for an existing Nutanix cluster. Because this target has different resources than the original
cluster, its primary use case is to provide backup for a small set of VMs. This utility gives SMB
and ROBO customers a fully integrated backup option.
Best practices:
• Combined, all protection domains should be under 30 VMs.
• Limit backup retention to a three-month policy.
⁃ A recommended policy would include seven daily, four weekly, and three monthly backups.
• Only map an NX-1155 or NX-1175 to one physical cluster.
• Snapshot schedule should be greater than or equal to six hours.
• Turn off deduplication.

8. Backup and Disaster Recovery on Remote Sites | 28


Data Protection and Disaster Recovery

8.6. Cloud Connect


The CVM running in AWS and Azure has limited SSD space, so we recommend following the
best practices below when sizing.
Best practices:
• Try to limit each protection domain to one VM to speed up restores. This approach also saves
money, as it limits the amount of data going across the WAN.
• The RPO should not be lower than four hours.
• Turn off deduplication.
• Try to use Cloud Connect to protect workloads that have an average change rate of less than
0.5 percent.

8. Backup and Disaster Recovery on Remote Sites | 29


Data Protection and Disaster Recovery

9. Leap: Disaster Recovery Orchestration


The following best practices for DR orchestration cover both on-premises environments and Xi.
Any differences between the two are noted below.

9.1. Required Infrastructure for DR Orchestration


• Deploy Prism Central in each on-prem site.
• Deploy Prism Central onto a subnet that will not fail over.
• Place the CVM and hypervisor IPs on a subnet different from the subnets used by VMs.
• The test network for on-prem DR orchestration requires a nonroutable VLAN.

9.2. Availability Zones


Paired availability zones synchronize the following DR configuration entities:
• Protection policies.
• Recovery plans.
• Categories used in protection polices and recovery plans.
Issues such as a loss of network connectivity between paired availability zones or user actions
such as unpairing availability zones followed by pairing those availability zones again can affect
entity synchronization. Pairing previously unpaired availability zones triggers an automatic
synchronization event. If you do not update entities before a connectivity issue is resolved or
before you pair the availability zones again, the synchronization behavior described above
resumes. If you update entities in either or both availability zones before such issues are resolved
or before unpaired availability zones are paired again, entity synchronization is not possible. In
such a scenario, you can force the entities in one availability zone to synchronize with the paired
availability zone. This forced synchronization overwrites entities at the paired availability zone.
Observe the following recommendations to avoid inconsistencies and the resulting
synchronization issues:
• During network connectivity issues, do not update an entity at both the availability zones in a
pair. You can safely make updates at any one location. After the connectivity issue is resolved,
force synchronization from the availability zone in which you made the updates. Failure to
adhere to this recommendation results in synchronization failures.

9. Leap: Disaster Recovery Orchestration | 30


Data Protection and Disaster Recovery

• You can safely create new entities in either or both availability zones as long as you do not
assign the same name to entities in both availability zones. After the connectivity issue is
resolved, force synchronization from the availability zone in which you created entities.
• If one of the availability zones becomes unavailable, or if a service in the paired availability
zone is down, perform a forced sync from the paired availability zone after the issue is
resolved.

9.3. Protection Policies


A protection policy automates the creation and replication of snapshots. When configuring a
protection policy for creating local snapshots, simply specify the RPO, retention policy, and
the entities that you want to protect. If you want to automate snapshot replication to a remote
location, you can also specify the remote location. Protection policies replace the need to use
protection domains.
Requirements for protection policies:
• A VM can only belong to either a protection domain or a protection policy.
• If you are not using Nutanix AHV IPAM and need to retain your IP addresses, install NGT into
the VMs to be protected.
Best practices:
• Apply protection polices by using categories.
• Apply only one protection policy per VM.
• Include only up to 200 VMs in a category.
• For on-premises Leap, create the same container name on both sides. If the container name
does not match on both sides, data replicates by default to the SelfServiceContainer.

9.4. Recovery Plans


A recovery plan orchestrates restoring protected VMs at a backup location. Recovery plans
can either recover all specified VMs at once or, via what is essentially a runbook functionality,
use power-on sequences with optionally configurable interstage delays to recover applications
gracefully and in the required order. Recovery plans that restore applications in Xi can also create
the required networks during failover and can assign public-facing IP addresses to VMs.
Requirements for recovery plans:
• To enable DR orchestration, you must have set up a Prism Element external data services IP.
• Prism Central must run on a Nutanix cluster with the external data services IP.

9. Leap: Disaster Recovery Orchestration | 31


Data Protection and Disaster Recovery

Xi Leap limitations:
Xi Leap does not allow you to create a recovery plan in the following scenario:
• The recovery network that you specify in the recovery plan exists with the same name on
multiple clusters.
• Though the networks have the same name, they have different IP address spaces.
• If you add a VM to multiple recovery plans and perform failover simultaneously on those
recovery plans, each recovery plan creates an instance of the VM at the recovery location.
You must manually clean up the additional instances.
Best practices:
• For on-premises availability zones, create a nonroutable network for testing failovers.
• Run the Validate workflow after making changes to recovery plans.
• After running the Test workflow, run the Clean-Up workflow instead of manually deleting VMs.
• A recovery plan should cover a maximum of 200 VMs at any one time.
• Maximum of 20 categories in a recovery plan.
• Maximum of 20 stages in a recovery plan.
• Maximum of 15 categories per stage in a recovery plan.
• Maximum of 5 recovery plans can be executed in parallel.

Network Mapping

Figure 11: Network Mapping for Xi Leap

9. Leap: Disaster Recovery Orchestration | 32


Data Protection and Disaster Recovery

Virtual Networks in On-Premises Clusters


Virtual networks in on-premises Nutanix clusters are virtual subnets that are bound to a single
VLAN. At physical locations, including the recovery location, administrators must create these
virtual subnets manually, with separate virtual subnets created for production and test purposes.
You must create these virtual subnets before configuring recovery plans.
When configuring a recovery plan, map the virtual subnets at the source location to the virtual
subnets at the recovery location.

Virtual Networks in Xi Cloud Services


Xi Cloud Services features two built-in virtual networks: Production and Test. These virtual
networks are not analogous to the virtual subnets used in on-premises Nutanix clusters; these
virtual networks only serve to provide two separate IP address spaces for production and testing
so that activities performed in one do not affect the other. These separate Production and Test IP
address spaces contain virtual subnets that are analogous to the virtual subnets in on-premises
Nutanix clusters.
The Production virtual network contains subnets that are used for production workloads. These
production workloads can be either workloads created and maintained in Xi Cloud Services or
workloads that have failed over from a paired physical location.
The Test virtual network contains the subnets on which you want to recover VMs when testing
failover from the virtual subnets at a paired physical location. When configuring a recovery plan,
map the virtual subnets on your on-premises clusters to virtual subnets within the Production and
Test networks in Xi Cloud Services.
Best practices:
• Set up administrative distances on VLANs for subnets that will completely fail over. If you don’t
set up administrative distances, shut down the VLAN on the source side after failover if the
VPN connection is maintained between the two sites. If you are failing over to a new subnet,
set up the subnet beforehand so you can test the routing.
• The prefix length for network mappings at the source and the destination must be the same.
• If you’re not using Nutanix IPAM, you must install the NGT software package to maintain a
static address.
• To maintain a static address for Linux VMs that aren’t using Nutanix IPAM, the VMs must
have the NetworkManager command-line tool (nmcli) version 0.9.10.0 or later installed.
Additionally, you must use NetworkManager to manage the network for the Linux VMs. To
enable NetworkManager on a Linux VM, set the value of the NM_CONTROLLED field to yes
in the interface configuration file (for example, in CentOS, the file is /etc/sysconfig/network-
scripts/ifcfg-eth0). After setting this field, restart the network service on the VM.
For networking best practices specific to Xi Leap, refer to the Xi Connectivity Tech Note.

9. Leap: Disaster Recovery Orchestration | 33


Data Protection and Disaster Recovery

Xi Leap Hypervisor Support


• Xi Leap only supports clusters running AHV.

Xi Leap Virtual Machine Configuration Limitations


VMs with the following configurations cannot power on:
• VMs configured with a GPU resource.
• VMs configured with four vNUMA sockets.

9. Leap: Disaster Recovery Orchestration | 34


Data Protection and Disaster Recovery

10. Sizing Space

10.1. Full Local Snapshots


To size space for local snapshots, you need to account for the rate of change in your environment
and how long you plan to keep your snapshots on the cluster. It is important to understand that
reduced snapshot frequency may increase the rate of change due to the greater chance of
common blocks changing before the next snapshot.
To find the space needed to meet your RPO, you can use the following formula. As you lower the
RPO for asynchronous replication, you may need to account for an increased rate of transformed
garbage. Transformed garbage is space that was allocated for I/O optimization or space that was
assigned but to which the metadata no longer refers.

Table 5: Local Full Snapshot Reserve Formula

snapshot reserve = (frequency of snapshots * change rate per frequency) +


(change rate per frequency * ⁃ of snapshots in a full curator scan * 0.1)

Note: A full curator scan runs every six hours.

You can look at your backups and compare the incremental difference between them to find
the change rate. You could also take a conservative approach and start with a low snapshot
frequency and a short expiry policy, then gauge the size difference between backups before
consuming too much space.

10. Sizing Space | 35


Data Protection and Disaster Recovery

Figure 12: Example Snapshot Schedule: Taking a Snapshot at Noon and 6 PM

Using the local snapshot reserve formula presented above, assuming for demonstration
purposes that the change rate is 35 GB of data every six hours and that we keep ten snapshots:
snapshot reserve = (frequency of snapshots * change rate per frequency) +
(change rate per frequency * # of snapshots in a full curator scan * 0.1)
= (10 * 35,980 MB) + (35,980 MB * 1 * 0.1)
= 359,800 + (35,980 * 1 * 0.1)
= 359,800 + 3,598
= 363,398 MB
= 363 GB

10.2. Asynchronous Replication


Asynchronous replication uses the same process as above, but you must include the first full
copy of the protection domain (PD) plus delta changes based on the set schedule.

10. Sizing Space | 36


Data Protection and Disaster Recovery

Table 6: Remote Snapshot Reserve Formula

snapshot reserve = (frequency of snapshots * change rate per frequency) +


(change rate per frequency * ⁃ of snapshots in a full curator scan * 0.2)
+ total size of the source PD

For the minimum amount of space needed at the remote side, 130 percent of the PD is a good
average to work from.
If the remote target is also running a workload, note that incoming replication uses the
performance tier. If you’re using a hybrid cluster, be sure to size for the additional hot data. You
can also skip the performance tier by creating a separate container for incoming replication and
following the steps provided in the Remote Container section above.

10.3. Lightweight Snapshots (LWS)


Sizing for LWS is very similar to sizing for full snapshots in that you must account for change
rate and retention time. During the LWS retention time, you don’t have to size for transformed
garbage space, but LWS does use additional SSD space, for which we do have to size. By
default, 7 percent of each node’s SSD space forms the LWS reserve, which is an additional
factor.
Duplicating the workload from the full snapshot example above, let’s change the RPO to one
minute. LWS data exists for 75 minutes plus one extra snapshot based on the RPO value.
Because the frequency rate in this scenario is high, it’s very important to account for overwrites in
the change rate. Because all data goes through the oplog, it is compressed; the type and amount
of compression varies by workload. Using inline compression, we typically see a 2:1 compression
rate.

Example Using 35 GB Change Rate Over 6 Hours


If your change rate is 35 GB of data every six hours, add 5 GB to account for overwrites. 40
GB every six hours equals a change rate of approximately 114 MB per minute. If the cluster is
running with replication factor 2, then we must account for the total physical space.

Table 7: LWS Reserve

= (frequency of snapshots * change rate per frequency) * oplog


LWS reserve compression * replication factor
= (75 * 114 MB/minute) * 0.50 * 2

10. Sizing Space | 37


Data Protection and Disaster Recovery

= 8,550 MB
= 8.6 GB

If we were running a 3460 hybrid system with two 1.9 TB SSDs, our LWS reserve would be:

Table 8: Cluster LWS Reserve

= ((total SSD capacity per node - (CVM + metadata + oplog + cache


overhead)) * number of nodes) * LWS reserve percentage
= (3,539 GiB - (120 GiB + 30 GiB + 200 GiB + 40 GiB)) * 4 * 0.07
LWS cluster
reserve = 3,149 GB * 4 * 0.07
= 881.72 GiB
= ~945 GB

We apply the 7 percent of the SSD space used by the extent store after we have accounted for
the rest of the system. Because our LWS cluster reserve is 945 GB, this system has lots of room
for the workload as well as for additional business-critical applications.
As we discussed in the Scheduling LWS and Near-Sync Replication section above, a telescopic
schedule would have a total of six hourly, seven daily, four weekly, and one monthly snapshots.
You only need to account for additional garbage space for the dailies due to the higher frequency.
Using the change rate above, we can calculate each separate schedule and add them all
together. Because full snapshots occur less often than LWS, we don’t have to be concerned with
overwrites and can use the original 35 GB change rate for every six hours.

10. Sizing Space | 38


Data Protection and Disaster Recovery

Example of a Telescopic Near-Sync Snapshot Schedule


Hourly change rate: 5,833 MB
Daily change rate: 140,000 MB
Weekly change rate: 980,000 MB
Monthly change rate: 3,920,000 MB
snapshot overall capacity reserve = (hourly schedule + daily schedule + weekly schedule +
monthly schedule)
For the hourly schedule:
(frequency of snapshots * change rate per frequency) + (change rate per frequency * # of
snapshots in a full curator scan * 0.1)
The remaining schedules:
= (frequency of snapshots * change rate per frequency)
snapshot overall capacity reserve = ((6 * 5,833)+(5,833 * 6 * 0.1)) + (7 * 140,000) + (4 *
980,000) + (2 * 3,920,000)
= 34,998 + 3,500 + 980,000 + 3,920,000 + 8,257,538
= 13,196,036 MB
= ~13 TB

For most small and medium businesses, a daily change of 140 GB would be considered high.
This near-sync example highlights the difference between keeping a lot of snapshots around and
keeping only the 10 snapshots in the full snapshot example.

10.4. Near-Sync Replication


Near-sync replication uses the same process as the snapshot process outlined above, but you
must include the first full copy of the protection domain as well as delta changes based on the set
schedule.
For the minimum amount of space needed at the remote site, start with 130 percent of the
protection domain plus the required LWS reserve space.
Best practices:
• Use drives with the same or greater capacity at the remote site compared to those in your
primary cluster to ensure that there is enough LWS reserve space.
• Do not use Metro or VMware SRM-based containers with near-sync.
• Do not enable deduplication on the source container for near-sync.

10. Sizing Space | 39


Data Protection and Disaster Recovery

11. Bandwidth
You must have enough available bandwidth to keep up with the replication schedule. If you are
still replicating when the next snapshot is scheduled, the current replication job finishes first.
The newest outstanding snapshot then starts to get the newest data to the remote side first. To
help replication run faster when you have limited bandwidth, you can seed data on a secondary
cluster at the primary site before shipping that cluster to the remote site.

11.1. Seeding
To seed data for a new site:
• Set up a secondary cluster with local IPs at the primary site.
• Enable compression on the remote site within the production domain.
• Set the initial retention time to “3 months.”
• Once setup completes, reconfigure the secondary cluster with remote IPs.
• Shut down the secondary cluster and ship it to the remote site.
• Power on the remote cluster and update the remote site on the primary cluster to the new IP.
If you are not able to seed the protection domain at the local site, you can create the remote
cluster as a normal install and turn on compression over the wire. Manually create a one-time
replication with retention time set to “3 months.” We recommend this retention time setting due to
the extra time it takes to replicate the first data set across the wire.
To figure out the needed throughput, you must know your RPO. If you set the RPO to one hour,
you must be able to replicate the changes within that time.
Assuming you know your change rate based on incremental backups or local snapshots, you can
calculate the bandwidth needed. The next example uses a change rate of 15 GB and an RPO
of one hour. We do not use deduplication in the calculation, partly so the dedupe savings can
serve as a buffer in the overall calculation, and partly because the one-time cost for deduped
data going over the wire has less of an impact once the data is present at the remote site. We’re
assuming an average of 30 percent bandwidth savings for compression on the wire.

Table 9: Bandwidth Sizing

Bandwidth needed = (RPO change rate * (1 - compression on wire savings %)) / RPO

11. Bandwidth | 40
Data Protection and Disaster Recovery

Example:
(15 GB * (1 - 0.3)) / 3,600s
(15 GB * 0.7) / 3,600s
10.5 GB / 3,600s
(10.5 * 1,000 MB) / 3,600s - changing to MB/s
(10,500 MB) / 3,600s
10,500 MB / 3,600 = 2.92 MB/s
Bandwidth needed = 23.33 Mb/s

You can easily perform the calculation online using http://www.wolframalpha.com/input/?


i=(15+GB+*+(1-30%25))%2F(1+hour).
If you keep missing your replication schedule, either increase your bandwidth or try increasing
your RPO. To allow more RPO flexibility, you can run different schedules on the same production
domain; for example, have one daily replication schedule and create a separate schedule to take
local snapshots every couple hours.

11. Bandwidth | 41
Data Protection and Disaster Recovery

12. Failover: Migrate vs. Activate


Nutanix offers two options for handling VMs and volume groups during failover: migrate and
activate.
Use migrate when you know that both the production site and the remote site are still healthy.
This option is only available on the active production site. Migrate shuts down the VM or volume
group, takes another snapshot, and replicates the VM or volume group to the selected remote
site.
The activate option for restoring a VM is only available on the remote site. When you select
activate, the system uses the last snapshot on the remote side to bring up the VM regardless of
whether the active production site is healthy. You won’t be able to sync any outstanding changes
to the VM from the production site.
If you have to activate a PD because the primary site is down, but the primary site comes back
up after the failover, you can have a split-brain scenario. To resolve this situation, deactivate the
PD on the former primary site. The following command is hidden from nCLI because it deletes
the VMs, but it resolves the split while keeping the existing snapshots:

Table 10: Asynchronous Replication Recovery from a Split-Brain Scenario

ncli> pd deactivate_and_destroy_vms name=<protection_Domain_Name>

You can test a VM at the remote site without breaking replication using the restore or clone
functionality.

Figure 13: Cloning a VM for Testing Using the Local Snapshot Browser at the Remote Site

Using the local snapshot browser on the inactive production domain at the remote site, choose
the restore option to clone a VM to the datastore. Add a prefix to the VM’s path.
Best practices:

12. Failover: Migrate vs. Activate | 42


Data Protection and Disaster Recovery

• When activating PDs on the remote site, use intelligent placement for Hyper-V and DRS for
ESXi clusters. Intelligent placement evenly spreads out the VMs on boot during a failover.
Acropolis powers on VMs uniformly at boot time.
• Install Nutanix Guest Tools (NGT) on machines using volume groups.
• Configure the data services cluster IP on the remote cluster.

12.1. Protection Domain Cleanup


Because inactive protection domains still consume space in existing snapshots, you should
remove any unused PDs to reclaim that space.
To remove a PD from a cluster, follow these steps:
1. Remove existing schedules.
2. Remove both local and remote snapshots.
3. Remove the VMs from the active PD.
4. Delete the PD.

12. Failover: Migrate vs. Activate | 43


Data Protection and Disaster Recovery

13. Self-Service File Restore


Self-service file restore is available for ESXi and AHV. The goal is to offer self-service file-level
restore with minimal support from infrastructure administrators. Once you’ve installed Nutanix
Guest Tools (NGT), you can open a web browser and go to http://localhost:5000 to browse
snapshots sorted by today, last week, last month, and user-defined criteria.
• Guest VMs should be:
⁃ Windows 2008, Windows 7, or a later version
⁃ Centos 6.5+ and 7.0+
⁃ Red Hat 6.5+ and 7.0+
⁃ OEL 6.5+
⁃ SLES 11+
• Install VMware Tools.
• Remove JRE 1.8 or later if installed with a previous release.
• Configure a cluster external IP address.
• Add the VM to a protection domain.
• Use the default disk.EnableUUID = true for the VM in advanced settings for ESXi.
• The VM must have an IDE-based CD-ROM configured for installation. On newer versions of
ESXi, a SATA-based CD-ROM is added by default.
• Detach the mounted disk after restoring your files.
Limitations:
• For Linux VMs, logical volumes spanning multiple disks are not supported.

13. Self-Service File Restore | 44


Data Protection and Disaster Recovery

14. Third-Party Backup Products


Nutanix provides its own changed region tracking API that vendors can access via REST API.
Currently, Commvault, Rubrik, and HYCU support the API. The API is hypervisor-agnostic, but
these companies are the first to provide backup support specifically for AHV. Nutanix systems
can integrate with any backup vendor that supports vStorage APIs for Data Protection for ESXi.
Nutanix also offers a VSS provider for Hyper-V backup software vendors.
Nutanix has published recommendations for optimizing and scaling third-party backup solutions
in best practices guides focused on an enhanced disk-to-disk backup architecture. Find these
guides in the best practices section of Nutanix.com.

Note: Nutanix AHV guest VMs and Nutanix Files are not certified for backup via
Cohesity because of the current Cohesity architectural implementation. Issues
related to Cohesity backups and configuration are not supported by Nutanix. For a list
of backup partners currently supported by and certified with Nutanix AHV, refer to the
Nutanix Technology Alliance Partners program.

14. Third-Party Backup Products | 45


Data Protection and Disaster Recovery

15. Conclusion
Nutanix offers granular data protection based on the required recovery point objectives across
many different deployment models. As your application requirements change and your cluster
grows, you have the flexibility to add and remove VMs from protection domains. Sizing capacity
and bandwidth are key to achieving optimal data protection and maintaining cluster health.
Snapshot frequency and the daily change rate affect the capacity and bandwidth needed
between sites to meet the needs of the business.
With data protection features that are purpose-built, fully integrated, and 100 percent software
defined, Nutanix provides the ultimate in adaptability and flexibility for meeting your enterprise’s
backup and recovery needs.

15. Conclusion | 46
Data Protection and Disaster Recovery

Appendix

Best Practices Checklist


Following are high-level best practices for backup and disaster recovery on Nutanix.
1. General
a. All VM files should sit on Nutanix storage. If non-Nutanix storage stores files externally, it
should have the same file path on both sides.
b. Remove all external devices, including ISOs or floppy devices.
2. Nutanix Native VSS Snapshots
a. Configure an external cluster IP address.
b. Guest VMs should be able to reach the external cluster IP on port 2074.
c. Guest VMs should have an empty IDE CD-ROM for attaching Nutanix Guest Tools.
d. Only available for ESXi and AHV.
e. Virtual disks must use the SCSI bus type.
f. VSS is not supported with near-sync, or if the VM has any delta disks (VMware
snapshots).
g. VSS must be running in the guest VM. The appendix has a Check for VSS Service
PowerShell script that verifies whether the service is running.
h. The guest VM needs to support the use of VSS writers. This Appendix has a Check VSS
Writers PowerShell script that makes sure that the VM writers are stable.
i. Schedule application-consistent snapshots during off-peak hours or ensure that
additional I/O performance is available. If you take a VSS snapshot during peak usage,
then the delta disk from the hypervisor-based snapshot could become large. When you
delete the hypervisor-based snapshot, collapsing it takes additional I/O; account for this
additional I/O to avoid affecting performance.
3. Hyper-V VSS Provider
a. VSS support is only for backup.
b. Create different containers for VMs needing VSS backup support. Limit the number of
VMs on each container to 50.
c. Create a separate large container for crash-consistent VMs.
4. Protection Domains
a. Protection domain names must be unique across sites.
b. Group VMs with similar RPO requirements.
c. Maximum of 200 VMs per protection domain for full snapshots.

Appendix | 47
Data Protection and Disaster Recovery

d. Maximum of 10 VMs per protection domain for LWS.


e. VMware Site Recovery Manager and Metro Availability protection domains are limited to
50 VMs. (These types of protection domains are not currently supported for near-sync.)
f. Linked clone VMs (typically nonpersistent View desktops) are not supported with near-
sync.
g. Remove unused protection domains to reclaim space.
h. If you have to activate a protection domain rather than migrate it, deactivate the old
primary protection domain when the site comes back up.
5. Consistency Groups
a. Keep consistency groups as small as possible. Keep dependent applications or service
VMs in one consistency group to ensure that they are recovered in a consistent state (for
example, App and DB).
b. Each consistency group using application-consistent snapshots can contain only one VM.
6. Disaster Recovery and Backup
a. Ensure that you configure forward (DNS A) and reverse (DNS PTR) DNS entries for each
ESXi management host on the DNS servers used by the Nutanix cluster.
7. Remote Sites
a. Use the external cluster IP as the address for the remote site.
b. Use DR proxy to limit firewall rules.
c. Use max bandwidth to limit replication traffic.
d. When activating protection domains, use intelligent placement for Hyper-V and DRS for
ESXi clusters on the remote site. Intelligent placement evenly spreads out the VMs on
boot during a failover. Acropolis powers on VMs uniformly at boot time.
e. If you are using vCenter Server to manage both the primary and remote sites, do not use
storage containers with the same name on both sites.
8. Remote Containers
a. Create a new remote container as the target for the VStore mapping.
b. When backing up many clusters to one destination cluster, use only one destination
container if the source containers have similar advanced settings.
c. Enable MapReduce compression if licensing permits.
d. If the aggregate incoming bandwidth required to maintain the current change rate is
< 500 Mb/s, we recommend skipping the performance tier to save flash capacity and
increase device longevity.
9. Network Mapping
a. Whenever you delete or change the network attached to a VM specified in the network
map, modify the network map accordingly.
10. Scheduling

Appendix | 48
Data Protection and Disaster Recovery

a. In order to spread out replication impact on performance and bandwidth, stagger


replication schedules across PDs. If you have a PD starting at the top of the hour, stagger
the PDs by half of the most commonly used RPO.
b. Configure snapshot schedules to retain the smallest number of snapshots while still
meeting the retention policy.
c. Metro and SRM-protected containers are not supported.
d. Deduplication on the source container is not currently supported for near-sync.
11. Cross-Hypervisor Disaster Recovery
a. Configure CVM external IP address.
b. Obtain the mobility driver from Nutanix Guest Tools.
c. Don’t migrate VMs with delta disks (hypervisor-based snapshots), using SATA disks, or
using volume groups.
d. Ensure that protected VMs have an empty IDE CD-ROM attached.
e. Ensure that network mapping is complete.
12. DR Orchestration
a. Deploy Prism Central to each on-prem site.
b. Deploy Prism Central onto a subnet that will not fail over.
c. Place CVM and hypervisor IPs on a subnet different from the subnets used by VMs.
d. On-prem DR orchestration requires a nonroutable VLAN for the test network.
13. Availability Zones
a. If one of the availability zones becomes unavailable, or if a service in the paired
availability zone is down, perform a forced sync from the paired availability zone after the
issue is resolved.
14. Protection Polices
a. A VM can only belong to either a protection domain or a protection policy.
b. If you are not using Nutanix AHV IPAM and need to retain your IP addresses, install NGT
into the VMs to be protected.
c. Apply protection polices by using categories.
d. Apply only one protection policy per VM.
e. Include only up to 200 VMs in a category.
f. For on-premises Leap, create the same container name on both sides. If the
container name does not match on both sides, data replicates by default to the
SelfServiceContainer.
15. Recovery Plans
a. For on-premises availability zones, create a nonroutable network for testing failovers.
b. Run the Validate workflow after making changes to recovery plans.

Appendix | 49
Data Protection and Disaster Recovery

c. After running the Test workflow, run the Clean-Up workflow instead on manually deleting
VMs.
d. A recovery plan should cover a maximum of 200 VMs at any one time.
e. Maximum of 20 categories in a recovery plan.
f. Maximum of 20 stages in a recovery plan.
g. Maximum of 15 categories per stage in a recovery plan.
h. Maximum of 5 recovery plans can be executed in parallel.
16. Network Mapping
a. Set up administrative distances on VLANs for subnets that will completely fail over. If
you don’t set up administrative distances, shut down the VLAN on the source side after
failover if the VPN connection is maintained between the two sites. If you’re failing over to
a new subnet, set up the subnet beforehand so you can test the routing.
b. The prefix length for network mappings at the source and the destination must be the
same.
c. If you’re not using Nutanix IPAM, you must install the NGT software package to a
maintain static address.
d. To maintain a static address for Linux VMs that aren’t using Nutanix IPAM, the VMs must
have the NetworkManager command-line tool (nmcli) version 0.9.10.0 or later installed.
Additionally, you must use NetworkManager to manage the network for the Linux VMs.
To enable NetworkManager on a Linux VM, set the value of the NM_CONTROLLED field
to yes in the interface configuration file (for example, in CentOS, the file is /etc/sysconfig/
network-scripts/ifcfg-eth0). After setting the field, restart the network service on the VM.
17. Xi Leap Hypervisor Support
a. Xi Leap only supports clusters running AHV.
18. Xi Leap Virtual Machine Configuration Restrictions
a. Cannot power on VMs configured with a GPU resource.
b. Cannot power on VMs configured with four vNUMA sockets.
19. Single-Node Backup
a. Combined, all protections domains should be under 30 VMs.
b. Limit backup retention to a three-month policy. A recommended policy would be seven
daily, four weekly, and three monthly backups.
c. Only map an NX-1155 to one physical cluster.
d. Snapshot schedule should be greater than or equal to six hours.
e. Turn off deduplication.
20. Cloud Connect
a. Try to limit each protection domain to one VM to speed up restores. This approach also
saves money, as it limits the amount of data going across the WAN.
b. The RPO should not be lower than four hours.

Appendix | 50
Data Protection and Disaster Recovery

c. Turn off deduplication.


d. Try to use Cloud Connect to protect workloads that have an average change rate of less
than 0.5 percent.
21. Sizing
a. Size local and remote snapshot usage using the application’s change rate.
b. Remember to either size the performance tier for hybrid clusters to accommodate
incoming data or bypass the performance tier and write directly to disk.
22. Bandwidth
a. Seed locally for replication if WAN bandwidth is limited.
b. Set a high initial retention time for the first replication when seeding.
23. File-Level Restore
a. Guest VMs should be Windows 2008, Windows 7, or a later version.
b. Install VMware tools.
c. Install JRE 1.8 or later.
d. Configure a cluster external IP address.
e. Add the VM to a protection domain.
f. Use the default disk.EnableUUID = true for the VM in advanced settings.
g. The VM must have a CD-ROM configured.
h. Detach the mounted disk after restoring your files.

Appendix | 51
Data Protection and Disaster Recovery

PowerShell Scripts
Check for VSS Service
This PowerShell script checks whether the VSS service is running as required for application-
consistent snapshots.
#Connect to the Nutanix cluster of your choice, try to use the external address.
Connect-NutanixCluster -AcceptInvalidSSLCerts -server External_cluster_IP -UserName admin
#load Nutanix CMDlets, make sure your local version matches the cluster version
Add-PSSnapin NutanixCmdletsPSSnapin
#Get a list of all Consistency Groups
$pdvss = Get-NTNXProtectionDomainConsistencyGroup
#array of all the appConsistentVMs
$appConsistentVM = @()
Foreach ($vssVM in $pdvss)
{
if ($vssVM.appConsistentSnapshots)
{
$appConsistentVM += $vssVM.consistencyGroupName
}
}
get-service -name VSS -computername $appConsistentVM | format-table -property MachineName,
Status, Name, DisplayName -auto

Appendix | 52
Data Protection and Disaster Recovery

Check VSS Writers


This PowerShell script checks whether VSS writers are stable for VMs that are running
application-consistent snapshots.
function Get-VssWriters {
<#
.Synopsis
Function to get information about VSS Writers on one or more computers
.Description
Function will parse information from VSSAdmin tool and return object containing
WriterName, StateID, StateDesc, and LastError
Function will display a progress bar while it retrives information from different
computers.
.Parameter ComputerName
This is the name (not IP address) of the computer.
If absent, localhost is assumed.
.Example
Get-VssWriters
This example will return a list of VSS Writers on localhost
.Example
# Get VSS Writers on localhost, sort list by WriterName
$VssWriters = Get-VssWriters | Sort "WriterName"
$VssWriters | FT -AutoSize # Displays it on screen
$VssWriters | Out-GridView # Displays it in GridView
$VssWriters | Export-CSV ".\myReport.csv" -NoTypeInformation # Exports it to CSV
.Example
# Get VSS Writers on the list of $Computers, sort list by ComputerName
$Computers = "xHost11","notThere","xHost12"
$VssWriters = Get-VssWriters -ComputerName $Computers -Verbose | Sort "ComputerName"
$VssWriters | Out-GridView # Displays it in GridView
$VssWriters | Export-CSV ".\myReport.csv" -NoTypeInformation # Exports it to CSV
.Example
# Reports any errors on VSS Writers on the computers listed in MyComputerList.txt, sorts
list by ComputerName

Appendix | 53
Data Protection and Disaster Recovery

$Computers = Get-Content ".\MyComputerList.txt"


$VssWriters = Get-VssWriters $Computers -Verbose |
Where { $_.StateDesc -ne 'Stable' } | Sort "ComputerName"
$VssWriters | Out-GridView # Displays it in GridView
$VssWriters | Export-CSV ".\myReport.csv" -NoTypeInformation # Exports it to CSV
.Example
# Get VSS Writers on all computers in current AD domain, sort list by ComputerName
$Computers = (Get-ADComputer -Filter *).Name
$VssWriters = Get-VssWriters $Computers -Verbose | Sort "ComputerName"
$VssWriters | Out-GridView # Displays it in GridView
$VssWriters | Export-CSV ".\myReport.csv" -NoTypeInformation # Exports it to CSV
.Example
# Get VSS Writers on all Hyper-V hosts in current AD domain, sort list by ComputerName
$FilteredComputerList = $null
$Computers = (Get-ADComputer -Filter *).Name
Foreach ($Computer in $Computers) {
if (Get-WindowsFeature -ComputerName $Computer -ErrorAction SilentlyContinue |
where { $_.Name -eq "Hyper-V" -and $_.InstallState -eq "Installed"}) {
$FilteredComputerList += $Computer
}
}
$VssWriters = Get-VssWriters $FilteredComputerList -Verbose | Sort "ComputerName"
$VssWriters | Out-GridView # Displays it in GridView
$VssWriters | Export-CSV ".\myReport.csv" -NoTypeInformation # Exports it to CSV
.OUTPUT
Scripts returns a PS Object with the following properties:
ComputerName

WriterName
StateID

StateDesc

LastError

Appendix | 54
Data Protection and Disaster Recovery

.Link
https://superwidgets.wordpress.com/category/powershell/
.Notes
Function by Sam Boutros
v1.0 - 09/17/2014
#>
[CmdletBinding(SupportsShouldProcess=$true,ConfirmImpact='Low')]
Param(
[Parameter(Mandatory=$false,
ValueFromPipeLine=$true,
ValueFromPipeLineByPropertyName=$true,
Position=0)]
[ValidateNotNullorEmpty()]
[String[]]$ComputerName = $env:COMPUTERNAME
)
$Writers = @()
$k = 0
foreach ($Computer in $ComputerName) {
try {
Write-Verbose "Getting VssWriter information from computer $Computer"
$k++
$Progress = "{0:N0}" -f ($k*100/$ComputerName.count)
Write-Progress -Activity "Processing computer $Computer ... $k out of
$($ComputerName.count) computers" `
-PercentComplete $Progress -Status "Please wait" -CurrentOperation "$Progress%
complete"
$RawWriters = Invoke-Command -ComputerName $Computer -ErrorAction Stop -ScriptBlock {
return (VssAdmin List Writers)
}
for ($i=0; $i -lt ($RawWriters.Count-3)/6; $i++) {
$Writer = New-Object -TypeName psobject
$Writer| Add-Member "ComputerName" $Computer
$Writer| Add-Member "WriterName" $RawWriters[($i*6)+3].Split("'")[1]
$Writer| Add-Member "StateID" $RawWriters[($i*6)+6].SubString(11,1)

Appendix | 55
Data Protection and Disaster Recovery

$Writer| Add-Member "StateDesc" $RawWriters[($i*6)+6].SubString(14,


$RawWriters[($i*6)+6].Length - 14)
$Writer| Add-Member "LastError" $RawWriters[($i*6)+7].SubString(15,
$RawWriters[($i*6)+7].Length - 15)
$Writers += $Writer
}
Write-Debug "Done"
} catch {
Write-Warning "Computer $Computer is offline, does not exist, or cannot be contacted"
}
}
return $Writers
}
#Connect to the Nutanix cluster of your choice, try to use the external address.
Connect-NutanixCluster -AcceptInvalidSSLCerts -server External_cluster_IP -UserName admin
#load Nutanix CMDlets, make sure your local version matches the cluster version
Add-PSSnapin NutanixCmdletsPSSnapin
#Get a list of all Consistency Groups
$pdvss = Get-NTNXProtectionDomainConsistencyGroup
#array of all the appConsistentVMs
$appConsistentVM = @()
Foreach ($vssVM in $pdvss)
{
if ($vssVM.appConsistentSnapshots)
{
$appConsistentVM += $vssVM.consistencyGroupName
}
}
$VssWriters = Get-VssWriters $appConsistentVM -Verbose |
Where { $_.StateDesc -ne ‘Stable’ } | Sort “ComputerName”
$VssWriters | Out-GridView # Displays it in GridView
$VssWriters | Export-CSV “.\vssWriterReport.csv” -NoTypeInformation # Exports it to CSV

Appendix | 56
Data Protection and Disaster Recovery

About the Author


Dwayne Lessner is a Senior Technical Marketing Engineer on the Product and Technical
Marketing team at Nutanix, Inc. In this role, Dwayne helps design, test, and build solutions on the
Nutanix Enterprise Cloud Platform. Dwayne is always willing to assist customers and partners
build the right solution to fit the project.
Dwayne has worked in healthcare and oil and gas for over ten years in various roles. A strong
background in server and desktop virtualization has given Dwayne the opportunity to work
with many different application frameworks and architectures. Dwayne has been a speaker at
BriForums and various VMware User Group events and conferences.
Follow Dwayne on Twitter at @dlink7.

About Nutanix
Nutanix makes infrastructure invisible, elevating IT to focus on the applications and services that
power their business. The Nutanix Enterprise Cloud OS leverages web-scale engineering and
consumer-grade design to natively converge compute, virtualization, and storage into a resilient,
software-defined solution with rich machine intelligence. The result is predictable performance,
cloud-like infrastructure consumption, robust security, and seamless application mobility for a
broad range of enterprise applications. Learn more at www.nutanix.com or follow us on Twitter
@nutanix.

Appendix | 57
Data Protection and Disaster Recovery

List of Figures
Figure 1: Nutanix Enterprise Cloud................................................................................... 9

Figure 2: Scalable Replication as You Scale...................................................................11

Figure 3: Two-Way Mirroring............................................................................................13

Figure 4: Many-to-One Architecture.................................................................................14

Figure 5: Using the Public Cloud as a Backup Destination............................................. 14

Figure 6: On-Prem and Xi-Based Availability Zones....................................................... 15

Figure 7: Setup Options for a Remote Site..................................................................... 21

Figure 8: Max Bandwidth................................................................................................. 23

Figure 9: VStore-Container Mappings for Replication..................................................... 24

Figure 10: Multiple Schedules for a Production Domain................................................. 26

Figure 11: Network Mapping for Xi Leap.........................................................................32

Figure 12: Example Snapshot Schedule: Taking a Snapshot at Noon and 6 PM............ 36

Figure 13: Cloning a VM for Testing Using the Local Snapshot Browser at the Remote
Site................................................................................................................................42

58
Data Protection and Disaster Recovery

List of Tables
Table 1: Document Version History................................................................................... 7

Table 2: MS Failover Settings Adjusted for Using VSS................................................... 18

Table 3: Skip the Performance Tier................................................................................. 24

Table 4: Default Telescopic Schedule for One Month......................................................27

Table 5: Local Full Snapshot Reserve Formula...............................................................35

Table 6: Remote Snapshot Reserve Formula..................................................................37

Table 7: LWS Reserve..................................................................................................... 37

Table 8: Cluster LWS Reserve.........................................................................................38

Table 9: Bandwidth Sizing................................................................................................40

Table 10: Asynchronous Replication Recovery from a Split-Brain Scenario....................42

59

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy