Lecture#5: IT Service Continuity Management
Lecture#5: IT Service Continuity Management
IT Service Continuity
Management
Contents
Introduction
Objectives
Process
Relationship with the other Processes and Functions
Activities
Process Reports
Critical
success factors
Performance indicators
Functions and Roles
Costs
Problems
Introduction
Disaster - an event that affects a service or system such that
significant effort is required to restore the original performance level.
Service analysis:
Once the reasons for initiating ITSCM have been identified, an analysis is
made of the IT services that are essential to the business (e.g. information
systems, office applications, accounting applications, e-mail, etc.) and which
must be available in accordance with the Service Level Agreements.
For some nonessential services, it may be agreed to provide an emergency
service with limited capacity and availability.
The service levels during disaster recovery may only be modified in
agreement with the customer.
For critical services, a balance has to be struck between prevention and
recovery options.
2. Business Impact Analysis
Infrastructure
A service analysis is followed by an assessment of the dependencies between
services and IT resources. Availability Management information is used to
analyze the extent to which IT resources perform a critical function in
supporting the IT services discussed earlier. Capacity Management provides
information about the required capacity.
It is also determined to what extent these services may be disrupted, from
the loss of service to its restoration.
Later, this information will be used to identify the recovery options for each
service.
3. Risk Assessment
A risk analysis can help identify the risks a business is exposed to. Such an
analysis will provide management with valuable information by identifying the
threats and vulnerabilities and relevant prevention measures.
Because maintaining a disaster recovery plan is relatively expensive, the
prevention measures should be taken first.
Once such measures have been taken against most risks, it is determined if
there are any remaining risks that may necessitate a Contingency plan.
Figure below shows the links between Risk Analysis and Risk Management; it
is based on the CCTA Risk Analysis and Management Method (CRAMM).
3. Risk Assessment
Risk Analysis
First, the relevant IT components (assets) must be identified, such as
buildings, systems, data, etc. Effective asset identification means that the
owner and purpose of each component must be documented.
The next step is to analyze the threats and dependencies and to estimate
the likelihood (high, medium, low) that a disaster will occur, for example a
combination of an unreliable main power supply and an area with many storms
and thunderstorms.
Next, the vulnerabilities are identified and classified (high, medium, and low).
A lightning conductor will provide some protection against lightning strikes, but
they can still seriously affect the network and the computer systems.
Finally, the threats and vulnerabilities are evaluated in the context of the IT
components, to provide an estimate of the risks.
4. IT Service Continuity Strategy
Most businesses will aim to strike a balance between risk
reduction and recovery planning.
There is a distinction between risk reduction, business
activity recovery activities, and IT recovery options.
The relationship between risk reduction (prevention) and
recovery planning (recovery options) is discussed below.
Threats can never be fully eliminated. For example, a fire in a
building nearby may also damage your building.
Reducing one risk might also increase another risk. For
example, outsourcing might increase security risks.
4. IT Service Continuity Strategy
Prevention Measures
Prevention measures can be taken on the basis of the risk analysis, while carefully
considering the costs and risks. The measures may aim to reduce the likelihood or
impact of contingencies, and therefore narrow the scope of the recovery plan.
Measures can be taken against dust, excessively high or low temperatures, fire, leaks,
power outages, and burglary. The remaining risks are then covered by the recovery
plan.
The Stronghold/Fortress Approach is the most extensive form of prevention. It
eliminates most vulnerability, for example by building a bunker with its own power and
water supply. However, this may introduce other vulnerabilities such as the risk of
network failure, or roadblocks, as off-site recovery will now be even more difficult.
The stronghold/fortress approach is suitable for large computer centers that are too
complex for a recovery plan. It is vital nowadays to complement a stronghold/fortress
approach with a skirmish capability, i.e., an organizational capability to go where the
problem is and deal with it promptly before it spirals out of control.
4. IT Service Continuity Strategy
:Do nothing
Few businesses can afford this approach. It is more likely to
indicate a head-in-the-sand attitude.
Departments which think that they can survive without IT
recovery facilities may give the impression they mean so little
to the business that they are dispensable after a contingency.
Nevertheless, it could be investigated for each service if this
option might be acceptable.
4. IT Service Continuity Strategy
:Reciprocal Agreements
This option can be used if two organizations have similar
hardware and agree to provide each other with facilities in the
event of a disaster.
For this option, the two businesses have to conclude an
agreement and ensure that changes are coordinated so that both
environments remain interchangeable.
Capacity Management should ensure that the reserved capacity is
not used for other purposes, or can be released quickly.
This option is less attractive nowadays due to the increasing use
of online systems such as ATMs and on-line banking as these
systems have to be available 24 hours a day, 7 days a week.
4. IT Service Continuity Strategy
:Combinations of options
In many cases, a Contingency plan can provide for a more
expensive option to bridge the introduction of a cheaper
option. For example, a trailer with operating computer center
(mobile hot start) can provide a temporarily solution until
portable facilities have been set up and the new host
computers have been delivered (mobile cold start). Normal
operations are restored after refurbishment of the building
and moving the new host computers into the building.
4. Organization and implementation
planning
Once the business strategy has been determined and choices have
been made, the ITSCM has to be implemented and the plans for the
IT facilities have to be developed in detail. An organization will have
to be set up to implement the ITSCM process.
This could include management (Crisis Manager), coordination, and
recovery teams for each service.
At the highest level there should be an overall plan addressing the
following issues:
Emergency response plan.
Damage assessment plan.
Recovery plan.
Vital records plan (what to do with data, including paper records).
Crisis Management and PR plans.
5. Organization and implementation
planning
All these plans are used to assess emergencies and to respond to
them. It can then be decided if the business recovery process should
be initiated, in which case the next level of plans has to be activated,
including the:
Accommodation and services plan.
Computer system and network plan.
Telecommunications plan (accessibility and links).
Security plan (integrity of the data and networks).
Personnel plan.
Financial and administrative plans.
6. Prevention measures and recovery
options
This is when the prevention measures and recovery options
identified earlier are put into practice.
Prevention measures to reduce the impact of an incident are
taken together with Availability Management, and may include:
Use of UPS and backup power supplies
Fault-tolerant systems
Off-site storage and RAID systems, etc.
6. Prevention measures and recovery
options
A start should also be made to introduce stand-by
agreements. These should cover personnel, buildings and
telecommunications. Even during the contingency period a start
can be made with restoring the normal situation and ordering
new IT components. Dormant contracts can be made in advance
with suppliers. This means that signed orders are available for
the components to be supplied at an agreed price.
When the disaster occurs, the supplier can process the order
without having to issue quotations. Such dormant contracts
should be updated every year as prices and models will change.
The Configuration Management baselines should be considered
when updating these contracts.
6. Prevention measures and recovery
options
The following activities can be carried out to set up stand-by
agreements:
Negotiating off-site recovery facilities with third parties
Maintaining and equipping the recovery facility
Purchasing and installing stand-by hardware (dormant
contracts)
Managing dormant contracts
7. Developing plans and procedures for
recovery
The plans should be detailed and formal, as a recovery plan
requires maintenance and changes must be approved by those
concerned. These issues also need to be communicated. The
major problems relate to changes in the infrastructure and the
agreed service levels. For example, migration to a new
midrange platform could mean that there is no equivalent unit
at the backup facility for a warm, external start. For this
reason, Configuration Management plays an important role in
monitoring the baseline configurations referred to in the
recovery plan. The plan should also identify the procedures
needed to support it.
7. Developing plans and procedures for
recovery
Recovery Plan
The recovery plan should include all elements relevant to
restoring the business activities and IT services, including:
Introduction - describes the structure of the plan and envisaged recovery
facilities.
Updating - discusses the procedures and agreements for maintaining the plan, and
tracks changes to the infrastructure.
Routing list - the plan is divided into sections, each specifying the actions to be
undertaken by a specific group. The routing list shows what sections should be
sent to which personnel.
Recovery initiation - describes when and under what conditions the plan is
invoked.
Contingency classification - if the plan describes procedures for different
contingencies, they should be described here in terms of seriousness (minor,
medium, major), duration (day, week, weeks), and damage (minor, limited, serious).
7. Developing plans and procedures for
recovery
Recovery Plan
Specialist sections - the plan should be divided into sections based on the six areas and
groups covered by the plan:
o Administration - how and when is the plan invoked, which managers and personnel are
involved, and where is the control center based?
o IT infrastructure - hardware, software, and telecommunications to be provided by the
recovery system, recovery procedures, and dormant contracts for the purchase of new
IT components.
o Personnel - personnel required at the recovery facility, possibly transport to the
facility, and accommodation if the facility is located far from the business.
o Security - instructions for protection against burglary, fires and explosions at both the
home site and the remote site, and information about external storage facilities such as
warehouses and vaults.
o Recovery sites - information about contracts, personnel with specified functions,
security, and transport.
o Restoration - procedures to restore the normal situation (e.g. the building), conditions
under which these procedures are invoked, and dormant contracts.
7. Developing plans and procedures for
recovery
Procedures
The recovery plan provides a framework for drafting the
procedures. It is essential to develop effective procedures,
such that anyone can undertake the recovery by following the
procedures. These should address:
Installing and testing hardware and network components
Restoring applications, databases, and data
These and other relevant procedures are attached to the
recovery plan.
8. Initial Testing
•Initiating BCM
•Allocating personnel and resources •Crisis management
Board •Defining policies •Taking corporate/business decisions
•Defining process authority