SANOG34 Tutorials Datacentre
SANOG34 Tutorials Datacentre
Session 1
Power
Quality
Power
Security
Backup
24/7/365
Operation,
Access and
Support
Redundancy Cooling
Internet
Data Centre in a Business Process
Operations of Data Centre
Data Centre Operations is a broad term that includes all processes and operations performed
within a data centre. Typically, data centre operations are distributed across several categories,
such as:
Infrastructure Operations
Installing, maintaining, monitoring, patching and updating server, storage & network resources
Security
Processes, tools and technologies that ensure physical and logical security in the data centre
premises
Management
Creation, enforcement and monitoring of policies and procedures within data centre processes
Cost of Down Time and It’s Calculation
According to Gartner, average cost of network downtime is around USD 5,600 per minute.
Considering average size of ISPs around the world. That is around USD 300,000 per Hour. For
any business, USD 300,000 per Hour is a lot on the line.
Beyond the monetary costs, IT downtime can wear on your business’s productivity levels.
Every time you get interrupted, it takes on average 23 minutes to get refocused on your prior
task.
Network failures and power outages aren’t the only culprits when it comes to downtime
either. Other top factors include:
So how do you know where you stand when it comes to downtime costs? Here is a simple way
to calculate how your business could be affected:
Cost of Down Time and It’s Calculation
Cost of Downtime (/Hr) = Lost Revenue + Lost Productivity + Recovery Cost + Intangible Cost
Recovery Cost
These are the costs accrued while fixing the issue. They can include but are not limited to:
• Repair services ± Replacement parts
• Lost data recovery ± Other costs due to loss of data
These may not be as tangible at revenue and productivity costs, but they are equally as
important when deducing your real downtime costs.
Intangible Cost
These are the costs that can sting the most for the long-term. These occur when downtime
damages your reputation or your brand. These costs ultimately affect businesses that rely
heavily on uptime. Including intangible costs into the Total Downtime Cost Formula gives a
better understanding of the long-term consequences that can occur due to downtime.
Cost of Down Time : Ponemon Institute
o Average cost of data centre downtime across industries in 2018 was approximately USD
9,000 per Minute for an average of 2,500 m2 of White Space. (It was USD 5,600 in 2010.)
o Average reported incident length was 86 minutes in 2018, resulting in average cost per
incident of approximately USD 790,000 (In 2010 it was 97 minutes at USD 505,500)
o For a total data centre outage, which had an average recovery time of 119 minutes in 2018,
average costs were USD 1,901,500 (In 2010, it was 134 minutes at USD 680,700)
o For a partial data centre outage, which averaged 56 minutes in length in 2018, average
costs were USD 750,400 (In 2010, it was 59 minutes at approximately USD 258,000)
o The majority of survey respondents reported having experienced an unplanned data centre
outage in the past 24 months (91%) in 2018. This is a slight decrease from the 95 percent of
respondents in the 2010 study who reported unplanned outages.
Example of Down Time : 2017
in May 2017, over 75,000 people had their three-day weekend plans stalled when British
Airways suffered a massive data centre failure. Along with the chaos of lost luggage,
broken trust and frustrations resulting from cancelled and delayed flights, the UK’s largest
airline also had to deal with a loss over $75 million. The outage was caused by a single
engineer who disconnected and reconnected the power supply in a disorganized fashion,
triggering a power surge that disrupted the operations of the entire BA infrastructure.
Even the data centre behemoths aren’t immune to outages– including Amazon Web
Services (AWS) which is considered to be the world’s biggest cloud provider, and home to
some of the biggest names on the Internet, like Slack, Quora, Netflix, and Airbnb. In 2017,
a mistyped command entered by an AWS engineer caused many sites to shut down for
several hours, prompting a loss of over $150 million.
A similar incident echoed a year later at one of AWS’ biggest rival. On September 2018, a
lightning strike caused Microsoft Azure to suffer an outage from voltage surge resulting in
damage to hardware, including network devices and storage units.
Example of Down Time : Google Cloud
In July, 2018 due to a bug in the new security feature added Google Front Ends (GFE)
architecture layer. The bug had not been identified earlier despite extensive testing
procedures in place, and was triggered only when the configuration changes were
introduced in the production environment. The affected services included the Google App
Engine, Stackdriver, Diagflow and Global Load Balancers. Customer including Spotify,
Discord, Pokemon Go app and Snapchat rely on these cloud networking services to reach
a global audience, thereby cascading the impact globally. The outage lasted for around 30
minutes and up to 87 percent of the customers experienced some form of errors on the
App Engine, HTTPS Load Balancer or the TCP/SSL Proxy Load Balancer solutions.
The affected customers were provided credits refund as per the Service Level Agreement
(SLA) as a common compensation by any cloud vendor. However, the true cost of data
centre downtime that averages around $750,000 as of 2015 according to a Ponemon
Institute research report far outweighed the offered compensation.
Example of Down Time : O2
The CenturyLink incident was the highlight network outage of 2018 as it left millions
of users without the ability to call 911, ATM withdrawals, access to sensitive patient
healthcare records, Verizon mobile data services and even lottery drawings. The
incident later led to an FCC investigation considering the “unacceptable” downtime
impacting emergency services such as 911 or ATM withdrawals. The outage lasted
two days and was caused due to an issue with a single network management card.
The device was found transmitting invalid data frame packets across the
infrastructure. Despite the multiple layers of redundancy in place, the issue cascaded
across its nationwide communication infrastructure. Once the infrastructure systems
crashed, CenturyLink had limited visibility into its network to troubleshoot the issue.
Example of Down Time : 2019
Outages like these are not anomalies. The rate and severity of these incidents have
grown significantly over the last year (2018), according to Uptime Institutes’ Global
Data Centre Survey. What’s surprising is that 80% of the survey respondents believed
that their recent outage was preventable.
In March of 2019, the Nordic metals firm, Norsk Hydro, suffered a ransomware attack
called LockerGoga that shut down its global operations. This left their 35,000
employees around the world unable to progress with their work. At this point, they
are still working to calculate the financial impact of the attack, loss of wages,
productivity, and stock price drop.
UpTime Institute Industry Survey, 2018
UpTime Institute Industry Survey, 2018
UpTime Institute Industry Survey, 2018
UpTime Institute Industry Survey, 2018
Complexity of DC even without ICT
Risk Factors for Data Centre
Prominent Cause of Downtime/Failure
Summary
‘Fault Tolerant’ is the philosophy behind the Rated-4 and Tier-IV conformity. It
requires to ensure that every capacity component in either of their
distribution path can run on the full-load operation of the facility. Hence,
capacity component as well as distribution path can tolerate a fault anywhere
in the system while the facility is having planned down-time / maintenance
without disrupting ICT capabilities to end-user.
Rated-4 and It applies to all active and passive components of MEP infrastructure. However,
Tier-IV Architecture-Civil, Fire Suppression and Safety-Security provisions are out of
Design this scope. Software tools for remote operation is required.
Furthermore, it requires that each distribution path for power, cooling, ICT to
be physically separated. Specifically transformer, generator, UPS, battery,
chiller plant, carrier room / meet-me room and rack-space should remain 2
(two) hour fire-separated from each other. Additionally, no sharing of PDU, fire
suppression and cooling is allowed along with manual fail-over switching of
electrical-power.
Hybrid Topology : Requirement
# Site Selection Requirement
Ground Floor should be high-enough to sustain any flash flood based on 50 (fifty) years of
1
flood history
2 Distance from Air-Port should be 8 Km / 5 miles
3 Distance from Rail-Station should be 0.8 Km / 0.5 miles
4 Within 3,050 m / 10,000 feet from the sea-level
5 Capability to Handle Seismic Activity based on ‘Zone’ Requirement
Away from Chemical Plant, Power Generation Plant and Establishment which could be
6
categorized as ‘Potential Target of Attack’
55
Cooling Technology for Network Node
Comfort AC
Natural Free Air Cooled Water
VRF
Cooling (+DX) Chiller
Precision AC
Power | Heat
Heat | Cooling
Cooling | Power
VRF AC
Comfort AC
1 – 20 TR
Immersion
Precision AC
30 – 130 KW
20 – 100 TR
Power / Rack
Cooling
Technology
DX (± DF)
WCWC (±DF)
Chiller
400+ TR
100 – 200 TR
ACWC (± DF)
200 – 400 TR
Plenum Cooling Technology
Plenum Cooling Technology
Plenum Cooling Technology
Plenum Cooling Technology
Plenum Cooling Technology
Plenum Cooling Technology
Plenum Cooling Technology
Session 3
SUPS Static UPS + Battery Bank + Battery Room PAC + PFI + Many Distribution Panel + AVR +
Generator + UPS and Gen-Set Synchronizers + Phase Plotter
Cross-Link Configuration
Power System Configuration
Tier-IV Certification with N+1 Setup [Assuring ‘N’ after any Failure]
1. Static UPS Not Applicable
2. Rotary UPS (Only LV) Only Applicable with Isolated Parallel Configuration (IP Bus)
Session 4
Do Not Forget to Add the Weight of the Raised Floor to Building Floor Load
Signal Reference Grid and Grounding
Raised Floor Guidelines
1. Height : 300 mm – 1 m
2. Ramp Slope = 1 : 12
3. Ramp Width : 600 mm
4. Aisle Width : 600 mm
5. Wheel Chair Road : 1 m
6. Rand Rail beside Ramp
7. No Plumbing (Optional)
Bonding-Earthling Guidelines
1. Individual Device Bond
2. Serial Bonding : NA
3. IEEE-1100 to Follow
4. Ground < 1 Ohm [9 Hole]
5. Code of AHJ to Follow
TIA-942 Cabling Standard
ER : Carrier MDA : MMR HDA : DH-MMR ZDA : End of Row HDA : Top of Rack EDA : Rack
TIA-942 Cabling Standard
TIA-942 Cabling Standard
TIA-942 Cabling Standard
Perimeter
Security,
Reception,
Patrolling
Rodent Baggage
Repellent Scan, Metal
and Pest Detector,
Control Integrated Man Trap
Security
Management
with 24/7/365
Fire Operations
Detection Centre
Access
and
Control
Suppression
System
Surveillance
System
Access Control & Surveillance System
Layer 1
Perimeter
Layer 6
Layer 2
White Space
Clear Zone
/ Cage /
and Parking
Containment
Layer 7
Rack Doors
[Front & Back]
Layer 5 Layer 3
Hall Way / Façade &
Gray Space Reception
Layer 4
Turnstile /
Man Trap
Layer 4 : Tailgating and Piggybacking
Data Centre Infrastructure Management
Fire Triangle
Fire Suppressant Gas : Modus Operandi
There are four means used by the agents to extinguish a fire. They act on the
"fire tetrahedron":
Flooding Time 10 Sec / Less 10 Sec / More 10 Sec / Less 10 Sec / Less 60 Sec / Less
Fire Zone Not Possible Not Possible Not Possible Not Possible Possible
Working Oxygen Reduce Heat Absorb Oxygen Reduce
Heat Absorb Heat Absorb
Principle (15%) [Heat + Oxygen] (12.5%)
9% (7.5%) 5% (35%) 9% (7.5%) 43% (40%)
NOAEL (Design) 10% (5%)
[30% (15%)] 9% (B), 34% (D) [7.5% (5%)] [43% (40%)]
Usage in DC Banned [Partial] Mostly Banned Partial Ban In Use In Use
Look / Smell Clean White Mist Clean [Lemon] Clean Clean
Fire Suppression System : Selection
Irish / Finger Print + PIN + Access ID NAF S-125 / Water Mist + VESDA
Technology Selection and Options
Continuous Bus Way + Cast Resin BBT Slab Floor, Chimney Return
UPS and
Cooling
Electrical Improvement in PUE
Efficiency
Efficiency
Quiz : Tier Topology Misconception
YES / NO YES / NO
YES / NO YES / NO
Question, Comment, Feedback, Advice