NIA - Chapter 13
NIA - Chapter 13
Another group of machines in your organization is the server fleet. This population of
machines is used to provide application services, web services, databases, batch
computation, back-o ice processing, and so on.
A single machine might provide one service or many. For example, a single server might be
a dedicated file server, or it might be a file server, a DNS server, and a wiki server while
performing many other functions. Alternatively, some services might be too large to fit on a
single machine. Instead, the work of the service may be distributed over many machines.
For example, Google’s Gmail service is distributed over thousands of machines, each
doing a small fraction of the work.
By definition a server has dependents, usually many. A single server may have hundreds of
clients relying on it. A web server may have thousands of users depending on it. Contrast
this to a laptop or desktop, which generally has just a single user.
In a model that has one big server, or just a few, any investment made in a server is
amortized over all the dependents or users. We justify any additional expense to improve
performance or increase reliability by looking at the dependents’ needs. In this model
servers are also expected to last many years, so paying for spare capacity or expandability
is an investment in extending its life span.
This chapter’s advice applies to a typical enterprise organization that mostly purchases
o -the-shelf hardware and software, with a limited number of home-grown applications.
Volume 2 of this book series is more specifically geared toward web-based applications
and large clusters of machines.
There are many strategies for providing server resources. Most organizations use a mix of
these strategies.
• All eggs in one basket: One machine used for many purposes
• Buy in bulk, allocate fractions: Large machines partitioned into many smaller virtual
machines using virtualization or containers
• Blade servers: A hardware architecture that places many machines in one chassis
What Is a Server?
The term “server” is overloaded and ambiguous. It means di erent things in di erent
contexts. The phrase “web server” might refer to a host being used to provide a web site (a
machine) or the software that implements the HTTP protocol (Apache HTTPD).
Unless specified, this book uses the term “server” to mean a machine. It refers to a
“service” as the entire hardware/software combination that provides the service users
receive. For example, an email server (the hardware) runs MS Exchange (the software) to
provide the email service for the department.
Note that Volume 2 of this book series deals with this ambiguity di erently. It uses the
term “server” to mean the server of a client–server software arrangement—for example, an
Apache HTTPD server. When talking about the hardware it uses the word “machine.”
The most basic strategy is to purchase a single server and use it for many services. For
example, a single machine might serve as the department’s DNS server, DHCP server,
email server, and web server. This is the “all eggs in one basket” approach. We don’t
recommend this approach. Nevertheless, we see it often enough that we felt we should
explain how to make the best of it.
If you are going to put all your eggs in one basket, make sure you have a really, really, really
strong basket. Any hardware problems the machine has will a ect many services. Buy
top-of-the-line hardware and a model that has plenty of expansion slots. You’ll want this
machine to last a long time.
In this setting it becomes critical to ensure data integrity. Hard disks fail. Expecting them
to never fail is irrational. Therefore RAID should be used so that a single disk can fail and
the system will continue to run. Our general rule of thumb is to use a 2-disk mirror (RAID 1)
for the boot disk. Additional storage should be RAID 5, 6, or 10 as needed. RAID 0 o ers
zero data integrity protection (the “zero” in the name reflects this). Obviously, di erent
applications demand di erent configurations but this fits most situations. Putting plain
disks without RAID 1 or higher on a server like this is asking for trouble. Any disk
malfunction will result in a bad day as you restore from backups and deal with any data
loss. When RAID was new, it was expensive and rare. Now it is inexpensive and should be
the default for all systems that need reliability.
Data integrity also requires regular data backups to tape or another system. These
backups should be tested periodically. As explained in Section 43.3.3, RAID is not a
backup strategy. If this system dies in a fire, RAID will not protect the data.
If a server is expected to last a long time, it will likely need to be expanded. Over time most
systems will need to handle more users, more data, or more applications. Additional slots
for memory, interfaces, and hard drive bays make it easy to upgrade the system without
replacing it. Upgradable hardware is more expensive than fixed-configuration equivalents,
but that extra expense is an insurance policy against more costly upgrades later. It is much
less expensive to extend the life of a machine via an incremental upgrade such as adding
RAM than to do a wholesale replacement of one machine with another.
Forklift Upgrades
The term forklift upgrade is industry slang for a wholesale replacement. In such a
situation you are removing one machine with a metaphorical forklift and dropping a
replacement in its place.
Another problem with the “one basket” strategy is that it makes OS upgrades very di icult
and risky. Upgrading the operating system is an all-or-nothing endeavor. The OS cannot be
upgraded or patched until all the services are verified to work on the new version. One
application may hold back all the others.
Upgrading individual applications also becomes more perilous. For example, what if
upgrading one application installs the newest version of a shared library, but another
application works only with the older version? Now you have one application that isn’t
working and the only way to fix it is to downgrade it, which will make the other application
fail. This Catch-22 is known as dependency hell.
Often we do not discover these incompatibilities until after the newer software is installed
or the operating system is upgraded. Some technologies can be applied to help in this
situation. For example, some disk storage systems have the ability to take a snapshot of
the disk. If an upgrade reveals incompatibility or other problems, we can revert to a
previous snapshot. RAID 1 mirrors and file systems such as ZFS and Btrfs have
snapshotting capabilities.
Upgrading a single machine with many applications is complex and is the focus of Chapter
33, “Server Upgrades.”
The more applications a big server is running, the more di icult it becomes to schedule
downtime for hardware upgrades. Very large systems permit hardware upgrades to be
performed while they remain running, but this is rare for components such as RAM and
CPUs. We can delay the need for downtime by buying extra capacity at the start, but this
usually leads to more dependencies on the machine, which in turn exacerbates the
scheduling problem we originally tried to solve.
A better strategy is to use a separate machine for each service. In this model we purchase
servers as they are needed, ordering the exact model and configuration that is right for the
application.
Each machine is sized for the desired application: RAM, disk, number and speeds of NICs,
and enough extra capacity, or expansion slots, for projected growth during the expected
life of the machine. Vendors can compete to provide their best machine that meets these
specifications.
The benefit of this strategy is that the machine is the best possible choice that meets the
requirements. The downside is that the result is a fleet of unique machines. Each is a
beautiful, special little snowflake.
While snowflakes are beautiful, nobody enjoys a blizzard. Each new system adds
administrative overhead proportionally. For example, it would be a considerable burden if
each new server required learning an entirely new RAID storage subsystem. Each one
would require learning how to configure it, replace disks, upgrade the firmware, and so on.
If, instead, the IT organization standardized on a particular RAID product, each new
machine would simply benefit from what was learned earlier.
A variety of inventory applications are available, many of which are free or low cost. Having
even a simple inventory system is better than none. A spreadsheet is better than keeping
this information in your head. A database is better than a spreadsheet.
Always be on the lookout for opportunities to reduce the number of variations in platforms
or technologies being supported. Discourage gratuitous variations by taking advantage of
the fact that people lean toward defaults. Make right easy: Make sure that the lazy path is
the path you want people to take. Select a default hardware vendor, model, and operating
system and make it super easy to order. Provide automated OS installation, configuration,
and updates. Provide a wiki with information about recommended models and options,
sales contact information, and assistance. We applaud you for being able to keep all that
information in your head, but putting it on a wiki helps the entire organization stick to
standards. That helps you in the long term.
The techniques discussed in Part II, “Workstation Fleet Management,” for managing
variation also apply to servers. In particular, adopt a policy of supporting a limited number
of generations of hardware. When a new generation of hardware is introduced, the oldest
generation is eliminated. This practice also works for limiting the number of OS revisions in
use.
Google’s SysOps (internal IT) organization had a policy of supporting at most two Linux
releases at any given time, and a limited number of hardware generations. For a new OS
release or hardware model to be introduced into the ecosystem, the oldest one had to be
eliminated.
Eliminating the old generation didn’t happen automatically. People were not continuing to
use it because of laziness; there was always some deep technical reason, dependency,
lack of resources, or political issue preventing the change. Finding and eliminating the
stragglers took focused e ort.
To accelerate the process, a project manager and a few SAs would volunteer to form a
team that would work to eliminate the stragglers. These teams were called “death
squads,” an insensitive name that could have been invented only by people too young to
remember the politics of Central America in the late twentieth century.
The team would start by publishing a schedule: the date the replacement was available,
the date the old systems would no longer receive updates, and the date the old systems
would be disconnected from the network. There were no exceptions. Management had
little sympathy for whining. Employees sometimes complained the first time they
experienced this forced upgrade process but soon they learned that restricting
technologies to a small number of generations was a major part of the Google operational
way and that services had to be built with the expectation of being easy to upgrade or
rebuild.
The team would identify machines that still used the deprecated technology. This would
be published as a shared spreadsheet, viewable and editable by all. Columns would be
filled in to indicate the status: none, owner contacted, owner responded, work in progress,
work complete. As items were complete, they would be marked green. The system was
very transparent. Everyone in the company could see the list of machines, who had
stepped up to take ownership of each, and their status. It was also very visible if a team
was not taking action.
The death squad would work with all the teams involved and o er support and assistance.
Some teams already had plans in place. Others needed a helping hand. Some needed
cajoling and pressure from management. The death squad would collaborate and assist
until the older technology was eliminated.
This system was used often and was a key tactic in preventing Google’s very large fleet
from devolving into a chaotic mess of costly snowflakes.
While it sounds e icient to customize each machine to the exact needs of the service it
provides, the result tends to be an unmanageable mess.
The one-o hardware might be rationalized by the customer stating that spares aren’t
required, that firmware upgrades can be skipped, or that the machine will be self-
supported by the customer. These answers are nice in theory, but usually lead to even
more work for the IT department. When there is a hardware problem, the IT department
will be blamed for any di iculties. Sadly the IT department must usually say no, which
makes this group look inflexible.
During a software upgrade the system stopped working. The professor was unable to fix
the problem, and other professors who had come to rely on the system were dead in the
water.
After a month of downtime, the other professors complained to the department chair, who
complained to the IT department. While the department chair understood the agreement
in place, he’d hear none of it. This was an emergency and IT’s support was needed. The IT
group knew that once they got involved, any data loss would be blamed on them. Their
lack of experience with the device increased that risk.
The end of this story is somewhat anticlimactic. While the IT group was establishing an
action plan in cooperation with the vendor, the professor went rogue and booted the
storage server with a rescue disk, wiped and reinstalled the system, and got it working
again. Only then did he concede that there was no irreplaceable data on the system.
Unfortunately, this experience taught the professor that he could ignore IT’s hardware
guidance since he’d get support when he needed it, and that he didn’t really need it since
he fixed the problem himself.
The other faculty who had come to depend on this professor’s system realized the value of
the support provided by the IT group. While the previous system was “free,” losing access
for a month was an unacceptable loss of research time. The potential for catastrophic
data loss was not worth the savings.
What the IT department learned was that there was no such thing as self-support. In the
future, they developed ways to be involved earlier on in purchasing processes so that they
could suggest alternatives before decisions had been made in the minds of their
customers.
The next strategy is to buy computing resources in bulk and allocate fractions of it as
needed.
One way to do this is through virtualization. That is, an organization purchases large
physical servers and divides them up for use by customers by creating individual virtual
machines (VMs). A virtualization cluster can grow by adding more physical hardware as
more capacity is needed.
VMs can also be resized. You can add RAM, vCPUs, and disk space to a VM via an API call
instead of a visit to the datacenter. If customers request more memory and you add it
using a management app on your iPhone while sitting on the beach, they will think you are
doing some kind of magic.
Virtualization improves computing e iciency. Physical machines today are so powerful
that applications often do not need the full resources of a single machine. The excess
capacity is called stranded capacity because it is unusable in its current form. Sharing a
large physical machine’s power among many smaller virtual machines helps reduce
stranded capacity, without getting into the “all eggs in one basket” trap.
Stranded capacity could also be mitigated by running multiple services on the same
machine. However, virtualization provides better isolation than simple multitasking. The
benefits of isolation include
• Independence: Each VM can run a di erent operating system. On a single physical host
there could be a mix of VMs running a variety of Microsoft Windows releases, Linux
releases, and so on.
• Resource isolation: The disk and RAM allocated to a VM are committed to that VM and
not shared. Processes running on one VM can’t access the resources of another VM. In
fact, programs running on a VM have little or no awareness that they are running on VMs,
sharing a larger physical machine.
• Granular security: A person with root access on one VM does not automatically have
privileged access on another VM. Suppose you had five services, each run by a di erent
team. If each service was on its own VM, each team could have administrator or root
access for its VM without a ecting the security of the other VMs. If all five services were
running on one machine, anyone needing root or administrator access would have
privileged access for all five services.
• Reduced dependency hell: Each machine has its own operating system and system
libraries, so they can be upgraded independently.
13.3.1 VM Management
Like other strategies, keeping a good inventory of VMs is important. In fact, it is more
important since you can’t walk into a datacenter and point at the machine. The cluster
management software will keep an inventory of which VMs exist, but you need to maintain
an inventory of who owns each VM and its purpose.
Some clusters are tightly controlled, only permitting the IT team to create VMs with the
care and planning reminiscent of the laborious process previously used for physical
servers. Other clusters are general-purpose compute farms providing the ability for
customers to request new machines on demand. In this case, it is important to provide a
self-service way for customers to create new machines. Since the process can be fully
automated via the API, not providing a self-service portal or command-line tool for
creating VMs simply creates more work for you and delays VM creation until you are
available to process the request. Being able to receive a new machine when the SAs are
unavailable or busy reduces the friction in getting the compute resources customers
need. In addition to creating VMs, users should be able to reboot and delete their own
VMs.
There should be limits in place so that customers can’t overload the system by creating
too many VMs. Typically limits are based on existing resources, daily limits, or per-
department allocations.
Users can become confused if you permit them to select any amount of disk space and
RAM. They often do not know what is required or reasonable. One strategy is to simply
o er reasonable defaults for each OS type. Another strategy is to o er a few options:
small, medium, large, and custom.
As in the other strategies, it is important to limit the amount of variation. Apply the
techniques described previously in this chapter and in Part II, “Workstation Fleet
Management.” In particular, adopt a policy of supporting a limited number of OS versions
at any given time. For a new version to be introduced, the oldest generation is eliminated.
Most virtual machine cluster management systems permit live migration of VMs, which
means a VM can be moved from one physical host to another while it is running. Aside
from a brief performance reduction during the transition, the users of the VM do not even
know they’re being moved.
Live migration makes management easier. It can be used to rebalance a cluster, moving
VMs o overloaded physical machines to others that are less loaded. It also lets you work
around hardware problems. If a physical machine is having a hardware problem, its VMs
can be evacuated to another physical machine. The owners of the VM can be blissfully
unaware of the problem. They simply benefit from the excellent uptime.
The architecture of a typical virtualization cluster includes many physical machines that
share a SAN for storage of the VM’s disks. By having the storage external to any particular
machine, the VMs can be easily migrated between physical machines.
Shared storage is depicted in Figure 13.1. The VMs run on the VM servers and the disk
volumes they use are stored on the SAN. The VM servers have little disk space of their
own.
Three VM (Virtual Machine) servers VM server A, VM Server B, and VM Server C are shown
on the left and SAN (Storage Area Network) is shown on the right. VM Server A consist of
seven blocks labeled, VM1, VM2, VM3, VM4, VM5, VM6, and VM7. Virtual Server B
consists of two blocks labeled, VM8 and VM9, and Virtual Server C shows two blocks
labeled, VM10 and VM11. The Storage Area Network shows eleven blocks labeled,
VM1, VM2, VM3, VM4, VM5, VM6, VM7, VM8, VM9, VM10, and VM11. A network
connects VM server A, VM Server B, and VM Server C, and the network is connected to the
Storage Area Network.
In Figure 13.2 we see VM7 is being migrated from VM Server A to VM Server B. Because
VM7’s disk image is stored on the SAN, it is accessible from both VM servers as the
migration process proceeds. The migration process simply needs to copy the RAM and
CPU state between A and B, which is a quick operation. If the entire disk image had to be
copied as well, the migration could take hours.
Three VM (Virtual Machine) servers VM server A, VM Server B, and VM Server C are shown
on the left and SAN (Storage Area Network) is shown on the right. VM Server A consist of
seven blocks labeled, VM1, VM2, VM3, VM4, VM5, VM7, VM6, VM7, VM7, empty, VM7,
VM8. Virtual Server B consists of two blocks labeled, VM8 and VM9, and Virtual Server
C shows two blocks labeled, VM10 and VM11. The Storage Area Network shows eleven
blocks labeled, VM1, VM2, VM3, VM4, VM5, VM6, VM7, VM8, VM9, VM10, and VM11. A
network connects VM server A, VM Server B, and VM Server C, and the network is
connected to the Storage Area Network. VM7 in the Storage Area Network is connected to
four VM7s of VM Server A.
13.3.3 VM Packing
While VMs can reduce the amount of stranded compute capacity, they do not eliminate it.
VMs cannot span physical machines. As a consequence, we often get into situations
where the remaining RAM on a physical machine is not enough for a new VM.
The best way to avoid this is to create VMs that are standard sizes that pack nicely. For
example, we might define the small configuration such that exactly eight VMs fit on a
physical machine with no remaining stranded space. We might define the medium
configuration to fit four VMs per physical machine, and the large configuration to fit two
VMs per physical machine. In other words, the sizes are , , and . Since the
denominators are powers of 2, combinations of small, medium, and large configurations
will pack nicely.
One strategy is to keep one physical machine entirely idle so that, when needed, the VMs
from the machine to be repaired can all be migrated to this machine. This scheme is
depicted in Figure 13.3(a). When physical machine A, B, C, or D needs maintenance, its
VMs are transferred to machine E. Any one machine can be down at a time. It doesn’t
matter which combination of VMs are on the machine, as long as the spare is as large as
any of the other machines.
In machine A, column A shows a block with four rows labeled VM, column B shows three
rows labeled VM, column C shows two rows labeled VM, column D shows eight cells
labeled VM, and column E shows four blank rows. In machine B, column A shows a block
with four rows labeled VM, columns B, C, D, and E have four rows in which the first three
rows are labeled VM. In machine c, column A has three rows labeled VM, columns B, C, D,
and E have four rows in which the first three rows are labeled VM. In machine d, column A
has three rows labeled VM, column B has three rows in which two rows are labeled VM,
columns C and D have four rows in which three rows are labeled VM, column E has four
rows labeled E.
Of course, this arrangement means that the spare machine is entirely unused. This seems
like a waste since that capacity could be used to alleviate I/O pressure such as contention
for network bandwidth, disk I/O, or other bottlenecks within the cluster.
Another strategy is to distribute the spare capacity around the cluster so that the
individual VMs can share the extra I/O bandwidth. When a machine needs to be evacuated
for repairs, the VMs are moved into the spare capacity that has been spread around the
cluster.
This scheme is depicted in Figure 13.3(b). If machine A requires maintenance, its VMs can
be moved to B, C, D, and E. Similar steps can be followed for the other machines.
Unfortunately, this causes a new problem: There might not be a single machine with
enough spare capacity to accept a VM that needs to be evacuated from a failing physical
host. This situation is depicted in Figure 13.3(c). If machine A needs maintenance, there is
no single machine with enough capacity to receive the largest of its VMs. VMs cannot
straddle two physical machines. This is not a problem for machines B through E, whose
VMs can be evacuated to the remaining space.
If machine A needed to be evacuated, first we would need to consolidate free space onto a
single physical machine so that the larger VM has a place to move to. This is known as a
Towers of Hanoi situation, since moves must be done recursively to enable other moves.
These additional VM moves make the evacuation process both more complex and longer.
The additional VM migrations a ect VMs that would otherwise not have been involved.
They now must su er the temporary performance reduction that happens during
migration, plus the risk that the migration will fail and require a reboot.
Figure 13.3(d) depicts the same quantity and sizes of VMs packed to permit any single
machine to be evacuated for maintenance.
Some VM cluster management systems include a tool that will calculate the minimum
number of VM moves to evacuate a physical machine. Other systems simply refuse to
create a new VM if it will create a situation where the cluster is no longer N + 1 redundant.
Either way, virtual clusters should be monitored for loss of N + 1 redundancy, so that you
are aware when the situation has happened and it can be corrected proactively.
Most sites end up with two entirely di erent ways to request, allocate, and track VMs and
non-VMs. It can be beneficial to have one system that manages both. Some cluster
management systems will manage a pool of bare-metal machines using the same API as
VMs. Creating a machine simply allocates an unused machine from the pool. Deleting a
machine marks the machine for reuse.
Another way to achieve this is to make everything a VM, even if that means o ering an an
extra large size, which is a VM that fills the entire physical machine: . While such
machines will have a slight performance reduction due to the VM overhead, unifying all
machine management within one process benefits customers, who now have to learn only
one system, and makes management easier.
13.3.6 Containers
Containers are another virtualization technique. They provide isolation at the process level
instead of the machine level. While a VM is a machine that shares physical hardware with
other VMs, each container is a group of processes that run in isolation on the same
machine. All of the containers run under the same operating system, but each container is
self-contained as far as the files it uses. Therefore there is no dependency hell.
Containers are much lighter weight and permit more services to be packed on fewer
machines. Docker, Mesos, and Kubernetes are popular systems for managing large
numbers of containers. They all use the same container format, which means once a
container is created, it can be used on a desktop, server, or huge farm of servers.
Pros and cons of virtualization and containerization, as well as technical details of how
they work, can be be found in Volume 2, Chapter 3, “Selecting a Service Platform,” of this
book series.
One bad situation people get into with containers is a lack of reproducibility. After using
the system for a while, there comes a day when a container needs to be rebuilt from
scratch to upgrade a library or other file. Suddenly the team realizes no one is around who
remembers how the container was made. The way to prevent this is to make the
container’s creation similar to compiling software: A written description of what is to be in
the container is passed through automation that reads the description and outputs the
container. The description is tracked in source code control like any other source code.
Containers should be built using whatever continuous integration (CI) system is used for
building other software in your organization.
Grid computing takes many similar machines and manages them as a single unit. For
example, four racks of 40 servers each would form a grid of 160 machines. Each one is
configured exactly alike—same hardware and software.
To use the grid, a customer specifies how many machines are needed and which software
package to run. The grid management system allocates the right number of machines,
installs the software on them, and runs the software. When the computation is done, the
results are uploaded to a repository and the software is de-installed.
A big part of grid management software is the scheduling algorithm. Your job is held until
the number of machines requested are available. When you request many machines, that
day may never come. You would be very disappointed if your job, which was expected to
run for an hour, took a week to start.
To start a big request, one has to stall small requests until enough machines are available,
which wastes resources for everyone. Suppose your job required 100 machines. The
scheduler would stall any new jobs until 100 machines are idle. Suppose at first there are
50 machines idle. An hour later, another job has completed and now there are 75
machines idle. A day later a few more jobs have completed and there are 99 machines
idle. Finally another job completes and there are 100 free machines, enough for your job to
run. During the time leading up to when your job could run, many machines were sitting
idle. In fact, more than 75 compute-days were lost in this example.
The scheduler must be very smart and mix small and big requests to both minimize wait
time and maximize utilization. A simple algorithm is to allocate half the machines for big
jobs. They will stay busy if there is always a big job ready to run. Allocate the other half of
the machines for smaller jobs; many small and medium jobs will pass through those
machines.
More sophisticated algorithms are invented and tested all the time. At Google the
scheduling algorithms have become so complex that a simulator was invented to make it
possible to experiment with new algorithms without having to put them into production.
Typically grids are very controlled systems. All allocations are done though the grid
management and scheduling system. Each machine runs the same OS configuration.
When entropy is detected, a machine is simply wiped and reloaded.
Because many machines are being purchased, shaving a few dollars o the cost of each
machine pays big dividends. For example, a video card is not needed since these
machines will not be used for interactive computing. For a 1,000-machine cluster, this can
save thousands of dollars. RAID cards, fancy plastic front plates with awesome LED
displays, redundant power supplies, and other add-ons are eliminated to save a few
dollars here and there. This multiplies out to thousands or millions of dollars overall.
Data integrity systems such as RAID cards are generally not installed in grid hardware
because the batch-oriented nature of jobs means that if a machine dies, the batch can be
rerun.
While one wants a grid that is reliable, compute e iciency is more important. E iciency is
measured in dollars per transaction. Rather than examining the initial purchase price, the
total cost of ownership (TCO) is considered. For example, ARM chips are less expensive
than x86 Intel chips, but they are slower processors. Therefore you need more of them to
do the same job. Is it better to have 1,000 high-powered Intel-based machines or 2,000
ARM-based machines? The math can be fairly complex. The ARM chips use less power but
there are more of them. The amount of power used is related directly to how much cooling
is needed. The additional machines require more space in the datacenter. What if the
additional machines would result in a need to build more datacenters? TCO involves
taking all of these factors into account during the evaluation.
Grid computing has other constraints. Often there is more bandwidth between machines
on the same rack, and less bandwidth between racks. Therefore some jobs will execute
faster if rack locality (putting all related processes in one rack) can be achieved.
Grid computing is more e icient than virtualization because it eliminates the virtualization
overhead, which is typically a 5 to 10 percent reduction in performance.
Grids are easier to manage because what is done for one machine is done for all
machines. They are fungible units—each one can substitute for the others. If one machine
dies, the scheduler can replace it with another machine in the grid.
Like many web sites, Yahoo! builds mammoth clusters of low-cost 1U PC servers. Racks
are packed with as many servers as possible, with dozens or hundreds configured to
provide each service required. Yahoo! once reported that when a unit died, it was more
economical to power it o and leave it in the rack rather than repair the unit. Removing
dead units might accidentally cause an outage if other cables were loosened in the
process. Eventually the machine would be reaped and many dead machines would be
repaired en masse.
A blade server has many individual slots that take motherboards, called blades, that
contain either a computer or storage. Each blade can be installed quickly and easily
because you simply slide a card into a slot. There are no cables to connect; the blade’s
connector conveys power, networking, and I/O connectivity. Additional capacity is added
by installing more blades, or replacing older blades with newer, higher-capacity models.
Blades can be used to implement many of the strategies listed previously in this chapter.
Computers within the blade chassis can be allocated individually, or used to create a
virtualization cluster, or used to create a grid.
Another benefit of blade systems is that they are software configurable. Assigning a disk to
a particular computer or a computer to a particular network is done through a
management console or API. Does a particular blade computer need an additional hard
drive? A few clicks and it is connected to storage from another blade. No trip to the
datacenter is required.
Blade systems are most cost-e ective when the chassis lasts many years, enabling you to
upgrade the blades for the duration of its lifetime. This amortizes the cost of the chassis
over many generations of blades. The worst-case scenario is that shortly after investing in
a chassis, a newer, incompatible chassis model becomes available. When you buy a
blade-based system, you are betting that the chassis will last a long time and that new
blades will be available for its entire lifespan. If you lose the bet, a forklift upgrade can be
very expensive.
Grid Uniformity
A division of a large multinational company was planning on replacing its aging multi-CPU
server with large grid computing environment, implemented as a farm of blade servers.
The application would be recoded so that instead of using multiple processes on a single
machine, it would use processes spread over the blade farm. Each blade would be one
node of a vast compute farm to which jobs could be submitted, with the results
consolidated on a controlling server.
This had wonderful scalability, since a new blade could be added to the farm and be
usable within minutes. No direct user logins were needed, and no SA work would be
needed beyond replacing faulty hardware and managing which blades were assigned to
which applications. To this end, the SAs engineered a tightly locked-down minimal-access
solution that could be deployed in minutes. Hundreds of blades were purchased and
installed, ready to be purposed as the customer required.
The problem came when application developers found themselves unable to manage their
application. They couldn’t debug issues without direct access. They demanded shell
access. They required additional packages. They stored unique state on each machine, so
automated builds were no longer viable. All of a sudden, the SAs found themselves
managing 500 individual servers rather than one unified blade farm. Other divisions had
also signed up for the service and made the same demands.
A number of things could have prevented this problem. Management should have required
more discipline. Once the developers started requesting access, management should
have set limits that would have prevented the system from devolving into hundreds of
custom machines. Deployment of software into production should have been automated
using a continuous integration/continuous deployment (CI/CD) system. If the only way to
put things in production is by automated deployment, then repeatability can be more
easily achieved. There should have been separate development and production
environments, with fewer access restrictions in the development environment. More
attention to detail at the requirements-gathering stage might have foreseen the need for
developer access, which would have led to the requirement for and funding of a
development environment.
Another strategy is to not own any machines at all, but rather to rent capacity on someone
else’s system. Such cloud-based computing lets you benefit from the economies of scale
that large warehouse-size datacenters can provide, without the expense or expertise to
run them. Examples of cloud-based compute services include Amazon AWS, Microsoft
Azure, and Google Compute Engine.
There are three common definitions for the cloud, each coming from di erent
communities:
• Consumers: When a typical consumer uses the term the cloud, they mean putting their
data on a web-based platform. The primary benefit is that this data becomes accessible
from anywhere. For example, consumers might have all their music stored in the cloud; as
a result their music can be played on any device that has Internet access.
• Business people: Typically business people think of the cloud as some kind of rented
computing infrastructure that is elastic. That is, they can allocate one or thousands of
machines; use them for a day, week, or year; and give them back when they are done. They
like the fact that this infrastructure is a payas-you-go and on-demand system. The on-
demand nature is the most exciting because they won’t have to deal with IT departments
that could take months to deliver a single new machine, or simply reject their request.
Now with a credit card, they have a partner that always says yes.
• IT professionals: When all the hype is removed (and there is a lot of hype), cloud
computing comes down to someone else maintaining hardware and networks so that
customers can focus on higher-level abstractions such as the operating system and
applications. It requires software that is built di erently and new operational methods. IT
professionals shift from being the experts in how to install and set up computers to being
the experts who understand the full stack and become valued for their architectural
expertise, especially regarding how the underlying infrastructure a ects performance and
reliability of the application, and how to improve both.
This book uses a combination of the last two definitions. The on-demand nature of the
cloud benefits us by providing an elastic computing environment: the ability to rapidly and
dynamically grow and shrink. As IT professionals, it changes our role to focus on
performance and reliability, and be the architecture experts.
Over the years computer hardware has grown less expensive due to Moore’s law and other
factors. Conversely, the operation of computers has become increasingly expensive. The
increased operational cost diminishes and often eliminates the cost reductions.
The one place where operational cost has been going down instead of up is in large grids
or clusters. There the entire infrastructure stack of hardware, software, and operations can
be vertically integrated to achieve exceptional cost savings at scale.
Another way to think about this issue is that of all the previously discussed strategies, the
“buy in bulk, allocate fractions” strategy described earlier is generally the most
economical and proves the most flexible. Cloud-based compute services take that
strategy to a larger scale than most companies can achieve on their own, which enables
these smaller companies take advantage of these economics.
As these cost trends continue, it will become di icult for companies to justify maintaining
their own computers. Cloud computing will go from the exception that few companies are
able to take advantage of, to the default. Companies that must maintain their own
hardware will be an exception and will be at an economic disadvantage.
Adoption of cloud computing is also driven by another cost: opportunity cost. Opportunity
cost is the revenue lost due to missed opportunities. If a company sees an opportunity but
the competition beats them to it, that could be millions of potential dollars lost.
Some companies miss opportunities because they cannot hire people fast enough to sta
a project. Other companies do not even attempt to go after certain opportunities because
they know they will not be able to ramp up the IT resources fast enough to address the
opportunity. Companies have missed multi-million-dollar opportunities because an IT
department spent an extra month to negotiate a slightly lower price from the vendor.
If cloud computing enables a company to spin up new machines in minutes, without any
operations sta , the ability to address opportunities is greatly enhanced. At many
companies it takes months, and sometimes an entire year, from the initial request to
realizing a working server in production. The actual installation of the server may be a few
hours, but it is surrounded by months of budget approvals, bureaucratic approvals,
security reviews, and IT managers who think it is their job to always say “no.”
We are optimistic about cloud computing because it enables new applications that just
can’t be done any other way. Kevin McEntee, vice-president at Netflix, put it succinctly:
“You can’t put a price on nimble.” In his 2012 talk at re:Invent, McEntee gave examples of
how the elastic ability of cloud computing enabled his company to process video in new
ways. He extolled the value of elastic computing’s ability to let Netflix jump on
opportunities that enabled it to be a part of the iPad launch and the Apple TV launch, and
to quickly launch new, large libraries of videos to the public. Netflix has been able to make
business deals that its competition couldn’t. “If we were still building out datacenters in
the old model we would not have been able to jump on those opportunities” (McEntee
2012).
There are legal and technical challenges to putting your data on other people’s hardware.
That said, do not assume that HIPAA compliance and similar requirements automatically
disqualify you from using cloud-based services. Cloud vendors have a variety of compliant
options and ways of managing risk. A teleconference between your legal compliance
department and the provider’s sales team may lead to surprising results.
This strategy was impossible until the early 2010s. Since then, ubiquitous fast Internet
connections, HTML5’s ability to create interactive applications, and better security
features have made this possible.
When a company adopts such a strategy, the role of the IT department becomes that of an
IT coordinator and integrator. Rather than running clients and services, someone is
needed to coordinate vendor relationships, introduce new products into the company,
provide training, and be the first stop for support before the provider is contacted directly.
Technical work becomes focused on high-level roles such as software development for
integrating the tools, plus low-level roles such as device support and repair management.
An appliance is a device designed specifically for a particular task. Toasters make toast.
Blenders blend. One could do these things using general-purpose devices, but there are
benefits to using a device designed to do one task very well.
The computer world also has appliances: file server appliances, web server appliances,
email appliances, DNS/DHCP appliances, and so on. The first appliance was the
dedicated network router. Some sco ed, “Who would spend all that money on a device
that just sits there and pushes packets when we can easily add extra interfaces to our VAX
and do the same thing?” It turned out that quite a lot of people would. It became obvious
that a box dedicated to doing a single task, and doing it well, was in many cases more
valuable than a general-purpose computer that could do many tasks. And, heck, it also
meant that you could reboot the VAX without taking down the network for everyone else.
A server appliance brings years of experience together in one box. Architecting a server is
di icult. The physical hardware for a server has all the requirements listed earlier in this
chapter, as well as the system engineering and performance tuning that only a highly
experienced expert can do. The software required to provide a service often involves
assembling various packages, gluing them together, and providing a single, unified
administration system for it all. It’s a lot of work! Appliances do all this for you right out of
the box.
Although a senior SA can engineer a system dedicated to file service or email out of a
general-purpose server, purchasing an appliance can free the SA to focus on other tasks.
Every appliance purchased results in one less system to engineer from scratch, plus
access to vendor support in case of an outage. Appliances also let organizations without
that particular expertise gain access to well-designed systems.
The other benefit of appliances is that they often have features that can’t be found
elsewhere. Competition drives the vendors to add new features, increase performance,
and improve reliability. For example, NetApp Filers have tunable file system snapshots
that allow end users to “cd back in time,” thus eliminating many requests for file restores.
Use what is most appropriate. Small companies often can’t justify a grid. For large
companies it is best to build a private, in-house cloud. If a company needs to get started
quickly, cloud computing permits it to spin up new machines without the delay of
specifying, purchasing, and installing machines.
Often one platform is selected as the default and exceptions require approval. Many IT
departments provide VMs as the default, and bare-metal requests require proof that the
application is incompatible with the virtualization system. Some companies have a “cloud
first” or “cloud only” strategy.
Depending on the company and its culture, di erent defaults may exist. The default
should be what’s most e icient, not what’s least expensive.
The only strategy we recommend against is an organization trying to deploy all of these
strategies simultaneously. It is not possible to be good at all strategies. Pick a few, or one,
and get very good at it. This will be better than providing mediocre service because you are
spread too thin.
13.9 Summary
Servers are the hardware used to provide services, such as file service, mail service,
applications, and so on. There are three general strategies to manage servers.
All eggs in one basket has one machine that is used for many purposes, such as a
departmental server that provides DNS, email, web, and file services. This puts many
critical services in one place, which is risky.
With beautiful snowflakes, there are many machines, each configured di erently. Each
machine is configured exactly as needed for its purpose, which sounds optimal but is a
management burden. It becomes important to manage variations, reducing the number of
types of things that are managed. We can do this many ways, including adopting a policy
of eliminating one generation of products before adopting a new one.
There are several related strategies. Grid computing provides hundreds or thousands of
machines managed as one large computer for large computational tasks. Blade servers
make hardware operations more e icient by providing individual units of computer power
or storage in a special form factor. Cloud-based computing rents time on other people’s
server farms, enabling one to acquire additional compute resources dynamically.
Software as a service (SaaS) eliminates the need for infrastructure by relying on web-
based applications. Server appliances eliminate the need for local engineering knowledge
by providing premade, preconfigured hardware solutions.
Organizations use a mixture of these strategies. Most have one primary strategy and use
the others for specific instances. For example, it is very common to see a company use
virtualization as the default, physical machines as the exception, and SaaS for designated
applications.
Exercises
2. What are the pros and cons of the three primary hardware strategies? Name two
situations where each would be the best fit.
3. What is the hardware strategy used in your organization? Why was it selected?
4. How does your organization benefit from the hardware strategy it uses? Why is it better
or worse than the other two strategies?
5. Why is data integrity so important in the all eggs in one basket strategy?
8. With beautiful snowflakes, each server is exactly what is needed for the application.
This sounds optimal. How can it be a bad thing?
9. In your environment, how many di erent server vendors are used? List them. Do you
consider this to be a lot of vendors? What would be the benefits and problems of
increasing the number of vendors? Decreasing this number?
10. What are some of the types of variations that can exist in a server fleet? Which
management overhead does each of them carry?
11. What are some of the ways an organization can reduce the number and types of
hardware variations in its server fleet?
12. In what way is buy in bulk, allocate fractions more e icient than the other strategies?
14. Which server appliances are in your environment? What kind of engineering would you
have to do if you had instead purchased a general-purpose machine to do the same
function?
15. Which services in your environment would be good candidates for replacement with a
server appliance (whether or not such an appliance is available)? Why are they good
candidates?
16. Live migration sounds like magic. Research how it works and summarize the process.
17. If your organization uses virtualization, how quickly can a new VM be created? Request
one and time the duration from the original request to when you are able to use it.
18. Section 13.3 says that it can take minutes to create a VM. If the answer to Exercise 17
was longer than a few minutes, investigate where all the time was spent.
19. The cloud means di erent things to di erent groups of people. How are these
definitions related? Which definition do you use?
20. What can cloud-based computing do that a small or medium-size company cannot do
for itself?