UNIT 2
UNIT 2
1
Computing in the Cloud
• The most basic is known in the cloud industry as
infrastructure as a service (IaaS) because it provides
virtualized infrastructure to its users.
2
Virtual Machines and Containers
• A virtual machine is just the software image of a
complete machine that can be loaded onto the server
and run like any other program.
3
• To the hypervisor, all VMs look the same and can be
managed in a uniform way.
• The cloud management system (sometimes called the fabric
controller) can select which server to use to run the
requested VM instances, and it can monitor the health of
each VM
• NATIVE HYPERVISORS /TYPE 1/BARE METAL (VMWARE.ESXI from vmware, Hyper V
from Microsoft, Oracle VM from oracle)
• HOSTED HYPERVISORS /TYPE 2/ (VMWARE.GSX from vmware,virtual box from
oracle, virtual PC from Microsoft)
• HYBRID
• Bare metal or native hypervisors run directly on the hardware, providing all the features (e.g.,
I/O) needed by the guests.
• Hosted hypervisors run on top of an existing OS and leverage the features of the underlying
OS.
• Virtual machines run on top of the hosted hypervisor, which runs on top of an existing OS.
4
5
Containers
• Containers are similar to VMs but are based on a
different technology and serve a slightly different
purpose.
• Rather than run a full OS, a container is layered on top
of the host OS and uses that OS’s resources in a clever
way.
• Containers allow you to package up an application
and all of its library dependencies and data into a
single, easy-to-manage unit.
• When you launch the container, the application can be
configured to start up, go through its initialization, and
be running in seconds
6
Virtual machines vs. containers on a
typical cloud server
7
Virtual machines Vs Containers
Virtual machines Containers
VM is piece of software that allows you to While a container is a software that allows
install other software inside of it so you different functionalities of an
basically control it virtually as opposed to application independently.
installing the software directly on the
computer.
Applications running on VM system can run While applications running in a container
different OS. environment share a single OS.
VM virtualizes the computer system While containers virtualize the operating
system only.
VM size is very large While the size of container is very light; i.e.
a few megabytes.
VM’s are useful when we require all of OS While containers are useful when we are
resources to run various application required to maximize the running
applications using minimal servers.
Examples of VM are: KVM, Xen, VMware While examples of containers
are:RancherOS, PhotonOS, Containers by
Docker
8
Virtual machines Vs containers
Virtual machines Containers
Heavyweight Lightweight
Fully isolated; hence more secure Process-level isolation; hence
less secure
No automation for configuration Script-driven configuration
Slow deployment Rapid deployment
Easy port and IP address More abstract port and IP
Mapping’s Mappings
Custom images not portable Completely portable
across clouds
9
Advanced Computing Services
• A common issue of concern to scientists and engineers is
scale.
• VMs and containers are a great way to virtualize a single
machine image.
• Most high-performance parallel applications are based on
the Message Passing Interface (MPI) standard.
• For example, many task (MT) parallelism is used to
tackle problems in which you have hundreds of similar
tasks to run, each(largely) independent of the others.
• Another method is called MapReduce made popular by the
Hadoop computational framework
• MapReduce is related to a style of parallel computing known
as bulk synchronous parallelism (BSP)
10
• Google has released a service called Cloud Datalab,
based on Jupyter, for interactive control of its data
analytics cloud.
• The Microsoft Cloud Business Intelligence (Cloud
BI) tool supports interactive access to data queries and
visualization
11
12
13
14
• MPI stands for Message Passing Interface.
• MPI is used to send messages from one process
(computer, workstation etc.) to another.
• These messages can contain data ranging from
primitive types (integers, strings and so forth) to
actual objects
15
• Bulk Synchronous Parallel (BSP) is a programming model
and computation framework for parallel computing.
• Computation is divided into a sequence of SUPERSTEPS.
• In each superstep, a set of processes, running the same
code, executes concurrently and creates messages that
are sent to other processes.
• The superstep ends when all the computation in the
superstep is complete and all messages have been sent.
• A barrier synchronization at the end of the superstep ensures
that all messages have been transmitted (but not yet
delivered to the processes).
16
17
18
Serverless Computing
• Serverless is a cloud application development and
execution model that lets developers build and run code
without managing servers, and without paying for
idle cloud infrastructure.
• Serverless does not mean 'NO SERVERS’
• The term ‘serverless’ is somewhat misleading, as there
are still servers providing these backend services,
but all of the server space and infrastructure concerns
are handled by the vendor.
• Serverless means that the developers can do their work
without having to worry about servers at all.
https://www.ibm.com/cloud/learn/
serverless#:~:text=Serverless%20is%20a
%20cloud%20computing,managing%20servers
%20or%20backend%20infrastructure. 19
• Serverless architecture is largely based on a Functions as a
Service (FaaS) model that allows cloud platforms to execute
code without the need for fully provisioned infrastructure
instances.
• FaaS, also known as Compute as a Service (CaaS), are
stateless, server-side functions that are event-driven, scalable,
and fully managed by cloud providers.
• Serverless computing is the abstraction of servers,
infrastructure, and operating systems.
• When you build serverless apps you don’t need to provision and
manage any servers, so you can take your mind off
infrastructure concerns.
20
21
Pros and Cons of Public Cloud Computing
Pros and Cons
PROS
CONS
Cost Cost
Scalability Variety:
Access Security
Configurability Dependence
Variety
Upgradeability
Security
Simplicity
22
PROS Description
Cost If you need a resource for only a few hours or days, the cloud is
much cheaper than buying a new machine
Scalability You are starting a new lab and want to start with a small number of
servers, but as your research group grows, you want to be able to
expand easily without the hassle of managing your own racks of
servers.
Access A researcher in a small university or an engineer in a small company
may not even have a computer room or a lab with room for racks of
machines.
The cloud may be the only choice.
Configurability For many scientific disciplines, you can get complete virtual
machines or containers with all the standard software you need pre-
installed
23
PROS Description
Variety Public cloud systems provide access to a growing diversity of
computer systems.
Amazon and Azure each provide dozens of machine configurations,
ranging from a single core with a gigabyte of memory to multicore
systems with GPU accelerators and massive amounts of
memory.
Security • Commercial cloud providers have excellent security.
• They also make it easy to create a virtual network that integrates
cloud resources into your network
Upgradeability • Cloud hardware is constantly upgraded.
• Hardware that you buy is out of date the day that it is delivered,
and becomes obsolete quickly
Simplicity • You can manage your cloud resources from a web portal that is
easy to navigate.
• Managing your own private cluster may require
sophisticated system
• administration skills
24
CONS Description
Cost • You pay for public cloud by the hour and the byte.
• Computing the total cost of ownership of a cluster of machines
housed in a university lab or data center is not easy.
• In many environments, power and system administration are
subsidized by the institution.
• If you need to pay only for hardware, then running your own
cluster may be cheaper than renting the same services in the
cloud.
Variety • The cloud does not provide every type of computing that you may
require, at least not today.
• In particular, it is not a proper substitute for a large
supercomputer
Security •Your research concerns highly sensitive data, such as medical
information, that cannot be moved outside your firewall
Dependence • Dependence on one cloud vendor (often referred to as vendor
lock-in).
• As the public clouds converge in many of their standard offerings
and compete on price, moving applications between cloud vendors
has become easier
25
Sample Scenario
• In today’s world, there are various use cases for
Serverless technologies. Let’s take a simple example,
imagine you are the manager of Coca-Cola , one of the
key capabilities of Coca-Cola is allowing users to have
serverless after its implementation in vending machines
resulted in significant savings. Whenever a beverage is
purchased the details need to be shared to organization
like user name , payment details etc. Serverless model
ignoring the entire complexity of the application.
26
Flow of the application
• Amazon Web Services (AWS) serverless architecture, the new
contactless Coca-Cola Freestyle solution enables consumers to
choose and pour drinks from their phones in just a few
seconds, without having to create an account or download
an app.
• The mobile experience is currently rolling out to all Coca-Cola
Freestyle dispensers across the United States.
28
AWS based services enabled for all
scenario based questions
• AWS Lambda : Run code without thinking about servers or
clusters (MANAGE COMPUTE)
• Amazon S3 : Object storage built to retrieve any amount
of data from anywhere (MANAGE STORAGE)
• Amazon API Gateway : Create, maintain, and secure APIs
at any scale.(MANAGE API)
• AWS Fargate :Serverless compute for containers .
(MANAGE CONTAINERS)
• Amazon DynamoDB: Fast, flexible NoSQL database service
for single-digit millisecond performance at any scale
(MANAGES NOSQL)
29
• Amazon RDS :Set up, operate, and scale a relational
database in the cloud with just a few clicks. (MANAGE
SQL)
• Amazon SNS :Fully managed Pub/Sub service for A2A
and A2P messaging (MANAGE MESSAGING)
• Amazon Virtual Private Cloud (Amazon
VPC):Define and launch AWS resources in a logically
isolated virtual network.(MANAGE PRIVACY)
• Amazon Athena :Analyze petabyte-scale data where it
lives with ease and flexibility (MANAGE ANALYTICS)
30
• Amazon EMR: Easily run and scale Apache Spark,
Hive, Presto, and other big data workloads (MANAGE
BIGDATA COMPUTE)
• Amazon Kinesis: Easily collect, process, and analyze
video and data streams in real time (MANAGE REAL
TIME DATA)
31
Serverless Architecture
32
Using and Managing Virtual
Machines
33
Historical Roots
37
• PRIVILEGE LEVEL OR PROTECTION RING
• Privilege levels is that all instructions that modify the physical hardware
configuration are permitted at the highest level
• At lower levels, only restricted sets of instructions can be executed.
• Ring 0 have the highest privileges, and are allowed to execute any instructions or
access any physical resources such as memory pages or i/o devices
• Current privilege level (CPL) register of the processor to 3 before starting
execution of the guest.
• If the guest tries to access a protected resource, such as an i/o device, an interrupt
takes place, and the hypervisor regains control.
• The hypervisor then emulates the i/o operation for the guest.
38
• The x86 architecture provides four levels of protection,
0,1,2,3, with 0 used by the kernel, 3 by application
software, and 1 and 2 unused
39
• Hardware Support for Virtualization
• Vt-x, an intel technology that helps virtualize intel
x86 processors.
• Overview of a technique called EXTENDED PAGE
TABLES (EPT) which helps virtualize memory
• Followed by VT-D, a technology to assist in the
virtualization of i/o.
40
• Hardware Support for Processor Virtualization
• VMX root operation and VMX non-root operation
• VT-x makes use of a new data structure called the Virtual Machine
Control Structure (VMCS).
IBM SAN
Lustre GlusterFS Storage Level Volume
Controller
Network level
42
Network-Based Virtualization
• Network Virtualization (NV) refers to abstracting
network resources that were traditionally delivered in
hardware to software.
• Fibre Channel Storage Area Network (SAN).
• There are broadly two categories based on where the
virtualization functions are implemented:
• either in Switches (Routers) Or In Appliances (Servers).
43
AWS Elastic Compute Cloud
Various
Secure login
Preconfigured configurations of
information for your
templates for CPU, memory,
Virtual computing instances using key
your instances, storage, and
environments, pairs (AWS stores
known as networking
known as the public key, and
Amazon capacity for your
instances you store the
Machine Images instances, known
private key in a
(AMIs) as instance
secure place)
types
46
47
48
• To connect to your instance, you need to use a secure
shell command.
• On Windows the tool to use is called PuTTY
• ec2-user@IPAddress, where the IPAddress is the IP
address you can find in the Portal Instance View
49
• The following listing uses the Python Boto3 SDK to
create an Amazon EC2 VM instance.
• It creates an ec2 resource, which requires your
aws_access_key_id and aws_secret_access_key,
unless you have these stored in your .aws directory.
• The ImageId argument specifies the VM image that
is to be started and the MinCount and MaxCount
arguments the number of instances needed.
50
51
Attaching Storage
• VM: instance storage, Elastic Block Store, and
Elastic File System.
• Instance storage is what comes with each VM instance.
• Elastic Block Store (EBS) storage independent of a
VM and then attach it to a running VM.
• EBS volumes persist and thus are good for databases
and other data collections that we want to keep beyond
the life of a VM.
• To create an EBS volume, go to the volumes tab of the
EC2 Management console and click Create Volume
52
53
• Selected the us-west-2b availability zone.
• The actions tab in the volume management console
54
55
• This function requires the path to your private key,
the IP address of the instance, and the command
script as a string.
• If you want a volume to be shared with multiple
instances, then you can use the third type of instance
storage, called Elastic File System, that implements
the Network File System (NFS) standard
56
EC2 Computational Resources
Features Descriptions
Computing • The computing resources available on EC2, referred to as EC2 instances,
resources consist of combinations of computing power, together with other resources
such as memory.
• EC2 Compute Unit (CU) is a standard measure of computing power in the same
way that bytes are a standard measure of storage.
Software • Amazon Machine Images (AMIs).
• The required AMI has to be specified when requesting the EC2 instance, as
seen earlier. The AMI running on an EC2 instance is also called the root AMI.
Operating Red Hat Enterprise Linux and SuSE, the Windows server, and Solaris
systems
Regions and EC2 offers regions, which are the same as the S3 regions described in the section
Availability S3 Administration.
Zones
Load Balancing EC2 provides the Elastic Load Balancer, which is a service that balances the load
and Scaling across multiple servers
57
Azure VMs
58
• Google Cloud VM Services
• Usage of Virtual Machines:
1. Create development and test environments
2. Enable workload migration
3. Improve disaster recovery and business continuity
4. Create a hybrid environment
5. Consolidate servers
59
• Google Cloud Compute Engine
• Google Compute Engine is Google’s Infrastructure-as-
a-Service virtual machine offering.
• It allows customers to use virtual machines in the cloud
as server resources instead of acquiring and
managing server hardware.
• Google Compute Engine offers virtual machines running
in Google’s data centers connected to the worldwide
fibre network.
• The tooling and workflow offered by compute engine
enable scaling from single instances to global, load-
balanced cloud computing.
60
• Applications of Compute Engine
• Virtual Machine (VM) migration to Compute Engine
• Genomics Data Processing
61
• BYOL or Bring Your Own License images
62
• Google Compute Engine Features
• Machine Types
• Persistent Disks
• Local SSD
• GPU Accelerators
• Images
• Global Load Balancing
63
Jetstream VM Services
64
• Wrangler iRODS 4.1 and a setup script for easy
generation of the iRODS client environment on XSEDE
resources
• Docker, the platform for launching Docker containers
• EPIC Modeling and Simulations: Explicit Planetary
Isentropic-Coordinate (EPIC) Atmospheric Model Based
on Ubuntu 14.04.3 Numerous Ubuntu
65
66
Using and Managing Containers
• Containers are a common option for deploying and
managing software in the cloud.
• Containers are used to abstract applications from the
physical environment in which they are running.
• A container packages all dependencies related to a software
component, and runs them in an isolated environment.
• Containers have become one of the most interesting and
versatile alternatives to virtual machines for
encapsulating applications for cloud execution.
• Docker container technology, because it is the most widely
known and used, is easy to download and install, and is free
67
Need for Docker
Tailor-made The Docker in cloud computing enables its clients to make use of Docker
to organize their software infrastructure
Accessibility The docker is a cloud framework, it is accessible from anywhere,
anytime. Has high efficiency.
Operating It takes less space.
System They are lightweight and can operate several containers
Support simultaneously.
Performance Containers have better performance as they are hosted in a single docker
engine
Speed No requirement for OS to boot.
Applications are made online in seconds.
As the business environment is constantly changing, technological up-
gradation needs to keep pace for smoother workplace transitions.
Flexibility They are a very agile container platform. It is deployed easily across
clouds, providing users with an integrated view of all their applications across
different environments. Easily portable across different platforms.
Scalable It helps create immediate impact by saving on recoding time, reducing
costs, and limiting the risk of operations.
Automation Docker works on software as a service and platform as a service model, which
enables organizations to streamline and automate diverse applications. 68
Example : Scenario based
• A company needs to develop a Java Application.
• In order to do so the developer will setup an environment
with tomcat server installed in it.
• Once the application is developed, it needs to be tested by
the tester.
• Now the tester will again set up tomcat environment
from the scratch to test the application.
• Once the application testing is done, it will be deployed on
the production server.
• Again the production needs an environment with tomcat
installed on it, so that it can host the Java application.
• If you see the same tomcat environment setup is done
thrice.
69
• Problems : There is a loss of time and effort.
• There could be a version mismatch in different setups i.e.
the developer & tester may have installed tomcat 7, however
the system admin installed tomcat 9 on the production server.
https://www.edureka.co/blog/what-is-docker-
container
71
CPU will allocates exactly the amount of
memory that is required by the Container
Using Docker Container, we can set up
many instances of Jenkins, Puppet, and
many more, all running in the same
container or running in different
containers which can interact with one
another by just running a few
commands. It can also easily scale up
by creating multiple copies of these
containers
72
Docker Engine & Docker Image
73
Docker Container & Registry
• Docker Containers are the ready applications
created from Docker Images
74
Docker Image
75
Steps in Docker
Moving the
Container Monitoring Our Configuring the
Adding Vault
Infrastructure Environment System
to the Cloud
Deploy to Restarting
Run Application Start Machine
Machine Machine
Inspect
View State
/vault/code
76
Container Basics
• The best (indeed, in many cases, only) way to
encapsulate software for deployment in the cloud was to
create a VM image.
• Docker allows applications to be provisioned in containers
that encapsulate all application dependencies.
• The application sees a complete, private process
space, file system, and network interface isolated
from applications in other containers on the same host
operating system.
• Docker isolation provides a way to factor large
applications, as well as simple ways for running
containers to communicate with each other
77
• To understand how the file system in a container is
layered on top of the existing host services.
• The key is the Union File System (more precisely, the
ADVANCED MULTILAYERED UNIFICATION FILE
SYSTEM (AUFS) and a special property called copy on
write that allows the system to reuse many data objects
in multiple containers.
• Docker images are composed of layers in the Union
File System.
• The image is itself a stack of read-only directories.
• The base is a simplified Linux or Windows file system
78
• The Docker Union File System is layered on a standard
base.
• As an application in the container executes, it uses the
WRITABLE LAYER.
• If it needs to modify an object in the read-only layers, it
copies those objects into the writable layer.
• Otherwise, it uses the data in the read-only layer, which
is shared with other container instances
79
Basic Docker commands
docker –version get the currently installed version of docker
docker pull Usage: docker pull <image name>
to pull images from the docker repository(hub.docker.com)
docker run Usage: docker run -it -d <image name>
used to create a container from an image
docker ps used to list the running containers
docker ps -a used to show all the running and exited containers
docker exec Usage: docker exec -it <container id> bash
used to access the running container
docker stop Usage: docker stop <container id>
stops a running container
docker kill Usage: docker kill <container id>
command kills the container by stopping its execution immediately.
The difference between ‘docker kill’ and ‘docker stop’ is that ‘docker
stop’ gives the container time to shutdown gracefully
docker commit Usage: docker commit <conatainer id>
<username/imagename>
creates a new image of an edited container on the local system
docker login used to login to the docker hub repository 80
Basic Docker commands
81
NEED
A DOCKERFIL
E COMPOSE
(IT IS A TEXT A DOCKER
FILE WITH A IMAGE IT
SET OF IS LIBRARIES A
COMMANDS. T WE
THIS FILE IS IMPLEMENT FOR
NEEDED FOR WORK
PERFORMING
THE CREATION
OF A
CONTAINER.) CODE YOU’VE
WRITTEN FOR
YOUR
APPLICATION
DOCKER
PLATFORM
DOCKER
OPEN YOUR CONTAINER
COMMAND LINE .
AND WRITE
You can use this diagram for writing “DOCKER RUN
USAGE OF
YOUR STOP
APPLICATIO CONTAINER
N
82
Docker and the Hub
• Install jupyter with docker on your laptop
• First install docker on your machine
• The details differ on linux, mac, or PC
• The installation is a simple process, similar to that of
installing a new browser or other desktop application.
• Docker does not have a graphical interface
• It is BASED ON A COMMAND LINE API.
• Open a “powershell” or “terminal” window on your
machine.
• The docker commands are then the same on linux, mac
83
• Once you have installed Docker, you can verify that it is
running by executing the docker ps command, which
tells you which containers are running.
•C:\> docker ps
•CONTAINER ID IMAGE COMMAND CREATED STATUS
PORTS
•NAMES
•C:\>
•launch Jupyter with the docker run command
84
• The first flag, -it, causes the printing of a URL with a
token that you can use to connect to the new Jupyter
instance.
• The second flag, -p 8888:8888, binds port 8888 in the
container’s IP stack to port 8888 on our machine.
• Finally, the command specifies the name of the container,
jupyter/scipy-notebook, as it can be found in the
Docker Hub
•C:\> docker run -it -p 8888:8888 jupyter/scipynotebook
•Copy/paste this URL into your browser when you
connect for the first time, to login with a token:
•http://localhost:8888/?
token=b9fc19aa8762a6c781308bb8dae27a…
85
• Rerunning the docker ps command shows that our newly
started Jupyter notebook is now running.
87
Standard Docker features
flag -it connects the container’s standard I/O to the shell that ran
the docker command
flag -d to make it run in detached mode
-v To mount a local directory on your laptop as a volume on the
localdir:/containern Docker container file system
ame flag
-it and -v. Use the docker command on a Mac to launch a Linux Ubuntu
container with the Mac’s /tmp directory mounted as /localtmp.
Due to -it, we are presented with a command prompt for the
newly started Ubuntu container
-e flag on the run command to pass environment flags through to
Jupyter
-e GEN_CERT=yes tells Jupyter to generate a
self-signed SSL certificate and to use HTTPS instead of
HTTP for access.
88
89
• Let’s assume that we also want to mount a local
directory c:/tmp/docmnt as a local directory document
inside the container.
• Jupyter has a user called jovyan and the working
directory is /home/jovyan/work.
• Jupyter via HTTPS with your new password.
• When the container is up, you can connect to it via
HTTPS at your host’s IP address and port 8888.
90
Containers for Science
Radio Astronomy(lofar,
Pyimager, And
Meqtree)
Bioinformatics(galaxy
Toolkit, Genome
Toolkit)
Mathematics And
Statistics(r And Python)
Machine
Learning(spark,
The Vowpal Wabbit
Tools, And The Scikit-
Containers For Science learn)
Geospatial
Data(container With
Geoserver)
Iplant Consortium
Ubercloud
91
Creating Your Own Container
• Creating your own container image and storing it in
the Docker Hub is simple.
• Scenario :Suppose you have a Python application that
opens a web server, waits for you to provide input, and
then uses that input to pull up an image and display it.
• Now, let’s build this little server and its image data as a
container
92
• Python application based on the Bottle framework for
creating the web server.
• Assume the images are all stored as jpg files in a
directory called images
• SciPy tools+Amazon Boto3 SDK+file named Dockerfile
FROM jupyter/scipy-notebook
MAINTAINER your name <yourname@gmail.com>
RUN pip install bottle
COPY images /images
COPY bottleserver.py /
ENTRYPOINT ["ipython", "/bottleserver.py"]
93
jupyter/scipynotebook well-maintained container in the Docker Hub
pip install for Boto3 and Bottle
ENTRYPOINT Docker what to execute when the container
Runs
docker build downloads all the components for jupyter/scipy-
Docker run -d -p 8000:8000 notebook, Boto3, and Bottle
yourname/bottlesamp
docker push Create a free Docker account and save your
yourname/bottlesamp container to the Docker
Hub as follows.
94
Scaling Deployments
95
Paradigms of Parallel Computing in the Cloud
96
• MapReduce style made famous by Google and now
widely used in its Hadoop and Spark realizations.
• The GRAPH EXECUTION MODEL, in which computation
is represented by a directed, usually acyclic, graph of
tasks.
• Execution begins at a source of the graph.
• Each node is scheduled for execution when all
incoming edges to that node come from task nodes that
have completed.
• Graphs can be constructed by hand or alternatively
generated by a compiler from a more traditional-looking
program that describes the graph either implicitly or
explicitly
97
• The DATA ANALYTICS TOOL spark and the spark streaming,
apache flink, storm, and google dataflow systems.
• The graph execution model is also used in machine learning
tools, such as the google tensorflow and microsoft cognitive
toolkit systems.
• Microservices and actors.
• In the actor model of parallel programming, computing is
performed by many actors that communicate via
messages.
• Each actor has its own internal private memory and goes
into action when it receives a message.
• Based on the message, it can change its internal state and
then send messages to other actors
98
SPMD and HPC-style Parallelism
• Message passing interface in the cloud
• GPUS in the cloud
• Accelerators in supercomputing’s
• Deep neural networks (NNS)
• Deploying an HPC cluster on amazon
99
MPI cloud computing for proton
therapy
• The example illustrates how specialized node types (in
this case, GPU-equipped nodes with 10 gigabit/s
interconnect) allow cloud computing to deliver
transformational computing power for a time-sensitive
medical application.
• It also involves the use of MPI for inter-instance
communication
• Apache Mesos for acquiring and configuring virtual
clusters
• Globus for data movement between hospitals and cloud
100
• The cloud computing is used to reconstruct three-
dimensional proton computed tomography (pCT)
images in support of proton cancer treatment.
• Proton computed tomography.
• Protons pass left to right through sensor planes and
traverse the target before stopping in the detector at the
far right.
101
https://pubs.acs.org/doi/10.1021/
acs.jcim.9b01152
https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC4003902/
102
• A single reconstruction can require the analysis of
around two billion 50-byte proton histories, resulting in
an input dataset of ~100 GB.
• The reconstruction process is complex, involving
multiple processes and multiple stages.
• Each participating process reads a subset of proton
histories into memory and performs some preliminary
calculations to remove abnormal histories.
• Filtered back projection is used to estimate an initial
reconstruction solution, from which most likely paths
(MLPs) are estimated for the proton through the
target.
103
• The voxels ( graphic simulation)of the MLP for each proton
identify the nonzero coefficients in a set of nonlinear
equations that must then be iteratively solved to construct
the image.
• This solution phase takes the bulk of the time and can be
accelerated by using GPUs and by caching the MLP paths (up
to 2 TB for a 2-billion-history image) to avoid
recomputation.
• MPI-based parallel reconstruction code developed when run
on a standalone cluster with 60 GPU-equipped compute
nodes, can reconstruct two billion histories in seven minutes
• Deploy on each instance a VM image configured to run the
pCT software plus associated dependencies (e.g., MPI for
inter-instance communication).
104
• The APACHE MESOS scheduler for task
• The GLOBUS TRANSFER service for data movement
• Amazon spot instances
• Revolutionary new capability, considering that the
alternative is for each hospital with a proton therapy
system to acquire, install, and operate a
dedicated HPC cluster.
105
For scenario based questions
106
• Amazon Elastic Compute Cloud:EC2 has
incorporated cluster computing instances aimed
toward HPC computing applications.
• pCT Reconstruction Instances Types: GPU-enabled,
MPI application were identified requirements for high-
CPU, high-memory, and GPU-enabled instances as
well as low latency between instances.
• Amazon EC2 GPU-enhanced cluster compute
instance, termed CG1.
• CG1 instances include two Intel Xeon X5570, quad-core
CPUs with hyperthreading, 22.5GB of RAM, and two
NVIDIA Tesla M2050 GPUs, each containing 3GB of RAM.
107
• Shared File System: GlusterFS, an opensource
distributed file system that provides scalable and high
performance access to files
• The GlusterFS model relies on one or more
storage bricks (or servers) that allow client
applications, in this case the pCT reconstruction worker
nodes, to mount the data source
108
• Data Upload/Download:Globus moves data between
Globus endpoints, the name given to a resource on which a
Globus agent is installed
• Globus automatically TUNES PARAMETERS
• Maximize bandwidth usage
• Manages security configurations
• Provides automatic fault recovery
• Encrypts data channels
• Notifies users of completion and problems
• Ensures that files are transferred reliably by matching
checksums
109
• Cloud Images and Elastic Scale Out: Amazon, these
snapshots are referred to as Amazon Machine Images
(AMIs)
https://ieeexplore.ieee.org/ielaam/
6245519/8307205/7160740-aam.pdf?tag=1
110
AWS Services used in CfnCluster
• CfnCluster (“cloud formation cluster”) is a framework that
deploys and maintains high performance computing clusters on
Amazon Web Services (AWS)
• AWS CloudFormation
• AWS Identity and Access Management (IAM)
• Amazon SNS (Amazon Simple Notification Service)
• Amazon SQS (Amazon Simple Queue Service)
• Amazon EC2
• Auto Scaling
• Amazon EBS
• Amazon S3
111
• Amazon DynamoDB
112
Components in cfncluster
• AWS Systems Manager Agent (SSM Agent) is Amazon
software that runs on Amazon Elastic Compute Cloud (Amazon
EC2) instances, edge devices, and on-premises.
• DCV is a high-performance remote display protocol that
provides customers with a SECURE WAY to deliver remote
desktops and application streaming from any cloud or data
center to any device, over varying network conditions.
• Slurm Workload Manager, formerly known as Simple Linux
Utility for Resource Management (SLURM), or simply
Slurm, is a free and open-source job scheduler for Linux and
Unix-like kernels, used by many of the world's supercomputers
and computer clusters
• AWS ParallelCluster is tested with Slurm configuration
parameters, which are provided by default
113
Deploying an HPC Cluster on
Amazon
• CloudFormation service, which enables the automated
deployment of complex collections of related services,
such as multiple EC2 instances, load balancers,
special subnetworks connecting these components,
and security groups that apply across the
collection.
• CfnCluster (CloudFormation Cluster) Python scripts
that you can install and run on your Linux, Mac, or
Windows computer to invoke CloudFormation, as follows,
to build a private, custom HPC cluster
• sudo pip install cfncluster
• cfncluster configure
114
cfncluster CREATE MYCLUSTER
115
• CloudFormation steps involved in launching a private HPC
cloud from a CfnCluster template.
• The “create” command returns a Ganglia URL.
• Ganglia is a well-known and frequently USED CLUSTER
MONITORING TOOL.
• Following that link takes you to a Ganglia view of your HPC
cluster.
• The default settings for a new cluster are AUTOSCALE
compute nodes and the gridEngine scheduler.
117
• Deploy a new cluster with better compute nodes and a
better scheduler.
• cfncluster delete mycluster
118
Slurm Workload Manager
119
120
compute_instance_type = • c3.xlarge instance type supports what Amazon calls
c3.xlarge enhanced networking, which means that it runs on
initial_queue_size = 4 hardware and with software that support Single-Root
maintain_initial_size = true I/O Virtualization (SR-IOV).
scheduler = slurm • VM image because the default contains all libraries
needed for HPC MPI-style computing compute nodes to
stay around and not be managed by autoscale, and that
you want Slurm to be the scheduler
cfncluster create mycluster • The cluster’s head node, using the key pair that we
used to create the cluster.
• On a pc you can use PUTTY, and on a mac you can use
ssh from the command line.
• The user is EC2-USER.
• First you need to set up some PATH INFORMATION.
export PATH=/usr/lib64/mpich/bin:$PATH
export LD_LIBRARY_PATH=/usr/lib64/mpich/lib
export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi.so
121
#include <stdio.h> to know the local IP addresses of your compute
#include <mpi.h> nodes.
#include <stdlib.h> Create a file called ip-print.c
main(int argc, char **argv)
{ if your cluster has 16 nodes, run
char hostname[1024];
gethostname(hostname, 1024); The output file machines should then contain
printf("%s\n", hostname); multiple IP addresses of the form 10.0.1.x,
} where x is a number between 1 and 255, one for
each of your compute nodes.
mpicc ip-print.c the Slurm command srun to run copies of this
srun -n 16 /home/ec2-user/a.out > program across the entire cluster
machines
mpicc ring.c • MPI program
mpirun -np 7 -machinefile ./machines • Program starts with MPI node 0 and sends the
/home/ec2- number -1 to MPI node 1.
user/a.out • MPI node 1 sends 0 to node 2, node 2 sends 1 to
node 3, and so on.
122
123
Deploying an HPC Cluster on Azure
• The first is to use Azure’s service deployment
orchestration service, Quick Start. (FIRST APPROACH)
• Like Amazon CloudFormation, it is based on templates.
• The templates are stored in GitHub and can be
invoked directly from the GitHub page
124
• When you click Deploy to Azure, you are taken to the
Azure login page
• Then directly to the web form for completing the
SLURM CLUSTER DEPLOYMENT.
• At this point
• Enter the names for the new resource group that
defines your cluster
• The number and type of compute nodes
• Few details about the network.
125
• HPC computing on Azure IS AZURE BATCH, which supports the
management of large pools of VMs that can handle large batch
jobs, such as many task-parallel jobs (SECOND APPROACH)
• A managed tool that you can use to AUTOSCALE DEPLOYMENTS
and set policies for job scheduling.
• The Azure Batch service handles provisioning, assignment,
runtimes, and monitoring of your workloads.
• Upload your application binaries and input data to Azure storage.
• Define the pool of compute VMs that you want to use, specifying
the desired VM size and OS image
• Define a Job, a container for tasks executed in your VM pool.
• Create tasks that are loaded into the Job and executed in the VM
pool.
126
Sample Diagram for Cluster
formation in Azure
127
Steps for 2nd Approach
128
Steps in creating and executing an
Azure batch job
129
Scaling Further
• A quantity called the network bisection bandwidth is a
measure of how much data traffic can flow from one
half of the supercomputer to the other half in a
specified period of time.
• The networks used in supercomputers have an
extremely HIGH BISECTION BANDWIDTH.
• The first cloud data centers had LOW BISECTION
BANDWIDTH.
130
• Service level agreement (SLA) that cloud providers
make with their users.
• Supercomputers commit to a specific
• Processor type
• Network bandwidth
• Bisection width
• Latency
• Allowing the user to predict an application’s
performance profile with a fair degree of
certainty
131
Many Task Parallelism
• Analyze many data samples.
• Each analysis task can be performed independently of
all the other tasks.
• You place all data samples in a queue in the cloud,
and then start a large number of worker VMs or
containers.
• We refer to this as many task parallelism, but it is also
known as bag of tasks parallelism and manager
worker parallelism.
132
Simple many task execution model
133
MapReduce and Bulk Synchronous
Parallelism
• A slightly more sophisticated approach to parallelism is
based on a concept called bulk-synchronous parallelism
(BSP).
• This is important when worker tasks must
periodically synchronize and exchange data with
each other.
• The point of synchronization in a BSP computation is
called a BARRIER
• Because no computation is allowed to proceed until
all computations reach the synchronization point
134
• MapReduce is a special case of BSP computing.
• Say you have a sequence of data objects Xi for i = 1..n
and you want to apply a function f(x) to each element.
• Assume the result is a value in an associative ring like
the real numbers, for which we can compose objects,
and we want to compute the sum.
135
• A MapReduce computation starts with a distributed data
collection partitioned into non-overlapping blocks.
• It then maps a supplied function over each block
136
Graph Dataflow Execution and Spark
137
DAG Visualization in SPARK
Stage •Parallelize
•Filter
1 •Map
Stage •ReducedBykey
•Map
2
Stage
3
•Join
138
Name Description
HADOOP 1. Apache Hadoop is a platform that handles large datasets in a
distributed fashion.
2. The framework uses MapReduce to split the data into blocks and
assign the chunks to nodes across a cluster.
3. MapReduce then processes the data in parallel on each node to produce
a unique output.
4. Every machine in a cluster both stores and processes data.
5. Hadoop stores the data to disks using HDFS.
SPARK 1. It is designed for fast performance and uses RAM for caching and
processing data.
2. Spark performs different types of big data workloads.
3. This includes MapReduce-like batch processing, as well as real-time
stream processing, machine learning, graph computation, and
interactive queries
4. The data structure that Spark uses is called Resilient Distributed
Dataset (RDD).
139
140
Key Apache Spark Hadoop MapReduce
Features
Speed 10–100 times faster Slower
than MapReduce
Analytics Supports streaming, Comprises simple Map and
Machine Learning, Reduce tasks
complex analytics, etc.
141
142
Why we need DAG with spark
• Each MapReduce operation is independent of each
other and HADOOP has no idea of which Map reduce
would come next.
• Sometimes for some iteration, it is irrelevant to read
and write back the immediate result between two map-
reduce jobs.
• In such case, the memory in stable storage (HDFS)
or disk memory gets wasted.
• In multiple-step, till the completion of the previous job
all the jobs block from the beginning.
143
• As a result, COMPLEX COMPUTATION CAN REQUIRE a long
time with small data volume.
• While in Spark, a DAG (Directed Acyclic Graph) of
consecutive computation stages is formed.
• In this way, we optimize the execution plan, e.g. to minimize
shuffling data around.
• In contrast, it is done manually in MapReduce by tuning each
MapReduce step.
144
• Spark spark.apache.org is a popular example of this style
of computation.
• In Spark the control flow program is a version of SQL,
Scala, or Python and, consequently, you can easily
execute Spark programs from a Jupyter notebook.
• Dataflow graph as defined by the program and after the
parallelism is unrolled during execution.
• Spark is also part of Microsoft’s HDInsight toolkit and is
supported on Amazon Elastic MapReduce.
• Microsoft documentation describes how to deploy Spark on
Azure with Linux or Windows and from a Jupyter notebook.
• Because Spark is used primarily for data analytics
145
Agents and Microservices
• The cloud is designed to host applications that are
realized as scalable services, such as a web server
or the backend of a mobile app.
• Such applications accept connections from remote
clients and, based on the client request, perform some
computation and return a response
146
• SCENARIO: Another scenario is an application that processes
events from REMOTE SENSORS, with the goal of informing a
control system on how to respond, as when geosensors
detecting ground motion tremors occurring in a significant
pattern sound an earthquake warning.
• PROBLEM: Multiple components: sensor signal decoders,
pattern analysis integrators, database searches, and alarm
system interfaces.
• SOLUTION:PARALLEL PROGRAMMING is like an
asynchronous swarm of communicating processes or services
distributed over a virtual network in the cloud.
• The individual processes may be stateless, such as a
simple web service, or stateful, as in the actor
programming model
147
Conceptual view of a swarm of
communicating microservices or actors.
148
• The problem addressed by microservices is that of HOW TO
DESIGN AND BUILD A LARGE, HETEROGENEOUS
APPLICATION THAT IS SECURE, MAINTAINABLE, FAULT
TOLERANT, AND SCALABLE.
• It is particularly important for LARGE ONLINE SERVICES
that need to support thousands of concurrent users
• The microservice solution to this challenge is to partition
the application into small, independent service
components communicating with simple, lightweight
mechanisms.
• The microservice paradigm design rules dictate that each
microservice must be able to be managed, replicated,
scaled, upgraded, and deployed independently of other
microservices.
149
• Each microservice must have a single function and
operate in a bounded context
• It has limited responsibility and limited dependence on
other services.
• The communication mechanisms used by microservice
systems are varied
• REST web service calls
• RPC mechanisms such as google’s swift
• Advanced Message Queuing Protocol (AMQP).
150
Microservices and Container
Resource Managers
• Amazon ECS container service, Google Kubernetes,
Apache Mesos, and Mesosphere on Azure
151
Managing Identity in a Swarm
• Scenario:If some microservices need to access a queue of
events that you own, and others need to interact with a
database that you created, then you need to pass part of
your authority to those services so that they may make
invocations on your behalf.
• Solution 1:To pass these values as runtime parameters
through a secure channel to your remotely running
application or microservice
• Problem in Solution 1:First, microservices are designed to
be shut down when not needed and scaled up in
number when the demand is high.
• Need to automate the process of fetching the credentials
for each microservice reboot
152
• Solution 2: Second, by PASSING THESE
CREDENTIALS, you endow your remote service with all
of your authority prefer to pass only a limited authority.
153
A Simple Microservices Example
Scenario:
• When scientists send technical papers to scientific journals, the abstracts of these
papers often make their way onto the Internet as a stream of news items, to which one
can subscribe via RSS feeds.
• A major source of high-quality streamed science data is arXiv arxiv.org, a
collection of more than one million open-access documents.
• Other sources include the Public Library of Science (PLOS one), Science, and Nature, as
well as online news sources.
• We have downloaded a small collection of records from arXiv, each containing a paper
title, an abstract, and, for some, a scientific topic as determined by a curator.
Solution:Our goal is to build a system that pulls document abstracts from the
various feeds and then uses one set of microservices to classify those abstracts
into the major topics of physics, biology, math, finance, and computer science, and a
second set to classify them into subtopic areas
154
A Simple Microservices Example
Requirements:
• Documents from a Jupyter notebook into a cloud-based message Queue
• NoSQL table
• Online scientific document classifier example, showing two levels and subcategories
for biology and computer science.
• Document classifier version 1, showing the multiple predictor microservices
155
Amazon EC2 Container Service
• The Amazon EC2 Container Service (ECS) is a system to
manage clusters of servers devoted to launching and
managing microservices based on Docker containers.
• One or more sets of EC2 instances, each a logical unit
called a cluster.
• One Default cluster can be formed if it requires we can add
more.
• Task definitions, which specify information about the
containers in your application, such as how many containers
are part of your task, what resources they are to use, how
they are linked, and which host ports they are to use.
156
• Amazon Elastic Container Service (ECS), also
known as Amazon EC-2 Container Service, is a
managed service that allows users to run Docker-
based applications packaged as containers across a
cluster of EC2 instances.
• Running simple containers on a single EC-2 instance is
simple but running these applications on a cluster of
instances and managing the cluster is being
administratively heavy process.
157
• Amazon-hosted Docker image repositories.
• Storing your images here may make them faster to load
when needed, but you can also use the public Docker
Hub repository.
• Amazon refers to the EC2 VM instances in a cluster
as container instances.
• Amazon Identity and Access Management (IAM)
system to address the identity management issues.
158
• The IAM link in the Security subarea of the AWS
management console takes you to the IAM Dashboard.
• Name it container service, and then select the role type.
• You need two roles: one for the container service
(which actually refers to the VMs in our cluster) and one
for the actual Docker services that we deploy.
• Scroll down the list of role types and look for Amazon EC2
Container Service Role and Amazon Container
Service Task Role.
• Select the Container Service Role for container
service.
• Save this, and now create a second role and call it my
microservices
159
• On the panel on the left is the link Roles.
• Select your container service role, and click Roles.
• You should now be able to attach various access
policies to your role.
• The attach policy button exposes a list of over 400
access policies that you can attach.
• Add three policies:
•AmazonBigtableServiceFullAccess,
•AmazonEC2ContainerServiceforEC2role
•AmazonEC2ContainerServiceRole.
160
• The mymicroservices role, for the amazon simple queue
service (SQS) and dynamodb:
• Amazonsqsfullaccess and amazondynamodbfullaccess
• Creating a cluster is now easy.
• From the amazon ECS console, simply click create
cluster, and then give it a name.
• Select the ec2 instance type you want and provide the
number of instances.
• Container instance IAM ROLE, you should see the
“containerservice” role.
• Select this, and select create.
• The cluster listed on the cluster console, with the container
instances running
161
import boto3 response = client.create_service(
client = boto3.client('ecs') cluster='cloudbook',
response = client.register_task_definition( serviceName='predictor',
family='predict', taskDefinition='predict:5',
networkMode='bridge', desiredCount=8,
taskRoleArn = deploymentConfiguration={
'arn:aws:iam::01233456789123:role/mymic 'maximumPercent': 100,
roservices ', 'minimumHealthyPercent': 50
containerDefinitions=[ }
{ )
'name': 'predict',
'image': 'cloudbook/predict',
'cpu': 20,
'memoryReservation': 400,
'essential': True,
},
],
)
162
Eight instances of the predictor and two
instances of the table service running
163
queue = sqs.get_queue_by_name(QueueName='bookque')
abstracts, sites, titles = load_docs("path-todocuments",
"sciml_data_arxiv")
for i in range(1330,1430):
queue.send_message(MessageBody='boto3',
MessageAttributes ={
'Title':{ 'StringValue': titles[i],
'DataType': 'String'},
'Source':{ 'StringValue': sites[i],
'DataType': 'String'},
'Abstract':{ 'StringValue': abstracts[i],
'DataType': 'String’}
})
164
165
Google’s Kubernetes
• This service can be both installed on a third-party
cloud and accessed within the Google Cloud.
• Creating a Kubernetes cluster on the Google Cloud is
easy.
• Google CLOUD PUB/SUB, that supports both push and
pull subscribers
166
• Google Kubernetes Engine (also known as GKE) is a
managed, production-ready environment for
running Docker containers in the Google cloud.
• It permits you to form multiple-node clusters
whereas conjointly providing access to any or all
Kubernetes options.
167
168
169
170
• Deploy an instance of the open source queue
service rabbitmq rabbitmq.Com on a VM running on
jetstream.
• Python package called CELERY to communicate with
the queue service.
• Celery is a distributed remote procedure call system
for python programs.
• The celery view of the world is that you have a set of
worker processes running on remote machines and a
client process that invokes functions that are executed
on the remote machines.
• ADVANCED MESSAGE QUEUING PROTOCOL (AMQP)
171
• >celery worker -A predictor -b
• 'amqp://guest@brokerIPaddr
• from celery import Celery
• app = Celery('predictor',
• broker='amqp://guest@brokerIPaddr',\
• backend='amqp')
• @app.task
• def predict(statement):
• return ["stub call"]
• res = predict.apply_async(["this is a science
• document…"])
• print(res.get())
172
Mesos and Mesosphere
• Mesosphere (from Mesosphere.com) is a DATA CENTER
OPERATING SYSTEM (DCOS) based on the original
Berkeley Mesos system for managing clusters.
• The Apache Mesos distributed system kernel.
• The Marathon init system, which monitors
applications and services and, like Amazon ECS and
Kubernetes, automatically heals any failures.
• Mesos-DNS, a service discovery utility.
• ZooKeeper, a high-performance coordination
service to manage the installed DCOS services.
173
174
• Mesosphere is deployed, it has A MASTER NODE, a
backup master, and a set of workers that run the
service containers.
• Azure supports the Mesosphere components listed
above as well as another container management
service called DOCKER SWARM.
• Azure also provides a set of DCOS command line tools.
175
• Mesosphere also provides excellent interactive service
management consoles.
• When you bring up Mesos on Azure through the Azure
Container Services, the console presents a view
of your service health, current CPU and memory
allocations, and current failure rate.
176
HTCondor
• HTCondor research.cs.wisc.edu/htcondor high-
throughput computing system is a particularly mature
technology for scientific computing in the cloud.
• Globus Genomics system uses HTCondor to schedule
large numbers of bioinformatics pipelines on the
Amazon cloud
• GeoDeepDive geodeepdive.org, part of the NSF
EarthCube project, is an infrastructure for text and data
mining that uses HTCondor for large analyses of
massive text collections
177
• PEGASUS is a workflow system for managing large
scientific computations on top of HTCondor.
• HEPCLOUD PROJECT used HTCondor to process data
from a high energy physics experiment
178