PARAM_Rudra_User's_Manual-IITB-V1 (1)
PARAM_Rudra_User's_Manual-IITB-V1 (1)
USER MANUAL
Copyright Notice
Any technical documentation that is made available by C-DAC (Centre for Development of Advanced
Computing) is the copyrighted work of C-DAC and is owned by C-DAC. This technical documentation is being
delivered to you as is, and C-DAC makes no warranty as to its accuracy or use. Any use of the technical
documentation or the information contained therein is at the risk of the user. C-DAC reserves the right to
make changes without prior notice.
No part of this publication may be copied without the express written permission of C -DAC.
Trademarks
Other brands and product names mentioned in this manual may be trademarks or registered trademarks of
their respective companies and are hereby acknowledged.
Intended Audience
Typographic Conventions
Symbol Meaning
Blue underlined text A hyperlink or link you can click to go to a related section in
this document or to a URL in your web browser.
Getting help
DISCLAIMER
The information contained in this document is subject to change without notice. C -DAC shall not be liable for
errors contained herein or for incidental or consequential damages in connection with the performance or use
of this manual.
Page | 1
PARAM Rudra – User Manual
Table of Content
Introduction .................................................................................................. 5
System Architecture and Configuration ......................................................... 6
System Hardware Specifications............................................................................................ 6
Login Nodes............................................................................................................................ 6
Service Nodes......................................................................................................................... 7
CPU Compute Nodes.............................................................................................................. 7
GPU Compute Nodes ............................................................................................................. 8
High Memory Compute Nodes .............................................................................................. 8
Storage ................................................................................................................................... 8
PARAM Rudra Architecture Diagram ..................................................................................... 9
Operating System................................................................................................................... 9
Primary Interconnection Network ....................................................................................... 10
Secondary Interconnection Network ................................................................................... 10
Software Stack ..................................................................................................................... 10
Page | 2
PARAM Rudra – User Manual
Monitoring jobs.................................................................................................................... 33
Getting Node and Partition details ...................................................................................... 35
Accounting ........................................................................................................................... 36
Investigating a job failure..................................................................................................... 36
I am familiar with PBS/ TORQUE. How do I migrate to SLURM? ......................................... 37
Addressing Basic Security Concerns..................................................................................... 38
Page | 3
PARAM Rudra – User Manual
Page | 4
PARAM Rudra – User Manual
Introduction
This document is the user manual for the PARAM Rudra Supercomputing facility at IIT
Bombay. It covers a wide range of topics ranging from a detailed description of the
hardware infrastructure to the information required to utilize the supercomputer, such as
information about logging on to the supercomputer, submitting jobs, retrieving the results
on to the user's Laptop/ Desktop etc. In short, the manual describes all that one needs to
know to effectively utilize PARAM Rudra.
Page | 5
PARAM Rudra – User Manual
Login Nodes
Login nodes are typically used for administrative tasks such as editing, writing scripts,
transferring files, managing your jobs and the like. You will always get connected to one of
the login nodes. From the login nodes you can get connected to a Compute Node and
execute an interactive job or submit batch jobs through the batch system (SLURM) to run
your jobs on compute nodes. For all users PARAM Rudra Login Nodes are the entry points
and hence are shared. By default, there will be a limit on the CPU time that can be used on a
Login Node by a user and there is a limit/user on the memory as well. If any of these are
exceeded, the job will get terminated.
Login Nodes: 10
Page | 6
PARAM Rudra – User Manual
Service Nodes
PARAM Rudra is an aggregation of a large number of nodes connected through networks.
Management nodes play a crucial role in managing and monitoring every component of
PARAM Rudra cluster. This includes monitoring the health, load, and utilization of individual
components, as well as providing essential services such as security, management, and
monitoring to ensure the cluster functions smoothly.
Management nodes: 8
Visualization nodes: 2
Page | 7
PARAM Rudra – User Manual
Storage
● Based on Lustre parallel file system
● Total useable capacity of 4.4 PiB Primary Storage
● Throughput 100 GB/s
Page | 8
PARAM Rudra – User Manual
Operating System
The operating system on PARAM Rudra is Linux – Alma 8.9
Network infrastructure
A robust network infrastructure is essential for implementing the basic functionalities of a
cluster. These functionalities include:
Page | 9
PARAM Rudra – User Manual
Technically, all the above functionalities can be implemented in a single network. However,
for optimal performance, economic suitability, and meeting specific requirements, these
functionalities are implemented using two different networks based on differ ent
technologies, as explained below:
InfiniBand: NDR
Computing nodes of PARAM Rudra are interconnected by a high-bandwidth, low-latency
interconnect network, specifically InfiniBand: NDR. InfiniBand, a high-performance
communication architecture owned by Mellanox, offers low communication latency, low
power consumption and a high throughput. All CPU nodes are connected via the InfiniBand
interconnect network.
Software Stack
Software Stack is an aggregation of software components that work together to accomplish
various tasks. These tasks can range from facilitating users in executing their jobs to
enabling system administrators to manage the system efficiently. Each software component
within the stack is equipped with the necessary tools to achieve its specific task, and there
may be multiple components of different flavors for different sub-tasks. Users have the
flexibility to mix and match these components according to their preferences. For users, the
primary focus is on preparing executables, executing them with their datasets, and
visualizing the output. This typically involves compiling codes, linking them with
communication libraries, math libraries, and numerical algorithm libraries, preparing
executables, running them with desired datasets, monitoring job progress, collecting results,
and visualizing output.
Page | 10
PARAM Rudra – User Manual
System administrators, on the other hand, are concerned with ensuring optimal resource
utilization. To achieve this, they may require installation tools, health-check tools for all
components, efficient schedulers, and tools for resource allocation and usage monitoring.
The software stack provided with this system has a wide range of software components that
meet the needs of both users and administrators. Figure 2 illustrates the components of the
software stack.
Architecture X86_64
Page | 11
PARAM Rudra – User Manual
Page | 12
PARAM Rudra – User Manual
Recommended process for creating a user account to access the PARAM Rudra:
1. Visit nsmindia.in
2. Navigate to the "How to access NSM HPC System" section, where you will find a link to
the User Creation portal.
3. Click on the provided link to access the registration page.
4. Fill in all required information on the registration page.
5. Select IITB as the institute.
6. Upload the necessary documents as instructed.
7. Once the form is complete, submit the details.
8. The NSM committee will review the submission.
9. If accepted, users will receive an email containing their user credentials and allocated
cluster.
Page | 13
PARAM Rudra – User Manual
1. PuTTY is the most popular open source ssh client application for Windows. Following are
the steps:
a) Download PuTTY from its official website.
b) Install PuTTY on your computer.
c) Launch Putty from your desktop or Start menu.
d) In the dialog, locate the "Hostname or IP Address" input field.
e) Enter the hostname of the cluster: paramrudra.iitb.ac.in
f) For all users, use port 4422 as ssh port.
g) Select open, then enter your username
h) Enter the captcha when prompted, then input your password.
i) Press Enter to proceed with the connection.
Page | 14
PARAM Rudra – User Manual
2. Another popular tool is MobaXterm, which is a third party freely available tool which can
be used to access the HPC system and transfer files to the PARAM Rudra system through
your local systems (laptop/desktop). Here are the steps:
a) Download MobaXterm from its official website.
b) Install MobaXterm on your computer.
c) Launch MobaXterm from your desktop or Start menu.
d) Click on the "Session" button in MobaXterm.
e) Enter the hostname, along with your username.
f) For all users, use port 4422 as ssh port.
g) Enter the captcha when prompted, then input your password.
h) Press Enter to proceed with the connection.
Page | 15
PARAM Rudra – User Manual
This is a native tool for Windows machines which can be used to transfer data from the
PARAM Rudra system through your local systems (laptop/desktop).
Page | 16
PARAM Rudra – User Manual
This is a native tool for Windows machines which could be used to transfer data from the
PARAM Rudra system through your local systems (laptop/desktop).
ssh[username]@[hostname]
After entering captcha, you will be prompted for a password. Once entered, you will be
connected to the server.
After getting credentials you may access the cluster, please remember the following points:
Page | 17
PARAM Rudra – User Manual
• When you log in to the cluster, you will land on the login nodes. The login node
serves as the primary gateway to the rest of the cluster, housing a job scheduler
(known called SLURM) and other applications for creating and submitting the job.
You can submit jobs to the queue, and they will execute when the required
resources become available.
• Please refrain from running jobs directly on the login node. Login nodes are intended
for compiling codes, transferring data and submitting jobs. If you run your job
directly on the login node, it will be terminated.
• By default, two directories are available (i.e. /home and /scratch). These directories
are available on the login node as well as the other nodes on the cluster. /scratch is
for temporary data storage, generally used to store data required for running jobs.
Users are requested to regularly back up their own data in scratch directory. As per
policy, any files not accessed in the last three months will be permanently deleted.
First login
Whenever a newly created user on PARAM Rudra attempts to log in with the user ID and
temporary password provided via email by PARAM Rudra support, it is mandatory for the
user to change the password to one of their choosing. This ensures the security of your
account. It is recommended to use a strong password containing a combination of
lowercase and uppercase letters, numbers, and a few special characters that are easy for
you to remember.
Your password will be valid for 90 days. On expiry of 90 days period, you will be prompted
to change your password, on attempting to log in. You are required to provide a new
password.
Page | 18
PARAM Rudra – User Manual
Forgot Password?
1. Please open a ticket regarding this issue, and the support team will assist you with your
problem. Follow the steps below:
2. Visit the PARAM Rudra support site, which is the ticketing tool, by clicking on the
following link: paramrudra.iitb.ac.in/support
3. Log in using your username or registered email ID.
4. Raise a ticket to request a password reset.
5. The support team will respond with an email for verification.
6. Once you acknowledge the email, the password will be reset for you, and you will
receive an email confirming the same.
7. You can then log in using the temporary password provided and set a new password of
your choice.
Page | 19
PARAM Rudra – User Manual
Users need to have their data and applications related to their project or research work on
PARAM Rudra. To store the data, special directories named “home” have been made
available to the users. While these directories are common to all the users, each user will
have their own directory with their username in the “/home/” directory, where they can
store their data.
/home/<username>/: This directory is generally used by the user to store their data and if
needed install their own applications.
However, there is a limit to the storage provided to users. The limits have been defined
according to quota over these directories, and all users will be allotted the same quota by
default. When a user wishes to transfer data from their local system (laptop/desktop) to the
HPC system, they can use various methods and tools.
A user using the ‘Windows’ operating system will have access to methods and tools native
to Microsoft Windows, as well as tools that can be installed on their Windows machine.
Linux operating system users, however, do not require any tool. They can simply use the
“scp” command on their terminal. Here’s how:
Example:
Same Command could be used to transfer data from the HPC system to other HPC system,
or your own system.
Page | 20
PARAM Rudra – User Manual
Note: The local system (laptop/desktop) must be connected to a network that allows access
to the HPC system. Additionally, please ensure that the firewall settings on your laptop are
configured to allow access from the HPC system.
Users are advised to keep a copy of their data once their project or research work is
completed by transferring the data from PARAM Rudra to their local system
(laptop/desktop). The command below can be used for file transfers in all the tools.
Tools
This popular tool is freely available and is used very often to transfer data from Windows
machine to Linux machine. This tool is GUI based which makes it very user-friendly.
Figure 6 - A snapshot of the "WinSCP" tool to transfer file to and from remote computer.
Page | 21
PARAM Rudra – User Manual
Note: Port Used for SFTP connection is 4422 and not 22. Please change it to 4422
Page | 22
PARAM Rudra – User Manual
Resource Management
This section explains how you interact with the resource manager. It covers information
about the resource manager, the definition of nodes within partitions, job policies,
scheduler information, the process of submitting jobs to the cluster, monitoring active jobs
and getting useful information about resource usage.
A cluster is a group of computers that work together to solve complex computational tasks
and presents itself to the user as a single system. For the resources of a cluster (e.g. CPUs,
GPUs, memory) to be used efficiently, a resource manager (also called workload manager or
batch-queuing system) is important. While there are many different resource managers
available, the resource manager at PARAM Rudra is SLURM. After submitting a job to the
cluster, SLURM will try to fulfill the job’s resource request by allocating resources to the job.
If the requested resources are already available, the job can start immediately. Otherwise,
the start of the job is delayed (pending) until enough resources are available. SLURM allows
you to monitor active (pending, running) jobs and to retrieve statistics about finished jobs.
SLURM Partitions
Partition is a logical grouping of nodes that share similar characteristics or resources.
Partitions are helpful to manage and allocate resources efficiently based on the specific
requirements of jobs or users. PARAM Rudra consists of three types of computational
nodes: i.e. CPU only nodes, High memory (with 768 GB memory) nodes and GPU-enabled
GPGPU nodes.
The following partitions/queues have been defined to meet different user requirements:
1. standard: By default, all user jobs will be submitted to the standard partition which
contains 572 nodes. These nodes consist of CPU and High Memory (HM) nodes.
2. CPU: This partition is specifically designed for nodes that only have CPU resources.
3. GPU: The GPU partition includes nodes equipped with NVIDIA A100 GPUs. Jobs
submitted to this partition will run on nodes that can leverage the high-performance
computing capabilities of A100 GPU cards for parallel processing tasks. The GPU
Page | 23
PARAM Rudra – User Manual
partition exclusively contains GPU nodes. If a user’s wishes to submit a job only on
GPU nodes, they need to specify the number of GPU cards with the partition name.
4. hm: The High Memory partition is intended for nodes with a substantial amount of
RAM. Specifically, it accommodates CPU nodes that are equipped with 768 GB of
RAM, allowing jobs requiring large memory resources to be executed efficiently.
Walltime
The walltime parameter defines how long your job will run, with the maximum runtime
determined by the QoS Policy. The default walltime for every job is 2 hours, so users are
requested to explicitly specify the walltime in their scripts. If more than 4 days are required,
users can raise a query on the support portal of PARAM Rudra, and it will be addressed on a
case-by-case basis. If a job exceeds the specified walltime in the script, it will be terminated.
Specifying the appropriate walltime improves scheduling efficiency, resulting in enhanced
throughput for all jobs, including yours.
Scheduling Type
PARAM Rudra has been configured with Slurm’s backfill scheduling policy. It is good for
ensuring higher system utilization; it will start lower priority jobs if doing so does not delay
the expected start time of any higher priority jobs. Since the expected start time of pending
jobs depends upon the expected completion time of running jobs, reasonably accurate time
limits are important for backfill scheduling to work well.
Job Priority
The job's priority at any given time will be a weighted sum of all the factors that have been
enabled in the slurm.conf file. Job priority can be expressed as:
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (priority_job_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
Page | 24
PARAM Rudra – User Manual
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factor
All of the factors in this formula are floating point numbers that range from 0.0 to 1.0. The
weights are unsigned, 32-bit integers. The larger the number, the higher the job will be
positioned in the queue, and the sooner the job will be scheduled. A job's priority, and
hence its order in the queue, can vary over time. For example, the longer a job sits in the
queue, the higher its priority will grow when the age weight is non-zero.
Age Factor: The age factor represents the length of time a job has been sitting in the queue
and eligible to run.
Association Factor: Each association can be assigned an integer priority. The larger the
number, the greater the job priority will be for jobs that request this association. This
priority value is normalized to the highest priority of all the association to become the
association factor.
Job Size Factor: The job size factor correlates to the number of nodes or CPUs the job has
requested.
Nice Factor: Users can adjust the priority of their own jobs by setting the nice value on their
jobs. Like the system nice, positive values negatively impact a job's priority and negative
values increase a job's priority. Only privileged users can specify a negative value.
Partition Factor: Each node partition can be assigned an integer priority. The larger the
number, the greater the job priority will be for jobs that request to run in this partition.
Quality of Service (QOS) Factor: Each QOS can be assigned an integer priority. The larger
the number, the greater the job priority will be for jobs that request this QOS.
Fair-share Factor: The fair-share component to a job's priority influences the order in which
a user's queued jobs are scheduled to run based on the portion of the computing resources
they have been allocated and the resources their jobs have already consumed.
Job Submission
We can submit jobs either through a SLURM script or by using the interactive method.
Creating a SLURM script is the optimal way to submit a job to the cluster.
#!/bin/bash
Page | 25
PARAM Rudra – User Manual
$ sbatch slurm-job.sh
Note that the Slurm -J option is used to give the job a name.
#!/bin/bash
#SBATCH -p standard
#SBATCH -J simple
sleep 60
Submit the job:
$ sbatch simple.sh
Submitted batch job 149
Now we'll submit another job that's dependent on the previous job. There are many ways to
specify the dependency conditions, but the "singleton" method is the simplest. The Slurm -d
singleton argument tells Slurm not to dispatch this job until all previous jobs with the same
name have completed.
Page | 26
PARAM Rudra – User Manual
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
150 standard simple user1 R 0:31 1 rpcn001
Slurm has the ability to reserve resources for jobs being executed by select users and/or
select bank accounts. A resource reservation identifies the resources in that reservation and
a time period during which the reservation is available. The resources which can be reserved
include cores, nodes.
Use the command given below to check the reservation name allocated to your user
account
If your ‘user account’ is associated with any reservation the above command will show you
the same. For e.g. The given reservation name is user_11. Use the command given below to
make use of this reservation
A SLURM job array is a collection of jobs that differs from each other by only a single index
parameter. Job arrays can be used to submit and manage a large number of jobs with
similar settings.
Page | 27
PARAM Rudra – User Manual
N1 is specifying the number of nodes you want to use for your job. example: N1 -one node,
N4 - four nodes. Instead of tmp here you can use the below example script.
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=48
#SBATCH --error=job.%A_%a.err
#SBATCH --output=job.%A_%a.out
#SBATCH --time=01:00:00
#SBATCH --partition=standard
The following command asks for a single core in one hour with the default amount of
memory.
The command prompt of the allocated compute node will appear as soon as the job starts.
If the job is waiting for the resources, then this is how it will look :
Page | 28
PARAM Rudra – User Manual
If after a while, it will allocate resources, then it will look like this:
If you exceed the time or memory limit the job will also abort.
Please note that PARAM Rudra is NOT meant for executing interactive jobs. However, it can
be utilized to quickly verify the successful execution of a job before submitting a larger
batch job with a high iteration count. It can also be used for running small jobs. However,
it's important to consider that other users may also be utilizing this node, so it's advisable
not to inconvenience them by running large jobs.
There are various use cases for requesting interactive resources, such as debugging
(launching a job, adjusting setup parameters like compile options, relaunching the job, and
making further adjustments) and interactive interfaces (inspecting a node, etc.).
Page | 29
PARAM Rudra – User Manual
Page | 30
PARAM Rudra – User Manual
#!/bin/bash
#SBATCH -N 1 // number of nodes
#SBATCH --ntasks-per-node=1 // number of cores per node
#SBATCH --error=job.%J.err // name of output file
#SBATCH --output=job.%J.out // name of error file
#SBATCH --time=01:00:00 // time required to execute the program
#SBATCH --partition=standard // specifies queue name (standard is the
default partition if you do not specify any partition job will be submitted
using default partition). For other partitions you can specify hm or gpu
#!/bin/bash
#SBATCH -N 1 // Number of nodes
#SBATCH --ntasks-per-node=48 // Number of core per node
#SBATCH --error=job.%J.err // Name of output file
#SBATCH --output=job.%J.out // Name of error file
#SBATCH --time=01:00:00 // Time take to execute the program
#SBATCH --partition=cpu // specifies partition name
Page | 31
PARAM Rudra – User Manual
#!/bin/sh
export I_MPI_FALLBACK=disable
export I_MPI_FABRICS=shm:dapl
export I_MPI_DEBUG=9 // Level of MPI verbosity
#!/bin/sh
Page | 32
PARAM Rudra – User Manual
export I_MPI_FABRICS=shm:dapl
export OMP_NUM_THREADS=24 //Possibly then total no. of MPI ranks will be = (total no. of
cores, in this case 16 nodes x 48 cores/node) divided by (no. of threads per MPI rank i.e. 24)
Listing Partition
sinfo displays information about nodes and partitions allowing users to view available nodes
in the partition within the cluster.
Monitoring jobs
Monitoring jobs on SLURM can be done using the command squeue. The command squeue
provides high-level information about jobs in the Slurm scheduling queue (state
information, allocated resources, runtime, etc .
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
106 standard slurm-jo user1 R 0:04 1 rpcn001
The command scontrol provides even more detailed information about jobs and job steps.
It will report more detailed information about nodes, partitions, jobs, job steps, and
configuration.
Page | 33
PARAM Rudra – User Manual
The above command change attributes of submitted job. Like time limit, nodelist, number of
nodes, etc. For example:
Deleting jobs:
Use the scancel command to delete active jobs. Users can cancel their own jobs only.
$ scancel <jobid>
$ scancel 135
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Holding a job:
Page | 34
PARAM Rudra – User Manual
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
139 standard simple user1 PD 0:00 1 (Dependency)
138 standard simple user1 R 0:16 1 rpcn001
Releasing a job:
scontrol show partition <partition name>- shows detailed information about a specific
partition
Page | 35
PARAM Rudra – User Manual
Accounting
Accounting system tracks and manages HPC resource usage. As jobs are completed or
resources are utilized, accounts are charged and resource usage is recorded. Accounting
policy is like a Banking System, where each department can be allocated with some
predefined budget on a quarterly basis for CPU usage. As and when the resources are
utilized, the amount will be deducted. The allocation will be reset half yearly. Depending
upon the policy, users will be informed when their account is created about how much CPU
hours have been allocated to them.
sacct
This command can report resource usage for running or terminated jobs including individual
tasks, which can be useful to detect load imbalance between the tasks.
$ sacct -j <jobid>
This section discusses methods to gather information and find ways to avoid common
issues.
It is important to collect error and output messages by either writing this information to the
default location or specifying specific locations using the --error/--output option.
Redirecting the error/output stream to /dev/null should be avoided unless you fully
understand its implications, as error and output messages serve as the initial point for
investigating job failures.
Page | 36
PARAM Rudra – User Manual
If a job exceeds the runtime or memory limit, it will get killed by SLURM.
Software Errors
The exit code of a job is captured by Slurm and saved as part of the job record. For sbatch
jobs the exit code of the batch script is captured. For srun, the exit code will be the return
value of the executed command. Any non-zero exit code is considered a job failure, and
results in job state of FAILED. When a signal was responsible for a job/step termination, the
signal number will also be captured, and displayed after the exit code (separated by a
colon).
Page | 37
PARAM Rudra – User Manual
Per user
• Every user will have quota of 50 GB of soft limit in HOME file system (/home) and
200 GB of soft limit in SCRATCH file system.
• Users are recommended to copy their execution environment and input files to
scratch file system (/scratch/<username>) during job running and copy output data
back to HOME area
Page | 38
PARAM Rudra – User Manual
• File retention policy has been implemented on Lustre storage for the "/scratch" file
system. As per the policy, any files that have not been accessed for the last 3 months
will be deleted permanently
It is important to note:
• Compilations are performed on the login node. Only the execution is scheduled via
SLURM on the compute nodes.
• It is important to collect error/output messages, either by writing such information
to the default location or by specifying specific locations using the --error or --output
option. Error and output messages serve as the starting point for investigating a job
failure. If not specified, the Job Id is also appended to the output and error
filenames.
• Submitting a series of jobs (a collection of similar jobs) as array jobs instead of one
by one is crucial for improving backfilling performance and thus job throughput,
instead of submitting the same job repeatedly.
• User has to specify #SBATCH --gres=gpu:1/2 in their job script if user wants to use 1
or 2 GPU cards on GPU nodes
Page | 39
PARAM Rudra – User Manual
PARAM Rudra extensively uses Spack. The purpose of Spack is to provide freedom to users
for loading required applications or packages of specific versions with all its dependencies in
the user environment. Users can find the list of all installed packages with their specific
versions and dependencies. This also specifies which version of the application is available
for a given session. All applications and libraries are made available through Spack. A User
has to load the appropriate package from the available packages.
Introduction
Spack automates the download-build-install process for software - including dependencies -
and provides convenient management of versions and build configurations. It is designed to
support multiple versions and configurations of software on a wide variety of platforms and
environments. It is designed for large supercomputing centers, where many users and
application teams share common installations of software on clusters with exotic
architectures, using libraries that do not have a standard ABI. Spack is non-destructive:
installing a new version does not break existing installations, so many configurations can
coexist on the same system.
Kindly see the above screenshot and source below line including initial dot.
$ . /home/apps/spack/share/spack/setup-env.sh
Page | 40
PARAM Rudra – User Manual
The spack find command is used to query installed packages on PARAM Rudra. Note that
some packages appear identical with the default output. The -l flag shows the hash of each
package, and the -f flag shows any non-empty compiler flags of those packages.
Page | 41
PARAM Rudra – User Manual
spack compilers
Spack manages a list of available compilers on the system, detected automatically from the
user’s PATH variable. The Spack compilers command is an alias for the command Spack
compiler list.
$ spack compilers
==> Available compilers
-- gcc almalinux8-x86_64 ----------------------------------------
gcc@8.5.0 gcc@14.2.0 gcc@13.3.0 gcc@12.4.0
Spack list
The spack list command can also take a query string. Spack automatically adds wildcards to
both ends of the string, or you can add your own wildcards.
Page | 42
PARAM Rudra – User Manual
spack install
Above command will install gromacs version 2020.5 with blas and cuda support and without
MPI support. For blas there are multiple providers like OpenBLAS, Intel MKL, amdblis, and
essl, ^intel-mkl will tell spack to use intel-mkl for blas routines.
Operators in Spack
Uninstalling Packages
Earlier we installed many configurations of zlib. Now we will go through and uninstall some
of those packages that we didn’t really need.
Page | 43
PARAM Rudra – User Manual
Using Environments
Spack has an environment feature in which you can group installed software. You can install
software with different versions and dependencies in each environment and can change
software to use at once by changing environments. You can create a Spack environment by
spack env create command. You can create multiple environments by specifying different
environment names here.
To activate the created environment, type spack env activate. Adding -p option will display
the current activated environment on your console. Then, install software you need to the
activated environment.
You can deactivate the environment by spack env deactivate. To switch to another
environment, type spack env activate to activate it.
Use spack env list to display the list of created Spack environments.
Page | 44
PARAM Rudra – User Manual
Once we’ve specified a package’s recipe, users of our recipe can ask Spack to build the
software with different features on any of the supported systems. Refer Packaging Guide —
Spack documentation for detailed understanding of the Spack packaging.
In the below spec file we have used Linewidth, an IISc developed code. See the bold lines
for comments related to preceding lines in the spec file of spack package named
IiscLinewidth:
Page | 45
PARAM Rudra – User Manual
Source code of Linewidth was not available through a public repo like GitHub, so needed to
import OS package.
os.getcwd() - expects the source tar present in current working directory. cha256 - to check
for sha256 checksum we added same in version clause and for place holder we have given
version as 1.
manual download = True refers to spack will not try to download source code for the
package.
name - make sure that name of tar file is same as used inside package recipe
2. Variant- User can control behavior of application being built through this clause. Ex- To
enable MPI support we have defined it to be true by default.
3. depends_on() - This clause defines all dependencies required to build the given
application.
4. @property - With this decorator we can define some properties for build system like
edit, build, install.
5. property build_targets - Defines logic of building source for native platform.
6. property install - Defines install procedure to be used after building source code. Ex- In
our example we define prefix path
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p cpu
## gpu/standard
#SBATCH --exclusive
#SBATCH -t 1:00:00
echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST echo
"SLURM_NNODES"=$SLURM_NNODES
echo "SLURM_NTASKS"=$SLURM_NTASKS
export SPACK_ROOT=/home/apps/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack load intel-mpi@2019.10.317
spack load intel-mkl@2020.4.304
spack load intel-oneapi-compilers@2025.0.1
spack load gcc@12.4.0
(time <executable_path>)
Page | 47
PARAM Rudra – User Manual
The compilations are done on the login node, whereas the execution happens on the
compute nodes via the scheduler (SLURM).
Note: The compilation and execution must be done with the same libraries and matching
version to avoid unexpected results.
Steps:
The directory contains a few sample programs and their sample job submission scripts. The
compilation and execution instructions are described in the beginning of the respective files.
The user can copy the directory to his/her home directory and further try compiling and
executing these sample codes. The command for copying is as follows:
cp -r /home/apps/Docs/samples/ ~/.
It is recommended to use the intel compilers since they are better optimized for the
hardware.
Page | 48
PARAM Rudra – User Manual
Compilers
Optimization Flags
Optimization flags are meant for uniprocessor optimization, wherein, the compiler tries to
optimize the program, on the basis of the level of optimization. The optimization flags may
also change the precision of output produced from the executable. The optimization flags
can be explored more on the respective compiler pages. A few examples are given below.
Given next is a brief description of compilation and execution of the various types of
programs. However, for certain bigger applications, loading of additional dependency
libraries might be required.
C Program:
Setting up of environment:
spack load gcc@13.3.0 /3wdpf6l
Page | 49
PARAM Rudra – User Manual
C + OpenMP Program:
Setting up of environment:
spack load gcc@13.2.0 /3wdpf6l
spack load intel-oneapi-compilers@2024.2.1
Compilation: icc -O3 -xHost -qopenmp <<prog_name.c>>
Execution: ./a.out
C + MPI Program:
Setting up of environment:
spack load gcc@13.2.0 /3wdpf6l
spack load intel-oneapi-compilers@2024.2.1
Compilation: mpiicc -O3 -xHost <<prog_name.c>>
Execution: mpirun -n <<num_procs>> ./a.out
C + MKL Program:
Setting up of environment:
spack load gcc@13.2.0 /3wdpf6l
spack load intel-oneapi-compilers@2024.2.1
Compilation: icc -O3 -xHost -mkl <<prog_name.c>>
Execution: ./a.out
CUDA Program:
Setting up of environment:
spack load gcc@12.4.0 /3wdpf6l
spack load cuda /ezhcbzk
Example (1)
Compilation: nvcc -arch=sm_80<<prog_name.cu>>
Execution: ./a.out
Example (2)
Compilation: nvcc -arch=sm_80 /home/apps/Docs/samples/mm_blas.cu -lcublas
Execution: ./a.out
Page | 50
PARAM Rudra – User Manual
Setting up of environment:
spack load gcc@12.4.0 /3wdpf6l
spack load cuda /ezhcbzk
Example (1)
Compilation: nvcc -arch=sm_80 -Xcompiler="-fopenmp" -lgomp
/home/apps/Docs/samples/mm_blas_omp.cu -lcublas
Execution: ./a.out
Example (2)
Compilation: g++ -fopenmp /home/apps/Docs/samples/mm_blas_omp.c -
I/home/apps/spack/opt/spack/linux-almalinux8-cascadelake/gcc-12.4.0/cuda-
12.6.3-ezhcbzkhxdihcdrq6lp4df3stnmrza4b/include -
L/home/apps/spack/opt/spack/linux-almalinux8-cascadelake/gcc-12.4.0/cuda-
12.6.3-ezhcbzkhxdihcdrq6lp4df3stnmrza4b/lib64 -lcublas
Execution: ./a.out
OpenACC Program:
Setting up of environment:
spack load pgi@19.10 cuda@10.1
Page | 51
PARAM Rudra – User Manual
Introduction
A debugger or debugging tool is a computer program that is used to test and debug other
programs (the "target" program).
When the program "traps" or reaches a preset condition, the debugger typically shows the
location in the original code if it is a source-level debugger or symbolic debugger, commonly
now seen in integrated development environments.
Debuggers also offer more sophisticated functions such as running a program step by step
(single-stepping or program animation), stopping (breaking) (pausing the program to
examine the current state) at some event or specified instruction by means of a breakpoint,
and tracking the values of variables.
Some debuggers have the ability to modify program state while it is running. It may also be
possible to continue execution at a different location in the program to bypass a crash or
logical error.
Basics: How To
Compilation
Compilation with a separate flag ‘-g’ is required since the program needs to be linked with
debugging symbols.
gcc -g <program_name.c>
gdb <executable.out>
Page | 52
PARAM Rudra – User Manual
Starts the program execution and stops at the first line of the main procedure. Command
line arguments may be provided if any.
Run:
Starts the program execution but does not stop. It stops only when any error or program
trap occurs. Command line arguments may be provided if any.
Help:
Prints the list of commands available. Specifying ‘help’ followed by a command (e.x. ‘help
run’) displays more information about that command.
File <filename>:
Loads a binary program that is compiled with ‘-g’ flag for debugging.
List [line_no]
Displays the source code (nearby 10 lines) of the program in execution where the execution
stopped. If ‘line_no’ is specified, it displays the source code (10 lines) at the specified line.
Info:
Displays more information about the set of utilities and saved information by the debugger.
For example; ‘info breakpoints’ will list all the breakpoints, similarly ‘info watchpoints’ will
list all the watchpoints set by the user while debugging their programs.
Print <expression>:
Prints the values of variables / expression at the current running instance of the program.
Step N:
Steps the program one (or ‘N’) instructions ahead or till the program stops for any reason.
Steps through each and every instruction even if it is a function call (only function or
instruction compiled with debugging flags).
next:
This command also steps through the instructions of the program. Unlike the ‘step’
command, if the current source code line calls a subroutine, this command does not enter
the subroutine, but instead steps over the call, in effect treating it as a single source line.
Page | 53
PARAM Rudra – User Manual
Continue:
This command continues the stopped program till the next breakpoint has occurred or till
the end of the program. It is used to continue from a paused/debug point state.
Stops the program at the specified line number and provides a breakpoint for the user.
Specific source code file and breakpoint based on a condition can also be set for specific
cases. You can also view the list of breakpoints set, by using the ‘info breakpoints’
command.
watch <expression>:
A watchpoint means break the program or stop the execution of the program when the
value of the expression provided is changed. Using watch command specific variables can be
watched for value changes. You can also view the list of watchpoints by using the ‘info
watchpoints’ command.
Delete command deletes a breakpoint or a watchpoint that has been set by a user while
debugging the program.
Backtrace:
Prints the backtrace of all stack frames of the program. Provides the call stack and more
other information about the running program.
These are some of the most powerful utilities that can be used to debug your programs
using gdb. gdb is not limited to these commands and contains a rich set of features that can
allow you to debug multi-threaded programs as well. Also, all the commands, along with the
ones listed above have ‘n’ number of different variants for more in-depth control. Same can
be utilized using the help page of gdb.
Page | 54
PARAM Rudra – User Manual
Things to note:
1) We have a few libraries included for the functions that are used in the program.
2) We have two ‘#define’ statements:
a. ‘N’ for the number of times the ‘rand_fract’ function will spend in calculating
the random number.
b. ‘N_LEN’ for the length of the final random number string generated.
Currently it is set to ‘100’ which means that the long random number will be
of length 100.
3) Then, we have a function by name ‘rand_fract’ that iterates over two loops and using
the values of iterators (‘i’ and ‘j’), it calculates a small random number. Since, ‘rand()’
function is used for the outer loop, its number of iterations cannot be clearly defined
which gives the function a random nature.
Page | 55
PARAM Rudra – User Manual
4) The next function is as simple as its name is. It just takes an unsigned integer and
returns its factorial.
PART 2:
Things to note:
Page | 56
PARAM Rudra – User Manual
The program ended up with a core dump without giving much information but just ‘Floating
point exception’. Now let’s compile the code with debugging information and run the
program simply with gdb.
Page | 57
PARAM Rudra – User Manual
Here we compiled the code using ‘-g’ and then used the ‘run’ command we studied earlier
for running the program. You can observe that the debugger stopped at line number 13
where the ‘Floating point exception (SIGFPE)’ occurred. At this point we can even g o and
check the code at line number 13. But for now, let’s check what other information we can
get from the debugger. Let’s check the values of the variables ‘i’ and ‘j’ at this point.
Page | 58
PARAM Rudra – User Manual
The values of both ‘i’ and ‘j’ appear to be ‘0’ and thus a divide by zero exception is what
caused our program to terminate. Let’s update the code such that the value of ‘i’ and ‘j’ will
never become ‘0’. This is the modified code:
Thus, we just updated the loop index variables to start from ‘1’ instead of ‘0’. Thus, using
gdb, it was very simple to identify the point where the error occurred. Let’s re-run our
updated code and check what we get.
What!? This is unexpected. We just cured the error part of our program and still getting an
FPE. Let’s go through the debugger and check where the error point is right now.
Page | 59
PARAM Rudra – User Manual
The debugger output shows that the error occurred on the same line as earlier. But in this
case, the value of ‘i’ and ‘j’ are not ‘0,0’ but they are ‘1, -1’ which is causing the denominator
at line 13 to be ‘0’ and thus, causing an FPE. In addition to print commands, we have also
issued the ‘list’ command which shows the nearby 10 lines of the code where the program
stopped.
You can observe that some bugs in the programs are easier to debug but some aren’t.
Page | 60
PARAM Rudra – User Manual
We will have to dig in much more to find out what is going on. Also, to be noted, we have
our inner loop iterating from 1 to N (which is 100), but still the value of ‘j’ is printed out to
be ‘-1’. How is this even possible!? Smart programmers would have the problem identified,
but let’s stick to the basics on how to gdb. Let us use the ‘break’ command and set a
breakpoint at line number 13 and observe what is going on.
Thus, using the command ‘break 13’ we have set the breakpoint at line number 13 which
was verified using the ‘info breakpoint’ command. Then, we reran the program with the
‘run’ command. At line 13 the program stopped and using the ‘print’ command we checked
the values of ‘i’ and ‘j’. t this point, all seems to be well. Now, let’s proceed further. For
stepping 1 instruction we can use the ‘step’ command. Let’s do that and observe the value
of ‘j’.
Page | 61
PARAM Rudra – User Manual
You can observe the usage of the ‘step’ command. We are going through the program line
by line and checking the values of the variable ‘j’.
There seems to be a lot of writing/typing of the ‘step’ command just to proceed with the
program. Since we have already set a breakpoint at line 13, we can use another command
called ‘continue’. This command continues the program till the next breakpoint or the end
of the program.
Page | 62
PARAM Rudra – User Manual
You can see that we reduced the typing of the ‘step’ command by 3 times to a ‘continue’
command just 1 time. But this is also having us write ‘continue’ and ‘print’ multiple times.
Let us use some other utility in gdb known as ‘data breakpoints’ also known as watchpoints.
But before that, let us delete the existing breakpoint using the ‘delete’ command.
Page | 63
PARAM Rudra – User Manual
Thus, using the command ‘watch j’ we have set a watchpoint over ‘j’. Now every time when
the value of ‘j’ changes, a break will occur. You can also note the old and new values of ‘j’
printed out at each break. Another point to note is that after having one ‘continue’
command, the program had a break. Further, by just pressing the ‘Enter/Return’ button on
the keyboard, the continue command was repeated. Thus, by pressing the ‘Enter/Return’
button, the last command is repeated. At this point, we have learned much about the
debugger, but we are still not able to proceed fast with our error. Is there any other way to
proceed? Well, yes!!
Page | 64
PARAM Rudra – User Manual
We want to observe at the point where the value of ‘j’ reaches closer to ‘N i.e. 100’. Which
means that we are only concerned about what happens after ‘j’ reaches 99. Here, we land
up on using what are called conditional breakpoints. First, we will delete our watchpoint and
then make use of the conditional breakpoint.
You can observe another variant of the ‘break’ command. We have explicitly stated the file
and the line number along with a condition to stop. This is useful, when the source code is
large and has multiple files. After setting a conditional break, we stopped at the point where
the value of ‘j’ becomes ‘99’. Now, let us see what happens next. Since, this is a critical point
at which we could observe the program, it is better if we step in the program using the
‘step’ command instead of relying on any break/watch points.
Page | 65
PARAM Rudra – User Manual
This is unexpected!! The value of ‘j’ should never be 100 or anything above it.
By observation, we have figured out that the condition is itself wrong. It should have been ‘j
< N’ instead of ‘i < N’. This is a silly mistake of the programmer that led us to this much of an
effort.
Also, the value of ‘j’ which was observed as ‘-1’ was an outcome of the ‘short’ data type
overflow i.e. the value of ‘j’ went from 1 to 32767 (assuming short as 2 bytes) and then from
-32768 to -1.
Finally, a hard programming bug was discovered. Let us correct this error and rerun the
program.
Page | 66
PARAM Rudra – User Manual
Figure 30 – Again, Dumping Core!! Things are getting interesting or frustrating or both!!
This is strange!!
Sometimes the program is getting the correct output, but sometimes, we are getting a
segmentation fault. Debugging such a program may be tricky since the occurrence of the
bug is low. We will proceed with our standard debugger steps to identify the error.
We compiled the code and ran it using the debugger. But the program completed
successfully. Let us rerun it till a point where the program fails.
Here we observe a point where the program exited at the function ‘factorial’.
Page | 67
PARAM Rudra – User Manual
This is a point where the debugger didn’t give much information about what the value of
the variable ‘x’ was. It just pointed out that the program failed at the function named
‘factorial’. That’s it!
Another reason for such kind of output would be because of the recursive nature of the
function. The stack frame where the function ‘factorial’ failed could be in a long nest of
recursive calls. At such points, it would be better to inspect the program at an earlier point
and look for errors. Let us have a breakpoint before the ‘factorial’ function was called and
view the value of the parameters that are passed to the function.
Thus, we have set a breakpoint before the call of the function ‘factorial’ and run the
program. For the value of ‘f1 = 8’ for the ‘factorial’ function the process seems to exit
normally. Let us rerun.
Page | 68
PARAM Rudra – User Manual
Unexpectedly, we have got the value of ‘f1’ as ‘-8’ and the program seems to have crashed.
Let us observe the ‘rand_fract’ function and ‘factorial’ function once again. And study the
behavior of the functions where we could get a negative number.
Page | 69
PARAM Rudra – User Manual
The ‘rand_fract’ function is returning a datatype of ‘short’ while the calculation of the return
value could be significantly large which may overflow the size of ‘short’, thus, causing a
negative answer.
The function ‘factorial’ is expecting a value of type ‘unsigned int’. Since the value passed to
the function is a negative value, having an implicit conversion from a negative number to an
unsigned number means that we are having a very large value passed to the factorial
function.
Also, since the ‘factorial’ function is recursive, passing a very large number to it could cause
multiple calls to the same function and thus, overflowing the stack provided to the user.
Now let us step further into our program and see whether what we are discussing is the
same behavior that is being observed.
Page | 70
PARAM Rudra – User Manual
Stepping in more reveals the recursive behavior of the ‘factorial’ function i.e. each call is
having a sub call to the same function with one value less. Thus, what to do in these types of
cases. Assume you have a large code where these functions are called from multiple
locations.
Modifying the signature of any of the functions means changing the code everywhere where
the function is called. This is not affordable. These are some cases, where a choice is to be
made where patching the code is necessary for semantics of the program.
Page | 71
PARAM Rudra – User Manual
Let us observe a piece of code where this change can be made and then test our program
for the expected results.
By observing the code, we find out that the expected value of ‘f1’ is between ‘0 to 9’
(because of the modulo 10 operation).
Thus, without changing the signature of any function, we have inserted a patch (the
highlighted) portion, that maintains the semantics of the code as well cures the problem
that we had. Now let us rerun and check our final program.
Page | 72
PARAM Rudra – User Manual
Figure 38 – Resolved
Conclusion
We started with a program that we assumed to be functional but then the program ended
up with bugs that were not straightforward. We then explored the power of the debugger
and the various ways to identify the bugs in our program. We looked upon the easy
solutions, and slowly migrated towards the type of bugs that are not easily traceable.
Finally, we identified and corrected all the bugs in our program with the help of the
debugger and arrived at a bug free code.
Points to Note
● Bugs in the program cannot be necessarily a compilation error.
● One type of error can be caused by multiple bugs in the same line of code.
● Sometimes, it is not possible to change the code even when the problem is
identified. The best way to cure this is to study the behavior of the code and apply
patches wherever necessary.
● Using simple utilities from the ‘GNU Debugger’ can help in getting rid of problems
causing bugs in large programs.
Page | 73
PARAM Rudra – User Manual
Page | 74
PARAM Rudra – User Manual
Most of the popular python-based ML/DL libraries are installed on the PARAM Rudra
system. Users while developing and testing their applications, can use conda based python
installation.
For the conda environment different modules are prepared. Users can check the list of the
modules by using “module avail” command. Shown below is an example of loading conda
environments in the current bash shell and continuing with application development.
Once logged into PARAM Rudra HPC Cluster, check which all libraries are available, loaded in
the current shell. To check list of modules loaded in current shell, use the command given
below:
$ module list
To check all modules available on the system, but not loaded currently, use the command
given below:
$ module avail
Defaults libraries and framework specific conda environment has been made available for
user to start with application development which is installed with most of the popular
python packages as shown below
In order to use base conda environment we first, access and load the miniconda module,
which provides access to the base environment which is installed with default packages:
To see the list of other packages installed, use the command given below,
Page | 75
PARAM Rudra – User Manual
$ conda list
We provide multiple conda environments that include basic machine learning packages, as
well as common image processing and natural language processing packages, for your
machine learning projects.
The following table shows currently available conda environments with their version (all
include GPU support):
Pytorch 0.28.1
To activate any one of the environments we can load it on PARAM Rudra, load module
“ENV_NAME” as shown below:
Once the “ENV_NAME” module is loaded, end-users can use all libraries inside their python
program. Users can load those libraries using the “module load” command and use them
for their applications.
Page | 76
PARAM Rudra – User Manual
Example: To activate Pytorch environment we can load it on PARAM Rudra, using module
load Pytorch as shown below:
This will activate Pytorch environment in which users can use pytorch library and its related
functionalities
After loading the module, you will have access to conda commands, including:
You have two options to install your own Python packages in our machine learning
environment:
Consider the benefits and disadvantages of each method, before choosing which works best
for you.
NOTE: Use Conda primarily for environment management, especially in scientific computing
and data science projects where non-Python dependencies are common.
Use pip for installing Python packages from PyPI when you don't need the advanced
environment management features provided by Conda.
Page | 77
PARAM Rudra – User Manual
Creating an environment can take up a significant portion of your disk quota, depending on
the packages installed. To ensure that you can use your conda environment properly, please
familiarize yourself with all the basic conda commands.
Conda based installation provides the latest version of DL framework, however users can
install their own choice of DL framework or library version locally by following below steps.
Step 3. Create the local environment myenv (myenv is the environment name, you can give
any name of your choice).
Step 5. Install your own DL framework / python library. <package-name> will get replaced
by desired package which user wants to install
Now you can use the newly installed package in your python program.
Page | 78
PARAM Rudra – User Manual
#!/bin/bash -x
#SBATCH -N 1
#SBATCH --ntasks-per-node=<np>
#SBATCH -p cpu
#SBATCH -J <job_name>
#SBATCH -t 05:00:00
#SBATCH -o %j.out # name of stdout output file(--output)
#SBATCH -e %j.err # name of stderr error file(--error)
cd $SLURM_WORKDIR
module purge
module load miniconda # load the module and environment
conda activate <env_name> # load working environment
python <script>.py # run python script
conda deactivate # deactivate environment
# end of script
NOTE: To launch the jupyter notebook from gpu, first login to gpu node using the below
command in the login node.
$ squeue --me
Now ssh to the node assigned to you. For example, in the screenshot below, you can see
that gpu007 was assigned to the user.
$ ssh gpu007
Now to launch the notebook from the gpu node, follow the below steps.
Page | 79
PARAM Rudra – User Manual
For example,
Note: Token number displayed on the screen would later be used for login to jupyter
notebook through your local web browser.
3. From another terminal, on your mobaxterm, create ssh tunneling between your local
machine and remote system by executing below command
For example,
Note: Use the port number and gpu node that is assigned by slurm.
4. Type the below address in your local browser to access Jupyter notebook.
https://localhost:<PORT_NO>
For example,
https://localhost:8888
5. The Jupyter notebook can now be opened after entering the valid token
Page | 80
PARAM Rudra – User Manual
lfs setstripe -c 4.
After this has been done all new files created in the current directory will be spread over 4
storage arrays each having 1/4th of the file. The file can be accessed as normal no special
action needs to be taken. When the striping is set this way, it will be defined on a per
directory basis so different directories can have different stripe setups in the same file
system; new subdirectories will inherit the striping from its parent at the time of creation.
We recommend users to set the stripe count so that each chunk will be approx. 200-300GB
each, for example
Once a file is created with a stripe count, it cannot be changed. A user by themselves is also
able to set stripe size and stripe count for their directories and A user can check the set
stripe size and stripe count with following command:
The options on the above command used have these respective functions.
Page | 81
PARAM Rudra – User Manual
• -c to set the stripe count; 0 means use the system default (usually 1) and -1 means
stripe over all available OSTs (lustre Object Storage Targets).
• -s to set the stripe size; 0 means use the system default (usually 1 MB) otherwise use
k, m or g for KB, MB or GB respectively
Page | 82
PARAM Rudra – User Manual
1. Do NOT run any job which is longer than a few minutes on the login nodes. Login node is
for compilation of jobs. It is best to run the job on compute nodes.
2. It is recommended to go through the beginner’s guide in /home/apps/Docs/samples
this should serve as a good starting point for the new users.
3. Use the same compiler to compile different parts/modules/library-dependencies of an
application. Using different compilers (e.g. pgcc + icc) to compile different parts of an
application may cause linking or execution issues.
4. Choosing appropriate compiler switches/flags/options (e.g. –O3) may increase the
performance of the application substantially (accuracy of output must be verified).
Please refer to documentation of compilers (online / docs present inside compiler
installation path / man pages etc.)
5. Modules/libraries used for execution should be the same as that used for compilations.
This can be specified in the Job submission script.
6. Be aware of the amount of disk space utilized by your job(s). Do an estimate before
submitting multiple jobs.
7. Please submit jobs preferably in $SCRATCH. You can back up your results/summaries in
your $HOME
8. $SCRATCH is NOT backed up! Please download all your data to your Desktop/ Laptop.
9. Before installing any software in your home, ensure that it is from a reliable and safe
source. Ransomware is on the rise!
10. Please do not use spaces while creating the directories and files.
11. Please inform PARAM Rudra support when you notice something strange - e.g.
unexpected slowdowns, files missing/corrupted etc.
Page | 83
PARAM Rudra – User Manual
Installed Applications/Libraries
Following is the list of few of the applications from various domains of science and
engineering installed in the system.
Page | 84
PARAM Rudra – User Manual
LAMMPS Applications
LAMMPS is an acronym for Large-scale Atomic/ Molecular Massively Parallel Simulator. This
is extensively used in the fields of Material Science, Physics, Chemistry and many others.
More information about LAMMPS may please be found at https://lammps.sandia.gov .
1. The LAMMPS input is in.lj file which contains the below parameters.
#!/bin/sh
#SBATCH -N 8
#SBATCH --ntasks-per-node=40
#SBATCH --time=08:50:20
#SBATCH --job-name=lammps
#SBATCH --error=job.%J.err_8_node_40
Page | 85
PARAM Rudra – User Manual
#SBATCH --output=job.%J.out_8_node_40
#SBATCH --partition=standard
spack load intel-oneapi-compilers /jtvke3n
spack load intel-oneapi-mpi/2db2e7t
spack load gcc@13.2.0/3wdooxp
source /home/apps/spack/opt/spack/linux-almalinux8-cascadelake/gcc-
13.2.0/intel-oneapi-mkl-2024.0.0-
yq4keqsjr44rf5ffroiiim2iklxg4let/setvars.sh intel64
export I_MPI_FALLBACK=disable
export I_MPI_FABRICS=shm:ofa
#export I_MPI_FABRICS=shm:tmi
#export I_MPI_FABRICS=shm:dapl
export I_MPI_DEBUG=5
#Enter your working directory or use SLURM_SUBMIT_DIR
cd /home/manjunath/NEW_LAMMPS/lammps-7Aug19/bench
export OMP_NUM_THREADS=1
time mpiexec.hydra -n $SLURM_NTASKS -genv OMP_NUM_THREADS 1 <path of lammps
executable> -in in.lj
Page | 86
PARAM Rudra – User Manual
GROMACS APPLICATION
GROMACS
GROningen MAchine for Chemical Simulations (GROMACS) is a molecular dynamics package
mainly designed for simulations of proteins, lipids, and nucleic acids. It was originally
developed in the Biophysical Chemistry department of University of Groningen, and is now
maintained by contributors in universities and research centres worldwide. GROMACS is one
of the fastest and most popular software packages available, and can run on central
processing units (CPUs) and graphics processing units (GPUs).
Submission Script:
#!/bin/sh
#SBATCH -N 10
#SBATCH --ntasks-per-node=48
##SBATCH --time=03:05:30
#SBATCH --job-name=gromacs
#SBATCH --error=job.16.%J.err
#SBATCH --output=job.16.%J.out
#SBATCH --partition=standard
source /home/apps/spack/share/spack/setup-env.sh
Page | 87
PARAM Rudra – User Manual
export I_MPI_DEBUG=5
export OMP_NUM_THREADS=1
mpirun -np 4 gmx_mpi grompp -f pme.mdp -c conf.gro -p topol.top
Output Snippet:
Number of logical cores detected (48) does not match the number reported by
OpenMP (1).
Consider setting the launch configuration manually!
Running on 10 nodes with total 192 cores, 480 logical cores
Cores per node: 0 - 48
Logical cores per node: 48
Hardware detected on host cn072 (the node of MPI rank 0):
CPU info:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256
Reading file /home/shweta/Gromacs/water-cut1.0_GMX50_bare/3072/topol.tpr,
VERSION 5.1.4 (single precision)
Changing nstlist from 10 to 20, rlist from 1 to 1.032
The number of OpenMP threads was set by environment variable
OMP_NUM_THREADS to 1 (and the command-line setting agreed with that)
NOTE: KMP_AFFINITY set, will turn off gmx mdrun internal affinity
setting as the two can conflict and cause performance degradation.
To keep using the gmx mdrun internal affinity setting, set the
KMP_AFFINITY=disabled environment variable.
Overriding nsteps with value passed on the command line: 50000 steps, 100
ps
Will use 360 particle-particle and 120 PME only ranks
This is a guess, check the performance at the end of the log file
Using 480 MPI processes
Using 1 OpenMP thread per MPI process
Back Off! I just backed up ener.edr to ./#ener.edr.2#
starting mdrun 'Water'
50000 steps, 100.0 ps.
Page | 88
PARAM Rudra – User Manual
If you use supercomputers and services provided under the National Supercomputing
Mission, Government of India, please let us know of any published results including Student
Thesis, Conference Papers, Journal Papers and patents obtained.
Also, please submit the copies of dissertations, reports, reprints and URLs in which “National
Supercomputing Mission, Government of India” is acknowledged to:
HPC Technologies,
Centre for Development of Advanced Computing,
CDAC Innovation Park,
S.N. 34/B/1,
Panchavati, Pashan,
Pune – 411008
Maharashtra
Page | 89
PARAM Rudra – User Manual
We suggest that you please refer to these four easy steps to generate a Ticket related to the
issue you are experiencing.
Your Ticket will be assisted by the Rudra Support team. The ticket generated will be closed
only when the related issue gets resolved.
You can generate a new ticket for any of the new issues that you are experiencing.
3. Sign in by using the Username and Password that you use for logging to the Cluster.
Refer to Fig37 for the same.
Page | 90
PARAM Rudra – User Manual
4. Select a Help Topic from the Dropdown and then Click on Create Ticket. Refer to Fig:38
for the same
Page | 91
PARAM Rudra – User Manual
5. Please fill in the details of your issue in the fields given and then click on Create ticket.
Once the Ticket is generated, an acknowledgement e-mail will be sent to your official e-mail
address. The e-mail will also contain the Ticket number along with reference to the ticket
that you have generated.
In case of any difficulty while accessing Rudra Support you can reach us via e-mail at
rudrasupport@iitb.ac.in
Page | 92
PARAM Rudra – User Manual
To get access to this HPC Facility, proceed with registration on the Portal through the link
provided below:
Link: https://services.nsmindia.in/userportal/account
Once registered, you will receive an email outlining the next steps to be followed for your
User Creation Request.
User Creation Portal streamlines the user data collection process, enabling multiple users to
submit user creation requests simultaneously. Physical form maintenance is eliminated;
users need to provide accurate official details and an email address for pr ocedural
notifications. Users can track their account creation status, remaining steps, and identify
necessary actions. Administrative or Higher authorities can access all user details through
secure login into the portal.
Process/Steps
Users initiate registration on the portal by entering their email address, city, and institute
name. An email will be sent to verify the provided email address, upon verification
registration form link is sent for completing the user account request. Users have the option
to preview and edit the form before the final submission. Upon submission, a link for
Document Upload is provided, where documents like ID proof, User Creation Form and
other needed documents are uploaded. Once documents are uploaded, modifications are
not possible as the documents will undergo verification processes.
User details and documents undergo verification by the user's Institute HOD/PI. Upon
approval, a verification email is sent to the coordinator. The coordinator selects the
appropriate cluster for the user based on document verification and requirements. Fi nal
approval is granted by higher authority, resulting in acceptance of the user request.
If you have any queries, refer to the User Creation Manual and Flowcharts accessible in the
Help section within the User Creation Portal. Furthermore, common questions are
addressed in the FAQ section located beside the Help section. If you have any additional
inquiries or require assistance, feel free to reach out to us at nsmsupport@cdac.in .
Note: Kindly use your official email address for registration to avoid the possibility of your
request being declined.
Page | 93
PARAM Rudra – User Manual
Page | 94
PARAM Rudra – User Manual
Page | 95
PARAM Rudra – User Manual
When once you have completed your research work and you no longer need to use PARAM
Rudra, you may please close your account on PARAM Rudra. Please raise a ticket by
following the URL https://paramrudra.iitb.ac.in/support The system administrator will guide
you about the “Closure Procedure”. You will need clearance from your project-coordinator/
Supervisor/ Head of the Department about you having surrendered this resource for getting
“no dues” certificate from the institute.
Page | 96
PARAM Rudra – User Manual
References
***
Page | 97