Edição das 20h32min de 24 de janeiro de 2024

Clusters Ada and Lovelace - Instituto de Física UFRGS

The clusters are located at Instituto de Física da UFRGS, in Porto Alegre.

Infraestruture

Management Software

Slurm Workload Manager

Site :https://slurm.schedmd.com/

Hardware in the ada nodes

CPU: 16 nodes x86_64
RAM: varies between 8 GB - 16 GB
GPU: 3 nodes with NVIDIA CUDA
Storage: storage Asustor 12TB
Inter-node connection: Gigabit

Hardware in the lovelace nodes

CPU: Ryzen (32 and 2*24 cores)
RAM: 64 GB each
GPU: two nodes have NVIDIA CUDA
Storage: storage Dell 12TB 
Conection inter-nodes: Gigabit

Software in the nodes

OS: Debian 8 (in cluster ada)
OS: Debian 11 (in cluster lovelace)
Basic packages installed:
 GCC
 gfortran
 python2
 python3

How to use

Conect to cluster-slurm

The clusters are accessible through server cluster-slurm.if.ufrgs.br (ou ada.if.ufrgr.br). To access through a unix-like system use:

ssh <user>@cluster-slurm.if.ufrgs.br

or

ssh <user>@ada.if.ufrgs.br

Under windows you may use winscp.

If you are not registered, ask for registration sending an email to fisica-ti@ufrgs.br

Using softwares in the cluster

To execute a software in a cluster job this program must:

1. Be already installed

OR

2. Be copied to the user home

Ex:

scp my_programm <user>@cluster-slurm.if.ufrgs.br:~/

If you are compiling your program in the cluster, one option is to user gcc.

Ex:

scp -r source-code/ usuario@cluster-slurm.if.ufrgs.br:~/
ssh <user>@cluster-slurm.if.ufrgs.br:~/
cd source-code
gcc main.c funcoes.c

This will generate file a.out, which is the executable.

Being accessible by methods 1 or 2, the program can be executed in the cluster through one JOB.

OBS: If you execute your executable without submitting as JOB, it will be executed in the server, not in the nodes. This is not recommended since the server computational capabilities are limited and you will be slowing down the server for everyone else.

Criating and executing a Job

Slurm manages jobs and each job represents a program or task being executed.

To submit a new job, you must create a script file describing the requisites and characteristics of the Job.

A typical example of the content of a submission script is below

Ex: job.sh

#!/bin/bash 
#SBATCH -n 1 # Number of cpus to be allocated (Despite the # these SBATCH lines are compiled by the slurm manager!)
#SBATCH -N 1 # Nummber of nodes to be allocated  (You don't have to use all requisites, comment with ##)
#SBATCH -t 0-00:05 # Limit execution time (D-HH:MM)
#SBATCH -p long # Partition to be submitted
#SBATCH --qos qos_long # QOS 
  
# Your program execution commands
./a.out

In option --qos, use the partition name with "qos_" prefix:

partition: short -> qos: qos_short -> limit 2 weeks

partition: long -> qos: qos_long -> limit de 3 month

If you run on GPU, specify the "generic resource" gpu in cluster ada:

#!/bin/bash 
#SBATCH -n 1 
#SBATCH -N 1
#SBATCH -t 0-00:05 
#SBATCH -p long 
#SBATCH --qos qos_long # QOS 
#SBATCH --gres=gpu:1
  
# Comandos de execução do seu programa:
./a.out

To ask for a specific gpu:

#SBATCH --constraint="gtx970"

To submit the job, execute:

sbatch job.sh

Usefull commands

To list jobs:

 squeue

To list all jobs running in the cluster now:

 sudo squeue

To delete a running job:

 scancel [job_id]

To list available partitions:

 sinfo

To list gpu's in the nodes:

 sinfo -o "%N %f"

To list characteristic of all nodes:

 sinfo -Nel

Cluster: mudanças entre as edições

Edição das 20h32min de 24 de janeiro de 2024

Índice

Clusters Ada and Lovelace - Instituto de Física UFRGS

Infraestruture

Management Software

Hardware in the ada nodes

Hardware in the lovelace nodes

Software in the nodes

How to use

Conect to cluster-slurm

Using softwares in the cluster

Criating and executing a Job

Usefull commands

Menu de navegação

Ações da página

Ações da página

Ferramentas pessoais

Navegação

Pesquisa

Ferramentas