Cluster: mudanças entre as edições
Linha 37: | Linha 37: | ||
<pre> | <pre> | ||
[https://slurm.schedmd.com/ Slurm Workload Manager] | |||
Number of jobs per user controlled on demand. | Number of jobs per user controlled on demand. |
Edição das 10h46min de 4 de abril de 2024
Cluster Lovelace - Instituto de Física UFRGS
The cluster is located at Instituto de Física da UFRGS, in Porto Alegre.
Management Committee
The cluster is managed by professors representing the fields of Astronomy, Theoretical Physics, and Experimental Physics, in addition to an IT department employee from the Physics Institute. Astronomy: Rogério Riffel Theoretical Physics: Leonardo Brunnet Experimental Physics: Pedro Grande TI employee: Gustavo Feller
Users Committee
Users have two channels for communication/discussion: 1) The fis-linux-if@grupos.ufrgs.br mailing list 2) Direct messages to the IT department via the email fisica-ti@ufrgs.br.
Infraestruture
Management Software
[https://slurm.schedmd.com/ Slurm Workload Manager] Number of jobs per user controlled on demand. Number of users on 1/24/2023: 150 Account request: mail to fisica-ti@ufrgs.br
Hardware in lovelace nodes
CPU: Ryzen (32 and 2*24 cores) + AMD 16 cores RAM: 64 GB each GPU: Three nodes with NVIDIA CUDA Storage: storage Dell 12TB Conection inter-nodes: Gigabit
Installed Software
OS: Debian 12 Basic packages installed: gcc gfortran python: torch, numba julia conda compucel3d espresso gromacs lammps mesa openmpi povray quantum-espresso vasp
Rules for scheduling, access control, and usage of the research infrastructure
Online scheduling
The cluster is accessible using the UFRGS virtual prived network (vpn) through server lovelace.if.ufrgs.br. To access through a unix-like system use:
ssh <user>@lovelace.if.ufrgs.br
Under windows you may configure winscp to enter the address lovelace.if.ufrgs.br.
If you are not registered, ask for registration sending an email to fisica-ti@ufrgs.br
Using softwares in the cluster
To execute a software in a cluster job this program must:
1. Be already installed
OR
2. Be copied to the user home
Ex:
scp my_programm <user>@cluster-slurm.if.ufrgs.br:~/
If you are compiling your program in the cluster, one option is to use gcc
.
Ex:
scp -r source-code/ usuario@cluster-slurm.if.ufrgs.br:~/ ssh <user>@cluster-slurm.if.ufrgs.br:~/ cd source-code gcc main.c funcoes.c
This will generate file a.out
, which is the executable.
Being accessible by methods 1 or 2, the program can be executed in the cluster through one JOB.
OBS: If you execute your executable without submitting as JOB, it will be executed in the server, not in the nodes. This is not recommended since the server computational capabilities are limited and you will be slowing down the server for everyone else.
Criating and executing a Job
Slurm manages jobs and each job represents a program or task being executed.
To submit a new job, you must create a script file describing the requisites and characteristics of the Job.
A typical example of the content of a submission script is below
Ex: job.sh
#!/bin/bash #SBATCH -n 1 # Number of cpus to be allocated (Despite the # these SBATCH lines are compiled by the slurm manager!) #SBATCH -N 1 # Nummber of nodes to be allocated (You don't have to use all requisites, comment with ##) #SBATCH -t 0-00:05 # Limit execution time (D-HH:MM) #SBATCH -p long # Partition to be submitted #SBATCH --qos qos_long # QOS # Your program execution commands ./a.out
In option --qos, use the partition name with "qos_" prefix:
partition: short -> qos: qos_short -> limit 2 weeks
partition: long -> qos: qos_long -> limit de 3 month
If you run on GPU, specify the "generic resource" gpu in cluster ada:
#!/bin/bash #SBATCH -n 1 #SBATCH -N 1 #SBATCH -t 0-00:05 #SBATCH -p long #SBATCH --qos qos_long # QOS #SBATCH --gres=gpu:1 # Comandos de execução do seu programa: ./a.out
To ask for a specific gpu:
#SBATCH --constraint="gtx970"
To submit the job, execute:
sbatch job.sh
Usefull commands
- To list jobs:
squeue
- To list all jobs running in the cluster now:
sudo squeue
- To delete a running job:
scancel [job_id]
- To list available partitions:
sinfo
- To list gpu's in the nodes:
sinfo -o "%N %f"
- To list characteristic of all nodes:
sinfo -Nel