Cluster: mudanças entre as edições

De Instituto de Física - UFRGS
Ir para navegaçãoIr para pesquisar
Sem resumo de edição
 
(51 revisões intermediárias por 3 usuários não estão sendo mostradas)
Linha 1: Linha 1:
= Cluster Ada - Instituto de Física UFRGS =
= Cluster Lovelace - Instituto de Física UFRGS =


O Cluster está localizado no Instituto de Física da UFRGS, em Porto Alegre.


== Infraestrutura ==
The cluster is located at Instituto de Física da UFRGS, in Porto Alegre.


=== Software de gerenciamento ===
== Management Committee ==


<pre>
<pre>
Slurm Workload Manager


Site :https://slurm.schedmd.com/
The cluster is managed by professors representing the fields of Astronomy, Theoretical Physics, and Experimental Physics, in addition to an IT department employee from the Physics Institute.
 
Astronomy: Rogério Riffel
 
Theoretical Physics: Leonardo Brunnet
 
Experimental Physics: Pedro Grande
 
TI employee: Gustavo Feller
 
</pre>
 
== Users Committee ==
 
<pre>
 
Users have two channels for communication/discussion:
 
1) The fis-linux-if@grupos.ufrgs.br mailing list
 
2) Direct messages to the IT department via the email fisica-ti@ufrgs.br.
 
</pre>
</pre>


=== Hardware dos nodes ===
== Infraestruture ==
 
=== Management Software ===
 
The system of queues and scheduling of tasks is controlled by the [https://slurm.schedmd.com/ Slurm Workload Manager].


<pre>
<pre>
CPU: x86_64
 
RAM: varia entre 4 GB - 8 GB
Number of jobs per user controlled on demand.
GPU: alguns nodes possuem NVIDIA CUDA
 
Storage: storage em rede com quota de 50 GB por usuário, os nodes não possuem HD local
Number of users on 1/24/2023: 150
 
Account request: mail to fisica-ti@ufrgs.br
</pre>
</pre>


=== Software nos nodes ===
=== Hardware in lovelace nodes ===


<pre>
<pre>
OS: Debian 8 (Jessie) x86_64
CPU: Ryzen (32 and 2*24 cores) + AMD 16 cores
Pacotes instalados:
RAM: 64 GB each
gcc
GPU: Three nodes with NVIDIA CUDA
docker
Storage: storage Dell 12TB
Conection inter-nodes: Gigabit
</pre>
</pre>


== Como utilizar ==
=== Installed Software ===


=== Conectar-se ao cluster-slurm ===
<pre>
OS: Debian 12
Basic packages installed:
gcc
gfortran
python: torch, numba
julia
conda
compucel3d
espresso
gromacs
lammps
mesa
openmpi
povray
quantum-espresso
vasp
</pre>


O cluster é acessível através do server cluster-slurm. Para acessar o server via SSH, use:
== Rules for scheduling, access control, and usage of the research infrastructure ==
 
=== Online scheduling ===
 
The cluster is accessible using the  UFRGS virtual prived network ([https://www1.ufrgs.br/CatalogoServicos/servicos/servico?servico=3178 vpn]) through server lovelace.if.ufrgs.br.
 
To access through a unix-like system use:
<pre>
<pre>
ssh usuario@cluster-slurm.if.ufrgs.br
ssh <user>@lovelace.if.ufrgs.br
</pre>
</pre>


Caso você não tenha cadastro ou não é vinculado ao Instituto de Física, solicite o cadastro enviando um email para fisica-ti@ufrgs.br.
Under windows you may configure winscp to enter the address lovelace.if.ufrgs.br.


=== Utilizando softwares no Cluster ===
If you are not registered, ask for registration sending an email to fisica-ti@ufrgs.br


Para que seja possível executar um programa em um job no cluster, o programa deve:
=== Using softwares in the cluster ===


1. Já estar instalado
To execute a software in a cluster job this program must:


OU
1. Be already installed
OR


2. Ser copiado para sua home (pasta do seu usuário)
2. Be copied to the user home  


Ex:
Ex:
<pre>
<pre>
scp meu_executavel usuario@cluster-slurm.if.ufrgs.br:~/
scp my_programm <user>@cluster-slurm.if.ufrgs.br:~/
</pre>
</pre>


Caso queira compilar o programa para uso no Cluster, uma das opções é usar o <code>gcc</code>.
If you are compiling your program in the cluster, one option is to use <code>gcc</code>.


Ex:
Ex:
<pre>
<pre>
scp -r source-code/ usuario@cluster-slurm.if.ufrgs.br:~/
scp -r source-code/ usuario@cluster-slurm.if.ufrgs.br:~/
ssh usuario@cluster-slurm.if.ufrgs.br:~/
ssh <user>@cluster-slurm.if.ufrgs.br:~/
cd source-code
cd source-code
gcc main.c
gcc main.c funcoes.c
</pre>
</pre>
Isso irá gerar um arquivo <code>a.out</code>, que é o executável.
This will generate file <code>a.out</code>, which is the executable.
 
Estando acessível pelo método 1 ou 2, o programa pode ser executado no Cluster através de um <strong>JOB</strong>.


OBS: Caso você execute o programa sem submetê-lo como <strong>JOB</strong>, ele não será executado nos nodes, e sim apenas no próprio server (cluster-slurm), que possui capacidades bem limitadas de processamento.
Being accessible by methods 1 or 2, the program can be executed in the cluster through one <strong>JOB</strong>.


OBS: If you execute your executable without submitting as <strong>JOB</strong>, it will be executed in the server, not in the nodes. This is not recommended since the server computational capabilities are limited and you will be slowing down the server for everyone else.


=== Criando e executando um Job ===
=== Criating and executing a Job ===


O Slurm gerencia jobs, e cada job representa um programa ou tarefa sendo executado.
Slurm manages jobs and each job represents a program or task being executed.


Para submeter um novo Job, deve-se criar um arquivo de script descrevendo os requisitos e características de execução do Job.
To submit a new job, you must create a script file describing the requisites and characteristics of the Job.


Formato do arquivo abaixo.
A typical example of the content of a submission script is below


Ex: <code>job.sh</code>
Ex: <code>job.sh</code>
Linha 85: Linha 135:
<pre>
<pre>
#!/bin/bash  
#!/bin/bash  
#SBATCH -n 1 # Numero de CPU cores a serem alocados
#SBATCH -n 1 # Number of cpus to be allocated (Despite the # these SBATCH lines are compiled by the slurm manager!)
#SBATCH -N 1 # Numero de nodes a serem alocados
#SBATCH -N 1 # Nummber of nodes to be allocated  (You don't have to use all requisites, comment with ##)
#SBATCH -t 0-00:05 # Tempo limite de execucao (D-HH:MM)
#SBATCH -t 0-00:05 # Limit execution time (D-HH:MM)
#SBATCH -p long # Particao (fila) a ser submetido
#SBATCH -p long # Partition to be submitted
#SBATCH --qos qos_long # QOS  
#SBATCH --qos qos_long # QOS  
    
    
# Comandos de execução do seu programa:
# Your program execution commands
./a.out
./a.out
</pre>
</pre>


Na opção --qos, deve-se colocar o nome da partição com o prefixo "qos_":
In option --qos, use the partition name with "qos_" prefix:


partição: short -> qos: qos_short -> limite de 2 semanas
partition: short -> qos: qos_short -> limit  2 weeks


partição: long -> qos: qos_long -> limite de 3 meses
partition: long -> qos: qos_long -> limit de 3 month
    
    
If you run on GPU, specify the "generic resource" gpu in cluster ada:


Caso deseje rodar em GPU, é necessário especificar a fila e pedir explicitamente a ''gereric resource'' gpu:
<pre>
<pre>
#!/bin/bash  
#!/bin/bash  
#SBATCH -n 1 # Numero de CPU cores a serem alocados
#SBATCH -n 1  
#SBATCH -N 1 # Numero de nodes a serem alocados
#SBATCH -N 1
#SBATCH -t 0-00:05 # Tempo limite de execucao (D-HH:MM)
#SBATCH -t 0-00:05  
#SBATCH -p long # Particao (fila) a ser submetido
#SBATCH -p long  
#SBATCH --qos qos_long # QOS  
#SBATCH --qos qos_long # QOS  
#SBATCH --gres=gpu:1
#SBATCH --gres=gpu:1
Linha 116: Linha 166:
</pre>
</pre>


Para pedir alguma GPU específica, use um constraint adicionando a linha:
To ask for a specific gpu:
<pre>
<pre>
#SBATCH --constraint="gtx970"
#SBATCH --constraint="gtx970"
</pre>
</pre>


Para submeter o job, execute o comando
To submit the job, execute:


<pre>
<pre>
Linha 127: Linha 177:
</pre>
</pre>


== Comandos úteis ==
== Usefull commands ==
* Para listar os seus jobs:
* To list jobs:
   squeue
   squeue


* Para deletar um job:
* To list all jobs running in the cluster now:
  sudo squeue
 
* To delete a running job:
   scancel [job_id]
   scancel [job_id]


* Para listar as partições disponíveis:
* To list available partitions:
   sinfo
   sinfo


* Para listar as gpus presentes nos nodes:
* To list gpu's in the nodes:
   sinfo -o "%N %f"
   sinfo -o "%N %f"


* Para listar um resumo de todos os nodes:
* To list characteristic of all nodes:
   sinfo -Nel
   sinfo -Nel

Edição atual tal como às 10h52min de 4 de abril de 2024

Cluster Lovelace - Instituto de Física UFRGS

The cluster is located at Instituto de Física da UFRGS, in Porto Alegre.

Management Committee


The cluster is managed by professors representing the fields of Astronomy, Theoretical Physics, and Experimental Physics, in addition to an IT department employee from the Physics Institute.

Astronomy: Rogério Riffel

Theoretical Physics: Leonardo Brunnet

Experimental Physics: Pedro Grande

TI employee: Gustavo Feller

Users Committee


Users have two channels for communication/discussion: 

1) The fis-linux-if@grupos.ufrgs.br mailing list

2) Direct messages to the IT department via the email fisica-ti@ufrgs.br.

Infraestruture

Management Software

The system of queues and scheduling of tasks is controlled by the Slurm Workload Manager.


Number of jobs per user controlled on demand.

Number of users on 1/24/2023: 150

Account request: mail to fisica-ti@ufrgs.br

Hardware in lovelace nodes

CPU: Ryzen (32 and 2*24 cores) + AMD 16 cores
RAM: 64 GB each
GPU: Three nodes with NVIDIA CUDA
Storage: storage Dell 12TB 
Conection inter-nodes: Gigabit

Installed Software

OS: Debian 12 
Basic packages installed:
gcc
gfortran
python: torch, numba
julia
conda
compucel3d
espresso
gromacs
lammps
mesa
openmpi
povray
quantum-espresso
vasp

Rules for scheduling, access control, and usage of the research infrastructure

Online scheduling

The cluster is accessible using the UFRGS virtual prived network (vpn) through server lovelace.if.ufrgs.br.

To access through a unix-like system use:

ssh <user>@lovelace.if.ufrgs.br

Under windows you may configure winscp to enter the address lovelace.if.ufrgs.br.

If you are not registered, ask for registration sending an email to fisica-ti@ufrgs.br

Using softwares in the cluster

To execute a software in a cluster job this program must:

1. Be already installed

OR

2. Be copied to the user home

Ex:

scp my_programm <user>@cluster-slurm.if.ufrgs.br:~/

If you are compiling your program in the cluster, one option is to use gcc.

Ex:

scp -r source-code/ usuario@cluster-slurm.if.ufrgs.br:~/
ssh <user>@cluster-slurm.if.ufrgs.br:~/
cd source-code
gcc main.c funcoes.c

This will generate file a.out, which is the executable.

Being accessible by methods 1 or 2, the program can be executed in the cluster through one JOB.

OBS: If you execute your executable without submitting as JOB, it will be executed in the server, not in the nodes. This is not recommended since the server computational capabilities are limited and you will be slowing down the server for everyone else.

Criating and executing a Job

Slurm manages jobs and each job represents a program or task being executed.

To submit a new job, you must create a script file describing the requisites and characteristics of the Job.

A typical example of the content of a submission script is below

Ex: job.sh

#!/bin/bash 
#SBATCH -n 1 # Number of cpus to be allocated (Despite the # these SBATCH lines are compiled by the slurm manager!)
#SBATCH -N 1 # Nummber of nodes to be allocated  (You don't have to use all requisites, comment with ##)
#SBATCH -t 0-00:05 # Limit execution time (D-HH:MM)
#SBATCH -p long # Partition to be submitted
#SBATCH --qos qos_long # QOS 
  
# Your program execution commands
./a.out

In option --qos, use the partition name with "qos_" prefix:

partition: short -> qos: qos_short -> limit 2 weeks

partition: long -> qos: qos_long -> limit de 3 month

If you run on GPU, specify the "generic resource" gpu in cluster ada:

#!/bin/bash 
#SBATCH -n 1 
#SBATCH -N 1
#SBATCH -t 0-00:05 
#SBATCH -p long 
#SBATCH --qos qos_long # QOS 
#SBATCH --gres=gpu:1
  
# Comandos de execução do seu programa:
./a.out

To ask for a specific gpu:

#SBATCH --constraint="gtx970"

To submit the job, execute:

sbatch job.sh

Usefull commands

  • To list jobs:
 squeue
  • To list all jobs running in the cluster now:
 sudo squeue
  • To delete a running job:
 scancel [job_id]
  • To list available partitions:
 sinfo
  • To list gpu's in the nodes:
 sinfo -o "%N %f"
  • To list characteristic of all nodes:
 sinfo -Nel