SLURM - Status des jobs

La commande squeue affiche les informations des jobs dans la file d’attente sous forme de plusieurs colonnes dont ST et NODELIST (REASON). La colonne ST indique le statut du job et NODELIST (REASON) donne plus d’informations sur la raison pour laquelle le job n’a pas démarré.

# squeue
JOBID  PARTITION         NAME        USER   ST       TIME  NODES   NODELIST(REASON)
10674     n01_25    particule   emmetbrown   R      17:00      4   n13-[01-04]
10668       node    quantique   martymcfly   R   17:45:45      1   node-01
10669       node    fibonacci   martymcfly   R    1:30:19      1   node-01
10673       node     timemach   emmetbrown   R      17:08      1   node-02

Statut des jobs

Les jobs passent généralement par plusieurs statuts au cours de leur exécution. Les statuts typiques sont PENDING, RUNNING, SUSPENDED, COMPLETING et COMPLETED :

PD : PENDING (en attente d’allocation de ressources)
S : SUSPENDED (exécution suspendue car ressources libérés pour d’autres jobs.)
R : RUNNING (le job est alloué)
CD : COMPLETED (tous les processus sont terminés avec comme code de sortie zéro)
CG : COMPLETING (certains processus peuvent encore être actifs)
F : FAILED (code de sortie non nul ou autres échecs)
BF : BOOT_FAIL (job terminé en raison d’un échec de lancement, généralement une panne matérielle)
CA : CANCELLED (job annulée par l’utilisateur ou l’administrateur système)
CF : CONFIGURING (le job a les ressources, mais attend qu’elles soient prêtes à être utilisées)
NF : NODE_FAIL (Job interrompu en raison de la défaillance d’un ou plusieurs nœuds)
PR : PREEMPTED (job terminé en raison d’une préemption).
SE : SPECIAL_EXIT (job remis en file d’attente dans un état spécial pouvant être défini par l’utilisateur)
ST : STOPPED (exécution arrêtée avec le signal SIGSTOP)
TO : TIMEOUT (job terminé après avoir atteint sa limite de temps)

Statut de la colonne nodelist

La colonne NODELIST(REASON) peut avoir les statuts suivant :

AssociationJobLimit : The job’s association has reached its maximum job count.
AssociationResourceLimit : The job’s association has reached some resource limit.
AssociationTimeLimit : The job’s association has reached its time limit.
BadConstraints : The job’s constraints can not be satisfied.
BeginTime : heure de lancement pas encore atteinte
BlockFreeAction : An IBM BlueGene block is being freed and can not allow more jobs to start.
BlockMaxError : An IBM BlueGene block has too many cnodes in error state to allow more jobs to start.
Cleaning : The job is being requeued and still cleaning up from its previous execution.
Dependency : en attente de fin d’un job dépendant
FrontEndDown : No front end node is available to execute this job.
InactiveLimit : The job reached the system InactiveLimit.
InvalidAccount : The job’s account is invalid.
InvalidQOS : The job’s QOS is invalid.
JobHeldAdmin : The job is held by a system administrator.
JobHeldUser : le job est bloqué par l’utilisateur
JobLaunchFailure : The job could not be launched. This may be due to a file system problem, invalid program name, etc.
Licenses : The job is waiting for a license.
NodeDown : A node required by the job is down.
NonZeroExitCode : The job terminated with a non-zero exit code.
PartitionDown : The partition required by this job is in a DOWN state.
PartitionInactive : The partition required by this job is in an Inactive state and not able to start jobs.
PartitionNodeLimit : The number of nodes required by this job is outside of it’s partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit : The job’s time limit exceeds it’s partition’s current time limit.
Priority : des jobs de priorité supérieure existent
Prolog : It’s PrologSlurmctld program is still running.
QOSJobLimit : The job’s QOS has reached its maximum job count.
QOSResourceLimit : The job’s QOS has reached some resource limit.
QOSTimeLimit : The job’s QOS has reached its time limit.
ReqNodeNotAvail : Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are ` DOWN, DRAINED, or not responding will be identified as part of the job’s “reason” field as “UnavailableNodes”. Such nodes will typically require the intervention of a system administrator to make available.
Reservation : The job is waiting its advanced reservation to become available.
Resources : en attente que les ressources soient disponibles
SystemFailure : Failure of the Slurm system, a file system, the network, etc.
TimeLimit : The job exhausted its time limit.
QOSUsageThreshold : Required QOS threshold has been breached.
WaitingForScheduling : No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.

Documentation

https://manpages.org/squeue