SLURM - Stats & Diagnostic

Pour effectuer des diagnostics et sortir des statistiques sous SLURM, on peut utiliser 3 commandes :

# sdiag
# sacctmgr show stats
# sacctmgr show problem

Commande sdiag

Utilité

La commande sdiag permet d’obtenir de nombreuses informations sur :

  • le nombre de jobs (lancés, annulés, terminés, échoués, etc.)
  • les appels RCP et qui les exécute
  • les temps des traitements

Exemple

Exemple de sortie :

# sdiag

Server thread count: 3
Agent queue size: 0
Agent count: 0
DBD Agent queue size: 0
Jobs submitted: 523
Jobs started: 523
Jobs completed: 501
Jobs canceled: 3
Jobs failed: 19
Main schedule statistics (microsectons):
Last cycle: 16
Max cycle: 53
Total cycles: 59
Mean cycle: 20
Mean depth cycle: 11
Cycles per minute: 1
Last queue length: 0
Backfilling stats
Total backfilled jobs (since last slurm start): 0
Total backfilled jobs (since last stats cycle start): 0
Total cycles: 28
Last cycle when: Wed Dec 30 15;33;18 2018
Last cycle: 93
Max cycle: 5433
Last depth cycle: 0
Last depth cycle (try sched): 0
Last queue length: 0
Latency for 1000 calls to gettimeofday(): 15 microseconds
Remote Procedure Call statistics by message type
REQUEST_RESOURCE_ALLOCATION (4001) count:5 ave_time:1880
total_time:94042
REQUEST_JOB_READY (4019) count:5 ave_time:490
total_time:24520
Remote Procedure Call statistics by user
student (1002) count:32 ave_time:1405
total_time:44973
root ( 0) count:0 ave_time:0 total_time:0
Pending RPC statistics
No pending RPCs

Commande sacctmgr

Options show stats

La commande sacctmgr show stats permet d’obtenir de nombreuses informations sur :

  • les statistiques “Rollups”
  • les appels sur la DB exécutés et par qui

Exemple

Exemple de sortie :

# sacctmgr show stats

Rollup statistics
Hour count:746 ave_time:307765 max_time:223576198 total_time:229593291
Day count:31 ave_time:2429 max_time:10972 total_time:75328
Month count:1 ave_time:32007 max_time:32007 total_time:32007
Remote Procedure Call statistics by message type
DBD_CLUSTER_TRES ( 1407) count:8948 ave_time:57837 total_time:517531274
DBD_JOB_COMPLETE ( 1424) count:5 ave_time:17972 total_time:89864
DBD_FINI ( 1401) count:5 ave_time:256 total_time:1284
SLURM_PERSIST_INIT ( 6500) count:4 ave_time:341 total_time:1367
DBD_STEP_START ( 1442) count:3 ave_time:4617 total_time:13852
DBD_SEND_MULT_MSG ( 1474) count:3 ave_time:1579 total_time:4738
DBD_STEP_COMPLETE ( 1441) count:3 ave_time:1252 total_time:3757
DBD_SEND_MULT_JOB_START ( 1472) count:3 ave_time:3527 total_time:10581
DBD_JOB_START ( 1425) count:2 ave_time:1146 total_time:2292
DBD_NODE_STATE ( 1432) count:2 ave_time:2427 total_time:4854
DBD_GET_USERS ( 1415) count:1 ave_time:510 total_time:510
DBD_GET_ASSOCS ( 1410) count:1 ave_time:1768 total_time:1768
DBD_GET_RES ( 1478) count:1 ave_time:274 total_time:274
DBD_REGISTER_CTLD ( 1434) count:1 ave_time:1065 total_time:1065
DBD_GET_TRES ( 1486) count:1 ave_time:357 total_time:357
DBD_GET_FEDERATIONS ( 1494) count:1 ave_time:477 total_time:477
DBD_GET_QOS ( 1448) count:1 ave_time:208 total_time:208
DBD_GET_STATS ( 1489) count:1 ave_time:387 total_time:387
DBD_GET_CONFIG ( 1466) count:1 ave_time:99 total_time:99
Remote Procedure Call statistics by user
agil ( 1000) count:8987 ave_time:57601 total_time:517669008

Options show problem

La commande sacctmgr show problem permet d’obtenir de nombreuses informations sur les problème rencontrés par Slurm.

Exemple

Exemple de sortie :

# sacctmgr show problem

Cluster Account User Problem
------- ------- ---- ------------------------
jim User does not have a uid
joe User does not have a uid

Documentation

http://loxop4biz.minibird.jp/slurm2002/SLUG19/Troubleshooting.pdf
Troubleshooting.pdf
https://manpages.org/sacctmgr

> Partager <