Hello,
We are trying to run some PiconGPU codes on a machine with 8x100H, susing slurm. But the jobs don't run, and are not in the queue. In slurmd logs I have:
[2024-10-24T09:50:40.934] CPU_BIND: _set_batch_job_limits: Memory extracted from credential for StepId=1079.batch job_mem_limit= 648000 [2024-10-24T09:50:40.934] Launching batch job 1079 for UID 1009 [2024-10-24T09:50:40.938] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2024-10-24T09:50:40.938] debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded [2024-10-24T09:50:40.938] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2024-10-24T09:50:40.938] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded [2024-10-24T09:50:40.939] debug: gres/gpu: init: loaded [2024-10-24T09:50:41.022] [1079.batch] debug: cgroup/v2: init: Cgroup v2 plugin loaded [2024-10-24T09:50:41.026] [1079.batch] debug: CPUs:192 Boards:1 Sockets:2 CoresPerSocket:48 ThreadsPerCore:2 [2024-10-24T09:50:41.026] [1079.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2024-10-24T09:50:41.026] [1079.batch] CPU_BIND: Memory extracted from credential for StepId=1079.batch job_mem_limit=648000 step_mem_limit=648000 [2024-10-24T09:50:41.027] [1079.batch] debug: laying out the 8 tasks on 1 hosts mihaigpu2 dist 2 [2024-10-24T09:50:41.027] [1079.batch] gres_job_state gres:gpu(7696487) type:(null)(0) job:1079 flags: [2024-10-24T09:50:41.027] [1079.batch] total_gres:8 [2024-10-24T09:50:41.027] [1079.batch] node_cnt:1 [2024-10-24T09:50:41.027] [1079.batch] gres_cnt_node_alloc[0]:8 [2024-10-24T09:50:41.027] [1079.batch] gres_bit_alloc[0]:0-7 of 8 [2024-10-24T09:50:41.027] [1079.batch] debug: Message thread started pid = 459054 [2024-10-24T09:50:41.027] [1079.batch] debug: Setting slurmstepd(459054) oom_score_adj to -1000 [2024-10-24T09:50:41.027] [1079.batch] debug: switch/none: init: switch NONE plugin loaded [2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: core enforcement enabled [2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:2063720M allowed:100%(enforced), swap:0%(permissive), max:100%(2063720M) max+swap:100%(4127440M) min:30M kmem:100%(2063720M permissive) min:30M [2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: memory enforcement enabled [2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2024-10-24T09:50:41.027] [1079.batch] cred/munge: init: Munge credential signature plugin loaded [2024-10-24T09:50:41.027] [1079.batch] debug: job_container/none: init: job_container none plugin loaded [2024-10-24T09:50:41.030] [1079.batch] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63' [2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63' [2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158' [2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158' [2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize: job: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited [2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize: step: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited [2024-10-24T09:50:41.064] [1079.batch] debug levels are stderr='error', logfile='debug', syslog='quiet' [2024-10-24T09:50:41.064] [1079.batch] starting 1 tasks [2024-10-24T09:50:41.064] [1079.batch] task 0 (459058) started 2024-10-24T09:50:41 [2024-10-24T09:50:41.069] [1079.batch] _set_limit: RLIMIT_NOFILE : reducing req:1048576 to max:131072 [2024-10-24T09:51:23.066] debug: _rpc_terminate_job: uid = 64030 JobId=1079 [2024-10-24T09:51:23.067] debug: credential for job 1079 revoked [2024-10-24T09:51:23.067] [1079.batch] debug: Handling REQUEST_SIGNAL_CONTAINER [2024-10-24T09:51:23.067] [1079.batch] debug: _handle_signal_container for StepId=1079.batch uid=64030 signal=18 [2024-10-24T09:51:23.068] [1079.batch] Sent signal 18 to StepId=1079.batch [2024-10-24T09:51:23.068] [1079.batch] debug: Handling REQUEST_SIGNAL_CONTAINER [2024-10-24T09:51:23.068] [1079.batch] debug: _handle_signal_container for StepId=1079.batch uid=64030 signal=15 [2024-10-24T09:51:23.068] [1079.batch] error: *** JOB 1079 ON mihaigpu2 CANCELLED AT 2024-10-24T09:51:23 *** [2024-10-24T09:51:23.069] [1079.batch] Sent signal 15 to StepId=1079.batch [2024-10-24T09:51:23.069] [1079.batch] debug: Handling REQUEST_STATE [2024-10-24T09:51:23.071] [1079.batch] task 0 (459058) exited. Killed by signal 15. [2024-10-24T09:51:23.090] [1079.batch] debug: Handling REQUEST_STATE [2024-10-24T09:51:23.141] [1079.batch] debug: Handling REQUEST_STATE [2024-10-24T09:51:23.241] [1079.batch] debug: Handling REQUEST_STATE [2024-10-24T09:51:23.741] [1079.batch] debug: Handling REQUEST_STATE [2024-10-24T09:51:24.073] [1079.batch] debug: signaling condition [2024-10-24T09:51:24.073] [1079.batch] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded [2024-10-24T09:51:24.073] [1079.batch] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded [2024-10-24T09:51:24.073] [1079.batch] debug: get_exit_code task 0 killed by cmd [2024-10-24T09:51:24.073] [1079.batch] job 1079 completed with slurm_rc = 0, job_rc = 15 [2024-10-24T09:51:24.075] [1079.batch] debug: Message thread exited [2024-10-24T09:51:24.154] [1079.batch] done with job
Anyone have any idea what could be the problem?
Thank you, Mihai