Hello,
We are trying to run some PiconGPU codes on a machine with 8x100H,
susing slurm. But the jobs don't run, and are not in the queue. In
slurmd logs I have:
[2024-10-24T09:50:40.934] CPU_BIND: _set_batch_job_limits: Memory
extracted from credential for StepId=1079.batch job_mem_limit= 648000
[2024-10-24T09:50:40.934] Launching batch job 1079 for UID 1009
[2024-10-24T09:50:40.938] debug: acct_gather_energy/none: init:
AcctGatherEnergy NONE plugin loaded
[2024-10-24T09:50:40.938] debug: acct_gather_profile/none: init:
AcctGatherProfile NONE plugin loaded
[2024-10-24T09:50:40.938] debug: acct_gather_interconnect/none: init:
AcctGatherInterconnect NONE plugin loaded
[2024-10-24T09:50:40.938] debug: acct_gather_filesystem/none: init:
AcctGatherFilesystem NONE plugin loaded
[2024-10-24T09:50:40.939] debug: gres/gpu: init: loaded
[2024-10-24T09:50:41.022] [1079.batch] debug: cgroup/v2: init: Cgroup
v2 plugin loaded
[2024-10-24T09:50:41.026] [1079.batch] debug: CPUs:192 Boards:1
Sockets:2 CoresPerSocket:48 ThreadsPerCore:2
[2024-10-24T09:50:41.026] [1079.batch] debug: jobacct_gather/cgroup:
init: Job accounting gather cgroup plugin loaded
[2024-10-24T09:50:41.026] [1079.batch] CPU_BIND: Memory extracted from
credential for StepId=1079.batch job_mem_limit=648000
step_mem_limit=648000
[2024-10-24T09:50:41.027] [1079.batch] debug: laying out the 8 tasks on
1 hosts mihaigpu2 dist 2
[2024-10-24T09:50:41.027] [1079.batch] gres_job_state gres:gpu(7696487)
type:(null)(0) job:1079 flags:
[2024-10-24T09:50:41.027] [1079.batch] total_gres:8
[2024-10-24T09:50:41.027] [1079.batch] node_cnt:1
[2024-10-24T09:50:41.027] [1079.batch] gres_cnt_node_alloc[0]:8
[2024-10-24T09:50:41.027] [1079.batch] gres_bit_alloc[0]:0-7 of 8
[2024-10-24T09:50:41.027] [1079.batch] debug: Message thread started
pid = 459054
[2024-10-24T09:50:41.027] [1079.batch] debug: Setting
slurmstepd(459054) oom_score_adj to -1000
[2024-10-24T09:50:41.027] [1079.batch] debug: switch/none: init: switch
NONE plugin loaded
[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: core
enforcement enabled
[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup:
task_cgroup_memory_init: task/cgroup/memory: total:2063720M
allowed:100%(enforced), swap:0%(permissive), max:100%(2063720M)
max+swap:100%(4127440M) min:30M kmem:100%(2063720M permissive) min:30M
[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: memory
enforcement enabled
[2024-10-24T09:50:41.027] [1079.batch] debug: task/cgroup: init: Tasks
containment cgroup plugin loaded
[2024-10-24T09:50:41.027] [1079.batch] cred/munge: init: Munge
credential signature plugin loaded
[2024-10-24T09:50:41.027] [1079.batch] debug: job_container/none: init:
job_container none plugin loaded
[2024-10-24T09:50:41.030] [1079.batch] debug: spank: opening plugin
stack /etc/slurm/plugstack.conf
[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:
task_cgroup_cpuset_create: job abstract cores are '0-63'
[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:
task_cgroup_cpuset_create: step abstract cores are '0-63'
[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:
task_cgroup_cpuset_create: job physical CPUs are
'0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158'
[2024-10-24T09:50:41.030] [1079.batch] debug: task/cgroup:
task_cgroup_cpuset_create: step physical CPUs are
'0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142,144,146,148,150,152,154,156,158'
[2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize:
job: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited
[2024-10-24T09:50:41.031] [1079.batch] task/cgroup: _memcg_initialize:
step: alloc=648000MB mem.limit=648000MB memsw.limit=unlimited
[2024-10-24T09:50:41.064] [1079.batch] debug levels are stderr='error',
logfile='debug', syslog='quiet'
[2024-10-24T09:50:41.064] [1079.batch] starting 1 tasks
[2024-10-24T09:50:41.064] [1079.batch] task 0 (459058) started
2024-10-24T09:50:41
[2024-10-24T09:50:41.069] [1079.batch] _set_limit: RLIMIT_NOFILE :
reducing req:1048576 to max:131072
[2024-10-24T09:51:23.066] debug: _rpc_terminate_job: uid = 64030
JobId=1079
[2024-10-24T09:51:23.067] debug: credential for job 1079 revoked
[2024-10-24T09:51:23.067] [1079.batch] debug: Handling
REQUEST_SIGNAL_CONTAINER
[2024-10-24T09:51:23.067] [1079.batch] debug: _handle_signal_container
for StepId=1079.batch uid=64030 signal=18
[2024-10-24T09:51:23.068] [1079.batch] Sent signal 18 to
StepId=1079.batch
[2024-10-24T09:51:23.068] [1079.batch] debug: Handling
REQUEST_SIGNAL_CONTAINER
[2024-10-24T09:51:23.068] [1079.batch] debug: _handle_signal_container
for StepId=1079.batch uid=64030 signal=15
[2024-10-24T09:51:23.068] [1079.batch] error: *** JOB 1079 ON mihaigpu2
CANCELLED AT 2024-10-24T09:51:23 ***
[2024-10-24T09:51:23.069] [1079.batch] Sent signal 15 to
StepId=1079.batch
[2024-10-24T09:51:23.069] [1079.batch] debug: Handling REQUEST_STATE
[2024-10-24T09:51:23.071] [1079.batch] task 0 (459058) exited. Killed by
signal 15.
[2024-10-24T09:51:23.090] [1079.batch] debug: Handling REQUEST_STATE
[2024-10-24T09:51:23.141] [1079.batch] debug: Handling REQUEST_STATE
[2024-10-24T09:51:23.241] [1079.batch] debug: Handling REQUEST_STATE
[2024-10-24T09:51:23.741] [1079.batch] debug: Handling REQUEST_STATE
[2024-10-24T09:51:24.073] [1079.batch] debug: signaling condition
[2024-10-24T09:51:24.073] [1079.batch] debug: jobacct_gather/cgroup:
fini: Job accounting gather cgroup plugin unloaded
[2024-10-24T09:51:24.073] [1079.batch] debug: task/cgroup: fini: Tasks
containment cgroup plugin unloaded
[2024-10-24T09:51:24.073] [1079.batch] debug: get_exit_code task 0
killed by cmd
[2024-10-24T09:51:24.073] [1079.batch] job 1079 completed with slurm_rc
= 0, job_rc = 15
[2024-10-24T09:51:24.075] [1079.batch] debug: Message thread exited
[2024-10-24T09:51:24.154] [1079.batch] done with job
Anyone have any idea what could be the problem?
Thank you,
Mihai