[slurm-users] Job cancelled by root - why?
Torkil Svensgaard
torkil at drcmr.dk
Tue May 19 10:59:55 UTC 2020
Hi
One of my users reported a job cancelled before it completed. She got this:
"
slurmstepd: *** JOB 390031 ON bigger4 CANCELLED AT 2020-05-18T22:27:04 ***
"
The job was apparently cancelled by root:
"
sacct -j 390031 --format="jobid,state%30"
JobID State
------------ ------------------------------
390031 CANCELLED by 0
390031.batch CANCELLED
"
"
I can only find this in the logs:
"
[2020-05-18T22:27:03.954] debug2: _slurm_rpc_dump_partitions, size=542
usec=87
[2020-05-18T22:27:04.032] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job
390031 uid 0
[2020-05-18T22:27:04.032] debug3: User (null)(1501) doesn't have a
default account
[2020-05-18T22:27:04.032] debug3: cons_res: _rm_job_from_res: job 390031
action 0
[2020-05-18T22:27:04.032] debug3: cons_res: removed job 390031 from part
HPC row 0
[2020-05-18T22:27:04.032] debug2: Spawning RPC agent for msg_type
REQUEST_TERMINATE_JOB
[2020-05-18T22:27:04.033] _job_signal: 9 of running JobID=390031
State=0x8004 NodeCnt=4 successful 0x8004
...
[2020-05-18T22:27:19.143] debug2: Processing RPC:
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=390031
[2020-05-18T22:27:19.143] job_complete: JobID=390031 State=0x8004
NodeCnt=1 WTERMSIG 15
[2020-05-18T22:27:19.143] debug2: _slurm_rpc_complete_batch_script
JobId=390031: Job/step already completing or completed
"
How do I determine why the job was cancelled? Usually it only happens
when the OOM killer strikes but that doesn't seem to the case here.
Thanks,
Torkil
More information about the slurm-users
mailing list