[slurm-users] Job cancelled by root - why?

Torkil Svensgaard torkil at drcmr.dk
Tue May 19 10:59:55 UTC 2020


Hi

One of my users reported a job cancelled before it completed. She got this:

"
slurmstepd: *** JOB 390031 ON bigger4 CANCELLED AT 2020-05-18T22:27:04 ***
"

The job was apparently cancelled by root:

"
sacct -j 390031 --format="jobid,state%30"
        JobID                          State
------------ ------------------------------
390031                       CANCELLED by 0
390031.batch                      CANCELLED
"
"

I can only find this in the logs:

"
[2020-05-18T22:27:03.954] debug2: _slurm_rpc_dump_partitions, size=542 
usec=87
[2020-05-18T22:27:04.032] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 
390031 uid 0
[2020-05-18T22:27:04.032] debug3: User (null)(1501) doesn't have a 
default account
[2020-05-18T22:27:04.032] debug3: cons_res: _rm_job_from_res: job 390031 
action 0
[2020-05-18T22:27:04.032] debug3: cons_res: removed job 390031 from part 
HPC row 0
[2020-05-18T22:27:04.032] debug2: Spawning RPC agent for msg_type 
REQUEST_TERMINATE_JOB
[2020-05-18T22:27:04.033] _job_signal: 9 of running JobID=390031 
State=0x8004 NodeCnt=4 successful 0x8004
...
[2020-05-18T22:27:19.143] debug2: Processing RPC: 
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=390031
[2020-05-18T22:27:19.143] job_complete: JobID=390031 State=0x8004 
NodeCnt=1 WTERMSIG 15
[2020-05-18T22:27:19.143] debug2: _slurm_rpc_complete_batch_script 
JobId=390031: Job/step already completing or completed
"

How do I determine why the job was cancelled? Usually it only happens 
when the OOM killer strikes but that doesn't seem to the case here.

Thanks,

Torkil




More information about the slurm-users mailing list