[slurm-users] slurm status says jobs are running but they aren't

c b breedthoughts.www at gmail.com
Mon Mar 2 16:05:50 UTC 2020


I have a bunch of jobs that according to the slurm status have been running
for 30+ minutes, but in reality aren't running.  When i go to the node
where the job is supposed to be, the processes aren't there (not showing up
in top or ps) and the job's stdout/stderr logs are empty.  I know it's not
a problem with the job definition because i can run it myself on the node
in question without any problem, and if it was running correctly it should
be printing to stdout almost immediately.  Anyone know what could be
happening?  below are the snippets from my slurmctld and slurmd logs for
this job.


2020-02-29T15:28:03.832] _slurm_rpc_submit_batch_job: JobId=7784818
InitPrio=673 usec=20715
[2020-03-01T11:24:39.744] sched: _hold_job_rec: hold on JobId=7784818 by
uid 10234
[2020-03-01T11:24:39.744] sched: _update_job: set priority to 0 for
[2020-03-01T11:24:39.744] _slurm_rpc_update_job: complete JobId=7784818
uid=10234 usec=717
[2020-03-01T13:04:08.006] _slurm_rpc_update_job: complete JobId=7784818
uid=10234 usec=501
[2020-03-02T10:06:43.326] sched: _release_job_rec: release hold on
JobId=7784818 by uid 10234
[2020-03-02T10:06:43.326] _slurm_rpc_update_job: complete JobId=7784818
uid=10234 usec=286461
[2020-03-02T10:06:49.626] sched: Allocate JobId=7784818 NodeList=node1
#CPUs=1 Partition=debug

[2020-03-02T10:06:49.649] debug:  task_p_slurmd_batch_request: 7784818
[2020-03-02T10:06:49.650] _run_prolog: prolog with lock for job 7784818 ran
for 0 seconds
[2020-03-02T10:06:49.650] Launching batch job 7784818 for UID 10234
[2020-03-02T10:06:49.668] [7784818.batch] debug:  Job accounting gather
NOT_INVOKED plugin loaded
[2020-03-02T10:06:49.669] [7784818.batch] debug:  laying out the 1 tasks on
1 hosts node1 dist 2
[2020-03-02T10:06:49.669] [7784818.batch] debug:  Message thread started
pid = 8684
[2020-03-02T10:06:49.672] [7784818.batch] debug:  task NONE plugin loaded
[2020-03-02T10:06:49.673] [7784818.batch] debug:  Checkpoint plugin loaded:
[2020-03-02T10:06:49.674] [7784818.batch] Munge credential signature plugin
[2020-03-02T10:06:49.676] [7784818.batch] debug:  job_container none plugin
[2020-03-02T10:06:49.676] [7784818.batch] debug:  spank: opening plugin
stack /usr/local/install/slurm-19.05.2/etc/plugstack.conf
[2020-03-02T10:06:49.680] [7784818.batch] debug level = 2
[2020-03-02T10:06:49.680] [7784818.batch] starting 1 tasks
[2020-03-02T10:06:49.680] [7784818.batch] task 0 (8690) started
[2020-03-02T10:06:49.680] [7784818.batch] debug:  task_p_pre_launch_priv:
[2020-03-02T10:06:49.681] [7784818.batch] debug:  task_p_pre_launch:
7784818.4294967294, task 0
[2020-03-02T10:31:43.142] [7784818.batch] debug:  Handling REQUEST_STATE
[2020-03-02T10:31:43.142] debug:  _fill_registration_msg: found apparently
running job 7784818
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200302/9d6be629/attachment.htm>

More information about the slurm-users mailing list