Hello
at site RO-14-ITIM, after a power failure I get the following problem
2024-08-06 15:53:04 Finished - job id: c9INDmclYv5ngvuSSqSAreymYz3jwmOETUEmV71LDmABFKDm7KNpMn, unix user: 1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: "/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-egi@cro-ngi.hr", lrms: SLURM, queue: debug, lrmsid: 274399, failure: "LRMS error: (-1) Job missing from SLURM." 2024-08-06 15:53:04 Finished - job id: tjJNDmclYv5ngvuSSqSAreymYz3jwmOETUEmd71LDmABFKDmePf7To, unix user: 1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: "/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-egi@cro-ngi.hr", lrms: SLURM, queue: debug, lrmsid: 274400, failure: "LRMS error: (-1) Job missing from SLURM." 2024-08-06 15:53:04 Finished - job id: kiJNDmclYv5ngvuSSqSAreymYz3jwmOETUEml71LDmABFKDmCmwifm, unix user: 1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: "/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-egi@cro-ngi.hr", lrms: SLURM, queue: debug, lrmsid: 274398, failure: "LRMS error: (-1) Job missing from SLURM."
The jobs can not be seen in sinfo or squeue
And indication on how where to look up the problem?
Thank you
Felix
Felix,
Finished jobs roll off the list shown in squeue, so that may be no surprise (depending on settings). If there was a power failure that caused the nodes to restart, it could also be that the job had not been written to slurmdbd, making it unavailable to sacct as well.
Your logs look to be from a front-end system that interfaces with slurm and does not seem to show the actual slurm jobid, unless those are the 274398, 274399, and 274400 numbers. If so, you could look in the slurmctld logs for the jobs to see what may have happened.
Brian Andrus
On 8/6/2024 5:57 AM, Felix via slurm-users wrote:
Hello
at site RO-14-ITIM, after a power failure I get the following problem
2024-08-06 15:53:04 Finished - job id: c9INDmclYv5ngvuSSqSAreymYz3jwmOETUEmV71LDmABFKDm7KNpMn, unix user: 1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: "/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-egi@cro-ngi.hr", lrms: SLURM, queue: debug, lrmsid: 274399, failure: "LRMS error: (-1) Job missing from SLURM." 2024-08-06 15:53:04 Finished - job id: tjJNDmclYv5ngvuSSqSAreymYz3jwmOETUEmd71LDmABFKDmePf7To, unix user: 1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: "/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-egi@cro-ngi.hr", lrms: SLURM, queue: debug, lrmsid: 274400, failure: "LRMS error: (-1) Job missing from SLURM." 2024-08-06 15:53:04 Finished - job id: kiJNDmclYv5ngvuSSqSAreymYz3jwmOETUEml71LDmABFKDmCmwifm, unix user: 1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: "/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-egi@cro-ngi.hr", lrms: SLURM, queue: debug, lrmsid: 274398, failure: "LRMS error: (-1) Job missing from SLURM."
The jobs can not be seen in sinfo or squeue
And indication on how where to look up the problem?
Thank you
Felix