[slurm-users] unable to kill namd3 process

Wed May 3 10:15:19 UTC 2023

Hi,

For an update we tried one case please find it below:

We tried by adding below script to kill the namd3 process in our epilog
script.

# To kill remaining processes of job.
#
if [ $SLURM_UID = 1234 ] ; then
        STUCK_PID=`${SLURM_BIN}scontrol listpids $SLURM_JOB_ID | awk
'{print $1}' | grep -v PID`
        for kpid in $STUCK_PID
        do
                kill -9 $kpid
        done
fi

but it didn't worked out as it is unable to fetch the required pid with
"scontrol listpid" command

It looks like the slurmd had a problem with a job step that didn't end
correctly, and the slurmd wasn't able to kill it after the timeout was
reached.

Any help would be much appreciated.

Thanks,
Shaghuf Rahman

On Tue, Apr 25, 2023 at 8:32 PM Shaghuf Rahman <shaghuf at gmail.com> wrote:

> Hi,
>
> Also forgot to mention the process is still running when user do scancel
> and epilog does not clean if one job finished when doing multiple job
> submission.
> We tried to use unkillable option but did not work. The process still
> remains the same until killing it manually.
>
>
>
> On Tue, 25 Apr 2023 at 19:57, Shaghuf Rahman <shaghuf at gmail.com> wrote:
>
>> Hi,
>>
>> We are facing one issue in my environment and the behaviour looks strange
>> to me. It is specifically associated with the namd3 application.
>> The issue is narrated below and I have made some of the cases.
>>
>> I am trying to understand the way to kill the processes of the namd3
>> application submitted through sbatch without making the node in drain.
>>
>> What I observed is when a user submits a single job on a node and then
>> when he do scancel of namd3 job it kills the job and the node gets to idle
>> state and everything looks as expected.
>> But when the user submit multiple jobs on a single node and do scancel 1
>> of his job, it puts the node in drain state. However the other jobs are
>> running fine without an issue.
>>
>> Due to this issue multiple nodes getting to drain state when a user
>> do scancel of the namd3 job.
>>
>> Note: When the user is not performing scancel, all job run successfully
>> and the node states are also fine.
>>
>> It is not creating issues with any of the applications. So we are
>> suspecting the issue could be with the namd3 application
>> Kindly suggest some solution or any ideas on how to fix this issue.
>>
>> Thanks in advance,
>> Shaghuf Rahman
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230503/588cace9/attachment.htm>