<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>The epilog script does have exit 0 set at the end. Epilogs exit

      cleanly when run.<br>

    </p>

    <p>With log set to debug5 I get the following results for any

      scancel call.</p>

    <p>Submit host slurmctld.log</p>

    <p>

      <blockquote type="cite">[2020-11-18T10:19:34.944]

        _slurm_rpc_submit_batch_job: JobId=110 InitPrio=110503 usec=191<br>

        [2020-11-18T10:19:35.231] debug:  sched: Running job scheduler<br>

        [2020-11-18T10:19:35.231] sched: Allocate JobId=110

        NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute<br>

        [2020-11-18T10:19:35.387] prolog_running_decr: Configuration for

        JobId=110 is complete<br>

        [2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB

        JobId=110 uid 0<br>

        [2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110

        successful 0x8004</blockquote>

      <blockquote type="cite">[2020-11-18T10:20:59.282] Resending

        TERMINATE_JOB request JobId=110 Nodelist=martin-cn-01</blockquote>

      <blockquote type="cite">[2020-11-18T10:21:10.000] update_node:

        node martin-cn-01 reason set to: Kill task failed<br>

        [2020-11-18T10:21:10.000] update_node: node martin-cn-01 state

        set to DRAINING<br>

        [2020-11-18T10:21:10.001] error: slurmd error running JobId=110

        on node(s)=martin-cn-01: Kill task failed<br>

        [2020-11-18T10:21:10.124] cleanup_completing: JobId=110

        completion process took 91 seconds<br>

      </blockquote>

      Compute node slurmd.log</p>

    <p>

      <blockquote type="cite">[2020-11-18T10:19:35.388]

        task_p_slurmd_batch_request: 110<br>

        [2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask

        for node: 0x00000000000000000000000000000003<br>

        [2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW

        mask for node: 0x00000000000000010000000000000001<br>

        [2020-11-18T10:19:35.388] debug:  [job 110] attempting to run

        prolog [/cm/local/apps/cmd/scripts/prolog]<br>

        [2020-11-18T10:19:35.392] _run_prolog: run job script took

        usec=4085<br>

        [2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job

        110 ran for 0 seconds<br>

        [2020-11-18T10:19:35.393] Launching batch job 110 for UID 0<br>

        [2020-11-18T10:19:35.398] debug:  AcctGatherEnergy NONE plugin

        loaded<br>

        [2020-11-18T10:19:35.398] debug:  AcctGatherProfile NONE plugin

        loaded<br>

        [2020-11-18T10:19:35.398] debug:  AcctGatherInterconnect NONE

        plugin loaded<br>

        [2020-11-18T10:19:35.398] debug:  AcctGatherFilesystem NONE

        plugin loaded<br>

        [2020-11-18T10:19:35.398] debug:  switch NONE plugin loaded<br>

        [2020-11-18T10:19:35.398] debug:  init: Gres GPU plugin loaded<br>

        [2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node

        as a socket<br>

        [2020-11-18T10:19:35.400] [110.batch] debug:  CPUs:128 Boards:1

        Sockets:8 CoresPerSocket:8 ThreadsPerCore:2<br>

        [2020-11-18T10:19:35.400] [110.batch] debug:  Job accounting

        gather cgroup plugin loaded<br>

        [2020-11-18T10:19:35.401] [110.batch] debug:  cont_id hasn't

        been set yet not running poll<br>

        [2020-11-18T10:19:35.401] [110.batch] debug:  laying out the 1

        tasks on 1 hosts martin-cn-01 dist 2<br>

        [2020-11-18T10:19:35.401] [110.batch] debug:  Message thread

        started pid = 267029<br>

        [2020-11-18T10:19:35.401] [110.batch] task affinity plugin

        loaded with CPU mask

00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff<br>

        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now

        constraining jobs allocated cores<br>

        [2020-11-18T10:19:35.402] [110.batch] debug: 

        task/cgroup/memory: total:515440M allowed:100%(enforced),

        swap:0%(enforced), max:100%(515440M) max+swap:100%(1030880M)

        min:30M kmem:100%(515440M enforced) min:30M swappiness:100(set)<br>

        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now

        constraining jobs allocated memory<br>

        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup:

        unable to open

        /cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf:

        No such file or directory<br>

        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now

        constraining jobs allocated devices<br>

        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup:

        loaded<br>

        [2020-11-18T10:19:35.402] [110.batch] debug:  Checkpoint plugin

        loaded: checkpoint/none<br>

        [2020-11-18T10:19:35.402] [110.batch] Munge credential signature

        plugin loaded<br>

        [2020-11-18T10:19:35.402] [110.batch] debug:  job_container none

        plugin loaded<br>

        [2020-11-18T10:19:35.402] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm'

        already exists<br>

        [2020-11-18T10:19:35.402] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm/uid_0'

        already exists<br>

        [2020-11-18T10:19:35.403] [110.batch] debug:  spank: opening

        plugin stack /etc/slurm/plugstack.conf<br>

        [2020-11-18T10:19:35.403] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm'

        already exists<br>

        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job

        abstract cores are '0'<br>

        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step

        abstract cores are '0'<br>

        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job

        physical cores are '0,64'<br>

        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step

        physical cores are '0,64'<br>

        [2020-11-18T10:19:35.403] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0'

        already exists<br>

        [2020-11-18T10:19:35.406] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm'

        already exists<br>

        [2020-11-18T10:19:35.406] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0'

        already exists<br>

        [2020-11-18T10:19:35.406] [110.batch] debug: 

        xcgroup_instantiate: cgroup

        '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists<br>

        [2020-11-18T10:19:35.406] [110.batch] task/cgroup:

        /slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB

        memsw.limit=4096MB<br>

        [2020-11-18T10:19:35.406] [110.batch] debug: 

        xcgroup_instantiate: cgroup

        '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already

        exists<br>

        [2020-11-18T10:19:35.406] [110.batch] task/cgroup:

        /slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB

        memsw.limit=4096MB<br>

        [2020-11-18T10:19:35.406] [110.batch] debug: 

        _oom_event_monitor: started.<br>

        [2020-11-18T10:19:35.406] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm'

        already exists<br>

        [2020-11-18T10:19:35.406] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm/uid_0'

        already exists<br>

        [2020-11-18T10:19:35.438] [110.batch] debug level = 2<br>

        [2020-11-18T10:19:35.438] [110.batch] starting 1 tasks<br>

        [2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started

        2020-11-18T10:19:35<br>

        [2020-11-18T10:19:35.438] [110.batch] debug:  Setting slurmstepd

        oom_adj to -1000<br>

        [2020-11-18T10:19:35.438] [110.batch] debug: 

        jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid

        4294967294 taskid 0 max_task_id 0<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm'

        already exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0'

        already exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup

        '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup

        '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch' already

        exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid

        4294967294 taskid 0 max_task_id 0<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm'

        already exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0'

        already exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup

        '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup

        '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already

        exists<br>

        [2020-11-18T10:19:35.439] [110.batch] debug: 

        xcgroup_instantiate: cgroup

        '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0'

        already exists<br>

        [2020-11-18T10:19:35.440] [110.batch] debug:  task_p_pre_launch:

        affinity jobid 110.4294967294, task:0 bind:256<br>

        [2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using

        sched_affinity for tasks<br>

        [2020-11-18T10:19:39.770] debug:  _rpc_terminate_job, uid = 450<br>

        [2020-11-18T10:19:39.770] debug: 

        task_p_slurmd_release_resources: affinity jobid 110<br>

        [2020-11-18T10:19:39.770] debug:  credential for job 110 revoked<br>

        [2020-11-18T10:19:39.771] [110.batch] debug:  Handling

        REQUEST_SIGNAL_CONTAINER<br>

        [2020-11-18T10:19:39.771] [110.batch] debug: 

        _handle_signal_container for step=110.4294967294 uid=450

        signal=18<br>

        [2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to

        110.4294967294<br>

        [2020-11-18T10:19:39.771] [110.batch] debug:  Handling

        REQUEST_SIGNAL_CONTAINER<br>

        [2020-11-18T10:19:39.771] [110.batch] debug: 

        _handle_signal_container for step=110.4294967294 uid=450

        signal=15<br>

        [2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON

        martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***<br>

        [2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to

        110.4294967294<br>

        [2020-11-18T10:19:39.771] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:39.792] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:39.842] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:39.942] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:40.442] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:41.442] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:42.443] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:43.443] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:44.443] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:45.443] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:46.443] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:47.443] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:48.444] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:49.444] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:19:50.444] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:00.444] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:09.445] [110.batch] debug:  Handling

        REQUEST_STEP_TERMINATE<br>

        [2020-11-18T10:20:09.445] [110.batch] debug:  _handle_terminate

        for step=110.4294967294 uid=0<br>

        [2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to

        110.4294967294<br>

        [2020-11-18T10:20:09.445] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:09.465] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:09.515] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:09.616] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:10.116] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:11.116] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:11.116] [110.batch] debug:  Handling

        REQUEST_STEP_TERMINATE<br>

        [2020-11-18T10:20:11.116] [110.batch] debug:  _handle_terminate

        for step=110.4294967294 uid=0<br>

        [2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to

        110.4294967294<br>

        [2020-11-18T10:20:12.116] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:20:12.116] [110.batch] debug:  Handling

        REQUEST_STEP_TERMINATE<br>

        [2020-11-18T10:20:12.116] [110.batch] debug:  _handle_terminate

        for step=110.4294967294 uid=0<br>

      </blockquote>

      SIGKILL signals continue until the unkillable timelimit is met and

      then slurm closes the job leaving the processes running with

      output:<br>

    </p>

    <p>

      <blockquote type="cite">[2020-11-18T10:20:59.282] debug: 

        _rpc_terminate_job, uid = 450<br>

        [2020-11-18T10:20:59.282] debug: 

        task_p_slurmd_release_resources: affinity jobid 110<br>

        [2020-11-18T10:21:00.121] [110.batch] debug:  Handling

        REQUEST_STATE<br>

        [2020-11-18T10:21:00.121] [110.batch] debug:  Handling

        REQUEST_STEP_TERMINATE<br>

        [2020-11-18T10:21:00.121] [110.batch] debug:  _handle_terminate

        for step=110.4294967294 uid=0<br>

        [2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to

        110.4294967294<br>

        [2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD

        TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT

        ENDING WITH SIGNALS ***<br>

        [2020-11-18T10:21:10.000] [110.batch] debug:  get_exit_code task

        0 killed by cmd<br>

        [2020-11-18T10:21:10.000] [110.batch] job 110 completed with

        slurm_rc = 4001, job_rc = -1<br>

        [2020-11-18T10:21:10.000] [110.batch] sending

        REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1<br>

        [2020-11-18T10:21:10.001] [110.batch] debug:  Message thread

        exited<br>

        [2020-11-18T10:21:10.001] [110.batch] done with job<br>

        [2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog

        to complete<br>

        [2020-11-18T10:21:10.121] debug:  Finished wait for job 110's

        prolog to complete<br>

        [2020-11-18T10:21:10.121] debug:  [job 110] attempting to run

        epilog [/cm/local/apps/cmd/scripts/epilog]<br>

        [2020-11-18T10:21:10.124] debug:  completed epilog for jobid 110<br>

        [2020-11-18T10:21:10.124] debug:  Job 110: sent epilog complete

        msg: rc = 0<br>

      </blockquote>

    </p>

    <div class="moz-signature">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <title></title>

      <table cellspacing="0" cellpadding="0" border="0">

        <tbody>

          <tr>

            <td width="150" valign="top" height="30" align="left">

              <p style="font-size:14px;">Willy Markuske</p>

            </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">HPC Systems Engineer</p>

            </td>

            <td rowspan="3" width="180" valign="center" height="42"

              align="center"><tt><img moz-do-not-send="false"

                  src="cid:part1.152B416D.527E59F0@sdsc.edu" alt=""

                  width="168" height="48"></tt> </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">Research Data Services</p>

            </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">P: (858) 246-5593</p>

            </td>

          </tr>

        </tbody>

      </table>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">On 11/18/20 9:41 AM, mercan wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:9f9643ff-bf36-facd-d228-7569d43a5301@uhem.itu.edu.tr">Hi;

      <br>

      <br>

      Check epilog return value which comes from the return value of the

      last line of epilog script. Also, you can add a "exit 0" line at

      the last line of the epilog script to ensure to get a zero return

      value for testing purpose.

      <br>

      <br>

      Ahmet M.

      <br>

      <br>

      <br>

      18.11.2020 20:00 tarihinde William Markuske yazdı:

      <br>

      <blockquote type="cite">

        <br>

        Hello,

        <br>

        <br>

        I am having an odd problem where users are unable to kill their

        jobs with scancel. Users can submit jobs just fine and when the

        task completes it is able to close correctly. However, if a user

        attempts to cancel a job via scancel the SIGKILL signals are

        sent to the step but don't complete. Slurmd then continues to

        send SIGKILL requests until the UnkillableTimeout is hit, the

        slurm job is exits with an error, the node enters a draining

        state, and the spawn processes continue to run on the node.

        <br>

        <br>

        I'm at a loss because jobs can complete without issue which

        seems to suggest it's not a networking or permissions issue for

        the slurm to do job accounting tasks. A user can ssh to the node

        once a job is submitted and kill the subprocesses manually at

        which point slurm completes the epilog and the node returns to

        idle.

        <br>

        <br>

        Does anyone know what may be causing such behavior? Please let

        me know any slurm.conf or cgroup.conf settings that would be

        helpful to diagnose this issue. I'm quite stumped by this one.

        <br>

        <br>

        -- <br>

        <br>

        Willy Markuske

        <br>

        <br>

        HPC Systems Engineer

        <br>

        <br>

        <br>

        <br>

        Research Data Services

        <br>

        <br>

        P: (858) 246-5593

        <br>

        <br>

      </blockquote>

    </blockquote>

  </body>

</html>