<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>The epilog script does have exit 0 set at the end. Epilogs exit
      cleanly when run.<br>
    </p>
    <p>With log set to debug5 I get the following results for any
      scancel call.</p>
    <p>Submit host slurmctld.log</p>
    <p>
      <blockquote type="cite">[2020-11-18T10:19:34.944]
        _slurm_rpc_submit_batch_job: JobId=110 InitPrio=110503 usec=191<br>
        [2020-11-18T10:19:35.231] debug:  sched: Running job scheduler<br>
        [2020-11-18T10:19:35.231] sched: Allocate JobId=110
        NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute<br>
        [2020-11-18T10:19:35.387] prolog_running_decr: Configuration for
        JobId=110 is complete<br>
        [2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB
        JobId=110 uid 0<br>
        [2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110
        successful 0x8004</blockquote>
      <blockquote type="cite">[2020-11-18T10:20:59.282] Resending
        TERMINATE_JOB request JobId=110 Nodelist=martin-cn-01</blockquote>
      <blockquote type="cite">[2020-11-18T10:21:10.000] update_node:
        node martin-cn-01 reason set to: Kill task failed<br>
        [2020-11-18T10:21:10.000] update_node: node martin-cn-01 state
        set to DRAINING<br>
        [2020-11-18T10:21:10.001] error: slurmd error running JobId=110
        on node(s)=martin-cn-01: Kill task failed<br>
        [2020-11-18T10:21:10.124] cleanup_completing: JobId=110
        completion process took 91 seconds<br>
      </blockquote>
      Compute node slurmd.log</p>
    <p>
      <blockquote type="cite">[2020-11-18T10:19:35.388]
        task_p_slurmd_batch_request: 110<br>
        [2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask
        for node: 0x00000000000000000000000000000003<br>
        [2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW
        mask for node: 0x00000000000000010000000000000001<br>
        [2020-11-18T10:19:35.388] debug:  [job 110] attempting to run
        prolog [/cm/local/apps/cmd/scripts/prolog]<br>
        [2020-11-18T10:19:35.392] _run_prolog: run job script took
        usec=4085<br>
        [2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job
        110 ran for 0 seconds<br>
        [2020-11-18T10:19:35.393] Launching batch job 110 for UID 0<br>
        [2020-11-18T10:19:35.398] debug:  AcctGatherEnergy NONE plugin
        loaded<br>
        [2020-11-18T10:19:35.398] debug:  AcctGatherProfile NONE plugin
        loaded<br>
        [2020-11-18T10:19:35.398] debug:  AcctGatherInterconnect NONE
        plugin loaded<br>
        [2020-11-18T10:19:35.398] debug:  AcctGatherFilesystem NONE
        plugin loaded<br>
        [2020-11-18T10:19:35.398] debug:  switch NONE plugin loaded<br>
        [2020-11-18T10:19:35.398] debug:  init: Gres GPU plugin loaded<br>
        [2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node
        as a socket<br>
        [2020-11-18T10:19:35.400] [110.batch] debug:  CPUs:128 Boards:1
        Sockets:8 CoresPerSocket:8 ThreadsPerCore:2<br>
        [2020-11-18T10:19:35.400] [110.batch] debug:  Job accounting
        gather cgroup plugin loaded<br>
        [2020-11-18T10:19:35.401] [110.batch] debug:  cont_id hasn't
        been set yet not running poll<br>
        [2020-11-18T10:19:35.401] [110.batch] debug:  laying out the 1
        tasks on 1 hosts martin-cn-01 dist 2<br>
        [2020-11-18T10:19:35.401] [110.batch] debug:  Message thread
        started pid = 267029<br>
        [2020-11-18T10:19:35.401] [110.batch] task affinity plugin
        loaded with CPU mask
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff<br>
        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now
        constraining jobs allocated cores<br>
        [2020-11-18T10:19:35.402] [110.batch] debug: 
        task/cgroup/memory: total:515440M allowed:100%(enforced),
        swap:0%(enforced), max:100%(515440M) max+swap:100%(1030880M)
        min:30M kmem:100%(515440M enforced) min:30M swappiness:100(set)<br>
        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now
        constraining jobs allocated memory<br>
        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup:
        unable to open
        /cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf:
        No such file or directory<br>
        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now
        constraining jobs allocated devices<br>
        [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup:
        loaded<br>
        [2020-11-18T10:19:35.402] [110.batch] debug:  Checkpoint plugin
        loaded: checkpoint/none<br>
        [2020-11-18T10:19:35.402] [110.batch] Munge credential signature
        plugin loaded<br>
        [2020-11-18T10:19:35.402] [110.batch] debug:  job_container none
        plugin loaded<br>
        [2020-11-18T10:19:35.402] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm'
        already exists<br>
        [2020-11-18T10:19:35.402] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm/uid_0'
        already exists<br>
        [2020-11-18T10:19:35.403] [110.batch] debug:  spank: opening
        plugin stack /etc/slurm/plugstack.conf<br>
        [2020-11-18T10:19:35.403] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm'
        already exists<br>
        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job
        abstract cores are '0'<br>
        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step
        abstract cores are '0'<br>
        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job
        physical cores are '0,64'<br>
        [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step
        physical cores are '0,64'<br>
        [2020-11-18T10:19:35.403] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0'
        already exists<br>
        [2020-11-18T10:19:35.406] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm'
        already exists<br>
        [2020-11-18T10:19:35.406] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0'
        already exists<br>
        [2020-11-18T10:19:35.406] [110.batch] debug: 
        xcgroup_instantiate: cgroup
        '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists<br>
        [2020-11-18T10:19:35.406] [110.batch] task/cgroup:
        /slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB
        memsw.limit=4096MB<br>
        [2020-11-18T10:19:35.406] [110.batch] debug: 
        xcgroup_instantiate: cgroup
        '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
        exists<br>
        [2020-11-18T10:19:35.406] [110.batch] task/cgroup:
        /slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB
        memsw.limit=4096MB<br>
        [2020-11-18T10:19:35.406] [110.batch] debug: 
        _oom_event_monitor: started.<br>
        [2020-11-18T10:19:35.406] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm'
        already exists<br>
        [2020-11-18T10:19:35.406] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm/uid_0'
        already exists<br>
        [2020-11-18T10:19:35.438] [110.batch] debug level = 2<br>
        [2020-11-18T10:19:35.438] [110.batch] starting 1 tasks<br>
        [2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started
        2020-11-18T10:19:35<br>
        [2020-11-18T10:19:35.438] [110.batch] debug:  Setting slurmstepd
        oom_adj to -1000<br>
        [2020-11-18T10:19:35.438] [110.batch] debug: 
        jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid
        4294967294 taskid 0 max_task_id 0<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm'
        already exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0'
        already exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup
        '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup
        '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch' already
        exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid
        4294967294 taskid 0 max_task_id 0<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm'
        already exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0'
        already exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup
        '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup
        '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
        exists<br>
        [2020-11-18T10:19:35.439] [110.batch] debug: 
        xcgroup_instantiate: cgroup
        '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0'
        already exists<br>
        [2020-11-18T10:19:35.440] [110.batch] debug:  task_p_pre_launch:
        affinity jobid 110.4294967294, task:0 bind:256<br>
        [2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using
        sched_affinity for tasks<br>
        [2020-11-18T10:19:39.770] debug:  _rpc_terminate_job, uid = 450<br>
        [2020-11-18T10:19:39.770] debug: 
        task_p_slurmd_release_resources: affinity jobid 110<br>
        [2020-11-18T10:19:39.770] debug:  credential for job 110 revoked<br>
        [2020-11-18T10:19:39.771] [110.batch] debug:  Handling
        REQUEST_SIGNAL_CONTAINER<br>
        [2020-11-18T10:19:39.771] [110.batch] debug: 
        _handle_signal_container for step=110.4294967294 uid=450
        signal=18<br>
        [2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to
        110.4294967294<br>
        [2020-11-18T10:19:39.771] [110.batch] debug:  Handling
        REQUEST_SIGNAL_CONTAINER<br>
        [2020-11-18T10:19:39.771] [110.batch] debug: 
        _handle_signal_container for step=110.4294967294 uid=450
        signal=15<br>
        [2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON
        martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***<br>
        [2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to
        110.4294967294<br>
        [2020-11-18T10:19:39.771] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:39.792] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:39.842] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:39.942] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:40.442] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:41.442] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:42.443] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:43.443] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:44.443] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:45.443] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:46.443] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:47.443] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:48.444] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:49.444] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:19:50.444] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:00.444] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:09.445] [110.batch] debug:  Handling
        REQUEST_STEP_TERMINATE<br>
        [2020-11-18T10:20:09.445] [110.batch] debug:  _handle_terminate
        for step=110.4294967294 uid=0<br>
        [2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to
        110.4294967294<br>
        [2020-11-18T10:20:09.445] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:09.465] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:09.515] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:09.616] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:10.116] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:11.116] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:11.116] [110.batch] debug:  Handling
        REQUEST_STEP_TERMINATE<br>
        [2020-11-18T10:20:11.116] [110.batch] debug:  _handle_terminate
        for step=110.4294967294 uid=0<br>
        [2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to
        110.4294967294<br>
        [2020-11-18T10:20:12.116] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:20:12.116] [110.batch] debug:  Handling
        REQUEST_STEP_TERMINATE<br>
        [2020-11-18T10:20:12.116] [110.batch] debug:  _handle_terminate
        for step=110.4294967294 uid=0<br>
      </blockquote>
      SIGKILL signals continue until the unkillable timelimit is met and
      then slurm closes the job leaving the processes running with
      output:<br>
    </p>
    <p>
      <blockquote type="cite">[2020-11-18T10:20:59.282] debug: 
        _rpc_terminate_job, uid = 450<br>
        [2020-11-18T10:20:59.282] debug: 
        task_p_slurmd_release_resources: affinity jobid 110<br>
        [2020-11-18T10:21:00.121] [110.batch] debug:  Handling
        REQUEST_STATE<br>
        [2020-11-18T10:21:00.121] [110.batch] debug:  Handling
        REQUEST_STEP_TERMINATE<br>
        [2020-11-18T10:21:00.121] [110.batch] debug:  _handle_terminate
        for step=110.4294967294 uid=0<br>
        [2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to
        110.4294967294<br>
        [2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD
        TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT
        ENDING WITH SIGNALS ***<br>
        [2020-11-18T10:21:10.000] [110.batch] debug:  get_exit_code task
        0 killed by cmd<br>
        [2020-11-18T10:21:10.000] [110.batch] job 110 completed with
        slurm_rc = 4001, job_rc = -1<br>
        [2020-11-18T10:21:10.000] [110.batch] sending
        REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1<br>
        [2020-11-18T10:21:10.001] [110.batch] debug:  Message thread
        exited<br>
        [2020-11-18T10:21:10.001] [110.batch] done with job<br>
        [2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog
        to complete<br>
        [2020-11-18T10:21:10.121] debug:  Finished wait for job 110's
        prolog to complete<br>
        [2020-11-18T10:21:10.121] debug:  [job 110] attempting to run
        epilog [/cm/local/apps/cmd/scripts/epilog]<br>
        [2020-11-18T10:21:10.124] debug:  completed epilog for jobid 110<br>
        [2020-11-18T10:21:10.124] debug:  Job 110: sent epilog complete
        msg: rc = 0<br>
      </blockquote>
    </p>
    <div class="moz-signature">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <title></title>
      <table cellspacing="0" cellpadding="0" border="0">
        <tbody>
          <tr>
            <td width="150" valign="top" height="30" align="left">
              <p style="font-size:14px;">Willy Markuske</p>
            </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">HPC Systems Engineer</p>
            </td>
            <td rowspan="3" width="180" valign="center" height="42"
              align="center"><tt><img moz-do-not-send="false"
                  src="cid:part1.152B416D.527E59F0@sdsc.edu" alt=""
                  width="168" height="48"></tt> </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">Research Data Services</p>
            </td>
          </tr>
          <tr>
            <td style="border-right: 1px solid #000;" align="left">
              <p style="font-size:12px;">P: (858) 246-5593</p>
            </td>
          </tr>
        </tbody>
      </table>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">On 11/18/20 9:41 AM, mercan wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:9f9643ff-bf36-facd-d228-7569d43a5301@uhem.itu.edu.tr">Hi;
      <br>
      <br>
      Check epilog return value which comes from the return value of the
      last line of epilog script. Also, you can add a "exit 0" line at
      the last line of the epilog script to ensure to get a zero return
      value for testing purpose.
      <br>
      <br>
      Ahmet M.
      <br>
      <br>
      <br>
      18.11.2020 20:00 tarihinde William Markuske yazdı:
      <br>
      <blockquote type="cite">
        <br>
        Hello,
        <br>
        <br>
        I am having an odd problem where users are unable to kill their
        jobs with scancel. Users can submit jobs just fine and when the
        task completes it is able to close correctly. However, if a user
        attempts to cancel a job via scancel the SIGKILL signals are
        sent to the step but don't complete. Slurmd then continues to
        send SIGKILL requests until the UnkillableTimeout is hit, the
        slurm job is exits with an error, the node enters a draining
        state, and the spawn processes continue to run on the node.
        <br>
        <br>
        I'm at a loss because jobs can complete without issue which
        seems to suggest it's not a networking or permissions issue for
        the slurm to do job accounting tasks. A user can ssh to the node
        once a job is submitted and kill the subprocesses manually at
        which point slurm completes the epilog and the node returns to
        idle.
        <br>
        <br>
        Does anyone know what may be causing such behavior? Please let
        me know any slurm.conf or cgroup.conf settings that would be
        helpful to diagnose this issue. I'm quite stumped by this one.
        <br>
        <br>
        -- <br>
        <br>
        Willy Markuske
        <br>
        <br>
        HPC Systems Engineer
        <br>
        <br>
            
        <br>
        <br>
        Research Data Services
        <br>
        <br>
        P: (858) 246-5593
        <br>
        <br>
      </blockquote>
    </blockquote>
  </body>
</html>