<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="text-align:left; direction:ltr;">
<div>Oh and I forgot to mention that we are using Slurm version 20.11.3.</div>
<div>Best,</div>
<div><br>
</div>
<div>Thomas</div>
<div><br>
</div>
<div>ons, 14 04 2021 kl. 09:23 +0200, skrev Thomas Arildsen:</div>
<blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex">
<pre>I administer a Slurm cluster with many users and the operation of the</pre>
<pre>cluster currently appears "totally normal" for all users; except for</pre>
<pre>one. This one user gets all attempts to run commands through Slurm</pre>
<pre>killed after 20-25 seconds (I think the cause is another job - not so</pre>
<pre>much the time, see further down).</pre>
<pre>The following minimal example reproduces the error:</pre>
<pre><br></pre>
<pre> $ sudo -u <the_user> srun --pty sleep 25</pre>
<pre> srun: job 110962 queued and waiting for resources</pre>
<pre> srun: job 110962 has been allocated resources</pre>
<pre> srun: Force Terminated job 110962</pre>
<pre> srun: Job step aborted: Waiting up to 32 seconds for job step to</pre>
<pre>finish.</pre>
<pre> slurmstepd: error: *** STEP 110962.0 ON <node> CANCELLED AT 2021-</pre>
<pre>04-09T16:33:35 ***</pre>
<pre> srun: error: <node>: task 0: Terminated</pre>
<pre><br></pre>
<pre>When this happens, I find this line in the slurmctld log:</pre>
<pre><br></pre>
<pre> _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=110962 uid</pre>
<pre><the_users_uid></pre>
<pre><br></pre>
<pre>It only happens for '<the_user>' and not for any other user that I know</pre>
<pre>of. This very similar but shorter-running example works fine:</pre>
<pre><br></pre>
<pre> $ sudo -u <the_user> srun --pty sleep 20</pre>
<pre> srun: job 110963 queued and waiting for resources</pre>
<pre> srun: job 110963 has been allocated resources</pre>
<pre><br></pre>
<pre>Note that when I run srun --pty sleep 20 as myself, srun does not</pre>
<pre>output the two srun: job... lines. This seems to me to be an additional</pre>
<pre>indication that srun is subject to some different settings for</pre>
<pre>'<the_user>'.</pre>
<pre>All settings that I have been able to inspect appear identical for</pre>
<pre>'<the_user>' as for other users. I have checked, and 'MaxWall' is not</pre>
<pre>set for this user and not for any other user, either. Other users</pre>
<pre>belonging to the same Slurm account do not experience this problem.</pre>
<pre><br></pre>
<pre>When this unfortunate user's jobs get allocated, I see messages like</pre>
<pre>this in</pre>
<pre>'/var/log/slurm/slurmctld.log':</pre>
<pre><br></pre>
<pre> sched: _slurm_rpc_allocate_resources JobId=111855 NodeList=<node></pre>
<pre><br></pre>
<pre>and shortly after, I see this message:</pre>
<pre><br></pre>
<pre> select/cons_tres: common_job_test: no job_resources info for</pre>
<pre>JobId=110722_* rc=0</pre>
<pre><br></pre>
<pre>Job 110722_* is a pending array job by another user that is pending due</pre>
<pre>to 'QOSMaxGRESPerUser'. One pending part of this array job (110722_57)</pre>
<pre>eventually ends up taking over job 111855's CPU cores when 111855 gets</pre>
<pre>killed. This leads me to believe that 110722_57 causes 111855 to be</pre>
<pre>killed. However, 110722_57 remains pending afterwards.</pre>
<pre>Some of the things I fail to understand here are:</pre>
<pre> - Why does a pending job kill another job, yet remains pending</pre>
<pre>afterwards?</pre>
<pre> - Why does the pending job even have privileges to kill another job</pre>
<pre>in the first place?</pre>
<pre> - Why does this only affect '<the_user>'s jobs but not those of other</pre>
<pre>users?</pre>
<pre><br></pre>
<pre>None of this is intended to happen. I am guessing it must be caused by</pre>
<pre>some settings specific to '<the_user>', but I cannot figure out what</pre>
<pre>they are and they are not supposed to be like this. If these are</pre>
<pre>settings we admins somehow caused, it was unintended.</pre>
<pre><br></pre>
<pre>NB: some details have been anonymized as <something> above.</pre>
<pre><br></pre>
<pre>I hope someone has a clue what is going on here. Thanks in advance,</pre>
<pre><br></pre>
<pre>Thomas Arildsen</pre>
</blockquote>
</body>
</html>