<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p
{mso-style-priority:99;
margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle19
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle20
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body bgcolor="white" lang="EN-CA" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Sorry everybody, I made a few mistakes concerning Slurm versions in the summary of what’s working and not working. Here is the text:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">20.11.8 (of course without CommunicationParameters=block_null_hash): working since long<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">21.08.8-2 with CommunicationParameters=block_null_hash: intermittent problems<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">21.08.8-2 without CommunicationParameters=block_null_hash: intermittent problems<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">20.11.9 without CommunicationParameters=block_null_hash: working<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">20.11.9 with CommunicationParameters=block_null_hash: working<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">So from the symptoms I described in my previous message, the intermittent problems really look like a buggy behavior (race condition
or communication problem). And from the result above this bug looks like a *<b>regression</b>* from 20.11.[8-9] to 21.08.8-2.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Sorry for the confusion.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Martin<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif"> slurm-users [mailto:slurm-users-bounces@lists.schedmd.com]
<b>On Behalf Of </b>Audet, Martin<br>
<b>Sent:</b> June 3, 2022 10:16<br>
<b>To:</b> Slurm User Community List <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> Re: [slurm-users] slurmctld loose connection with slurmd for no reason after upgrading from 20.11.8 to 21.08.8-2<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Hello Slurm user community,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">I would like to share my experience concerning the updates I did following the security fixes published last month (May 4<sup>th</sup>)
as it may help other users (and hopefully attract attention of responsible developers).<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">As I explained in my previous message, we were running Slurm 20.11.8 since July 2021 without any problems.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">After security patches were published we decided at the same time to update to the latest version (at that time), that is: 21.08.8-2.
We then ran into problems with jobs composed of a large number of steps (ex: 283). We tried modifying the configuration without success. We then decided to go back to a previous version (a version not vulnerable of course), 20.11.9 using a conservative configuration,
and it worked ! We then modified the configuration to use a parameter recommended for security and it continued working.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Here is a summary of what’s working and what’s not working:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">20.11.7 (of course without CommunicationParameters=block_null_hash): working<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">21.08.8-2 with CommunicationParameters=block_null_hash: intermittent problems<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">21.08.8-2 without CommunicationParameters=block_null_hash: intermittent problems<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">20.11.8 without CommunicationParameters=block_null_hash: working<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:Consolas;color:#1F497D;mso-fareast-language:EN-US">20.11.8 with CommunicationParameters=block_null_hash: working<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">So from the symptoms I described in my previous message, the intermittent problems really look like a buggy behavior (race condition
or communication problem). And from the result above this bug looks like a *<b>regression</b>* from 20.11.[7-8] to 21.08.8-2.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">To Slurm developers:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">I know we are not paying for maintenance. I would like to see my organisation support a useful and good
open source project such as Slurm instead of buying yet another Windows centered commercial software. But with our small 24 nodes cluster the support fees would be too high for our operational budget. If I were you I would still try to investigate this problem
as it of general interest: it may allow to find a real regression bug affecting many users (both with and without support contract).<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Martin Audet <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif"> slurm-users [<a href="mailto:slurm-users-bounces@lists.schedmd.com">mailto:slurm-users-bounces@lists.schedmd.com</a>]
<b>On Behalf Of </b>Audet, Martin<br>
<b>Sent:</b> May 20, 2022 8:53<br>
<b>To:</b> <a href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a><br>
<b>Subject:</b> [slurm-users] slurmctld loose connection with slurmd for no reason after upgrading from 20.11.8 to 21.08.8-2<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Hello Slurm community,<o:p></o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">It seems that since we upgraded Slurm from 20.11.8 to 21.08.8-2 and since a user started a special type of job with a many short steps (ex: 1000) lasting from one to five minutes
each, after many steps (283 in this case) slurmctld seems to loose contact with the slurmd running on the compute nodes for this job wrongly thinking that the node has resumed and exceded its ResumeTimeout of 300s (but this is not possible since the job was
running since almost 24 hours on the same four nodes, no node "resumed"). Slurm then decide to mark the nodes as "down" and "power_save" and just after it notice that the four nodes used by the job are "now responding".<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">But too late ! The decision was taken, the job will soon be rescheduled on another set of nodes and the old nodes are marked "down~".<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Note that on the compute nodes themselves everything appear normal. The only message I see is "credential for job 47647 revoked". The slurmd daemon is running fine according to
systemctl and I have seen nothing in /var/log/message indicating an error (ex: communication error) on the head and compute nodes. Communication between slurmctld and slurmd is done over IPoIB interfaces (rather than GbE) and our Infiniband network seems fine.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">In the last tree days this problem happened three times on our small 24 nodes cluster.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Am I the only one with this problem ?<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">In my opinion, this intermittent problem really look like a race or a communication bug. <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Note that slurm.conf contains:<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> CommunicationParameters=block_null_hash<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> MaxStepCount=100000000<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> SlurmctldDebug=debug<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> SlurmdDebug=debug<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Also the update was done cleanly by stopping all jobs using a maintenance reservation. The deamons were stopped and disabled. The rpms were build from sources. The old ones (20.11.8)
were removed and the new ones installed (21.08.8-2). Finally the daeomons were started and enabled.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">The job 47647 was allocated four nodes: cn[9-12]. Each step is short, use all CPUs allocated and run one after the other.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Thanks in advance,<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Martin Audet<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Here are the messages I had from slurmctld:<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"> <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn9 not resumed by ResumeTimeout(300) - marking down and power_save<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: requeue job JobId=47647 due to failure of node cn9<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Requeuing JobId=47647<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn10 not resumed by ResumeTimeout(300) - marking down and power_save<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn11 not resumed by ResumeTimeout(300) - marking down and power_save<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node cn12 not resumed by ResumeTimeout(300) - marking down and power_save<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: sched: Updated reservation=limite users=root nodes=cn[1-24] cores=576 licenses=(null) tres=cpu=1152 watts=4294967294
start=2022-06-30T23:59:00 end=2022-07-01T00:00:00 MaxStartDelay=<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn10 now responding<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn11 now responding<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn12 now responding<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node cn9 now responding<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: _job_complete: JobId=47647 WTERMSIG 15<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: _job_complete: JobId=47647 cancelled by interactive user<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: job_step_signal: JobId=47647 is in state PENDING, cannot signal steps<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: step_partial_comp: JobId=47647 pending<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug: job_epilog_complete: JobId=47647 complete response from DOWN node cn12<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug: job_epilog_complete: JobId=47647 complete response from DOWN node cn10<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug: job_epilog_complete: JobId=47647 complete response from DOWN node cn11<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: debug: job_epilog_complete: JobId=47647 complete response from DOWN node cn9<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black">Here are the messages I had from slurmd on cn9 (the other 3 show similar messages):<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: launch task StepId=47647.283 request from UID:171971265 GID:171971265 HOST:10.2.1.109 PORT:43544<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: Checking credential with 572 bytes of sig data<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: lllp_distribution: JobId=47647 binding: cores, dist 2<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [47647]: mask_cpu, 0x000001000001,0x001000001000,0x000002000002,0x002000002000,0x000004000004,0x004000004000,0x000008000008,0x008000008000,0x000010000010,0x010000010000,0x000020000020,0x020000020000,0x000040000040,0x040000040000,0x000080000080,0x080000080000,0x000100000100,0x100000100000,0x000200000200,0x200000200000,0x000400000400,0x400000400000,0x000800000800,0x800000800000<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: Waiting for job 47647's prolog to complete<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: Finished wait for job 47647's prolog to complete<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: _rpc_terminate_job: uid = 1001 JobId=47647<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: credential for job 47647 revoked<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: Waiting for job 47647's prolog to complete<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: Finished wait for job 47647's prolog to complete<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: completed epilog for jobid 47647<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:Consolas;color:black"> May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug: JobId=47647: sent epilog complete msg: rc = 0<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<p><span style="font-family:"Calibri",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
</body>
</html>