<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Never mind, I found the problem. The rebuilt nodes were still listed in my other cluster config (running Slurm 19), and hence it was sending them status check messages which they couldn’t respond to. Tidied up the config and the messages
have disappeared.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="mso-fareast-language:EN-GB">From:</span></b><span lang="EN-US" style="mso-fareast-language:EN-GB"> slurm-users <slurm-users-bounces@lists.schedmd.com>
<b>On Behalf Of </b>Mark Holliman<br>
<b>Sent:</b> 29 November 2022 11:53<br>
<b>To:</b> Slurm User Community List <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> [slurm-users] protocol_version 8960 not supported<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Hello,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I’ve just finished building and installing Slurm 22.05.6 from source on a head node and a couple workers. I installed the same RPMs on all the nodes, and the slurmdbd, slurmctld, and slurmd daemons have all come online and appear healthy
(test jobs can be submitted to partitions and successfully run on the nodes). But I’m seeing these errors at regular intervals in the slurm logs:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">[2022-11-29T11:29:49.683] error: unpack_header: protocol_version 8960 not supported<o:p></o:p></p>
<p class="MsoNormal">[2022-11-29T11:29:49.683] error: unpacking header<o:p></o:p></p>
<p class="MsoNormal">[2022-11-29T11:29:49.683] error: destroy_forward: no init<o:p></o:p></p>
<p class="MsoNormal">[2022-11-29T11:29:49.684] error: slurm_receive_msg_and_forward: [[sdc-uk]:53026] failed: Message receive failure<o:p></o:p></p>
<p class="MsoNormal">[2022-11-29T11:29:49.694] error: service_connection: slurm_receive_msg: Message receive failure<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">My slurm.conf is based on my previous (still existing) cluster config, and I’ve already encountered one or two issues with plugins not working. I can’t find anything online listing the Slurm protocol_version numbers to check what is causing
this error, though I’m assuming it’s plugin related (slurmdbd maybe?). Turning up the debugging on the slurm logs doesn’t help at finding the issue. Does anyone here know what protocol_verson 8960 relates to?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Relevant slurm.conf lines are:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">MpiDefault=none<o:p></o:p></p>
<p class="MsoNormal">ProctrackType=proctrack/pgid<o:p></o:p></p>
<p class="MsoNormal">ReturnToService=2<o:p></o:p></p>
<p class="MsoNormal">SlurmUser=slurm<o:p></o:p></p>
<p class="MsoNormal">StateSaveLocation=/var/spool/slurm/slurmctld<o:p></o:p></p>
<p class="MsoNormal">SwitchType=switch/none<o:p></o:p></p>
<p class="MsoNormal">TaskPlugin=task/affinity,task/cgroup<o:p></o:p></p>
<p class="MsoNormal"># Job cleanup<o:p></o:p></p>
<p class="MsoNormal">Epilog=/etc/slurm/slurm.epilog.clean<o:p></o:p></p>
<p class="MsoNormal">UnkillableStepTimeout=120<o:p></o:p></p>
<p class="MsoNormal">UnkillableStepProgram=/root/unkillableJobStepScript.sh<o:p></o:p></p>
<p class="MsoNormal"># SCHEDULING<o:p></o:p></p>
<p class="MsoNormal">#FastSchedule=0<o:p></o:p></p>
<p class="MsoNormal">SchedulerType=sched/backfill<o:p></o:p></p>
<p class="MsoNormal">SchedulerParameters=nohold_on_prolog_fail<o:p></o:p></p>
<p class="MsoNormal">SelectType=select/cons_res<o:p></o:p></p>
<p class="MsoNormal">SelectTypeParameters=CR_Core_Memory<o:p></o:p></p>
<p class="MsoNormal">PriorityType=priority/multifactor<o:p></o:p></p>
<p class="MsoNormal">PriorityWeightPartition=1000<o:p></o:p></p>
<p class="MsoNormal">PreemptMode=SUSPEND,GANG<o:p></o:p></p>
<p class="MsoNormal">PreemptType=preempt/partition_prio<o:p></o:p></p>
<p class="MsoNormal"># LOGGING AND ACCOUNTING<o:p></o:p></p>
<p class="MsoNormal">AccountingStorageType=accounting_storage/slurmdbd<o:p></o:p></p>
<p class="MsoNormal">JobCompType=jobcomp/none<o:p></o:p></p>
<p class="MsoNormal">JobAcctGatherFrequency=40<o:p></o:p></p>
<p class="MsoNormal">JobAcctGatherType=jobacct_gather/linux<o:p></o:p></p>
<p class="MsoNormal">SlurmctldDebug=5<o:p></o:p></p>
<p class="MsoNormal">SlurmctldLogFile=/var/log/slurm/slurmctld.log<o:p></o:p></p>
<p class="MsoNormal">SlurmdDebug=5<o:p></o:p></p>
<p class="MsoNormal">SlurmdLogFile=/var/log/slurm/slurmd.log<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">Cheers,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB"> Mark<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">-------------------------------<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">Mark Holliman<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">Senior Data Systems Specialist<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">Wide Field Astronomy Unit<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">Institute for Astronomy<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">University of Edinburgh<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">--------------------------------<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-GB">The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.</span><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>