[slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

Tue Apr 7 20:55:36 UTC 2020

OK when restarting slurmd on the nodes I get these errors:

Apr 07 16:52:33 node001 systemd[1]: Starting Slurm node daemon...
Apr 07 16:52:33 node001 slurmd[299181]: Message aggregation disabled
Apr 07 16:52:33 node001 slurmd[299181]: WARNING: A line in gres.conf for
GRES mps has 400 more configured than expected in slurm.conf. Ignoring
extra GRES.
Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to
autodetect nvml functionality, but we weren't able to find that lib when
Slurm was configured.
Apr 07 16:52:33 node001 systemd[1]: slurmd.service: control process exited,
code=exited status=1
Apr 07 16:52:33 node001 systemd[1]: Failed to start Slurm node daemon.
Apr 07 16:52:33 node001 systemd[1]: Unit slurmd.service entered failed
state.
Apr 07 16:52:33 node001 systemd[1]: slurmd.service failed.

Apr 07 16:43:27 node002 slurmd[273406]: error: GresPlugins changed from
gpu,mic to gpu,mic,mps ignored
Apr 07 16:43:27 node002 slurmd[273406]: error: Restart the slurmctld daemon
to change GresPlugins
Apr 07 16:43:27 node002 slurmd[273406]: error: Ignoring gres.conf record,
invalid name: mps
Apr 07 16:44:06 node002 slurmd[273406]: error:
select_g_select_jobinfo_unpack: select plugin cons_tres not found
Apr 07 16:44:06 node002 slurmd[273406]: error:
select_g_select_jobinfo_unpack: unpack error
Apr 07 16:44:06 node002 slurmd[273406]: error: Malformed RPC of type
REQUEST_TERMINATE_JOB(6011) received
Apr 07 16:44:06 node002 slurmd[273406]: error:
slurm_receive_msg_and_forward: Header lengths are longer than data received
Apr 07 16:44:06 node002 slurmd[273406]: error: service_connection:
slurm_receive_msg: Header lengths are longer than dat...ceived

so that " WARNING: A line in gres.conf for GRES mps has 400" must come from
this entry in gres.conf:
NodeName=node[001-003] Name=gpu Type=v100 File=/dev/nvidia0
# END AUTOGENERATED SECTION   -- DO NOT REMOVE
Name=mps Count=400
AutoDetect=nvml

Perhaps I'm misunderstanding the Count option?

On Tue, Apr 7, 2020 at 4:34 PM Davide Vanzo <Davide.Vanzo at utsouthwestern.edu>
wrote:

> Robert,
>
>
>
> That error is typically due to slurmd/slurmctld version mismatch or
> different configuration. I would not be surprised if you need to restart
> slurmd too after changing the SelectType configuration.
>
> Also, do not forget this warning from the documentation when it comes to
> modifying SelectType:
>
>
>
> *Changing this value can only be done by restarting the slurmctld daemon
> and will result in the loss of all job information (running and pending)
> since the job state save format used by each plugin is different.*
>
>
>
> --
>
> *Davide Vanzo, PhD*
>
> *Computer Scientist*
>
> BioHPC – Lyda Hill Dept. of Bioinformatics
>
> UT Southwestern Medical Center
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Robert Kudyba
> *Sent:* Tuesday, April 7, 2020 3:26 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] Header lengths are longer than data received
> after changing SelectType & GresTypes to use MPS
>
>
>
> *EXTERNAL MAIL*
>
> Using Slurm 20.02 on CentIOS 7.7 with Bright Cluster. We changed the
> following options to enable MPS:
> SelectType=select/cons_tres
> GresTypes=gpu,mic,mps
>
> I restarted slurmctld and ran scontrol reconfigure, however all jobs get
> the below error:
> [2020-04-07T15:29:00.741] debug:  backfill: no jobs to backfill
> [2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3056
> Nodelist=node[001-002]
> [2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3061
> Nodelist=node003
> [2020-04-07T15:29:03.051] debug:  sched: Running job scheduler
> [2020-04-07T15:29:03.063] agent/is_node_resp: node:node003
> RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
> [2020-04-07T15:29:03.071] agent/is_node_resp: node:node002
> RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
> [2020-04-07T15:29:03.071] agent/is_node_resp: node:node001
> RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
>
> Do any other options need changing? What causes these header length
> errors?
>
> *CAUTION: *This email originated from outside UTSW. Please be cautious of
> links or attachments, and validate the sender's email address before
> replying.
>
> ------------------------------
>
> UT Southwestern
>
> Medical Center
>
> The future of medicine, today.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200407/6d5d13f7/attachment.htm>