[slurm-users] squeue reports ReqNodeNotAvail but node is available
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Mon Jul 13 06:20:57 UTC 2020
Hi Janna,
If you're running an old Slurm version, there may be bugs already resolved
in the later versions. You can search for bugs with ReqNodeNotAvail in
the title:
https://bugs.schedmd.com/buglist.cgi?quicksearch=ReqNodeNotAvail
For example, this one might be relevant:
https://bugs.schedmd.com/show_bug.cgi?id=9257
Upgrade to Slurm 20.02 is highly recommended.
/Ole
On 7/12/20 3:36 PM, Ole Holm Nielsen wrote:
> In case your Arp cache is the problem, there is some advice in the Wiki page:
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
>
>
> I think there are other causes for ReqNodeNotAvail, for example, the node
> being allocated for other jobs. The "scontrol show node/job" should
> reveal more details.
>
> /Ole
>
>
> On 11-07-2020 06:00, mercan wrote:
>> Hi Janna;
>>
>> It sounds like a Arp cache table problem to me. If your slurm head node
>> can reachable ~1000 or more network devices (all connected network
>> cards, switches etc., even they are reachable by different ports of the
>> server), you need to increse some network settings at headnode and
>> servers which can reach same amount of network device :
>>
>> http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm
>>
>>
>> Also some advices for big cluster at slurm documentation:
>>
>> https://slurm.schedmd.com/big_sys.html
>>
>> Regards,
>>
>> Ahmet M.
>>
>>
>> 11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:
>>>
>>> Hi All,
>>>
>>> I’ve got an intermittent situation with gpu nodes that sinfo says are
>>> available and idle, but squeue reports as “ReqNodeNotAvail”. We’ve
>>> cycled the nodes to restart services but it hasn’t helped. Any
>>> suggestions for resolving this or digging into it more deeply?
More information about the slurm-users
mailing list