[slurm-users] squeue reports ReqNodeNotAvail but node is available
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Sun Jul 12 13:36:46 UTC 2020
In case your Arp cache is the problem, there is some advice in the Wiki
page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
I think there are other causes for ReqNodeNotAvail, for example, the
node being allocated for other jobs. The "scontrol show node/job"
should reveal more details.
/Ole
On 11-07-2020 06:00, mercan wrote:
> Hi Janna;
>
> It sounds like a Arp cache table problem to me. If your slurm head node
> can reachable ~1000 or more network devices (all connected network
> cards, switches etc., even they are reachable by different ports of the
> server), you need to increse some network settings at headnode and
> servers which can reach same amount of network device :
>
> http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm
>
>
> Also some advices for big cluster at slurm documentation:
>
> https://slurm.schedmd.com/big_sys.html
>
> Regards,
>
> Ahmet M.
>
>
> 11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:
>>
>> Hi All,
>>
>> I’ve got an intermittent situation with gpu nodes that sinfo says are
>> available and idle, but squeue reports as “ReqNodeNotAvail”. We’ve
>> cycled the nodes to restart services but it hasn’t helped. Any
>> suggestions for resolving this or digging into it more deeply?
More information about the slurm-users
mailing list