[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Brian Johanson bjohanso at psc.edu
Tue Jan 21 14:20:48 UTC 2020


On 1/21/2020 12:32 AM, Chris Samuel wrote:
> On 20/1/20 3:00 pm, Dean Schulze wrote:
>
>> There's either a problem with the source code I cloned from github, 
>> or there is a problem when the controller runs on Ubuntu 19 and the 
>> node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to 
>> see if that solves the problem.
>
> I've run the master branch on a Cray XC without issues, and I concur 
> with what the others have said and suggest it's worth checking the 
> slurmd and slurmctld logs to find out why communications is not right 
> between them.
>
and if the logs do not have enough information, run the daemon in the 
foreground with increased verbosity

slurmd -D -v -v -v

As another said, check if the connections are available with telnet  
server->client 'telnet node1 6818' (6818 is the default slurmd port) and 
same from compute->server.

Are these new host builds?  Is there a firewall enabled?  Kinda sounds 
like a firewall on the client that allows outbound (initial connection 
to the slurmctl) but not new inbound (slurmctl ping) connections.

-b




More information about the slurm-users mailing list