[slurm-users] Node can't run simple job when STATUS is up and STATE is idle
Brian Johanson
bjohanso at psc.edu
Tue Jan 21 14:20:48 UTC 2020
On 1/21/2020 12:32 AM, Chris Samuel wrote:
> On 20/1/20 3:00 pm, Dean Schulze wrote:
>
>> There's either a problem with the source code I cloned from github,
>> or there is a problem when the controller runs on Ubuntu 19 and the
>> node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to
>> see if that solves the problem.
>
> I've run the master branch on a Cray XC without issues, and I concur
> with what the others have said and suggest it's worth checking the
> slurmd and slurmctld logs to find out why communications is not right
> between them.
>
and if the logs do not have enough information, run the daemon in the
foreground with increased verbosity
slurmd -D -v -v -v
As another said, check if the connections are available with telnet
server->client 'telnet node1 6818' (6818 is the default slurmd port) and
same from compute->server.
Are these new host builds? Is there a firewall enabled? Kinda sounds
like a firewall on the client that allows outbound (initial connection
to the slurmctl) but not new inbound (slurmctl ping) connections.
-b
More information about the slurm-users
mailing list