[slurm-users] errors requesting gpus

Tue Oct 26 15:10:19 UTC 2021

Hi,

I'm setting up a slurm cluster where some subset of compute nodes will have gpus. My slurm.conf contains, among other lines:

[...]
GresTypes=gpu
[...]
Include /etc/slurm/slurm.conf.d/allnodes
[...]

and the abovementioned /etc/slurm/slurm.conf.d/allnodes file has the line

NodeName=gpu1601 CPUs=12 RealMemory=257840 Gres=gpu:gtx1080:4

On the host gpu1601, the file /etc/slurm/gres.conf contains

NodeName=gpu1601 Name=gpu Type=gtx1080 File=/dev/nvidia[0-3]

However, when I try to srun something with 1 gpu, I get:

srun: error: gres_plugin_job_state_unpack: no plugin configured to unpack data type 7696487 from job 22. This is likely due to a difference in the GresTypes configured in slurm.conf on different cluster nodes.
srun: gres_plugin_step_state_unpack: no plugin configured to unpack data type 7696487 from StepId=22.0
srun: error: fwd_tree_thread: can't find address for host gpu1601, check slurm.conf
srun: error: Task launch for StepId=22.0 failed on node gpu1601: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

I'm not sure whether the relevant error is the "no plugin configured" part or the "Can't find an address" part. "gpu1601" is pingable from both the submit host and the controller host. The slurm daemons seem to be running without errors.

Am I missing something stupidly obvious?

Thanks,
~~ bnacar

-- 
Benjamin Nacar
Systems Programmer
Computer Science Department
Brown University
401.863.7621