[slurm-users] errors requesting gpus

Thu Oct 28 18:44:21 UTC 2021

Found my problem. I had synced the /etc/slurm/* files on all controllers 
and compute hosts - but not the submit host. Making note of it here in 
case this helps anyone else.

~~ bnacar

On 10/26/21 11:10 AM, Benjamin Nacar wrote:
> Hi,
> 
> I'm setting up a slurm cluster where some subset of compute nodes will have gpus. My slurm.conf contains, among other lines:
> 
> [...]
> GresTypes=gpu
> [...]
> Include /etc/slurm/slurm.conf.d/allnodes
> [...]
> 
> and the abovementioned /etc/slurm/slurm.conf.d/allnodes file has the line
> 
> NodeName=gpu1601 CPUs=12 RealMemory=257840 Gres=gpu:gtx1080:4
> 
> On the host gpu1601, the file /etc/slurm/gres.conf contains
> 
> NodeName=gpu1601 Name=gpu Type=gtx1080 File=/dev/nvidia[0-3]
> 
> However, when I try to srun something with 1 gpu, I get:
> 
> srun: error: gres_plugin_job_state_unpack: no plugin configured to unpack data type 7696487 from job 22. This is likely due to a difference in the GresTypes configured in slurm.conf on different cluster nodes.
> srun: gres_plugin_step_state_unpack: no plugin configured to unpack data type 7696487 from StepId=22.0
> srun: error: fwd_tree_thread: can't find address for host gpu1601, check slurm.conf
> srun: error: Task launch for StepId=22.0 failed on node gpu1601: Can't find an address, check slurm.conf
> srun: error: Application launch failed: Can't find an address, check slurm.conf
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
> 
> I'm not sure whether the relevant error is the "no plugin configured" part or the "Can't find an address" part. "gpu1601" is pingable from both the submit host and the controller host. The slurm daemons seem to be running without errors.
> 
> Am I missing something stupidly obvious?
> 
> Thanks,
> ~~ bnacar
> 

-- 
Benjamin Nacar
Systems Programmer
Computer Science Department
Brown University
401.863.7621