[slurm-users] submit wrong core-gpu binding job, it will be pending with "resources" reason which impact the main scheduler.

Tue Jan 26 05:56:27 UTC 2021

Hi list，
I am a learner of SLURM, now encountered one issue in the slurm19.05 version.
when I submit a job with 16 cores and 1 GPU, the job will be in PD state with reason "Resources", which will impact the main scheduler to deal with lower priority jobs(PD reason is Priority) in the same partition. the open tickets is: https://bugs.schedmd.com/show_bug.cgi?id=10697
my gres.conf likes below:
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=8-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=16-23
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=24-31
After checked the problem found the reasons:
1. why this job can be submitted?
In my env, there are several nodes in a power-down state, so these nodes will not send the register message to the controller, the slurmctld will not have the cores-GPU binding info. so this job can be submitted in SLURM and with the reason "ReqNotAvail", which will not impact the main scheduler.
2. why this job was in PD with "Resources"?
because there was a running job with option --exclusive, so the global variant share_node_bitmap was partly cleared according to the running job's nodes. when I run the problem job, in function _pick_best_nodes it will set the nodes_busy to be true because of exclusive job, after executing select_g_job_test()
against the list of nodes that exist in any state, found it ok to submit, then go to the logic1 to below, found it nodes_busy is true, so it will go through logic 2. That is the reason for this question.

        //logic 1 in _pick_best_nodes
        else if (!runable_avail && !nodes_busy) {
		error_code = ESLURM_NODE_NOT_AVAIL;
	}
        //logic 2 in _pick_best_nodes
	if (error_code == SLURM_SUCCESS) {
		error_code = ESLURM_NODES_BUSY;
		*select_bitmap = possible_bitmap;
	} else {
		FREE_NULL_BITMAP(possible_bitmap);
	}
	return error_code;
To avoid problems, the following solutions are proposed, please help to give your advice.
(1) I think "nodes_busy" variant is not good to as a checking condition. What about changing the "!nodes_busy" parameter to "bit_super_set(possible_bitmap, share_node_bitmap)" in logic 1. which was checked in local env, this is an available way to resolve this problem.
(2) what about removing the gres.conf to remove the resources binding?
(3) what about using cli_filter to filter out jobs with not-right bindings?
It will be appreciated for receiving your response.

wenxiaoll at 126.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210126/e120e6bb/attachment-0001.htm>