[slurm-users] Node resource is under-allocated

Tue Aug 27 17:53:47 UTC 2019

Hi,

Can someone help me understand what this error is?

select/cons_res: node cn95 memory is under-allocated (125000-135000) for JobId=23544043

We get a lot of these from time to time and I don't understand what its about?

Looking at the code it doesn't make sense for this to be happening on running jobs.

plugins/select/cons_res/select_cons_res.c

/*
 * deallocate resources previously allocated to the given job
 * - subtract 'struct job_resources' resources from 'struct part_res_record'
 * - subtract job's memory requirements from 'struct node_res_record'
 *
 * if action = 0 then subtract cores, memory + GRES (running job was terminated)
 * if action = 1 then subtract memory + GRES (suspended job was terminated)
 * if action = 2 then only subtract cores (job is suspended)
 */
static int _rm_job_from_res(struct part_res_record *part_record_ptr,
                            struct node_use_record *node_usage,
                            struct job_record *job_ptr, int action)

...
if (action != 2) {
                        if (node_usage[i].alloc_memory <
                            job->memory_allocated[n]) {
                                error("%s: node %s memory is under-allocated (%"PRIu64"-%"PRIu64") for %pJ",
                                      plugin_type, node_ptr->name,
                                      node_usage[i].alloc_memory,
                                      job->memory_allocated[n],
                                      job_ptr);
                                node_usage[i].alloc_memory = 0;
                        } else
                                node_usage[i].alloc_memory -=
                                        job->memory_allocated[n];
                }
...

It appears to me that the function should be called when basically a job has ended or suspended. Yet, these errors are being printed for running jobs. Is slurm actually deallocating resources for that job? And thus there is more memory that could be used for other jobs? I don't think that is the case.

Anyone have a thought here?

My initial feeling is .. Who cares if the node is under-allocated? Yes, it would be great if the user actually comes close to using the memory/resource they asked for so that it is not wasted, but this typically doesn't happen. Is this error there to let sysadmins know that maybe you should overprovision the memory? Or maybe there is a config issue on our side? I don't think the latter is the case.

Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167