[slurm-users] slurm jobs are pending but resources are available

Mon Apr 16 04:35:16 MDT 2018

Hi,

I'm having some trouble with resource allocation in the sense that according to how I understood 
the documentation and applied that to the config file I am expecting some behavior that does not happen.

Here is the relevant excerpt from the config file:

 60 SchedulerType=sched/backfill                                                                                             
 61 SchedulerParameters=bf_continue,bf_interval=45,bf_resolution=90,max_array_tasks=1000                                     
 62 #SchedulerAuth=                                                                                                          
 63 #SchedulerPort=                                                                                                          
 64 #SchedulerRootFilter=                                                                                                    
 65 SelectType=select/cons_res                                                                                               
 66 SelectTypeParameters=CR_CPU_Memory                                                                                       
 67 FastSchedule=1
...      
 102 NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000  State=UNKNOWN                       
 103 PartitionName=main_compute Nodes=cn_burebista Shared=YES Default=YES MaxTime=76:00:00 State=UP

According to the above I have the backfill scheduler enabled with CPUs and Memory configured as 
resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill 
scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there 
are multiple processes asking for more resources than available. In my case I have the following queue:

  JOBID PARTITION     NAME     USER      ST       TIME  NODES NODELIST(REASON)
     2361 main_comp     training    mcetatea PD       0:00      1           (Resources)
     2356 main_comp     skrf_ori    jhanca     R        58:41      1          cn_burebista
     2357 main_comp     skrf_ori    jhanca     R        44:13      1          cn_burebista

Jobs 2356 and 2357 are asking for 16 CPUs each, job 2361 is asking for 20 CPUs, meaning in total 52 CPUs
As seen from above job 2361(which is started by a different user) is marked as pending due to lack of resources although there are plenty of CPUs and memory available.  "scontrol show nodes cn_burebista" gives me the following:

NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
   CPUAlloc=32 CPUErr=0 CPUTot=56 CPULoad=21.65
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn_burebista NodeHostName=cn_burebista Version=16.05
   OS=Linux RealMemory=256000 AllocMem=64000 FreeMem=178166 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2018-03-09T12:04:52 SlurmdStartTime=2018-03-20T10:35:50
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I'm going through the documentation again and again but I cannot figure out what am I doing wrong ...
Why do I have the above situation? What should I change to my config to make this work?

scontrol show -dd job <jobid> shows me the following:

JobId=2361 JobName=training_carlib
   UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
   Priority=4294901726 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
   SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
   StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=main_compute AllocNode:Sid=zalmoxis:23690
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cn_burebista
   NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
   WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
   StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
   StdIn=/dev/null
   StdOut=/home/mcetateanu/workspace/CarLib/src/_out

I also changed my config to specify exactly the numver of CPUs and to not let slurm compute the CPUs 
from Sockets, CoresPerSocket, and ThreadsPerCore. The 2 tasks that I am trying to run have the following 
output from "scontrol show -dd job <jobid>" but the one asking for 20 CPUs is still pending due to lack of resources:

NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=32000M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=cn_burebista CPU_IDs=0-15 Mem=32000 MinCPUsNode=16 MinMemoryCPU=2000M MinTmpDiskNode=0 

NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:* TRES=cpu=20,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* 

Thank you

-------------------------------------------------------------------------------------------
Marius Cetateanu
Senior Embedded Software Engineer
Engineering Department 1, Driver & Embedded
Sony Depthsensing Solutions
Tel:  +32 (0)28992171
email: Marius.Cetateanu at sony.com

Sony Depthsensing Solutions
11 Boulevard de la Plaine, 1050 Brussels, Belgium

**********************************************************************
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the sender. This
footnote also confirms that this email message has been checked for all
known viruses.
Sony DepthSensing Solutions SA/NV
Registered Office: 11 Boulevard de la Plaine, 1050 Brussels, Belgium
Registered number: RPM/RPR Brussels 0811 784 189
**********************************************************************

________________________________________
From: slurm-users [slurm-users-bounces at lists.schedmd.com] on behalf of slurm-users-request at lists.schedmd.com [slurm-users-request at lists.schedmd.com]
Sent: Sunday, April 15, 2018 9:02 PM
To: slurm-users at lists.schedmd.com
Subject: slurm-users Digest, Vol 6, Issue 21

Send slurm-users mailing list submissions to
        slurm-users at lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
        https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C531c46b911e643cc3bad08d5a303860b%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=0J4phgqhMDHOFVqXITNuNY62BWyprqriA75AvslDMG8%3D&reserved=0
or, via email, send a message with subject or body 'help' to
        slurm-users-request at lists.schedmd.com

You can reach the person managing the list at
        slurm-users-owner at lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."

Today's Topics:

   1. Re: ulimit in sbatch script (Mahmood Naderan)
   2. Re: ulimit in sbatch script (Bill Barth)
   3. Re: ulimit in sbatch script (Mahmood Naderan)
   4. Re: ulimit in sbatch script (Mahmood Naderan)
   5. Re: ulimit in sbatch script (Bill Barth)

----------------------------------------------------------------------

Message: 1
Date: Sun, 15 Apr 2018 22:56:01 +0430
From: Mahmood Naderan <mahmood.nt at gmail.com>
To: Ole.H.Nielsen at fysik.dtu.dk,  Slurm User Community List
        <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
        <CADa2P2XsyW0tBVGjuBi_yRpDdO15jALKssqqxDzGZCzD8VcyyQ at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

I actually have disabled the swap partition (!) since the system goes
really bad and based on my experience I have to enter the room and
reset the affected machine (!). Otherwise I have to wait for long
times to see it get back to normal.

When I ssh to the node with root user, the ulimit -a says unlimited
virtual memory. So, it seems that the root have unlimited value while
users have limited value.

Regards,
Mahmood

On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
<Ole.H.Nielsen at fysik.dtu.dk> wrote:
> Hi Mahmood,
>
> It seems your compute node is configured with this limit:
>
> virtual memory          (kbytes, -v) 72089600
>
> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
> permitted by the system (72089600), this must surely get rejected, as you
> have discovered!
>
> You may want to reconfigure your compute nodes' limits, for example by
> setting the virtual memory limit to "unlimited" in your configuration. If
> the nodes has a very small RAM memory + swap space size, you might encounter
> Out Of Memory errors...
>
> /Ole

------------------------------

Message: 2
Date: Sun, 15 Apr 2018 18:31:08 +0000
From: Bill Barth <bbarth at tacc.utexas.edu>
To: Slurm User Community List <slurm-users at lists.schedmd.com>,
        "Ole.H.Nielsen at fysik.dtu.dk" <Ole.H.Nielsen at fysik.dtu.dk>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID: <6218364A-07C8-4A75-B90A-A7AE77EBE238 at tacc.utexas.edu>
Content-Type: text/plain; charset="utf-8"

Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
Office: ROC 1.435            |   Fax:   (512) 475-9445

On 4/15/18, 1:26 PM, "slurm-users on behalf of Mahmood Naderan" <slurm-users-bounces at lists.schedmd.com on behalf of mahmood.nt at gmail.com> wrote:

    I actually have disabled the swap partition (!) since the system goes
    really bad and based on my experience I have to enter the room and
    reset the affected machine (!). Otherwise I have to wait for long
    times to see it get back to normal.

    When I ssh to the node with root user, the ulimit -a says unlimited
    virtual memory. So, it seems that the root have unlimited value while
    users have limited value.

    Regards,
    Mahmood

    On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
    <Ole.H.Nielsen at fysik.dtu.dk> wrote:
    > Hi Mahmood,
    >
    > It seems your compute node is configured with this limit:
    >
    > virtual memory          (kbytes, -v) 72089600
    >
    > So when the batch job tries to set a higher limit (ulimit -v 82089600) than
    > permitted by the system (72089600), this must surely get rejected, as you
    > have discovered!
    >
    > You may want to reconfigure your compute nodes' limits, for example by
    > setting the virtual memory limit to "unlimited" in your configuration. If
    > the nodes has a very small RAM memory + swap space size, you might encounter
    > Out Of Memory errors...
    >
    > /Ole

------------------------------

Message: 3
Date: Sun, 15 Apr 2018 23:01:32 +0430
From: Mahmood Naderan <mahmood.nt at gmail.com>
To: Ole.H.Nielsen at fysik.dtu.dk,  Slurm User Community List
        <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
        <CADa2P2U-9Pxm0oPT-DkmjzBDa66uk2z=tr-69X=p5WOaWphEUQ at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

BTW, the memory size of the node is 64GB.
Regards,
Mahmood

On Sun, Apr 15, 2018 at 10:56 PM, Mahmood Naderan <mahmood.nt at gmail.com> wrote:
> I actually have disabled the swap partition (!) since the system goes
> really bad and based on my experience I have to enter the room and
> reset the affected machine (!). Otherwise I have to wait for long
> times to see it get back to normal.
>
> When I ssh to the node with root user, the ulimit -a says unlimited
> virtual memory. So, it seems that the root have unlimited value while
> users have limited value.
>
> Regards,
> Mahmood
>
>
>
>
> On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
> <Ole.H.Nielsen at fysik.dtu.dk> wrote:
>> Hi Mahmood,
>>
>> It seems your compute node is configured with this limit:
>>
>> virtual memory          (kbytes, -v) 72089600
>>
>> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
>> permitted by the system (72089600), this must surely get rejected, as you
>> have discovered!
>>
>> You may want to reconfigure your compute nodes' limits, for example by
>> setting the virtual memory limit to "unlimited" in your configuration. If
>> the nodes has a very small RAM memory + swap space size, you might encounter
>> Out Of Memory errors...
>>
>> /Ole

------------------------------

Message: 4
Date: Sun, 15 Apr 2018 23:11:20 +0430
From: Mahmood Naderan <mahmood.nt at gmail.com>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
        <CADa2P2XTFSztdtW2_drBtXkKWxz4QdQNLf9P2SBmpU_4C2okQg at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

Excuse me... I think the problem is not pam.d.
How do you interpret the following output?

[hamid at rocks7 case1_source2]$ sbatch slurm_script.sh
Submitted batch job 53
[hamid at rocks7 case1_source2]$ tail -f hvacSteadyFoam.log
max memory size         (kbytes, -m) 65536000
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) 72089600
file locks                      (-x) unlimited
^C
[hamid at rocks7 case1_source2]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
                53   CLUSTER hvacStea    hamid  R       0:27      1 compute-0-3
[hamid at rocks7 case1_source2]$ ssh compute-0-3
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Last login: Sun Apr 15 23:03:29 2018 from rocks7.local
Rocks Compute Node
Rocks 7.0 (Manzanita)
Profile built 19:21 11-Apr-2018

Kickstarted 19:37 11-Apr-2018
[hamid at compute-0-3 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256712
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[hamid at compute-0-3 ~]$

As you can see, the log file where I put  "ulimit -a" before the main
command says limited virtual memory. However, when I login to the
node, it says unlimited!

Regards,
Mahmood

On Sun, Apr 15, 2018 at 11:01 PM, Bill Barth <bbarth at tacc.utexas.edu> wrote:
> Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.
>
> Best,
> Bill.

------------------------------

Message: 5
Date: Sun, 15 Apr 2018 19:02:48 +0000
From: Bill Barth <bbarth at tacc.utexas.edu>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID: <9A10D099-77FD-4892-9288-9708B796FFC4 at tacc.utexas.edu>
Content-Type: text/plain; charset="utf-8"

Mahmood, sorry to presume. I meant to address the root user and your ssh to the node in your example.

At our site, we use UsePAM=1 in our slurm.conf, and our /etc/pam.d/slurm and slurm.pam files both contain pam_limits.so, so it could be that way for you, too. I.e. Slurm could be setting the limits for jobscripts for your users, but for root SSHes, where that’s being set by PAM through another config file. Also, root’s limits are potentially differently set by PAM (in /etc/security/limits.conf) or the kernel at boot time.

Finally, users should be careful using ulimit in their job scripts b/c that can only change the limits for that shell script process and not across nodes. That jobscript appears to only apply to one node, but if they want different limits for jobs that span nodes, they may need to use other features of SLURM to get them across all  the nodes their job wants (cgroups, perhaps?).

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
Office: ROC 1.435            |   Fax:   (512) 475-9445

On 4/15/18, 1:41 PM, "slurm-users on behalf of Mahmood Naderan" <slurm-users-bounces at lists.schedmd.com on behalf of mahmood.nt at gmail.com> wrote:

    Excuse me... I think the problem is not pam.d.
    How do you interpret the following output?

    [hamid at rocks7 case1_source2]$ sbatch slurm_script.sh
    Submitted batch job 53
    [hamid at rocks7 case1_source2]$ tail -f hvacSteadyFoam.log
    max memory size         (kbytes, -m) 65536000
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 4096
    virtual memory          (kbytes, -v) 72089600
    file locks                      (-x) unlimited
    ^C
    [hamid at rocks7 case1_source2]$ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES
    NODELIST(REASON)
                    53   CLUSTER hvacStea    hamid  R       0:27      1 compute-0-3
    [hamid at rocks7 case1_source2]$ ssh compute-0-3
    Warning: untrusted X11 forwarding setup failed: xauth key data not generated
    Last login: Sun Apr 15 23:03:29 2018 from rocks7.local
    Rocks Compute Node
    Rocks 7.0 (Manzanita)
    Profile built 19:21 11-Apr-2018

    Kickstarted 19:37 11-Apr-2018
    [hamid at compute-0-3 ~]$ ulimit -a
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 256712
    max locked memory       (kbytes, -l) unlimited
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 4096
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited
    [hamid at compute-0-3 ~]$

    As you can see, the log file where I put  "ulimit -a" before the main
    command says limited virtual memory. However, when I login to the
    node, it says unlimited!

    Regards,
    Mahmood

    On Sun, Apr 15, 2018 at 11:01 PM, Bill Barth <bbarth at tacc.utexas.edu> wrote:
    > Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.
    >
    > Best,
    > Bill.

End of slurm-users Digest, Vol 6, Issue 21
******************************************