[slurm-users] [EXT] Jobs Immediately Fail for Certain Users

Wed Jul 8 00:57:45 UTC 2020

Now that is interesting. If I do:

loginctl enable-linger weissp

Then I get the following error:

Failed to look up user weissp: No such process

This is one of the users that always fails. But if I run it for myself with:

loginctl enable-linger simmsj

Everything works (as expected).

Any thoughts?

Warmest regards,
Jason

On Tue, Jul 7, 2020 at 8:47 PM Sean Crosby <scrosby at unimelb.edu.au> wrote:

> Hi Jason,
>
> What happens when you try to run that command on the node? Is the exit
> status of the command 0?
>
> e.g. for my servers, where lingering is masked, I get
>
> [root at thespian-gpgpu001 ~]# loginctl enable-linger scrosby
> Could not enable linger: Unit is masked.
> [root at thespian-gpgpu001 ~]# echo $?
> 1
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Wed, 8 Jul 2020 at 01:14, Jason Simms <simmsj at lafayette.edu> wrote:
>
>> *UoM notice: External email. Be cautious of links, attachments, or
>> impersonation attempts.*
>> ------------------------------
>> Hello all,
>>
>> Two users on my system experience job failures every time they submit a
>> job via sbatch. When I run their exact submission script, or when I create
>> a local system user and launch from there, the jobs run fine. Here is an
>> example of what I see in the slurmd log:
>>
>> [2020-07-06T15:02:41.284] task_p_slurmd_batch_request: 1421
>> [2020-07-06T15:02:41.284] task/affinity: job 1421 CPU input mask for
>> node: 0x00000F0000
>> [2020-07-06T15:02:41.284] task/affinity: job 1421 CPU final HW mask for
>> node: 0x00000F0000
>> [2020-07-06T15:02:41.295] _run_prolog: prolog with lock for job 1421 ran
>> for 0 seconds
>> [2020-07-06T15:02:41.295] error: [job 1421] prolog failed status=1:0
>> [2020-07-06T15:02:41.295] Job 1421 already killed, do not launch batch job
>>
>> The prolog file is simply:
>>
>> #!/bin/bash
>> loginctl enable-linger $SLURM_JOB_USER
>>
>> There seems to be some reason why certain users always encounter this,
>> but I can't figure out why. Their accounts are no "different" than anyone
>> else (not in a different group, etc.), so I don't think permissions are an
>> issue.
>>
>> Anyway, the job failure immediately puts the node into a DRAINED/DRAINING
>> state (which is expected). But for now, these users cannot submit any jobs
>> at all.
>>
>> Any insights would be welcomed!
>>
>> Warmest regards,
>> Jason
>>
>> --
>> *Jason L. Simms, Ph.D., M.P.H.*
>> Manager of Research and High-Performance Computing
>> XSEDE Campus Champion
>> Lafayette College
>> Information Technology Services
>> 710 Sullivan Rd | Easton, PA 18042
>> Office: 112 Skillman Library
>> p: (610) 330-5632
>>
>

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200707/a9c94b7d/attachment.htm>