[slurm-users] slurmstepd: error: _is_a_lwp

Marcus Boden mboden at gwdg.de
Wed Feb 5 07:25:05 UTC 2020


We had this issue recently. Some googling led me to the NERSC FAQs,
which state:
 > _is_a_lwp is a function called internally for Slurm job accounting. The message indicates a rare error situation with a function call. But the error shouldn't affect anything in the user job. Please ignore the message.

After looking into our logfiles, it seems that this error appears more
or less at random, but does not cause any jobs to fail (all errors I got
were for jobs that worked perfectly fine).
In your case, the job got cancelled an hour after that message.

Although it is curious that it does seem to happen to only one user in
your case.

Best,
Marcus

On 20-02-04 20:50, Luis Huang wrote:
> We have a user that keeps encountering this error with one type of her jobs. Sometimes her jobs will cancel and other times it will run fine.
> 
> slurmstepd: error: _is_a_lwp: open() /proc/195420/status failed: No such file or directory
> slurmstepd: error: *** JOB 17534 ON pe2dc5-0007 CANCELLED AT 2020-01-23T14:11:36 ***
> 
> [root at pe2dc5-0007 ~]# grep 17534  /var/log/slurmd.log
> [2020-01-23T14:10:12.789] task_p_slurmd_batch_request: 17534
> [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU input mask for node: 0x03000000000000
> [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU final HW mask for node: 0x02000000200000
> [2020-01-23T14:10:12.790] _run_prolog: prolog with lock for job 17534 ran for 0 seconds
> [2020-01-23T14:10:12.875] Launching batch job 17534 for UID 50321
> [2020-01-23T14:10:16.937] [17534.batch] task_p_pre_launch: Using sched_affinity for tasks
> [2020-01-23T14:10:42.895] [17534.batch] error: _is_a_lwp: open() /proc/195420/status failed: No such file or directory
> [2020-01-23T14:11:36.386] [17534.batch] error: *** JOB 17534 ON pe2dc5-0007 CANCELLED AT 2020-01-23T14:11:36 ***
> [2020-01-23T14:11:37.394] [17534.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15
> [2020-01-23T14:11:37.396] [17534.batch] done with job
> 
> I'm also seeing lots of spam in the slurmd.logs on the compute nodes themselves whenever this users jobs lands on them.
> 
> [2020-02-04T15:29:11.073] [43816.batch] error: _is_a_lwp: 1 read() attempts on /proc/234796/status failed: No such process
> [2020-02-04T15:37:24.238] [43682.batch] error: _is_a_lwp: open() /proc/74338/status failed: No such file or directory
> [2020-02-04T15:40:42.064] [43916.batch] error: _is_a_lwp: open() /proc/87034/status failed: No such file or directory
> [2020-02-04T15:41:11.304] [43840.batch] error: _is_a_lwp: open() /proc/151191/status failed: No such file or directory
> 
> Has anyone seen this issue before?
> 
> Regards,
> 
> 
> Luis Huang | Systems Administrator II, Research Computing
> New York Genome Center
> 101 Avenue of the Americas
> New York, NY 10013
> O: (646) 977-7291
> lhuang at nygenome.org
> 
> 
> 
> 
> ________________________________
> 
> This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mboden at gwdg.de
---------------------------------------
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:    http://www.gwdg.de
E-Mail: gwdg at gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:    +49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5028 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200205/2b50a070/attachment.bin>


More information about the slurm-users mailing list