Hello
at site RO-14-ITIM, we have a very strange problem.
A ticket was raised with the following:
Timo"
The problem is I can not find any of the jobs that are failing on the site:
for example:
the last job:
https://aipanda023.cern.ch/condor_logs_2/25-03-14_09/grid.19602200.0.log
with stdlog:
000 (19602200.000.000) 2025-03-14 09:28:03 Job submitted from host: <137.138.31.125:38090?addrs=137.138.31.125-38090+[2001-1458-d00-19--75]-38090&alias=aipanda023.cern.ch>
...
027 (19602200.000.000) 2025-03-14 09:28:13 Job submitted to grid resource
GridResource: arc arcn-node.itim-cj.ro:443
GridJobId: arc arcn-node.itim-cj.ro:443 PPVNDm5VGD7ngvuSSqSAreymYz3jwmOETUEm1dOSDmbfXLDmHXmz9n
...
001 (19602200.000.000) 2025-03-14 09:40:51 Job executing on host: arc arcn-node.itim-cj.ro:443
...
012 (19602200.000.000) 2025-03-14 09:40:58 Job was held.
ARC job failed: LRMS error: (-1) Job missing from SLURM
Code 0 Subcode 0
...
009 (19602200.000.000) 2025-03-14 11:03:03 Job was aborted.
Python-initiated action. (by user atlpan)
...
009 (19602200.000.000) 2025-03-14 11:03:17 Job was aborted.
Python-initiated action. (by user atlpan)
This name:
PPVNDm5VGD7ngvuSSqSAreymYz3jwmOETUEm1dOSDmbfXLDmHXmz9n
is nowhere to find on my server:
[root@arcn-node log]# updatedb
[root@arcn-node log]# locate PPVNDm5VGD7ngvuSSqSAreymYz3jwmOETUEm1dOSDmbfXLDmHXmz9n
[root@arcn-node log]#
I have on my site this kind of error
The ENDPOINT affected is
arcn-node.itim-cj.ro
(ARC-CE)It became
Critical at
2025-03-14T09:16:44Z
due to METRIC org.nordugrid.ARC-CE-sw-gcc
which is going to be ok in the messag, one minute later
Can you please advice where to look for any clue in solving this mystery? Thank you Felix :--
Dr. Eng. Farcas Felix National Institute of Research and Development of Isotopic and Molecular Technology, IT - Department - Cluj-Napoca, Romania Mobile: +40742195323