[slurm-users] slurmstepd: error: Too many levels of symbolic links
b.h.mevik at usit.uio.no
Fri Dec 3 09:49:13 UTC 2021
Adrian Sevcenco <Adrian.Sevcenco at spacescience.ro> writes:
> On 01.12.2021 10:25, Bjørn-Helge Mevik wrote:
>> In the end we had to give up
>> using automount, and implement a manual procedure that mounts/umounts
>> the needed nfs areas.
> Thanks a lot for info! manual as in "script" or as in "systemd.mount service"?
Script. We mount (if needed) in the prolog. Then in the healthcheck
(run every 5 mins), we check if a job is still running on the node that
needs the mount, and unmounts if not. (We could have done it in the
epilog, but feared it could lead to a lot of mount/umount cycles if a
set of jobs failed immediately. Hence we put it in the healthcheck
I don't have much experience with the systemd.mount service, but it is
possible it would work fine (and be less hackish than our solution :).
> Also, the big and the only advantage that autofs had over static mounts was
> that whenever there was a problem with the server, after the passing of the glitch
> the autofs would re-mount the target...
That's in theory. :) Our experience in practice is that if the client is
actively using the nfs mounted are when the problem arises, you will
often have to reboot the client to resolve the disk waits. (I *think*
it has something to do with nfs using longer and longer timeouts when it
cannot reach the server, so eventually it will take too long to time out
and return an error to the running applications.)
> I'm not very sure that a static nfs mount have this capability ... did you baked in
> your manual procedure also a recovery part?
No, we simply pretend it will not happen. :) In fact, I think we've
only had this type of problems once or twice in the last four-five
years. But this might be because we only mount the homedirs with nfs,
so most of the time, the jobs are not actively using the nfs mounted
area. (The most activity happen in BeeGFS or GPFS mounted areas.)
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 832 bytes
Desc: not available
More information about the slurm-users