Hi,
In our SLURM cluster, we are using the job_container/tmpfs plugin to
ensure that each user can use /tmp and it gets cleaned up after them.
Currently, we are mapping /tmp into the nodes RAM, which means that the
cgroups make sure that users can only use a certain amount of storage
inside /tmp.
Now we would like to use of the node's local SSD instead of its RAM to
hold the files in /tmp. I have seen people define local storage as GRES,
but I am wondering how to make sure that users do …
[View More]not exceed the storage
space they requested in a job. Does anyone have an idea how to configure
local storage as a proper tracked resource?
Thanks a lot in advance!
Best,
Tim
[View Less]
Hi Magnus,
I understand. Thanks a lot for your suggestion.
Best,
Tim
On 06.02.24 15:34, Hagdorn, Magnus Karl Moritz wrote:
> Hi Tim,
> in the end the InitScript didn't contain anything useful because
>
> slurmd: error: _parse_next_key: Parsing error at unrecognized key:
> InitScript
>
> At this stage I gave up. This was with SLURM 23.02. My plan was to
> setup the local scratch directory with XFS and then get the script to
> apply a project quota, ie quota …
[View More]attached to the directory.
>
> I would start by checking if slurm recognises the InitScript option.
>
> Regards
> magnus
>
> On Tue, 2024-02-06 at 15:24 +0100, Tim Schneider wrote:
>> Hi Magnus,
>>
>> thanks for your reply! If you can, would you mind sharing the
>> InitScript
>> of your attempt at getting it to work?
>>
>> Best,
>>
>> Tim
>>
>> On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote:
>>> Hi Tim,
>>> we are using the container/tmpfs plugin to map /tmp to a local NVMe
>>> drive which works great. I did consider setting up directory
>>> quotas. I
>>> thought the InitScript [1] option should do the trick. Alas, I
>>> didn't
>>> get it to work. If I remember correctly, slurm complained about the
>>> option being present. In the end we recommend our users to make
>>> exclusive use a node if they are going to use a lot of local
>>> scratch
>>> space. I don't think this happens very often if at all.
>>> Regards
>>> magnus
>>>
>>> [1]
>>> https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript
>>>
>>>
>>> On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users
>>> wrote:
>>>> Hi,
>>>>
>>>> In our SLURM cluster, we are using the job_container/tmpfs plugin
>>>> to
>>>> ensure that each user can use /tmp and it gets cleaned up after
>>>> them.
>>>> Currently, we are mapping /tmp into the nodes RAM, which means
>>>> that
>>>> the
>>>> cgroups make sure that users can only use a certain amount of
>>>> storage
>>>> inside /tmp.
>>>>
>>>> Now we would like to use of the node's local SSD instead of its
>>>> RAM
>>>> to
>>>> hold the files in /tmp. I have seen people define local storage
>>>> as
>>>> GRES,
>>>> but I am wondering how to make sure that users do not exceed the
>>>> storage
>>>> space they requested in a job. Does anyone have an idea how to
>>>> configure
>>>> local storage as a proper tracked resource?
>>>>
>>>> Thanks a lot in advance!
>>>>
>>>> Best,
>>>>
>>>> Tim
>>>>
>>>>
[View Less]
Hi Magnus,
thanks for your reply! If you can, would you mind sharing the InitScript
of your attempt at getting it to work?
Best,
Tim
On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote:
> Hi Tim,
> we are using the container/tmpfs plugin to map /tmp to a local NVMe
> drive which works great. I did consider setting up directory quotas. I
> thought the InitScript [1] option should do the trick. Alas, I didn't
> get it to work. If I remember correctly, slurm complained about …
[View More]the
> option being present. In the end we recommend our users to make
> exclusive use a node if they are going to use a lot of local scratch
> space. I don't think this happens very often if at all.
> Regards
> magnus
>
> [1]
> https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript
>
>
> On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users wrote:
>> Hi,
>>
>> In our SLURM cluster, we are using the job_container/tmpfs plugin to
>> ensure that each user can use /tmp and it gets cleaned up after them.
>> Currently, we are mapping /tmp into the nodes RAM, which means that
>> the
>> cgroups make sure that users can only use a certain amount of storage
>> inside /tmp.
>>
>> Now we would like to use of the node's local SSD instead of its RAM
>> to
>> hold the files in /tmp. I have seen people define local storage as
>> GRES,
>> but I am wondering how to make sure that users do not exceed the
>> storage
>> space they requested in a job. Does anyone have an idea how to
>> configure
>> local storage as a proper tracked resource?
>>
>> Thanks a lot in advance!
>>
>> Best,
>>
>> Tim
>>
>>
[View Less]
Hello
I have the following scenario:
I need to submit a sequence of up to 400 jobs where the even jobs depend on
the preceeding odd job to finish and every odd job depends on the presence
of a file generated by the preceding even job (availability of the file for
the first of those 400 jobs is guaranteed).
If I just submit all those jobs via a loop using dependencies, then I end
up with a lot of pending jobs who might later not even run because no
output file has been produced by the …
[View More]preceding jobs. Is there a way to
pause the submission loop until the required file has been generated so
that at most two jobs are submitted at the same time?
Here is a sample submission script showing what I want to achieve.
for i in {1..200}; do
FILE=GHM_paramset_${i}.dat
# How can I pause the submission loop until the FILE has been created????
#if test -f "$FILE"; then
jobid4=$(sbatch --parsable --dependency=afterok:$jobid3 job4_sub $i)
jobid3=$(sbatch --parsable --dependency=afterok:$jobid4 job3_sub $i)
#fi
done
Any help will be appreciated
Amjad
[View Less]
Hi,
I am a little new to this, so please pardon my ignorance.
I have configured slurm in my cluster and it works fine with local
users. But I am not able to get it working with LDAP/SSSD authentication.
User logins using ssh are working fine. An LDAP user can login to the
login, slurmctld and compute nodes, but when they try to submit jobs,
slurmctld logs an error about invalid account or partition for user.
Someone said we need to add the user manually into the database using
the …
[View More]sacctmgr command. But I am not sure we need to do this for each and
every LDAP user. Yes, it does work if we add the LDAP user manually
using sacctmgr. But I am not convinced this manual way is the way to do.
The documentation is not very clear about using LDAP accounts.
Saw somewhere in the list about using UsePAM=1 and copying or creating a
softlink for slurm PAM module under /etc/pam.d . But it didn't work for me.
Saw somewhere else that we need to specifying
LaunchParameters=enable_nss_slurm in the slurm.conf file and put slurm
keyword in passwd/group entry in the /etc/nsswitch.conf file. Did these,
but didn't help either.
I am bereft of ideas at present. If anyone has real world experience and
can advise, I will be grateful.
Thank you,
Richard
[View Less]
If I use the sbatch(1) option --export=NONE or wipe the environment with "env -i /usr/bin/sbatch ..." or use --export=NIL then the environment is not properly constructed and I see
the message in the /var/log/*slurm* files:
[2024-02-03T11:50:33.052] _get_user_env: get env for user jsu here
[2024-02-03T11:52:33.152] timeout waiting for /bin/su to complete
[2024-02-03T11:52:34.152] error: Failed to load current user environment variables
[2024-02-03T11:52:34.153] error: _get_user_env: Unable to …
[View More]get user's local environment, running only with passed environment
This occurs at 120 seconds not matter if I add --get-user-env=3600 or adjust many slurm.conf time-related parameters. It is easy to reproduce by adding "sleep 100" into a .cshrc file and sbatch(1) the file
#!/bin/csh
#SBATCH --export=NONE --propagate=NONE --get-user-env=3600L
printenv HOME
printenv USER
printenv PATH
env
I have adjust MANY time-related limits in the slurm.conf file to no avail. When the system is unresponsive or heavily loaded or users have prologues that set up complex environments via module commands (which can be notoriously slow) the jobs are failing or producing errors.
If I configure Slurm so that jobs that timeout requeue instead of running then a user with a slow login setup can submit a large number of jobs and basically close down a cluster because this option not only requeues jobs that fail but puts the node it occurred on in a DRAIN state.
We see this as very dangerous as by defaults jobs proceed to execute even when their environment is not properly constructed.
I can see that "slurmrestd getenv" and the procedure get_user_env(3c) are involved, but a preliminary scan of the code looked like the --get-user-env=NNNN value was being parsed,
and I did not see a reason the setup always times out at 120 seconds (at least on my system).
Does anyone know how to get the time allowed to get the default user environment to use the value on the --get-user-env option when no environment is being exported to a job?
This is showing up sporadically and causing intermittent failures that are very confusing and disturbing to the users it occurs with.
Sent with [Proton Mail](https://proton.me/) secure email.
[View Less]
log files use many strings to identify job, including not having a jobID in the message
NUMBER=$SLURM_JOBID
egrep "\.\<$NUMBER\>\] |\<$NUMBER\>\.batch|jobid \<$NUMBER\>|JObId=\<$NUMBER\>|job id \<$NUMBER\>|job\.\<$NUMBER\>|job \<$NUMBER\>|jobid \[\<$NUMBER\>\]|task_p_slurmd_batch_request: \<$NUMBER\>" /var/log/slurm*
Even that misses cruciall data that does not even contain the jobid
[2024-02-03T11:50:33.052] _get_user_env: get env for …
[View More]user jsu here
[2024-02-03T11:52:33.152] timeout waiting for /bin/su to complete
[2024-02-03T11:52:34.152] error: Failed to load current user environment variables
[2024-02-03T11:52:34.153] error: _get_user_env: Unable to get user's local environment, running only with passed environment
It would be very useful if all messages related to a job had a consistent string in them for grepping the log files;
even better might be a command like "scontrol show jobid=NNNN log_messages
But I could not find what I wanted (an easy way to find all daemon log messages related to a specific job). I would find it particularly useful if there were a way to automatically append such information to the stdout of the job at job termination so users would automatically get information about job failures or warnings.
Is there such a feature available I have missed?
Sent with [Proton Mail](https://proton.me/) secure email.
[View Less]
I finally had downtime on our cluster running 20.11.3 and decided to
upgrade SLURM. All daemons were stopped on nodes and master.
Rocky 8 Linux OS was updated but not changed configuration-wise
in anyway.
On the master, when I first installed 23.11.1 and tried to run
slurmdbd -D -vvv at the command line, it balked as the more than
2 major version jump. I then installed 22.05.6 and run slurmdbd -D -vvv
which took a long time but completed fine. I reinstalled 23.11.1
and did slurmdbd -D --vvv …
[View More]which went quick. I stopped it then
ran slurmdbd normally with systemctl.
When I ran slurmctld it complained about incompatable state files.
I didn't care about loosing state so ran with the -i option to
ignore. Seemed happy.
I then started up slurmd on my various GPU and non-GPU nodes.
I ran a test interactive job just to see if I would get a shell
with expected SLURM environment on the nodes and it worked.
Except on the GPU nodes I got errors about not being able
to get/set GPU frequences. I removed GpuFreqDef=medium from slurm.conf
and that went away.
I had my people start running jobs last night.
This morning I come in and see a lot of nodes are stuck in completing
state. All these are GPU nodes.
For example on node 'sevilla' squeue reports
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3679433 lcna100 sjob_123 sg1526 CG 8:29 1 sevilla
3679432 lcna100 sjob_123 sg1526 CG 8:35 1 sevilla
3679431 lcna100 sjob_123 sg1526 CG 8:42 1 sevilla
In the slurmd.log on sevilla I see every minute the lines:
[2024-01-28T09:37:52.003] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679432
[2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679431
[2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679433
[2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
Back at the time 3679433 was cancelled by the user I see
[2024-01-28T00:27:48.658] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T00:27:48.658] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-01-28T00:27:48.658] debug: _rpc_terminate_job: uid = 1150 JobId=3679433
[2024-01-28T00:27:48.658] debug: credential for job 3679433 revoked
[2024-01-28T00:27:48.658] debug: _rpc_terminate_job: sent SUCCESS for 3679433, waiting for prolog to finish
[2024-01-28T00:27:48.658] debug: Waiting for job 3679433's prolog to complete
On sevilla itself there are no processes currently running for user
sg1526 so no sign of these jobs. There ARE processes running
for user mu40 though
[root@sevilla ~]# pgrep -f slurm_script | xargs -n1 pstree -p | grep
^slurm
slurm_script(30691)---starter-suid(30764)-+-python3(30784)-+-{python3}(30817)
slurm_script(30861)---starter-suid(30873)-+-python3(30888)-+-{python3}(30915)
slurm_script(30963)---starter-suid(30975)-+-python3(30992)-+-{python3}(31020)
[root@sevilla ~]# strings /proc/30691/cmdline
/bin/sh
/var/slurm/spool/d/job3679220/slurm_script
/usr/bin/sacct -p --jobs=3679220
--format=JobID,User,ReqTRES,NodeList,Start,End,Elapsed,CPUTime,State,ExitCode
JobID|User|ReqTRES|NodeList|Start|End|Elapsed|CPUTime|State|ExitCode|
3679220|mu40|billing=6,cpu=3,gres/gpu=1,mem=32G,node=1|sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|NODE_FAIL|1:0|
3679220.batch|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED||
3679220.extern|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED||
The slurmd.log has at the time it started lines like:
[2024-01-27T22:33:28.679] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.092] [3679220.extern] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30650, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.166] [3679220.batch] debug2: profile signaling type Task
[2024-01-27T22:33:58.168] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.170] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30764, probably hasn't started yet or has already finished
[2024-01-27T22:33:58.172] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30784, probably hasn't started yet or has already finished
...
[2024-01-27T22:46:39.018] debug: _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967292: Connection refused
[2024-01-27T22:46:39.018] debug: Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967292
[2024-01-27T22:46:39.018] debug: _step_connect: connect() failed for /var/slurm/spool/d/sevilla_3679220.4294967291: Connection refused
[2024-01-27T22:46:39.018] debug: Cleaned up stray socket /var/slurm/spool/d/sevilla_3679220.4294967291
[2024-01-27T22:46:39.018] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679220/slurm_script
then later at the time of termination
[2024-01-27T23:26:39.042] debug: _rpc_terminate_job: uid = 1150 JobId=3679220
[2024-01-27T23:26:39.042] debug: credential for job 3679220 revoked
[2024-01-27T23:26:39.042] debug2: No steps in jobid 3679220 to send signal 998
[2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 18
[2024-01-27T23:26:39.043] debug2: No steps in jobid 3679220 to send signal 15
[2024-01-27T23:26:39.044] debug2: set revoke expiration for jobid 3679220 to 1706416119 UTS
[2024-01-27T23:26:39.044] debug: Waiting for job 3679220's prolog to complete
[2024-01-27T23:26:39.044] debug: Finished wait for job 3679220's prolog to complete
[2024-01-27T23:26:39.068] debug: completed epilog for jobid 3679220
[2024-01-27T23:26:39.071] debug: JobId=3679220: sent epilog complete msg: rc = 0
So I have a node stuck in 'comp' state that claims to have
jobs completing that are definite NOT running on the box BUT
there are jobs running on the box that SLURM thinks are done
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
[View Less]
Hey folks -
The mailing list will be offline for about an hour as we upgrade the
host, upgrade the mailing list software, and change the mail
configuration around.
As part of these changes, the "From: " field will no longer be the
original sender, but instead use the mailing list ID itself. This is to
comply with DMARC sending options, and allow us to start DKIM signing
messages to ensure deliverability once Google and Yahoo impose new
policy changes in February.
This is the last post …
[View More]on the current (mailman2) list. I'll send a
welcome message on the upgraded (mailman3) list once finished, and when
the list is open to new traffic again.
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
[View Less]