Jeff,
Dang. That's really old. I'm not sure I would run one that old, to be
honest. Too many missing security fixes and added features. It's never
been that hard to do a 'git clone' and the normal configure/make/make
install process with slurm.
Someone else made me aware of this, in case it's easier:
https://slurm.schedmd.com/quickstart_admin.html#debuild
Lloyd
On 5/15/24 08:57, Jeffrey Layton wrote:
> Lloyd,
>
> Good to hear from you! I was hoping to …
[View More]avoid the use of git but that
> may be the only way. The version is 21.08.5. I checked the "old"
> packages from SchedMD and they begin part way through 2024 so that
> won't work.
>
> I'm very surprised Ubuntu let a package through without a source
> package for it. I'm hoping I'm not seeing the tree through the forest
> in finding that package.
>
> Thanks for the help!
>
> Jeff
>
>
> On Wed, May 15, 2024 at 10:54 AM Lloyd Brown via slurm-users
> <slurm-users(a)lists.schedmd.com> wrote:
>
> Jeff,
>
> I'm not sure what version is in the Ubuntu packages, as I don't
> think they're provided by SchedMD, and I'm having trouble finding
> the right one on packages.ubuntu.com
> <http://packages.ubuntu.com>. Having said that, SchedMD is pretty
> good about using tags in their github repo
> (https://github.com/schedmd/slurm), to represent the releases.
> For example, the "slurm-23-11-6-1" tag corresponds to release
> 23.11.6. It's pretty straightforward to clone the repo, and do
> something like "git checkout -b MY_LOCAL_BRANCH_NAME TAG_NAME" to
> get the version you're after.
>
>
> Lloyd
>
> --
> Lloyd Brown
> HPC Systems Administrator
> Office of Research Computing
> Brigham Young University
> http://rc.byu.edu
>
> On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:
>> Good morning,
>>
>> I have an Ubuntu 22.04 server where I installed Slurm from the
>> Ubuntu packages. I now want to install pyxis but it says I need
>> the Slurm sources. In Ubuntu 22.04, is there a package that has
>> the source code? How to download the sources I need from github?
>>
>> Thanks!
>>
>> Jeff
>
>
> --
> slurm-users mailing list -- slurm-users(a)lists.schedmd.com
> To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com
>
--
Lloyd Brown
HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://rc.byu.edu
[View Less]
We installed slurm 23.11.5 and we are receiving "JobId=n has invalid
account" for every sbatch job.
We are not using the slurm accounting/user database; we are using uniform
UIDs and GIDs across the cluster.
The jobs run and complete; can these invalid account errors be ignored or
silenced?
Job Submission Environment:
id joteumer
uid=938401109(joteumer) gid=938400513(SPG) groups=938400513(SPG),27(sudo)
Slurm Worker Node:
id joteumer
uid=938401109(joteumer) gid=938400513(SPG) groups=938400513(…
[View More]SPG),27(sudo)
slurmctld log:
[2024-04-18T09:46:40.000] sched: JobId=18 has invalid account
scontrol show job 18
JobId=18 JobName=simplejob.sh
UserId=joteumer(938401109) GroupId=SPG(938400513) MCS_label=N/A
Priority=1 Nice=0 *Account=(null) *QOS=(null)
Submit another sbatch job and update the job to include an Account
scontrol update jobid=19 Account=joteumer
[2024-04-18T09:56:05.126] _slurm_rpc_submit_batch_job: JobId=19 InitPrio=1
usec=485
[2024-04-18T09:56:06.000] sched: JobId=19 has invalid account
[2024-04-18T09:56:17.000] debug: set_job_failed_assoc_qos_ptr: Filling in
assoc for JobId=19 Assoc=0
[2024-04-18T09:56:17.000] sched: JobId=19 has invalid account
[2024-04-18T09:56:17.588] debug: set_job_failed_assoc_qos_ptr: Filling in
assoc for JobId=19 Assoc=0
[2024-04-18T09:56:27.505] _slurm_rpc_update_job: complete JobId=19 uid=0
usec=110
[2024-04-18T09:56:28.000] sched: JobId=19 has invalid account
scontrol show job 19
JobId=19 JobName=simplejob.sh
UserId=joteumer(938401109) GroupId=SPG(938400513) MCS_label=N/A
Priority=1 Nice=0 Account=(null) QOS=(null)
JOBID PARTITION
NAME USER STATE TIME TIME_LIMI NODES
NODELIST(REASON)
19 SPG simplejob
joteumer PENDING 0:00 18:00:00 1 (InvalidAccount)
[View Less]
I am using the latest slurm. It runs fine for scripts. But if I give it a
container then it kills it as soon as I submit the job. Is slurm cleaning
up the $XDG_RUNTIME_DIR before it should? This is the log:
[2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0
TaskId=-1
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[0]=/bin/sh
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[1]=-c
[2024-05-15T08:00:35.143] [90.0] …
[View More]debug3: _get_container_state: command
argv[2]=crun --rootless=true --root=/run/user/1000/ state
slurm2.acog.90.0.-1
[2024-05-15T08:00:35.167] [90.0] debug: _get_container_state: RunTimeQuery
rc:256 output:error opening file
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory
[2024-05-15T08:00:35.167] [90.0] error: _get_container_state: RunTimeQuery
failed rc:256 output:error opening file
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory
[2024-05-15T08:00:35.167] [90.0] debug: container already dead
[2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0
pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/
[2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0
TaskId=0
[2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1
pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/
[2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step
(rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
[2024-05-15T08:00:35.275] debug3: in the service_connection
[2024-05-15T08:00:35.278] debug2: Start processing RPC:
REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug: _rpc_terminate_job: uid = 64030 JobId=90
[2024-05-15T08:00:35.278] debug: credential for job 90 revoked
[View Less]
We are pleased to announce the availability of Slurm release candidate
24.05.0rc1.
To highlight some new features coming in 24.05:
- (Optional) isolated Job Step management. Enabled on a job-by-job basis
with the --stepmgr option, or globally through
SlurmctldParameters=enable_stepmgr.
- Federation - Allow for client command operation while SlurmDBD is
unavailable.
- New MaxTRESRunMinsPerAccount and MaxTRESRunMinsPerUser QOS limits.
- New USER_DELETE reservation flag.
- New Flags=…
[View More]rebootless option on Features for node_features/helpers
which indicates the given feature can be enabled without rebooting the node.
- Cloud power management options: New "max_powered_nodes=<limit>" option
in SlurmctldParamters, and new SuspendExcNodes=<nodes>:<count> syntax
allowing for <count> nodes out of a given node list to be excluded.
- StdIn/StdOut/StdErr now stored in SlurmDBD accounting records for
batch jobs.
- New switch/nvidia_imex plugin for IMEX channel management on NVIDIA
systems.
- New RestrictedCoresPerGPU option at the Node level, designed to ensure
GPU workloads always have access to a certain number of CPUs even when
nodes are running non-GPU workloads concurrently.
This is the first release candidate of the upcoming 24.05 release
series, and represents the end of development for this release, and a
finalization of the RPC and state file formats.
If any issues are identified with this release candidate, please report
them through https://bugs.schedmd.com against the 24.05.x version and we
will address them before the first production 24.05.0 release is made.
Please note that the release candidates are not intended for production use.
A preview of the updated documentation can be found at
https://slurm.schedmd.com/archive/slurm-master/ .
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
[View Less]
Hi,
If I understand it correctly, the MUNGE and SACK authentication modules naturally require that no-one can get access to the key. This means that we should not use our normal workstations to which our users have physical access to run any jobs, nor could our users use the workstations to submit jobs to the compute nodes. They would have to ssh to a specific submit node and only then could they schedule their jobs.
Is there an elegant way to enable job submission from any computer (possibly …
[View More]requiring that users type their password for the submit node – or to their ssh key – at some point)? (All computers/users use the same LDAP server for logins.)
Best
/rike
[View Less]
I have installed slurm and podman. I have replaced podman's default runtime
as per the documentation to "slurm". Documentation says I need to choose
one oci.conf:
https://slurm.schedmd.com/containers.html#example
Which one should I use? runc? crun? nvidia?
We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about
pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html which does
not appear to come by default with the Bright 'cm' package of Slurm.
Currently ssh to a node gets:
Login not allowed: no running jobs and no WLM allocations
We have 8 GPUs on a node so when we drain a node, which can have up to a 5
day job, no new jobs can run. And since we have 20+ TB (yes TB) local
drives, researchers have their work and files on …
[View More]them to retrieve.
Is there a way to use /etc/security/access.conf to work around this at
least temporarily until the reboot and then we can revert?
Thanks!
Rob
[View Less]
Hello Slurm users,
Some of you may find interest in the new major version of Slurm-web v3.0.0, an open source web dashboard for Slurm: https://slurm-web.com
Slurm-web provides a reactive & responsive web interface to track jobs with intuitive insights and advanced visualizations to monitor status of HPC supercomputers in your organization. The software is released under GPLv3 [1].
This new version is based on official Slurm REST API slurmrestd and adopts modern web technologies to …
[View More]provide many features:
- Instant jobs filtering and sorting
- Live jobs status update
- Advanced visualization of node status with racking topology
- Intuitive visualization of QOS and advanced reservations
- Multi-clusters support
- LDAP authentication
- Advanced RBAC permissions management
- Transparent caching
For the next releases, a roadmap is published with many features ideas [2].
Quick start guide to install: http://docs.rackslab.io/slurm-web/install/quickstart.html
RPM and deb packages are published for easy installation and upgrade on all most popular Linux distributions.
I hope you will like it!
[1] https://github.com/rackslab/Slurm-web
[2] https://slurm-web.com/roadmap/
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io
[View Less]
Greetings Slurm gurus --
I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init(). E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does not experience the issue. E.g. mpirun -n 64 -H cn01,... ./hello_world_mpi. The failure rate seems to be in the 0.5% - 1.0% range when using srun for launch.
SW stack is self-built with:
* Dual socket AMD nodes
* RHEL 9.3 base …
[View More]system + tools
* Single 100 Gb card per host
* hwloc 2.9.3
* pmix 4.2.9 (5.0.2 also tried but continued to see the same issues)
* slurm 23.11.6 (started with 23.11.5 - update did not change the behavior)
* openmpi 5.0.3
The MPI code is a simple hello_world_mpi.c - anything that goes through startup via srun - does not seem to matter. Application core dump looks like the following regardless of the test running:
[cn04:1194785] *** Process received signal ***
[cn04:1194785] Signal: Segmentation fault (11)
[cn04:1194785] Signal code: Address not mapped (1)
[cn04:1194785] Failing at address: 0xe0
[cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0]
[cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d]
[cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c]
[cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88]
[cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7]
[cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af]
[cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365]
[cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d]
[cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae]
[cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780]
[cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0]
[cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60]
[cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5]
[cn04:1194785] *** End of error message ***
More than one rank can die with the same stacktrace on a node when this happens - I've seen as many as 6. One other interesting note is that if I change my srun command line to include strace (e.g. srun -N 8 --ntasks-per-node=8 strace <strace-options> ./hello_world_mpi) the issue appears to go away. 0 failures in ~2500 runs. Another thing that seems to help is to disabling cgroups in the slurm.conf. After the change, saw 0 failures in >6100 hello_world_mpi runs.
The changes in the slurm.conf were - original:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup
Changed
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
JobAcctGatherType=jobacct_gather/linux
My cgroup.conf file contains:
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRamSpace=95
Curious is anyone has any thoughts on next steps to help figure out what might be going on and how to resolve it. Currently, I'm planning to back down to the 23.02.7 release and see how that goes but open to other suggestions.
Thanks,
Brent
[View Less]