- slurm-users - lists.schedmd.com

Re: slurmdbd not connecting to mysql (mariadb)
by Radhouane Aniba 31 May '24

31 May '24

manually running it through sudo slurmdbd -D /path/to/conf is very quick on my fresh install trying to start the slurmdbd through systemctl take 3 minutes and then crashes and fail Is there an alternative to systemctl to start the slurmdbd in the background ? But most importantly I wanted to know why it takes so long through systemctl. Maybe I can increase the timeout limit ? On Thu, May 30, 2024 at 11:54 PM Ryan Novosielski <novosirj(a)rutgers.edu> wrote: > It may take longer to start than systemd allows for. How long does it take > to start from the command line? It’s common to need to run it manually for > upgrades to complete. > > -- > #BlackLivesMatter > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novosirj(a)rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB > A555B, Newark > `' > > On May 30, 2024, at 20:24, Radhouane Aniba via slurm-users < > slurm-users(a)lists.schedmd.com> wrote: > > Ok I made some progress here. > > I removed and purged slurmdbd mysql mariadb etc .. and started from > scratch. > I added the recommended mysqld requirements > > Started slurmdbd manually : sudo slurmdbd -D /path/to/conf and everything > worked well > > When I tried to start the service sudo systemctl start slurmdbd.service > it didnt work > > sudo systemctl status slurmdbd.service > ● slurmdbd.service - Slurm DBD accounting daemon > Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor > preset: enabled) > Active: failed (Result: timeout) since Fri 2024-05-31 00:21:30 UTC; > 2min 5s ago > Process: 6258 ExecStart=/usr/sbin/slurmdbd -D > /etc/slurm-llnl/slurmdbd.conf (code=exited, status=0/SUCCESS) > > May 31 00:20:00 hannibal-hn systemd[1]: Starting Slurm DBD accounting > daemon... > May 31 00:21:30 hannibal-hn systemd[1]: slurmdbd.service: start operation > timed out. Terminating. > May 31 00:21:30 hannibal-hn systemd[1]: slurmdbd.service: Failed with > result 'timeout'. > May 31 00:21:30 hannibal-hn systemd[1]: Failed to start Slurm DBD > accounting daemon. > > Even though it is the same command ?! > > Any idea ? > > > On Thu, May 30, 2024 at 5:02 PM Radhouane Aniba <aradwen(a)gmail.com> wrote: > >> Thank you Ahmet and Brian, >> >> Ahmet, which conf in particular slurmdbd is readiugn from, I parsed all >> the cnf files for mysql and I cannot find the data it is displaying here >> >> slurmdbd: debug2: Attempting to connect to localhost:3306 >> slurmdbd: debug2: innodb_buffer_pool_size: 134217728 >> slurmdbd: debug2: innodb_log_file_size: 50331648 >> slurmdbd: debug2: innodb_lock_wait_timeout: 50 >> slurmdbd: error: Database settings not recommended values: >> innodb_buffer_pool_size innodb_lock_wait_timeout >> >> >> sudo tree /etc/mysql/* >> /etc/mysql/conf.d >> ├── mysql.cnf >> └── mysqldump.cnf >> /etc/mysql/debian.cnf >> /etc/mysql/debian-start >> /etc/mysql/FROZEN >> /etc/mysql/mariadb.cnf >> /etc/mysql/mariadb.conf.d >> ├── 50-client.cnf >> ├── 50-mysql-clients.cnf >> ├── 50-mysqld_safe.cnf >> └── 50-server.cnf >> /etc/mysql/my.cnf >> /etc/mysql/my.cnf.fallback >> /etc/mysql/mysql.cnf >> /etc/mysql/mysql.conf.d >> ├── mysql.cnf >> └── mysqld.cnf >> >> On Thu, May 30, 2024 at 12:21 PM Brian Andrus via slurm-users < >> slurm-users(a)lists.schedmd.com> wrote: >> >>> That SIGTERM message means something is telling slurmdbd to quit. >>> >>> Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told >>> to shutdown. If you are running in the foreground, a ^C does that. If you >>> run a kill or killall on it, you will get that same message. >>> >>> Brian Andrus >>> On 5/30/2024 6:53 AM, Radhouane Aniba via slurm-users wrote: >>> >>> Yes I can connect to my database using mysql --user=slurm >>> --password=slurmdbpass slurm_acct_db and there is no firewall blocking >>> mysql after checking the firewall question >>> >>> ALso here is the output of slurmdbd -D -vvv (note I can only run this as >>> sudo ) >>> >>> sudo slurmdbd -D -vvv >>> slurmdbd: debug: Log file re-opened >>> slurmdbd: debug: Munge authentication plugin loaded >>> slurmdbd: debug2: mysql_connect() called for db slurm_acct_db >>> slurmdbd: debug2: Attempting to connect to localhost:3306 >>> slurmdbd: debug2: innodb_buffer_pool_size: 134217728 >>> slurmdbd: debug2: innodb_log_file_size: 50331648 >>> slurmdbd: debug2: innodb_lock_wait_timeout: 50 >>> slurmdbd: error: Database settings not recommended values: >>> innodb_buffer_pool_size innodb_lock_wait_timeout >>> slurmdbd: Accounting storage MYSQL plugin loaded >>> slurmdbd: debug2: ArchiveDir = /tmp >>> slurmdbd: debug2: ArchiveScript = (null) >>> slurmdbd: debug2: AuthAltTypes = (null) >>> slurmdbd: debug2: AuthInfo = (null) >>> slurmdbd: debug2: AuthType = auth/munge >>> slurmdbd: debug2: CommitDelay = 0 >>> slurmdbd: debug2: DbdAddr = localhost >>> slurmdbd: debug2: DbdBackupHost = (null) >>> slurmdbd: debug2: DbdHost = hannibal-hn >>> slurmdbd: debug2: DbdPort = 7032 >>> slurmdbd: debug2: DebugFlags = (null) >>> slurmdbd: debug2: DebugLevel = 6 >>> slurmdbd: debug2: DebugLevelSyslog = 10 >>> slurmdbd: debug2: DefaultQOS = (null) >>> slurmdbd: debug2: LogFile = /var/log/slurmdbd.log >>> slurmdbd: debug2: MessageTimeout = 100 >>> slurmdbd: debug2: Parameters = (null) >>> slurmdbd: debug2: PidFile = /run/slurmdbd.pid >>> slurmdbd: debug2: PluginDir = /usr/lib/x86_64-linux-gnu/slurm-wlm >>> slurmdbd: debug2: PrivateData = none >>> slurmdbd: debug2: PurgeEventAfter = 1 months* >>> slurmdbd: debug2: PurgeJobAfter = 12 months* >>> slurmdbd: debug2: PurgeResvAfter = 1 months* >>> slurmdbd: debug2: PurgeStepAfter = 1 months >>> slurmdbd: debug2: PurgeSuspendAfter = 1 months >>> slurmdbd: debug2: PurgeTXNAfter = 12 months >>> slurmdbd: debug2: PurgeUsageAfter = 24 months >>> slurmdbd: debug2: SlurmUser = root(0) >>> slurmdbd: debug2: StorageBackupHost = (null) >>> slurmdbd: debug2: StorageHost = localhost >>> slurmdbd: debug2: StorageLoc = slurm_acct_db >>> slurmdbd: debug2: StoragePort = 3306 >>> slurmdbd: debug2: StorageType = accounting_storage/mysql >>> slurmdbd: debug2: StorageUser = slurm >>> slurmdbd: debug2: TCPTimeout = 2 >>> slurmdbd: debug2: TrackWCKey = 0 >>> slurmdbd: debug2: TrackSlurmctldDown= 0 >>> slurmdbd: debug2: acct_storage_p_get_connection: request new connection >>> 1 >>> slurmdbd: debug2: Attempting to connect to localhost:3306 >>> slurmdbd: slurmdbd version 19.05.5 started >>> slurmdbd: debug2: running rollup at Thu May 30 13:50:08 2024 >>> slurmdbd: debug2: Everything rolled up >>> >>> >>> It goes like this for some time and then it crashes with this message >>> >>> slurmdbd: Terminate signal (SIGINT or SIGTERM) received >>> slurmdbd: debug: rpc_mgr shutting down >>> >>> >>> On Thu, May 30, 2024 at 8:18 AM mercan <ahmet.mercan(a)uhem.itu.edu.tr> >>> wrote: >>> >>>> Did you try to connect database using mysql command? >>>> >>>> mysql --user=slurm --password=slurmdbpass slurm_acct_db >>>> >>>> C. Ahmet Mercan >>>> >>>> On 30.05.2024 14:48, Radhouane Aniba via slurm-users wrote: >>>> >>>> Thank you Ahmet, >>>> I dont have a firewall active. >>>> And because slurmdbd cannot connect to the database I am not able to >>>> getting it to be activated through systemctl I will share the output for >>>> slurmdbd -D -vvv shortly but overall it is always saying trying to connect >>>> to the db and then retries a couple of times and crashes >>>> >>>> R. >>>> >>>> >>>> >>>> >>>> On Thu, May 30, 2024 at 2:51 AM mercan <ahmet.mercan(a)uhem.itu.edu.tr> >>>> wrote: >>>> >>>>> Hi; >>>>> >>>>> Did you check can you connect db with your conf parameters from >>>>> head-node: >>>>> >>>>> mysql --user=slurm --password=slurmdbpass slurm_acct_db >>>>> >>>>> Also, check and stop firewall and selinux, if they are running. >>>>> >>>>> Last, you can stop slurmdbd, then run run terminal with: >>>>> >>>>> slurmdbd -D -vvv >>>>> >>>>> Regards; >>>>> >>>>> C. Ahmet Mercan >>>>> >>>>> On 30.05.2024 00:05, Radhouane Aniba via slurm-users wrote: >>>>> >>>>> Hi everyone >>>>> I am trying to get slurmdbd to run on my local home server but I am >>>>> really struggling. >>>>> Note : am a novice slurm user >>>>> my slurmdbd always times out even though all the details in the conf >>>>> file are correct >>>>> >>>>> My log looks like this >>>>> >>>>> [2024-05-29T20:51:30.088] Accounting storage MYSQL plugin loaded >>>>> [2024-05-29T20:51:30.088] debug2: ArchiveDir = /tmp >>>>> [2024-05-29T20:51:30.088] debug2: ArchiveScript = (null) >>>>> [2024-05-29T20:51:30.088] debug2: AuthAltTypes = (null) >>>>> [2024-05-29T20:51:30.088] debug2: AuthInfo = (null) >>>>> [2024-05-29T20:51:30.088] debug2: AuthType = auth/munge >>>>> [2024-05-29T20:51:30.088] debug2: CommitDelay = 0 >>>>> [2024-05-29T20:51:30.088] debug2: DbdAddr = localhost >>>>> [2024-05-29T20:51:30.088] debug2: DbdBackupHost = (null) >>>>> [2024-05-29T20:51:30.088] debug2: DbdHost = head-node >>>>> [2024-05-29T20:51:30.088] debug2: DbdPort = 7032 >>>>> [2024-05-29T20:51:30.088] debug2: DebugFlags = (null) >>>>> [2024-05-29T20:51:30.088] debug2: DebugLevel = 6 >>>>> [2024-05-29T20:51:30.088] debug2: DebugLevelSyslog = 10 >>>>> [2024-05-29T20:51:30.088] debug2: DefaultQOS = (null) >>>>> [2024-05-29T20:51:30.088] debug2: LogFile = /var/log/slurmdbd.log >>>>> [2024-05-29T20:51:30.088] debug2: MessageTimeout = 100 >>>>> [2024-05-29T20:51:30.088] debug2: Parameters = (null) >>>>> [2024-05-29T20:51:30.088] debug2: PidFile = /run/slurmdbd.pid >>>>> [2024-05-29T20:51:30.088] debug2: PluginDir = >>>>> /usr/lib/x86_64-linux-gnu/slurm-wlm >>>>> [2024-05-29T20:51:30.088] debug2: PrivateData = none >>>>> [2024-05-29T20:51:30.088] debug2: PurgeEventAfter = 1 months* >>>>> [2024-05-29T20:51:30.088] debug2: PurgeJobAfter = 12 months* >>>>> [2024-05-29T20:51:30.088] debug2: PurgeResvAfter = 1 months* >>>>> [2024-05-29T20:51:30.088] debug2: PurgeStepAfter = 1 months >>>>> [2024-05-29T20:51:30.088] debug2: PurgeSuspendAfter = 1 months >>>>> [2024-05-29T20:51:30.088] debug2: PurgeTXNAfter = 12 months >>>>> [2024-05-29T20:51:30.088] debug2: PurgeUsageAfter = 24 months >>>>> [2024-05-29T20:51:30.088] debug2: SlurmUser = root(0) >>>>> [2024-05-29T20:51:30.089] debug2: StorageBackupHost = (null) >>>>> [2024-05-29T20:51:30.089] debug2: StorageHost = localhost >>>>> [2024-05-29T20:51:30.089] debug2: StorageLoc = slurm_acct_db >>>>> [2024-05-29T20:51:30.089] debug2: StoragePort = 3306 >>>>> [2024-05-29T20:51:30.089] debug2: StorageType = >>>>> accounting_storage/mysql >>>>> [2024-05-29T20:51:30.089] debug2: StorageUser = slurm >>>>> [2024-05-29T20:51:30.089] debug2: TCPTimeout = 2 >>>>> [2024-05-29T20:51:30.089] debug2: TrackWCKey = 0 >>>>> [2024-05-29T20:51:30.089] debug2: TrackSlurmctldDown= 0 >>>>> [2024-05-29T20:51:30.089] debug2: acct_storage_p_get_connection: >>>>> request new connection 1 >>>>> [2024-05-29T20:51:30.089] debug2: Attempting to connect to >>>>> localhost:3306 >>>>> [2024-05-29T20:51:30.090] slurmdbd version 19.05.5 started >>>>> [2024-05-29T20:51:30.090] debug2: running rollup at Wed May 29 >>>>> 20:51:30 2024 >>>>> [2024-05-29T20:51:30.091] debug2: Everything rolled up >>>>> [2024-05-29T20:51:49.673] Terminate signal (SIGINT or SIGTERM) >>>>> received >>>>> [2024-05-29T20:51:49.673] debug: rpc_mgr shutting down >>>>> >>>>> >>>>> >>>>> my config file looks like this >>>>> >>>>> ArchiveEvents=yes >>>>> ArchiveJobs=yes >>>>> ArchiveResvs=yes >>>>> ArchiveSteps=no >>>>> ArchiveSuspend=no >>>>> ArchiveTXN=no >>>>> ArchiveUsage=no >>>>> PurgeEventAfter=1month >>>>> PurgeJobAfter=12month >>>>> PurgeResvAfter=1month >>>>> PurgeStepAfter=1month >>>>> PurgeSuspendAfter=1month >>>>> PurgeTXNAfter=12month >>>>> PurgeUsageAfter=24month >>>>> # Authentication info >>>>> AuthType=auth/munge >>>>> # slurmDBD info >>>>> DbdAddr=localhost >>>>> DbdHost=head-node >>>>> DbdPort=7032 >>>>> SlurmUser=root >>>>> MessageTimeout=100 >>>>> DebugLevel=5 >>>>> #DefaultQOS=normal,standby >>>>> LogFile=/var/log/slurmdbd.log >>>>> PidFile=/run/slurmdbd.pid >>>>> #PrivateData=accounts,users,usage,jobs >>>>> #TrackWCKey=yes >>>>> # >>>>> # Database info >>>>> StorageType=accounting_storage/mysql >>>>> StorageHost=localhost >>>>> StoragePort=3306 >>>>> StoragePass=slurmdbpass >>>>> StorageUser=slurm >>>>> StorageLoc=slurm_acct_db >>>>> I used standard names and passwords to get started and I will change >>>>> later >>>>> >>>>> but everytime I try to start slurmdbd.service it crashes and I have >>>>> that log that I shared with you >>>>> >>>>> I use these versions >>>>> >>>>> slurmdbd -V >>>>> slurm-wlm 19.05.5 >>>>> mysql Ver 15.1 Distrib 10.3.39-MariaDB, for debian-linux-gnu (x86_64) >>>>> using readline 5.2 >>>>> Everything else Is working properly except I cannot get slurmdbd to >>>>> work and at this point I exhausted all my possible trials :) looking for >>>>> some expert insights :) >>>>> >>>>> >>>>> Any idea what I am doing wrong here ? Also I didn't compile any slurm >>>>> package. I used the binary from apt repos >>>>> >>>>> Any help will be appreciated >>>>> >>>>> Cheers >>>>> >>>>> Rad >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>> >>> >>> -- >>> *Rad Aniba, PhD* >>> >>> >>> >>> -- >>> slurm-users mailing list -- slurm-users(a)lists.schedmd.com >>> To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com >>> >> >> >> -- >> *Rad Aniba, PhD* >> >> > > -- > *Rad Aniba, PhD* > > > -- > slurm-users mailing list -- slurm-users(a)lists.schedmd.com > To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com > > > -- *Rad Aniba, PhD*

2 3

Slurm version 24.05.0 is now available
by Marshall Garey 30 May '24

30 May '24

We are pleased to announce the availability of Slurm 24.05.0. To highlight some new features in 24.05: - Isolated Job Step management. Enabled on a job-by-job basis with the --stepmgr option, or globally through SlurmctldParameters=enable_stepmgr. - Federation - Allow for client command operation while SlurmDBD is unavailable. - New MaxTRESRunMinsPerAccount and MaxTRESRunMinsPerUser QOS limits. - New USER_DELETE reservation flag. - New Flags=rebootless option on Features for node_features/helpers which indicates the given feature can be enabled without rebooting the node. - Cloud power management options: New "max_powered_nodes=<limit>" option in SlurmctldParamters, and new SuspendExcNodes=<nodes>:<count> syntax allowing for <count> nodes out of a given node list to be excluded. - StdIn/StdOut/StdErr now stored in SlurmDBD accounting records for batch jobs. - New switch/nvidia_imex plugin for IMEX channel management on NVIDIA systems. - New RestrictedCoresPerGPU option at the Node level, designed to ensure GPU workloads always have access to a certain number of CPUs even when nodes are running non-GPU workloads concurrently. The Slurm documentation has also been updated to the 24.05 release. (Older versions can be found in the archive, linked from the main documentation page.) Slurm can be downloaded from https://www.schedmd.com/downloads.php . -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

Executing srun -n X where X is greater than total CPU in entire cluster
by Dan Healy 30 May '24

30 May '24

Hi there, SLURM community, I swear I've done this before, but now it's failing on a new cluster I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I run `srun -n 500 hostname`, the task gets queued since there's not 500 available CPU. Wasn't there an option that allows for this to be run where the first 384 tasks execute, and then the remaining execute when resources free up? Here's my conf: # Slurm Cgroup Configs used on controllers and workersslurm_cgroup_config: CgroupAutomount: yes ConstrainCores: yes ConstrainRAMSpace: yes ConstrainSwapSpace: yes ConstrainDevices: yes# Slurm conf file settingsslurm_config: AccountingStorageType: "accounting_storage/slurmdbd" AccountingStorageEnforce: "limits" AuthAltTypes: "auth/jwt" ClusterName: "cluster" AccountingStorageHost : "{{ hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU: 1024 InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup" JobCompType: "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize: 40000 MaxJobCount: 100000 MinJobAge: 3600 ProctrackType: "proctrack/cgroup" ReturnToService: 2 SelectType: "select/cons_tres" SelectTypeParameters: "CR_Core_Memory" SlurmctldTimeout: 30 SlurmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name }}" SrunPortRange: "60000-61000" StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin: "task/affinity,task/cgroup" UnkillableStepTimeout: 120 -- Thanks, Daniel Healy

2 2

Jobs showing running but not running
by Sushil Mishra 29 May '24

29 May '24

Hi All, I'm managing a cluster with Slurm, consisting of 4 nodes. One of the compute nodes appears to be experiencing issues. While the front node's 'squeue' command indicates that jobs are running, upon connecting to the problematic node, I observe no active processes and GPUs are not being utilized. [sushil@ccbrc ~]$ sinfo -Nel Wed May 29 12:00:08 2024 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON gag 1 defq* mixed 48 2:24:1 370000 0 1 (null) none gag 1 glycore mixed 48 2:24:1 370000 0 1 (null) none glyco1 1 defq* completing* 128 2:64:1 500000 0 1 (null) none glyco1 1 glycore completing* 128 2:64:1 500000 0 1 (null) none glyco2 1 defq* mixed 128 2:64:1 500000 0 1 (null) none glyco2 1 glycore mixed 128 2:64:1 500000 0 1 (null) none mannose 1 defq* mixed 24 2:12:1 180000 0 1 (null) none mannose 1 glycore mixed 24 2:12:1 180000 0 1 (null) none On glyco1 (affected node!): squeue # gets stuck sudo systemctl restart slurmd # gets stuck I tried the following to clear the jobs stuck in CG state, but any new job appears to be stuck in a 'running' state without actually running. scontrol update nodename=glyco1 state=down reason=cg scontrol update nodename=glyco1 state=resume reason=cg There is no I/O issue in that node, and all file systems are under 30% in use. Any advice on how to resolve this without rebooting the machine? Best, Sushil

3 2

dynamical configuration || meta configuration mgmt
by Heckes, Frank 29 May '24

29 May '24

Hello all, I’m sorry if this has been asked and answered before, but I couldn’t find anything related. Does anyone know whether a framework of sorts exists that allow to change certain SLURM configuration parameters provided some conditions in the batch system’s state are detected and of course are revert if the state became the old one again? (To be more concrete: We like to raise or unset maxjobPU to run as much small jobs as possible to allocate all nodes as soon as certain threshold of free nodes are available and of course some other scenarios) Many thanks in advance. Cheers, -Frank Max-Planck-Institut für Sonnensystemforschung Justus-von-Liebig-Weg 3 D-37077 Göttingen Phone: [+49] 551 – 384 979 320 E-Mail: <mailto:heckes@mps.mpg.de> heckes(a)mps.mpg.de

2 1

Configuring sacct to report state=OUT_OF_MEMORY
by Lee 29 May '24

29 May '24

Hello, *Background :* I am working on a small cluster that is managed by Base Command Manager v10.0 using Slurm 23.02.7 with Ubuntu 22.04.2. I have a small testing script that simply consumes memory and processors. I run my test script, it consumes more memory than allocated by Slurm and as expected it gets killed by OOM killer. In /var/log/slurmd, I see entries like : [2024-05-29T08:53:04.975] Launching batch job 65 for UID 1001 [2024-05-29T08:53:05.016] [65.batch] task/cgroup: _memcg_initialize: job: alloc=10868MB mem.limit=10868MB memsw.limit=10868MB job_swappiness=1 [2024-05-29T08:53:05.016] [65.batch] task/cgroup: _memcg_initialize: step: alloc=10868MB mem.limit=10868MB memsw.limit=10868MB job_swappiness=1 [2024-05-29T08:53:19.530] [65.batch] task/cgroup: task_cgroup_memory_check_oom: StepId=65.batch hit memory+swap limit at least once during execution. This may or may not result in some failure. [2024-05-29T08:53:19.563] [65.batch] done with job Inspecting with sacct, I see : $ sacct -j 65 --format="jobid,jobname,state,exitcode" JobID JobName State ExitCode ------------ ---------- ---------- -------- 65 wrap FAILED 9:0 65.batch batch FAILED 9:0 Based on my previous experience with a RHEL based Slurm cluster, I would expect the state to be listed as OUT_OF_MEMORY and the exitcode to be 0:125. *Question : * 1. How do I configure [slurm,cgroup].conf such that when Slurm kills a job due to exceeding the allocated memory, sacct reports the state as OUT_OF_MEMORY? See below for my configuration : *Configuration:* /etc/default/grub : ~# grep -v "^#" /etc/default/grub GRUB_DEFAULT=0 GRUB_TIMEOUT_STYLE=hidden GRUB_TIMEOUT=10 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="" GRUB_CMDLINE_LINUX="biosdevname=0 cgroup_enable=memory swapaccount=1" GRUB_GFXMODE="1024x768,800x600,auto" GRUB_BACKGROUND="/boot/grub/bcm.png" slurm.conf : ~# grep -v "^#" /cm/shared/apps/slurm/var/etc/bcm10-slurm/slurm.conf SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge SlurmdSpoolDir=/cm/local/apps/slurm/var/spool SwitchType=switch/none MpiDefault=pmix SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/cgroup ReturnToService=2 TaskPlugin=task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=300 SlurmdTimeout=300 Waittime=0 SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=info SlurmdLogFile=/var/log/slurmd JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm SlurmctldHost=bcm10-h01 AccountingStorageHost=master NodeName=bcm10-n[01,02] Procs=4 CoresPerSocket=4 RealMemory=15988 SocketsPerBoard=1 ThreadsPerCore=1 Boards=1 MemSpecLimit=5120 Feature=location=local PartitionName="defq" Default=YES MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=bcm10-n[01,02] ClusterName=bcm10-slurm SchedulerType=sched/backfill StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave/bcm10-slurm PrologFlags=Alloc GresTypes=gpu Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory cgroup.conf ~# grep -v "^#" /cm/shared/apps/slurm/var/etc/bcm10-slurm/cgroup.conf CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=no ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes ConstrainDevices=yes AllowedRamSpace=100.00 AllowedSwapSpace=0.00 MemorySwappiness=1 MaxRAMPercent=100.00 MaxSwapPercent=100.00 MinRAMSpace=30 Best regards, Lee

1 0

sbatch problem
by Mihai Ciubancan 29 May '24

29 May '24

Hello, My name is Mihai and a have an issue with a small GPU cluster manage with slurm 22.05.11. I got 2 different output when I'm trying to find out the name of the nodes(one correct and one wrong). The script is: #!/bin/bash #SBATCH --job-name=test #SBATCH --output=/data/mihai/res.txt #SBATCH --partition=eli #SBATCH --nodes=2 srun echo Running on host: $(hostname) srun hostname srun sleep 15 And the output look like this: cat res.txt Running on host: mihai-x8640 Running on host: mihai-x8640 mihaigpu2 mihai-x8640 As you can see the output of the command 'srun echo Running on host: $(hostname)' is the same, as the jobs was running twice on the same node, while command 'srun hostname' it's giving me the correct output. Do you have any idea why the outputs of the 2 commands are different? Thank you, Mihai

2 6

slurm_bufs_sendto failed
by shaobo liu 29 May '24

29 May '24

slurm 23.11.0 , the slurmd error is as follows，what is the reason for this? [2024-05-29T09:21:32.450] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 32259 [2024-05-29T09:21:32.451] task/affinity: batch_bind: job 32259 CPU input mask for node: 0xFFFFFFFFFFFFFFFF [2024-05-29T09:21:32.451] task/affinity: batch_bind: job 32259 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFF [2024-05-29T09:21:32.451] Launching batch job 32259 for UID 66003 [2024-05-29T09:21:32.868] launch task StepId=32259.0 request from UID:66003 GID:66003 HOST:192.168.87.45 PORT:31104 [2024-05-29T09:21:32.868] task/affinity: lllp_distribution: JobId=32259 implicit auto binding: cores,one_thread, dist 2 [2024-05-29T09:21:32.868] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2024-05-29T09:21:32.868] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [32259]: mask_cpu,one_thread, 0x0000000000000001,0x0000000100000000,0x0000000000000002,0x0000000200000000,0x0000000000000004,0x0000000400000000,0x0000000000000008,0x0000000800000000,0x0000000000000010,0x0000001000000000,0x0000000000000020,0x0000002000000000,0x0000000000000040,0x0000004000000000,0x0000000000000080,0x0000008000000000,0x0000000000000100,0x0000010000000000,0x0000000000000200,0x0000020000000000,0x0000000000000400,0x0000040000000000,0x0000000000000800,0x0000080000000000,0x0000000000001000,0x0000100000000000,0x0000000000002000,0x0000200000000000,0x0000000000004000,0x0000400000000000,0x0000000000008000,0x0000800000000000,0x0000000000010000,0x0001000000000000,0x0000000000020000,0x0002000000000000,0x0000000000040000,0x0004000000000000,0x0000000000080000,0x0008000000000000,0x0000000000100000,0x0010000000000000,0x0000000000200000,0x0020000000000000,0x0000000000400000,0x0040000000000000,0x0000000000800000,0x0080000000000000,0x0000000001000000,0x0100000000000000,0x0000000002000000,0x0200000000000000,0x0000000004000000,0x0400000000000000,0x0000000008000000,0x0800000000000000,0x0000000010000000,0x1000000000000000,0x0000000020000000,0x2000000000000000,0x0000000040000000,0x4000000000000000,0x0000000080000000,0x8000000000000000 [2024-05-29T09:32:04.541] [32259.batch] done with step [2024-05-29T09:33:31.000] error: slurm_send_node_msg: [socket:[7270627]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:34:02.000] error: slurm_send_node_msg: [socket:[7270629]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:34:03.016] [32259.extern] done with step [2024-05-29T09:34:41.000] error: slurm_send_node_msg: [socket:[7270635]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:35:13.000] error: slurm_send_node_msg: [socket:[7270637]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:35:52.000] error: slurm_send_node_msg: [socket:[7270639]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:36:01.000] error: slurm_send_node_msg: [socket:[7270630]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:36:24.000] error: slurm_send_node_msg: [socket:[7270641]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:37:03.000] error: slurm_send_node_msg: [socket:[7270644]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:37:31.000] [32259.0] error: *** STEP 32259.0 STEPD TERMINATED ON node045 AT 2024-05-29T09:37:30 DUE TO JOB NOT RUNNING *** [2024-05-29T09:37:33.000] [32259.0] stepd_cleanup: done with step (rc[0x0]:No error, cleanup_rc[0xfbb]:Job step not running) [2024-05-29T09:37:35.000] error: slurm_send_node_msg: [socket:[7270646]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:38:14.000] error: slurm_send_node_msg: [socket:[7270649]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:38:46.000] error: slurm_send_node_msg: [socket:[7270651]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-05-29T09:39:25.000] error: slurm_send_node_msg: [socket:[7270657]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error

1 0

Final Call for SLUG Presenters & Early Bird Registration!
by Victoria Hobson 28 May '24

28 May '24

Slurm User Group (SLUG) 2024 is set for September 12-13 at the University of Oslo in Oslo, Norway. Registration information and a high-level schedule can be found here:https://slug24.splashthat.com/ The last day to register at the early bird pricing is this Friday, May 31st. Friday is also the deadline to submit a presentation abstract. We do not intend to extend this deadline. If you are interested in presenting your own usage, developments, site report, tutorial, etc about Slurm, please fill out the following form:https://forms.gle/N7bFo5EzwuTuKkBN7 Notifications of final presentations accepted will go out by Friday, June 14th. -- Victoria Hobson SchedMD LLC Vice President of Marketing

1 0

slurmdbd archive format
by O'Neal, Doug (NIH/NCI) [C] 28 May '24

28 May '24

My organization needs to access historic job information records for metric reporting and resource forecasting. slurmdbd is archiving only the job information, which should be sufficient for our numbers, but is using the default archive script. In retrospect, this data should have been migrated to a secondary MariaDB instance, but that train has passed. The format of the archive files is not well documented. Does anyone have a program (python/C/whatever) that will read a job_table_archive file and decode it into a parsable structure? Douglas O'Neal, Ph.D. (contractor) Manager, HPC Systems Administration Group, ITOG Frederick National Laboratory for Cancer Research Leidos Biomedical Research, Inc. Phone: 301-228-4656 Email: Douglas.O'Neal(a)nih.gov<mailto:Doug%20O'Neal%20%3cDouglas.O'Neal(a)nih.gov%3e>

2 2

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users