Hello, I’m trying to figure out why we’ve been seeing an increase in network traffic in our AWS-based cluster, which uses Amazon’s parallel cluster tool. After an incident a couple weeks ago, I turned on debug2 logging on the slurmd processes, and I’m seeing huge numbers of `REQUEST_GETPW` and `REQUEST_GETGR` requests going to the slurmd processes. I briefly turned on debug2 logging for `slurmctld` as well, and I’m seeing lots of RPC calls, but not as many as the `REQUEST_GETPW` requests that I’ve seen on compute node slurmd processes. Here’s a sample from the slurmctld log: ``` [2025-04-28T15:11:05.436] debug2: _slurm_rpc_dump_partitions, size=1253 usec=20 [2025-04-28T15:11:05.450] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2971 [2025-04-28T15:11:05.451] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2971 [2025-04-28T15:11:05.451] debug2: _slurm_rpc_dump_partitions, size=1253 usec=16 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2788 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2788 [2025-04-28T15:11:05.461] debug2: _slurm_rpc_dump_partitions, size=1253 usec=12 [2025-04-28T15:11:05.517] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2916 [2025-04-28T15:11:05.518] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2916 [2025-04-28T15:11:05.518] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.628] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3405 [2025-04-28T15:11:05.629] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3405 [2025-04-28T15:11:05.629] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2189 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2189 [2025-04-28T15:11:05.740] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:05.845] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4106 [2025-04-28T15:11:05.846] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.847] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4106 [2025-04-28T15:11:05.847] debug2: _slurm_rpc_dump_partitions, size=1253 usec=11 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3400 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3400 [2025-04-28T15:11:05.938] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:06.903] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3449 [2025-04-28T15:11:06.904] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3449 [2025-04-28T15:11:06.904] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.175] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3722 [2025-04-28T15:11:07.176] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3722 [2025-04-28T15:11:07.177] debug2: _slurm_rpc_dump_partitions, size=1253 usec=254 [2025-04-28T15:11:07.205] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4040 [2025-04-28T15:11:07.206] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4040 [2025-04-28T15:11:07.206] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:07.237] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2990 [2025-04-28T15:11:07.238] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2990 [2025-04-28T15:11:07.239] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.284] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2920 [2025-04-28T15:11:07.285] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2920 [2025-04-28T15:11:07.285] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.370] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3236 [2025-04-28T15:11:07.371] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3236 [2025-04-28T15:11:07.371] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:08.463] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2848 [2025-04-28T15:11:08.464] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2848 [2025-04-28T15:11:08.464] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:08.691] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2627 [2025-04-28T15:11:08.692] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2627 [2025-04-28T15:11:08.692] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:08.873] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3729 [2025-04-28T15:11:08.874] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3729 [2025-04-28T15:11:08.875] debug2: _slurm_rpc_dump_partitions, size=1253 usec=196 [2025-04-28T15:11:08.881] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3461 [2025-04-28T15:11:08.882] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3461 [2025-04-28T15:11:08.882] debug2: _slurm_rpc_dump_partitions, size=1253 usec=10 ``` And from slurmd: ``` [2025-04-27T19:45:01.353] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.475] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.491] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.501] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.504] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.507] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.513] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.606] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.607] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:04.988] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.992] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.995] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.999] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.011] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.016] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.033] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.045] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.048] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.057] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.073] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.077] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.110] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.144] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.170] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.172] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.174] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.203] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.204] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.207] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.316] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.318] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.321] [59253.batch] debug: Handling REQUEST_GETPW ``` This level of debugging makes the logs pretty huge, but if seeing a whole log file is helpful, I can make something available. Any ideas on next steps for figuring out what’s going on? It seems like something is asking for authentication a whole lot, but it’s not clear to me what or why. We do use munge for Slurm authentication, and SSSD to work with LDAP for user authentication. -Jeremy Guillette —
Jeremy Guillette
Software Engineer, FAS Academic Technology | Academic Technology Harvard University Information Technology P: (617) 998-1826 | W: huit.harvard.edu (he/him/his)
Hi, why you think it's an authentication requests? As far as I understand multiple UIDs are asking for job and partition info. It's unlikely that all of them perform that kind of requests the same way and in the same time, so I think you should look for some external program that may do that - i.e. some monitoring tool? Or reporting tool? I'm not sure if API calls are also registered as RPC in controller logs.
Dirty (but maybe effective) way of discovering what makes all of that calls is to set the RPC rate limit to some low value and see what stopped working ;) https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
Regards Patryk.
On 25/05/06 02:38PM, Guillette, Jeremy via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 9.3K --]
Hello, I’m trying to figure out why we’ve been seeing an increase in network traffic in our AWS-based cluster, which uses Amazon’s parallel cluster tool. After an incident a couple weeks ago, I turned on debug2 logging on the slurmd processes, and I’m seeing huge numbers of `REQUEST_GETPW` and `REQUEST_GETGR` requests going to the slurmd processes. I briefly turned on debug2 logging for `slurmctld` as well, and I’m seeing lots of RPC calls, but not as many as the `REQUEST_GETPW` requests that I’ve seen on compute node slurmd processes. Here’s a sample from the slurmctld log:
[2025-04-28T15:11:05.436] debug2: _slurm_rpc_dump_partitions, size=1253 usec=20 [2025-04-28T15:11:05.450] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2971 [2025-04-28T15:11:05.451] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2971 [2025-04-28T15:11:05.451] debug2: _slurm_rpc_dump_partitions, size=1253 usec=16 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2788 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2788 [2025-04-28T15:11:05.461] debug2: _slurm_rpc_dump_partitions, size=1253 usec=12 [2025-04-28T15:11:05.517] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2916 [2025-04-28T15:11:05.518] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2916 [2025-04-28T15:11:05.518] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.628] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3405 [2025-04-28T15:11:05.629] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3405 [2025-04-28T15:11:05.629] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2189 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2189 [2025-04-28T15:11:05.740] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:05.845] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4106 [2025-04-28T15:11:05.846] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.847] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4106 [2025-04-28T15:11:05.847] debug2: _slurm_rpc_dump_partitions, size=1253 usec=11 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3400 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3400 [2025-04-28T15:11:05.938] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:06.903] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3449 [2025-04-28T15:11:06.904] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3449 [2025-04-28T15:11:06.904] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.175] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3722 [2025-04-28T15:11:07.176] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3722 [2025-04-28T15:11:07.177] debug2: _slurm_rpc_dump_partitions, size=1253 usec=254 [2025-04-28T15:11:07.205] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4040 [2025-04-28T15:11:07.206] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4040 [2025-04-28T15:11:07.206] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:07.237] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2990 [2025-04-28T15:11:07.238] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2990 [2025-04-28T15:11:07.239] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.284] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2920 [2025-04-28T15:11:07.285] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2920 [2025-04-28T15:11:07.285] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.370] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3236 [2025-04-28T15:11:07.371] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3236 [2025-04-28T15:11:07.371] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:08.463] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2848 [2025-04-28T15:11:08.464] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2848 [2025-04-28T15:11:08.464] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:08.691] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2627 [2025-04-28T15:11:08.692] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2627 [2025-04-28T15:11:08.692] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:08.873] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3729 [2025-04-28T15:11:08.874] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3729 [2025-04-28T15:11:08.875] debug2: _slurm_rpc_dump_partitions, size=1253 usec=196 [2025-04-28T15:11:08.881] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3461 [2025-04-28T15:11:08.882] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3461 [2025-04-28T15:11:08.882] debug2: _slurm_rpc_dump_partitions, size=1253 usec=10
And from slurmd:
[2025-04-27T19:45:01.353] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.475] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.491] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.501] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.504] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.507] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.513] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.606] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.607] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:04.988] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.992] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.995] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.999] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.011] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.016] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.033] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.045] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.048] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.057] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.073] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.077] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.110] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.144] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.170] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.172] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.174] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.203] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.204] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.207] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.316] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.318] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.321] [59253.batch] debug: Handling REQUEST_GETPW
This level of debugging makes the logs pretty huge, but if seeing a whole log file is helpful, I can make something available. Any ideas on next steps for figuring out what’s going on? It seems like something is asking for authentication a whole lot, but it’s not clear to me what or why. We do use munge for Slurm authentication, and SSSD to work with LDAP for user authentication. -Jeremy Guillette —
Jeremy Guillette
Software Engineer, FAS Academic Technology | Academic Technology Harvard University Information Technology P: (617) 998-1826 | W: huit.harvard.edu (he/him/his)
[-- Alternative Type #1: text/html; charset=Windows-1252, Encoding: quoted-printable, Size: 20K --]
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 5/7/25 09:57, Patryk Bełzak via slurm-users wrote:
Hi, why you think it's an authentication requests? As far as I understand multiple UIDs are asking for job and partition info. It's unlikely that all of them perform that kind of requests the same way and in the same time, so I think you should look for some external program that may do that - i.e. some monitoring tool? Or reporting tool? I'm not sure if API calls are also registered as RPC in controller logs.
Dirty (but maybe effective) way of discovering what makes all of that calls is to set the RPC rate limit to some low value and see what stopped working ;) https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
IMHO the RPC rate limiting should be considered a best practice, and I wouldn't think that it's a "dirty" configuration. You need Slurm 23.02 or later for this. Some details are discussed in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#rpc-rate-limi...
IHTH, Ole
On 25/05/06 02:38PM, Guillette, Jeremy via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 9.3K --]
Hello, I’m trying to figure out why we’ve been seeing an increase in network traffic in our AWS-based cluster, which uses Amazon’s parallel cluster tool. After an incident a couple weeks ago, I turned on debug2 logging on the slurmd processes, and I’m seeing huge numbers of `REQUEST_GETPW` and `REQUEST_GETGR` requests going to the slurmd processes. I briefly turned on debug2 logging for `slurmctld` as well, and I’m seeing lots of RPC calls, but not as many as the `REQUEST_GETPW` requests that I’ve seen on compute node slurmd processes. Here’s a sample from the slurmctld log:
[2025-04-28T15:11:05.436] debug2: _slurm_rpc_dump_partitions, size=1253 usec=20 [2025-04-28T15:11:05.450] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2971 [2025-04-28T15:11:05.451] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2971 [2025-04-28T15:11:05.451] debug2: _slurm_rpc_dump_partitions, size=1253 usec=16 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2788 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2788 [2025-04-28T15:11:05.461] debug2: _slurm_rpc_dump_partitions, size=1253 usec=12 [2025-04-28T15:11:05.517] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2916 [2025-04-28T15:11:05.518] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2916 [2025-04-28T15:11:05.518] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.628] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3405 [2025-04-28T15:11:05.629] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3405 [2025-04-28T15:11:05.629] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2189 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2189 [2025-04-28T15:11:05.740] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:05.845] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4106 [2025-04-28T15:11:05.846] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.847] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4106 [2025-04-28T15:11:05.847] debug2: _slurm_rpc_dump_partitions, size=1253 usec=11 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3400 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3400 [2025-04-28T15:11:05.938] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:06.903] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3449 [2025-04-28T15:11:06.904] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3449 [2025-04-28T15:11:06.904] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.175] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3722 [2025-04-28T15:11:07.176] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3722 [2025-04-28T15:11:07.177] debug2: _slurm_rpc_dump_partitions, size=1253 usec=254 [2025-04-28T15:11:07.205] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4040 [2025-04-28T15:11:07.206] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4040 [2025-04-28T15:11:07.206] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:07.237] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2990 [2025-04-28T15:11:07.238] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2990 [2025-04-28T15:11:07.239] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.284] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2920 [2025-04-28T15:11:07.285] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2920 [2025-04-28T15:11:07.285] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.370] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3236 [2025-04-28T15:11:07.371] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3236 [2025-04-28T15:11:07.371] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:08.463] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2848 [2025-04-28T15:11:08.464] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2848 [2025-04-28T15:11:08.464] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:08.691] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2627 [2025-04-28T15:11:08.692] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2627 [2025-04-28T15:11:08.692] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:08.873] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3729 [2025-04-28T15:11:08.874] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3729 [2025-04-28T15:11:08.875] debug2: _slurm_rpc_dump_partitions, size=1253 usec=196 [2025-04-28T15:11:08.881] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3461 [2025-04-28T15:11:08.882] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3461 [2025-04-28T15:11:08.882] debug2: _slurm_rpc_dump_partitions, size=1253 usec=10
And from slurmd:
[2025-04-27T19:45:01.353] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.475] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.491] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.501] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.504] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.507] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.513] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.606] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.607] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:04.988] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.992] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.995] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.999] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.011] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.016] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.033] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.045] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.048] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.057] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.073] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.077] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.110] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.144] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.170] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.172] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.174] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.203] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.204] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.207] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.316] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.318] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.321] [59253.batch] debug: Handling REQUEST_GETPW
This level of debugging makes the logs pretty huge, but if seeing a whole log file is helpful, I can make something available. Any ideas on next steps for figuring out what’s going on? It seems like something is asking for authentication a whole lot, but it’s not clear to me what or why. We do use munge for Slurm authentication, and SSSD to work with LDAP for user authentication. -Jeremy Guillette —
Jeremy Guillette
Software Engineer, FAS Academic Technology | Academic Technology Harvard University Information Technology P: (617) 998-1826 | W: huit.harvard.edu (he/him/his)
Hi,
Speaking of RPC rate limiting, we recently encountered an issue with Snakemake making excessive requests to sacct. It seems that the current rate limiting only applies to controller RPCs. Is there a way to also limit the rate of sacct calls?
Thanks, Guillaume
----- Mail original ----- De: "Ole Holm Nielsen via slurm-users" slurm-users@lists.schedmd.com À: slurm-users@lists.schedmd.com Envoyé: Mercredi 7 Mai 2025 10:13:08 Objet: [slurm-users] Re: Lots of RPC calls and REQUEST_GETPW calls
On 5/7/25 09:57, Patryk Bełzak via slurm-users wrote:
Hi, why you think it's an authentication requests? As far as I understand multiple UIDs are asking for job and partition info. It's unlikely that all of them perform that kind of requests the same way and in the same time, so I think you should look for some external program that may do that - i.e. some monitoring tool? Or reporting tool? I'm not sure if API calls are also registered as RPC in controller logs.
Dirty (but maybe effective) way of discovering what makes all of that calls is to set the RPC rate limit to some low value and see what stopped working ;) https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
IMHO the RPC rate limiting should be considered a best practice, and I wouldn't think that it's a "dirty" configuration. You need Slurm 23.02 or later for this. Some details are discussed in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#rpc-rate-limi...
IHTH, Ole
On 25/05/06 02:38PM, Guillette, Jeremy via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 9.3K --]
Hello, I’m trying to figure out why we’ve been seeing an increase in network traffic in our AWS-based cluster, which uses Amazon’s parallel cluster tool. After an incident a couple weeks ago, I turned on debug2 logging on the slurmd processes, and I’m seeing huge numbers of `REQUEST_GETPW` and `REQUEST_GETGR` requests going to the slurmd processes. I briefly turned on debug2 logging for `slurmctld` as well, and I’m seeing lots of RPC calls, but not as many as the `REQUEST_GETPW` requests that I’ve seen on compute node slurmd processes. Here’s a sample from the slurmctld log:
[2025-04-28T15:11:05.436] debug2: _slurm_rpc_dump_partitions, size=1253 usec=20 [2025-04-28T15:11:05.450] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2971 [2025-04-28T15:11:05.451] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2971 [2025-04-28T15:11:05.451] debug2: _slurm_rpc_dump_partitions, size=1253 usec=16 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2788 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2788 [2025-04-28T15:11:05.461] debug2: _slurm_rpc_dump_partitions, size=1253 usec=12 [2025-04-28T15:11:05.517] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2916 [2025-04-28T15:11:05.518] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2916 [2025-04-28T15:11:05.518] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.628] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3405 [2025-04-28T15:11:05.629] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3405 [2025-04-28T15:11:05.629] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2189 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2189 [2025-04-28T15:11:05.740] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:05.845] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4106 [2025-04-28T15:11:05.846] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.847] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4106 [2025-04-28T15:11:05.847] debug2: _slurm_rpc_dump_partitions, size=1253 usec=11 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3400 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3400 [2025-04-28T15:11:05.938] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:06.903] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3449 [2025-04-28T15:11:06.904] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3449 [2025-04-28T15:11:06.904] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.175] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3722 [2025-04-28T15:11:07.176] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3722 [2025-04-28T15:11:07.177] debug2: _slurm_rpc_dump_partitions, size=1253 usec=254 [2025-04-28T15:11:07.205] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4040 [2025-04-28T15:11:07.206] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4040 [2025-04-28T15:11:07.206] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:07.237] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2990 [2025-04-28T15:11:07.238] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2990 [2025-04-28T15:11:07.239] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.284] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2920 [2025-04-28T15:11:07.285] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2920 [2025-04-28T15:11:07.285] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.370] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3236 [2025-04-28T15:11:07.371] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3236 [2025-04-28T15:11:07.371] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:08.463] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2848 [2025-04-28T15:11:08.464] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2848 [2025-04-28T15:11:08.464] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:08.691] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2627 [2025-04-28T15:11:08.692] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2627 [2025-04-28T15:11:08.692] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:08.873] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3729 [2025-04-28T15:11:08.874] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3729 [2025-04-28T15:11:08.875] debug2: _slurm_rpc_dump_partitions, size=1253 usec=196 [2025-04-28T15:11:08.881] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3461 [2025-04-28T15:11:08.882] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3461 [2025-04-28T15:11:08.882] debug2: _slurm_rpc_dump_partitions, size=1253 usec=10
And from slurmd:
[2025-04-27T19:45:01.353] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.475] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.491] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.501] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.504] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.507] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.513] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.606] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.607] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:04.988] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.992] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.995] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.999] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.011] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.016] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.033] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.045] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.048] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.057] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.073] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.077] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.110] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.144] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.170] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.172] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.174] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.203] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.204] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.207] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.316] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.318] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.321] [59253.batch] debug: Handling REQUEST_GETPW
This level of debugging makes the logs pretty huge, but if seeing a whole log file is helpful, I can make something available. Any ideas on next steps for figuring out what’s going on? It seems like something is asking for authentication a whole lot, but it’s not clear to me what or why. We do use munge for Slurm authentication, and SSSD to work with LDAP for user authentication. -Jeremy Guillette —
Jeremy Guillette
Software Engineer, FAS Academic Technology | Academic Technology Harvard University Information Technology P: (617) 998-1826 | W: huit.harvard.edu (he/him/his)
On 5/7/25 10:28, Guillaume COCHARD wrote:
Hi,
Speaking of RPC rate limiting, we recently encountered an issue with Snakemake making excessive requests to sacct. It seems that the current rate limiting only applies to controller RPCs. Is there a way to also limit the rate of sacct calls?
The sacct connects to the slurmdbd daemon, so I suppose this is not the same as RPC(?). I don't think slurmdbd has any rate limiting feature. However, clients asking for too much may be limited by this parameter in slurmdbd.conf:
MaxQueryTimeRange Return an error if a query is against too large of a time span, to prevent ill-formed queries from causing performance problems within SlurmDBD. Default value is INFINITE which allows any queries to proceed. Accepted time formats are the same as the MaxTime option in slurm.conf. Operator and higher privileged users are exempt from this restriction. Note that queries which attempt to return over 3GB of data will still fail to complete with ESLURM_RESULT_TOO_LARGE.
We use a value of 60 days as documented in https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_database/#setting-maxqueryti...
Best regards, Ole
----- Mail original ----- De: "Ole Holm Nielsen via slurm-users" slurm-users@lists.schedmd.com À: slurm-users@lists.schedmd.com Envoyé: Mercredi 7 Mai 2025 10:13:08 Objet: [slurm-users] Re: Lots of RPC calls and REQUEST_GETPW calls
On 5/7/25 09:57, Patryk Bełzak via slurm-users wrote:
Hi, why you think it's an authentication requests? As far as I understand multiple UIDs are asking for job and partition info. It's unlikely that all of them perform that kind of requests the same way and in the same time, so I think you should look for some external program that may do that - i.e. some monitoring tool? Or reporting tool? I'm not sure if API calls are also registered as RPC in controller logs.
Dirty (but maybe effective) way of discovering what makes all of that calls is to set the RPC rate limit to some low value and see what stopped working ;) https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
IMHO the RPC rate limiting should be considered a best practice, and I wouldn't think that it's a "dirty" configuration. You need Slurm 23.02 or later for this. Some details are discussed in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#rpc-rate-limi...
IHTH, Ole
On 25/05/06 02:38PM, Guillette, Jeremy via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 9.3K --]
Hello, I’m trying to figure out why we’ve been seeing an increase in network traffic in our AWS-based cluster, which uses Amazon’s parallel cluster tool. After an incident a couple weeks ago, I turned on debug2 logging on the slurmd processes, and I’m seeing huge numbers of `REQUEST_GETPW` and `REQUEST_GETGR` requests going to the slurmd processes. I briefly turned on debug2 logging for `slurmctld` as well, and I’m seeing lots of RPC calls, but not as many as the `REQUEST_GETPW` requests that I’ve seen on compute node slurmd processes. Here’s a sample from the slurmctld log:
[2025-04-28T15:11:05.436] debug2: _slurm_rpc_dump_partitions, size=1253 usec=20 [2025-04-28T15:11:05.450] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2971 [2025-04-28T15:11:05.451] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2971 [2025-04-28T15:11:05.451] debug2: _slurm_rpc_dump_partitions, size=1253 usec=16 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2788 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2788 [2025-04-28T15:11:05.461] debug2: _slurm_rpc_dump_partitions, size=1253 usec=12 [2025-04-28T15:11:05.517] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2916 [2025-04-28T15:11:05.518] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2916 [2025-04-28T15:11:05.518] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.628] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3405 [2025-04-28T15:11:05.629] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3405 [2025-04-28T15:11:05.629] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2189 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2189 [2025-04-28T15:11:05.740] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:05.845] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4106 [2025-04-28T15:11:05.846] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.847] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4106 [2025-04-28T15:11:05.847] debug2: _slurm_rpc_dump_partitions, size=1253 usec=11 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3400 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3400 [2025-04-28T15:11:05.938] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:06.903] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3449 [2025-04-28T15:11:06.904] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3449 [2025-04-28T15:11:06.904] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.175] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3722 [2025-04-28T15:11:07.176] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3722 [2025-04-28T15:11:07.177] debug2: _slurm_rpc_dump_partitions, size=1253 usec=254 [2025-04-28T15:11:07.205] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4040 [2025-04-28T15:11:07.206] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4040 [2025-04-28T15:11:07.206] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:07.237] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2990 [2025-04-28T15:11:07.238] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2990 [2025-04-28T15:11:07.239] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.284] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2920 [2025-04-28T15:11:07.285] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2920 [2025-04-28T15:11:07.285] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.370] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3236 [2025-04-28T15:11:07.371] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3236 [2025-04-28T15:11:07.371] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:08.463] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2848 [2025-04-28T15:11:08.464] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2848 [2025-04-28T15:11:08.464] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:08.691] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2627 [2025-04-28T15:11:08.692] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2627 [2025-04-28T15:11:08.692] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:08.873] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3729 [2025-04-28T15:11:08.874] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3729 [2025-04-28T15:11:08.875] debug2: _slurm_rpc_dump_partitions, size=1253 usec=196 [2025-04-28T15:11:08.881] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3461 [2025-04-28T15:11:08.882] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3461 [2025-04-28T15:11:08.882] debug2: _slurm_rpc_dump_partitions, size=1253 usec=10
And from slurmd:
[2025-04-27T19:45:01.353] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.475] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.491] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.501] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.504] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.507] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.513] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.606] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.607] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:04.988] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.992] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.995] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.999] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.011] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.016] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.033] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.045] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.048] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.057] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.073] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.077] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.110] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.144] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.170] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.172] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.174] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.203] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.204] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.207] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.316] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.318] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.321] [59253.batch] debug: Handling REQUEST_GETPW
This level of debugging makes the logs pretty huge, but if seeing a whole log file is helpful, I can make something available. Any ideas on next steps for figuring out what’s going on? It seems like something is asking for authentication a whole lot, but it’s not clear to me what or why. We do use munge for Slurm authentication, and SSSD to work with LDAP for user authentication. -Jeremy Guillette —
Jeremy Guillette
Software Engineer, FAS Academic Technology | Academic Technology Harvard University Information Technology P: (617) 998-1826 | W: huit.harvard.edu (he/him/his)
IMHO the RPC rate limiting should be considered a best practice, and I wouldn't think that it's a "dirty" configuration. You need Slurm 23.02 or later for this. Some details are discussed in this Wiki page:
Dirty in a way that levels are so low that they break some other service in order to determine which service is making that calls. You know, breaking things isn't the best practice ;) I totally agree that RPC rate limiting overall is a good practice, and perhaps it should be enabled by default in SLURM.
Regards Patryk.
On 25/05/07 10:13AM, Ole Holm Nielsen via slurm-users wrote:
On 5/7/25 09:57, Patryk Bełzak via slurm-users wrote:
Hi, why you think it's an authentication requests? As far as I understand multiple UIDs are asking for job and partition info. It's unlikely that all of them perform that kind of requests the same way and in the same time, so I think you should look for some external program that may do that - i.e. some monitoring tool? Or reporting tool? I'm not sure if API calls are also registered as RPC in controller logs.
Dirty (but maybe effective) way of discovering what makes all of that calls is to set the RPC rate limit to some low value and see what stopped working ;) https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
IMHO the RPC rate limiting should be considered a best practice, and I wouldn't think that it's a "dirty" configuration. You need Slurm 23.02 or later for this. Some details are discussed in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#rpc-rate-limi...
IHTH, Ole
On 25/05/06 02:38PM, Guillette, Jeremy via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 9.3K --]
Hello, I’m trying to figure out why we’ve been seeing an increase in network traffic in our AWS-based cluster, which uses Amazon’s parallel cluster tool. After an incident a couple weeks ago, I turned on debug2 logging on the slurmd processes, and I’m seeing huge numbers of `REQUEST_GETPW` and `REQUEST_GETGR` requests going to the slurmd processes. I briefly turned on debug2 logging for `slurmctld` as well, and I’m seeing lots of RPC calls, but not as many as the `REQUEST_GETPW` requests that I’ve seen on compute node slurmd processes. Here’s a sample from the slurmctld log:
[2025-04-28T15:11:05.436] debug2: _slurm_rpc_dump_partitions, size=1253 usec=20 [2025-04-28T15:11:05.450] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2971 [2025-04-28T15:11:05.451] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2971 [2025-04-28T15:11:05.451] debug2: _slurm_rpc_dump_partitions, size=1253 usec=16 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2788 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2788 [2025-04-28T15:11:05.461] debug2: _slurm_rpc_dump_partitions, size=1253 usec=12 [2025-04-28T15:11:05.517] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2916 [2025-04-28T15:11:05.518] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2916 [2025-04-28T15:11:05.518] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.628] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3405 [2025-04-28T15:11:05.629] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3405 [2025-04-28T15:11:05.629] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2189 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2189 [2025-04-28T15:11:05.740] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:05.845] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4106 [2025-04-28T15:11:05.846] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.847] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4106 [2025-04-28T15:11:05.847] debug2: _slurm_rpc_dump_partitions, size=1253 usec=11 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3400 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3400 [2025-04-28T15:11:05.938] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:06.903] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3449 [2025-04-28T15:11:06.904] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3449 [2025-04-28T15:11:06.904] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.175] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3722 [2025-04-28T15:11:07.176] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3722 [2025-04-28T15:11:07.177] debug2: _slurm_rpc_dump_partitions, size=1253 usec=254 [2025-04-28T15:11:07.205] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4040 [2025-04-28T15:11:07.206] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4040 [2025-04-28T15:11:07.206] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:07.237] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2990 [2025-04-28T15:11:07.238] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2990 [2025-04-28T15:11:07.239] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.284] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2920 [2025-04-28T15:11:07.285] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2920 [2025-04-28T15:11:07.285] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.370] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3236 [2025-04-28T15:11:07.371] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3236 [2025-04-28T15:11:07.371] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:08.463] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2848 [2025-04-28T15:11:08.464] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2848 [2025-04-28T15:11:08.464] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:08.691] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2627 [2025-04-28T15:11:08.692] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2627 [2025-04-28T15:11:08.692] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:08.873] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3729 [2025-04-28T15:11:08.874] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3729 [2025-04-28T15:11:08.875] debug2: _slurm_rpc_dump_partitions, size=1253 usec=196 [2025-04-28T15:11:08.881] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3461 [2025-04-28T15:11:08.882] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3461 [2025-04-28T15:11:08.882] debug2: _slurm_rpc_dump_partitions, size=1253 usec=10
And from slurmd:
[2025-04-27T19:45:01.353] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.475] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.491] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.501] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.504] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.507] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.513] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.606] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.607] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:04.988] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.992] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.995] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.999] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.011] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.016] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.033] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.045] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.048] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.057] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.073] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.077] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.110] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.144] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.170] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.172] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.174] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.203] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.204] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.207] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.316] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.318] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.321] [59253.batch] debug: Handling REQUEST_GETPW
This level of debugging makes the logs pretty huge, but if seeing a whole log file is helpful, I can make something available. Any ideas on next steps for figuring out what’s going on? It seems like something is asking for authentication a whole lot, but it’s not clear to me what or why. We do use munge for Slurm authentication, and SSSD to work with LDAP for user authentication. -Jeremy Guillette —
Jeremy Guillette
Software Engineer, FAS Academic Technology | Academic Technology Harvard University Information Technology P: (617) 998-1826 | W: huit.harvard.edu (he/him/his)
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Getting back the original question - I just noticed that there is special option AuditRPCs in DebugFlags for controller, so perhaps you can determine source of RPC calls without breaking things. https://slurm.schedmd.com/slurm.conf.html#OPT_AuditRPCs
Regards Patryk.
On 25/05/07 10:47AM, Patryk Bełzak via slurm-users wrote:
IMHO the RPC rate limiting should be considered a best practice, and I wouldn't think that it's a "dirty" configuration. You need Slurm 23.02 or later for this. Some details are discussed in this Wiki page:
Dirty in a way that levels are so low that they break some other service in order to determine which service is making that calls. You know, breaking things isn't the best practice ;) I totally agree that RPC rate limiting overall is a good practice, and perhaps it should be enabled by default in SLURM.
Regards Patryk.
On 25/05/07 10:13AM, Ole Holm Nielsen via slurm-users wrote:
On 5/7/25 09:57, Patryk Bełzak via slurm-users wrote:
Hi, why you think it's an authentication requests? As far as I understand multiple UIDs are asking for job and partition info. It's unlikely that all of them perform that kind of requests the same way and in the same time, so I think you should look for some external program that may do that - i.e. some monitoring tool? Or reporting tool? I'm not sure if API calls are also registered as RPC in controller logs.
Dirty (but maybe effective) way of discovering what makes all of that calls is to set the RPC rate limit to some low value and see what stopped working ;) https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
IMHO the RPC rate limiting should be considered a best practice, and I wouldn't think that it's a "dirty" configuration. You need Slurm 23.02 or later for this. Some details are discussed in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#rpc-rate-limi...
IHTH, Ole
On 25/05/06 02:38PM, Guillette, Jeremy via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 9.3K --]
Hello, I’m trying to figure out why we’ve been seeing an increase in network traffic in our AWS-based cluster, which uses Amazon’s parallel cluster tool. After an incident a couple weeks ago, I turned on debug2 logging on the slurmd processes, and I’m seeing huge numbers of `REQUEST_GETPW` and `REQUEST_GETGR` requests going to the slurmd processes. I briefly turned on debug2 logging for `slurmctld` as well, and I’m seeing lots of RPC calls, but not as many as the `REQUEST_GETPW` requests that I’ve seen on compute node slurmd processes. Here’s a sample from the slurmctld log:
[2025-04-28T15:11:05.436] debug2: _slurm_rpc_dump_partitions, size=1253 usec=20 [2025-04-28T15:11:05.450] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2971 [2025-04-28T15:11:05.451] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2971 [2025-04-28T15:11:05.451] debug2: _slurm_rpc_dump_partitions, size=1253 usec=16 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2788 [2025-04-28T15:11:05.461] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2788 [2025-04-28T15:11:05.461] debug2: _slurm_rpc_dump_partitions, size=1253 usec=12 [2025-04-28T15:11:05.517] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2916 [2025-04-28T15:11:05.518] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2916 [2025-04-28T15:11:05.518] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.628] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3405 [2025-04-28T15:11:05.629] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3405 [2025-04-28T15:11:05.629] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2189 [2025-04-28T15:11:05.740] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2189 [2025-04-28T15:11:05.740] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:05.845] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2209 [2025-04-28T15:11:05.846] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4106 [2025-04-28T15:11:05.846] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:05.847] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4106 [2025-04-28T15:11:05.847] debug2: _slurm_rpc_dump_partitions, size=1253 usec=11 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3400 [2025-04-28T15:11:05.938] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3400 [2025-04-28T15:11:05.938] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:06.903] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3449 [2025-04-28T15:11:06.904] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3449 [2025-04-28T15:11:06.904] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.175] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3722 [2025-04-28T15:11:07.176] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3722 [2025-04-28T15:11:07.177] debug2: _slurm_rpc_dump_partitions, size=1253 usec=254 [2025-04-28T15:11:07.205] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=4040 [2025-04-28T15:11:07.206] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=4040 [2025-04-28T15:11:07.206] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:07.237] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2990 [2025-04-28T15:11:07.238] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2990 [2025-04-28T15:11:07.239] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.284] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2920 [2025-04-28T15:11:07.285] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2920 [2025-04-28T15:11:07.285] debug2: _slurm_rpc_dump_partitions, size=1253 usec=15 [2025-04-28T15:11:07.370] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3236 [2025-04-28T15:11:07.371] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3236 [2025-04-28T15:11:07.371] debug2: _slurm_rpc_dump_partitions, size=1253 usec=17 [2025-04-28T15:11:08.463] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2848 [2025-04-28T15:11:08.464] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2848 [2025-04-28T15:11:08.464] debug2: _slurm_rpc_dump_partitions, size=1253 usec=14 [2025-04-28T15:11:08.691] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=2627 [2025-04-28T15:11:08.692] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=2627 [2025-04-28T15:11:08.692] debug2: _slurm_rpc_dump_partitions, size=1253 usec=18 [2025-04-28T15:11:08.873] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3729 [2025-04-28T15:11:08.874] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3729 [2025-04-28T15:11:08.875] debug2: _slurm_rpc_dump_partitions, size=1253 usec=196 [2025-04-28T15:11:08.881] debug2: Processing RPC: REQUEST_JOB_INFO_SINGLE from UID=3461 [2025-04-28T15:11:08.882] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=3461 [2025-04-28T15:11:08.882] debug2: _slurm_rpc_dump_partitions, size=1253 usec=10
And from slurmd:
[2025-04-27T19:45:01.353] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.475] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.491] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.496] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.497] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.501] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.504] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.507] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.513] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.518] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:02.606] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:02.607] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:04.988] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.992] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.995] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:04.999] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.011] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.016] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.033] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.045] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.048] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.057] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.073] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.077] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.110] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.143] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.144] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.152] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.159] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.167] [59253.batch] debug: Handling REQUEST_GETGR [2025-04-27T19:45:05.170] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.172] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.174] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.203] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.204] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.207] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.316] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.318] [59253.batch] debug: Handling REQUEST_GETPW [2025-04-27T19:45:05.321] [59253.batch] debug: Handling REQUEST_GETPW
This level of debugging makes the logs pretty huge, but if seeing a whole log file is helpful, I can make something available. Any ideas on next steps for figuring out what’s going on? It seems like something is asking for authentication a whole lot, but it’s not clear to me what or why. We do use munge for Slurm authentication, and SSSD to work with LDAP for user authentication. -Jeremy Guillette —
Jeremy Guillette
Software Engineer, FAS Academic Technology | Academic Technology Harvard University Information Technology P: (617) 998-1826 | W: huit.harvard.edu (he/him/his)
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com