Skip to content

Releases: SchedMD/slurm

v25.11.5

14 Apr 21:19

Choose a tag to compare

Changes in 25.11.5

  • slurmctld - Prevent crash when deleting the only node in the cluster which also belongs to an inactive reservation.
  • Fix assoc corruption on account add race condition.
  • slurmctld - Re-enforce accounting policy limits when updating a job's QOS/assoc/partition.
  • Prevent double call to requeue logic when PrologSlurmctld fails leading to extra records in database.
  • Fix backfill to honor partition OverSubscribe=EXCLUSIVE
  • stepmgr - Avoid leaking MPI ports when jobs that use the stepmgr are allocated nonconsecutive ports.
  • Fix always showing 0 for slurm_cpus_alloc, slurm_nodes_alloc and slurm_memory_alloc in the metrics/jobs endpoint.
  • Fix BPF token support compilation on systems with glibc >= 2.36 by using <sys/mount.h> where available instead of <linux/mount.h>.
  • Fix a regression in 25.11.0 that could cause bounded hang after hitting conmgr_max_connections.
  • Fix Insufficient Size error in NVML library call for long gpu names.
  • slurmctld - Correct race condition during reconfigure and creating new cluster in slurmdbd that could cause both daemons to deadlock.
  • slurmctld - Reject all job submissions as reserved user or group nobody(99).
  • sbatch,srun,salloc - Reject arg --uid=99.
  • sbatch,srun,salloc - Reject arg --gid=99.
  • Jobs that complete quickly will not be marked as runaway.
  • Correctly identify whether a job is in the DB.
  • slurmctld - Avoid possible race condition during shutdown that could cause a crash in the HTTP handling logic.
  • slurmctld - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmstepd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • srun - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • slurmdbd - Avoid race condition during shutdown that could cause a crash due to tree forwarding.
  • Fix race condition with cgroups not migrating slurmd process quickly, which caused EBUSY errors on startup.
  • Fix slurmd reconfigure failure with cgroup/v2.
  • Fix a regression added in 25.05.0 concerning how the slurmctld inherits /run/slurmctld/sack.socket when using AuthType=auth/slurm to prevent clients that connected during a reconfigure from hanging indefinitely.
  • slurmctld - Wait for forwarding threads to complete before shutdown to avoid crashing due to NULL dereferences or using unloaded plugins.
  • Avoid failure for spank options that do not require arguments.
  • Allow archive load of qos_usage tables
  • namespace/linux - fix memory leak in slurmstepd when namespace_p_recv_stepd() fails.
  • namespace/linux - Fix potential crash on failure if mmap() or sem_init() fails during namespace construction.
  • namespace/linux - fix unlikely error that could cause sigkill to be sent to a job during shutdown.
  • namespace/linux - fix failure to detect namespace setup problems when launching a job.
  • Fix slurmctld crash when querying the metrics endpoint after a partition is deleted with finished jobs still present.
  • reservations - Fix creation with NodeCnt and Flags=IGNORE_JOBS failing when partition nodes are occupied.
  • cons_tres - Prevent slurmctld SIGFPE during node selection.

v25.11.4

12 Mar 20:59

Choose a tag to compare

Changes in 25.11.4

  • slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
  • Prevent "period_start should already be set" errors when purging slurmdbd data and fix file names for archives of purged slurmdbd data.
  • Skip x11 shutdown when x11 functionality was not requested.
  • Fix build errors with recent versions of libcurl (8.16+).
  • Fix scrun segfault with step_mgr and if environment is set.
  • Fix two memory leaks located in the job info struct.
  • Fix sacct not accepting -R flag.
  • switch/nvidia_imex - Fix parsing of --network=unique-channel-per-segment option.
  • topology/block - Fix parsing of --network=unique-channel-per-segment option.
  • Fix compile errors building against glibc-2.43
  • Prevent potential race that could cause process/script completion to go undetected. In the case of prolog/epilog, this would leave jobs stuck in CG state on nodes running many concurrent jobs. In the case of --get-user-env, it may time out resulting in jobs being requeued and held.
  • switch/nvidia_imex - fix use-after-free when switch plugin debug logging is enabled.
  • Fix bad umask() if switch/nvidia_imex fails to initialize.
  • switch/nvidia_imex - fix memory leak if imex_dev_major is set.
  • switch/nvidia_imex - fix potential memory leaks when unpacking the jobinfo structure.
  • switch/nvidia_imex - prevent job from starting when imex channel allocation fails.
  • When bf_continue is set, prevent backfill from potentially ending its cycle early due to the reason "System state changed" because of a node state change.
  • Fix underflow in GRES selection when RestrictedCoresPerGPU is configured and the job is exclusive.
  • Fix race on reconfigure that caused slurmctld to crash.
  • Docs - Update the version constraints for libjwt to reflect the fact that only 1.x may be used with Slurm.
  • Fix case when using sacctmgr where user assoc failed to be removed when removing an account with parent specified.
  • cgroup/v2 - Fix issue which caused memory.peak to be inconsistently used.
  • Prevent flex reservations from taking nodes from other reservations if those reservations do not request full nodes.
  • Fix slurmctld crash situation with srun --overcommit.
  • Adding log message to notify user of queries which are too large

v25.05.7

12 Mar 20:58

Choose a tag to compare

Changes in 25.05.7

  • Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
  • slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
  • Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
  • Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
  • slurmrestd - Remove ExecReload from unit file since the daemon does not handle SIGHUP (reload would terminate the process).
  • Fix compile errors building against glibc-2.43
  • Fix race on reconfigure that caused slurmctld to crash

v25.11.3

19 Feb 22:13

Choose a tag to compare

Changes in 25.11.3

  • Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
  • Fixed issue where RestrictedCoresPerGPU with shared gres are limited to using restricted cores on one job per sharing gres.
  • slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
  • Fix "sacctmgr show conf" to properly display CommitDelay in seconds instead of as a boolean.
  • Fix cron/requeued jobs being incorrectly reported as runaway
  • slurmctld - Prevent the double-removal of accounting usage for jobs being requeued that are in the COMPLETED or COMPLETING state.
  • When deleting a QOS from the DB, also remove it from partition QOS, AllowQOS and DenyQOS fields.
  • Fixed bug that could cause the detected CPU count to be lower than actual available CPU count. This bug could have resulted in the default value for conmgr_threads being lower than the number of available CPUs in sackd, scrun, slurmctld, slurmscriptd, slurmd, slurmstepd, slurmdbd, and slurmrestd when the assigned CPUs are not sequential.
  • slurmdbd - Prevent the following slurmdbd.conf options from overriding the default values of any in the list not specified: AllowNoDefAcct, AllResourcesAbsolute, DisableCoordDBD, DisableArchiveCommands.
  • salloc/sbatch - Nesting a non-stepmgr salloc or sbatch inside an existing job allocation that enabled the stepmgr will no longer result in the inner job's steps failing to launch.
  • Prevent slurmd -G from initializing sack processing thread.
  • Added SLURM_CLUSTER_NAME, SLURM_JOB_ACCOUNT and SLURM_JOB_GROUP environment variables when a step is launched.
  • slurmctld - Prevent marking external nodes as being unresponsive when reconfiguring if SlurmctldParameters=enable_configless is used.
  • Fix potential segfault when attempting to look up the controller address via DNS in configless mode.
  • Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
  • slurmrestd - Avoid memory leak on authentication failures with invalid bearer tokens.
  • Fix potential deadlock in _x11_signal_handler() during stepd_cleanup().
  • slurmctld - Fix reservations AllowedPartitions logic leading to incorrect purge of valid reservations in some use-cases.
  • slurmcltd - Avoid persistent connections hangs when enable_async_reply is configured.
  • Prevent potential controller segfault when reconfiguring after gres file updates.
  • Reparent slurmd to a subcgroup to avoid conflicting with systemd.
  • Fix sprio regression not handling comma separated list of jobids.
  • slurmctld,slurmd - Fix memory leak when container ID is populated.
  • slurmd - Fix P-core detection on processors with varying P-core frequencies and in cpuset-restricted environments.
  • namespace/linux - add disable_bpf_token option.
  • slurmctld - Avoid expedited requeue triggering a job to requeue when job exit code was zero.
  • slurmctld - Avoid expedited requeue of jobs while waiting for job epilog script to complete.
  • slurmctld - Prevent removing cloud nodes from the topology when putting them in the POWERED_DOWN state if they are present in topology.conf or topology.yaml and their node configuration did not specify the Topology option.
  • interfaces/topology - When modifying a nodes topology with the Topology option in slurm.conf or the slurmd --conf Topology, change the topology to fully match the new topology.
  • slurmctld - Allow changes to topology.conf or topology.yaml, and slurm.conf node configuration Topology option to take effect on a reconfigure or restart when power saving is enabled.
  • slurmctld - Prevent backfill from combining future timeslots if they have different license reservations.
  • Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
  • slurmdbd - Avoid race condition that could cause a hang during shutdown when incoming connection fails.
  • slurmdbd - Avoid crash during shutdown due to sacctmgr shutdown request.
  • Fix slurmctld assertion when using "enable_async_reply" and certmgr is used for a TLS enabled cluster.
  • Fix potential slurmd process leak when handling --get-user-env.
  • slurmcltd - Avoid race condition that could cause the StateSaveLocation updates to be missed during shutdown.
  • slurmcltd - Avoid race condition that could cause slurmctld to hang during shutdown before updating StateSaveLocation.
  • slurmctld - Avoid race condition that could cause shutdown to wait on the wrong thread.
  • Fix handling of 0 node test allocations in topology/block.
  • slurmctld - In backfill, prevent unnecessarily testing jobs at future times using the select plugin if it is guaranteed to fail.

v25.11.2

26 Jan 19:20

Choose a tag to compare

Changes in 25.11.2

  • slurmstepd - Revert regression that would apply job environment to container runtime invocation.
  • Fix issue where reservations may start while required GRES resources are still being used by jobs.
  • Fix slurmctld segfault when using --consolidate-segments.
  • Expose slurm.CONSOLIDATE_SEGMENTS flag in lua.
  • Expose the job record's segment_size in lua.
  • job_submit/lua - Expose the job_desc's segment_size in lua.
  • Prevent PMIx 5.0.8 and 5.0.9 clients from hanging when connecting to the PMIx server.
  • Clarify warning when BPF tokens are not supported.
  • slurmctld - Ensure we close already accepted conn before RPC flush check
  • slurmctld - Fix rpc_queue feature causing statesave corruption while shutdown
  • slurmctld - Ensure backfill has finished before saving state.
  • slurmctld - Ensure main scheduler has finished before saving state.
  • slurmctld - Fix error message while shutting down and state cannot be saved.
  • Fix slurmctld double free that occurs when purging array jobs from memory only when using the topology/block plugin.
  • Fix steps being rejected inside a batch job when using --cpus-per-task and --mem-per-cpu, and the job was submitted to multiple partitions, but not all of them had the same MaxMemPerCPU limit in place.
  • slurmctld - Fix crash after failed reconfiguration while running jobs and priority/multifactor enabled.
  • slurmctld - Fix jobs' QOS/association usage leading to potential underflow errors after a failed reconfiguration attempt.
  • Guess NodeName with gethostname instead of gethostname_short
  • Fix allowing job submissions when EnforcePartLimits=NO and the requested minimum number of nodes exceeds the total nodes in the specified partition(s).
  • Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
  • srun - fix bug where some input/output/error filename format identifiers were not expanded.
  • Fix detecting restricted cores with SlurmdSpecOverride in nodes with more than one socket.
  • slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.
  • Fix average calculation in latency timers to show more accurate timing logs.

v25.05.6

26 Jan 19:18

Choose a tag to compare

Changes in 25.05.6

  • Updating a job's qos will always replace the previous timelimit with the new qos' timelimit, unless another time limit is explicitly specified in the update command.
  • slurmctld - Prevent memory corruption when fanning out messages to the slurmds if TreeWidth is more then or equal to 46341 and the number of nodes in the cluster is more then or equal to (TreeWidth + 1).
  • Fix slurmctld potential deadlock when trying to schedule jobs starting many years in the future. Slurm only supports one year time limits.
  • Fix accounting for memory on steps without pids, like the extern step, which caused them to be killed if OvermemoryKill was set.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.42/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.43/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.44/job/submit'.
  • slurmctld - Fixed segfault when running configless and a malformed REQUEST_CONFIG RPC is received.
  • slurmctld - Fixed segfault when using newly added remote licenses.
  • Fix memory leak on slurmctld for jobs that use --exclusive=topo
  • Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
  • slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.

v25.11.1

26 Jan 19:19

Choose a tag to compare

Changes in 25.11.1

  • data_parser/v0.0.41 - Prevent memory leaks when freeing parsed lists.
  • data_parser/v0.0.42 - Prevent memory leaks when freeing parsed lists.
  • data_parser/v0.0.43 - Prevent memory leaks when freeing parsed lists.
  • data_parser/v0.0.44 - Prevent memory leaks when freeing parsed lists.
  • slurmctld - Prevent a fatal when min_exempt_priority is not the last option listed in PreemptParameters.
  • Updating a job's qos will always replace the previous timelimit with the new qos' timelimit, unless another time limit is explicitly specified in the update command.
  • When debugflags=script is set in slurm.conf, Lua runtime error message will be logged with backtrace.
  • slurmctld - Prevent memory corruption when fanning out messages to the slurmds if TreeWidth is more then or equal to 46341 and the number of nodes in the cluster is more then or equal to (TreeWidth + 1).
  • When GrpTRES and MaxTRESPU are set on different QOSes and both QOSes are applied to a job, ensure that both limits are honored.
  • Fix issue where a cli command or process could get stuck indefinitely when trying to retrieve a slurm.conf from slurmctld.
  • Fix slurmctld potential deadlock when trying to schedule jobs starting many years in the future. Slurm only supports one year time limits.
  • Fix pam_slurm_adopt when using namespace/linux plugin.
  • topology/tree - Prevent overflow error when calculating fanout depth.
  • The state string for nodes in the MIXED+FAIL state will now appear as "FAILING" rather than just "FAIL", similar to what is already done for nodes in the ALLOCATED+FAIL state.
  • slurmctld - Prevent a divide by zero crash by fataling if the following SlurmctldParameters have a value of less than or equal to 0: rl_table_size, rl_bucket_size, rl_refill_rate, and rl_refill_period.
  • Fix missing updates to reservation TRES and accounting when node(s) replaced due to REPLACE or REPLACE_DOWN flags.
  • slurmctld - Cancel interactive job if prolog RPC never reaches its receiver.
  • slurmctld - Cancel interactive jobs that never ran the prolog in the purge jobs logic.
  • Fix accounting for memory on steps without pids, like the extern step, which caused them to be killed if OvermemoryKill was set.
  • NO_NORMAL_ALL will only be printed if all NO_NORMAL_* flags are set.
  • slurmctld - Prevent the controller from believing it has a job's federation cluster lock when it does not.
  • Fix jobs incorrectly stuck waiting for resources when launched with specific client flag combinations containing "--hint=nomultithread".
  • Fix allocated licenses still showing after removing all allocated licenses.
  • accounting_storage/mysql - Disallow creating users if requested user list is empty or usernames are empty strings.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.42/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.43/job/submit'.
  • slurmrestd - Revert tagging .script field as deprecated in 'POST /slurm/v0.0.44/job/submit'.
  • slurmrestd - Revert regression that changed the error from "Authentication failure" to "Authentication does not apply to request" when a HTTP request lacks any authentication credentials.
  • When a job requests multiple partitions and cannot run in one of them due to topology, allow the main scheduler to evaluate jobs in the other requested partitions.
  • slurmctld - Acquire the node write lock instead of the node read lock when querying 'GET /metrics/nodes' and 'GET /metrics/partitions' endpoints.
  • slurmctld - Fixed segfault when running configless and a malformed REQUEST_CONFIG RPC is received.
  • Remove error output for missing optional spank plugin.
  • slurmctld - when unable to schedule a job with preferred node features, don't exclude the partition from further scheduling attempts in the same iteration.
  • Fix issue with RestrictedCoresPerGPU with shared gres.
  • Fix rpmbuild --with libcurl option.
  • Add new JobAcctGatherParams=no_file_cache to change how memory usage (RSS) is reported when using cgroup/v2. With this flag set we will subtract active_file and inactive_file from the value reported in memory.current to avoid counting the file cache. memory.peak will then not be used to get the MaxRSS and getting memory spikes will depend on the JobAcctGatherFrequency parameter.
  • namespace/linux - fix bug that could leave defunct processes in the jobs namespace.
  • namespace/linux - kill and reap the namespace process during job teardown.
  • namespace/linux - Fix issue with user_ns_script that may result in STDIN closing, which may result in 'Unable to receive "ok ack"' error on slurmstepd or other undefined behavior.
  • Fix error reading /proc/0/* when calling the api outside the step namespace.
  • slurmctld - Fixed segfault when using newly added remote licenses.
  • Fix SIGCHLD not being sent to tasks.
  • bitmap2node_name() is not cleaned up properly when reservation logging is enabled.
  • Fix issue with jobs running on slurmd's with version 25.05.x or older getting aborted when slurmd re-registers with slurmctld.
  • Fix memory leak on slurmctld for jobs that use --exclusive=topo
  • Prevent jobs that cannot fit in the reservation's time limit from being attracted to a magnetic reservation.
  • Fix slurmstepd segfault for older versioned batch jobs (25.05 and older) submitted without using -o/--output on submission.

v25.05.5

26 Jan 19:18

Choose a tag to compare

Changes in 25.05.5

  • Fix slurmdbd error triggered by "sreport user topusage" when trying to get data from monthly usage tables.
  • scontrol - fix regression where "scontrol update jobid= qos=" was not considered a valid command.
  • slurmstepd - Prevent the slurmstepd from segfaulting if the switch/hpe_slingshot plugin is enabled and SwitchParameters is not specified.
  • Avoid deadlock that occurs on a failed reconfigure when there are issues with slurmdbd connections and AccountingStoreFlags is set with job_script or job_env.
  • slurmctld - Avoid regression that caused POSIX signals to be ignored after quiesce timeout triggers.
  • Fix potential file descriptor leak to child processes.
  • slurmctld - Prevent a fatal when min_exempt_priority is not the last option listed in PreemptParameters.

v25.11.0

26 Jan 19:19

Choose a tag to compare

Changes in 25.11.0

  • namespace/linux - move directory creation for bind mounts to before the init script is called.
  • namespace/linux - add SLURM_JOB_MEM to script environments when able.
  • Fix an error when printing sdiag rpc stats in json format when hostlists strings are too long.
  • Add --no-trunc argument to sdiag. That will output long hostlists that default to being truncated to 80 characters.
  • Add infinite (-1) layer support to HRes mode 3.
  • Fix ESLURM_RETRY_EVAL handling in common_topo_choose_nodes().
  • Fix HRes MODE_3 when using with --gpus.
  • Fix enforcing of MODE_3 with --distribution=arbitrary.
  • slurmrestd - Fix regression that caused rejected HTTP requests to not include an descriptive error message.
  • slurmrestd - Fix regression that caused requests for unknown or unsupported URL paths to not include a descriptive error.

v25.11.0rc2

26 Jan 19:19

Choose a tag to compare

v25.11.0rc2 Pre-release
Pre-release

Changes in 25.11.0rc2

  • Avoid deadlock that occurs on a failed reconfigure when there are issues with slurmdbd connections and AccountingStoreFlags is set with job_script or job_env.
  • Use rename() to atomically replace the heartbeat state file.
  • scrun - Fix memory leak from invalid incoming messages.
  • scrun - Avoid regressoion that would cause shutdown to hang.
  • scrun - Fix race condition that could cause scrun to crash during shutdown.
  • Set SLURM_JOB_SELINUX_CONTEXT in Prolog, Epilog, PrologSlurmctld, and EpilogSlurmctld with the selinux_context.
  • Avoid printing "JobID=Invalid" or "SLUID=Invalid" to the logs. Print both when both are set, otherwise print whichever is set.
  • slurmctld - Avoid regression that caused POSIX signals to be ignored after quiesce timeout triggers.
  • Fix potential file descriptor leak to child processes.
  • Add expediting state to job metrics.
  • Fix federated jobs not getting SLUID set.
  • Fix memory corruption on federated sibling submissions.
  • Add SLURM_JOB_QOS to PrologSlurmctld/EpilogSlurmctld environment.
  • namespace/linux - fix potential error with chown at job startup.
  • Fix use after free in namespace/linux on an error condition.
  • namespace/linux - fix potential invalid close() of file descriptors.
  • slurmctld,slurmd - Reject incoming RPC connections with TLS required error to help misconfigured clients.
  • Add requeue_delay option to SchedulerParameters.
  • RPCs that are keyed by SLUID no longer fall-back to looking up the job by JobId. This should avoid (rare) edge cases where a node reconnects to the cluster and attempts to cancel requeued jobs.
  • Add %S as a filename replacement pattern for SLUID.
  • Add %r as a filename replacement pattern for restart count for batch jobs.
  • Add topology.yaml manpage to debian packages.
  • Add GET /metrics endpoint to list all metric-related endpoints.
  • Export SLURM_JOB_SLUID in the environment for Prolog/Epilog. Remove the undocumented SLURM_SLUID environment variable.
  • Export SLURM_JOB_SLUID in the environment for PrologSlurmctld/EpilogSlurmctld.
  • namespace/linux - Default to 10 seconds for clone_ns_script_wait and clone_ns_epilog_wait if their values are not configured.
  • namespace/linux - The namespace/linux plugin no longer reads job_container.conf. Instead it parses namespace.yaml.
  • Prevent potential segfault when providing hostlist_push() with an incorrectly formatted hostlist string.