[slurm-users] Errors after removing partition

Jeffrey Frey frey at udel.edu
Fri Jul 26 15:28:35 UTC 2019


If you check the source code (src/slurmctld/job_mgr.c) this error is indeed thrown when slurmctl unpacks job state files.  Tracing through read_slurm_conf() -> load_all_job_state() -> _load_job_state():


		part_ptr = find_part_record (partition);
		if (part_ptr == NULL) {
			char *err_part = NULL;
			part_ptr_list = get_part_list(partition, &err_part);
			if (part_ptr_list) {
				part_ptr = list_peek(part_ptr_list);
				if (list_count(part_ptr_list) == 1)
					FREE_NULL_LIST(part_ptr_list);
			} else {
				verbose("Invalid partition (%s) for JobId=%u",
					err_part, job_id);
				xfree(err_part);
				/* not fatal error, partition could have been
				 * removed, reset_job_bitmaps() will clean-up
				 * this job */
			}
		}


The comment after the error implies that this is not really a problem, and that it occurs specifically when a partition has been removed.




> On Jul 26, 2019, at 11:15 AM, Brian Andrus <toomuchit at gmail.com> wrote:
> 
> All,
> 
> I have a cloud based cluster using slurm 19.05.0-1
> I removed one of the partitions, but now everytime I start slurmctld I get some errors:
> 
> slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545
> slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-01
> slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01
> .
> .
> slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-05
> slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05
> slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545
> 
> I suspect this is in the saved state directory and if I were to down the entire cluster and delete those files up, it would clear it up, but I prefer to not have to down the cluster...
> 
> Is there a way to clean up "phantom" nodes and partitions that were deleted?
> 
> Brian Andrus 


::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190726/98eeed20/attachment.htm>


More information about the slurm-users mailing list