<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
fyi… Joe is there now staining front entrance & fixing a few minor touchups, nailing baseboard in basement…
<div class="">Lock box is on the house now w/ key in it…</div>
<div class=""><br class="">
<div><br class="">
<div class="">On Jul 26, 2019, at 11:28 AM, Jeffrey Frey <<a href="mailto:frey@udel.edu" class="">frey@udel.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
If you check the source code (src/slurmctld/job_mgr.c) this error is indeed thrown when slurmctl unpacks job state files. Tracing through read_slurm_conf() -> load_all_job_state() -> _load_job_state():
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>part_ptr = find_part_record (partition);<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>if (part_ptr == NULL) {<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>char *err_part = NULL;<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>part_ptr_list = get_part_list(partition, &err_part);<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>if (part_ptr_list) {<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>part_ptr = list_peek(part_ptr_list);<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>if (list_count(part_ptr_list) == 1)<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>FREE_NULL_LIST(part_ptr_list);<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>} else {<br class="">
<b class=""><span class="Apple-tab-span" style="white-space:pre"></span>verbose("Invalid partition (%s) for JobId=%u",<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>err_part, job_id);<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>xfree(err_part);<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>/* not fatal error, partition could have been<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span> * removed, reset_job_bitmaps() will clean-up<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span> * this job */</b><br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>}<br class="">
<span class="Apple-tab-span" style="white-space:pre"></span>}</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">The comment after the error implies that this is not really a problem, and that it occurs specifically when a partition has been removed.</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
<blockquote type="cite" class="">On Jul 26, 2019, at 11:15 AM, Brian Andrus <<a href="mailto:toomuchit@gmail.com" class="">toomuchit@gmail.com</a>> wrote:<br class="">
<br class="">
All,<br class="">
<br class="">
I have a cloud based cluster using slurm 19.05.0-1<br class="">
I removed one of the partitions, but now everytime I start slurmctld I get some errors:<br class="">
<br class="">
slurmctld[63042]: error: Invalid partition (mpi-h44rs) for JobId=52545<br class="">
slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-01<br class="">
slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-01<br class="">
.<br class="">
.<br class="">
slurmctld[63042]: error: _find_node_record(756): lookup failure for mpi-h44rs-05<br class="">
slurmctld[63042]: error: node_name2bitmap: invalid node specified mpi-h44rs-05<br class="">
slurmctld[63042]: error: Invalid nodes (mpi-h44rs-[01-05]) for JobId=52545<br class="">
<br class="">
I suspect this is in the saved state directory and if I were to down the entire cluster and delete those files up, it would clear it up, but I prefer to not have to down the cluster...<br class="">
<br class="">
Is there a way to clean up "phantom" nodes and partitions that were deleted?<br class="">
<br class="">
Brian Andrus <br class="">
</blockquote>
<br class="">
<div class=""><br class="">
::::::::::::::::::::::::::::::::::::::::::::::::::::::<br class="">
Jeffrey T. Frey, Ph.D.<br class="">
Systems Programmer V / HPC Management<br class="">
Network & Systems Services / College of Engineering<br class="">
University of Delaware, Newark DE 19716<br class="">
Office: (302) 831-6034 Mobile: (302) 419-4976<br class="">
::::::::::::::::::::::::::::::::::::::::::::::::::::::<br class="">
<br class="">
<br class="">
<br class="">
</div>
<br class="">
</div>
</div>
</div>
</div>
<br class="">
</div>
</body>
</html>