[slurm-users] New Bright Cluster Slurm issue for AD users

Wed Feb 13 13:57:05 UTC 2019

one method I've used a lot of times in bright is to integrate a compute
node in the same way as the master and logins (I normally use realm
join...) and then grab the changes back into to the image in cmsh, If you
are worried you can clone into a new image

Then you can make sure your compute nodes are all using that image and
reboot them

it's not perfect but if you only need the UIDs and GIDs to authenticate
against the external AD server then it works fine.

if you get DNS issues then your head node isn't forwarding DNS queries to
the right DNS servers.

Antony

On Wed, 13 Feb 2019 at 13:11, Yugendra Guvvala <
yguvvala at cambridgecomputer.com> wrote:

> Thanks Guys. I will go through all resources and report back how it goes.
>
> Thanks,
> Yugi
>
> On Feb 13, 2019, at 7:58 AM, John Hearns <hearnsj at googlemail.com> wrote:
>
> please have a look at section 6.3 of the Bright Admin Manual
> You have run updateprovisioners then rebooted the nodes?
>
>
> Conﬁguring The Cluster To Authenticate Against An External LDAP Server The
> cluster can be conﬁgured in different ways to authenticate against an
> external LDAP server. For smaller clusters, a conﬁguration where LDAP
> clients on all nodes point directly to the external server is recommended.
> An easy way to set this up is as follows:
> • On the head node:
> – In distributions that are: * derived from prior to RHEL 6: the URIs in
> /etc/ldap.conf, and in the image ﬁle
> /cm/images/default-image/etc/ldap.confaresettopointtotheexternalLDAP
> server. * derived from the RHEL 6.x series: the ﬁle /etc/ldap.conf does not
> exist. The ﬁles in which the changes then need to be made are
> /etc/nslcd.conf and /etc/pam_ldap.conf. To implement the changes, the nslcd
> daemon must then be restarted, for example with service nslcd restart. *
> derived from RHEL 7.x series: the ﬁle /etc/ldap.conf does not exist. The
> ﬁles in which the changes then need to be made are /etc/nslcd.conf and
> /etc/openldap/ldap.conf. To implement the changes, the nslcd daemon must
> then be restarted, for example with service nslcd restart.
> © Bright Computing, Inc.
> 214 User Management
> –
> theupdateprovisionerscommand(section5.2.4)isruntoupdateanyotherprovisioners.
> • Then, to update conﬁgurations on the regular nodes so that they are able
> to do LDAP lookups:
> – They can simply be rebooted to pick up the updated conﬁguration, along
> with the new software image. – Alternatively, to avoid a reboot, the
> imageupdate command (section 5.6.2) can be run to pick up the new software
> image from a provisioner.
>
> On Wed, 13 Feb 2019 at 12:55, Antony Cleave <antony.cleave at gmail.com>
> wrote:
>
>> Can you ssh in as root and the su to the AD user to make sure that the
>> node is integrated correctly?
>>
>> If you cannot su to an AD user on the node then Slurm will not be able to
>> resolve the UID either as they use the same methods.
>>
>> On Wed, 13 Feb 2019, 12:35 Yugendra Guvvala, <
>> yguvvala at cambridgecomputer.com> wrote:
>>
>>> No, we can’t ssh to compute nodes. And this is by design that no one
>>> should be able to ssh to compute nodes other than root.
>>>
>>> I figure that munge is not configured for AD. We have configured our
>>> login image for AD and slurm and mung configurations are on head node. Not
>>> sure how to integrate these.
>>>
>>> Thanks,
>>> Yugi
>>>
>>> On Feb 13, 2019, at 7:27 AM, Antony Cleave <antony.cleave at gmail.com>
>>> wrote:
>>>
>>> can you ssh to the compute node that job was trying to run on as as the
>>> AD user in question?
>>>
>>> I've  seen similar issues on AD integrated systems where some nodes boot
>>> from a different image that have not yet been joined to the domain.
>>>
>>> Antony
>>>
>>> On Wed, 13 Feb 2019 at 04:58, Yugendra Guvvala <
>>> yguvvala at cambridgecomputer.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We are bringing a new cluster online. We installed SLURM through Bright
>>>> Cluster Manager how ever we are running into a issue here.
>>>>
>>>> We are able to run jobs as root user and users created using bright
>>>> cluster (cmsh commands). How ever we use AD authentication for all our
>>>> users and when we try to submit jobs to slurm using AD users we are getting
>>>> following error message.
>>>>
>>>>
>>>> srun: fatal: Invalid user id: 10952
>>>> srun: fatal: Invalid user id: 10952
>>>> srun: error: cnode001: task 0: Exited with exit code 1
>>>>
>>>> Attached is the slurm.con file for reference. Please let us know if you
>>>> have any insight into this.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Yugi
>>>>
>>>> *Yugendra Guvvala | HPC Technologist ** |** Cambridge Computer ** |** "Artists
>>>> in Data Storage" *
>>>> *Direct:* 781-250-3273  | *Cell*: 806-773-4464  |
>>>> yguvvala at cambridgecomputer.com  | www.cambridgecomputer.com
>>>>
>>>>
>>>> _______________________________________________________________________________________________
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190213/45545723/attachment-0001.html>