[slurm-users] Major newbie - Slurm/jupyterhub

Tue May 5 00:24:09 UTC 2020

I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX 2080 Ti).

Use is to be for GPU ML computing and python based data science.

One faculty wants jupyter notebooks, other faculty member is used to using CUDA for GPU but has only done it on a workstation in his lab with a GUI.  New faculty member coming in has used nvidia-docker container for GPU (I think on a large cluster, we are just getting started)

I'm charged with making all this work and hopefully all at once. Right now I'll take one thing working.

So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE Linux enabled). I posted once before about having trouble getting that combination correct and I finally worked that out. Most of the tests in the test suite seem to run okay. I'm trying to start with very basic Slurm configuration so I haven't enabled accounting.

For reference here is my slurm.conf

# slurm.conf file generated by configurator easy.html.

# Put this file on all nodes of your cluster.

# See the slurm.conf man page for more information.

#

SlurmctldHost=cs-host

#authentication

AuthType=auth/munge

CacheGroups = 0

CryptoType=crypto/munge

#Add GPU support

GresTypes=gpu

#

#MailProg=/bin/mail

MpiDefault=none

#MpiParams=ports=#-#

#service

ProctrackType=proctrack/cgroup

ReturnToService=1

SlurmctldPidFile=/var/run/slurmctld.pid

#SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

#SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

#SlurmdUser=root

StateSaveLocation=/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/affinity

#

#

# TIMERS

#KillWait=30

#MinJobAge=300

#SlurmctldTimeout=120

SlurmdTimeout=1800

#

#

# SCHEDULING

SchedulerType=sched/backfill

SelectType=select/cons_tres

SelectTypeParameters=CR_Core_Memory

PriorityType=priority/multifactor

PriorityDecayHalfLife=3-0

PriorityMaxAge=7-0

PriorityFavorSmall=YES

PriorityWeightAge=1000

PriorityWeightFairshare=0

PriorityWeightJobSize=125

PriorityWeightPartition=1000

PriorityWeightQOS=0

#

#

# LOGGING AND ACCOUNTING

AccountingStorageType=accounting_storage/none

ClusterName=cs-host

#JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurmctld.log

#SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log

#

#

# COMPUTE NODES

NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4

#PARTITIONS

PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES MaxTime=INFINITE State=UP

PartitionName=faculty  Priority=10 Default=YES

I have jupyterhub running as part of RedHat SCL. It works fine with no integration with Slurm. Now I'm trying to use batchspawner to start a server for the user.  Right now I'm just trying one configuration from within the jupyterhub_config.py and trying to keep it simple (see below).

When I connect I get this error:
500: Internal Server Error
Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job has disappeared while pending in the queue or died immediately after starting.

In the jupyterhub.log:

[I 2020-05-04 19:47:58.604 JupyterHub base:707] User logged in: csadmin

[I 2020-05-04 19:47:58.606 JupyterHub log:174] 302 POST /hub/login?next= -> /hub/spawn (csadmin at 127.0.0.1) 227.13ms

[I 2020-05-04 19:47:58.748 JupyterHub batchspawner:248] Spawner submitting job using sudo -E -u csadmin sbatch --parsable

[I 2020-05-04 19:47:58.749 JupyterHub batchspawner:249] Spawner submitted script:

    #!/bin/bash

    #SBATCH --partition=faculty

    #SBATCH --time=8:00:00

    #SBATCH --output=/home/csadmin/jupyterhub_slurmspawner_%j.log

    #SBATCH --job-name=jupyterhub-spawner

    #SBATCH --cpus-per-task=1

    #SBATCH --chdir=/home/csadmin

    #SBATCH --uid=csadmin

    env

    which jupyterhub-singleuser

    batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0

[I 2020-05-04 19:47:58.831 JupyterHub batchspawner:252] Job submitted. cmd: sudo -E -u csadmin sbatch --parsable output: 7117

[W 2020-05-04 19:47:59.481 JupyterHub batchspawner:377] Job  neither pending nor running.

[E 2020-05-04 19:47:59.482 JupyterHub user:640] Unhandled error starting csadmin's server: The Jupyter batch job has disappeared while pending in the queue or died immediately after starting.

[W 2020-05-04 19:47:59.518 JupyterHub web:1782] 500 GET /hub/spawn (127.0.0.1): Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job has disappeared while pending in the queue or died immediately after starting.

[E 2020-05-04 19:47:59.521 JupyterHub log:166] {

      "X-Forwarded-Host": "localhost:8000",

      "X-Forwarded-Proto": "http",

      "X-Forwarded-Port": "8000",

      "X-Forwarded-For": "127.0.0.1",

      "Cookie": "jupyterhub-hub-login=[secret]; _xsrf=[secret]; jupyterhub-session-id=[secret]",

      "Accept-Language": "en-US,en;q=0.9",

      "Accept-Encoding": "gzip, deflate, br",

      "Referer": "http://localhost:8000/hub/login",

      "Sec-Fetch-Dest": "document",

      "Sec-Fetch-User": "?1",

      "Sec-Fetch-Mode": "navigate",

      "Sec-Fetch-Site": "same-origin",

      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",

      "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",

      "Upgrade-Insecure-Requests": "1",

      "Cache-Control": "max-age=0",

      "Connection": "close",

      "Host": "localhost:8000"

    }

[E 2020-05-04 19:47:59.522 JupyterHub log:174] 500 GET /hub/spawn (csadmin at 127.0.0.1) 842.87ms

[I 2020-05-04 19:49:05.294 JupyterHub proxy:320] Checking routes

[I 2020-05-04 19:54:05.292 JupyterHub proxy:320] Checking routes

In the slurmd.log (which I don't see as helpful):

[2020-05-04T19:47:58.931] task_p_slurmd_batch_request: 7117

[2020-05-04T19:47:58.931] task/affinity: job 7117 CPU input mask for node: 0x000003

[2020-05-04T19:47:58.931] task/affinity: job 7117 CPU final HW mask for node: 0x001001

[2020-05-04T19:47:58.932] _run_prolog: run job script took usec=473

[2020-05-04T19:47:58.932] _run_prolog: prolog with lock for job 7117 ran for 0 seconds

[2020-05-04T19:47:58.932] Launching batch job 7117 for UID 1001

[2020-05-04T19:47:58.967] [7117.batch] task_p_pre_launch: Using sched_affinity for tasks

[2020-05-04T19:47:58.978] [7117.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:32512

[2020-05-04T19:47:58.982] [7117.batch] done with job

In the jupyterhub_config.py (just the part for batchspawner):

c = get_config()

c.JupyterHub.spawner_class = 'batchspawner.SlurmSpawner'

# Even though not used, needed to register batchspawner interface

import batchspawner

c.Spawner.http_timeout = 120

c.SlurmSpawner.req_nprocs = '1'

c.SlurmSpawner.req_runtime = '8:00:00'

c.SlurmSpawner.req_partition = 'faculty'

c.SlurmSpawner.req_memory = '128gb'

c.SlurmSpawner.start_timeout = 240

c.SlurmSpawner.batch_script = '''#!/bin/bash

#SBATCH --partition={partition}

#SBATCH --time={runtime}

#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log

#SBATCH --job-name=jupyterhub-spawner

#SBATCH --cpus-per-task={nprocs}

#SBATCH --chdir=/home/{username}

#SBATCH --uid={username}

env

which jupyterhub-singleuser

{cmd}

'''

I will admit that I don't understand all of this completely as I haven't written a lot of bash scripts. I'm getting that some of the things in {} are environment variables and others come from within this file and it seems they must be specifically defined in the batchspawner software somewhere.

Is the last piece trying to find the path of jupyterhub-singleuser and then launch it with {cmd}

Feel free to tell me to go read the docs but be gentle 🙂 Because of the request to make ALL of this work ASAP I've been skimming and trying to pick up as much as I can and then going off examples trying to make this work.  I have a feeling that this command: sudo -E -u csadmin sbatch --parsable output: 7117
is what is incorrect and causing the problems. Clearly something isn't starting that should be.

If you can shed any light on anything or any info online that might help me I'd much appreciate it. I'm really beating my head over this one and I know inexperience isn't helping.

When I figure out this simple config then I want to have the profile where I can setup several settings and have the user select.

One other basic question. I'm assuming in Slurm language my server is considered to have 24 CPU with the cores and threads so that any of the Slurm settings that refer to things like CPU per task I could specify up to 24 if a user wanted. Also, in this case the node will always be 1 since we only have 1 server.

Thanks!

***************************************************************

Lisa Weihl Systems Administrator

Computer Science, Bowling Green State University
Tel: (419) 372-0116   |    Fax: (419) 372-8061
lweihl at bgsu.edu
www.bgsu.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200505/23f87de8/attachment-0001.htm>