[slurm-users] Major newbie - Slurm/jupyterhub

Tue May 5 08:59:01 UTC 2020

Hi Lisa,

Below is my jupyterhub slurm config. It uses the profiles, which allows you
to spawn different sized jobs.  I found the most useful thing for debugging
is to make sure that the --output option is being honoured; any jupyter
python errors will end up there, and to to explicitly set the python
environment at the start of the script. (The example below uses conda,
replace with whatever makes sense in your environment).

Hope that helps,

Guy

#Extend timeouts to deal with slow job launch
c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
c.Spawner.start_timeout=120
c.Spawner.term_timeout=20
c.Spawner.http_timeout = 120

# Set up the various sizes of job
c.ProfilesSpawner.profiles = [
("Local server: (Run on local machine)", "local",
"jupyterhub.spawner.LocalProcessSpawner", {'ip':'0.0.0.0'}),
("Single CPU: (1 CPU, 8GB, 48 hrs)", "cpu1", "batchspawner.SlurmSpawner",
 dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G ")),
("Single GPU: (1 CPU, 1 GPU, 8GB, 48 hrs)", "gpu1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G --gres=gpu:k40:1")),
("Whole Node: (32 CPUs, 128 GB, 48 hrs)", "node1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M")),
("Whole GPU Node: (32 CPUs, 2 GPUs, 128GB, 48 hrs)", "gnode1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M
--gres=gpu:k40:2")),
]

#Configure the batch job. Make sure --output is set and explicitly set up
#the jupyterhub python environment
c.SlurmSpawner.batch_script = """#!/bin/bash
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=spawner-jupyterhub
#SBATCH --chdir={homedir}
#SBATCH --export={keepvars}
#SBATCH --get-user-env=L
#SBATCH {options}
trap 'echo SIGTERM received' TERM
 . /usr/local/jupyterhub/miniconda3/etc/profile.d/conda.sh
conda activate /usr/local/jupyterhub/jupyterhub
which jupyterhub-singleuser
{cmd}
echo "jupyterhub-singleuser ended gracefully"
"""

On Tue, 5 May 2020 at 01:27, Lisa Kay Weihl <lweihl at bgsu.edu> wrote:

> I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX
> 2080 Ti).
>
> Use is to be for GPU ML computing and python based data science.
>
> One faculty wants jupyter notebooks, other faculty member is used to using
> CUDA for GPU but has only done it on a workstation in his lab with a GUI.
> New faculty member coming in has used nvidia-docker container for GPU (I
> think on a large cluster, we are just getting started)
>
> I'm charged with making all this work and hopefully all at once. Right now
> I'll take one thing working.
>
> So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE
> Linux enabled). I posted once before about having trouble getting that
> combination correct and I finally worked that out. Most of the tests in the
> test suite seem to run okay. I'm trying to start with very basic Slurm
> configuration so I haven't enabled accounting.
>
> *For reference here is my slurm.conf*
>
> # slurm.conf file generated by configurator easy.html.
>
> # Put this file on all nodes of your cluster.
>
> # See the slurm.conf man page for more information.
>
> #
>
> SlurmctldHost=cs-host
>
>
> #authentication
>
> AuthType=auth/munge
>
> CacheGroups = 0
>
> CryptoType=crypto/munge
>
>
> #Add GPU support
>
> GresTypes=gpu
>
>
> #
>
> #MailProg=/bin/mail
>
> MpiDefault=none
>
> #MpiParams=ports=#-#
>
>
> #service
>
> ProctrackType=proctrack/cgroup
>
> ReturnToService=1
>
> SlurmctldPidFile=/var/run/slurmctld.pid
>
> #SlurmctldPort=6817
>
> SlurmdPidFile=/var/run/slurmd.pid
>
> #SlurmdPort=6818
>
> SlurmdSpoolDir=/var/spool/slurmd
>
> SlurmUser=slurm
>
> #SlurmdUser=root
>
> StateSaveLocation=/var/spool/slurmctld
>
> SwitchType=switch/none
>
> TaskPlugin=task/affinity
>
> #
>
> #
>
> # TIMERS
>
> #KillWait=30
>
> #MinJobAge=300
>
> #SlurmctldTimeout=120
>
> SlurmdTimeout=1800
>
> #
>
> #
>
> # SCHEDULING
>
> SchedulerType=sched/backfill
>
> SelectType=select/cons_tres
>
> SelectTypeParameters=CR_Core_Memory
>
> PriorityType=priority/multifactor
>
> PriorityDecayHalfLife=3-0
>
> PriorityMaxAge=7-0
>
> PriorityFavorSmall=YES
>
> PriorityWeightAge=1000
>
> PriorityWeightFairshare=0
>
> PriorityWeightJobSize=125
>
> PriorityWeightPartition=1000
>
> PriorityWeightQOS=0
>
> #
>
> #
>
> # LOGGING AND ACCOUNTING
>
> AccountingStorageType=accounting_storage/none
>
> ClusterName=cs-host
>
> #JobAcctGatherFrequency=30
>
> JobAcctGatherType=jobacct_gather/none
>
> SlurmctldDebug=info
>
> SlurmctldLogFile=/var/log/slurmctld.log
>
> #SlurmdDebug=info
>
> SlurmdLogFile=/var/log/slurmd.log
>
> #
>
> #
>
> # COMPUTE NODES
>
> NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
>
>
> #PARTITIONS
>
> PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES
> MaxTime=INFINITE State=UP
>
> PartitionName=faculty  Priority=10 Default=YES
>
>
> I have jupyterhub running as part of RedHat SCL. It works fine with no
> integration with Slurm. Now I'm trying to use batchspawner to start a
> server for the user.  Right now I'm just trying one configuration from
> within the jupyterhub_config.py and trying to keep it simple (see below).
>
> *When I connect I get this error:*
> 500: Internal Server Error
> Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job
> has disappeared while pending in the queue or died immediately after
> starting.
>
> *In the jupyterhub.log:*
>
> [I 2020-05-04 19:47:58.604 JupyterHub base:707] User logged in: csadmin
>
> [I 2020-05-04 19:47:58.606 JupyterHub log:174] 302 POST /hub/login?next=
> -> /hub/spawn (csadmin at 127.0.0.1) 227.13ms
>
> [I 2020-05-04 19:47:58.748 JupyterHub batchspawner:248] Spawner submitting
> job using sudo -E -u csadmin sbatch --parsable
>
> [I 2020-05-04 19:47:58.749 JupyterHub batchspawner:249] Spawner submitted
> script:
>
>     #!/bin/bash
>
>     #SBATCH --partition=faculty
>
>     #SBATCH --time=8:00:00
>
>     #SBATCH --output=/home/csadmin/jupyterhub_slurmspawner_%j.log
>
>     #SBATCH --job-name=jupyterhub-spawner
>
>     #SBATCH --cpus-per-task=1
>
>     #SBATCH --chdir=/home/csadmin
>
>     #SBATCH --uid=csadmin
>
>
>
>     env
>
>     which jupyterhub-singleuser
>
>     batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0
>
>
>
> [I 2020-05-04 19:47:58.831 JupyterHub batchspawner:252] Job submitted.
> cmd: sudo -E -u csadmin sbatch --parsable output: 7117
>
> [W 2020-05-04 19:47:59.481 JupyterHub batchspawner:377] Job  neither
> pending nor running.
>
>
>
> [E 2020-05-04 19:47:59.482 JupyterHub user:640] Unhandled error starting
> csadmin's server: The Jupyter batch job has disappeared while pending in
> the queue or died immediately after starting.
>
> [W 2020-05-04 19:47:59.518 JupyterHub web:1782] 500 GET /hub/spawn
> (127.0.0.1): Error in Authenticator.pre_spawn_start: RuntimeError The
> Jupyter batch job has disappeared while pending in the queue or died
> immediately after starting.
>
> [E 2020-05-04 19:47:59.521 JupyterHub log:166] {
>
>       "X-Forwarded-Host": "localhost:8000",
>
>       "X-Forwarded-Proto": "http",
>
>       "X-Forwarded-Port": "8000",
>
>       "X-Forwarded-For": "127.0.0.1",
>
>       "Cookie": "jupyterhub-hub-login=[secret]; _xsrf=[secret];
> jupyterhub-session-id=[secret]",
>
>       "Accept-Language": "en-US,en;q=0.9",
>
>       "Accept-Encoding": "gzip, deflate, br",
>
>       "Referer": "http://localhost:8000/hub/login",
>
>       "Sec-Fetch-Dest": "document",
>
>       "Sec-Fetch-User": "?1",
>
>       "Sec-Fetch-Mode": "navigate",
>
>       "Sec-Fetch-Site": "same-origin",
>
>       "Accept":
> "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
>
>       "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",
>
>       "Upgrade-Insecure-Requests": "1",
>
>       "Cache-Control": "max-age=0",
>
>       "Connection": "close",
>
>       "Host": "localhost:8000"
>
>     }
>
> [E 2020-05-04 19:47:59.522 JupyterHub log:174] 500 GET /hub/spawn (
> csadmin at 127.0.0.1) 842.87ms
>
> [I 2020-05-04 19:49:05.294 JupyterHub proxy:320] Checking routes
>
> [I 2020-05-04 19:54:05.292 JupyterHub proxy:320] Checking routes
>
>
>
> *In the slurmd.log (which I don't see as helpful):*
>
>
> [2020-05-04T19:47:58.931] task_p_slurmd_batch_request: 7117
>
> [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU input mask for node:
> 0x000003
>
> [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU final HW mask for
> node: 0x001001
>
> [2020-05-04T19:47:58.932] _run_prolog: run job script took usec=473
>
> [2020-05-04T19:47:58.932] _run_prolog: prolog with lock for job 7117 ran
> for 0 seconds
>
> [2020-05-04T19:47:58.932] Launching batch job 7117 for UID 1001
>
> [2020-05-04T19:47:58.967] [7117.batch] task_p_pre_launch: Using
> sched_affinity for tasks
>
> [2020-05-04T19:47:58.978] [7117.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:32512
>
> [2020-05-04T19:47:58.982] [7117.batch] done with job
>
>
> *In the jupyterhub_config.py (just the part for batchspawner):*
>
>
>
>
>
>
>
> * c = get_config() c.JupyterHub.spawner_class =
> 'batchspawner.SlurmSpawner' # Even though not used, needed to register
> batchspawner interface import batchspawner     c.Spawner.http_timeout = 120
> c.SlurmSpawner.req_nprocs = '1' c.SlurmSpawner.req_runtime = '8:00:00'
> c.SlurmSpawner.req_partition = 'faculty' c.SlurmSpawner.req_memory =
> '128gb' c.SlurmSpawner.start_timeout = 240 c.SlurmSpawner.batch_script =
> '''#!/bin/bash #SBATCH --partition={partition} #SBATCH --time={runtime}
> #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log #SBATCH
> --job-name=jupyterhub-spawner #SBATCH --cpus-per-task={nprocs} #SBATCH
> --chdir=/home/{username} #SBATCH --uid={username} env which
> jupyterhub-singleuser {cmd} ''' *
>
> I will admit that I don't understand all of this completely as I haven't
> written a lot of bash scripts. I'm getting that some of the things in {}
> are environment variables and others come from within this file and it
> seems they must be specifically defined in the batchspawner software
> somewhere.
>
> Is the last piece trying to find the path of jupyterhub-singleuser and
> then launch it with {cmd}
>
> Feel free to tell me to go read the docs but be gentle 🙂 Because of the
> request to make ALL of this work ASAP I've been skimming and trying to pick
> up as much as I can and then going off examples trying to make this work.
> I have a feeling that this command: sudo -E -u csadmin sbatch --parsable
> output: 7117
> is what is incorrect and causing the problems. Clearly something isn't
> starting that should be.
>
> If you can shed any light on anything or any info online that might help
> me I'd much appreciate it. I'm really beating my head over this one and I
> know inexperience isn't helping.
>
> When I figure out this simple config then I want to have the profile where
> I can setup several settings and have the user select.
>
> One other basic question. I'm assuming in Slurm language my server is
> considered to have 24 CPU with the cores and threads so that any of the
> Slurm settings that refer to things like CPU per task I could specify up to
> 24 if a user wanted. Also, in this case the node will always be 1 since we
> only have 1 server.
>
> Thanks!
>
> ***************************************************************
>
> Lisa Weihl *Systems Administrator*
>
>
> *Computer Science, Bowling Green State University *Tel: (419) 372-0116
> |    Fax: (419) 372-8061
> lweihl at bgsu.edu
> www.bgsu.edu
>

-- 
Dr. Guy Coates
+44(0)7801 710224
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200505/2a15c990/attachment-0001.htm>