[slurm-users] Major newbie - Slurm/jupyterhub

Tue May 5 16:22:47 UTC 2020

Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] folder structure for CUDA and other third-party software. That handles LD_LIBRARY_PATH and other similar variables, reduces the chances for library conflicts, and lets users decide their environment on a per-job basis. Ours includes a basic Miniconda installation, and the users can make their own environments from there [3]. I very rarely install a system-wide Python module.

[1] http://modules.sourceforge.net
[2] https://lmod.readthedocs.io/
[3] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+Jupyter+Notebook

> On May 5, 2020, at 9:37 AM, Lisa Kay Weihl <lweihl at bgsu.edu> wrote:
> 
> Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my home directory.  That enabled me to find out that it could not find the path for batchspawner-singleuser. 
> 
> 
> So I added this to jupyter_config.py
> export PATH=/opt/rh/rh-python36/root/bin:$PATH
> 
> 
> That seemed to now allow the server to launch for my user that I use for all the configuration work. I get errors (see below) but the notebook loads. The problem is I'm not sure how to kill the job in the Slurm queue or the notebook server if I finish before the job times out and kills it. Logout doesn't seem to do it.
> 
> It still doesn't work for a regular user (see below)
> 
> I think my problems all have to do with Slurm/jupyterhub finding python. So I have some questions about the best way to set it up for multiple users and make it work for this.
> 
> I use CentOS distribution so that if the university admins will ever have to take over it will match their RedHat setups they use. I know on all Linux distros you need to leave the python 2 system install alone. It looks like as of CentOS 7.7 there is now a python3 in the repository. I didn't go that route because in the past I installed the python from RedHat Software Collection which is what I did this time.
> I don't know if that's the best route for this use case. They also say don't sudo pip3 to try to install global packages but does that mean sudo to root and then using pip3 is okay?
> 
> When I test and faculty don't give me code I go to the web and try to find examples. I know I also wanted to try to test the GPUs from within the notebook. I have 2 examples:
> 
> Example 1 uses these modules:
> import numpy as np
> import xgboost as xgb
> from sklearn import datasets
> from sklearn.model_selection import train_test_split
> from sklearn.datasets import dump_svmlight_file
> from sklearn.externals import joblib
> from sklearn.metrics import precision_score
> 
> It gives error: cannot load library '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': libcudart.so.9.2: cannot open shared object file: No such file or directory
> 
> libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib
> 
> Does this mean I need LD_LIBRARY_PATH  set also? Cuda was installed with typical NVIDIA instructions using their repo.
> 
> Example 2 uses these modules:
> import numpy as np
> from numba import vectorize
> 
> And gives error:  NvvmSupportError: libNVVM cannot be found. Do `conda install cudatoolkit`:
> library nvvm not found
> 
> I don't have conda installed. Will that interfere with pip3?
> 
> Part II - using jupyterhub with regular user gives different error
> 
> I'm assuming this is a python path issue?
> 
>  File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in <module>
>     __import__('pkg_resources').require('batchspawner==1.0.0rc0')
> and later
> pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution was not found and is required by the application
> 
> Thanks again for any help especially if you can help clear up python configuration.
> 
> 
> ***************************************************************
> Lisa Weihl Systems Administrator
> Computer Science, Bowling Green State University
> Tel: (419) 372-0116   |    Fax: (419) 372-8061
> lweihl at bgsu.edu
> www.bgsu.edu
> 
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of slurm-users-request at lists.schedmd.com <slurm-users-request at lists.schedmd.com>
> Sent: Tuesday, May 5, 2020 4:59 AM
> To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8
>  
> Send slurm-users mailing list submissions to
>         slurm-users at lists.schedmd.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3D&reserved=0
> or, via email, send a message with subject or body 'help' to
>         slurm-users-request at lists.schedmd.com
> 
> You can reach the person managing the list at
>         slurm-users-owner at lists.schedmd.com
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of slurm-users digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Major newbie - Slurm/jupyterhub (Guy Coates)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 5 May 2020 09:59:01 +0100
> From: Guy Coates <guy.coates at gmail.com>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub
> Message-ID:
>         <CAF+C_WtuRQkM+p8v8EnS-ADz2-rDLevEHijdbi+X9HsweL=bjw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Lisa,
> 
> Below is my jupyterhub slurm config. It uses the profiles, which allows you
> to spawn different sized jobs.  I found the most useful thing for debugging
> is to make sure that the --output option is being honoured; any jupyter
> python errors will end up there, and to to explicitly set the python
> environment at the start of the script. (The example below uses conda,
> replace with whatever makes sense in your environment).
> 
> Hope that helps,
> 
> Guy
> 
> 
> #Extend timeouts to deal with slow job launch
> c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
> c.Spawner.start_timeout=120
> c.Spawner.term_timeout=20
> c.Spawner.http_timeout = 120
> 
> # Set up the various sizes of job
> c.ProfilesSpawner.profiles = [
> ("Local server: (Run on local machine)", "local",
> "jupyterhub.spawner.LocalProcessSpawner", {'ip':'0.0.0.0'}),
> ("Single CPU: (1 CPU, 8GB, 48 hrs)", "cpu1", "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G ")),
> ("Single GPU: (1 CPU, 1 GPU, 8GB, 48 hrs)", "gpu1",
> "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G --gres=gpu:k40:1")),
> ("Whole Node: (32 CPUs, 128 GB, 48 hrs)", "node1",
> "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M")),
> ("Whole GPU Node: (32 CPUs, 2 GPUs, 128GB, 48 hrs)", "gnode1",
> "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M
> --gres=gpu:k40:2")),
> ]
> 
> #Configure the batch job. Make sure --output is set and explicitly set up
> #the jupyterhub python environment
> c.SlurmSpawner.batch_script = """#!/bin/bash
> #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
> #SBATCH --job-name=spawner-jupyterhub
> #SBATCH --chdir={homedir}
> #SBATCH --export={keepvars}
> #SBATCH --get-user-env=L
> #SBATCH {options}
> trap 'echo SIGTERM received' TERM
>  . /usr/local/jupyterhub/miniconda3/etc/profile.d/conda.sh
> conda activate /usr/local/jupyterhub/jupyterhub
> which jupyterhub-singleuser
> {cmd}
> echo "jupyterhub-singleuser ended gracefully"
> """
> 
> On Tue, 5 May 2020 at 01:27, Lisa Kay Weihl <lweihl at bgsu.edu> wrote:
> 
> > I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX
> > 2080 Ti).
> >
> > Use is to be for GPU ML computing and python based data science.
> >
> > One faculty wants jupyter notebooks, other faculty member is used to using
> > CUDA for GPU but has only done it on a workstation in his lab with a GUI.
> > New faculty member coming in has used nvidia-docker container for GPU (I
> > think on a large cluster, we are just getting started)
> >
> > I'm charged with making all this work and hopefully all at once. Right now
> > I'll take one thing working.
> >
> > So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE
> > Linux enabled). I posted once before about having trouble getting that
> > combination correct and I finally worked that out. Most of the tests in the
> > test suite seem to run okay. I'm trying to start with very basic Slurm
> > configuration so I haven't enabled accounting.
> >
> > *For reference here is my slurm.conf*
> >
> > # slurm.conf file generated by configurator easy.html.
> >
> > # Put this file on all nodes of your cluster.
> >
> > # See the slurm.conf man page for more information.
> >
> > #
> >
> > SlurmctldHost=cs-host
> >
> >
> > #authentication
> >
> > AuthType=auth/munge
> >
> > CacheGroups = 0
> >
> > CryptoType=crypto/munge
> >
> >
> > #Add GPU support
> >
> > GresTypes=gpu
> >
> >
> > #
> >
> > #MailProg=/bin/mail
> >
> > MpiDefault=none
> >
> > #MpiParams=ports=#-#
> >
> >
> > #service
> >
> > ProctrackType=proctrack/cgroup
> >
> > ReturnToService=1
> >
> > SlurmctldPidFile=/var/run/slurmctld.pid
> >
> > #SlurmctldPort=6817
> >
> > SlurmdPidFile=/var/run/slurmd.pid
> >
> > #SlurmdPort=6818
> >
> > SlurmdSpoolDir=/var/spool/slurmd
> >
> > SlurmUser=slurm
> >
> > #SlurmdUser=root
> >
> > StateSaveLocation=/var/spool/slurmctld
> >
> > SwitchType=switch/none
> >
> > TaskPlugin=task/affinity
> >
> > #
> >
> > #
> >
> > # TIMERS
> >
> > #KillWait=30
> >
> > #MinJobAge=300
> >
> > #SlurmctldTimeout=120
> >
> > SlurmdTimeout=1800
> >
> > #
> >
> > #
> >
> > # SCHEDULING
> >
> > SchedulerType=sched/backfill
> >
> > SelectType=select/cons_tres
> >
> > SelectTypeParameters=CR_Core_Memory
> >
> > PriorityType=priority/multifactor
> >
> > PriorityDecayHalfLife=3-0
> >
> > PriorityMaxAge=7-0
> >
> > PriorityFavorSmall=YES
> >
> > PriorityWeightAge=1000
> >
> > PriorityWeightFairshare=0
> >
> > PriorityWeightJobSize=125
> >
> > PriorityWeightPartition=1000
> >
> > PriorityWeightQOS=0
> >
> > #
> >
> > #
> >
> > # LOGGING AND ACCOUNTING
> >
> > AccountingStorageType=accounting_storage/none
> >
> > ClusterName=cs-host
> >
> > #JobAcctGatherFrequency=30
> >
> > JobAcctGatherType=jobacct_gather/none
> >
> > SlurmctldDebug=info
> >
> > SlurmctldLogFile=/var/log/slurmctld.log
> >
> > #SlurmdDebug=info
> >
> > SlurmdLogFile=/var/log/slurmd.log
> >
> > #
> >
> > #
> >
> > # COMPUTE NODES
> >
> > NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6
> > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
> >
> >
> > #PARTITIONS
> >
> > PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES
> > MaxTime=INFINITE State=UP
> >
> > PartitionName=faculty  Priority=10 Default=YES
> >
> >
> > I have jupyterhub running as part of RedHat SCL. It works fine with no
> > integration with Slurm. Now I'm trying to use batchspawner to start a
> > server for the user.  Right now I'm just trying one configuration from
> > within the jupyterhub_config.py and trying to keep it simple (see below).
> >
> > *When I connect I get this error:*
> > 500: Internal Server Error
> > Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job
> > has disappeared while pending in the queue or died immediately after
> > starting.
> >
> > *In the jupyterhub.log:*
> >
> > [I 2020-05-04 19:47:58.604 JupyterHub base:707] User logged in: csadmin
> >
> > [I 2020-05-04 19:47:58.606 JupyterHub log:174] 302 POST /hub/login?next=
> > -> /hub/spawn (csadmin at 127.0.0.1) 227.13ms
> >
> > [I 2020-05-04 19:47:58.748 JupyterHub batchspawner:248] Spawner submitting
> > job using sudo -E -u csadmin sbatch --parsable
> >
> > [I 2020-05-04 19:47:58.749 JupyterHub batchspawner:249] Spawner submitted
> > script:
> >
> >     #!/bin/bash
> >
> >     #SBATCH --partition=faculty
> >
> >     #SBATCH --time=8:00:00
> >
> >     #SBATCH --output=/home/csadmin/jupyterhub_slurmspawner_%j.log
> >
> >     #SBATCH --job-name=jupyterhub-spawner
> >
> >     #SBATCH --cpus-per-task=1
> >
> >     #SBATCH --chdir=/home/csadmin
> >
> >     #SBATCH --uid=csadmin
> >
> >
> >
> >     env
> >
> >     which jupyterhub-singleuser
> >
> >     batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0
> >
> >
> >
> > [I 2020-05-04 19:47:58.831 JupyterHub batchspawner:252] Job submitted.
> > cmd: sudo -E -u csadmin sbatch --parsable output: 7117
> >
> > [W 2020-05-04 19:47:59.481 JupyterHub batchspawner:377] Job  neither
> > pending nor running.
> >
> >
> >
> > [E 2020-05-04 19:47:59.482 JupyterHub user:640] Unhandled error starting
> > csadmin's server: The Jupyter batch job has disappeared while pending in
> > the queue or died immediately after starting.
> >
> > [W 2020-05-04 19:47:59.518 JupyterHub web:1782] 500 GET /hub/spawn
> > (127.0.0.1): Error in Authenticator.pre_spawn_start: RuntimeError The
> > Jupyter batch job has disappeared while pending in the queue or died
> > immediately after starting.
> >
> > [E 2020-05-04 19:47:59.521 JupyterHub log:166] {
> >
> >       "X-Forwarded-Host": "localhost:8000",
> >
> >       "X-Forwarded-Proto": "http",
> >
> >       "X-Forwarded-Port": "8000",
> >
> >       "X-Forwarded-For": "127.0.0.1",
> >
> >       "Cookie": "jupyterhub-hub-login=[secret]; _xsrf=[secret];
> > jupyterhub-session-id=[secret]",
> >
> >       "Accept-Language": "en-US,en;q=0.9",
> >
> >       "Accept-Encoding": "gzip, deflate, br",
> >
> >       "Referer": "http://localhost:8000/hub/login",
> >
> >       "Sec-Fetch-Dest": "document",
> >
> >       "Sec-Fetch-User": "?1",
> >
> >       "Sec-Fetch-Mode": "navigate",
> >
> >       "Sec-Fetch-Site": "same-origin",
> >
> >       "Accept":
> > "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
> >
> >       "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6)
> > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",
> >
> >       "Upgrade-Insecure-Requests": "1",
> >
> >       "Cache-Control": "max-age=0",
> >
> >       "Connection": "close",
> >
> >       "Host": "localhost:8000"
> >
> >     }
> >
> > [E 2020-05-04 19:47:59.522 JupyterHub log:174] 500 GET /hub/spawn (
> > csadmin at 127.0.0.1) 842.87ms
> >
> > [I 2020-05-04 19:49:05.294 JupyterHub proxy:320] Checking routes
> >
> > [I 2020-05-04 19:54:05.292 JupyterHub proxy:320] Checking routes
> >
> >
> >
> > *In the slurmd.log (which I don't see as helpful):*
> >
> >
> > [2020-05-04T19:47:58.931] task_p_slurmd_batch_request: 7117
> >
> > [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU input mask for node:
> > 0x000003
> >
> > [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU final HW mask for
> > node: 0x001001
> >
> > [2020-05-04T19:47:58.932] _run_prolog: run job script took usec=473
> >
> > [2020-05-04T19:47:58.932] _run_prolog: prolog with lock for job 7117 ran
> > for 0 seconds
> >
> > [2020-05-04T19:47:58.932] Launching batch job 7117 for UID 1001
> >
> > [2020-05-04T19:47:58.967] [7117.batch] task_p_pre_launch: Using
> > sched_affinity for tasks
> >
> > [2020-05-04T19:47:58.978] [7117.batch] sending
> > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:32512
> >
> > [2020-05-04T19:47:58.982] [7117.batch] done with job
> >
> >
> > *In the jupyterhub_config.py (just the part for batchspawner):*
> >
> >
> >
> >
> >
> >
> >
> > * c = get_config() c.JupyterHub.spawner_class =
> > 'batchspawner.SlurmSpawner' # Even though not used, needed to register
> > batchspawner interface import batchspawner     c.Spawner.http_timeout = 120
> > c.SlurmSpawner.req_nprocs = '1' c.SlurmSpawner.req_runtime = '8:00:00'
> > c.SlurmSpawner.req_partition = 'faculty' c.SlurmSpawner.req_memory =
> > '128gb' c.SlurmSpawner.start_timeout = 240 c.SlurmSpawner.batch_script =
> > '''#!/bin/bash #SBATCH --partition={partition} #SBATCH --time={runtime}
> > #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log #SBATCH
> > --job-name=jupyterhub-spawner #SBATCH --cpus-per-task={nprocs} #SBATCH
> > --chdir=/home/{username} #SBATCH --uid={username} env which
> > jupyterhub-singleuser {cmd} ''' *
> >
> > I will admit that I don't understand all of this completely as I haven't
> > written a lot of bash scripts. I'm getting that some of the things in {}
> > are environment variables and others come from within this file and it
> > seems they must be specifically defined in the batchspawner software
> > somewhere.
> >
> > Is the last piece trying to find the path of jupyterhub-singleuser and
> > then launch it with {cmd}
> >
> > Feel free to tell me to go read the docs but be gentle ? Because of the
> > request to make ALL of this work ASAP I've been skimming and trying to pick
> > up as much as I can and then going off examples trying to make this work.
> > I have a feeling that this command: sudo -E -u csadmin sbatch --parsable
> > output: 7117
> > is what is incorrect and causing the problems. Clearly something isn't
> > starting that should be.
> >
> > If you can shed any light on anything or any info online that might help
> > me I'd much appreciate it. I'm really beating my head over this one and I
> > know inexperience isn't helping.
> >
> > When I figure out this simple config then I want to have the profile where
> > I can setup several settings and have the user select.
> >
> > One other basic question. I'm assuming in Slurm language my server is
> > considered to have 24 CPU with the cores and threads so that any of the
> > Slurm settings that refer to things like CPU per task I could specify up to
> > 24 if a user wanted. Also, in this case the node will always be 1 since we
> > only have 1 server.
> >
> > Thanks!
> >
> > ***************************************************************
> >
> > Lisa Weihl *Systems Administrator*
> >
> >
> > *Computer Science, Bowling Green State University *Tel: (419) 372-0116
> > |    Fax: (419) 372-8061
> > lweihl at bgsu.edu
> > http://www.bgsu.edu/?
> >
> 
> 
> -- 
> Dr. Guy Coates
> +44(0)7801 710224
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20200505%2F2a15c990%2Fattachment.htm&data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&sdata=dK0HAGC6RxwkysiGgYXwtYA1dxL7HPIwnEcF0LS2Nn8%3D&reserved=0>
> 
> End of slurm-users Digest, Vol 31, Issue 8
> ******************************************