[slurm-users] Major newbie - Slurm/jupyterhub

Tue May 5 14:37:58 UTC 2020

Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my home directory.  That enabled me to find out that it could not find the path for batchspawner-singleuser.

So I added this to jupyter_config.py

export PATH=/opt/rh/rh-python36/root/bin:$PATH

That seemed to now allow the server to launch for my user that I use for all the configuration work. I get errors (see below) but the notebook loads. The problem is I'm not sure how to kill the job in the Slurm queue or the notebook server if I finish before the job times out and kills it. Logout doesn't seem to do it.

It still doesn't work for a regular user (see below)

I think my problems all have to do with Slurm/jupyterhub finding python. So I have some questions about the best way to set it up for multiple users and make it work for this.

I use CentOS distribution so that if the university admins will ever have to take over it will match their RedHat setups they use. I know on all Linux distros you need to leave the python 2 system install alone. It looks like as of CentOS 7.7 there is now a python3 in the repository. I didn't go that route because in the past I installed the python from RedHat Software Collection which is what I did this time.
I don't know if that's the best route for this use case. They also say don't sudo pip3 to try to install global packages but does that mean sudo to root and then using pip3 is okay?

When I test and faculty don't give me code I go to the web and try to find examples. I know I also wanted to try to test the GPUs from within the notebook. I have 2 examples:

Example 1 uses these modules:
import numpy as np
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.datasets import dump_svmlight_file
from sklearn.externals import joblib
from sklearn.metrics import precision_score

It gives error: cannot load library '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': libcudart.so.9.2: cannot open shared object file: No such file or directory

libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib

Does this mean I need LD_LIBRARY_PATH  set also? Cuda was installed with typical NVIDIA instructions using their repo.

Example 2 uses these modules:
import numpy as np
from numba import vectorize

And gives error:  NvvmSupportError: libNVVM cannot be found. Do `conda install cudatoolkit`:
library nvvm not found

I don't have conda installed. Will that interfere with pip3?

Part II - using jupyterhub with regular user gives different error

I'm assuming this is a python path issue?

 File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in <module>

    __import__('pkg_resources').require('batchspawner==1.0.0rc0')

and later

pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution was not found and is required by the application

Thanks again for any help especially if you can help clear up python configuration.

***************************************************************

Lisa Weihl Systems Administrator

Computer Science, Bowling Green State University
Tel: (419) 372-0116   |    Fax: (419) 372-8061
lweihl at bgsu.edu
www.bgsu.edu

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of slurm-users-request at lists.schedmd.com <slurm-users-request at lists.schedmd.com>
Sent: Tuesday, May 5, 2020 4:59 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8

Send slurm-users mailing list submissions to
        slurm-users at lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
        https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3D&reserved=0
or, via email, send a message with subject or body 'help' to
        slurm-users-request at lists.schedmd.com

You can reach the person managing the list at
        slurm-users-owner at lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."

Today's Topics:

   1. Re: Major newbie - Slurm/jupyterhub (Guy Coates)

----------------------------------------------------------------------

Message: 1
Date: Tue, 5 May 2020 09:59:01 +0100
From: Guy Coates <guy.coates at gmail.com>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub
Message-ID:
        <CAF+C_WtuRQkM+p8v8EnS-ADz2-rDLevEHijdbi+X9HsweL=bjw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Lisa,

Below is my jupyterhub slurm config. It uses the profiles, which allows you
to spawn different sized jobs.  I found the most useful thing for debugging
is to make sure that the --output option is being honoured; any jupyter
python errors will end up there, and to to explicitly set the python
environment at the start of the script. (The example below uses conda,
replace with whatever makes sense in your environment).

Hope that helps,

Guy

#Extend timeouts to deal with slow job launch
c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
c.Spawner.start_timeout=120
c.Spawner.term_timeout=20
c.Spawner.http_timeout = 120

# Set up the various sizes of job
c.ProfilesSpawner.profiles = [
("Local server: (Run on local machine)", "local",
"jupyterhub.spawner.LocalProcessSpawner", {'ip':'0.0.0.0'}),
("Single CPU: (1 CPU, 8GB, 48 hrs)", "cpu1", "batchspawner.SlurmSpawner",
 dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G ")),
("Single GPU: (1 CPU, 1 GPU, 8GB, 48 hrs)", "gpu1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G --gres=gpu:k40:1")),
("Whole Node: (32 CPUs, 128 GB, 48 hrs)", "node1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M")),
("Whole GPU Node: (32 CPUs, 2 GPUs, 128GB, 48 hrs)", "gnode1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M
--gres=gpu:k40:2")),
]

#Configure the batch job. Make sure --output is set and explicitly set up
#the jupyterhub python environment
c.SlurmSpawner.batch_script = """#!/bin/bash
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=spawner-jupyterhub
#SBATCH --chdir={homedir}
#SBATCH --export={keepvars}
#SBATCH --get-user-env=L
#SBATCH {options}
trap 'echo SIGTERM received' TERM
 . /usr/local/jupyterhub/miniconda3/etc/profile.d/conda.sh
conda activate /usr/local/jupyterhub/jupyterhub
which jupyterhub-singleuser
{cmd}
echo "jupyterhub-singleuser ended gracefully"
"""

On Tue, 5 May 2020 at 01:27, Lisa Kay Weihl <lweihl at bgsu.edu> wrote:

> I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX
> 2080 Ti).
>
> Use is to be for GPU ML computing and python based data science.
>
> One faculty wants jupyter notebooks, other faculty member is used to using
> CUDA for GPU but has only done it on a workstation in his lab with a GUI.
> New faculty member coming in has used nvidia-docker container for GPU (I
> think on a large cluster, we are just getting started)
>
> I'm charged with making all this work and hopefully all at once. Right now
> I'll take one thing working.
>
> So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE
> Linux enabled). I posted once before about having trouble getting that
> combination correct and I finally worked that out. Most of the tests in the
> test suite seem to run okay. I'm trying to start with very basic Slurm
> configuration so I haven't enabled accounting.
>
> *For reference here is my slurm.conf*
>
> # slurm.conf file generated by configurator easy.html.
>
> # Put this file on all nodes of your cluster.
>
> # See the slurm.conf man page for more information.
>
> #
>
> SlurmctldHost=cs-host
>
>
> #authentication
>
> AuthType=auth/munge
>
> CacheGroups = 0
>
> CryptoType=crypto/munge
>
>
> #Add GPU support
>
> GresTypes=gpu
>
>
> #
>
> #MailProg=/bin/mail
>
> MpiDefault=none
>
> #MpiParams=ports=#-#
>
>
> #service
>
> ProctrackType=proctrack/cgroup
>
> ReturnToService=1
>
> SlurmctldPidFile=/var/run/slurmctld.pid
>
> #SlurmctldPort=6817
>
> SlurmdPidFile=/var/run/slurmd.pid
>
> #SlurmdPort=6818
>
> SlurmdSpoolDir=/var/spool/slurmd
>
> SlurmUser=slurm
>
> #SlurmdUser=root
>
> StateSaveLocation=/var/spool/slurmctld
>
> SwitchType=switch/none
>
> TaskPlugin=task/affinity
>
> #
>
> #
>
> # TIMERS
>
> #KillWait=30
>
> #MinJobAge=300
>
> #SlurmctldTimeout=120
>
> SlurmdTimeout=1800
>
> #
>
> #
>
> # SCHEDULING
>
> SchedulerType=sched/backfill
>
> SelectType=select/cons_tres
>
> SelectTypeParameters=CR_Core_Memory
>
> PriorityType=priority/multifactor
>
> PriorityDecayHalfLife=3-0
>
> PriorityMaxAge=7-0
>
> PriorityFavorSmall=YES
>
> PriorityWeightAge=1000
>
> PriorityWeightFairshare=0
>
> PriorityWeightJobSize=125
>
> PriorityWeightPartition=1000
>
> PriorityWeightQOS=0
>
> #
>
> #
>
> # LOGGING AND ACCOUNTING
>
> AccountingStorageType=accounting_storage/none
>
> ClusterName=cs-host
>
> #JobAcctGatherFrequency=30
>
> JobAcctGatherType=jobacct_gather/none
>
> SlurmctldDebug=info
>
> SlurmctldLogFile=/var/log/slurmctld.log
>
> #SlurmdDebug=info
>
> SlurmdLogFile=/var/log/slurmd.log
>
> #
>
> #
>
> # COMPUTE NODES
>
> NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
>
>
> #PARTITIONS
>
> PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES
> MaxTime=INFINITE State=UP
>
> PartitionName=faculty  Priority=10 Default=YES
>
>
> I have jupyterhub running as part of RedHat SCL. It works fine with no
> integration with Slurm. Now I'm trying to use batchspawner to start a
> server for the user.  Right now I'm just trying one configuration from
> within the jupyterhub_config.py and trying to keep it simple (see below).
>
> *When I connect I get this error:*
> 500: Internal Server Error
> Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job
> has disappeared while pending in the queue or died immediately after
> starting.
>
> *In the jupyterhub.log:*
>
> [I 2020-05-04 19:47:58.604 JupyterHub base:707] User logged in: csadmin
>
> [I 2020-05-04 19:47:58.606 JupyterHub log:174] 302 POST /hub/login?next=
> -> /hub/spawn (csadmin at 127.0.0.1) 227.13ms
>
> [I 2020-05-04 19:47:58.748 JupyterHub batchspawner:248] Spawner submitting
> job using sudo -E -u csadmin sbatch --parsable
>
> [I 2020-05-04 19:47:58.749 JupyterHub batchspawner:249] Spawner submitted
> script:
>
>     #!/bin/bash
>
>     #SBATCH --partition=faculty
>
>     #SBATCH --time=8:00:00
>
>     #SBATCH --output=/home/csadmin/jupyterhub_slurmspawner_%j.log
>
>     #SBATCH --job-name=jupyterhub-spawner
>
>     #SBATCH --cpus-per-task=1
>
>     #SBATCH --chdir=/home/csadmin
>
>     #SBATCH --uid=csadmin
>
>
>
>     env
>
>     which jupyterhub-singleuser
>
>     batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0
>
>
>
> [I 2020-05-04 19:47:58.831 JupyterHub batchspawner:252] Job submitted.
> cmd: sudo -E -u csadmin sbatch --parsable output: 7117
>
> [W 2020-05-04 19:47:59.481 JupyterHub batchspawner:377] Job  neither
> pending nor running.
>
>
>
> [E 2020-05-04 19:47:59.482 JupyterHub user:640] Unhandled error starting
> csadmin's server: The Jupyter batch job has disappeared while pending in
> the queue or died immediately after starting.
>
> [W 2020-05-04 19:47:59.518 JupyterHub web:1782] 500 GET /hub/spawn
> (127.0.0.1): Error in Authenticator.pre_spawn_start: RuntimeError The
> Jupyter batch job has disappeared while pending in the queue or died
> immediately after starting.
>
> [E 2020-05-04 19:47:59.521 JupyterHub log:166] {
>
>       "X-Forwarded-Host": "localhost:8000",
>
>       "X-Forwarded-Proto": "http",
>
>       "X-Forwarded-Port": "8000",
>
>       "X-Forwarded-For": "127.0.0.1",
>
>       "Cookie": "jupyterhub-hub-login=[secret]; _xsrf=[secret];
> jupyterhub-session-id=[secret]",
>
>       "Accept-Language": "en-US,en;q=0.9",
>
>       "Accept-Encoding": "gzip, deflate, br",
>
>       "Referer": "http://localhost:8000/hub/login",
>
>       "Sec-Fetch-Dest": "document",
>
>       "Sec-Fetch-User": "?1",
>
>       "Sec-Fetch-Mode": "navigate",
>
>       "Sec-Fetch-Site": "same-origin",
>
>       "Accept":
> "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
>
>       "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",
>
>       "Upgrade-Insecure-Requests": "1",
>
>       "Cache-Control": "max-age=0",
>
>       "Connection": "close",
>
>       "Host": "localhost:8000"
>
>     }
>
> [E 2020-05-04 19:47:59.522 JupyterHub log:174] 500 GET /hub/spawn (
> csadmin at 127.0.0.1) 842.87ms
>
> [I 2020-05-04 19:49:05.294 JupyterHub proxy:320] Checking routes
>
> [I 2020-05-04 19:54:05.292 JupyterHub proxy:320] Checking routes
>
>
>
> *In the slurmd.log (which I don't see as helpful):*
>
>
> [2020-05-04T19:47:58.931] task_p_slurmd_batch_request: 7117
>
> [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU input mask for node:
> 0x000003
>
> [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU final HW mask for
> node: 0x001001
>
> [2020-05-04T19:47:58.932] _run_prolog: run job script took usec=473
>
> [2020-05-04T19:47:58.932] _run_prolog: prolog with lock for job 7117 ran
> for 0 seconds
>
> [2020-05-04T19:47:58.932] Launching batch job 7117 for UID 1001
>
> [2020-05-04T19:47:58.967] [7117.batch] task_p_pre_launch: Using
> sched_affinity for tasks
>
> [2020-05-04T19:47:58.978] [7117.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:32512
>
> [2020-05-04T19:47:58.982] [7117.batch] done with job
>
>
> *In the jupyterhub_config.py (just the part for batchspawner):*
>
>
>
>
>
>
>
> * c = get_config() c.JupyterHub.spawner_class =
> 'batchspawner.SlurmSpawner' # Even though not used, needed to register
> batchspawner interface import batchspawner     c.Spawner.http_timeout = 120
> c.SlurmSpawner.req_nprocs = '1' c.SlurmSpawner.req_runtime = '8:00:00'
> c.SlurmSpawner.req_partition = 'faculty' c.SlurmSpawner.req_memory =
> '128gb' c.SlurmSpawner.start_timeout = 240 c.SlurmSpawner.batch_script =
> '''#!/bin/bash #SBATCH --partition={partition} #SBATCH --time={runtime}
> #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log #SBATCH
> --job-name=jupyterhub-spawner #SBATCH --cpus-per-task={nprocs} #SBATCH
> --chdir=/home/{username} #SBATCH --uid={username} env which
> jupyterhub-singleuser {cmd} ''' *
>
> I will admit that I don't understand all of this completely as I haven't
> written a lot of bash scripts. I'm getting that some of the things in {}
> are environment variables and others come from within this file and it
> seems they must be specifically defined in the batchspawner software
> somewhere.
>
> Is the last piece trying to find the path of jupyterhub-singleuser and
> then launch it with {cmd}
>
> Feel free to tell me to go read the docs but be gentle ? Because of the
> request to make ALL of this work ASAP I've been skimming and trying to pick
> up as much as I can and then going off examples trying to make this work.
> I have a feeling that this command: sudo -E -u csadmin sbatch --parsable
> output: 7117
> is what is incorrect and causing the problems. Clearly something isn't
> starting that should be.
>
> If you can shed any light on anything or any info online that might help
> me I'd much appreciate it. I'm really beating my head over this one and I
> know inexperience isn't helping.
>
> When I figure out this simple config then I want to have the profile where
> I can setup several settings and have the user select.
>
> One other basic question. I'm assuming in Slurm language my server is
> considered to have 24 CPU with the cores and threads so that any of the
> Slurm settings that refer to things like CPU per task I could specify up to
> 24 if a user wanted. Also, in this case the node will always be 1 since we
> only have 1 server.
>
> Thanks!
>
> ***************************************************************
>
> Lisa Weihl *Systems Administrator*
>
>
> *Computer Science, Bowling Green State University *Tel: (419) 372-0116
> |    Fax: (419) 372-8061
> lweihl at bgsu.edu
> http://www.bgsu.edu/?
>

--
Dr. Guy Coates
+44(0)7801 710224
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20200505%2F2a15c990%2Fattachment.htm&data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&sdata=dK0HAGC6RxwkysiGgYXwtYA1dxL7HPIwnEcF0LS2Nn8%3D&reserved=0>

End of slurm-users Digest, Vol 31, Issue 8
******************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200505/6a6b41c7/attachment-0001.htm>