[slurm-users] Fw: slurm-users Digest, Vol 31, Issue 50

Thu May 14 01:15:53 UTC 2020

Thank you Michael for pitching in to trouble shoot the config file.
Now my configfile looks like:
ClusterName=linux
ControlMachine=abhi-Latitude-E6430
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
SwitchType=switch/none
MpiDefault=none
ProctrackType=proctrack/pgid
Epilog=/usr/local/slurm/sbin/epilog
Prolog=/usr/local/slurm/sbin/prolog
SlurmdSpoolDir=/var/tmp/slurmd.spool
StateSaveLocation=/usr/local/slurm/slurm.state
TmpFS=/tmp
NodeName=abhi-Lenovo-ideapad-330-15IKB CPUS=4
NodeName=abhi-HP-EliteBook-840-G2 CPUS=4
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

abhi at abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:11:32 IST; 2h 28min ago
       Docs: man:slurmd(8)
    Process: 977 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1028 (slurmd)
      Tasks: 2
     Memory: 3.9M
     CGroup: /system.slice/slurmd.service
             └─1028 /usr/sbin/slurmd

abhi at abhi-HP-EliteBook-840-G2:~$ service slurmd status
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:18:51 IST; 2h 24min ago
       Docs: man:slurmd(8)
    Process: 1313 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1372 (slurmd)
      Tasks: 2
     Memory: 3.8M
     CGroup: /system.slice/slurmd.service
             └─1372 /usr/sbin/slurmd

abhi at abhi-Latitude-E6430:~$ service slurmctld status
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-05-14 04:11:21 IST; 2h 32min ago
       Docs: man:slurmctld(8)
    Process: 1208 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1306 (slurmctld)
      Tasks: 7
     Memory: 6.7M
     CGroup: /system.slice/slurmctld.service
             └─1306 /usr/sbin/slurmctld
However still: sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  down* abhi-Lenovo-ideapad-330-15IKB

My Study is inconclusive

Best Regards,Abhinandan H. Patil, +919886406214https://www.AbhinandanHPatil.info

   ----- Forwarded message ----- From: "slurm-users-request at lists.schedmd.com" <slurm-users-request at lists.schedmd.com>To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>Sent: Thursday, 14 May 2020, 2:39:40 am GMT+5:30Subject: slurm-users Digest, Vol 31, Issue 50
 Send slurm-users mailing list submissions to
    slurm-users at lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
    https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
    slurm-users-request at lists.schedmd.com

You can reach the person managing the list at
    slurm-users-owner at lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."

Today's Topics:

  1. Re: Ubuntu Cluster with Slurm (Renfro, Michael)
  2. Re: sacct returns nothing after reboot (Roger Mason)
  3. Re: Reset TMPDIR for All Jobs (Ellestad, Erik)
  4. Re: additional jobs killed by scancel. (Alastair Neil)

----------------------------------------------------------------------

Message: 1
Date: Wed, 13 May 2020 14:05:21 +0000
From: "Renfro, Michael" <Renfro at tntech.edu>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Ubuntu Cluster with Slurm
Message-ID: <B4E26014-E420-4506-8A7F-DCDF01E4AAD3 at tntech.edu>
Content-Type: text/plain; charset="utf-8"

I?d compare the RealMemory part of ?scontrol show node abhi-HP-EliteBook-840-G2? to the RealMemory part of your slurm.conf:

> Nodes which register to the system with less than the configured resources (e.g. too little memory), will be placed in the "DOWN" state to avoid scheduling jobs on them.

? https://slurm.schedmd.com/slurm.conf.html

As far as GPUs go, it looks like you have Intel graphics on the Lenovo and a Radeon R7 on the HP? If so, then nothing is CUDA-compatible, but you might be able to make something work with OpenCL. No idea if that would give performance improvements over the CPUs, though.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601    / Tennessee Tech University

> On May 13, 2020, at 8:42 AM, Abhinandan Patil <abhinandan_patil_1414 at yahoo.com> wrote:
> 
> Dear All,
> 
> Preamble
> ----------
> I want to form simple cluster with three laptops:
> abhi-Latitude-E6430  //This serves as the controller
> abhi-Lenovo-ideapad-330-15IKB //Compute Node
> abhi-HP-EliteBook-840-G2 //Compute Node
> 
> 
> Aim
> -------------
> I want to make use of CPU+GPU+RAM on all the machines when I execute JAVA programs or Python programs.
> 
> 
> Implementation
> ------------------------
> Now let us look at the slurm.conf
> 
> On Machine abhi-Latitude-E6430
> 
> ClusterName=linux
> ControlMachine=abhi-Latitude-E6430
> SlurmUser=abhi
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> SwitchType=switch/none
> StateSaveLocation=/tmp
> MpiDefault=none
> ProctrackType=proctrack/pgid
> NodeName=abhi-Lenovo-ideapad-330-15IKB RealMemory=12000 CPUs=2
> NodeName=abhi-HP-EliteBook-840-G2 RealMemory=14000 CPUs=2
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
> 
> Same slurm.conf is copied to all the Machines.
> 
> 
> Observations
> --------------------------------------
> Now when I do
> abhi at abhi-HP-EliteBook-840-G2:~$ service slurmd status
> ? slurmd.service - Slurm node daemon
>      Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>      Active: active (running) since Wed 2020-05-13 18:50:01 IST; 1min 49s ago
>        Docs: man:slurmd(8)
>    Process: 98235 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>    Main PID: 98253 (slurmd)
>      Tasks: 2
>      Memory: 2.2M
>      CGroup: /system.slice/slurmd.service
>              ??98253 /usr/sbin/slurmd
> 
> abhi at abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status
> ? slurmd.service - Slurm node daemon
>      Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>      Active: active (running) since Wed 2020-05-13 18:50:20 IST; 8s ago
>        Docs: man:slurmd(8)
>    Process: 71709 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>    Main PID: 71734 (slurmd)
>      Tasks: 2
>      Memory: 2.0M
>      CGroup: /system.slice/slurmd.service
>              ??71734 /usr/sbin/slurmd
> 
> abhi at abhi-Latitude-E6430:~$ service slurmctld status 
> ? slurmctld.service - Slurm controller daemon
>      Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
>      Active: active (running) since Wed 2020-05-13 18:48:58 IST; 4min 56s ago
>        Docs: man:slurmctld(8)
>    Process: 97114 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
>    Main PID: 97116 (slurmctld)
>      Tasks: 7
>      Memory: 2.6M
>      CGroup: /system.slice/slurmctld.service
>              ??97116 /usr/sbin/slurmctld
> 
>              
> However  abhi at abhi-Latitude-E6430:~$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*      up  infinite      1  down* abhi-Lenovo-ideapad-330-15IKB
> 
> 
> Advice needed
> ------------------------
> Please let me know Why I am seeing only one node. 
> Further how the total memory is calculated? Can Slurm make use of GPU processing power as well
> Please let me know if I have missed something in configuration or explanation.
> 
> Thank you all
> 
> Best Regards,
> Abhinandan H. Patil, +919886406214
> https://www.AbhinandanHPatil.info
> 
> 

------------------------------

Message: 2
Date: Wed, 13 May 2020 12:20:11 -0230
From: Roger Mason <rmason at mun.ca>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] sacct returns nothing after reboot
Message-ID: <y65sgg399ek.fsf at mun.ca>
Content-Type: text/plain

Hello,

Marcus Boden <mboden at gwdg.de> writes:

> the default time window starts at 00:00:00 of the current day:
> -S, --starttime
>          Select jobs in any state after the specified  time.  Default
>          is  00:00:00  of  the  current  day, unless the '-s' or '-j'
>          options are used. If the  '-s'  option  is  used,  then  the
>          default  is  'now'. If states are given with the '-s' option
>          then only jobs in this state at this time will be  returned.
>          If  the  '-j'  option is used, then the default time is Unix
>          Epoch 0. See the DEFAULT TIME WINDOW for more details.

Thank you!  Obviously I did not read far enough down the man page.

Roger

------------------------------

Message: 3
Date: Wed, 13 May 2020 15:18:09 +0000
From: "Ellestad, Erik" <Erik.Ellestad at ucsf.edu>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Reset TMPDIR for All Jobs
Message-ID:
    <BY5PR05MB690060B056D48D6B031A2DA99ABF0 at BY5PR05MB6900.namprd05.prod.outlook.com>

Content-Type: text/plain; charset="utf-8"

Woo!

Thanks Marcus, that works.

Though, ahem, SLURM/SchedMD, if you're listening, would it hurt to cover this in the documentation regarding prolog/epilog, and maybe give an example?

https://slurm.schedmd.com/prolog_epilog.html

Just a thought,

Erik

--
Erik Ellestad
Wynton Cluster SysAdmin
UCSF

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Marcus Wagner
Sent: Tuesday, May 12, 2020 10:08 PM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Reset TMPDIR for All Jobs

Hi Erik,

the output of task-prolog is sourced/evaluated (not really sure, how) in 
the job environment.

Thus you don't have to export a variable in task-prolog, but echo the 
export, e.g.

echo export TMPDIR=/scratch/$SLURM_JOB_ID

The variable will then be set in job environment.

Best
Marcus

Am 12.05.2020 um 17:40 schrieb Ellestad, Erik:
> I was wanted to set TMPDIR from /tmp to a per job directory I create in 
> local /scratch/$SLURM_JOB_ID (for example)
> 
> This bug suggests I should be able to do this in a task-prolog.
> 
> https://bugs.schedmd.com/show_bug.cgi?id=2664
> 
> However adding the following to task-prolog doesn?t seem to affect the 
> variables the job script is running with.
> 
> unset TMPDIR
> 
> export TMPDIR=/scratch/$SLURM_JOB_ID
> 
> It does work if it is done in the job script, rather than the task-prolog.
> 
> Am I missing something?
> 
> Erik
> 
> --
> 
> Erik Ellestad
> 
> Wynton Cluster SysAdmin
> 
> UCSF
> 

------------------------------

Message: 4
Date: Wed, 13 May 2020 17:08:55 -0400
From: Alastair Neil <ajneil.tech at gmail.com>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] additional jobs killed by scancel.
Message-ID:
    <CA+SarwpQMepkhWLC_RUqSi1SzaNb8MHk77wCSFQAFFyTB7fx2Q at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

invalid field requested: "reason"

On Tue, 12 May 2020 at 16:47, Steven Dick <kg4ydw at gmail.com> wrote:

> What do you get from
>
> sacct -o jobid,elapsed,reason,exit -j 533900,533902
>
> On Tue, May 12, 2020 at 4:12 PM Alastair Neil <ajneil.tech at gmail.com>
> wrote:
> >
> >  The log is continuous and has all the messages logged by slurmd on the
> node for all the jobs mentioned, below are the entries from the slurmctld
> log:
> >
> >> [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB
> JobId=533898 uid 1224431221
> >>
> >> [2020-05-10T00:26:03.098] email msg to sshres2 at masonlive.gmu.edu:
> Slurm Job_id=533898 Name=r18-relu-ent Ended, Run time 04:36:17, CANCELLED,
> ExitCode 0
> >> [2020-05-10T00:26:03.098] job_signal: 9 of running JobId=533898
> successful 0x8004
> >> [2020-05-10T00:26:05.204] _job_complete: JobId=533902 WTERMSIG 9
> >> [2020-05-10T00:26:05.204] email msg to sshres2 at masonlive.gmu.edu:
> Slurm Job_id=533902 Name=r18-soft-ent Failed, Run time 04:30:39, FAILED
> >> [2020-05-10T00:26:05.205] _job_complete: JobId=533902 done
> >> [2020-05-10T00:26:05.210] _job_complete: JobId=533900 WTERMSIG 9
> >> [2020-05-10T00:26:05.210] email msg to sshres2 at masonlive.gmu.edu:
> Slurm Job_id=533900 Name=r18-soft Failed, Run time 04:32:51, FAILED
> >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done
> >
> >
> > it is curious, that all the jobs were running on the same processor,
> perhaps this is a cgroup related failure?
> >
> > On Tue, 12 May 2020 at 10:10, Steven Dick <kg4ydw at gmail.com> wrote:
> >>
> >> I see one job cancelled and two jobs failed.
> >> Your slurmd log is incomplete -- it doesn't show the two failed jobs
> >> exiting/failing, so the real error is not here.
> >>
> >> It might also be helpful to look through slurmctld's log starting from
> >> when the first job was canceled, looking at any messages mentioning
> >> the node or the two failed jobs.
> >>
> >> I've had nodes do strange things on job cancel.  Last one I tracked
> >> down to the job epilog failing because it was NFS mounted and nfs was
> >> being slower than slurm liked, so it took the node offline and killed
> >> everything on it.
> >>
> >> On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajneil.tech at gmail.com>
> wrote:
> >> >
> >> > Hi there,
> >> >
> >> > We are using slurm 18.08 and had a weird occurrence over the
> weekend.  A user canceled one of his jobs using scancel, and two additional
> jobs of the user running on the same node were killed concurrently.  The
> jobs had no dependency, but they were all allocated 1 gpu. I am curious to
> know why this happened,  and if this is a known bug is there a workaround
> to prevent it happening?  Any suggestions gratefully received.
> >> >
> >> > -Alastair
> >> >
> >> > FYI
> >> > The cancelled job (533898) has this at the end of the .err file:
> >> >
> >> >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT
> 2020-05-10T00:26:03 ***
> >> >
> >> >
> >> > both of the killed jobs (533900 and 533902)  have this:
> >> >
> >> >> slurmstepd: error: get_exit_code task 0 died by signal
> >> >
> >> >
> >> > here is the slurmd log from the node and ths how-job output for each
> job:
> >> >
> >> >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
> >> >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job
> 533898 ran for 0 seconds
> >> >> [2020-05-09T19:49:46.754] ====================
> >> >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
> >> >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
> >> >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
> >> >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
> >> >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
> >> >> [2020-05-09T19:49:46.758] ====================
> >> >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID
> 1224431221
> >> >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
> >> >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job
> 533900 ran for 0 seconds
> >> >> [2020-05-09T19:53:14.080] ====================
> >> >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
> >> >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
> >> >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
> >> >> [2020-05-09T19:53:14.084] ====================
> >> >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID
> 1224431221
> >> >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
> >> >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job
> 533902 ran for 0 seconds
> >> >> [2020-05-09T19:55:26.304] ====================
> >> >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
> >> >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
> >> >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
> >> >> [2020-05-09T19:55:26.307] ====================
> >> >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID
> 1224431221
> >> >> [2020-05-10T00:26:03.127] [533898.extern] done with job
> >> >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON
> NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
> >> >> [2020-05-10T00:26:04.425] [533898.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> >> >> [2020-05-10T00:26:04.428] [533898.batch] done with job
> >> >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0
> died by signal
> >> >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0
> died by signal
> >> >> [2020-05-10T00:26:05.202] [533900.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> >> >> [2020-05-10T00:26:05.202] [533902.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> >> >> [2020-05-10T00:26:05.211] [533902.batch] done with job
> >> >> [2020-05-10T00:26:05.216] [533900.batch] done with job
> >> >> [2020-05-10T00:26:05.234] [533902.extern] done with job
> >> >> [2020-05-10T00:26:05.235] [533900.extern] done with job
> >> >
> >> >
> >> >> [root at node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
> >> >> JobId=533898 JobName=r18-relu-ent
> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >>  JobState=CANCELLED Reason=None Dependency=(null)
> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
> >> >>  RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >>  SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
> >> >>  AccrueTime=2020-05-09T19:49:45
> >> >>  StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03
> Deadline=N/A
> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >>  LastSchedEval=2020-05-09T19:49:46
> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >>  ReqNodeList=(null) ExcNodeList=(null)
> >> >>  NodeList=NODE056
> >> >>  BatchHost=NODE056
> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >>  Features=(null) DelayBoot=00:00:00
> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
> >> >>  StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
> >> >>  Power=
> >> >>  TresPerNode=gpu:1
> >> >>
> >> >> JobId=533900 JobName=r18-soft
> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
> >> >>  RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >>  SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
> >> >>  AccrueTime=2020-05-09T19:53:13
> >> >>  StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05
> Deadline=N/A
> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >>  LastSchedEval=2020-05-09T19:53:14
> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >>  ReqNodeList=(null) ExcNodeList=(null)
> >> >>  NodeList=NODE056
> >> >>  BatchHost=NODE056
> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >>  Features=(null) DelayBoot=00:00:00
> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
> >> >>  StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
> >> >>  Power=
> >> >>  TresPerNode=gpu:1
> >> >>
> >> >> JobId=533902 JobName=r18-soft-ent
> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
> >> >>  RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >>  SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
> >> >>  AccrueTime=2020-05-09T19:55:26
> >> >>  StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05
> Deadline=N/A
> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >>  LastSchedEval=2020-05-09T19:55:26
> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >>  ReqNodeList=(null) ExcNodeList=(null)
> >> >>  NodeList=NODE056
> >> >>  BatchHost=NODE056
> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >>  Features=(null) DelayBoot=00:00:00
> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
> >> >>  StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
> >> >>  Power=
> >> >>  TresPerNode=gpu:1
> >> >
> >> >
> >> >
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200513/8ff7b80b/attachment.htm>

End of slurm-users Digest, Vol 31, Issue 50
*******************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200514/4e793061/attachment-0001.htm>