Hi,
when rerunning a canu-2.1.1 job on a different machine I realized that canu is picking up the number of total CPU cores locally available instead of respecting what I have reserved through the queing system. Here is how I started the job:
#PBS -l select=1:ncpus=240:mem=6000gb:scratch_local=12tb,walltime=48:00:00
...
canu useGrid=false ... genomeSize=6.8g correctedErrorRate=0.16 corMhapSensitivity=high ovsMemory=1024 ovsConcurrency=5
-- Detected 504 CPUs and 10074 gigabytes of memory.
-- Detected PBSPro '19.0.0' with 'pbsnodes' binary in /opt/pbs/bin/pbsnodes.
-- Grid engine and staging disabled per useGrid=false option.
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 64.000 GB 8 CPUs x 63 jobs 4032.000 GB 504 CPUs (k-mer counting)
-- Local: hap 16.000 GB 63 CPUs x 8 jobs 128.000 GB 504 CPUs (read-to-haplotype assignment)
-- Local: cormhap 64.000 GB 14 CPUs x 36 jobs 2304.000 GB 504 CPUs (overlap detection with mhap)
-- Local: obtovl 24.000 GB 14 CPUs x 36 jobs 864.000 GB 504 CPUs (overlap detection)
-- Local: utgovl 24.000 GB 14 CPUs x 36 jobs 864.000 GB 504 CPUs (overlap detection)
-- Local: cor 24.000 GB 4 CPUs x 126 jobs 3024.000 GB 504 CPUs (read correction)
-- Local: ovb 4.000 GB 1 CPU x 504 jobs 2016.000 GB 504 CPUs (overlap store bucketizer)
-- Local: ovs 1024.000 GB 1 CPU x 5 jobs 5120.000 GB 5 CPUs (overlap store sorting)
-- Local: red 64.000 GB 9 CPUs x 56 jobs 3584.000 GB 504 CPUs (read error detection)
-- Local: oea 8.000 GB 1 CPU x 504 jobs 4032.000 GB 504 CPUs (overlap error adjustment)
-- Local: bat 1024.000 GB 64 CPUs x 1 job 1024.000 GB 64 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 8 CPUs x - jobs -.--- GB - CPUs (consensus)
It picked 504 CPU cores and 10TB of RAM although I have in the environment:
PBS_NCPUS=240
PBS_NGPUS=0
PBS_NUM_NODES=1
PBS_NUM_PPN=240
PBS_RESC_MEM=6442450944000
PBS_RESC_SCRATCH_SSD=13194139533312
PBS_RESC_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_MEM=6442450944000
PBS_RESC_TOTAL_PROCS=240
PBS_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_WALLTIME=172800
SCRATCH=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCHDIR=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCH_TYPE=ssd
SCRATCH_VOLUME=13194139533312
TORQUE_RESC_MEM=6442450944000
TORQUE_RESC_PROC=240
TORQUE_RESC_SCRATCH_SSD=13194139533312
TORQUE_RESC_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_MEM=6442450944000
TORQUE_RESC_TOTAL_PROCS=240
TORQUE_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_WALLTIME=172800
I see some code in canu/src/utility/src/utility/system.C but although in comments there are more PBS_pro variables mentioned, only PBS_NUM_PPN is looked up (in theory).
Could it be that this code is neglected altogether because I started canu with useGrid=false? That's bad. I just wanted to avoid submitting childs jobs into the queing system but of course, I expected canu to understand it is being run under a job scheduling system anyway on an exec host picked by me, and respect its limits (6TB RAM and only 240 CPUs).
-- BEGIN CORRECTION
--
--
-- Creating overlap store correction/my_genome.ovlStore using:
-- 147 buckets
-- 616 slices
-- using at most 29 GB memory each
-- Finished stage 'cor-overlapStoreConfigure', reset canuIteration.
--
-- Running jobs. First attempt out of 2.
----------------------------------------
-- Starting 'ovB' concurrent execution on Thu Mar 4 09:11:40 2021 with 214055.941 GB free disk space (147 processes; 504 concurrently)
cd correction/my_genome.ovlStore.BUILDING
./scripts/1-bucketize.sh 1 > ./logs/1-bucketize.000001.out 2>&1
Hi,
when rerunning a canu-2.1.1 job on a different machine I realized that canu is picking up the number of total CPU cores locally available instead of respecting what I have reserved through the queing system. Here is how I started the job:
It picked 504 CPU cores and 10TB of RAM although I have in the environment:
I see some code in
canu/src/utility/src/utility/system.Cbut although in comments there are more PBS_pro variables mentioned, onlyPBS_NUM_PPNis looked up (in theory).Could it be that this code is neglected altogether because I started canu with
useGrid=false? That's bad. I just wanted to avoid submitting childs jobs into the queing system but of course, I expected canu to understand it is being run under a job scheduling system anyway on an exec host picked by me, and respect its limits (6TB RAM and only 240 CPUs).