Skip to content

Getting a number of CPU cores on PBS_pro #1912

@mmokrejs

Description

@mmokrejs

Hi,
when rerunning a canu-2.1.1 job on a different machine I realized that canu is picking up the number of total CPU cores locally available instead of respecting what I have reserved through the queing system. Here is how I started the job:

#PBS -l select=1:ncpus=240:mem=6000gb:scratch_local=12tb,walltime=48:00:00

...

canu useGrid=false ... genomeSize=6.8g correctedErrorRate=0.16 corMhapSensitivity=high ovsMemory=1024 ovsConcurrency=5
-- Detected 504 CPUs and 10074 gigabytes of memory.
-- Detected PBSPro '19.0.0' with 'pbsnodes' binary in /opt/pbs/bin/pbsnodes.
-- Grid engine and staging disabled per useGrid=false option.
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB    8 CPUs x  63 jobs  4032.000 GB 504 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   63 CPUs x   8 jobs   128.000 GB 504 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   14 CPUs x  36 jobs  2304.000 GB 504 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: cor       24.000 GB    4 CPUs x 126 jobs  3024.000 GB 504 CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x 504 jobs  2016.000 GB 504 CPUs  (overlap store bucketizer)
-- Local: ovs      1024.000 GB    1 CPU  x   5 jobs  5120.000 GB   5 CPUs  (overlap store sorting)
-- Local: red       64.000 GB    9 CPUs x  56 jobs  3584.000 GB 504 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x 504 jobs  4032.000 GB 504 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)

It picked 504 CPU cores and 10TB of RAM although I have in the environment:

PBS_NCPUS=240
PBS_NGPUS=0
PBS_NUM_NODES=1
PBS_NUM_PPN=240
PBS_RESC_MEM=6442450944000
PBS_RESC_SCRATCH_SSD=13194139533312
PBS_RESC_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_MEM=6442450944000
PBS_RESC_TOTAL_PROCS=240
PBS_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
PBS_RESC_TOTAL_WALLTIME=172800
SCRATCH=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCHDIR=/scratch.ssd/mmokrejs/job_2227881.cerit-pbs.cerit-sc.cz
SCRATCH_TYPE=ssd
SCRATCH_VOLUME=13194139533312
TORQUE_RESC_MEM=6442450944000
TORQUE_RESC_PROC=240
TORQUE_RESC_SCRATCH_SSD=13194139533312
TORQUE_RESC_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_MEM=6442450944000
TORQUE_RESC_TOTAL_PROCS=240
TORQUE_RESC_TOTAL_SCRATCH_VOLUME=13194139533312
TORQUE_RESC_TOTAL_WALLTIME=172800

I see some code in canu/src/utility/src/utility/system.C but although in comments there are more PBS_pro variables mentioned, only PBS_NUM_PPN is looked up (in theory).

Could it be that this code is neglected altogether because I started canu with useGrid=false? That's bad. I just wanted to avoid submitting childs jobs into the queing system but of course, I expected canu to understand it is being run under a job scheduling system anyway on an exec host picked by me, and respect its limits (6TB RAM and only 240 CPUs).

-- BEGIN CORRECTION
--
--
-- Creating overlap store correction/my_genome.ovlStore using:
--    147 buckets
--    616 slices
--        using at most 29 GB memory each
-- Finished stage 'cor-overlapStoreConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'ovB' concurrent execution on Thu Mar  4 09:11:40 2021 with 214055.941 GB free disk space (147 processes; 504 concurrently)

    cd correction/my_genome.ovlStore.BUILDING
    ./scripts/1-bucketize.sh 1 > ./logs/1-bucketize.000001.out 2>&1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions