Skip to content

Slow DAG postprocessing on high-latency filesystems #2920

@benjeffery

Description

@benjeffery

I'm running a workflow with ~100k jobs on an HPC where filesystem latency is relatively high. (Compute and filesystem are not co-located for reasons outside my control).

DAG building for this is taking ~45 minutes, almost all of which is in DAG.update_needrun while it performs file existence and mtime checks. For the file existence checks I've hacked in a cache (only active during update_needrun) which uses os.listdir in _ioFile.exists_local:

    async def exists_local(self):
        if not cache_on:
            return os.path.exists(self.file)
        _dir = os.path.dirname(self.file)
        if _dir not in _local_cache:
            if not os.path.exists(_dir):
                _local_cache[_dir] = []
            else:
                _local_cache[_dir] = os.listdir(_dir)
        return os.path.basename(self.file) in _local_cache[_dir]

Using os.listdir greatly reduces the hits to the filesystem. This halves the runtime of update_needrun, the remaining time being mtime checks.

Does a mechanism like this make sense? I would be happy to work up a proper PR if so.

Many thanks!

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions