Slow DAG postprocessing on high-latency filesystems

I'm running a workflow with ~100k jobs on an HPC where filesystem latency is relatively high. (Compute and filesystem are not co-located for reasons outside my control).

DAG building for this is taking ~45 minutes, almost all of which is in `DAG.update_needrun` while it performs file existence and `mtime` checks. For the file existence checks I've hacked in a cache (only active during `update_needrun`) which uses `os.listdir` in `_ioFile.exists_local`:

```python
    async def exists_local(self):
        if not cache_on:
            return os.path.exists(self.file)
        _dir = os.path.dirname(self.file)
        if _dir not in _local_cache:
            if not os.path.exists(_dir):
                _local_cache[_dir] = []
            else:
                _local_cache[_dir] = os.listdir(_dir)
        return os.path.basename(self.file) in _local_cache[_dir]
```

Using `os.listdir` greatly reduces the hits to the filesystem. This halves the runtime of `update_needrun`, the remaining time being `mtime` checks. 

Does a mechanism like this make sense? I would be happy to work up a proper PR if so.

Many thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow DAG postprocessing on high-latency filesystems #2920

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow DAG postprocessing on high-latency filesystems #2920

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions