-
Notifications
You must be signed in to change notification settings - Fork 634
Open
Labels
enhancementNew feature or requestNew feature or request
Description
I'm running a workflow with ~100k jobs on an HPC where filesystem latency is relatively high. (Compute and filesystem are not co-located for reasons outside my control).
DAG building for this is taking ~45 minutes, almost all of which is in DAG.update_needrun while it performs file existence and mtime checks. For the file existence checks I've hacked in a cache (only active during update_needrun) which uses os.listdir in _ioFile.exists_local:
async def exists_local(self):
if not cache_on:
return os.path.exists(self.file)
_dir = os.path.dirname(self.file)
if _dir not in _local_cache:
if not os.path.exists(_dir):
_local_cache[_dir] = []
else:
_local_cache[_dir] = os.listdir(_dir)
return os.path.basename(self.file) in _local_cache[_dir]Using os.listdir greatly reduces the hits to the filesystem. This halves the runtime of update_needrun, the remaining time being mtime checks.
Does a mechanism like this make sense? I would be happy to work up a proper PR if so.
Many thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request