Fix: Not all or the wrong files are unstaged from the temporary directory to the work directory#2035
Fix: Not all or the wrong files are unstaged from the temporary directory to the work directory#2035Lehmann-Fabian wants to merge 13 commits intonextflow-io:fix-globstar-patternfrom
Conversation
… are failing for cp and mv Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…tory to the work directory, adjusted tests accordingly Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
Thanks for this PR, however I have the impression that the use of |
|
Thank you @pditommaso for reviewing my PR. I tested your proposed solution with the following code. However, the problem is, that this solution copies more files than necessary, as the removeGlobStar method makes the search term less precise. This will lead to additional IO, as more files are moved or copied to a shared file system. For the move mode, this could also lead to data loss, if a second output is moved to another directory. Moreover, it seems that the script fails, if no file matches the pattern. I think, my proposed glob to regex conversion is a very powerful way to deal with that glob / unix inconsistencies introduced in Nextflow. But you are right, it adds more complexity. But in my opinion, it is the best solution of all three (current state (which does not work for many cases) and adopting the way as you do it for AWS (which copies/moves to much files)). |
|
Why it should copy more files than expected? if I'm understanding correctly your approach converts to glob to a regex and then evaluate it with |
|
Correct me, if I’m wrong, but to evaluate the glob within the |
|
Got your point, I was missing the need for |
|
This looks really promising. I just had some tests and it should work. I will adjust my solution on Monday. However, the globstar requires Bash 4.0, does that violate some assumptions of Nextflow? |
|
the best would be trying setting it and fallback to default otherwise, since it would not break it |
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…ccordingly and added some new Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
I updated the code in a way that it uses Bash’s glob instead of converting Java’s glob to a regex. Furthermore, I added more test cases. In contrast to the regex approach it requires Bash 4.0 instead of Shell. |
There was a problem hiding this comment.
Thanks for this PR and my apology for the late reply. I think this contribution is valuable, the only concern is about possible side effects that are not possible to predict at this time that could affect existing pipelines.
My proposal is to enable this enhancement as an opt-in option that could be controlled via the stageInModestageOutMode. it could be introduced, for example, the option copy-allow-globs that would enable the mechanism you are proposing, and continuing to use the current implementation for all other cases.
This would allow us to start using this feature without worrying about backward compatibility and promote to default in the future.
What do you think ?
|
Thanks @pditommaso for your feedback. |
|
The main problem of this approach is that uses options that are not portable e.g |
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
Thanks for the input, I was not aware of both. I removed the |
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…already copied. Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
@pditommaso, I changed the copying logic to avoid the |
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
16be72d to
b7fdffa
Compare
|
I needed to make some refactoring to fix another bug (#2141) that overlap with some of this logic. The plan to address the problem in this PR is to change the unstage output code as shown below If the For each of the file resolved, a bash function is invoked depending the stage mode used (copy, move, rsync) The benefit of this approach is that it follows exactly the same pattern used for aws, azure and google cloud providers. This makes much easier to maintain and in the future, it could be managed in a common abstraction. Still haven't time to check if it addresses all use cases included in this PR, but I think there's space to improve this implementation. What do you think? For details check the branch fix-globstar-pattern, in particular the commit 700e4de |
|
Hello @pditommaso, this seems to be exactly what I implemented before. Sure, I can put my logic into methods instead of writing always the copy command, but I do not see more differences. |
|
I've realised that I've forgotten to remove the use |
|
Not surprising there are conflicts. If you have a chance to resolve them it would be great, otherwise, I'll look at my earliest convenience. |
|
I try to merge it next week. |
|
I partially merged this on master 👉 2d4c3df adding a tests for the case reported above However, I've stripped the support for extended globs, ie. |
|
Sorry, @pditommaso, I did not yet find the time to merge it, I planned to do it tomorrow. |
|
Sure, I understand that. Just wanted to notify you on the current status. I agree, that at this stage it would be better to open a new PR. Thanks a lot. |
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…proposed in PR nextflow-io#2035. Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…extflow-io#2035 Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…io#2035 Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
Thanks |
Using Nextflow with Kubernetes I realized that sometimes files are not found, or the wrong ones.
I found the issue in the copying/moving strategy. The current strategy used, does not support the glob notation.
For example, if I define
“work/*/*.tif”as an output, it generates the following unstage command:mkdir -p /workdir/work/ea/f44f2e36fe3bba9eae4d4ecbdb18d3/mask/* && cp -fRL mask/*/*.tif /workdir/work/ea/f44f2e36fe3bba9eae4d4ecbdb18d3/mask/* || trueThat leads to a folder mask/mask/ with all the tifs in it.
Furthermore, running that script:
results in:
/workdir/work/c9/9254056eb2a43f396ef6c564b7ee54/a/*/2.txt
changing the output to
“?/*/*.txt”results again in one file:/workdir/work/61/708e9652d22b6970511dc8b09eb1d4/?/*/2.txt
To overcome these issues, I changed the unstage strategy and adjusted the tests accordingly. Moreover, I wrote some tests to confirm also the bash command’s execution results. That was missing until now, why the following test cases were excepted as valid, however, they will generate the wrong directory hierarchy.
While my new proposed solution passes all the new tests, the old one failed in 21 out of 31 case for ‘should move the right files’. The same applies to ‘should copy the right files’.
I have tested my fix only in regard to the usage of temporary directories in Kubernetes, however, it should work for all execution engines, where temporary directories are used. I assume the same errors occur there, I have observed it in Kubernetes. This fix should also apply to them.
Additional notes:
In the getUnstageOutputFilesScript method I do not normalize the Glob anymore. This is now done while generating the stageOutCommand. I wrote a new method, transforming a glob expression to a regex expression find understands.
An important aspect is, that the cp command does not yet overwrite files, to avoid copying files more than once. This is fine, if a file is copied into an empty work directory. However, if this command is used to copy into a directory already containing files, new files, with the same name, are not copied. Depending, where this method will also be used, this has to be changed.
Moreover, as the current implementation resolves symlinks, my proposed solution does it again. This is problematic, if the output matches a file, that was staged in. Adding the -P option to cp would overcome this. Otherwise, it could generate a lot of IO-overhead for huge files.
How to evaluate my solution:
My version can be tested, by just adding this to an execution on Kubernetes: -pod-image fabianlehmann/nextflow:fix
If you have any questions, do not hesitate to come back to me.