Skip to content

Fix: Not all or the wrong files are unstaged from the temporary directory to the work directory#2035

Closed
Lehmann-Fabian wants to merge 13 commits intonextflow-io:fix-globstar-patternfrom
Lehmann-Fabian:master
Closed

Fix: Not all or the wrong files are unstaged from the temporary directory to the work directory#2035
Lehmann-Fabian wants to merge 13 commits intonextflow-io:fix-globstar-patternfrom
Lehmann-Fabian:master

Conversation

@Lehmann-Fabian
Copy link
Contributor

Using Nextflow with Kubernetes I realized that sometimes files are not found, or the wrong ones.
I found the issue in the copying/moving strategy. The current strategy used, does not support the glob notation.
For example, if I define “work/*/*.tif” as an output, it generates the following unstage command:
mkdir -p /workdir/work/ea/f44f2e36fe3bba9eae4d4ecbdb18d3/mask/* && cp -fRL mask/*/*.tif /workdir/work/ea/f44f2e36fe3bba9eae4d4ecbdb18d3/mask/* || true
That leads to a folder mask/mask/ with all the tifs in it.

Furthermore, running that script:

process a {

    echo true

    output:
    file("a/*/*.txt") into found

    """
    mkdir -p a/a b/b c/c
    touch a/1.txt
    touch b/1.txt
    touch c/1.txt
    touch a/a/2.txt
    touch b/b/2.txt
    touch c/c/2.txt
    """

}

found.view()

results in:
/workdir/work/c9/9254056eb2a43f396ef6c564b7ee54/a/*/2.txt

changing the output to “?/*/*.txt” results again in one file:
/workdir/work/61/708e9652d22b6970511dc8b09eb1d4/?/*/2.txt

To overcome these issues, I changed the unstage strategy and adjusted the tests accordingly. Moreover, I wrote some tests to confirm also the bash command’s execution results. That was missing until now, why the following test cases were excepted as valid, however, they will generate the wrong directory hierarchy.

mkdir -p /to/dir/path_name && cp -fRL path_name/ /to/dir/path_name
mkdir -p /to/dir/path_name && mv -f path_name/ /to/dir/path_name 

While my new proposed solution passes all the new tests, the old one failed in 21 out of 31 case for ‘should move the right files’. The same applies to ‘should copy the right files’.

I have tested my fix only in regard to the usage of temporary directories in Kubernetes, however, it should work for all execution engines, where temporary directories are used. I assume the same errors occur there, I have observed it in Kubernetes. This fix should also apply to them.

Additional notes:

In the getUnstageOutputFilesScript method I do not normalize the Glob anymore. This is now done while generating the stageOutCommand. I wrote a new method, transforming a glob expression to a regex expression find understands.

An important aspect is, that the cp command does not yet overwrite files, to avoid copying files more than once. This is fine, if a file is copied into an empty work directory. However, if this command is used to copy into a directory already containing files, new files, with the same name, are not copied. Depending, where this method will also be used, this has to be changed.
Moreover, as the current implementation resolves symlinks, my proposed solution does it again. This is problematic, if the output matches a file, that was staged in. Adding the -P option to cp would overcome this. Otherwise, it could generate a lot of IO-overhead for huge files.

How to evaluate my solution:

My version can be tested, by just adding this to an execution on Kubernetes: -pod-image fabianlehmann/nextflow:fix
If you have any questions, do not hesitate to come back to me.

… are failing for cp and mv

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…tory to the work directory, adjusted tests accordingly

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@pditommaso
Copy link
Member

Thanks for this PR, however I have the impression that the use of find is overkill and adds a lot of complexity to map to glob to the regex. My gut feeling the approach shown below should solve the problem

IFS=\$'\\n'
for name in \$(eval "ls -1d ${escape.join(' ')}" | sort | uniq); do
uploads+=("nxf_s3_upload '\$name' s3:/${Escape.path(targetDir)}")
done
unset IFS

@Lehmann-Fabian
Copy link
Contributor Author

Thank you @pditommaso for reviewing my PR. I tested your proposed solution with the following code.

protected String stageOutCommand( String source, String target, String mode ) {
        final String pattern = removeGlobStar( source )

        if( !pattern )
            return null

        final String escape = Escape.path( pattern )

        return """\
            IFS=\$'\\n'
            for name in \$(eval "ls -1d ${escape.replaceAll(" ","\\ ")}" | sort | uniq); do
                cp -fRLn --parents "\$name" ${Escape.path(target)}
            done
            unset IFS""".stripIndent(true)
}

However, the problem is, that this solution copies more files than necessary, as the removeGlobStar method makes the search term less precise. This will lead to additional IO, as more files are moved or copied to a shared file system. For the move mode, this could also lead to data loss, if a second output is moved to another directory. Moreover, it seems that the script fails, if no file matches the pattern.
For this proposed solution 6 of the 24 cases in the 'should copy the right files' test fail (One due to some special characters, which could be fixed), the others due to copying to many files.
These problems will also apply to the AWSBatchFileCopyStrategy, where more data stored on S3 lead to direct costs.

I think, my proposed glob to regex conversion is a very powerful way to deal with that glob / unix inconsistencies introduced in Nextflow. But you are right, it adds more complexity. But in my opinion, it is the best solution of all three (current state (which does not work for many cases) and adopting the way as you do it for AWS (which copies/moves to much files)).
What do you think?

@pditommaso
Copy link
Member

Why it should copy more files than expected? if I'm understanding correctly your approach converts to glob to a regex and then evaluate it with find, I'm suggesting evaluating the glob with ls command. Avoid the intermediate format would help prevent possible inconsistencies.

@Lehmann-Fabian
Copy link
Contributor Author

Correct me, if I’m wrong, but to evaluate the glob within the ls command, I have to use the removeGlobStar method?
The removeGlobStar method transforms something like 'some/**/path' to 'some*' which will also apply for 'some/a/b', which should not be copied.
That is why I transform the glob into a regex. The regex supports '**' and '*' as they are intended.
Changing the removeGlobStar method in a way, that it replaces '**' with '*' would also lead to cases where more files match the pattern than intended.
'a/**/b*.txt' would then also match 'a/path/blob/data.txt'.
Is that what you were asking for?

@pditommaso
Copy link
Member

pditommaso commented Apr 22, 2021

Got your point, I was missing the need for **. What about relying on the bash globstar option?

@Lehmann-Fabian
Copy link
Contributor Author

This looks really promising. I just had some tests and it should work. I will adjust my solution on Monday. However, the globstar requires Bash 4.0, does that violate some assumptions of Nextflow?

@pditommaso
Copy link
Member

the best would be trying setting it and fallback to default otherwise, since it would not break it

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…ccordingly and added some new

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@Lehmann-Fabian
Copy link
Contributor Author

I updated the code in a way that it uses Bash’s glob instead of converting Java’s glob to a regex. Furthermore, I added more test cases. In contrast to the regex approach it requires Bash 4.0 instead of Shell.
I’m looking forward to your feedback.

Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR and my apology for the late reply. I think this contribution is valuable, the only concern is about possible side effects that are not possible to predict at this time that could affect existing pipelines.

My proposal is to enable this enhancement as an opt-in option that could be controlled via the stageInMode stageOutMode. it could be introduced, for example, the option copy-allow-globs that would enable the mechanism you are proposing, and continuing to use the current implementation for all other cases.

This would allow us to start using this feature without worrying about backward compatibility and promote to default in the future.

What do you think ?

@Lehmann-Fabian
Copy link
Contributor Author

Thanks @pditommaso for your feedback.
I understand your concerns regarding unexpected issues, but as I wrote in the PR description the behavior of the staging in Kubernetes is currently in most cases different to the local environment. As it copies either wrong files or the right ones to another path. It only works, if we copy files in the first hierarchy or without glob in path.
For most of the users the behavior I create with that PR is the expected one. That’s why I do not think we should put it into a new, not default option.
To avoid problems with older bash versions, while keeping it working for the current cases, I would suggest making the glob notation optional adding || true:
shopt -s globstar extglob || true
shopt -u globstar extglob || true
Accordingly, it will work for files on the first hierarchy and for files with a concrete path as it does now.

@pditommaso
Copy link
Member

The main problem of this approach is that uses options that are not portable e.g cp -n is not working with the cp command distributed with alpine and busybox. cp --parents does not work in mac. This is was suggesting to add this as an opt-in feature via stageOutMode.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@Lehmann-Fabian
Copy link
Contributor Author

Thanks for the input, I was not aware of both. I removed the -n as it was not required anymore since the last changes where I process all outputs at once.
As the unstage method is only used in cluster environments I do not see issues with MacOS, as this will never be Macs. Am I wrong? If this is the case, it is possible to use rsync instead of cp, if it runs on Mac. Unfortunately I have no Apple available to test something like this.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…already copied.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@Lehmann-Fabian
Copy link
Contributor Author

@pditommaso, I changed the copying logic to avoid the --parents attribute, therefore I simply used the logic which I implemented for the mv command already.
Furthermore, I added a check whether a file was already copied, as it could happen, that we have a glob like **/a/ and a structure like a/a/a/. I forgot this in my last comment.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 16be72d to b7fdffa Compare May 17, 2021 07:07
@Lehmann-Fabian Lehmann-Fabian requested a review from pditommaso May 21, 2021 14:43
@pditommaso
Copy link
Member

pditommaso commented Jun 13, 2021

I needed to make some refactoring to fix another bug (#2141) that overlap with some of this logic. The plan to address the problem in this PR is to change the unstage output code as shown below

final patterns = normalizeGlobStarPaths(outputFiles)
// create a bash script that will copy the out file to the working directory
log.trace "Unstaging file path: $patterns"
if( !patterns )
return null
final escape = new ArrayList(outputFiles.size())
for( String it : patterns )
escape.add( Escape.path(it) )
final mode = stageoutMode ?: ( workDir==targetDir ? 'copy' : 'move' )
return """\
IFS=\$'\\n'
for name in \$(shopt -s globstar extglob; eval "ls -1d ${escape.join(' ')}" | sort | uniq); do
${stageOutCommand('$name', targetDir, mode)} || true
done
unset IFS""".stripIndent(true)

If the shopt -s globstar extglob is not supported, the subprocess just ignore it reporting a warning message.

For each of the file resolved, a bash function is invoked depending the stage mode used (copy, move, rsync)

nxf_fs_copy() {
local source=$1
local target=$2
local basedir=$(dirname $1)
mkdir -p $target/$basedir
cp -fRL $source $target/$basedir
}
nxf_fs_move() {
local source=$1
local target=$2
local basedir=$(dirname $1)
mkdir -p $target/$basedir
mv -f $source $target/$basedir
}
nxf_fs_rsync() {
rsync -rRl $1 $2
}

The benefit of this approach is that it follows exactly the same pattern used for aws, azure and google cloud providers. This makes much easier to maintain and in the future, it could be managed in a common abstraction.

Still haven't time to check if it addresses all use cases included in this PR, but I think there's space to improve this implementation. What do you think?

For details check the branch fix-globstar-pattern, in particular the commit 700e4de

@Lehmann-Fabian
Copy link
Contributor Author

Hello @pditommaso, this seems to be exactly what I implemented before. Sure, I can put my logic into methods instead of writing always the copy command, but I do not see more differences.
However, my branch adds also test cases, which perhaps would not all pass here. There are some weird special cases, with special characters in file names. These cases are addressed in my commit.
Maybe you can merge my commit into the fix-globstar-pattern Branch, merging the idea of having everything in methods with my logic around it and the test cases. I think, then we have a very stable solution. Otherwise, I continue working in this branch, and it does not match with yours in the end, as we are really implementing the same. And merging becomes failure-prone.

@pditommaso
Copy link
Member

I've realised that I've forgotten to remove the use normalizeGlobStarPaths. Let's continue this PR using fix-globstar-pattern as base. I'll change it now.

@pditommaso pditommaso changed the base branch from master to fix-globstar-pattern June 14, 2021 12:34
@pditommaso
Copy link
Member

Not surprising there are conflicts. If you have a chance to resolve them it would be great, otherwise, I'll look at my earliest convenience.

@Lehmann-Fabian
Copy link
Contributor Author

I try to merge it next week.

@pditommaso
Copy link
Member

pditommaso commented Jun 21, 2021

I partially merged this on master 👉 2d4c3df adding a tests for the case reported above

http://github.com/nextflow-io/nextflow/blob/89ce364c93d0a1a2c668004e64ef7c63e496043f/tests/output-globs.nf#L1-L1

However, I've stripped the support for extended globs, ie. **, since I could not tests carefully. Likely makes more sense to close this PR and eventually reopen for the part regarding the extended globs.

@Lehmann-Fabian
Copy link
Contributor Author

Sorry, @pditommaso, I did not yet find the time to merge it, I planned to do it tomorrow.
For me, especially the ** is an important feature, why your solution does not fix all my problems.
I think it is a big issue if Nextflow does not find the same files in a distributed environment than locally. That makes it very hard to debug a Nextflow Workflow.
I see you copied some of my ideas, but you didn’t merge them, so you missed the test cases. I think especially my newly added tests are important to improve Nextflow’s stability. As the current tests only check the generated scripts, but never their results.
I propose that I open a new PR, adding my test cases. The test cases might fail, due to the missing ** support. Furthermore, I spend a lot of time, doing my contribution, checking for special char combinations - I assume also some failures here, but let’s see.
After I reincluded the tests, we can look together for a solution to solve the ** problem, as you didn’t like my proposed one.
How do you think about that? I hope to start tomorrow.

@pditommaso
Copy link
Member

Sure, I understand that. Just wanted to notify you on the current status. I agree, that at this stage it would be better to open a new PR. Thanks a lot.

Lehmann-Fabian added a commit to Lehmann-Fabian/nextflow that referenced this pull request Jun 23, 2021
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Lehmann-Fabian added a commit to Lehmann-Fabian/nextflow that referenced this pull request Jun 23, 2021
…proposed in PR nextflow-io#2035.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Lehmann-Fabian added a commit to Lehmann-Fabian/nextflow that referenced this pull request Jun 23, 2021
…extflow-io#2035

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Lehmann-Fabian added a commit to Lehmann-Fabian/nextflow that referenced this pull request Jun 23, 2021
…io#2035

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@Lehmann-Fabian
Copy link
Contributor Author

I opened two new PRs, #2181 adding the test cases from this PR and #2182 where I implemented the fix for the extended glob.
I think this PR can be closed.

@pditommaso
Copy link
Member

Thanks

@pditommaso pditommaso closed this Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants