Conversation
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…proposed in PR nextflow-io#2035. Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…extflow-io#2035 Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
…io#2035 Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
I'm not sure why it is failing. |
| pathes=`ls -1d ${escape.join(' ')} | sort | uniq` | ||
| shopt -u globstar extglob || true | ||
| set -f | ||
| for name in \$pathes; do |
There was a problem hiding this comment.
My understanding is this prevents the wildcards expansion. Why is needed? they should have already been resolved by the ls -1d command
There was a problem hiding this comment.
You are right, set -f prevents it. This is needed, if file names contain a special character like a question mark, which would otherwise be resolved again. There is one test case which fails otherwise.
There was a problem hiding this comment.
Sorry, not fully understanding this. Can you provide a minimal example NF script showing how this works?
There was a problem hiding this comment.
Sure, the following process should only find one file - the one with a question mark in the file name.
process a {
echo true
output:
file("ab\\?.txt") into found
"""
touch abc.txt
touch ab\\?.txt
ls
"""
}
found.view()
Don't disable wildcard expansion would lead to a new expansion in the for loop and once again in the copy/move method. Accordingly, both files will be moved/copied.
There was a problem hiding this comment.
Hi @pditommaso, I’m not sure whether this helped you understand my contribution or you need an additional example. I look forward to your feedback.
There was a problem hiding this comment.
I think a problem can come from the fact that NF expects the use of the glob: option to switch off the matching of the ? special character. My tests:
without glon option
process a {
output:
path "ab?.txt" into found
"""
touch abc.txt
touch ab\\?.txt
ls
"""
}
found.flatten().view()
shell:
» nextflow run test.nf
N E X T F L O W ~ version 21.09.0-SNAPSHOT
Launching `test.nf` [elated_nobel] - revision: 198b348c2b
executor > local (1)
[23/6a07e6] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/23/6a07e6c49fe17b96cab4c476a2b83f/ab?.txt
/Users/pditommaso/Projects/nextflow/work/23/6a07e6c49fe17b96cab4c476a2b83f/abc.txt
OK
using glob option
process a {
output:
path "ab?.txt", glob: false into found
"""
touch abc.txt
touch ab\\?.txt
ls
"""
}
found.flatten().view()
Shell:
» nextflow run test.nf
N E X T F L O W ~ version 21.09.0-SNAPSHOT
Launching `test.nf` [curious_brown] - revision: 4cae485aee
executor > local (1)
[56/bda665] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/56/bda665ba766227bbecf15e8611cbf6/ab?.txt
OK
glob option + enable scratch option
process a {
output:
path "ab?.txt", glob: false into found
"""
touch abc.txt
touch ab\\?.txt
ls
"""
}
found.flatten().view()
Shell command:
» nextflow run test.nf -process.scratch true
N E X T F L O W ~ version 21.09.0-SNAPSHOT
Launching `test.nf` [high_minsky] - revision: 4cae485aee
executor > local (1)
[82/ef2225] process > a [100%] 1 of 1 ✔
/Users/pditommaso/Projects/nextflow/work/82/ef22257c5eb503b4f57e993c8aca51/ab?.txt
OK, though
Work dir:
» ls -lh work/82/ef22257c5eb503b4f57e993c8aca51
total 0
-rw-r--r-- 1 pditommaso staff 0B Oct 6 16:28 ab?.txt
-rw-r--r-- 1 pditommaso staff 0B Oct 6 16:28 abc.txt
I think your concern is the above, that since the ab?.txt was requested, both files were copied from the task scratch copied. Is that right?
There was a problem hiding this comment.
You are right. My concern is that the file is copied/moved anyways. That is why I prevent the second wildcard expansion since the absolute file paths are known then.
a) If you do not see a problem, we can delete the set -f. But this will generate more I/O.
Anyway, I see my commit does not consider the option glob: false. In that case, we should also say set -f instead of shopt -s globstar extglob || true to activate the glob.
b) How do you think about that?
If you agree to both, I will change a and b as proposed.
There was a problem hiding this comment.
I see two problems: 1) handling correctly glob:false 2) handle ** wildcard
For 1, I think it could be addressed replacing Escape.path with Escape.wildcard when glob==false here
For 2, I think it would be preferable to keep eval subshell, I mean having
for name in \$(eval "ls -1d ${escape.join(' ')}" | sort | uniq); do
...
There was a problem hiding this comment.
Let me start with 2).
I found the problem you meant.
There was a problem if the ** was within paths like a/dir**file.txt, but I fixed it in the last commit.
I prepared 4 test cases and tested them three times each.
- with the current Nextflow version
- with the current Nextflow version and
-process.scratch true - with my fixed Nextflow version and
-process.scratch true
For portability, the second and third test case should find the same files as the first one. However, the second and third should only copy the files found.
Testcase 1
process a {
output:
path "{a,b,c}/*" into found
"""
mkdir a b c d
touch file.txt
touch a/file.txt
touch c/file.txt
touch d/file.txt
ls
"""
}
found.flatten().view()
1.
$ ./nextflow run test1.nf
N E X T F L O W ~ version 21.04.3
Launching `test1.nf` [lonely_sanger] - revision: 06c25e47e3
executor > local (1)
[21/781741] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/21/781741acd0b0369ababfe7c881ebc1/a/file.txt
/home/fabian/extendedGlob/work/21/781741acd0b0369ababfe7c881ebc1/c/file.txt
Files in workdir:
work/21/781741acd0b0369ababfe7c881ebc1/
├── a
│ └── file.txt
├── b
├── c
│ └── file.txt
├── d
│ └── file.txt
└── file.txt
4 directories, 4 files
2.
$ ./nextflow run test1.nf -process.scratch true
N E X T F L O W ~ version 21.04.3
Launching `test1.nf` [intergalactic_dijkstra] - revision: 06c25e47e3
executor > local (1)
[5b/b09417] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/5b/b09417152861aa02b76b6b756f92ae/c/a
/home/fabian/extendedGlob/work/5b/b09417152861aa02b76b6b756f92ae/c/b
/home/fabian/extendedGlob/work/5b/b09417152861aa02b76b6b756f92ae/c/file.txt
Files in workdir:
work/5b/b09417152861aa02b76b6b756f92ae/
├── a
├── b
└── c
├── a
├── b
└── file.txt
5 directories, 1 file
For the current Nextflow version, the files are copied into the wrong hierarchy. Accordingly, the wrong and not all files are found.
3.
$ ./nextflow-fix run test1.nf -process.scratch true
N E X T F L O W ~ version 21.09.0-SNAPSHOT
Launching `test1.nf` [gigantic_stone] - revision: 06c25e47e3
executor > local (1)
[89/51d3b7] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/89/51d3b76e56ae498416cc307209a3aa/a/file.txt
/home/fabian/extendedGlob/work/89/51d3b76e56ae498416cc307209a3aa/c/file.txt
Files in workdir:
work/89/51d3b76e56ae498416cc307209a3aa/
├── a
│ └── file.txt
└── c
└── file.txt
2 directories, 2 files
Testcase 2
process a {
output:
path "a/*/c/*" into found
"""
mkdir -p a/b/c/ a/b/d/c/
touch a/b/c/file.txt
touch a/b/c/file2.txt
touch a/b/d/c/file2.txt
touch a/b/file.txt
touch a/file.txt
touch file.txt
ls
"""
}
found.flatten().view()
1.
$ ./nextflow run test2.nf
N E X T F L O W ~ version 21.04.3
Launching `test2.nf` [modest_venter] - revision: 53280a4646
executor > local (1)
[3f/eba9de] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/3f/eba9de27063e039e2d4e5351f15817/a/b/c/file.txt
/home/fabian/extendedGlob/work/3f/eba9de27063e039e2d4e5351f15817/a/b/c/file2.txt
Files in workdir:
work/3f/eba9de27063e039e2d4e5351f15817/
├── a
│ ├── b
│ │ ├── c
│ │ │ ├── file2.txt
│ │ │ └── file.txt
│ │ ├── d
│ │ │ └── c
│ │ │ └── file2.txt
│ │ └── file.txt
│ └── file.txt
└── file.txt
5 directories, 6 files
2.
$ ./nextflow run test2.nf -process.scratch true
N E X T F L O W ~ version 21.04.3
Launching `test2.nf` [romantic_dijkstra] - revision: 53280a4646
executor > local (1)
[23/f55e4c] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/23/f55e4c82bf437bc047c8b670c26457/a/*/c/file.txt
/home/fabian/extendedGlob/work/23/f55e4c82bf437bc047c8b670c26457/a/*/c/file2.txt
Files in workdir:
work/23/f55e4c82bf437bc047c8b670c26457/
└── a
└── *
└── c
├── file2.txt
└── file.txt
3 directories, 2 files
This time the * is somehow used as the output file name. Accordingly, some files are overwritten, and not all files are found in the end.
3.
$ ./nextflow-fix run test2.nf -process.scratch true
N E X T F L O W ~ version 21.09.0-SNAPSHOT
Launching `test2.nf` [festering_jang] - revision: 53280a4646
executor > local (1)
[8c/cfc9ff] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/8c/cfc9ffcc0949937b0982ea120e5303/a/b/c/file.txt
/home/fabian/extendedGlob/work/8c/cfc9ffcc0949937b0982ea120e5303/a/b/c/file2.txt
Files in workdir:
work/8c/cfc9ffcc0949937b0982ea120e5303/
└── a
└── b
└── c
├── file2.txt
└── file.txt
3 directories, 2 files
Testcase 3
process a {
output:
path "a/dir*/file.txt" into found
"""
mkdir -p a/dir1/a a/dir2
touch a/dir1/file.txt
touch a/dir2/file.txt
touch a/dir1/a/file.txt
ls
"""
}
found.flatten().view()
1.
$ ./nextflow run test3.nf
N E X T F L O W ~ version 21.04.3
Launching `test3.nf` [gigantic_hodgkin] - revision: 3c4fcfe408
executor > local (1)
[3b/88992b] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/3b/88992b05421a86ecae9cdee5d415d8/a/dir1/file.txt
/home/fabian/extendedGlob/work/3b/88992b05421a86ecae9cdee5d415d8/a/dir2/file.txt
Files in workdir:
work/3b/88992b05421a86ecae9cdee5d415d8/
└── a
├── dir1
│ ├── a
│ │ └── file.txt
│ └── file.txt
└── dir2
└── file.txt
4 directories, 3 files
2.
$ ./nextflow run test3.nf -process.scratch true
N E X T F L O W ~ version 21.04.3
Launching `test3.nf` [magical_rubens] - revision: 3c4fcfe408
executor > local (1)
[f2/adbebd] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/f2/adbebd320ec29728872033ebb09bf4/a/dir*/file.txt
Files in workdir:
work/f2/adbebd320ec29728872033ebb09bf4/
└── a
└── dir*
└── file.txt
2 directories, 1 file
More or less the same behavior as for test case 2.
3.
$ ./nextflow-fix run test3.nf -process.scratch true
N E X T F L O W ~ version 21.09.0-SNAPSHOT
Launching `test3.nf` [admiring_blackwell] - revision: 3c4fcfe408
executor > local (1)
[14/06cc88] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/14/06cc88cf25a6e76ec18ef9789c3478/a/dir1/file.txt
/home/fabian/extendedGlob/work/14/06cc88cf25a6e76ec18ef9789c3478/a/dir2/file.txt
Files in workdir:
work/14/06cc88cf25a6e76ec18ef9789c3478/
└── a
├── dir1
│ └── file.txt
└── dir2
└── file.txt
3 directories, 2 files
Testcase 4
process a {
output:
path "a/dir**/file.txt" into found
"""
mkdir -p a/dir1/a/b/ a/dir2 a/dir
touch a/dir/file.txt
touch a/dir1/file.txt
touch a/dir2/file.txt
touch a/dir1/a/file.txt
touch a/dir1/a/b/file.txt
ls
"""
}
found.flatten().view()
1.
$ ./nextflow run test4.nf
N E X T F L O W ~ version 21.04.3
Launching `test4.nf` [lonely_descartes] - revision: 623cbb1c8b
executor > local (1)
[f5/36c1a7] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir1/a/b/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir1/a/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir1/file.txt
/home/fabian/extendedGlob/work/f5/36c1a762b733e32d9e1bf20cb66780/a/dir2/file.txt
Files in workdir:
work/f5/36c1a762b733e32d9e1bf20cb66780/
└── a
├── dir
│ └── file.txt
├── dir1
│ ├── a
│ │ ├── b
│ │ │ └── file.txt
│ │ └── file.txt
│ └── file.txt
└── dir2
└── file.txt
6 directories, 5 files
2.
$ ./nextflow run test4.nf -process.scratch true
N E X T F L O W ~ version 21.04.3
Launching `test4.nf` [insane_rosalind] - revision: 623cbb1c8b
executor > local (1)
[2f/b19602] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir1/a/b/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir1/a/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir1/file.txt
/home/fabian/extendedGlob/work/2f/b19602d56e03194db570eb45e697bd/a/dir2/file.txt
Files in workdir:
work/2f/b19602d56e03194db570eb45e697bd/
└── a
├── dir
│ └── file.txt
├── dir1
│ ├── a
│ │ ├── b
│ │ │ └── file.txt
│ │ └── file.txt
│ └── file.txt
└── dir2
└── file.txt
6 directories, 5 files
3.
$ ./nextflow-fix run test4.nf -process.scratch true
N E X T F L O W ~ version 21.09.0-SNAPSHOT
Launching `test4.nf` [happy_golick] - revision: 623cbb1c8b
executor > local (1)
[ba/6f94f6] process > a [100%] 1 of 1 ✔
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir1/a/b/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir1/a/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir1/file.txt
/home/fabian/extendedGlob/work/ba/6f94f6ff172a2cd39f4deeea0d42fd/a/dir2/file.txt
Files in workdir:
work/ba/6f94f6ff172a2cd39f4deeea0d42fd/
└── a
├── dir
│ └── file.txt
├── dir1
│ ├── a
│ │ ├── b
│ │ │ └── file.txt
│ │ └── file.txt
│ └── file.txt
└── dir2
└── file.txt
6 directories, 5 files
For ** it also works in my fixed version.
From my tests, you can see that my implementation is consistent with the behavior running without a scratch dir.
I implemented many more test cases here
The tests check if all files are found, but not more.
Let us talk about 1): How to deal with glob:false.
I have not yet implemented it.
Have you a good idea how to get the information whether an output pattern has to be treated as glob:false here:
My idea would be to change this:
nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskRun.groovy
Lines 426 to 434 in b9edc92
to:
List<Tuple2<String, Boolean>> getOutputFilesNames() {
def result = []
for( FileOutParam param : getOutputsByType(FileOutParam).keySet() ) {
result.addAll( new Tuple2(param.getFilePatterns(context, workDir), param.glob) )
}
return result.unique()
}
and adjust all calls accordingly.
But I think there is a better way?
There was a problem hiding this comment.
commenting in the main thread to have more space
30b2ef4 to
cc77472
Compare
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
I think most (all) of the problems you are reporting using the current master version (think 21.08.0-edge) is also ok. I've repeated the tests for your convenience Test case 11. plainOK 2. with scratchOK Test case 21. plain2. with scratchOK Testcase 31. plainOK 2. with scracthTestcase 41. plain2. with scratchAlso when using |
|
Regarding |
|
Ahh, somehow, Nextflow did not execute the latest version, even when I downloaded it. Testcase 11.Files in workdir: 2.Files in workdir: Here to much data is copied. This could be problematic, as it could be dozens of gigabytes of temporary data. My solution only copies the required files: 3.Files in workdir: Testcase 21.Files in workdir: 2.Files in workdir: In this case the wrong file is copied, while my solution finds it: 3.Files in workdir: Testcase 31.Files in workdir: 2.Files in workdir: Again, the file is not copied and accordingly the pipeline crashed. My solution worked well: 3.Files in workdir: |
6564123 to
c78965b
Compare
|
regarding the first, should it traverse dirs? I'm getting this |
|
Yes, the first run for the first test case is without With that change globstar always traverses through all levels. |
acb53ee to
2c6665e
Compare
|
Hi @pditommaso. As the PR is still open, I wanted to ask you how to proceed. Otherwise, we can close the PR. |
|
I think there are at least two issues here: 1) the correct handling of
|
|
|
Good point. Yes, I think the support for extended glob should be done as PR forked from #2379. Regarding the second point ideally, it should be the default behaviour i.e. case a) in your comment. Though, I don't have a clean idea of the impact of this. |
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
|
Ok, thanks |
This PR solves the problems with the extended glob, which can be seen in #2181. Now it passes all tests.
I promised this fix in #2035.