Add a general SAM/BAM/CRAM input filter option (HTS_OPT_SAM_FILTER).#1181
Add a general SAM/BAM/CRAM input filter option (HTS_OPT_SAM_FILTER).#1181jkbonfield wants to merge 7 commits intosamtools:developfrom
Conversation
|
The Examples: |
|
Things that may need work.
|
|
I think we should use the names in the specification, as the ones in the source are probably less well known. Or possibly we could allow aliases. The lack of a regex library in Windows is a bit annoying. We could paper over it by using PCRE, but I suspect it might result in regexps that work differently between Windows and everything else. Other alternatives available in MSYS2 might be TRE or Henry Spencer's library. |
|
For now I'll just do some configure checks and get it to build without regex support. It's useful anyway, and people can kludge in pcre themselves via CPPFLAGS and LDFLAGS if they wish in which case configure will reenable the support. |
|
Actually, bcftools uses regcomp, and has a Windows solution already. So it doesn't look to be a big problem after all. |
expr.c
Outdated
| static int expression(sam_filter_t *filt, void *data, sym_func *fn, | ||
| char *str, char **end, fexpr_t *res); | ||
|
|
||
| // Copied from htslib's textutils_internal.h. Delete after migration |
There was a problem hiding this comment.
You need to follow the instructions in this comment!
There was a problem hiding this comment.
Oops. Also now done.
|
Updates.
|
This uses an expression language and SAM format fields to specify
rules which must pass (so filter-in rather than filter-out).
The code structure is expr.[ch] implements a generic expression
language which accepts a function callback for performing variable
lookup, with sam.c:bam_sym_lookup() being that function.
The sam_read1 function has been changed to keep looping until the
filter passes, or we run out of data.
This works for all file formats, but does so by adding a new member to
htsFile. Technically it's an ABI change, but at least our own usage
of it never uses arrays of them and it's stated in the header to be an
opaque type which in theory should also ban that, albeit without any
compiler checks). If this is a problem, would could add the same
member to BGZF, cram_fd and hFILE as they're all private, but that
feels like needless hoop jumping.
Syntax supports
Grouping: (, ), eg "(1+2)*3"
Values: integers, floats, strings or variables
Unary ops: +, -, !, ~ eg -10 +10, !10 (0), ~5 (bitwise not)
Math ops: *, /, % [TODO: add // for floor division?]
Math ops: +, -
Bit-wise: &, |, ^ [NB as 3 precedence levels, in that order]
Conditionals: >, >=, <, <=,
Equality: ==, !=, =~ !~
Boolean: &&, ||
Data types are numerics (integers or doubles, eg 10, -0x2B, 10.2e-7)
and strings "str" with backslash as an escape char within strings.
The key-words currently supported are:
flag int
flag.paired int
flag.proper_pair int
flag.unmap int
flag.munmap int
flag.reverse int
flag.mreverse int
flag.read1 int
flag.read2 int
flag.secondary int
flag.qcfail int
flag.dup int
flag.supplementary int
mapq int
mpos / pnext int
mrefid int
mrname / rnext string
ncigar int
pos int
qlen int
qname string
refid int
rlen int
rname string
tlen int
[XX] tag int / string
For example, to return only records where the mate reference name
differs to the mapped query name and the read is aligned with high
mapping quality:
test_view -i 'sam_filter=mqual >= 30 && mrname != rname' in.bam
Filters to find significant deletions:
cigar =~ "[0-9]{2}D"
rlen - qlen > 10
Duplicates that aren't part of a "proper pair" (equivalent to
"samtools view -F 2 -f 0x400")
sam_filter=flag.dup && !flag.proper_pair
|
While this is specifically for SAM filtering should we be designing it with an eye for adding VCF filters later, even if we don't want to add them now? For example, we could have |
|
I actually started with HTS_OPT_FILTER but renamed it because the filter is explicitly for SAM with sam keywords. Indeed it was thinking about other format filters that made me rename it to HTS_OPT_SAM_FILTER. Good point about data types though. I'll rename those. The design was always to separate language from data-specific keywords. Hence expr.c as its own file. Note however the bcftools filtering is far more powerful, with functions, arrays, etc. |
|
Wow — I thought I saw this gestating as a samtools-specific thing… Is the reason for moving this to htslib to make the functionality available to other programs if they added appropriate function calls, or to make it available to other programs (via user-settable format options on the command line) even without their knowledge or changes being made to their code? The latter might not work too well as not too many programs make input file format options accessible to the user, so they would need source code changes to enable this anyway. The former (i.e., making this available via an additional API call like (Example of the lack of flexibility of the latter within- At present, the PR implements the latter via Re naming, and potentially implementing this for non-SAM So if pursuing the implicit filtering-as-format-option approach, I think either there should be |
|
Thanks for the comments. I agree with the general thrust of it and I think it's relatively easy modifications to make it a little more flexible so it can encompass both the way it works now as well as a more programmatic API instead.
It started off a week ago in samtools as a draft. However there are considerable advantages to putting it into htslib.
Note this doesn't mean it won't also be available by easier and friendlier command line options. I'll probably still keep something like
All of our programs do. Any third party tool that doesn't make these options available isn't going to break. It simply won't be able to support filtering. That'd be true though if we added it to samtools view via its own dedicated command line option too, so I don't see any difference there.
What you're describing is already how it works, but it's just a matter of deciding whether we wish to expose that interface. Right now the option is calling That in turn uses
Why does it have to be either/or? Yes you have more flexibility by calling
Gah, I chose Obviously I need to fix that. I know it's a little hacky by design. The choice was either a few lines added to the top and end, a mammoth reindentation of the entire function making PR checking hard, or restructuring into multiple functions (that's probably long overdue tbh!). I went with the simplest to review, but I'll consider a wider restructure then.
Agreed, and Rob's comments above were similar. I deliberately designed it to have a non-SAM specific component with a view to using in other formats in the future, but somewhere along the line forgot this in my function naming. At present it's too simple to replace the bcftools filtering, but it doesn't have to arrive fully complete on day 1 as long as the design doesn't forbid future work. I'll edit expr.c to use hts prefixes instead of sam prefixes. Good point on doing introspection on the file format type to decide which format specific symbol-lookup function to pass in. That may mean going back to the generalised option is better again. I'll make the other changes and then see how it's looking. |
|
Hopefully this now addresses all the comments. Thanks for raising them. It's now more public than it was before, to facilitate being able to call in and do your own filtering where and when you want instead of being required to use the hts_opt string interface. This exposes more, so I've done a lot of data type and function renaming. The general structure is: |
- Rename all the sam_* types and functions in expr.c to hts_*. This file is intended to be file format agnostic. A few key functions here are now externally visible too. - Renamed fexpr_t to hts_expr_val_t. Similarly the other bits named after fexpr. It was a poor choice as it's not a filter expression, but a *value*. - Renamed expr.[ch] to hts_expr.[ch]. The header is now public too, but included by hts.h so it does not need to be explicitly included. - Fixed a bug where the multi-region CRAM iterator wasn't using the filter (it calls cram_readrec). All other iterators, both single and multi, worked fine. - Added a simpler sam_passes_filter function that combines hts_filter_eval with the internal bam_sym_lookup. This is an externally visible function. - Rename sym_func to hts_expr_sym_func. - Add more documentation. - Renamed HTS_OPT_SAM_FILTER back to HTS_OPT_FILTER. It doesn't yet check if you attempt to specify this on e.g. a VCF file, but neither does it break as it's simply ignored. - Restructured sam_read1 into a series of internal format-specific functions. This makes the body of sam_read1 itself far simpler. This now means we can do sensible looping without concern over indenting an enormous chunk of code, and the loop is now small enough to see the logic in a single page.
|
I need some help to make it work on Windows though. I don't understand the nuances of our symbol export code. I've added HTSLIB_EXPORT macros, but it's still not sufficient for Windows. I included hts_defs.h too which defines it, but that's not enough either. Guidance please! The error is: |
|
Thoughts on some of the terms we filter on.
Should I add There are other potentially useful functions. Eg we could have |
"seq", "qual" and "library" are string variables, so we can now match
them with e.g. seq =~ "(AT){10}".
We have basic functions that operate on strings, interpreting them as
arrays of characters. (This could be updated for B aux arrays and
larger objects in the future.) Functions are length, min, max and
avg. For example avg(qual) or length(seq). Note qual here is binary,
so NOT the ASCII phred+33. I wouldn't advise trying to match it in a
regexp.
"qlen" was previously length of sequence, and not the value of
cigar2qlen. This differs when we have CIGAR but seq is "*". Use the
cigar2qlen definition so it is symmetric with our "rlen" variable.
To permit the previous definition, we can now do "length(seq)".
Note that min and max of empty string is NAN. This isn't treated as
an error. This permits us to do "avg(qual) > 10" where some entries
have seq and qual both as "*" (ie zero length).
|
In absence of any comment either way, I decided to see how complex it is to add basic functions and it's easy enough. Not an entirely clean design, but good enough for now. (It accepts So I implemented the list of thoughts above. They're not brilliantly efficient, but if you really want to do filtering on sequence by regexp then you now can! By quality may be a tad heroic though... |
|
I think your appveyor failure here is caused by a lack of See for example https://github.com/samtools/htslib/blob/238600/hts.c#L26 Also, please could you make the list of header files in the Makefile dependency list be in the same order as they are in the .c file? It makes comparing them easier. |
jmarshall
left a comment
There was a problem hiding this comment.
A few minor infrastructural comments. Also need to add the new hts_expr.h and hts_expr.c to HTSLIB_PUBLIC_HEADERS and HTSLIB_ALL in htslib.mk.
htslib/hts.h
Outdated
|
|
||
| #include "hts_defs.h" | ||
| #include "hts_log.h" | ||
| #include "hts_expr.h" |
There was a problem hiding this comment.
If the few declarations in hts_expr.h are worthy of being in a separate header (FWIW +1 they are IMHO), might consider just forward declaring struct hts_filter_t; rather than #including the separate header here.
There was a problem hiding this comment.
Can do so, but it then also needs a similar modification to sam.h (or the include). I'm not sure on what's best policy, but no strong views either way so will work on this.
htslib/hts_expr.h
Outdated
| #ifndef HTS_EXPR_H | ||
| #define HTS_EXPR_H | ||
|
|
||
| #include <htslib/kstring.h> |
There was a problem hiding this comment.
| #include <htslib/kstring.h> | |
| #include "kstring.h" |
htslib/hts_expr.h
Outdated
| * Take care when negating this. "[NM] != 0" will be true when | ||
| * [NM] is absent, thus consider "[NM] && [NM] != 0". | ||
| */ | ||
| typedef struct { |
There was a problem hiding this comment.
| typedef struct { | |
| typedef struct hts_expr_val_t { |
htslib_vars.mk
Outdated
| htslib_faidx_h = $(HTSPREFIX)htslib/faidx.h $(htslib_hts_defs_h) $(htslib_hts_h) | ||
| htslib_hfile_h = $(HTSPREFIX)htslib/hfile.h $(htslib_hts_defs_h) | ||
| htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h) | ||
| htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h) $(htslib_hts_expr_h) |
There was a problem hiding this comment.
Same order as #include lines (but see previous comment about whether to add this inclusion)
| htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h) $(htslib_hts_expr_h) | |
| htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_hts_expr_h) $(htslib_kstring_h) $(htslib_kroundup_h) |
htslib_vars.mk
Outdated
| htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h) $(htslib_hts_expr_h) | ||
| htslib_hts_defs_h = $(HTSPREFIX)htslib/hts_defs.h | ||
| htslib_hts_endian_h = $(HTSPREFIX)htslib/hts_endian.h | ||
| htslib_hts_expr_h = $(HTSPREFIX)htslib/hts_expr.h |
There was a problem hiding this comment.
| htslib_hts_expr_h = $(HTSPREFIX)htslib/hts_expr.h | |
| htslib_hts_expr_h = $(HTSPREFIX)htslib/hts_expr.h $(htslib_kstring_h) $(htslib_hts_defs_h) |
Makefile
Outdated
| hfile_s3.o hfile_s3.pico: hfile_s3.c config.h $(hfile_internal_h) $(htslib_hts_h) $(htslib_kstring_h) | ||
| hts.o hts.pico: hts.c config.h $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h) | ||
| hts.o hts.pico: hts.c config.h $(htslib_hts_expr_h) $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h) | ||
| hts_expr.o hts_expr.pico: hts_expr.c config.h $(htslib_hts_expr_h) $(htslib_kstring_h) $(textutils_internal_h) |
There was a problem hiding this comment.
| hts_expr.o hts_expr.pico: hts_expr.c config.h $(htslib_hts_expr_h) $(htslib_kstring_h) $(textutils_internal_h) | |
| hts_expr.o hts_expr.pico: hts_expr.c config.h $(htslib_hts_expr_h) $(textutils_internal_h) |
It doesn't have to be either/or, but providing automatic filtering inside However with the explicit version added and enabling more advanced usage, FWIW I'm happy enough either way. The splitting out of
|
|
The problem is we need a transparent way to retrofit this to other tools. I don't think requiring code changes to them is justified purely by the desire to be tidy. Also why is the clue in the name? I don't think the core purpose has changed either. It does fetch a single record, one at a time. It's simply now one record matching your criteria if you specified them. At it's heart nothing has changed, but optionally you have more control over it. I don't see that as a bad thing. Thanks for comments on tidying various bits up though. I'll do these today. |
|
Thanks John, and great catch on the lack of check in Hopefully I've addressed all the issues now. |
valeriuo
left a comment
There was a problem hiding this comment.
Only a few minor aspects to be addressed.
|
|
||
| /* | ||
| * mul_expr | ||
| * : unary_expr ( |
There was a problem hiding this comment.
mul_expr: unary_expr | (
or
mul_expr: unary_expr ( ( '*' | '/' | '%' ) unary_expr )*
Otherwise it leaves room for something like !5 -10/2, which presumably is not an accepted expression.
There was a problem hiding this comment.
!5 -10/2 is an accepted expression, it evaluates to -5 (in both test/test_expr and C).
|
|
||
| /* | ||
| * add_expr | ||
| * : mul_expr ( |
There was a problem hiding this comment.
Same here: : mul_expr | (
There was a problem hiding this comment.
This is wrong however, and mul_expr above. They state:
add_expr
: mul_expr (
mul_expr '+' mul_expr
| mul_expr '-' mul_expr
)*
The * on the end there implies 1 2+3 is valid, which it obviously isn't. The code got rewritten at some point without the comment changing. I can't just use mul_expr | as that'd then permit 1+2 3+4. It appears to be doing the right thing, which is:
add_expr: mul_expr ( ('+' | '-') mul_expr )*
So X, X + X, X + X - X are all valid (where X is anything accepted by mul_expr).
The intention for all of these is that they can match nothing themselves other recursively descending down to the next level in the parser. So add_expr may contain +/- but it's not a requirement.
Amending the comment.
Edit: or alternatively define recursively, as is done in bitand_expr and similar. e.g.
add_expr
: mul_expr
| add_expr '+' mul_expr
| add_expr '-' mul_expr
htslib/hts_expr.h
Outdated
| * | ||
| * The type of the returned result may be numeric of string, as defined by | ||
| * the is_str member. It can also be explicitly defined to be true even | ||
| * for a null value. This may be used to check for the existance of |
test/test_expr.c
Outdated
|
|
||
| { 1, NULL, " (2*3)&7 > 4"}, | ||
| { 0, NULL, " (2*3)&(7 > 4)"}, // C precedence equiv | ||
| { 1, NULL, "((2*3)&7) > 4"}, // Python precendece equiv |
| if (ks_resize(&res->s, b->core.l_qseq+1) < 0) | ||
| return -1; | ||
| memcpy(res->s.s, bam_get_qual(b), b->core.l_qseq); | ||
| res->s.l = b->core.l_qseq; |
There was a problem hiding this comment.
res->s.s[b->core.l_qseq] = 0; as used for seq.
There was a problem hiding this comment.
This isn't needed. Here bam_get_qual returns the binary quality, with values 0 upwards. Thus the string may already contain nul chars. Nul terminating it would lead people to assume it's a printable string if they looked into the source.
It's returned so that things like expr_func_avg can be used on the quality string. This uses .l to determine the length and doesn't need nul termination.
The nul terminating for seq is probably just for niceties sake when debugging, but I doubt that's required either.
|
Pushed some new changes. Thanks for the comments. |
This bug was introduced during samtools#1181 where the additional HTS_OPT_FILTER meant the automatic fall-through from HTS_OPT_COMPRESSION_LEVEL no longer applied. I decided against continuing the previous obscurity of moving HTS_OPT_COMPRESSION_LEVEL to the end of the switch and having no return (thus falling into the cram_set_voption call), instead favouring an explicit call. An example command that demonstrated the problem was: samtools view -O cram,level=1 in.bam -o out.cram
This bug was introduced during #1181 where the additional HTS_OPT_FILTER meant the automatic fall-through from HTS_OPT_COMPRESSION_LEVEL no longer applied. I decided against continuing the previous obscurity of moving HTS_OPT_COMPRESSION_LEVEL to the end of the switch and having no return (thus falling into the cram_set_voption call), instead favouring an explicit call. An example command that demonstrated the problem was: samtools view -O cram,level=1 in.bam -o out.cram
This uses an expression language and SAM format fields to specify
rules which must pass (so filter-in rather than filter-out).
The code structure is expr.[ch] implements a generic expression
language which accepts a function callback for performing variable
lookup, with sam.c:bam_sym_lookup() being that function.
The sam_read1 function has been changed to keep looping until the
filter passes, or we run out of data.
This works for all file formats, but does so by adding a new member to
htsFile. Technically it's an ABI change, but at least our own usage
of it never uses arrays of them and it's stated in the header to be an
opaque type which in theory should also ban that, albeit without any
compiler checks). If this is a problem, would could add the same
member to BGZF, cram_fd and hFILE as they're all private, but that
feels like needless hoop jumping.
Syntax supports
Data types are numerics (integers or doubles, eg 10, -0x2B, 10.2e-7)
and strings "str" with backslash as an escape char within strings.
The key-words currently supported are:
For example, to return only records where the mate reference name
differs to the mapped query name and the read is aligned with high
mapping quality:
Filters to find significant deletions:
Duplicates that aren't part of a "proper pair" (equivalent to
"samtools view -F 2 -f 0x400")