Skip to content

Add a general SAM/BAM/CRAM input filter option (HTS_OPT_SAM_FILTER).#1181

Closed
jkbonfield wants to merge 7 commits intosamtools:developfrom
jkbonfield:filter
Closed

Add a general SAM/BAM/CRAM input filter option (HTS_OPT_SAM_FILTER).#1181
jkbonfield wants to merge 7 commits intosamtools:developfrom
jkbonfield:filter

Conversation

@jkbonfield
Copy link
Copy Markdown
Contributor

This uses an expression language and SAM format fields to specify
rules which must pass (so filter-in rather than filter-out).

The code structure is expr.[ch] implements a generic expression
language which accepts a function callback for performing variable
lookup, with sam.c:bam_sym_lookup() being that function.

The sam_read1 function has been changed to keep looping until the
filter passes, or we run out of data.

This works for all file formats, but does so by adding a new member to
htsFile. Technically it's an ABI change, but at least our own usage
of it never uses arrays of them and it's stated in the header to be an
opaque type which in theory should also ban that, albeit without any
compiler checks). If this is a problem, would could add the same
member to BGZF, cram_fd and hFILE as they're all private, but that
feels like needless hoop jumping.

Syntax supports

Grouping:      (, ),   eg "(1+2)*3"
Values:        integers, floats, strings or variables
Unary ops:     +, -, !, ~  eg -10 +10, !10 (0), ~5 (bitwise not)
Math ops:      *, /, %  [TODO: add // for floor division?]
Math ops:      +, -
Bit-wise:      &, |, ^  [NB as 3 precedence levels, in that order]
Conditionals:  >, >=, <, <=,
Equality:      ==, !=, =~ !~
Boolean:       &&, ||

Data types are numerics (integers or doubles, eg 10, -0x2B, 10.2e-7)
and strings "str" with backslash as an escape char within strings.

The key-words currently supported are:

flag                    int
flag.paired             int
flag.proper_pair        int
flag.unmap              int
flag.munmap             int
flag.reverse            int
flag.mreverse           int
flag.read1              int
flag.read2              int
flag.secondary          int
flag.qcfail             int
flag.dup                int
flag.supplementary      int
isize                   int
mqual                   int
mtid                    int
mrname                  string
ncigar                  int
pos                     int
qlen                    int
qname                   string
rlen                    int
rname                   string
tid                     int
[XX] tag                int / string

For example, to return only records where the mate reference name
differs to the mapped query name and the read is aligned with high
mapping quality:

test_view -i 'sam_filter=mqual >= 30 && mrname != rname' in.bam

Filters to find significant deletions:

 cigar =~ "[0-9]{2}D"
 rlen - qlen > 10

Duplicates that aren't part of a "proper pair" (equivalent to
"samtools view -F 2 -f 0x400")

flag.dup && !flag.proper_pair

@jkbonfield
Copy link
Copy Markdown
Contributor Author

jkbonfield commented Nov 23, 2020

The test/test_expr tool is both a battery of unit tests (when run without arguments), but also a general purpose expression calculator.

Examples:

$ ./test/test_expr '2*3 + 4*5'
26
$ ./test/test_expr '3&2 > 5&1';  # This differs to C precedence.  It is equiv to (3&2) > (5&1)
1
$ ./test/test_expr '"foo" =~ "[aeiou]{2}"'
1
$ ./test/test_expr '"a\\b\"c"'
a\b"c
$ ./test/test_expr '17/8'
2.125

@jkbonfield
Copy link
Copy Markdown
Contributor Author

Things that may need work.

@daviesrob
Copy link
Copy Markdown
Member

I think we should use the names in the specification, as the ones in the source are probably less well known. Or possibly we could allow aliases.

The lack of a regex library in Windows is a bit annoying. We could paper over it by using PCRE, but I suspect it might result in regexps that work differently between Windows and everything else. Other alternatives available in MSYS2 might be TRE or Henry Spencer's library.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

For now I'll just do some configure checks and get it to build without regex support. It's useful anyway, and people can kludge in pcre themselves via CPPFLAGS and LDFLAGS if they wish in which case configure will reenable the support.

@daviesrob
Copy link
Copy Markdown
Member

Actually, bcftools uses regcomp, and has a Windows solution already. So it doesn't look to be a big problem after all.

expr.c Outdated
static int expression(sam_filter_t *filt, void *data, sym_func *fn,
char *str, char **end, fexpr_t *res);

// Copied from htslib's textutils_internal.h. Delete after migration
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to follow the instructions in this comment!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Also now done.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

jkbonfield commented Nov 24, 2020

Updates.

  • Check for regcomp function (-lregex) for Mingw.

  • Rename isize to tlen and mqual to mapq (as per SAM)

  • Add rnext and pnext keywords (but keeping mrname and mpos as aliases), also as per SAM.

  • Renamed tid to refid and mtid to mrefid. Loosely based on BAM spec, although that has next_refid. I don't like that as we're mixing both ordering and with/without underscores: pnext vs next_refid. There's no obvious ?next phrase to go along side pnext and rnext, hence keeping the "m" prefix terms.

This uses an expression language and SAM format fields to specify
rules which must pass (so filter-in rather than filter-out).

The code structure is expr.[ch] implements a generic expression
language which accepts a function callback for performing variable
lookup, with sam.c:bam_sym_lookup() being that function.

The sam_read1 function has been changed to keep looping until the
filter passes, or we run out of data.

This works for all file formats, but does so by adding a new member to
htsFile.  Technically it's an ABI change, but at least our own usage
of it never uses arrays of them and it's stated in the header to be an
opaque type which in theory should also ban that, albeit without any
compiler checks).  If this is a problem, would could add the same
member to BGZF, cram_fd and hFILE as they're all private, but that
feels like needless hoop jumping.

Syntax supports

    Grouping:      (, ),   eg "(1+2)*3"
    Values:        integers, floats, strings or variables
    Unary ops:     +, -, !, ~  eg -10 +10, !10 (0), ~5 (bitwise not)
    Math ops:      *, /, %  [TODO: add // for floor division?]
    Math ops:      +, -
    Bit-wise:      &, |, ^  [NB as 3 precedence levels, in that order]
    Conditionals:  >, >=, <, <=,
    Equality:      ==, !=, =~ !~
    Boolean:       &&, ||

Data types are numerics (integers or doubles, eg 10, -0x2B, 10.2e-7)
and strings "str" with backslash as an escape char within strings.

The key-words currently supported are:

    flag                    int
    flag.paired             int
    flag.proper_pair        int
    flag.unmap              int
    flag.munmap             int
    flag.reverse            int
    flag.mreverse           int
    flag.read1              int
    flag.read2              int
    flag.secondary          int
    flag.qcfail             int
    flag.dup                int
    flag.supplementary      int
    mapq                    int
    mpos / pnext            int
    mrefid                  int
    mrname / rnext          string
    ncigar                  int
    pos                     int
    qlen                    int
    qname                   string
    refid                   int
    rlen                    int
    rname                   string
    tlen                    int
    [XX] tag                int / string

For example, to return only records where the mate reference name
differs to the mapped query name and the read is aligned with high
mapping quality:

    test_view -i 'sam_filter=mqual >= 30 && mrname != rname' in.bam

Filters to find significant deletions:

     cigar =~ "[0-9]{2}D"
     rlen - qlen > 10

Duplicates that aren't part of a "proper pair" (equivalent to
"samtools view -F 2 -f 0x400")

    sam_filter=flag.dup && !flag.proper_pair
@daviesrob
Copy link
Copy Markdown
Member

While this is specifically for SAM filtering should we be designing it with an eye for adding VCF filters later, even if we don't want to add them now? For example, we could have hts_filter_t instead of sam_filter_t, just filter for the filter-setting option and HTS_OPT_FILTER for the enumerated type. Attempting to set a filter on VCF files could be met with a polite refusal until we get around to implementing it.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

I actually started with HTS_OPT_FILTER but renamed it because the filter is explicitly for SAM with sam keywords. Indeed it was thinking about other format filters that made me rename it to HTS_OPT_SAM_FILTER.

Good point about data types though. I'll rename those.

The design was always to separate language from data-specific keywords. Hence expr.c as its own file.

Note however the bcftools filtering is far more powerful, with functions, arrays, etc.

@jmarshall
Copy link
Copy Markdown
Member

jmarshall commented Nov 24, 2020

Wow — I thought I saw this gestating as a samtools-specific thing…

Is the reason for moving this to htslib to make the functionality available to other programs if they added appropriate function calls, or to make it available to other programs (via user-settable format options on the command line) even without their knowledge or changes being made to their code?

The latter might not work too well as not too many programs make input file format options accessible to the user, so they would need source code changes to enable this anyway.

The former (i.e., making this available via an additional API call like int sam_passes_filter(const bam1_t *, filter_expr_something_t *) that programs would add explicitly into their read loops) has the advantage that it is more explicit and more flexible.

(Example of the lack of flexibility of the latter within-sam_read1() approach: samtools view would not be able to use this as a filtering option whose rejected reads are diverted to the -U file.)

At present, the PR implements the latter via hts_set_opt() and adding a rather hacky loop inside sam_read1(). Fortunately just about everything, including most iterators, does its reading via sam_read1(). However filters would not be applied for e.g. a CRAM file read via a multi-region iterator (i.e., when cram_readrec() is used).

Re naming, and potentially implementing this for non-SAM htsFiles in future: the hts_set_filter_expression() function added takes an expression as a string and does not appear to be SAM-specific. So it would appear to be (in principle) usable for SAM/VCF/whatever and users might expect it to internally check the fp's file format and parse the given expression appropriately (or return “not implemented for this format”). Similarly the hts_opt_apply() option just takes a string, independent of the file type.

So if pursuing the implicit filtering-as-format-option approach, I think either there should be HTS_OPT_SAM_FILTER + sam_set_filter_expression() and equivalents in due course for VCF; or there should be HTS_OPT_FILTER + hts_set_filter_expression() that check applicability in the innermost implementation (i.e., inside hts_set_filter_expression()). Probably the latter, as the argument here is just a string, not a sam_filter_expr_t or something. At the moment, there's a mixture of the two.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

Thanks for the comments. I agree with the general thrust of it and I think it's relatively easy modifications to make it a little more flexible so it can encompass both the way it works now as well as a more programmatic API instead.

Wow — I thought I saw this gestating as a samtools-specific thing…

Is the reason for moving this to htslib to make the functionality available to other programs if they added appropriate function calls, or to make it available to other programs (via user-settable format options on the command line) even without their knowledge or changes being made to their code?

It started off a week ago in samtools as a draft. However there are considerable advantages to putting it into htslib.

  • We don't need to make adjustments to every single samtools sub-command to permit filtering options. Everything that currently has an input-fmt-option magically gains this ability, even including test_view.
  • We don't need to use samtools view | samtools sub-command2 as a way of filtering. That can have advantages (and it's still an option), but it also has disadvantages.
  • It's available to other users as htslib.

Note this doesn't mean it won't also be available by easier and friendlier command line options. I'll probably still keep something like samtools view -z "filter-expression".

The latter might not work too well as not too many programs make input file format options accessible to the user, so they would need source code changes to enable this anyway.

All of our programs do. Any third party tool that doesn't make these options available isn't going to break. It simply won't be able to support filtering. That'd be true though if we added it to samtools view via its own dedicated command line option too, so I don't see any difference there.

The former (i.e., making this available via an additional API call like int sam_passes_filter(const bam1_t *, filter_expr_something_t *) that programs would add explicitly into their read loops) has the advantage that it is more explicit and more flexible.

What you're describing is already how it works, but it's just a matter of deciding whether we wish to expose that interface. Right now the option is calling hts_set_filter_expression(fp, expr). We could make that public as a friendlier programmatic interface.

That in turn uses sam_filter_t *sam_filter_init(const char *str) and int sam_filter_eval(...) in expr.c that creates a filter and checks whether a bam record passes that filter. Again we could make these public if desired. I think they need renaming though from sam to hts and create a second sam_filter function that explicitly passes in the bam_sym_lookup function.

(Example of the lack of flexibility of the latter within-sam_read1() approach: samtools view would not be able to use this as a filtering option whose rejected reads are diverted to the -U file.)

Why does it have to be either/or?

Yes you have more flexibility by calling sam_read1 yourself and then optionally calling e.g. sam_passes_filter on one or more sets of filters. However you lose the automatic filtering that can be added to any tool without needing to change code. I don't see it as one or the other. It would make sense to be able to specify and automatic filter, as well as provide the ability to do more custom on-the-fly ones precisely where you wish in your code.

At present, the PR implements the latter via hts_set_opt() and adding a rather hacky loop inside sam_read1(). Fortunately just about everything, including most iterators, does its reading via sam_read1(). However filters would not be applied for e.g. a CRAM file read via a multi-region iterator (i.e., when cram_readrec() is used).

Gah, I chose sam_read1 as I thought it was the one central place that everything went via. I wasn't aware the cram iterator had it's own reading code that avoided that function. :-(

Obviously I need to fix that.

I know it's a little hacky by design. The choice was either a few lines added to the top and end, a mammoth reindentation of the entire function making PR checking hard, or restructuring into multiple functions (that's probably long overdue tbh!). I went with the simplest to review, but I'll consider a wider restructure then.

Re naming, and potentially implementing this for non-SAM htsFiles in future: the hts_set_filter_expression() function added takes an expression as a string and does not appear to be SAM-specific. So it would appear to be (in principle) usable for SAM/VCF/whatever and users might expect it to internally check the fp's file format and parse the given expression appropriately (or return “not implemented for this format”). Similarly the hts_opt_apply() option just takes a string, independent of the file type.

Agreed, and Rob's comments above were similar. I deliberately designed it to have a non-SAM specific component with a view to using in other formats in the future, but somewhere along the line forgot this in my function naming. At present it's too simple to replace the bcftools filtering, but it doesn't have to arrive fully complete on day 1 as long as the design doesn't forbid future work.

I'll edit expr.c to use hts prefixes instead of sam prefixes. Good point on doing introspection on the file format type to decide which format specific symbol-lookup function to pass in. That may mean going back to the generalised option is better again. I'll make the other changes and then see how it's looking.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

Hopefully this now addresses all the comments. Thanks for raising them.

It's now more public than it was before, to facilitate being able to call in and do your own filtering where and when you want instead of being required to use the hts_opt string interface. This exposes more, so I've done a lot of data type and function renaming.

The general structure is:

* denotes new public API

hts_expr.c    	  General filter expression language
       *hts_filter_init	       Compile
       *hts_filter_free	       Free
       *hts_filter_eval	       Validate

htslib/hts.h
	"hts_filter_t *filter;" a new member of htsFile

htslib/hts_expr.h  Header for things exposed in hts_expr.c.  Automatically included by htslib/hts.h
	hts_filter_* funcs above
	struct hts_filter_t hts_filter_t	Opaque type
	typedef struct {...} hts_expr_val_t

hts.c
	hts_opt_add / hts_opt_apply: "filter" -> HTS_OPT_FILTER
	hts_set_opt: hts_set_filter_expression call
	hts_close: hts_filter_free call
      * hts_set_filter_expression (easier programmatic interface than hts_opt_add)

sam.c
	bam_sym_lookup		Handles SAM columns; callback for hts_filter
       *sam_passes_filter	Wraps up hts_filter_eval + bam_sym_lookup
	sam_read1		Calls hts_filter_eval
	cram_readrec		Calls hts_filter_eval

- Rename all the sam_* types and functions in expr.c to hts_*.
  This file is intended to be file format agnostic.

  A few key functions here are now externally visible too.

- Renamed fexpr_t to hts_expr_val_t.
  Similarly the other bits named after fexpr.  It was a poor choice as
  it's not a filter expression, but a *value*.

- Renamed expr.[ch] to hts_expr.[ch].  The header is now public too,
  but included by hts.h so it does not need to be explicitly
  included.

- Fixed a bug where the multi-region CRAM iterator wasn't using the
  filter (it calls cram_readrec). All other iterators, both single and
  multi, worked fine.

- Added a simpler sam_passes_filter function that combines
  hts_filter_eval with the internal bam_sym_lookup.

  This is an externally visible function.

- Rename sym_func to hts_expr_sym_func.

- Add more documentation.

- Renamed HTS_OPT_SAM_FILTER back to HTS_OPT_FILTER.  It doesn't yet
  check if you attempt to specify this on e.g. a VCF file, but neither
  does it break as it's simply ignored.

- Restructured sam_read1 into a series of internal format-specific
  functions.  This makes the body of sam_read1 itself far simpler.
  This now means we can do sensible looping without concern over
  indenting an enormous chunk of code, and the loop is now small
  enough to see the logic in a single page.
@jkbonfield
Copy link
Copy Markdown
Contributor Author

I need some help to make it work on Windows though.

I don't understand the nuances of our symbol export code. I've added HTSLIB_EXPORT macros, but it's still not sufficient for Windows. I included hts_defs.h too which defines it, but that's not enough either. Guidance please!

The error is:

Checking shared library exports
hts_filter_eval
hts_filter_free
hts_filter_init
Error: Found unexported symbols (listed above)

@jkbonfield
Copy link
Copy Markdown
Contributor Author

jkbonfield commented Nov 25, 2020

Thoughts on some of the terms we filter on.

  • library (this doesn't exist yet). Samtools can filter on this, but we can only do RG.

  • rlen is from bam_cigar2rlen so it's simply the number of reference bases consumed by CIGAR.

  • My qlen is quite simply length(seq). If we had functions we could do that explicitly.

    This is not the same as bam_cigar2qlen however as unmapped data returns 0 there. Otherwise obviously it would be.

    Note samtools view -m documents qlen as the length of query sequence consumed by CIGAR, but inexplicably explicitly codes it to also include hard-clips. There's no comment or obvious reason why. Maybe it's just a long standing bug?

  • seq and qual (don't exist yet). There's not a huge amount of point in this, but potentially people may wish to (inefficiently) select on them.

Should I add seq_len as a token, or bite the bullet and do basic functions, although that's more work. I think it'd make more sense though for qlen and rlen to be defined consistently.

There are other potentially useful functions. Eg we could have min(qual), max(qual), avg(qual), etc. I was delaying that for another PR though given the bigger they are the longer (much much longer) they ever take to be reviewed and merged and it doesn't all have to be done from day 1.

"seq", "qual" and "library" are string variables, so we can now match
them with e.g. seq =~ "(AT){10}".

We have basic functions that operate on strings, interpreting them as
arrays of characters.  (This could be updated for B aux arrays and
larger objects in the future.)  Functions are length, min, max and
avg.  For example avg(qual) or length(seq).  Note qual here is binary,
so NOT the ASCII phred+33.  I wouldn't advise trying to match it in a
regexp.

"qlen" was previously length of sequence, and not the value of
cigar2qlen.  This differs when we have CIGAR but seq is "*".  Use the
cigar2qlen definition so it is symmetric with our "rlen" variable.

To permit the previous definition, we can now do "length(seq)".

Note that min and max of empty string is NAN.  This isn't treated as
an error.  This permits us to do "avg(qual) > 10" where some entries
have seq and qual both as "*" (ie zero length).
@jkbonfield
Copy link
Copy Markdown
Contributor Author

In absence of any comment either way, I decided to see how complex it is to add basic functions and it's easy enough. Not an entirely clean design, but good enough for now. (It accepts length("foo") but not length ("foo").)

So I implemented the list of thoughts above. They're not brilliantly efficient, but if you really want to do filtering on sequence by regexp then you now can! By quality may be a tad heroic though...

@daviesrob
Copy link
Copy Markdown
Member

I think your appveyor failure here is caused by a lack of #define HTS_BUILDING_LIBRARY in the new hts_expr.c file. It's needed on Windows to set up the macros that export the symbols, as opposed to importing them.

See for example https://github.com/samtools/htslib/blob/238600/hts.c#L26

Also, please could you make the list of header files in the Makefile dependency list be in the same order as they are in the .c file? It makes comparing them easier.

Copy link
Copy Markdown
Member

@jmarshall jmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor infrastructural comments. Also need to add the new hts_expr.h and hts_expr.c to HTSLIB_PUBLIC_HEADERS and HTSLIB_ALL in htslib.mk.

htslib/hts.h Outdated

#include "hts_defs.h"
#include "hts_log.h"
#include "hts_expr.h"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the few declarations in hts_expr.h are worthy of being in a separate header (FWIW +1 they are IMHO), might consider just forward declaring struct hts_filter_t; rather than #including the separate header here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do so, but it then also needs a similar modification to sam.h (or the include). I'm not sure on what's best policy, but no strong views either way so will work on this.

#ifndef HTS_EXPR_H
#define HTS_EXPR_H

#include <htslib/kstring.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <htslib/kstring.h>
#include "kstring.h"

* Take care when negating this. "[NM] != 0" will be true when
* [NM] is absent, thus consider "[NM] && [NM] != 0".
*/
typedef struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typedef struct {
typedef struct hts_expr_val_t {

htslib_vars.mk Outdated
htslib_faidx_h = $(HTSPREFIX)htslib/faidx.h $(htslib_hts_defs_h) $(htslib_hts_h)
htslib_hfile_h = $(HTSPREFIX)htslib/hfile.h $(htslib_hts_defs_h)
htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h)
htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h) $(htslib_hts_expr_h)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same order as #include lines (but see previous comment about whether to add this inclusion)

Suggested change
htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h) $(htslib_hts_expr_h)
htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_hts_expr_h) $(htslib_kstring_h) $(htslib_kroundup_h)

htslib_vars.mk Outdated
htslib_hts_h = $(HTSPREFIX)htslib/hts.h $(htslib_hts_defs_h) $(htslib_hts_log_h) $(htslib_kstring_h) $(htslib_kroundup_h) $(htslib_hts_expr_h)
htslib_hts_defs_h = $(HTSPREFIX)htslib/hts_defs.h
htslib_hts_endian_h = $(HTSPREFIX)htslib/hts_endian.h
htslib_hts_expr_h = $(HTSPREFIX)htslib/hts_expr.h
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
htslib_hts_expr_h = $(HTSPREFIX)htslib/hts_expr.h
htslib_hts_expr_h = $(HTSPREFIX)htslib/hts_expr.h $(htslib_kstring_h) $(htslib_hts_defs_h)

Makefile Outdated
hfile_s3.o hfile_s3.pico: hfile_s3.c config.h $(hfile_internal_h) $(htslib_hts_h) $(htslib_kstring_h)
hts.o hts.pico: hts.c config.h $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h)
hts.o hts.pico: hts.c config.h $(htslib_hts_expr_h) $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h)
hts_expr.o hts_expr.pico: hts_expr.c config.h $(htslib_hts_expr_h) $(htslib_kstring_h) $(textutils_internal_h)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hts_expr.o hts_expr.pico: hts_expr.c config.h $(htslib_hts_expr_h) $(htslib_kstring_h) $(textutils_internal_h)
hts_expr.o hts_expr.pico: hts_expr.c config.h $(htslib_hts_expr_h) $(textutils_internal_h)

@jmarshall
Copy link
Copy Markdown
Member

Why does it have to be either/or?

Yes you have more flexibility by calling sam_read1 yourself and then optionally calling e.g. sam_passes_filter on one or more sets of filters. However you lose the automatic filtering that can be added to any tool without needing to change code.

It doesn't have to be either/or, but providing automatic filtering inside sam_read1() does involve making rather hacky changes and diluting its core purpose of reading one SAM record (the clue is, as they say, in the name). The HTSlib API is a very explicit API (consider e.g. the addition of an explicit header argument to sam_read1() compared to the old samtools API function), and this is very implicit filtering.

However with the explicit version added and enabling more advanced usage, FWIW I'm happy enough either way.

The splitting out of sam_read1() into subfunctions looks clearer; some more comments:

  • Perhaps the new helpers could be static inline.
  • The cram_readrec() changes would benefit from looking more like those in sam_read1(): s/filtered/pass_filter/ and needs to call sam_passes_filter() only if there's an active filter.
  • sam_read1() case sam can lose the extra braces now.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

jkbonfield commented Dec 2, 2020

The problem is we need a transparent way to retrofit this to other tools. I don't think requiring code changes to them is justified purely by the desire to be tidy.

Also why is the clue in the name? sam_read1 is named thus not because it reads 1 record, but because it's version 1 of the API (as in returning bam1_t). Or at least that's what I thought it meant, although I haven't ever seen it spelt out anywhere.

I don't think the core purpose has changed either. It does fetch a single record, one at a time. It's simply now one record matching your criteria if you specified them. At it's heart nothing has changed, but optionally you have more control over it. I don't see that as a bad thing.

Thanks for comments on tidying various bits up though. I'll do these today.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

Thanks John, and great catch on the lack of check in cram_readrec. That lead to a crash. (Note to self - we need to do some coverage analysis and work out which paths are not exercised by the test harness at some point.)

Hopefully I've addressed all the issues now.

Copy link
Copy Markdown
Contributor

@valeriuo valeriuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few minor aspects to be addressed.


/*
* mul_expr
* : unary_expr (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mul_expr: unary_expr | (
or
mul_expr: unary_expr ( ( '*' | '/' | '%' ) unary_expr )*

Otherwise it leaves room for something like !5 -10/2, which presumably is not an accepted expression.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!5 -10/2 is an accepted expression, it evaluates to -5 (in both test/test_expr and C).


/*
* add_expr
* : mul_expr (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: : mul_expr | (

Copy link
Copy Markdown
Contributor Author

@jkbonfield jkbonfield Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong however, and mul_expr above. They state:

add_expr                                                                     
    : mul_expr (                                                             
          mul_expr '+' mul_expr                                              
        | mul_expr '-' mul_expr                                              
      )*                                                                                                                                                  

The * on the end there implies 1 2+3 is valid, which it obviously isn't. The code got rewritten at some point without the comment changing. I can't just use mul_expr | as that'd then permit 1+2 3+4. It appears to be doing the right thing, which is:

add_expr: mul_expr ( ('+' | '-') mul_expr )*

So X, X + X, X + X - X are all valid (where X is anything accepted by mul_expr).
The intention for all of these is that they can match nothing themselves other recursively descending down to the next level in the parser. So add_expr may contain +/- but it's not a requirement.

Amending the comment.

Edit: or alternatively define recursively, as is done in bitand_expr and similar. e.g.

add_expr
    : mul_expr
    | add_expr '+' mul_expr
    | add_expr '-' mul_expr

*
* The type of the returned result may be numeric of string, as defined by
* the is_str member. It can also be explicitly defined to be true even
* for a null value. This may be used to check for the existance of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existence

test/test_expr.c Outdated

{ 1, NULL, " (2*3)&7 > 4"},
{ 0, NULL, " (2*3)&(7 > 4)"}, // C precedence equiv
{ 1, NULL, "((2*3)&7) > 4"}, // Python precendece equiv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precedence

if (ks_resize(&res->s, b->core.l_qseq+1) < 0)
return -1;
memcpy(res->s.s, bam_get_qual(b), b->core.l_qseq);
res->s.l = b->core.l_qseq;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res->s.s[b->core.l_qseq] = 0; as used for seq.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed. Here bam_get_qual returns the binary quality, with values 0 upwards. Thus the string may already contain nul chars. Nul terminating it would lead people to assume it's a printable string if they looked into the source.

It's returned so that things like expr_func_avg can be used on the quality string. This uses .l to determine the length and doesn't need nul termination.

The nul terminating for seq is probably just for niceties sake when debugging, but I doubt that's required either.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

Pushed some new changes. Thanks for the comments.

@valeriuo
Copy link
Copy Markdown
Contributor

valeriuo commented Jan 6, 2021

Rebased, squashed and merged as 999d181 to c7c433f

@valeriuo valeriuo closed this Jan 6, 2021
jkbonfield added a commit to jkbonfield/htslib that referenced this pull request Jan 14, 2021
This bug was introduced during samtools#1181 where the additional
HTS_OPT_FILTER meant the automatic fall-through from
HTS_OPT_COMPRESSION_LEVEL no longer applied.

I decided against continuing the previous obscurity of moving
HTS_OPT_COMPRESSION_LEVEL to the end of the switch and having no
return (thus falling into the cram_set_voption call), instead
favouring an explicit call.

An example command that demonstrated the problem was:

    samtools view -O cram,level=1 in.bam -o out.cram
daviesrob pushed a commit that referenced this pull request Jan 18, 2021
This bug was introduced during #1181 where the additional
HTS_OPT_FILTER meant the automatic fall-through from
HTS_OPT_COMPRESSION_LEVEL no longer applied.

I decided against continuing the previous obscurity of moving
HTS_OPT_COMPRESSION_LEVEL to the end of the switch and having no
return (thus falling into the cram_set_voption call), instead
favouring an explicit call.

An example command that demonstrated the problem was:

    samtools view -O cram,level=1 in.bam -o out.cram
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants