Skip to content

LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals#1097

Merged
romseygeek merged 8 commits intoapache:masterfrom
romseygeek:intervals-gaps
Feb 6, 2020
Merged

LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals#1097
romseygeek merged 8 commits intoapache:masterfrom
romseygeek:intervals-gaps

Conversation

@romseygeek
Copy link
Copy Markdown
Contributor

If you have repeating intervals in an ordered or unordered interval source, you currently get somewhat confusing behaviour:

  • ORDERED(a, a, b) will return an extra interval over just a b if it first matches a a b, meaning that you can get incorrect results if used in a CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on the document a x a b y
  • UNORDERED(a, a) will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats within ordered and unordered sources. It also changes the way that gaps are calculated within ordered and unordered sources, by using a new width() method on IntervalIterator. The default implementation just returns end() - start() + 1, but RepeatingIntervalsSource instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches now always returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds gaps() and width() methods so that submatches can be combined in the same way that subiterators are. Extra checks have been added to checkIntervals() to ensure that the same intervals are returned by both iterator and matches, and a fix to DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between the two methods.

Copy link
Copy Markdown
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the code becomes a bit difficult to follow but it's great that we can handle repeats correctly. This is another advantage over Span queries so +1 to merge.

@romseygeek romseygeek merged commit 7c1ba1a into apache:master Feb 6, 2020
@romseygeek romseygeek deleted the intervals-gaps branch February 6, 2020 14:44
asfgit pushed a commit that referenced this pull request Feb 6, 2020
…als (#1097)

If you have repeating intervals in an ordered or unordered interval source, you currently 
get somewhat confusing behaviour:

* `ORDERED(a, a, b)` will return an extra interval over just a b if it first matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter - 
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats within 
ordered and unordered sources. It also changes the way that gaps are calculated within 
ordered and unordered sources, by using a new width() method on IntervalIterator. The 
default implementation just returns end() - start() + 1, but RepeatingIntervalsSource 
instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering 
on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches now always 
returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds 
gaps() and width() methods so that submatches can be combined in the same way that 
subiterators are. Extra checks have been added to checkIntervals() to ensure that the same 
intervals are returned by both iterator and matches, and a fix to 
DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes 
its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between 
the two methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants