New API, BufferedSource.indexOf(ByteString, fromIndex, toIndex)#1626
New API, BufferedSource.indexOf(ByteString, fromIndex, toIndex)#1626swankjesse merged 3 commits intomasterfrom
Conversation
This is surprisingly interesting. To minimize unnecessary reads for toIndex it is necessary to check whether a prefix of the query matches a suffix of the currently-loaded data. This read-avoidance is useful in practice. When doing HTTP multipart decoding the caller may scan for a boundary separator with a bounded range, and we don't want to block reading when doing so won't impact the result of the call.
| val b0 = targetByteArray[0] | ||
| val bytesSize = bytes.size | ||
| val resultLimit = size - bytesSize + 1L | ||
| val resultLimit = minOf(toIndex, size - bytesSize + 1L) |
There was a problem hiding this comment.
I don’t think so. Here toIndex is the exclusive upper bound on the search, and size is the data size. They’re both independent, and you can pass toIndex that is out-of-bounds on the buffer.
| source.skip(5L) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
What about the lookback case?
There was a problem hiding this comment.
I called it indexOfByteStringLoadsOnlyWhatIsRequiredWhenNotFoundWithFromIndex()
|
|
||
| // Load new data if a prefix of 'bytes' matches a suffix of 'buffer'. | ||
| val limit = minOf(bytes.size, buffer.size - fromIndex + 1).toInt() | ||
| for (i in limit - 1 downTo 1) { |
There was a problem hiding this comment.
Is it worth documenting that the complexity relates to the length of bytes now?
There was a problem hiding this comment.
I guess while there are two O(i) loops here (for and rangeEquals), in practice, it's really unlikely to have pathological behaviour on each iteration, so maybe it's on average fine with long strings.
There was a problem hiding this comment.
Yeah, good point. I think I’d need to do Boyer-Moore to do faster than N*M.
In practice, indexOf() is allowed to do N*M, whether or not we’re doing load avoidance. The simplest such example is searching for xxxxx in a string like xxxxOxxxxOxxxxOxxxxOxxxx.
yschimke
left a comment
There was a problem hiding this comment.
Looks like a good fit for square/okhttp#8665
But minimal okio review from me.
| override fun indexOf(bytes: ByteString, fromIndex: Long): Long | ||
| override fun indexOf(bytes: ByteString, fromIndex: Long, toIndex: Long): Long | ||
| override fun indexOfElement(targetBytes: ByteString): Long | ||
| override fun indexOfElement(targetBytes: ByteString, fromIndex: Long): Long |
There was a problem hiding this comment.
This variant should probably also have an overload with a toIndex for symmetry.
This is surprisingly interesting. To minimize unnecessary reads for toIndex it is necessary to check whether a prefix of the query matches a suffix of the currently-loaded data.
This read-avoidance is useful in practice. When doing HTTP multipart decoding the caller may scan for a boundary separator with a bounded range, and we don't want to block reading when doing so won't impact the result of the call.