Skip to content

Fix/long >string#941

Merged
dpetran merged 5 commits intomainfrom
fix/long->string
Nov 27, 2024
Merged

Fix/long >string#941
dpetran merged 5 commits intomainfrom
fix/long->string

Conversation

@dpetran
Copy link
Contributor

@dpetran dpetran commented Nov 22, 2024

This fixes an OutOfMemoryError that occurred when we exhaust the heap space while trying to turn the long array on a SID back into a string. The IRI in question in this case was is http://dbpedia.org/resource/Permian%E2%80%93Triassic_extinction_event.

I've started adding property tests to ensure that all UTF-8 strings can be roundtripped properly, but that has exposed a few more uncovered cases around how we deal with padding. In the interest of making this fix available quickly, this PR doesn't include those (failing) tests yet.

As long as the fields are public you don't need a getter to access them, and since
there's no abstraction at all I removed them while troubleshooting.

The one place we instantiate a SID we pass a long array, no need for varargs.
We encountered a OOM error while trying to turn this IRI into a SID:
http://dbpedia.org/resource/Permian–Triassic_extinction_event

(note the unicode em-dash instead of a hyphen)

This reimplementation avoids an infinite loop. It also preserves a bug in the original
impl wherein zero bytes are removed, regardless of whether they were padding or not. A
fix for that will appear in a subsequent PR.
@dpetran dpetran requested a review from a team November 22, 2024 21:44
this.nameCodes = nameCodes;
}

public int getNamespaceCode() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the getters need to be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed them during troubleshooting, I just saw no reason to add them back, they add no meaningful abstraction, and you can access public fields by their name just fine. Removing them also makes my Clojure-bound wariness about hiding data happier : )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add them back. This is a Java class, not Clojure. Those fields should not be public, and they should not be settable. This change makes them settable.

I don't think anything about the Java class needed to change to fix this bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that final indicates that the field cannot be mutated and is more than sufficient for safety, but I've gone ahead and reverted the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should strive to write code as idiomatically as possible in whatever language we're writing in. I would not have written a clojure namespace hiding data behind a function, but that is the Java way, and this is a Java class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bold take is that idiomatic java is in this case just wrong - getters are a bad design and the convention for using them comes from a time when final did not exist.

Especially since this is in performance sensitive code my thought is why waste a stack frame on something we don't need. Same thought with changing the signature from long... to long[], we only ever supply an array so no need to generate vararg handling. But in the end it's only a stack frame and I'm sure the JIT will handle both cases correctly anyways.

n''))))))
[l]
(->> [56 48 40 32 24 16 8 0] ;; byte offsets
(map (fn [i] (bit-and (bit-shift-right l i) 0xFF)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't matter in practice, but I think we should keep the loop structure here instead of using map + remove. This code ideally should be as fast as possible because it's often called repeatedly when processing a query.

There's a lot of other things to optimize, and the functional implementation is still way faster than having to do an async query the way it was before, so I don't feel that strongly about changing it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a good point, I've rewritten it to be a loop. criterium/quick-bench pegs it as a little more than twice as fast now - ~300ns per long vs ~850ns with the seq version. I've also added a property test to make sure that it's working for inputs we didn't consider.

Using criterium/quick-bench to verify, this implementation is about twice as fast as the
seq-based one.

old way: Execution time mean : 863.869922 ns
new way: Execution time mean : 318.420721 ns
We do not properly handle zero bytes - we indiscriminately remove them all. However,
zero bytes are not valid in an IRI, so in practice this ok. The only zero bytes that we
have are the ones we add ourselves for padding.
@dpetran dpetran merged commit fe48427 into main Nov 27, 2024
@dpetran dpetran deleted the fix/long->string branch November 27, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants