Fix/long >string by dpetran · Pull Request #941 · fluree/db

dpetran · 2024-11-22T21:39:01Z

This fixes an OutOfMemoryError that occurred when we exhaust the heap space while trying to turn the long array on a SID back into a string. The IRI in question in this case was is http://dbpedia.org/resource/Permian%E2%80%93Triassic_extinction_event.

I've started adding property tests to ensure that all UTF-8 strings can be roundtripped properly, but that has exposed a few more uncovered cases around how we deal with padding. In the interest of making this fix available quickly, this PR doesn't include those (failing) tests yet.

As long as the fields are public you don't need a getter to access them, and since there's no abstraction at all I removed them while troubleshooting. The one place we instantiate a SID we pass a long array, no need for varargs.

We encountered a OOM error while trying to turn this IRI into a SID: http://dbpedia.org/resource/Permian–Triassic_extinction_event (note the unicode em-dash instead of a hyphen) This reimplementation avoids an infinite loop. It also preserves a bug in the original impl wherein zero bytes are removed, regardless of whether they were padding or not. A fix for that will appear in a subsequent PR.

zonotope · 2024-11-26T02:57:03Z

src/java/fluree/db/SID.java

        this.nameCodes = nameCodes;
    }

-    public int getNamespaceCode() {


Why did the getters need to be removed?

I removed them during troubleshooting, I just saw no reason to add them back, they add no meaningful abstraction, and you can access public fields by their name just fine. Removing them also makes my Clojure-bound wariness about hiding data happier : )

We need to add them back. This is a Java class, not Clojure. Those fields should not be public, and they should not be settable. This change makes them settable.

I don't think anything about the Java class needed to change to fix this bug.

My understanding is that final indicates that the field cannot be mutated and is more than sufficient for safety, but I've gone ahead and reverted the changes.

I think we should strive to write code as idiomatically as possible in whatever language we're writing in. I would not have written a clojure namespace hiding data behind a function, but that is the Java way, and this is a Java class.

My bold take is that idiomatic java is in this case just wrong - getters are a bad design and the convention for using them comes from a time when final did not exist.

Especially since this is in performance sensitive code my thought is why waste a stack frame on something we don't need. Same thought with changing the signature from long... to long[], we only ever supply an array so no need to generate vararg handling. But in the end it's only a stack frame and I'm sure the JIT will handle both cases correctly anyways.

This reverts commit ed9bb63.

zonotope · 2024-11-26T19:34:58Z

src/clj/fluree/db/util/bytes.cljc

-                 n''))))))
+  [l]
+  (->> [56 48 40 32 24 16 8 0] ;; byte offsets
+       (map (fn [i] (bit-and (bit-shift-right l i) 0xFF)))


This probably doesn't matter in practice, but I think we should keep the loop structure here instead of using map + remove. This code ideally should be as fast as possible because it's often called repeatedly when processing a query.

There's a lot of other things to optimize, and the functional implementation is still way faster than having to do an async query the way it was before, so I don't feel that strongly about changing it back.

You raise a good point, I've rewritten it to be a loop. criterium/quick-bench pegs it as a little more than twice as fast now - ~300ns per long vs ~850ns with the seq version. I've also added a property test to make sure that it's working for inputs we didn't consider.

Using criterium/quick-bench to verify, this implementation is about twice as fast as the seq-based one. old way: Execution time mean : 863.869922 ns new way: Execution time mean : 318.420721 ns

We do not properly handle zero bytes - we indiscriminately remove them all. However, zero bytes are not valid in an IRI, so in practice this ok. The only zero bytes that we have are the ones we add ourselves for padding.

dpetran added 2 commits November 22, 2024 15:32

remove varargs, getters on SID class

ed9bb63

As long as the fields are public you don't need a getter to access them, and since there's no abstraction at all I removed them while troubleshooting. The one place we instantiate a SID we pass a long array, no need for varargs.

dpetran requested a review from a team November 22, 2024 21:44

zonotope reviewed Nov 26, 2024

View reviewed changes

Revert "remove varargs, getters on SID class"

4ac72d8

This reverts commit ed9bb63.

zonotope approved these changes Nov 26, 2024

View reviewed changes

use loop to translate longs to bytes

f25d2cd

Using criterium/quick-bench to verify, this implementation is about twice as fast as the seq-based one. old way: Execution time mean : 863.869922 ns new way: Execution time mean : 318.420721 ns

dpetran force-pushed the fix/long->string branch from 6d3006d to 3ca2c42 Compare November 26, 2024 20:52

add property tests for string->longs roundtrip

0db07fc

We do not properly handle zero bytes - we indiscriminately remove them all. However, zero bytes are not valid in an IRI, so in practice this ok. The only zero bytes that we have are the ones we add ourselves for padding.

dpetran force-pushed the fix/long->string branch from 3ca2c42 to 0db07fc Compare November 26, 2024 21:00

dpetran merged commit fe48427 into main Nov 27, 2024

dpetran deleted the fix/long->string branch November 27, 2024 17:01

aaj3f mentioned this pull request Dec 3, 2024

Fix Long->String fluree/server#102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/long >string#941

Fix/long >string#941
dpetran merged 5 commits intomainfrom
fix/long->string

dpetran commented Nov 22, 2024

Uh oh!

zonotope Nov 26, 2024

Uh oh!

dpetran Nov 26, 2024

Uh oh!

zonotope Nov 26, 2024

Uh oh!

dpetran Nov 26, 2024

Uh oh!

zonotope Nov 26, 2024

Uh oh!

dpetran Nov 26, 2024

Uh oh!

zonotope Nov 26, 2024

Uh oh!

dpetran Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dpetran commented Nov 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants