SI-8774 - set LinkedHashMap/LinkedHashSet node's link fields to null when removed#3911
SI-8774 - set LinkedHashMap/LinkedHashSet node's link fields to null when removed#3911alextg-github wants to merge 1 commit intoscala:2.10.xfrom
Conversation
|
PLS REBUILD/pr-scala@efb7f68 |
|
(kitty-note-to-self: ignore 51068312) |
|
Very good analysis, and I really like your test case. It makes me think that we should add an additional From the correctness standpoint all looks good to me. (Btw, where can I see the GC test file LGTM |
|
Thanks for the review! At the moment, the test scala file is attached to the PR and to the thread I started on scala-internals: https://groups.google.com/forum/#!topic/scala-internals/g51SUrGnxas Being able to test the number of (full) GCs after a point in the program (here, after phase 2) would be great. This could then be added as a test. I'm currently trying to rebuild and rerun the pull-requests tests on Jenkins. They failed the first time around but looking at the logs, the failures look unrelated to the change. |
|
I'm not competent to review, but it's really interesting, after years of thinking I didn't have to help the GC. The build runs -optimize, so it would be funny if the failing test for -optimize inlining revealed a reliance on a weak ref not getting collected, or similar. (Without knowing anything about it.) |
|
som-snytt: Thanks! For the most part, one doesn't need to help the collector. However, it is possible to end up with either temporary or actual "memory leaks". In this case, the collections are holding onto long chains of younger objects because the older (now dead) objects are in a part of the heap that is conservatively being considered live. |
|
My first set of tests on Jenkins failed with the files/run/private-inline.scala test in the pr-scala-test runs. I looked at the backtrace and at the code (line 42) for the assert. It seemed unrelated to my change so I asked for a rebuild. The rebuild also failed. However, this time the pr-scala-test run passed (including private-inline.scala). But unlike the first attempt where it passed, the pr-scala-rangepos run failed and it also failed on the files/run/private-inline.scala test (with the same backtrace). Is this test known to be buggy or racy? If it isn't, I will dig deeper to see how my change could have affected it. If it is, should I request another rebuild? |
|
FTR, test-opt passed on my machine this morning, but that doesn't prove any negatives. The failure is that it detects a closure class (Class forName $anon$fun) that should have been inlined. So maybe how the backend decides whether to emit the class is affected in a racy way. (Just a guess.) |
|
PLS REBUILD/pr-scala@efb7f68 |
|
(kitty-note-to-self: ignore 51135705) |
|
With the latest round of tests, it passes everything except pr-scala-integrate-ide which it says has been broken since #3915. This also does not appear to be due to this change set. Any advice? Should I file a ticket? Looking at scala-internal, Paolo Giarrusso is reporting a similar issue with that test today. |
|
PLS REBUILD/pr-scala@efb7f68 |
|
(kitty-note-to-self: ignore 51185230) |
|
PLS REBUILD/pr-scala@efb7f68 |
|
(kitty-note-to-self: ignore 51330199) |
|
PLS REBUILD ALL |
|
(kitty-note-to-self: ignore 51399838) |
|
PLS REBUILD ALL |
|
(kitty-note-to-self: ignore 51415901) |
|
PLS REBUILD ALL The test failure seems unrelated to the change - it tests elimination of inlined closures. I could not reproduce it locally. |
|
(kitty-note-to-self: ignore 51484250) |
|
PLS REBUILD ALL |
|
PLS REBUILD ALL |
Several collection types have var link fields in their nodes. Examples include HashMap, LinkedHashMap, LinkedHashSet, and ParHashMap. As implemented, removal of a value from one of these structures unlinks the node from the collection but leaves that node's link fields unchanged. This can lead to a form of nepotism causing unnecessary promotion of objects in generational garbage collection and can lead to sequences of frequent full garbage collections. How does this happen? These fields allow older (previously allocated objects) to point to younger objects in the heap. If a removed node happens to be in an older generation, the fact that it refers to younger objects causes these objects to be promoted even if they are otherwise unreachable. This then causes an instance of one of these collections (especially if used as a queue) to repeatedly promote objects that should have been collected in the young generation. The problem is not a semantic one, but instead, a limitation of incremental tracing garbage collection in which it collects part of the heap while treating another part, conservatively, as if it were reachable. The PR includes a simple benchmark (LinkedHashMapTest.scala) that shows the issue. The benchmark has three phases. In the first phase, a LinkedHashMap[Int, Int] is populated. In the second phase, that collection instance (and all its nodes) is promoted by invoking System.gc(). In the final phase, nodes are repeatedly added and removed as if the data structure were a queue. The test not included in this changeset because it needs to be run in a way that can quantify the number of full GCs that occur during the third phase. When run on the existing code base, the test shows frequent repeated full GCs after phase three starts. The fix is relatively straightforward: add code in the remove operations to null out the link fields. With the fix, there are only young GCs during that final phase. Aside: This all came out of the observation of occasional/unpredictable spikes in GC activity in some of our services. We also found and fixed these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to also null out these link fields. We've verified that this fixes the pathologies we saw. We also found that, independently, the OpenJDK 1.8 class files for these collection classes have also been modified (by Oracle) to null out these link fields. We'd now like to do the same for the Scala types. testing: "test/partest --all" passes all tests
|
PLS REBUILD ALL |
|
(kitty-note-to-self: ignore 52263874) |
|
PLS REBUILD/pr-scala@533d5b0 |
|
(kitty-note-to-self: ignore 52267775) |
|
Axel22, others: I managed to get the change in one commit (new to git) and it now passes all tests. However, I posted a question in the ticket about the semantics of unsynchronized iteration. The reason my first change (that also modified HashMap) caused the private-inline.scala test to fail 26% of the time was that the semantics of (unsynchronized) iteration change with this request. I backed out that part of the change but the remaining test affects LinkedHashMap and LinkedHashSet iteration in a similar manner. In particular, by nulling the link fields on removal of a node, we no longer have the property that an unsynchronized iteration will visit all nodes not removed during that iteration. Instead, the iteration may end early. My question is: is this acceptable? Understand that the same change has been made to JDK8's classfiles for these classes. So it also has made this change in behavior. |
|
I'm not sure whether this is acceptable (it sounds reasonable to me for regular hash maps, but I'm less sure about linked hash maps), but I want to point out that it's not just "unsynchronized" iteration (implying multiple threads). You can do something like so: This is true entirely within a single thread. There's nothing racy about it. (Note that removing I'd recommend making this policy more explicit and sane for 2.12 (e.g. by using this patch). I'd have to defer to others regarding 2.10 and 2.11. |
|
The docs of ConcurrentModificationException are worth a read in this context. They define "Concurrent" to include interleaved, single threaded ops. Our docs for
We should probably expand this to include "calling a mutating method on the underlying collection". |
|
lchoran: you identified the problem that causes private-inline.scala to fail. It isn't a case of concurrency. It is a case of a remove() (or -=) operation occurring in the closure to a foreach (or iterator). I added asserts to the code and got the following backtrace: When I rewrote the foreachEntry method from (simplified): to be: the private-inline.scala test passes 100% of the time for 100 runs. The challenge is that one then needs to be aware of this effect of removing nodes causing link fields to change and to write loops with hasNext/next to similarly read the next value before executing the closure. This doesn't seem to be acceptable (and shows that hasNext/next is not a nice interface in some ways). The fixed code is also still open to the concurrent issue I mentioned. But it is good to understand why this broke scalac. |
Doing so relies on implementation details which might change. See scala#3911 / SI-8774.
|
I've submitted #4018 to decouple that spot in the backend from the implementation details of I wonder if we also need to build in a fail-fast mechanism like |
|
Shall we merge this patch too? I would like to include this in 2.11.3 (as a side effect of merging 2.10.x into 2.11.x branch). @alextg-github, WDYT? |
|
I feel uncomfortable merging this change as it stands, especially in a minor release. How many more #4018's will show up in the wild after someone updates to 2.11.3? That said, I'd like to get this into 2.12. It is clearly leads to hard-to-diagnose performance problems for our users, and its one that Java's collections are engineered to avoid. Our decision matrix is: When? 2.11 or 2.12 |
|
+1 for fail fast |
|
+1 to fail fast. This PR targets 2.10 at the moment. I'd be willing to risk merging this into 2.11. I think #4018 is a good example of what can go wrong but doesn't give us a sense how often such code would be seen in the wild. I think the only way to find out is to try. It's ok to risk a regression from time to time. |
|
+1 for fail fast and 2.12, unless there are better ways to reduce the chance of changed behavior (e.g. by sticking in a sentinel value when removing, and caching the next value in the iterator so you can use it if you read the sentinel) for 2.11. For 2.12 maybe we just want the simpler/higher-performance implementation, but I'm wary of things exploding otherwise. |
|
I wanted to provide a little more motivation for why I think this is a dangerous change for 2.11. The Scala library doesn't have Java's facilities for iterators that mutate, and doesn't have a full set of modify-in-place methods to match filter, collect, etc.. So a very natural way to filter in place is something like Unfortunately, this is exactly the sort of thing that looks maybe reasonable and works now, but will fail if the null-out-next-pointer patch goes through. In the case of HashMap it could be particularly pernicious in that it will manifest seemingly randomly only when hash keys collide and you happen to take out an early one. Because of this, I think it's a somewhat dangerous change at all unless the sentinel approach is used, and is certainly too perilous for 2.11 or 2.10. With cached next and a sentinel, the original behavior would be retained (albeit with a performance penalty for iteration). One does have to be careful to modify every iterator that is created or you'll end up with random ClassCastExceptions instead as the iterator blindly returns the sentinel. |
|
Apologies for dropping off. I had been researching C++ and Java specs for collections and iterators in early September but got knocked for a loop (literally, got clipped by a car while running and lost several weeks; fortunately, injuries/concussion were minor but forced time off). Here is my current take (which I think matches lchoran). Java only allows remove on an iterator, not on other forms of iteration over a collection. It defines this as part of the specification for both collection iteration and for the remove() operator, itself. It implements this (per collection type) with state that carries between hasNext() and next() (as well as remove). It coordinates ConcurrentModificationException, typically, via counters that are inc'd by modifying operations and checked during iteration of any kind. It is a fairly intrusive set of changes. Because of the size of those changes, I agree with lchoran and Jason that this should probably be fixed completely in 2.12. The fix includes fixing the specification to spell this out and going through the various collection types to implement ConcurrentModificationException. We can then fix the nulling of link fields after that is in place. What I am likely to do inside Twitter is publish locally modified jar files that fix LinkedHashMap for scala 2.10.x and 2.11.x since here I need only worry about whether this breaks our own stack. But I think it is too close to the 2.10.5 deadline to push anything there. Should I leave this SI open for the 2.12 work? The current patch is inadequate for what should be done. |
|
Hi @alextg-github! Sorry to hear about your accident. At the same time, it's great to hear that it turned fairly minor and your back to your regular activities! Thanks for your detailed response! Shall we close this PR? Would be interested cleaning up the matters you mentioned in your response and submitting a PR against 2.12.x branch? |
|
Recent list about difficulties with iterator's |
|
We need to close this one to retarget and flesh out the fix as discussed. @alextg-github yep, please leave the JIRA ticket open, and update it the summary of our discussion here. |
Several collection types have var link fields in their nodes. Examples
include LinkedHashMap and LinkedHashSet. As implemented, removal
of a value from one of these structures unlinks the node from the
collection but leaves that node's link fields unchanged. This can lead to
a form of nepotism causing unnecessary promotion of objects in
generational garbage collection and can lead to sequences of frequent
full garbage collections.
How does this happen? These fields allow older (previously allocated
objects) to point to younger objects in the heap. If a removed node
happens to be in an older generation, the fact that it refers to younger
objects causes these objects to be promoted even if they are otherwise
unreachable. This then causes an instance of one of these collections
(especially if used as a queue) to repeatedly promote objects that
should have been collected in the young generation. The problem is not
a semantic one, but instead, a limitation of incremental tracing garbage
collection in which it collects part of the heap while treating another
part, conservatively, as if it were reachable.
The PR includes a simple benchmark (LinkedHashMapTest.scala) that shows
the issue. The benchmark has three phases. In the first phase, a
LinkedHashMap[Int, Int] is populated. In the second phase, that collection
instance (and all its nodes) is promoted by invoking System.gc(). In the
final phase, nodes are repeatedly added and removed as if the data structure
were a queue. The test not included in this changeset because it needs to be
run in a way that can quantify the number of full GCs that occur during
the third phase.
When run on the existing code base, the test shows frequent repeated
full GCs after phase three starts. The fix is relatively straightforward:
add code in the remove operations to null out the link fields. With the
fix, there are only young GCs during that final phase.
Aside: This all came out of the observation of occasional/unpredictable
spikes in GC activity in some of our services. We also found and fixed
these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to
also null out these link fields. We've verified that this fixes the
pathologies we saw. We also found that, independently, the OpenJDK 1.8
class files for these collection classes have also been modified (by
Oracle) to null out these link fields. We'd now like to do the same for
the Scala types.
testing: "test/partest --all" passes all tests