SI-8774 - set LinkedHashMap/LinkedHashSet node's link fields to null when removed by alextg-github · Pull Request #3911 · scala/scala

alextg-github · 2014-08-04T01:36:33Z

Several collection types have var link fields in their nodes. Examples
include LinkedHashMap and LinkedHashSet. As implemented, removal
of a value from one of these structures unlinks the node from the
collection but leaves that node's link fields unchanged. This can lead to
a form of nepotism causing unnecessary promotion of objects in
generational garbage collection and can lead to sequences of frequent
full garbage collections.

How does this happen? These fields allow older (previously allocated
objects) to point to younger objects in the heap. If a removed node
happens to be in an older generation, the fact that it refers to younger
objects causes these objects to be promoted even if they are otherwise
unreachable. This then causes an instance of one of these collections
(especially if used as a queue) to repeatedly promote objects that
should have been collected in the young generation. The problem is not
a semantic one, but instead, a limitation of incremental tracing garbage
collection in which it collects part of the heap while treating another
part, conservatively, as if it were reachable.

The PR includes a simple benchmark (LinkedHashMapTest.scala) that shows
the issue. The benchmark has three phases. In the first phase, a
LinkedHashMap[Int, Int] is populated. In the second phase, that collection
instance (and all its nodes) is promoted by invoking System.gc(). In the
final phase, nodes are repeatedly added and removed as if the data structure
were a queue. The test not included in this changeset because it needs to be
run in a way that can quantify the number of full GCs that occur during
the third phase.

When run on the existing code base, the test shows frequent repeated
full GCs after phase three starts. The fix is relatively straightforward:
add code in the remove operations to null out the link fields. With the
fix, there are only young GCs during that final phase.

Aside: This all came out of the observation of occasional/unpredictable
spikes in GC activity in some of our services. We also found and fixed
these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to
also null out these link fields. We've verified that this fixes the
pathologies we saw. We also found that, independently, the OpenJDK 1.8
class files for these collection classes have also been modified (by
Oracle) to null out these link fields. We'd now like to do the same for
the Scala types.

testing: "test/partest --all" passes all tests

alextg-github · 2014-08-04T01:41:05Z

review @Ichoran @phaller @axel22

alextg-github · 2014-08-04T14:38:00Z

PLS REBUILD/pr-scala@efb7f68

scala-jenkins · 2014-08-04T14:41:08Z

(kitty-note-to-self: ignore 51068312)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

axel22 · 2014-08-04T14:56:50Z

Very good analysis, and I really like your test case. It makes me think that we should add an additional Measurer to ScalaMeter, which counts GC cycles - exactly for these kinds of situations.

From the correctness standpoint all looks good to me.

(Btw, where can I see the GC test file LinkedHashMapTest.scala?)

LGTM

alextg-github · 2014-08-04T15:21:58Z

Thanks for the review! At the moment, the test scala file is attached to the PR and to the thread I started on scala-internals: https://groups.google.com/forum/#!topic/scala-internals/g51SUrGnxas

Being able to test the number of (full) GCs after a point in the program (here, after phase 2) would be great. This could then be added as a test.

I'm currently trying to rebuild and rerun the pull-requests tests on Jenkins. They failed the first time around but looking at the logs, the failures look unrelated to the change.

som-snytt · 2014-08-04T15:43:48Z

I'm not competent to review, but it's really interesting, after years of thinking I didn't have to help the GC.

The build runs -optimize, so it would be funny if the failing test for -optimize inlining revealed a reliance on a weak ref not getting collected, or similar. (Without knowing anything about it.)

alextg-github · 2014-08-04T17:14:46Z

som-snytt: Thanks! For the most part, one doesn't need to help the collector. However, it is possible to end up with either temporary or actual "memory leaks". In this case, the collections are holding onto long chains of younger objects because the older (now dead) objects are in a part of the heap that is conservatively being considered live.

alextg-github · 2014-08-04T17:20:22Z

My first set of tests on Jenkins failed with the files/run/private-inline.scala test in the pr-scala-test runs. I looked at the backtrace and at the code (line 42) for the assert. It seemed unrelated to my change so I asked for a rebuild.

The rebuild also failed. However, this time the pr-scala-test run passed (including private-inline.scala). But unlike the first attempt where it passed, the pr-scala-rangepos run failed and it also failed on the files/run/private-inline.scala test (with the same backtrace).

Is this test known to be buggy or racy? If it isn't, I will dig deeper to see how my change could have affected it. If it is, should I request another rebuild?

som-snytt · 2014-08-04T17:30:56Z

FTR, test-opt passed on my machine this morning, but that doesn't prove any negatives.

The failure is that it detects a closure class (Class forName $anon$fun) that should have been inlined. So maybe how the backend decides whether to emit the class is affected in a racy way. (Just a guess.)

alextg-github · 2014-08-05T00:26:42Z

PLS REBUILD/pr-scala@efb7f68

scala-jenkins · 2014-08-05T00:31:10Z

(kitty-note-to-self: ignore 51135705)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

alextg-github · 2014-08-05T02:54:36Z

With the latest round of tests, it passes everything except pr-scala-integrate-ide which it says has been broken since #3915. This also does not appear to be due to this change set. Any advice? Should I file a ticket?

Looking at scala-internal, Paolo Giarrusso is reporting a similar issue with that test today.
https://groups.google.com/forum/#!searchin/scala-internals/pr-scala-integrate-ide/scala-internals/XKJa5WffDXg/LuRgQ4KPSsoJ

alextg-github · 2014-08-05T11:33:56Z

PLS REBUILD/pr-scala@efb7f68

scala-jenkins · 2014-08-05T11:41:12Z

(kitty-note-to-self: ignore 51185230)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

alextg-github · 2014-08-06T12:56:55Z

PLS REBUILD/pr-scala@efb7f68

scala-jenkins · 2014-08-06T13:01:16Z

(kitty-note-to-self: ignore 51330199)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

retronym · 2014-08-06T21:27:51Z

PLS REBUILD ALL

scala-jenkins · 2014-08-06T21:31:17Z

(kitty-note-to-self: ignore 51399838)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

alextg-github · 2014-08-07T00:25:48Z

PLS REBUILD ALL

scala-jenkins · 2014-08-07T00:31:17Z

(kitty-note-to-self: ignore 51415901)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

lrytz · 2014-08-07T15:04:25Z

PLS REBUILD ALL

The test failure seems unrelated to the change - it tests elimination of inlined closures. I could not reproduce it locally.

scala-jenkins · 2014-08-07T15:11:19Z

(kitty-note-to-self: ignore 51484250)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

alextg-github · 2014-08-14T01:34:15Z

PLS REBUILD ALL

scala-jenkins · 2014-08-14T01:41:43Z

(kitty-note-to-self: ignore 52133998)
🐱 Roger! Rebuilding pr-scala for efb7f68, 568b4cb, 5197fd2, 6f629e3, 2a94053. 🚨

alextg-github · 2014-08-14T16:57:43Z

PLS REBUILD ALL

scala-jenkins · 2014-08-14T17:01:45Z

(kitty-note-to-self: ignore 52210718)
🐱 Roger! Rebuilding pr-scala for efb7f68, 568b4cb, 5197fd2, 6f629e3, 2a94053. 🚨

Several collection types have var link fields in their nodes. Examples include HashMap, LinkedHashMap, LinkedHashSet, and ParHashMap. As implemented, removal of a value from one of these structures unlinks the node from the collection but leaves that node's link fields unchanged. This can lead to a form of nepotism causing unnecessary promotion of objects in generational garbage collection and can lead to sequences of frequent full garbage collections. How does this happen? These fields allow older (previously allocated objects) to point to younger objects in the heap. If a removed node happens to be in an older generation, the fact that it refers to younger objects causes these objects to be promoted even if they are otherwise unreachable. This then causes an instance of one of these collections (especially if used as a queue) to repeatedly promote objects that should have been collected in the young generation. The problem is not a semantic one, but instead, a limitation of incremental tracing garbage collection in which it collects part of the heap while treating another part, conservatively, as if it were reachable. The PR includes a simple benchmark (LinkedHashMapTest.scala) that shows the issue. The benchmark has three phases. In the first phase, a LinkedHashMap[Int, Int] is populated. In the second phase, that collection instance (and all its nodes) is promoted by invoking System.gc(). In the final phase, nodes are repeatedly added and removed as if the data structure were a queue. The test not included in this changeset because it needs to be run in a way that can quantify the number of full GCs that occur during the third phase. When run on the existing code base, the test shows frequent repeated full GCs after phase three starts. The fix is relatively straightforward: add code in the remove operations to null out the link fields. With the fix, there are only young GCs during that final phase. Aside: This all came out of the observation of occasional/unpredictable spikes in GC activity in some of our services. We also found and fixed these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to also null out these link fields. We've verified that this fixes the pathologies we saw. We also found that, independently, the OpenJDK 1.8 class files for these collection classes have also been modified (by Oracle) to null out these link fields. We'd now like to do the same for the Scala types. testing: "test/partest --all" passes all tests

alextg-github · 2014-08-15T00:59:07Z

PLS REBUILD ALL

scala-jenkins · 2014-08-15T01:01:46Z

(kitty-note-to-self: ignore 52263874)
🐱 Roger! Rebuilding pr-scala for 533d5b0. 🚨

alextg-github · 2014-08-15T02:13:03Z

PLS REBUILD/pr-scala@533d5b0

scala-jenkins · 2014-08-15T02:21:46Z

(kitty-note-to-self: ignore 52267775)
🐱 Roger! Rebuilding pr-scala for 533d5b0. 🚨

alextg-github · 2014-08-18T17:42:47Z

Axel22, others: I managed to get the change in one commit (new to git) and it now passes all tests. However, I posted a question in the ticket about the semantics of unsynchronized iteration. The reason my first change (that also modified HashMap) caused the private-inline.scala test to fail 26% of the time was that the semantics of (unsynchronized) iteration change with this request. I backed out that part of the change but the remaining test affects LinkedHashMap and LinkedHashSet iteration in a similar manner. In particular, by nulling the link fields on removal of a node, we no longer have the property that an unsynchronized iteration will visit all nodes not removed during that iteration. Instead, the iteration may end early. My question is: is this acceptable?

Understand that the same change has been made to JDK8's classfiles for these classes. So it also has made this change in behavior.

Ichoran · 2014-08-19T01:05:16Z

I'm not sure whether this is acceptable (it sounds reasonable to me for regular hash maps, but I'm less sure about linked hash maps), but I want to point out that it's not just "unsynchronized" iteration (implying multiple threads). You can do something like so:

case class Deg(i: Int) { override hashCode = 0 }
val m = collection.mutable.HashMap[Deg, Int](Deg(0) -> 1, Deg(1) -> 2)
val it = m.iterator
it.next  // Deg(1) -> 2
it -= Deg(1)
it.next  // Now is Deg(0) -> 1

This is true entirely within a single thread. There's nothing racy about it. (Note that removing Deg(0) also is not visible to the iterator, but that removing something later would be--so it's pretty questionable behavior to rely upon.)

I'd recommend making this policy more explicit and sane for 2.12 (e.g. by using this patch). I'd have to defer to others regarding 2.10 and 2.11.

retronym · 2014-08-19T01:16:10Z

The docs of ConcurrentModificationException are worth a read in this context. They define "Concurrent" to include interleaved, single threaded ops.

Our docs for Iterator mention:

It is of particular importance to note that, unless stated otherwise, ''one should never
use an iterator after calling a method on it''. The two most important exceptions
are also the sole abstract methods: next and hasNext.

We should probably expand this to include "calling a mutating method on the underlying collection".

alextg-github · 2014-08-19T18:00:20Z

lchoran: you identified the problem that causes private-inline.scala to fail. It isn't a case of concurrency. It is a case of a remove() (or -=) operation occurring in the closure to a foreach (or iterator). I added asserts to the code and got the following backtrace:

Possible compiler crash during test of: test/files/run/private-inline.scala
java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at scala.collection.mutable.HashMap.$minus$eq(HashMap.scala:97)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase$$anonfun$run$3.apply(GenASM.scala:90)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase$$anonfun$run$3.apply(GenASM.scala:87)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:101)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:101)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:234)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:101)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:87)
    at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
    at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
    at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
    at scala.tools.nsc.Global$Run.compile(Global.scala:1662)

When I rewrote the foreachEntry method from (simplified):

         while (es != null) {
               f(...)
               es = es.next
          }

to be:

         while (es != null) {
               val next1 = es.next
               f(...)
               val next = es.next
               es = if (next ne null) next else next1
          }

the private-inline.scala test passes 100% of the time for 100 runs.

The challenge is that one then needs to be aware of this effect of removing nodes causing link fields to change and to write loops with hasNext/next to similarly read the next value before executing the closure. This doesn't seem to be acceptable (and shows that hasNext/next is not a nice interface in some ways).

The fixed code is also still open to the concurrent issue I mentioned. But it is good to understand why this broke scalac.

Doing so relies on implementation details which might change. See scala#3911 / SI-8774.

retronym · 2014-09-30T13:25:03Z

I've submitted #4018 to decouple that spot in the backend from the implementation details of HashMap. Great detective work, BTW!

I wonder if we also need to build in a fail-fast mechanism like java.util.HashMap's
ConcurrentModificationException to let us go ahead with this change (even if we defer it 2.12.)

gkossakowski · 2014-09-30T15:55:43Z

Shall we merge this patch too? I would like to include this in 2.11.3 (as a side effect of merging 2.10.x into 2.11.x branch).

@alextg-github, WDYT?

retronym · 2014-10-02T11:40:04Z

I feel uncomfortable merging this change as it stands, especially in a minor release. How many more #4018's will show up in the wild after someone updates to 2.11.3?

That said, I'd like to get this into 2.12. It is clearly leads to hard-to-diagnose performance problems for our users, and its one that Java's collections are engineered to avoid.

Our decision matrix is:

When? 2.11 or 2.12
What? Status Quo / This PR / This PR + fail fast ConcurrentModificationException / (or something else?)

lrytz · 2014-10-02T11:40:54Z

+1 for fail fast

gkossakowski · 2014-10-02T14:21:15Z

+1 to fail fast. This PR targets 2.10 at the moment. I'd be willing to risk merging this into 2.11. I think #4018 is a good example of what can go wrong but doesn't give us a sense how often such code would be seen in the wild. I think the only way to find out is to try. It's ok to risk a regression from time to time.

Ichoran · 2014-10-02T19:09:01Z

+1 for fail fast and 2.12, unless there are better ways to reduce the chance of changed behavior (e.g. by sticking in a sentinel value when removing, and caching the next value in the iterator so you can use it if you read the sentinel) for 2.11. For 2.12 maybe we just want the simpler/higher-performance implementation, but I'm wary of things exploding otherwise.

Ichoran · 2014-10-02T20:08:25Z

I wanted to provide a little more motivation for why I think this is a dangerous change for 2.11. The Scala library doesn't have Java's facilities for iterators that mutate, and doesn't have a full set of modify-in-place methods to match filter, collect, etc.. So a very natural way to filter in place is something like

myMap.foreach{ case (k,v) => if (!p(k)) myMap -= k }

Unfortunately, this is exactly the sort of thing that looks maybe reasonable and works now, but will fail if the null-out-next-pointer patch goes through. In the case of HashMap it could be particularly pernicious in that it will manifest seemingly randomly only when hash keys collide and you happen to take out an early one.

Because of this, I think it's a somewhat dangerous change at all unless the sentinel approach is used, and is certainly too perilous for 2.11 or 2.10. With cached next and a sentinel, the original behavior would be retained (albeit with a performance penalty for iteration).

One does have to be careful to modify every iterator that is created or you'll end up with random ClassCastExceptions instead as the iterator blindly returns the sentinel.

alextg-github · 2014-10-03T20:22:43Z

Apologies for dropping off. I had been researching C++ and Java specs for collections and iterators in early September but got knocked for a loop (literally, got clipped by a car while running and lost several weeks; fortunately, injuries/concussion were minor but forced time off).

Here is my current take (which I think matches lchoran). Java only allows remove on an iterator, not on other forms of iteration over a collection. It defines this as part of the specification for both collection iteration and for the remove() operator, itself. It implements this (per collection type) with state that carries between hasNext() and next() (as well as remove). It coordinates ConcurrentModificationException, typically, via counters that are inc'd by modifying operations and checked during iteration of any kind. It is a fairly intrusive set of changes.

Because of the size of those changes, I agree with lchoran and Jason that this should probably be fixed completely in 2.12. The fix includes fixing the specification to spell this out and going through the various collection types to implement ConcurrentModificationException. We can then fix the nulling of link fields after that is in place.

What I am likely to do inside Twitter is publish locally modified jar files that fix LinkedHashMap for scala 2.10.x and 2.11.x since here I need only worry about whether this breaks our own stack. But I think it is too close to the 2.10.5 deadline to push anything there.

Should I leave this SI open for the 2.12 work? The current patch is inadequate for what should be done.

gkossakowski · 2014-10-06T15:12:40Z

Hi @alextg-github! Sorry to hear about your accident. At the same time, it's great to hear that it turned fairly minor and your back to your regular activities!

Thanks for your detailed response! Shall we close this PR? Would be interested cleaning up the matters you mentioned in your response and submitting a PR against 2.12.x branch?

lrytz · 2014-10-08T05:51:59Z

Recent list about difficulties with iterator's ConcurrentModificationException: http://stackoverflow.com/questions/26249635/scala-mutablelist-when-foreach-add-element-why-not-throw-exception/26249934#26249934

retronym · 2014-10-09T06:56:21Z

We need to close this one to retarget and flesh out the fix as discussed.

@alextg-github yep, please leave the JIRA ticket open, and update it the summary of our discussion here.

scala-jenkins added the needs-attention label Aug 4, 2014

alextg-github changed the title ~~SI-8774 - set HashMap/HashSet node's link fields to null when removed~~ SI-8774 - set LinkedHashMap/LinkedHashSet node's link fields to null when removed Aug 13, 2014

scala-jenkins added tested and removed needs-attention labels Aug 15, 2014

retronym added a commit to retronym/scala that referenced this pull request Sep 30, 2014

Don't remove elements from a map during iteration.

c048e19

Doing so relies on implementation details which might change. See scala#3911 / SI-8774.

retronym mentioned this pull request Sep 30, 2014

Don't remove elements from a map during iteration. #4018

Merged

retronym closed this Oct 9, 2014

adriaanm mentioned this pull request Jul 18, 2016

SI-8774 mutable LinkedHashMap/LinkedHashSet nodes' link fields should be nulled out when removed scala/scala-dev#170

Closed

szeiger mentioned this pull request Jul 21, 2016

SI-8774 Null link fields in mutable LinkedHashMap (and friends) on remove #5295

Merged

varming mentioned this pull request Jul 22, 2016

Throw ConcurrentModificationException exception's in scala collections? scala/scala-dev#188

Closed

scabug mentioned this pull request Apr 7, 2017

mutable LinkedHashMap/LinkedHashSet nodes' link fields should be nulled out when removed scala/bug#8774

Closed

Conversation

alextg-github commented Aug 4, 2014

Uh oh!

alextg-github commented Aug 4, 2014

Uh oh!

alextg-github commented Aug 4, 2014

Uh oh!

scala-jenkins commented Aug 4, 2014

Uh oh!

axel22 commented Aug 4, 2014

Uh oh!

alextg-github commented Aug 4, 2014

Uh oh!

som-snytt commented Aug 4, 2014

Uh oh!

alextg-github commented Aug 4, 2014

Uh oh!

alextg-github commented Aug 4, 2014

Uh oh!

som-snytt commented Aug 4, 2014

Uh oh!

alextg-github commented Aug 5, 2014

Uh oh!

scala-jenkins commented Aug 5, 2014

Uh oh!

alextg-github commented Aug 5, 2014

Uh oh!

alextg-github commented Aug 5, 2014

Uh oh!

scala-jenkins commented Aug 5, 2014

Uh oh!

alextg-github commented Aug 6, 2014

Uh oh!

scala-jenkins commented Aug 6, 2014

Uh oh!

retronym commented Aug 6, 2014

Uh oh!

scala-jenkins commented Aug 6, 2014

Uh oh!

alextg-github commented Aug 7, 2014

Uh oh!

scala-jenkins commented Aug 7, 2014

Uh oh!

lrytz commented Aug 7, 2014

Uh oh!

scala-jenkins commented Aug 7, 2014

Uh oh!

alextg-github commented Aug 14, 2014

Uh oh!

scala-jenkins commented Aug 14, 2014

Uh oh!

alextg-github commented Aug 14, 2014

Uh oh!

scala-jenkins commented Aug 14, 2014

Uh oh!

alextg-github commented Aug 15, 2014

Uh oh!

scala-jenkins commented Aug 15, 2014

Uh oh!

alextg-github commented Aug 15, 2014

Uh oh!

scala-jenkins commented Aug 15, 2014

Uh oh!

alextg-github commented Aug 18, 2014

Uh oh!

Ichoran commented Aug 19, 2014

Uh oh!

retronym commented Aug 19, 2014

Uh oh!

alextg-github commented Aug 19, 2014

Uh oh!

retronym commented Sep 30, 2014

Uh oh!

gkossakowski commented Sep 30, 2014

Uh oh!

retronym commented Oct 2, 2014

Uh oh!

lrytz commented Oct 2, 2014

Uh oh!

gkossakowski commented Oct 2, 2014