Skip to content

SI-8774 - set LinkedHashMap/LinkedHashSet node's link fields to null when removed#3911

Closed
alextg-github wants to merge 1 commit intoscala:2.10.xfrom
twitter-forks:issue/SI-8774
Closed

SI-8774 - set LinkedHashMap/LinkedHashSet node's link fields to null when removed#3911
alextg-github wants to merge 1 commit intoscala:2.10.xfrom
twitter-forks:issue/SI-8774

Conversation

@alextg-github
Copy link

Several collection types have var link fields in their nodes. Examples
include LinkedHashMap and LinkedHashSet. As implemented, removal
of a value from one of these structures unlinks the node from the
collection but leaves that node's link fields unchanged. This can lead to
a form of nepotism causing unnecessary promotion of objects in
generational garbage collection and can lead to sequences of frequent
full garbage collections.

How does this happen? These fields allow older (previously allocated
objects) to point to younger objects in the heap. If a removed node
happens to be in an older generation, the fact that it refers to younger
objects causes these objects to be promoted even if they are otherwise
unreachable. This then causes an instance of one of these collections
(especially if used as a queue) to repeatedly promote objects that
should have been collected in the young generation. The problem is not
a semantic one, but instead, a limitation of incremental tracing garbage
collection in which it collects part of the heap while treating another
part, conservatively, as if it were reachable.

The PR includes a simple benchmark (LinkedHashMapTest.scala) that shows
the issue. The benchmark has three phases. In the first phase, a
LinkedHashMap[Int, Int] is populated. In the second phase, that collection
instance (and all its nodes) is promoted by invoking System.gc(). In the
final phase, nodes are repeatedly added and removed as if the data structure
were a queue. The test not included in this changeset because it needs to be
run in a way that can quantify the number of full GCs that occur during
the third phase.

When run on the existing code base, the test shows frequent repeated
full GCs after phase three starts. The fix is relatively straightforward:
add code in the remove operations to null out the link fields. With the
fix, there are only young GCs during that final phase.

Aside: This all came out of the observation of occasional/unpredictable
spikes in GC activity in some of our services. We also found and fixed
these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to
also null out these link fields. We've verified that this fixes the
pathologies we saw. We also found that, independently, the OpenJDK 1.8
class files for these collection classes have also been modified (by
Oracle) to null out these link fields. We'd now like to do the same for
the Scala types.

testing: "test/partest --all" passes all tests

@alextg-github
Copy link
Author

review @Ichoran @phaller @axel22

@alextg-github
Copy link
Author

PLS REBUILD/pr-scala@efb7f68

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 51068312)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

@axel22
Copy link
Contributor

axel22 commented Aug 4, 2014

Very good analysis, and I really like your test case. It makes me think that we should add an additional Measurer to ScalaMeter, which counts GC cycles - exactly for these kinds of situations.

From the correctness standpoint all looks good to me.

(Btw, where can I see the GC test file LinkedHashMapTest.scala?)

LGTM

@alextg-github
Copy link
Author

Thanks for the review! At the moment, the test scala file is attached to the PR and to the thread I started on scala-internals: https://groups.google.com/forum/#!topic/scala-internals/g51SUrGnxas

Being able to test the number of (full) GCs after a point in the program (here, after phase 2) would be great. This could then be added as a test.

I'm currently trying to rebuild and rerun the pull-requests tests on Jenkins. They failed the first time around but looking at the logs, the failures look unrelated to the change.

@som-snytt
Copy link
Contributor

I'm not competent to review, but it's really interesting, after years of thinking I didn't have to help the GC.

The build runs -optimize, so it would be funny if the failing test for -optimize inlining revealed a reliance on a weak ref not getting collected, or similar. (Without knowing anything about it.)

@alextg-github
Copy link
Author

som-snytt: Thanks! For the most part, one doesn't need to help the collector. However, it is possible to end up with either temporary or actual "memory leaks". In this case, the collections are holding onto long chains of younger objects because the older (now dead) objects are in a part of the heap that is conservatively being considered live.

@alextg-github
Copy link
Author

My first set of tests on Jenkins failed with the files/run/private-inline.scala test in the pr-scala-test runs. I looked at the backtrace and at the code (line 42) for the assert. It seemed unrelated to my change so I asked for a rebuild.

The rebuild also failed. However, this time the pr-scala-test run passed (including private-inline.scala). But unlike the first attempt where it passed, the pr-scala-rangepos run failed and it also failed on the files/run/private-inline.scala test (with the same backtrace).

Is this test known to be buggy or racy? If it isn't, I will dig deeper to see how my change could have affected it. If it is, should I request another rebuild?

@som-snytt
Copy link
Contributor

FTR, test-opt passed on my machine this morning, but that doesn't prove any negatives.

The failure is that it detects a closure class (Class forName $anon$fun) that should have been inlined. So maybe how the backend decides whether to emit the class is affected in a racy way. (Just a guess.)

@alextg-github
Copy link
Author

PLS REBUILD/pr-scala@efb7f68

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 51135705)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

@alextg-github
Copy link
Author

With the latest round of tests, it passes everything except pr-scala-integrate-ide which it says has been broken since #3915. This also does not appear to be due to this change set. Any advice? Should I file a ticket?

Looking at scala-internal, Paolo Giarrusso is reporting a similar issue with that test today.
https://groups.google.com/forum/#!searchin/scala-internals/pr-scala-integrate-ide/scala-internals/XKJa5WffDXg/LuRgQ4KPSsoJ

@alextg-github
Copy link
Author

PLS REBUILD/pr-scala@efb7f68

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 51185230)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

@alextg-github
Copy link
Author

PLS REBUILD/pr-scala@efb7f68

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 51330199)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

@retronym
Copy link
Member

retronym commented Aug 6, 2014

PLS REBUILD ALL

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 51399838)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

@alextg-github
Copy link
Author

PLS REBUILD ALL

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 51415901)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

@lrytz
Copy link
Member

lrytz commented Aug 7, 2014

PLS REBUILD ALL

The test failure seems unrelated to the change - it tests elimination of inlined closures. I could not reproduce it locally.

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 51484250)
🐱 Roger! Rebuilding pr-scala for efb7f68. 🚨

@alextg-github alextg-github changed the title SI-8774 - set HashMap/HashSet node's link fields to null when removed SI-8774 - set LinkedHashMap/LinkedHashSet node's link fields to null when removed Aug 13, 2014
@alextg-github
Copy link
Author

PLS REBUILD ALL

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 52133998)
🐱 Roger! Rebuilding pr-scala for efb7f68, 568b4cb, 5197fd2, 6f629e3, 2a94053. 🚨

@alextg-github
Copy link
Author

PLS REBUILD ALL

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 52210718)
🐱 Roger! Rebuilding pr-scala for efb7f68, 568b4cb, 5197fd2, 6f629e3, 2a94053. 🚨

Several collection types have var link fields in their nodes. Examples
include HashMap, LinkedHashMap, LinkedHashSet, and ParHashMap. As
implemented, removal of a value from one of these structures unlinks the
node from the collection but leaves that node's link fields unchanged.
This can lead to a form of nepotism causing unnecessary promotion of
objects in generational garbage collection and can lead to sequences
of frequent full garbage collections.

How does this happen? These fields allow older (previously allocated
objects) to point to younger objects in the heap. If a removed node
happens to be in an older generation, the fact that it refers to younger
objects causes these objects to be promoted even if they are otherwise
unreachable. This then causes an instance of one of these collections
(especially if used as a queue) to repeatedly promote objects that
should have been collected in the young generation. The problem is not
a semantic one, but instead, a limitation of incremental tracing garbage
collection in which it collects part of the heap while treating another
part, conservatively, as if it were reachable.

The PR includes a simple benchmark (LinkedHashMapTest.scala) that shows
the issue. The benchmark has three phases. In the first phase, a
LinkedHashMap[Int, Int] is populated. In the second phase, that collection
instance (and all its nodes) is promoted by invoking System.gc(). In the
final phase, nodes are repeatedly added and removed as if the data structure
were a queue. The test not included in this changeset because it needs to be
run in a way that can quantify the number of full GCs that occur during
the third phase.

When  run on the existing code base, the test shows frequent repeated
full GCs after phase three starts. The fix is relatively straightforward:
add code in the remove operations to null out the link fields. With the
fix, there are only young GCs during that final phase.

Aside: This all came out of the observation of occasional/unpredictable
spikes in GC activity in some of our services. We also found and fixed
these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to
also null out these link fields. We've verified that this fixes the
pathologies we saw. We also found that, independently, the OpenJDK 1.8
class files for these collection classes have also been modified (by
Oracle) to null out these link fields. We'd now like to do the same for
the Scala types.

testing: "test/partest --all" passes all tests
@alextg-github
Copy link
Author

PLS REBUILD ALL

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 52263874)
🐱 Roger! Rebuilding pr-scala for 533d5b0. 🚨

@alextg-github
Copy link
Author

PLS REBUILD/pr-scala@533d5b0

@scala-jenkins
Copy link

(kitty-note-to-self: ignore 52267775)
🐱 Roger! Rebuilding pr-scala for 533d5b0. 🚨

@alextg-github
Copy link
Author

Axel22, others: I managed to get the change in one commit (new to git) and it now passes all tests. However, I posted a question in the ticket about the semantics of unsynchronized iteration. The reason my first change (that also modified HashMap) caused the private-inline.scala test to fail 26% of the time was that the semantics of (unsynchronized) iteration change with this request. I backed out that part of the change but the remaining test affects LinkedHashMap and LinkedHashSet iteration in a similar manner. In particular, by nulling the link fields on removal of a node, we no longer have the property that an unsynchronized iteration will visit all nodes not removed during that iteration. Instead, the iteration may end early. My question is: is this acceptable?

Understand that the same change has been made to JDK8's classfiles for these classes. So it also has made this change in behavior.

@Ichoran
Copy link
Contributor

Ichoran commented Aug 19, 2014

I'm not sure whether this is acceptable (it sounds reasonable to me for regular hash maps, but I'm less sure about linked hash maps), but I want to point out that it's not just "unsynchronized" iteration (implying multiple threads). You can do something like so:

case class Deg(i: Int) { override hashCode = 0 }
val m = collection.mutable.HashMap[Deg, Int](Deg(0) -> 1, Deg(1) -> 2)
val it = m.iterator
it.next  // Deg(1) -> 2
it -= Deg(1)
it.next  // Now is Deg(0) -> 1

This is true entirely within a single thread. There's nothing racy about it. (Note that removing Deg(0) also is not visible to the iterator, but that removing something later would be--so it's pretty questionable behavior to rely upon.)

I'd recommend making this policy more explicit and sane for 2.12 (e.g. by using this patch). I'd have to defer to others regarding 2.10 and 2.11.

@retronym
Copy link
Member

The docs of ConcurrentModificationException are worth a read in this context. They define "Concurrent" to include interleaved, single threaded ops.

Our docs for Iterator mention:

It is of particular importance to note that, unless stated otherwise, ''one should never
use an iterator after calling a method on it''. The two most important exceptions
are also the sole abstract methods: next and hasNext.

We should probably expand this to include "calling a mutating method on the underlying collection".

@alextg-github
Copy link
Author

lchoran: you identified the problem that causes private-inline.scala to fail. It isn't a case of concurrency. It is a case of a remove() (or -=) operation occurring in the closure to a foreach (or iterator). I added asserts to the code and got the following backtrace:

Possible compiler crash during test of: test/files/run/private-inline.scala
java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at scala.collection.mutable.HashMap.$minus$eq(HashMap.scala:97)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase$$anonfun$run$3.apply(GenASM.scala:90)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase$$anonfun$run$3.apply(GenASM.scala:87)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:101)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:101)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:234)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:101)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:87)
    at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
    at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
    at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
    at scala.tools.nsc.Global$Run.compile(Global.scala:1662)

When I rewrote the foreachEntry method from (simplified):

         while (es != null) {
               f(...)
               es = es.next
          }

to be:

         while (es != null) {
               val next1 = es.next
               f(...)
               val next = es.next
               es = if (next ne null) next else next1
          }

the private-inline.scala test passes 100% of the time for 100 runs.

The challenge is that one then needs to be aware of this effect of removing nodes causing link fields to change and to write loops with hasNext/next to similarly read the next value before executing the closure. This doesn't seem to be acceptable (and shows that hasNext/next is not a nice interface in some ways).

The fixed code is also still open to the concurrent issue I mentioned. But it is good to understand why this broke scalac.

retronym added a commit to retronym/scala that referenced this pull request Sep 30, 2014
Doing so relies on implementation details which might change.

See scala#3911 / SI-8774.
@retronym
Copy link
Member

I've submitted #4018 to decouple that spot in the backend from the implementation details of HashMap. Great detective work, BTW!

I wonder if we also need to build in a fail-fast mechanism like java.util.HashMap's
ConcurrentModificationException to let us go ahead with this change (even if we defer it 2.12.)

@gkossakowski
Copy link
Contributor

Shall we merge this patch too? I would like to include this in 2.11.3 (as a side effect of merging 2.10.x into 2.11.x branch).

@alextg-github, WDYT?

@retronym
Copy link
Member

retronym commented Oct 2, 2014

I feel uncomfortable merging this change as it stands, especially in a minor release. How many more #4018's will show up in the wild after someone updates to 2.11.3?

That said, I'd like to get this into 2.12. It is clearly leads to hard-to-diagnose performance problems for our users, and its one that Java's collections are engineered to avoid.

Our decision matrix is:

When? 2.11 or 2.12
What? Status Quo / This PR / This PR + fail fast ConcurrentModificationException / (or something else?)

@lrytz
Copy link
Member

lrytz commented Oct 2, 2014

+1 for fail fast

@gkossakowski
Copy link
Contributor

+1 to fail fast. This PR targets 2.10 at the moment. I'd be willing to risk merging this into 2.11. I think #4018 is a good example of what can go wrong but doesn't give us a sense how often such code would be seen in the wild. I think the only way to find out is to try. It's ok to risk a regression from time to time.

@Ichoran
Copy link
Contributor

Ichoran commented Oct 2, 2014

+1 for fail fast and 2.12, unless there are better ways to reduce the chance of changed behavior (e.g. by sticking in a sentinel value when removing, and caching the next value in the iterator so you can use it if you read the sentinel) for 2.11. For 2.12 maybe we just want the simpler/higher-performance implementation, but I'm wary of things exploding otherwise.

@Ichoran
Copy link
Contributor

Ichoran commented Oct 2, 2014

I wanted to provide a little more motivation for why I think this is a dangerous change for 2.11. The Scala library doesn't have Java's facilities for iterators that mutate, and doesn't have a full set of modify-in-place methods to match filter, collect, etc.. So a very natural way to filter in place is something like

myMap.foreach{ case (k,v) => if (!p(k)) myMap -= k }

Unfortunately, this is exactly the sort of thing that looks maybe reasonable and works now, but will fail if the null-out-next-pointer patch goes through. In the case of HashMap it could be particularly pernicious in that it will manifest seemingly randomly only when hash keys collide and you happen to take out an early one.

Because of this, I think it's a somewhat dangerous change at all unless the sentinel approach is used, and is certainly too perilous for 2.11 or 2.10. With cached next and a sentinel, the original behavior would be retained (albeit with a performance penalty for iteration).

One does have to be careful to modify every iterator that is created or you'll end up with random ClassCastExceptions instead as the iterator blindly returns the sentinel.

@alextg-github
Copy link
Author

Apologies for dropping off. I had been researching C++ and Java specs for collections and iterators in early September but got knocked for a loop (literally, got clipped by a car while running and lost several weeks; fortunately, injuries/concussion were minor but forced time off).

Here is my current take (which I think matches lchoran). Java only allows remove on an iterator, not on other forms of iteration over a collection. It defines this as part of the specification for both collection iteration and for the remove() operator, itself. It implements this (per collection type) with state that carries between hasNext() and next() (as well as remove). It coordinates ConcurrentModificationException, typically, via counters that are inc'd by modifying operations and checked during iteration of any kind. It is a fairly intrusive set of changes.

Because of the size of those changes, I agree with lchoran and Jason that this should probably be fixed completely in 2.12. The fix includes fixing the specification to spell this out and going through the various collection types to implement ConcurrentModificationException. We can then fix the nulling of link fields after that is in place.

What I am likely to do inside Twitter is publish locally modified jar files that fix LinkedHashMap for scala 2.10.x and 2.11.x since here I need only worry about whether this breaks our own stack. But I think it is too close to the 2.10.5 deadline to push anything there.

Should I leave this SI open for the 2.12 work? The current patch is inadequate for what should be done.

@gkossakowski
Copy link
Contributor

Hi @alextg-github! Sorry to hear about your accident. At the same time, it's great to hear that it turned fairly minor and your back to your regular activities!

Thanks for your detailed response! Shall we close this PR? Would be interested cleaning up the matters you mentioned in your response and submitting a PR against 2.12.x branch?

@lrytz
Copy link
Member

lrytz commented Oct 8, 2014

Recent list about difficulties with iterator's ConcurrentModificationException: http://stackoverflow.com/questions/26249635/scala-mutablelist-when-foreach-add-element-why-not-throw-exception/26249934#26249934

@retronym
Copy link
Member

retronym commented Oct 9, 2014

We need to close this one to retarget and flesh out the fix as discussed.

@alextg-github yep, please leave the JIRA ticket open, and update it the summary of our discussion here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants