Skip to content

Java app issues/workarounds #3435

@Nashatyrev

Description

@Nashatyrev

Describe the issue

I was adopting Ethereum Java client Teku inside Shadow (specifically with Ethshadow). In this issue I would like to record the problems and challenges I have faced while adopting this Java app. This could be useful for those who will attempt to run their own Java app in Shadow

  1. java.net.NetworkInterface.networkInterfaces() causes java.net.SocketException: Invalid argument (ioctl(SIOCGIFCONF) failed)
    Shadow writes the following WARN:

    [shadow_rs::host::descriptor::socket::inet::udp] (LOG_ONCE) We do not yet handle ioctl request SIOCGIFCONF on tcp sockets
    

    That prevented me to start Teku from gradle because it calls the method internally. Teku itself also invokes the method, however it was possible to workaround it by explicitly setting bind socket address

  2. (see Stubs for UDP socket options SO_BROADCAST, SO_REUSEADDR, SO_REUSEPORT #3434) UDP SO_BROADCAST option is not supported: Netty library (widely adopted jvm network lib) asks for this socket option prior to binding a UDP port. No workarounds on a client code side were found for this.

  3. (see Stubs for UDP socket options SO_BROADCAST, SO_REUSEADDR, SO_REUSEPORT #3434) UDP SO_REUSEADDR & SO_REUSEPORT option is not supported: the option is set in the Teku app itself to check if the specified port is not busy and fail fast if busy. Not sure if it's that necessary here, but it makes sense because this port is then promptly closed and reopened again (when actual port binding occurs). This may potentially fail due to socket close delay in OS (probably it may be the case just in some specific OS)

  4. Java property os.name has the value shadowsys when running under Shadow. Thus to make various native libraries work correctly I needed to explicitly specify this property: -Dos.name=Linux

  5. The most challenging problem was hanging of a java process. At some point after running simulation for some minutes in a working regime, simulation just gets stuck on specific simulation time and top shows that the java process eats 100% of a single CPU core. Debugging had highlighted two threads from the same netty thread pool,

    • one of them had been in a native JIT compiler method which was recompiling a netty hot method, but had given up control to Shadow via some lock inside JIT
    • another thread hadn't shown any native or java stack details, but exactly this thread was consuming 100% of a CPU core

    While the exact reason is still unknown (I could not reproduce the case in a standalone test app as well) the best guess is when JIT recompiles a method from lower to higher optimized version it uses some trick and probably temporarily injects a spin-like lock which supposed to wait for the completion of the compilation started in another thread. There are no known ways to handle spin-locks at the moment.

    Anyways the following JVM options did the trick and the problem didn't appear anymore: -XX:-TieredCompilation or -XX:TieredStopAtLevel=1. Option -Xint worked as well but it was obviously just too slow. (worth to mention that options -XX:-UseOnStackReplacement or -XX:ActiveProcessorCount=1 hadn't helped)

    UPD: With -XX:-TieredCompilation the problem is still observed (less frequently though). Everything works with -XX:TieredStopAtLevel=1 though so far

  6. Had non-deterministic runs due to persistent TEMP directory (/tmp): Java File.createTempFile() creates temp file by generating a random number and appending prefix and suffix. If a file already exists it generates the next random number. Since the random seed is always the same the number of iterations increases over runs leading to non-determenism.

Tried to add Java option -Djava.io.tmpdir=. but some external lib expects absolute path and this solution fails. My current workaround is to just clean /tmp before every run.

Debugging

D1. Haven't managed to utilize jstack: when running it with jstack <pid> the target java process starts eating all available memory after what just crashes. jstack then just says that the process doesn't respond.

D2. Have no idea how/whether is it possible to use general java debuggers/profilers. They usually need tcp port for communication, I'm not sure how it could be done with Shadow

D3. What have worked is:

  • GDB which may give some low level debug information like native thread stacks
  • jhsdb clhsdb --pid <java pid>: at least java stack is available here
  • enabling core files in OS, calling kill -SEGV <java pid> then analysing core dump

To Reproduce

  • Build Shadow from the branch Stubs for UDP socket options SO_BROADCAST, SO_REUSEADDR, SO_REUSEPORT #3434 (if working sample is wanted)
  • Build Ethshadow from the temp teku branch:
    git clone https://github.com/dknopik/ethshadow -b teku
  • Build Teku:
      > git clone https://github.com/Consensys/teku
      > cd teku
      > ./gradlew installDist
    
    
  • run the following ethshadow config: ethshadow sample-teku.yaml
general:
  stop_time: 120m
  progress: true

ethereum:
  validators: 120
  nodes:
    - location: europe
      reliability: reliable
      tag: boot
      clients:
        cl: lighthouse_bootnode
    - location: europe
      reliabilities: reliable
      count:
        total: 2
      clients:
        cl: teku
  clients:
    teku:
      type: teku
      use_unsafe_test_stub: true
      executable: ~/teku/build/install/teku/bin/teku
      environment:
        JAVA_OPTS: -XX:ActiveProcessorCount=1 -XX:-TieredCompilation -Dos.name=Linux 
      extra_args: "--ignore-weak-subjectivity-period-enabled --logging=INFO"

Operating System (please complete the following information):
Both platforms works the same way:

  • Linode Virtual box:
> lsb_release -d
No LSB modules are available.
Description:    Ubuntu 24.04.1 LTS
> uname -a
Linux localhost 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Windows WSL:
> lsb_release -d                                                                                                                                                         
Description:    Ubuntu 22.04.1 LTS
> uname -a                                                                                                                                                               
Linux DESKTOP-ANTON-N 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux                                                                      

Shadow (please complete the following information):

Shadow 3.2.0 — v3.2.0-173-g4a1d8ac8d-dirty 2024-10-23--17:04:43
GLib 2.80.0
Built on 2024-11-05--09:39:27
Built from git branch main
Shadow was built with PROFILE=release, OPT_LEVEL=3, DEBUG=false, RUSTFLAGS="-C force-frame-pointers=y", CFLAGS="-std=gnu11 -O3 -ggdb -fno-omit-frame-pointer -Wreturn-type -Wswitch -DNDEBUG"
  • Which processes you are trying to run inside the Shadow simulation:
    java

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: EnhancementNew functionality or improved design

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions