JRuby behaves differently than MRI when it is unmarshaling symbols. The symbol always seems to have the US-ASCII encoding, even if it has special unicode characters in it.
To reproduce this, I needed two separate scripts. (It seems that the state of JRuby's symbol table affects how Marshal.load behaves.)
In test1.rb, I have:
# coding: UTF-8
mu = 'µ'.to_sym
File.open('mu.dat', 'wb') { |f| f.write(Marshal.dump(mu)) }
In test2.rb, I have:
dump = File.open('mu.dat', 'rb') { |f| f.read }
p dump.bytes.to_a
mu = Marshal.load(dump)
puts mu.to_s.encoding
Here is the output I get from running these scripts, and also information about the versions of Ruby I am using:
$ jruby -v && jruby test1.rb && jruby test2.rb
jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot(TM) 64-Bit Server VM
1.7.0_07-b10 [Windows 8-amd64]
[4, 8, 73, 58, 7, 194, 181, 6, 58, 6, 69, 84]
US-ASCII
$ ruby -v && ruby test1.rb && ruby test2.rb
ruby 2.0.0p0 (2013-02-24) [x64-mingw32]
[4, 8, 73, 58, 7, 194, 181, 6, 58, 6, 69, 84]
UTF-8
From this we can see that both JRuby and MRI are marshaling the data in the same way, but when JRuby unmarshals it, it is setting the encoding to US-ASCII instead of UTF-8.
This issue came up because I am trying to use YARD to generate documentation for JRuby code that has special characters in a few method alias names. When I run "yard doc", the data about those methods is marshaled and written to the disk, and when I run "yard server --reload" it gets unmarshaled badly.
One workaround for this issue is to create a symbol with the proper encoding before running Marshal.load.
Sorry if this is a duplicate. This could be related to issue with symbol literal encoding that I just reported, #1328. I also see there is another open issue about method that is probably related to symbol encoding: #914.
JRuby behaves differently than MRI when it is unmarshaling symbols. The symbol always seems to have the US-ASCII encoding, even if it has special unicode characters in it.
To reproduce this, I needed two separate scripts. (It seems that the state of JRuby's symbol table affects how Marshal.load behaves.)
In
test1.rb, I have:In
test2.rb, I have:Here is the output I get from running these scripts, and also information about the versions of Ruby I am using:
From this we can see that both JRuby and MRI are marshaling the data in the same way, but when JRuby unmarshals it, it is setting the encoding to US-ASCII instead of UTF-8.
This issue came up because I am trying to use YARD to generate documentation for JRuby code that has special characters in a few method alias names. When I run "yard doc", the data about those methods is marshaled and written to the disk, and when I run "yard server --reload" it gets unmarshaled badly.
One workaround for this issue is to create a symbol with the proper encoding before running
Marshal.load.Sorry if this is a duplicate. This could be related to issue with symbol literal encoding that I just reported, #1328. I also see there is another open issue about method that is probably related to symbol encoding: #914.