I noticed that the symbol table in JRuby cannot seem to store two different symbols if the symbols happen to have the same bytes, even if the encodings are different.
Here is a script demonstrating the problem:
sym1 = "ab".force_encoding("UTF-16").to_sym
sym2 = "ab".to_sym
puts sym2.encoding
sym3 = "cd".to_sym
sym4 = "cd".force_encoding("UTF-16").to_sym
puts sym4.encoding
Here is the output from my shell demonstrating how MRI gets the encodings for sym2 and sym4 right and JRuby gets them wrong because of the pre-existing symbol in the symbol table:
$ ruby -v && ruby test_symbol_table.rb
ruby 2.0.0p0 (2013-02-24) [x64-mingw32]
US-ASCII
UTF-16
$ source use_jruby_179.sh
$ jruby -v && jruby test_symbol_table.rb
jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot(TM) 64-Bit Server VM
1.7.0_45-b18 [Windows 8-amd64]
UTF-16
US-ASCII
I think I will work on making a pull request to fix this; advice and objections are welcome.
I have made some progress on issue #1329 (properly setting the encoding of unmarshaled symbols), but before I can truly succeed in fixing that, I think I need to fix this and maybe a few other fundamental things about symbols.
I noticed that the symbol table in JRuby cannot seem to store two different symbols if the symbols happen to have the same bytes, even if the encodings are different.
Here is a script demonstrating the problem:
Here is the output from my shell demonstrating how MRI gets the encodings for sym2 and sym4 right and JRuby gets them wrong because of the pre-existing symbol in the symbol table:
I think I will work on making a pull request to fix this; advice and objections are welcome.
I have made some progress on issue #1329 (properly setting the encoding of unmarshaled symbols), but before I can truly succeed in fixing that, I think I need to fix this and maybe a few other fundamental things about symbols.