feat(Rust): support basic type se/de aligned with java #2585

urlyy · 2025-09-07T20:14:18Z

What does this PR do?

Updated the legacy read/write methods to align with Java.
Started adding some register_by_name functionality; not finished yet, will continue in the next PR.
Added a new test file to facilitate testing across different languages conveniently.
Basic types, primitive arrays, string, list, set, and map are now aligned with Java. Handling of compress algorithm for String and header for map still needs improvement.

Related issues

#2539

TODOs:

Enum derive.
Complete register_by_name.
Align with Java metashare, test on struct..
Implement serialization for Box<dyn Any> (may be challenging).

Does this PR introduce any user-facing change?

Does this PR introduce any public API change?

# now can 
let reader = Reader::new(bytes.as_slice());
let fory = Fory::default().mode(Compatible).xlang(true);
let mut context = ReadContext::new(&fory, reader);
let a = fory.deserialize_with_context(&mut context);
let b = fory.deserialize_with_context(&mut context);
let c = fory.deserialize_with_context(&mut context);

And we should add scoped_meta_context in future.

Does this PR introduce any binary protocol compatibility change?

java/fory-core/src/main/java/org/apache/fory/resolver/XtypeResolver.java

chaokunyang · 2025-09-10T16:46:17Z

java/fory-core/src/test/java/org/apache/fory/WIPCrossLanguageTest.java

+
+/** Tests in this class need fory python/rust installed. */
+@Test
+public class WIPCrossLanguageTest extends ForyTestBase {


Rename it to RustXlangTest, and skip it by default. We don't want every contributor isntall rust for developing java.

You can control whether enable it by a env veraible. And this test should only be run in rust ci. You can update rust ci by install fory java, and execute command cd java/fory-core && mvn test -Dtest=org.apache.fory.RustXlangTest

chaokunyang · 2025-09-11T06:39:31Z

java/fory-core/src/test/java/org/apache/fory/RustXlangTest.java

+
+  @BeforeClass
+  public void isPyforyInstalled() {
+    // TestUtils.verifyPyforyInstalled();


We can configure a env variable such as FORY_RUST_JAVA_CI, and in isRustJavaCIEnabled method, we check the env variable and skip this whole RustXlangTest here if env variable not set to 1

Please also update rust ci in .github/workflows/ci.yaml to install fory java too and run test command mvn test -Dtest=org.apache.fory.RustXlangTest

chaokunyang · 2025-09-11T06:39:46Z

java/fory-core/src/test/java/org/apache/fory/RustXlangTest.java

+  private static final int RUST_TESTCASE_INDEX = 4;
+
+  @BeforeClass
+  public void isPyforyInstalled() {


Suggested change

public void isPyforyInstalled() {

public void isRustJavaCIEnabled() {

chaokunyang · 2025-09-11T06:40:12Z

java/fory-core/src/test/java/org/apache/fory/RustXlangTest.java

+    // TestUtils.verifyPyforyInstalled();
+  }
+
+  @Test(enabled = false)


remove enabled flag, use isRustJavaCIEnabled instead.

chaokunyang · 2025-09-11T06:45:33Z

rust/fory-core/src/serializer/string.rs

    fn write(&self, context: &mut WriteContext) {
-        context.writer.var_int32(self.len() as i32);
-        context.writer.bytes(self.as_bytes());
+        let encoding = best_coder(self);


For write, we could keep using utf8 instead, this will minimize the string encoding cost, and other languages support decode utf8

## Why? support enum xlang serialization between java and python ## What does this PR do?  ## Related issues #2602 #2585 ## Does this PR introduce any user-facing change?  - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark

ci/tasks/rust.py

chaokunyang · 2025-09-11T13:32:34Z

@urlyy
String serialization in rust needs to support latin1 encoding. You can take following code as an example for simd latin1 check:

/// Checks if a UTF-8 string can be losslessly encoded as Latin-1 using SIMD if available.
///
/// This is true if all characters in the string have a Unicode codepoint <= 255.
/// This translates to a byte-level check: the string must not contain any byte >= 0xC4.
pub fn can_be_latin1(s: &str) -> bool {
    let bytes = s.as_bytes();

    // Runtime feature detection to select the best implementation.
    // The functions are guarded by `#[target_feature]`, so the compiler
    // generates optimized code for each case. The function pointers are resolved
    // at runtime on the first call.
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx2") {
            return unsafe { can_be_latin1_avx2(bytes) };
        }
        if is_x86_feature_detected!("sse2") {
            return unsafe { can_be_latin1_sse2(bytes) };
        }
    }

    // Fallback for non-x86_64 architectures or if no SIMD features are available.
    can_be_latin1_scalar(bytes)
}

/// Scalar fallback implementation. Checks byte by byte.
#[inline]
fn can_be_latin1_scalar(bytes: &[u8]) -> bool {
    // A simple iterator-based check is clean and often optimized well by the compiler.
    !bytes.iter().any(|&b| b >= 0xC4)
}

/// Implementation using SSE2 intrinsics (16-byte vectors).
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "sse2")]
unsafe fn can_be_latin1_sse2(bytes: &[u8]) -> bool {
    const CHUNK_SIZE: usize = 16;
    let limit = _mm_set1_epi8(0xC3 as i8); // We check for values > 0xC3

    let mut i = 0;
    while i + CHUNK_SIZE <= bytes.len() {
        // Load 16 bytes of data. `loadu` handles unaligned memory.
        let chunk = _mm_loadu_si128(bytes.as_ptr().add(i) as *const _);

        // This is a common trick for unsigned comparison with signed-only intrinsics.
        // We want to check `byte >= 0xC4`. This is equivalent to `byte > 0xC3`.
        // The `_mm_cmpgt_epi8` instruction performs a signed comparison (a > b).
        // By adding -128 (or XORing with 0x80) to both operands, we can map the
        // unsigned range [0, 255] to the signed range [-128, 127] and perform a
        // valid comparison.
        // `byte > 0xC3` becomes `(byte - 128) > (0xC3 - 128)`.
        // `0xC3 - 128 = 195 - 128 = 67`. The `limit` vector holds `0xC3` because
        // we're comparing `chunk > limit`, and `_mm_cmpgt_epi8` works on signed i8.
        // The values from 0xC4 to 0xFF will correctly be "greater than" 0xC3.
        let comparison = _mm_cmpgt_epi8(chunk, limit);

        // Create a bitmask from the most significant bit of each byte in the result.
        // If any byte in `chunk` was > 0xC3, the corresponding byte in `comparison`
        // will be all 1s (0xFF), and its MSB will be 1.
        // `movemask` will be non-zero if any invalid byte was found.
        if _mm_movemask_epi8(comparison) != 0 {
            return false;
        }

        i += CHUNK_SIZE;
    }

    // Handle the remainder
    can_be_latin1_scalar(&bytes[i..])
}

/// Implementation using AVX2 intrinsics (32-byte vectors).
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn can_be_latin1_avx2(bytes: &[u8]) -> bool {
    const CHUNK_SIZE: usize = 32;
    // We want to check for bytes >= 0xC4, which is equivalent to > 0xC3.
    // The vector is filled with 0xC3.
    let limit = _mm256_set1_epi8(0xC3 as i8);

    let mut i = 0;
    while i + CHUNK_SIZE <= bytes.len() {
        // Load 32 bytes of data.
        let chunk = _mm256_loadu_si256(bytes.as_ptr().add(i) as *const _);
        
        // Perform a signed "greater than" comparison.
        let comparison = _mm256_cmpgt_epi8(chunk, limit);

        // Create a 32-bit mask. If any byte was > 0xC3, the mask will be non-zero.
        if _mm256_movemask_epi8(comparison) != 0 {
            return false;
        }

        i += CHUNK_SIZE;
    }

    // Handle the remainder using the SSE2 or scalar implementation
    // to avoid duplicating the tail-handling logic.
    can_be_latin1_sse2(&bytes[i..])
}

chaokunyang · 2025-09-11T14:27:34Z

rust/fory-core/src/buffer.rs

+            assert!(b <= 0xFF, "Non-Latin1 character found");
+            self.u8(b as u8);
+        }
+        s.chars().count()


THis will parse anditerate the string twice, s.chars() return an iterator. we'd bettter use a counter to count the length

chaokunyang · 2025-09-11T14:28:29Z

rust/fory-core/src/buffer.rs

+    }
+
+    pub fn utf16_string(&mut self, s: &str) -> usize {
+        let units: Vec<u16> = s.encode_utf16().collect();


the collection will introduce extra memroy allocation, we'd better keep using iterator and use a coutenr teh count written bytes length

chaokunyang · 2025-09-11T14:40:25Z

rust/fory-core/src/serializer/string.rs

+        let mut buf = Writer::default();
+        let len;
+        let bitor;
+        if is_latin(self.as_str()) {


Maybe not in this PR, we could implement a get_latin1_length, which return -1 if not latin1. And then write length and encoding first, then we can just write string byte by byte directly into the writer

chaokunyang

LGTM

try to test with java

8ec69b4

urlyy requested review from chaokunyang and theweipeng as code owners September 7, 2025 20:14

Merge branch 'apache:main' into sync_with_java

af57746

urlyy marked this pull request as draft September 8, 2025 21:06

urlyy added 3 commits September 10, 2025 07:58

Merge branch 'apache:main' into sync_with_java

56631d0

resolve conflict

503b8c0

support sync string & list & set with java

b622848

urlyy marked this pull request as ready for review September 10, 2025 00:38

fix something

1c9d215

urlyy changed the title ~~feat(Rust): Try to test with java~~ feat(Rust): Support basic type to aligned with java Sep 10, 2025

urlyy changed the title ~~feat(Rust): Support basic type to aligned with java~~ feat(Rust): support basic type se/de aligned with java Sep 10, 2025

chaokunyang reviewed Sep 10, 2025

View reviewed changes

java/fory-core/src/main/java/org/apache/fory/resolver/XtypeResolver.java Outdated Show resolved Hide resolved

Merge branch 'apache:main' into sync_with_java

4ba180a

chaokunyang reviewed Sep 10, 2025

View reviewed changes

bugfix

eada6f4

chaokunyang mentioned this pull request Sep 11, 2025

feat(java/python): support enum xlang serialization #2603

Merged

2 tasks

chaokunyang reviewed Sep 11, 2025

View reviewed changes

chaokunyang and others added 3 commits September 11, 2025 17:11

Merge branch 'main' into sync_with_java

d9db32c

fix ci

44d7164

fix ci

3172d59

chaokunyang reviewed Sep 11, 2025

View reviewed changes

ci/tasks/rust.py Outdated Show resolved Hide resolved

Update ci/tasks/rust.py

9cd1585

chaokunyang reviewed Sep 11, 2025

View reviewed changes

ci/tasks/rust.py Outdated Show resolved Hide resolved

chaokunyang and others added 2 commits September 11, 2025 20:26

Update ci/tasks/rust.py

565705a

ci format fix

12b69a5

urlyy added 2 commits September 11, 2025 20:55

ci format fix

48f4127

code clean

88a1b8e

buffix is_latin

6d6ebfb

chaokunyang reviewed Sep 11, 2025

View reviewed changes

perf Buffer::write_string

f6bb38a

chaokunyang reviewed Sep 11, 2025

View reviewed changes

perf string header write

57f9fee

chaokunyang approved these changes Sep 12, 2025

View reviewed changes

chaokunyang merged commit 477b646 into apache:main Sep 12, 2025
59 checks passed

	public void isPyforyInstalled() {
	public void isRustJavaCIEnabled() {

feat(Rust): support basic type se/de aligned with java #2585

feat(Rust): support basic type se/de aligned with java #2585

Uh oh!

Conversation

urlyy commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related issues

TODOs:

Does this PR introduce any user-facing change?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chaokunyang commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaokunyang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

urlyy commented Sep 7, 2025 •

edited

Loading

chaokunyang commented Sep 11, 2025 •

edited

Loading