Skip to content

Conversation

@urlyy
Copy link
Contributor

@urlyy urlyy commented Sep 7, 2025

What does this PR do?

  1. Updated the legacy read/write methods to align with Java.
  2. Started adding some register_by_name functionality; not finished yet, will continue in the next PR.
  3. Added a new test file to facilitate testing across different languages conveniently.
  4. Basic types, primitive arrays, string, list, set, and map are now aligned with Java. Handling of compress algorithm for String and header for map still needs improvement.

Related issues

#2539

TODOs:

  1. Enum derive.
  2. Complete register_by_name.
  3. Align with Java metashare, test on struct..
  4. Implement serialization for Box<dyn Any> (may be challenging).

Does this PR introduce any user-facing change?

  • Does this PR introduce any public API change?
# now can 
let reader = Reader::new(bytes.as_slice());
let fory = Fory::default().mode(Compatible).xlang(true);
let mut context = ReadContext::new(&fory, reader);
let a = fory.deserialize_with_context(&mut context);
let b = fory.deserialize_with_context(&mut context);
let c = fory.deserialize_with_context(&mut context);

And we should add scoped_meta_context in future.

  • Does this PR introduce any binary protocol compatibility change?

@urlyy urlyy marked this pull request as draft September 8, 2025 21:06
@urlyy urlyy marked this pull request as ready for review September 10, 2025 00:38
@urlyy urlyy changed the title feat(Rust): Try to test with java feat(Rust): Support basic type to aligned with java Sep 10, 2025
@urlyy urlyy changed the title feat(Rust): Support basic type to aligned with java feat(Rust): support basic type se/de aligned with java Sep 10, 2025

/** Tests in this class need fory python/rust installed. */
@Test
public class WIPCrossLanguageTest extends ForyTestBase {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it to RustXlangTest, and skip it by default. We don't want every contributor isntall rust for developing java.

You can control whether enable it by a env veraible. And this test should only be run in rust ci. You can update rust ci by install fory java, and execute command cd java/fory-core && mvn test -Dtest=org.apache.fory.RustXlangTest


@BeforeClass
public void isPyforyInstalled() {
// TestUtils.verifyPyforyInstalled();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can configure a env variable such as FORY_RUST_JAVA_CI, and in isRustJavaCIEnabled method, we check the env variable and skip this whole RustXlangTest here if env variable not set to 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update rust ci in .github/workflows/ci.yaml to install fory java too and run test command mvn test -Dtest=org.apache.fory.RustXlangTest

private static final int RUST_TESTCASE_INDEX = 4;

@BeforeClass
public void isPyforyInstalled() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void isPyforyInstalled() {
public void isRustJavaCIEnabled() {

// TestUtils.verifyPyforyInstalled();
}

@Test(enabled = false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove enabled flag, use isRustJavaCIEnabled instead.

fn write(&self, context: &mut WriteContext) {
context.writer.var_int32(self.len() as i32);
context.writer.bytes(self.as_bytes());
let encoding = best_coder(self);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For write, we could keep using utf8 instead, this will minimize the string encoding cost, and other languages support decode utf8

chaokunyang added a commit that referenced this pull request Sep 11, 2025
## Why?

 support enum xlang serialization between java and python

## What does this PR do?

<!-- Describe the details of this PR. -->

## Related issues

#2602 
#2585 
## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/fory/issues/new/choose) describing the
need to do so and update the document if necessary.

Delete section if not applicable.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?

## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.

Delete section if not applicable.
-->
@chaokunyang
Copy link
Collaborator

chaokunyang commented Sep 11, 2025

@urlyy
String serialization in rust needs to support latin1 encoding. You can take following code as an example for simd latin1 check:

/// Checks if a UTF-8 string can be losslessly encoded as Latin-1 using SIMD if available.
///
/// This is true if all characters in the string have a Unicode codepoint <= 255.
/// This translates to a byte-level check: the string must not contain any byte >= 0xC4.
pub fn can_be_latin1(s: &str) -> bool {
    let bytes = s.as_bytes();

    // Runtime feature detection to select the best implementation.
    // The functions are guarded by `#[target_feature]`, so the compiler
    // generates optimized code for each case. The function pointers are resolved
    // at runtime on the first call.
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx2") {
            return unsafe { can_be_latin1_avx2(bytes) };
        }
        if is_x86_feature_detected!("sse2") {
            return unsafe { can_be_latin1_sse2(bytes) };
        }
    }

    // Fallback for non-x86_64 architectures or if no SIMD features are available.
    can_be_latin1_scalar(bytes)
}

/// Scalar fallback implementation. Checks byte by byte.
#[inline]
fn can_be_latin1_scalar(bytes: &[u8]) -> bool {
    // A simple iterator-based check is clean and often optimized well by the compiler.
    !bytes.iter().any(|&b| b >= 0xC4)
}

/// Implementation using SSE2 intrinsics (16-byte vectors).
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "sse2")]
unsafe fn can_be_latin1_sse2(bytes: &[u8]) -> bool {
    const CHUNK_SIZE: usize = 16;
    let limit = _mm_set1_epi8(0xC3 as i8); // We check for values > 0xC3

    let mut i = 0;
    while i + CHUNK_SIZE <= bytes.len() {
        // Load 16 bytes of data. `loadu` handles unaligned memory.
        let chunk = _mm_loadu_si128(bytes.as_ptr().add(i) as *const _);

        // This is a common trick for unsigned comparison with signed-only intrinsics.
        // We want to check `byte >= 0xC4`. This is equivalent to `byte > 0xC3`.
        // The `_mm_cmpgt_epi8` instruction performs a signed comparison (a > b).
        // By adding -128 (or XORing with 0x80) to both operands, we can map the
        // unsigned range [0, 255] to the signed range [-128, 127] and perform a
        // valid comparison.
        // `byte > 0xC3` becomes `(byte - 128) > (0xC3 - 128)`.
        // `0xC3 - 128 = 195 - 128 = 67`. The `limit` vector holds `0xC3` because
        // we're comparing `chunk > limit`, and `_mm_cmpgt_epi8` works on signed i8.
        // The values from 0xC4 to 0xFF will correctly be "greater than" 0xC3.
        let comparison = _mm_cmpgt_epi8(chunk, limit);

        // Create a bitmask from the most significant bit of each byte in the result.
        // If any byte in `chunk` was > 0xC3, the corresponding byte in `comparison`
        // will be all 1s (0xFF), and its MSB will be 1.
        // `movemask` will be non-zero if any invalid byte was found.
        if _mm_movemask_epi8(comparison) != 0 {
            return false;
        }

        i += CHUNK_SIZE;
    }

    // Handle the remainder
    can_be_latin1_scalar(&bytes[i..])
}

/// Implementation using AVX2 intrinsics (32-byte vectors).
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn can_be_latin1_avx2(bytes: &[u8]) -> bool {
    const CHUNK_SIZE: usize = 32;
    // We want to check for bytes >= 0xC4, which is equivalent to > 0xC3.
    // The vector is filled with 0xC3.
    let limit = _mm256_set1_epi8(0xC3 as i8);

    let mut i = 0;
    while i + CHUNK_SIZE <= bytes.len() {
        // Load 32 bytes of data.
        let chunk = _mm256_loadu_si256(bytes.as_ptr().add(i) as *const _);
        
        // Perform a signed "greater than" comparison.
        let comparison = _mm256_cmpgt_epi8(chunk, limit);

        // Create a 32-bit mask. If any byte was > 0xC3, the mask will be non-zero.
        if _mm256_movemask_epi8(comparison) != 0 {
            return false;
        }

        i += CHUNK_SIZE;
    }

    // Handle the remainder using the SSE2 or scalar implementation
    // to avoid duplicating the tail-handling logic.
    can_be_latin1_sse2(&bytes[i..])
}

assert!(b <= 0xFF, "Non-Latin1 character found");
self.u8(b as u8);
}
s.chars().count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis will parse anditerate the string twice, s.chars() return an iterator. we'd bettter use a counter to count the length

}

pub fn utf16_string(&mut self, s: &str) -> usize {
let units: Vec<u16> = s.encode_utf16().collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the collection will introduce extra memroy allocation, we'd better keep using iterator and use a coutenr teh count written bytes length

let mut buf = Writer::default();
let len;
let bitor;
if is_latin(self.as_str()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR, we could implement a get_latin1_length, which return -1 if not latin1. And then write length and encoding first, then we can just write string byte by byte directly into the writer

Copy link
Collaborator

@chaokunyang chaokunyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chaokunyang chaokunyang merged commit 477b646 into apache:main Sep 12, 2025
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants