[ruff] use bitcode instead of bincode#23544
Conversation
Typing conformance resultsNo changes detected ✅Current numbersThe percentage of diagnostics emitted that were expected errors held steady at 85.05%. The percentage of expected errors that received a diagnostic held steady at 78.05%. The number of fully passing files held steady at 63/132. |
Memory usage reportMemory usage unchanged ✅ |
|
|
ntBre
left a comment
There was a problem hiding this comment.
Thanks for looking into this! I'll be curious to get Micha's thoughts next week.
I just had a couple of questions on the code itself, probably from my unfamiliarity with bitcode.
crates/ruff/src/cache.rs
Outdated
| } | ||
|
|
||
| #[derive(Debug, bitcode::Encode, bitcode::Decode)] | ||
| struct SerializedFileCache { |
There was a problem hiding this comment.
Encode and Decode seem to be implemented for all of these fields. Do we need the to and from_runtime methods for these?
There was a problem hiding this comment.
from_runtime reads atomic (load) and puts plain u64 into serializable struct /
into_runtime when reading from disk, wraps u64 back into AtomicU64::new(...)
There was a problem hiding this comment.
Yes sorry, but I saw in the bitcode docs that Decode and Encode are implemented for AtomicU64. I see now that that requires the target_has_atomic=64 feature, though.
https://docs.rs/bitcode/latest/bitcode/trait.Decode.html#impl-Decode%3C'a%3E-for-AtomicU64
crates/ruff/src/cache.rs
Outdated
| #[derive(Debug, bitcode::Encode, bitcode::Decode)] | ||
| struct SerializedPackageCache { | ||
| package_root: String, | ||
| files: Vec<(String, SerializedFileCache)>, |
There was a problem hiding this comment.
Would #[bitcode(with_serde)] help to avoid this intermediate struct here? It seems a bit unfortunate to have to go through to_string_lossy for these paths. I think if the conversion ends up being lossy, we might as well not even cache those files.
There was a problem hiding this comment.
diff --git a/crates/ruff/Cargo.toml b/crates/ruff/Cargo.toml
index f279b40756..0d85f4c0ed 100644
--- a/crates/ruff/Cargo.toml
+++ b/crates/ruff/Cargo.toml
@@ -39,7 +39,7 @@ ruff_workspace = { workspace = true }
anyhow = { workspace = true }
argfile = { workspace = true }
-bitcode = { workspace = true }
+bitcode = { workspace = true, features = ["serde"] }
bitflags = { workspace = true }
cachedir = { workspace = true }
clap = { workspace = true, features = ["derive", "env", "wrap_help"] }
diff --git a/crates/ruff/src/cache.rs b/crates/ruff/src/cache.rs
index 1a29789732..3830db4e10 100644
--- a/crates/ruff/src/cache.rs
+++ b/crates/ruff/src/cache.rs
@@ -109,7 +109,7 @@ impl Cache {
}
};
- let package: SerializedPackageCache = match bitcode::decode(&serialized) {
+ let mut package: PackageCache = match bitcode::deserialize(&serialized) {
Ok(package) => package,
Err(err) => {
warn_user!("Failed parse cache file `{}`: {err}", path.display());
@@ -117,8 +117,6 @@ impl Cache {
}
};
- let mut package = package.into_runtime();
-
// Sanity check.
if package.package_root != package_root {
warn_user!(
@@ -169,7 +167,7 @@ impl Cache {
// Serialize to in-memory buffer because hyperfine benchmark showed that it's faster than
// using a `BufWriter` and our cache files are small enough that streaming isn't necessary.
- let serialized = bitcode::encode(&SerializedPackageCache::from_runtime(&self.package));
+ let serialized = bitcode::serialize(&self.package).context("Failed to serialize cache")?;
temp_file
.write_all(&serialized)
.context("Failed to write serialized cache to temporary file.")?;
@@ -298,69 +296,8 @@ impl Cache {
}
}
-/// On disk representation of a cache of a package.
-#[derive(Debug, bitcode::Encode, bitcode::Decode)]
-struct SerializedPackageCache {
- package_root: String,
- files: Vec<(String, SerializedFileCache)>,
-}
-
-impl SerializedPackageCache {
- fn from_runtime(cache: &PackageCache) -> Self {
- Self {
- package_root: cache.package_root.to_string_lossy().into_owned(),
- files: cache
- .files
- .iter()
- .map(|(path, file)| {
- (
- path.to_string_lossy().into_owned(),
- SerializedFileCache::from_runtime(file),
- )
- })
- .collect(),
- }
- }
-
- fn into_runtime(self) -> PackageCache {
- PackageCache {
- package_root: PathBuf::from(self.package_root),
- files: self
- .files
- .into_iter()
- .map(|(path, file)| (PathBuf::from(path), file.into_runtime()))
- .collect(),
- }
- }
-}
-
-#[derive(Debug, bitcode::Encode, bitcode::Decode)]
-struct SerializedFileCache {
- key: u64,
- last_seen: u64,
- data: FileCacheData,
-}
-
-impl SerializedFileCache {
- fn from_runtime(cache: &FileCache) -> Self {
- Self {
- key: cache.key,
- last_seen: cache.last_seen.load(Ordering::Relaxed),
- data: cache.data.clone(),
- }
- }
-
- fn into_runtime(self) -> FileCache {
- FileCache {
- key: self.key,
- last_seen: AtomicU64::new(self.last_seen),
- data: self.data,
- }
- }
-}
-
/// Runtime representation of a cache of a package.
-#[derive(Debug)]
+#[derive(Debug, serde::Serialize, serde::Deserialize)]
struct PackageCache {
/// Path to the root of the package.
///
@@ -372,7 +309,7 @@ struct PackageCache {
}
/// Runtime representation of the cache per source file.
-#[derive(Debug)]
+#[derive(Debug, serde::Serialize, serde::Deserialize)]
pub(crate) struct FileCache {
/// Key that determines if the cached item is still valid.
key: u64,
@@ -392,7 +329,11 @@ impl FileCache {
}
}
-#[derive(Debug, Default, Clone, bitcode::Encode, bitcode::Decode)]
+#[derive(
+ Debug, Default, Clone,
+ bitcode::Encode, bitcode::Decode,
+ serde::Serialize, serde::Deserialize,
+)]
struct FileCacheData {
linted: bool,
formatted: bool,I think we could do something like this
There was a problem hiding this comment.
Do we have a benchmark? The bitcode docs say the serde integration "is slower, produces slightly larger output", but don't quantify how much, so I'm curious how much that would affect caching performance.
There was a problem hiding this comment.
Would
#[bitcode(with_serde)]help to avoid this intermediate struct here? It seems a bit unfortunate to have to go throughto_string_lossyfor these paths. I think if the conversion ends up being lossy, we might as well not even cache those files.
diff --git a/crates/ruff/src/cache.rs b/crates/ruff/src/cache.rs
index 1a29789732..0fa283d4bc 100644
--- a/crates/ruff/src/cache.rs
+++ b/crates/ruff/src/cache.rs
@@ -1,4 +1,5 @@
use std::fmt::Debug;
+use std::ffi::OsString;
use std::fs;
use std::hash::Hasher;
use std::io::{self, Write};
@@ -109,7 +110,9 @@ impl Cache {
}
};
- let package: SerializedPackageCache = match bitcode::decode(&serialized) {
+ let mut package: PackageCache = match bitcode::decode::<SerializedPackageCache>(&serialized)
+ .map(SerializedPackageCache::into_runtime)
+ {
Ok(package) => package,
Err(err) => {
warn_user!("Failed parse cache file `{}`: {err}", path.display());
@@ -117,8 +120,6 @@ impl Cache {
}
};
- let mut package = package.into_runtime();
-
// Sanity check.
if package.package_root != package_root {
warn_user!(
@@ -298,37 +299,77 @@ impl Cache {
}
}
+/// On-disk representation of a path, preserving non-UTF-8 data.
+#[derive(Debug, Clone, bitcode::Encode, bitcode::Decode)]
+struct SerializedPath {
+ #[cfg(unix)]
+ components: Vec<u8>,
+ #[cfg(windows)]
+ components: Vec<u16>,
+}
+
+impl SerializedPath {
+ fn from_path(path: &Path) -> Self {
+ #[cfg(unix)]
+ {
+ use std::os::unix::ffi::OsStrExt;
+
+ Self {
+ components: path.as_os_str().as_bytes().to_vec(),
+ }
+ }
+
+ #[cfg(windows)]
+ {
+ use std::os::windows::ffi::OsStrExt;
+
+ Self {
+ components: path.as_os_str().encode_wide().collect(),
+ }
+ }
+ }
+
+ fn into_path_buf(self) -> PathBuf {
+ #[cfg(unix)]
+ {
+ use std::os::unix::ffi::OsStringExt;
+ PathBuf::from(OsString::from_vec(self.components))
+ }
+
+ #[cfg(windows)]
+ {
+ use std::os::windows::ffi::OsStringExt;
+ PathBuf::from(OsString::from_wide(&self.components))
+ }
+ }
+}
+
/// On disk representation of a cache of a package.
#[derive(Debug, bitcode::Encode, bitcode::Decode)]
struct SerializedPackageCache {
- package_root: String,
- files: Vec<(String, SerializedFileCache)>,
+ package_root: SerializedPath,
+ files: Vec<(SerializedPath, SerializedFileCache)>,
}
impl SerializedPackageCache {
fn from_runtime(cache: &PackageCache) -> Self {
Self {
- package_root: cache.package_root.to_string_lossy().into_owned(),
+ package_root: SerializedPath::from_path(&cache.package_root),
files: cache
.files
.iter()
- .map(|(path, file)| {
- (
- path.to_string_lossy().into_owned(),
- SerializedFileCache::from_runtime(file),
- )
- })
+ .map(|(path, file)| (SerializedPath::from_path(path), SerializedFileCache::from_runtime(file)))
.collect(),
}
}
fn into_runtime(self) -> PackageCache {
PackageCache {
- package_root: PathBuf::from(self.package_root),
+ package_root: self.package_root.into_path_buf(),
files: self
.files
.into_iter()
- .map(|(path, file)| (PathBuf::from(path), file.into_runtime()))
+ .map(|(path, file)| (path.into_path_buf(), file.into_runtime()))
.collect(),
}
}we could use something like this instead of .to_string_lossy()
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Is bench-current the bitcode version using serde?
There was a problem hiding this comment.
A little later, I'll try the option with rkyv.
There was a problem hiding this comment.
The performance of bitcode with serde sounds promising. Do you want to update this PR to that version?
rkyv might be a bit more involved and I suggest we open a separate PR for it. We can then decide which of the two PRs we want to merge. We can also decide to leave it at this and merge the bitcode PR now.
There was a problem hiding this comment.
The performance of bitcode with serde sounds promising. Do you want to update this PR to that version?
rkyv might be a bit more involved and I suggest we open a separate PR for it. We can then decide which of the two PRs we want to merge. We can also decide to leave it at this and merge the bitcode PR now.
Just in case, I'll try to do a few more benchmarks locally and get back to you with the results later.
|
Does this affect compatibility with existing cache generated by bincode, or do they use the same on-disk format? |
|
We don't try to reuse caches across ruff versions from what I remember, so there's no problem with breaking the cache format between releases, even patch ones. If you have a long-lived .ruff_cache directory, it should have versioned sub-directories: |
|
The benchmark was performed on the repository https://github.com/apache/airflow.
Cache size:
https://gist.github.com/chirizxc/aa7eefd17b507575c1bff97c94e2bebf |
|
Thanks for the detailed benchmarks. Does the rkyv benchmark use its zero copy deserialization? |
In this benchmark, I did If I understand correctly, to use zero-copy, we will have to slightly change the cache design itself. Currently, the cache is designed for data ownership and mutation ( |
MichaReiser
left a comment
There was a problem hiding this comment.
Thanks for the detailed analysis. This looks good to me. I only have a small comment and this should then be ready to merge.
|
Did you happen to notice that |
|
No, I was not aware of this and I agree with this sentiment. |




Summary
See: #22284
This repository contains benchmarks. bitcode shows good results in them.
Test Plan