-
Notifications
You must be signed in to change notification settings - Fork 620
Fix data corruption during compression of large chunks #11525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Web viewer built successfully.
Note: This comment is updated whenever you push a commit. |
| } | ||
| crate::codec::Compression::LZ4 => re_protos::log_msg::v1alpha1::Compression::Lz4 as i32, | ||
| }, | ||
| uncompressed_size: payload.uncompressed_size as i32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original bug: Payload::uncompressed_size was a usize that could definitely grow beyond i32::MAX, so that silenced cast leads to disaster.
9393571 to
d84b0f2
Compare
|
@rerun-bot full-check |
|
Started a full build: https://github.com/rerun-io/rerun/actions/runs/18493924895 |
| Compression compression = 2; | ||
|
|
||
| int32 uncompressed_size = 3; | ||
| uint64 uncompressed_size = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a breaking change? If so, can we make it a non-breaking change by adding a new field instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, see PR description
|
|
||
| // Being able to log fast isn't particularly useful if the data happens to be corrupt at the | ||
| // other end, so make sure we can encode/decode everything that was logged. | ||
| if check && let Some(storage) = storage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this run on CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't believe any of this stuff runs on CI. Maybe I'll have a look in a follow-up.
|
merge & pray |
There are 3 problems involved:
1. `ArrowMsg::uncompressed_size` is defined as an `int32`. That issue
speaks for itself.
2. The encoding path tracks payload sizes using `usize`, and then
silently cast them to `int32` when encoding the Protobuf data, leading
to gibberish and crashes during decoding.
3. `lz4_flex` has similar casting bugs: it can encode 64bit values but
cannot decode them. Once again the `as` keyword is involved.
This PR fixes 1 and 2. All the changes are backward and forward
compatible because:
* In protobuf, `uint64` is a strict superset of `int32`: both are
unsigned varints on the wire.
* There exist no valid RRD files out there with negative values for
`uncompressed_size`.
We can already log much bigger chunks than before, but not too big,
otherwise `lz4` itself will during decoding:
```
thread 'decompress' panicked at 'attempt to add with overflow'
lz4_flex-0.11.5/src/block/decompress_safe.rs:39
stack backtrace:
6: core::panicking::panic_fmt
at core/src/panicking.rs:75:14
7: core::panicking::panic_const::panic_const_add_overflow
at core/src/panicking.rs:175:17
8: std::sys::backtrace::__rust_begin_short_backtrace
9: core::ops::function::FnOnce::call_once{{vtable.shim}}
10: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at alloc/src/boxed.rs:1966:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at alloc/src/boxed.rs:1966:9
std::sys::pal::unix::thread::Thread::new::thread_start
at std/src/sys/pal/unix/thread.rs:97:17
11: <unknown>
12: <unknown>
```
I'll have a follow-up PR for the `lz4_flex` issue.
---
* Half of fixing #11516
* The other half requires patching `lz4_flex` (on it)
---
Checks:
* [x] `rerun` with `--nb_scans 2000`
* [x] `rerun rrd stats` with `--nb_scans 2000`
* [x] ⛔ `rerun` with `--nb_scans 3000` -> bug in lz4
* [x] ⛔ `rerun rrd stats` with `--nb_scans 3000` -> bug in lz4
* [x] All dataplatform test suites pass
* [x] Can load old RRDs no problem
This is a follow up to #11525. It fixes the remaining half of #11516. We don't want to merge it as is yet. Let's see if we can get the fix merged upstream first, so we don't have to depend on yet another fork: * PSeitz/lz4_flex#192 --- * Fixes #11516 --- Checks: * [x] `rerun` with `--nb_scans 3000` now works * [x] `rerun rrd stats` with `--nb_scans 3000` now works
This is a follow up to #11525. It fixes the remaining half of #11516. We don't want to merge it as is yet. Let's see if we can get the fix merged upstream first, so we don't have to depend on yet another fork: * PSeitz/lz4_flex#192 --- * Fixes #11516 --- Checks: * [x] `rerun` with `--nb_scans 3000` now works * [x] `rerun rrd stats` with `--nb_scans 3000` now works
There are 3 problems involved:
ArrowMsg::uncompressed_sizeis defined as anint32. That issue speaks for itself.usize, and then silently cast them toint32when encoding the Protobuf data, leading to gibberish and crashes during decoding.lz4_flexhas similar casting bugs: it can encode 64bit values but cannot decode them. Once again theaskeyword is involved.This PR fixes 1 and 2. All the changes are backward and forward compatible because:
uint64is a strict superset ofint32: both are unsigned varints on the wire.uncompressed_size.We can already log much bigger chunks than before, but not too big, otherwise
lz4itself will during decoding:I'll have a follow-up PR for the
lz4_flexissue.capacity overflow#11516lz4_flex(on it)Checks:
rerunwith--nb_scans 2000rerun rrd statswith--nb_scans 2000rerunwith--nb_scans 3000-> bug in lz4rerun rrd statswith--nb_scans 3000-> bug in lz4