-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Make sure ObjectBuilder and ListBuilder to be finalized before its parent builder #7843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| inner_list.finish(); | ||
| outer_list.finish(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bug. Caught by the change. These builders were not called finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome -- i also think it is a great example of how this change improves the variant builders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the test is still broken/useless, if it lacks any checking to verify the builder actually produced what it was asked to build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the test just wants to test field validation for object builders, one failure case, one success case. It doesn't care about what it actually builds.
I said "bug" doesn't mean that the test is useless but it lacks the finish calls making it looks not following the insert - finish pattern that is the what the issue proposed to worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, it can achieve the effort to address the original issue without the worries.
it does seem like the empty drop helps this one usecase
I will try and write a test that shows what happens when a builder isn't finishd and we can evaluate the issues
| /// Finalize object | ||
| /// | ||
| /// This consumes self and writes the object to the parent buffer. | ||
| pub fn finish(mut self) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually remove finish as there are still examples where the builders are in same scope with top-level builder. So explicit call of finish is still necessary for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's said, and, for the case these builders are in the same scope as the top-level builder, after this PR, if such a builder is not called finish, it will be caught by the compiler now, e.g.,
error[E0505]: cannot move out of `builder` because it is borrowed
--> parquet-variant/src/builder.rs:1510:9
|
1469 | let mut builder = VariantBuilder::new().with_validate_unique_fields(true);
| ----------- binding `builder` declared here
...
1485 | let mut outer_list = builder.new_list();
| ------- borrow of `builder` occurs here
...
1510 | builder.finish();
| ^^^^^^^ move out of `builder` occurs here
1511 | }
| - borrow might be used here, when `outer_list` is dropped and runs the `Drop` code for type `ListBuilder`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sounds like a great idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! The compiler is normally willing to end the lifetime of the orphaned outer_list just before the builder.finish call, but providing an impl Drop somehow force-extends the lifetime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler is normally willing to end the lifetime of the orphaned outer_list just before the builder.finish call
No, outer_list's lifetime normally will be ended (if no finish call) after the scope as usual. That's why the compiler complains because it borrows builder. So builder.finish call cannot be happened before the end of outer_list lifetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That only happens if there's an impl Drop tho:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=fe049aba262d590948887ee08f43618c
Somehow, the compiler is able to shorten the lifetime of outer_list if impl Drop doesn't get in the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That only happens if there's an impl Drop tho:
Isn't what this PR does?
parquet-variant/src/builder.rs
Outdated
| /// | ||
| /// This doesn't consume self but just writes the object to the parent buffer. | ||
| fn _finish(&mut self) -> Result<(), ArrowError> { | ||
| assert!(!self.finished, "ObjectBuilder has already been finished"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents that dropping a finished builder causes re-finishing it.
|
cc @alamb |
|
👋 @viirya -- I am checking this one out |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| inner_list.finish(); | ||
| outer_list.finish(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome -- i also think it is a great example of how this change improves the variant builders
|
FYI @friendlymatthew and @scovich and @micoo227 who just added the validation code today! |
Add comments and examples to clarify panic behavior on drop
Thank you. I merged it. |
|
This change makes me nervous, after having been bitten in the past -- multiple times in multiple projects and multiple languages -- by unconditional finalizations like this that don't consider whether the the finalization is actually desirable. The most recent was a bad interaction between spark and hadoop, where the upload stream's unconditional auto-close behavior meant that early return due to an exception uploaded a partial object to cloud storage (= a torn write that caused havoc to readers who don't expect torn writes in cloud storage to be possible). The offending code was buried many levels deep, which made it a real treat to root-cause, and even more "fun" to workaround. Having any object attempt to "finish" in spite of a panic=unwind, or even a If the worry is that somebody might forget to invoke the |
|
Making sense. I forgot that this is not a good practice to put the finalization login into |
|
|
|
Hmm, actually, with an empty In other words, it can achieve the effort to address the original issue without the worries. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm neutral about this change. Not convinced it's actually useful in practice, but it also seems harmless enough so there's no reason to block it.
Against: Outside unit tests, recursive builders will almost always live in a nested scope vs. their parent -- usually a called function or method, or at least a match clause -- which would trivially defeat the drop sentinel.
For: The sentinel is helpful when it is able to catch an issue, and harmless the rest of the time. In the uncommon case that somebody actually wants to avoid calling finish, they can just drop the builder (which also documents the intent).
For: Like it or not, unit tests will end up being a very significant consumer of the variant API, so "quality of life" features that make unit tests easier to write and maintain are not necessarily a bad thing.
Against: Suppose a unit test does have a bug in forgetting to finish a builder. I'm very unclear on what was being tested, if the resulting variant wasn't checked in a way that would notice big swaths of missing data? Even if the drop sentinel catches it, the test is still arguably broken/useless due to incomplete verification of what it claims to test.
Against: A quick grep over the rest of arrow-rs suggests that none of the other crates' builders employ drop sentinels like this. The only matches for impl.* Drop are for buffers and FFI wrappers (to manually delete internal pointers).
| /// Finalize object | ||
| /// | ||
| /// This consumes self and writes the object to the parent buffer. | ||
| pub fn finish(mut self) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! The compiler is normally willing to end the lifetime of the orphaned outer_list just before the builder.finish call, but providing an impl Drop somehow force-extends the lifetime?
| inner_list.finish(); | ||
| outer_list.finish(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the test is still broken/useless, if it lacks any checking to verify the builder actually produced what it was asked to build?
|
So I personally prefer the automatic finalize on drop behavior -- I can understand how that behavior can cause trouble when external resources are affected but in this case I think it makes this code much easier to use for most cases. If we don't finalize the object on drop of the builder, for example, the resulting Variant object built by the builder seems to be malformed. Here is an example that would work with finalize on drop, but not with this PR as is #[test]
fn test_object_list_no_finish() {
let mut builder = VariantBuilder::new();
let mut list_builder = builder.new_list();
{
let mut object_builder = list_builder.new_object();
object_builder.insert("id", 1);
object_builder.finish().unwrap();
}
{
let mut object_builder = list_builder.new_object();
object_builder.insert("id", 2);
// NOTE object_builder.finish() is not called here
}
list_builder.finish();
let (metadata, value) = builder.finish();
let variant = Variant::try_new(&metadata, &value).unwrap();
let list = variant.as_list().unwrap();
}If object_builder.finish() is not called, then this line fails: let list = variant.as_list().unwrap();The error is like this (which is pretty unintuitive to me): |
|
So TLDR I still prefer the finalize on drop behavior |
I don't understand what you mean by defeat the drop sentinel. Since the Maybe I am missing something here and an example would help 🤔
My example above shows that not dropping the builder can lead to a corrupted output so I don't think we support the case of avoiding calling finish without some more work (to be clear I am not sure the "rollback a partially built object" s a valuable API yet)
I agree the tests could be improved and we should add a test showing what happens if you don't call
Another difference to the other builders is they don't have this model of nested nested builders -- so there is no "parent" state to finalize. Variant is somewhat different than the other arrow types because its nesting level camn be arbitary, which is what these nested builders allow) |
From what I understood, the example you gave above would trivially defeat an let mut builder = VariantBuilder::new();
let mut list_builder = builder.new_list();
{
let mut object_builder = list_builder.new_object();
object_builder.insert("id", 1);
// NOTE object_builder.finish() is not called here
}
list_builder.finish();
let (metadata, value) = builder.finish(); |
Doesn't that mean our current implementation is not even safe in the presence of early return by e.g. {
let mut object_builder = list_builder.new_object();
object_builder.insert("id", 2);
// NOTE object_builder.finish() is not called here
}imagine this: fn try_append(...) -> Result<(), ArrowError> {
let mut object_builder = list_builder.new_object();
object_builder.insert("id", 2);
// some fallible operation whose `?` could return early
object_builder.finish()?; // an err here also renders the parent unusable
Ok(())
}With the current behavior, the caller of
Honestly, now that the nested builders store their bytes in a separate vec, I would have expected rollback safety for free. But I guess we're still storing something in the parent vec. I would imagine that's easily addressed by having the constructor grab a marker, which No need to remove extra entries from the metadata dictionary, they're harmless. |
Relating to the |
Looking at the code, I think the bad behavior comes from the |
The idea of a logically lost object being hard to debug makes sense to me. Thank you -- I missed that Propsal
That way:
To be clear, it doesn't mean that @viirya needs to fix this in this PR (though if he does that would be great) |
|
Also, I filed a separate ticket for corrupted objects if build() is not called; |
|
NOTE: The sentinel |
|
Then maybe we can merge this now? |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this one in and continue on #7865
) # Which issue does this PR close? - Closes #7863 - Closes #7798 # Rationale for this change Reviews and testing on #7843 exposed the fact that creating a variant list or object builder has side effects that leave the parent in an inconsistent/invalid state if the child builder is never finalized. Rework the finalization logic to be more direct so that child builders have no effect on their parents before their `finish` method is called. # What changes are included in this PR? * Define a new `ParentState` enum that tracks the necessary information for a child to fully finalize its parent. * Remove the `pending` machinery from builders # Are these changes tested? Existing unit tests mostly cover this change. Added new tests to verify that failing to call `finish` leaves the parent unmodified. # Are there any user-facing changes? No.
Which issue does this PR close?
Dropimpl #7780.Rationale for this change
This minor patch adds some comments to ObjectBuilder and ListBuilder and fixes one existing test.
This patch makes sure ObjectBuilder and ListBuilder to be finalized before its parent builder is finalized. If
finishis forgotten to be called on them, the compiler will show error.What changes are included in this PR?
Are these changes tested?
Updated tests.
Are there any user-facing changes?
No