journaling: track dependencies of elided entries#21576
Conversation
fdc5778 to
84d7288
Compare
In the journaling code, when we have journal entries we can elide, e.g. sames where the only change is the stack trace, we immediately confirm them and return no error. This is fine, because even if they don't get stored on the server, nothing of major value is lost, which is why we were eliding them in the first place. However if this happens, and a subsequent journal entry depends on the elided entry, this can cause the journal to become non-replayable. For example, we could have a "same" for the root stack, that fails to be saved for whatever reason. Later registering the stack outputs would be an operation of "RemoveNew" of the root stack entry, and replacing it with the new state. However since we never saw that "new" entry, the replayer doesn't know what to do with it and crashes. This situation can either happen because the original entry itself failed, or a batch where the entry was in had an issue. The proposed solution here is to still return quickly when we get an elided entry, but to keep track of them. If we have an entry depending on the elided one, we wait for the result from the service, and check if there was any error. If there was we also just error out, and let the engine handle the failure.
84d7288 to
cb4ba6f
Compare
04f5c6a to
8f8749c
Compare
|
Curious if the test that you added fails without these changes. I think that what you've identified is indeed a possibility in two cases:
In the first case the possibility for missing entries is clear: we wouldn't wait for an elided entry, so its dependents would be allowed to proceed before the elided entry has been persisted. In the second case, I think the failure to persist and the continuance in the face of failure is critical. Regardless of whether an entry is elided, the current scheme ensures that the entry will be queued for persistence ahead of any of its dependent entries. Therefore, we cannot have missing entries unless an elided entry fails to persist and a dependent entry is later sent. All in all, I think this change is good b/c it future proofs us against case (1). But I think it might be good to include an explanation (feel free to take mine) of the necessary conditions for hitting this with our current approach. |
|
(oh and if the test you've added doesn't fail without these changes, I'd love to have a test that does fail) |
Right, this is not something we need to worry about right now because of how batching works, but if we ever end up being in a non-batching mode, this can become a problem. But by chance, this PR also solves this.
Yup. This shouldn't happen in theory, the cloud should always be able to store the journal entries. But in the real world that's of course not the case.
It does fail locally with the changes to (with the revert)
Added an explanation of why we need these pending entries. |
To be merged after #21727 and #21764 made it in. Tentative changelog: ## (2026-02-10) ### Features - [cli] Show environment variables that were set if a snapshot integrity error happens [#21709](#21709) - [engine] Pass replacement trigger through to Construct [#21408](#21408) - [engine] Add EnvVarMappings resource option for provider resources, allowing environment variables to be remapped before being passed to the provider [#21572](#21572) - [pkg] BREAKING: Deprecate github.com/pulumi/pulumi/pkg/v3/codegen/dotnet in favor of github.com/pulumi/pulumi-dotnet/pulumi-language-dotnet/v3/codegen. This package will be removed from pulumi/pulumi soon! [#21720](#21720) ### Bug Fixes - [cli] Retry `yarn install` when it fails (e.g. during `pulumi install`) [#21707](#21707) - [engine] Deal with errors in elided journal entries correctly [#21576](#21576) - [sdk/python] Fix `_LazyModule` to not trigger full module load for introspection attributes [#21620](#21620) - [sdkgen/python] Remove workaround for slow typechecking with MyPy and PyCharm [#21722](#21722) ### Miscellaneous - [cli] Write logfile location if verbosity is >= 1 to stderr instead of stdout [#21663](#21663)
|
This PR has been shipped in release v3.220.0. |
In the journaling code, when we have journal entries we can elide, e.g. sames where the only change is the stack trace, we immediately confirm them and return no error. This is fine, because even if they don't get stored on the server, nothing of major value is lost, which is why we were eliding them in the first place.
However if this happens, and a subsequent journal entry depends on the elided entry, this can cause the journal to become non-replayable. For example, we could have a "same" for the root stack, that fails to be saved for whatever reason. Later registering the stack outputs would be an operation of "RemoveNew" of the root stack entry, and replacing it with the new state. However since we never saw that "new" entry, the replayer doesn't know what to do with it and crashes.
This situation can either happen because the original entry itself failed, or a batch where the entry was in had an issue.
The proposed solution here is to still return quickly when we get an elided entry, but to keep track of them. If we have an entry depending on the elided one, we wait for the result from the service, and check if there was any error. If there was we also just error out, and let the engine handle the failure.
Fixes https://github.com/pulumi/pulumi-service/issues/37150