Skip to content

journaling: track dependencies of elided entries#21576

Merged
tgummerer merged 3 commits intomasterfrom
tg/wait-on-elided
Feb 6, 2026
Merged

journaling: track dependencies of elided entries#21576
tgummerer merged 3 commits intomasterfrom
tg/wait-on-elided

Conversation

@tgummerer
Copy link
Copy Markdown
Collaborator

In the journaling code, when we have journal entries we can elide, e.g. sames where the only change is the stack trace, we immediately confirm them and return no error. This is fine, because even if they don't get stored on the server, nothing of major value is lost, which is why we were eliding them in the first place.

However if this happens, and a subsequent journal entry depends on the elided entry, this can cause the journal to become non-replayable. For example, we could have a "same" for the root stack, that fails to be saved for whatever reason. Later registering the stack outputs would be an operation of "RemoveNew" of the root stack entry, and replacing it with the new state. However since we never saw that "new" entry, the replayer doesn't know what to do with it and crashes.

This situation can either happen because the original entry itself failed, or a batch where the entry was in had an issue.

The proposed solution here is to still return quickly when we get an elided entry, but to keep track of them. If we have an entry depending on the elided one, we wait for the result from the service, and check if there was any error. If there was we also just error out, and let the engine handle the failure.

Fixes https://github.com/pulumi/pulumi-service/issues/37150

@tgummerer tgummerer requested a review from a team as a code owner January 28, 2026 09:32
@tgummerer tgummerer force-pushed the tg/wait-on-elided branch 3 times, most recently from fdc5778 to 84d7288 Compare January 29, 2026 10:15
In the journaling code, when we have journal entries we can elide,
e.g. sames where the only change is the stack trace, we immediately
confirm them and return no error. This is fine, because even if they
don't get stored on the server, nothing of major value is lost, which
is why we were eliding them in the first place.

However if this happens, and a subsequent journal entry depends on the
elided entry, this can cause the journal to become non-replayable. For
example, we could have a "same" for the root stack, that fails to be
saved for whatever reason. Later registering the stack outputs would
be an operation of "RemoveNew" of the root stack entry, and replacing
it with the new state.  However since we never saw that "new" entry,
the replayer doesn't know what to do with it and crashes.

This situation can either happen because the original entry itself
failed, or a batch where the entry was in had an issue.

The proposed solution here is to still return quickly when we get an
elided entry, but to keep track of them. If we have an entry depending
on the elided one, we wait for the result from the service, and check
if there was any error. If there was we also just error out, and let
the engine handle the failure.
Comment thread pkg/backend/httpstate/journal/snapshot.go Outdated
@pgavlin
Copy link
Copy Markdown
Member

pgavlin commented Feb 6, 2026

Curious if the test that you added fails without these changes.

I think that what you've identified is indeed a possibility in two cases:

  1. if we're sending entries concurrently, or
  2. if a failure persisting a batch of journal entries does not terminate the journaler with the current sequentially-batched approach

In the first case the possibility for missing entries is clear: we wouldn't wait for an elided entry, so its dependents would be allowed to proceed before the elided entry has been persisted.

In the second case, I think the failure to persist and the continuance in the face of failure is critical. Regardless of whether an entry is elided, the current scheme ensures that the entry will be queued for persistence ahead of any of its dependent entries. Therefore, we cannot have missing entries unless an elided entry fails to persist and a dependent entry is later sent.

All in all, I think this change is good b/c it future proofs us against case (1). But I think it might be good to include an explanation (feel free to take mine) of the necessary conditions for hitting this with our current approach.

@pgavlin
Copy link
Copy Markdown
Member

pgavlin commented Feb 6, 2026

(oh and if the test you've added doesn't fail without these changes, I'd love to have a test that does fail)

@tgummerer
Copy link
Copy Markdown
Collaborator Author

In the first case the possibility for missing entries is clear: we wouldn't wait for an elided entry, so its dependents would be allowed to proceed before the elided entry has been persisted.

Right, this is not something we need to worry about right now because of how batching works, but if we ever end up being in a non-batching mode, this can become a problem. But by chance, this PR also solves this.

if a failure persisting a batch of journal entries does not terminate the journaler with the current sequentially-batched approach

Yup. This shouldn't happen in theory, the cloud should always be able to store the journal entries. But in the real world that's of course not the case.

Curious if the test that you added fails without these changes.

It does fail locally with the changes to journal/snapshot.go reverted. Even if there was an error, the specific error message doesn't exist without these changes, so it would always fail.

go test -run TestDependingOnElidedEntries\$ . 
--- FAIL: TestDependingOnElidedEntries (0.00s)
    snapshot_test.go:590: 
        	Error Trace:	/home/tgummerer/work/pulumi/wait-on-elided/pkg/backend/httpstate/journal/snapshot_test.go:590
        	Error:      	An error is expected but got nil.
        	Test:       	TestDependingOnElidedEntries
FAIL
FAIL	github.com/pulumi/pulumi/pkg/v3/backend/httpstate/journal	0.045s
FAIL

(with the revert)

But I think it might be good to include an explanation (feel free to take mine) of the necessary conditions for hitting this with our current approach.

Added an explanation of why we need these pending entries.

@tgummerer tgummerer enabled auto-merge February 6, 2026 11:36
@tgummerer tgummerer added this pull request to the merge queue Feb 6, 2026
Merged via the queue into master with commit 8fd6fd0 Feb 6, 2026
212 of 215 checks passed
@tgummerer tgummerer deleted the tg/wait-on-elided branch February 6, 2026 13:58
@tgummerer tgummerer mentioned this pull request Feb 10, 2026
github-merge-queue Bot pushed a commit that referenced this pull request Feb 10, 2026
To be merged after #21727 and
#21764 made it in.

Tentative changelog:

##  (2026-02-10)


### Features

- [cli] Show environment variables that were set if a snapshot integrity
error happens
  [#21709](#21709)

- [engine] Pass replacement trigger through to Construct
  [#21408](#21408)

- [engine] Add EnvVarMappings resource option for provider resources,
allowing environment variables to be remapped before being passed to the
provider
  [#21572](#21572)

- [pkg] BREAKING: Deprecate
github.com/pulumi/pulumi/pkg/v3/codegen/dotnet in favor of
github.com/pulumi/pulumi-dotnet/pulumi-language-dotnet/v3/codegen. This
package will be removed from pulumi/pulumi soon!
  [#21720](#21720)


### Bug Fixes

- [cli] Retry `yarn install` when it fails (e.g. during `pulumi
install`)
  [#21707](#21707)

- [engine] Deal with errors in elided journal entries correctly
  [#21576](#21576)

- [sdk/python] Fix `_LazyModule` to not trigger full module load for
introspection attributes
  [#21620](#21620)

- [sdkgen/python] Remove workaround for slow typechecking with MyPy and
PyCharm
  [#21722](#21722)


### Miscellaneous

- [cli] Write logfile location if verbosity is >= 1 to stderr instead of
stdout
  [#21663](#21663)
@pulumi-bot
Copy link
Copy Markdown
Contributor

This PR has been shipped in release v3.220.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants