journaling: track dependencies of elided entries by tgummerer · Pull Request #21576 · pulumi/pulumi

tgummerer · 2026-01-28T09:32:58Z

In the journaling code, when we have journal entries we can elide, e.g. sames where the only change is the stack trace, we immediately confirm them and return no error. This is fine, because even if they don't get stored on the server, nothing of major value is lost, which is why we were eliding them in the first place.

However if this happens, and a subsequent journal entry depends on the elided entry, this can cause the journal to become non-replayable. For example, we could have a "same" for the root stack, that fails to be saved for whatever reason. Later registering the stack outputs would be an operation of "RemoveNew" of the root stack entry, and replacing it with the new state. However since we never saw that "new" entry, the replayer doesn't know what to do with it and crashes.

This situation can either happen because the original entry itself failed, or a batch where the entry was in had an issue.

The proposed solution here is to still return quickly when we get an elided entry, but to keep track of them. If we have an entry depending on the elided one, we wait for the result from the service, and check if there was any error. If there was we also just error out, and let the engine handle the failure.

Fixes https://github.com/pulumi/pulumi-service/issues/37150

In the journaling code, when we have journal entries we can elide, e.g. sames where the only change is the stack trace, we immediately confirm them and return no error. This is fine, because even if they don't get stored on the server, nothing of major value is lost, which is why we were eliding them in the first place. However if this happens, and a subsequent journal entry depends on the elided entry, this can cause the journal to become non-replayable. For example, we could have a "same" for the root stack, that fails to be saved for whatever reason. Later registering the stack outputs would be an operation of "RemoveNew" of the root stack entry, and replacing it with the new state. However since we never saw that "new" entry, the replayer doesn't know what to do with it and crashes. This situation can either happen because the original entry itself failed, or a batch where the entry was in had an issue. The proposed solution here is to still return quickly when we get an elided entry, but to keep track of them. If we have an entry depending on the elided one, we wait for the result from the service, and check if there was any error. If there was we also just error out, and let the engine handle the failure.

pgavlin · 2026-02-06T01:49:05Z

Curious if the test that you added fails without these changes.

I think that what you've identified is indeed a possibility in two cases:

if we're sending entries concurrently, or
if a failure persisting a batch of journal entries does not terminate the journaler with the current sequentially-batched approach

In the first case the possibility for missing entries is clear: we wouldn't wait for an elided entry, so its dependents would be allowed to proceed before the elided entry has been persisted.

In the second case, I think the failure to persist and the continuance in the face of failure is critical. Regardless of whether an entry is elided, the current scheme ensures that the entry will be queued for persistence ahead of any of its dependent entries. Therefore, we cannot have missing entries unless an elided entry fails to persist and a dependent entry is later sent.

All in all, I think this change is good b/c it future proofs us against case (1). But I think it might be good to include an explanation (feel free to take mine) of the necessary conditions for hitting this with our current approach.

pgavlin · 2026-02-06T01:49:47Z

(oh and if the test you've added doesn't fail without these changes, I'd love to have a test that does fail)

tgummerer · 2026-02-06T11:36:50Z

In the first case the possibility for missing entries is clear: we wouldn't wait for an elided entry, so its dependents would be allowed to proceed before the elided entry has been persisted.

Right, this is not something we need to worry about right now because of how batching works, but if we ever end up being in a non-batching mode, this can become a problem. But by chance, this PR also solves this.

if a failure persisting a batch of journal entries does not terminate the journaler with the current sequentially-batched approach

Yup. This shouldn't happen in theory, the cloud should always be able to store the journal entries. But in the real world that's of course not the case.

Curious if the test that you added fails without these changes.

It does fail locally with the changes to journal/snapshot.go reverted. Even if there was an error, the specific error message doesn't exist without these changes, so it would always fail.

go test -run TestDependingOnElidedEntries\$ . 
--- FAIL: TestDependingOnElidedEntries (0.00s)
    snapshot_test.go:590: 
        	Error Trace:	/home/tgummerer/work/pulumi/wait-on-elided/pkg/backend/httpstate/journal/snapshot_test.go:590
        	Error:      	An error is expected but got nil.
        	Test:       	TestDependingOnElidedEntries
FAIL
FAIL	github.com/pulumi/pulumi/pkg/v3/backend/httpstate/journal	0.045s
FAIL

(with the revert)

But I think it might be good to include an explanation (feel free to take mine) of the necessary conditions for hitting this with our current approach.

Added an explanation of why we need these pending entries.

To be merged after #21727 and #21764 made it in. Tentative changelog: ## (2026-02-10) ### Features - [cli] Show environment variables that were set if a snapshot integrity error happens [#21709](#21709) - [engine] Pass replacement trigger through to Construct [#21408](#21408) - [engine] Add EnvVarMappings resource option for provider resources, allowing environment variables to be remapped before being passed to the provider [#21572](#21572) - [pkg] BREAKING: Deprecate github.com/pulumi/pulumi/pkg/v3/codegen/dotnet in favor of github.com/pulumi/pulumi-dotnet/pulumi-language-dotnet/v3/codegen. This package will be removed from pulumi/pulumi soon! [#21720](#21720) ### Bug Fixes - [cli] Retry `yarn install` when it fails (e.g. during `pulumi install`) [#21707](#21707) - [engine] Deal with errors in elided journal entries correctly [#21576](#21576) - [sdk/python] Fix `_LazyModule` to not trigger full module load for introspection attributes [#21620](#21620) - [sdkgen/python] Remove workaround for slow typechecking with MyPy and PyCharm [#21722](#21722) ### Miscellaneous - [cli] Write logfile location if verbosity is >= 1 to stderr instead of stdout [#21663](#21663)

pulumi-bot · 2026-02-10T17:11:40Z

This PR has been shipped in release v3.220.0.

tgummerer requested a review from a team as a code owner January 28, 2026 09:32

tgummerer force-pushed the tg/wait-on-elided branch 3 times, most recently from fdc5778 to 84d7288 Compare January 29, 2026 10:15

tgummerer force-pushed the tg/wait-on-elided branch from 84d7288 to cb4ba6f Compare January 29, 2026 10:37

Frassle reviewed Feb 4, 2026

View reviewed changes

Comment thread pkg/backend/httpstate/journal/snapshot.go Outdated

use promise instead of channels

8f8749c

tgummerer force-pushed the tg/wait-on-elided branch from 04f5c6a to 8f8749c Compare February 4, 2026 14:53

pgavlin approved these changes Feb 6, 2026

View reviewed changes

add some docs

5cdf96d

tgummerer enabled auto-merge February 6, 2026 11:36

tgummerer added this pull request to the merge queue Feb 6, 2026

Merged via the queue into master with commit 8fd6fd0 Feb 6, 2026
212 of 215 checks passed

tgummerer deleted the tg/wait-on-elided branch February 6, 2026 13:58

tgummerer mentioned this pull request Feb 10, 2026

Freeze 3.220.0 #21772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

journaling: track dependencies of elided entries#21576

journaling: track dependencies of elided entries#21576
tgummerer merged 3 commits intomasterfrom
tg/wait-on-elided

tgummerer commented Jan 28, 2026

Uh oh!

Uh oh!

pgavlin commented Feb 6, 2026

Uh oh!

pgavlin commented Feb 6, 2026

Uh oh!

tgummerer commented Feb 6, 2026

Uh oh!

Uh oh!

pulumi-bot commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tgummerer commented Jan 28, 2026

Uh oh!

Uh oh!

pgavlin commented Feb 6, 2026

Uh oh!

pgavlin commented Feb 6, 2026

Uh oh!

tgummerer commented Feb 6, 2026

Uh oh!

Uh oh!

pulumi-bot commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants