Skip to content

fix(spanner): context canceled in ExecuteStreamingSql#12630

Closed
yordis wants to merge 5 commits intogoogleapis:mainfrom
fanatics-live:yordis/fix-context-canceled
Closed

fix(spanner): context canceled in ExecuteStreamingSql#12630
yordis wants to merge 5 commits intogoogleapis:mainfrom
fanatics-live:yordis/fix-context-canceled

Conversation

@yordis
Copy link
Copy Markdown

@yordis yordis commented Aug 1, 2025

related to #11854

@yordis yordis requested review from a team August 1, 2025 05:42
@product-auto-label product-auto-label bot added the api: spanner Issues related to the Spanner API. label Aug 1, 2025
@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

@rahul2393 here, I am adding a test file in a few, cleaning up a bit

@yordis yordis force-pushed the yordis/fix-context-canceled branch 7 times, most recently from 33c109f to 12c16cd Compare August 1, 2025 05:58
@rahul2393
Copy link
Copy Markdown
Contributor

@yordis can you fix go vet issues?

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

@rahul2393 working on making CI to pass, yes

@yordis yordis force-pushed the yordis/fix-context-canceled branch 2 times, most recently from 31e6f7d to a88eb08 Compare August 1, 2025 06:29
…r for performance

Signed-off-by: Yordis Prieto <yordis.prieto@gmail.com>
@yordis yordis force-pushed the yordis/fix-context-canceled branch from a88eb08 to ed8dae6 Compare August 1, 2025 06:34
@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

It is 3AM, so I am going to sleep, I will come back to this around 11 AM EST

rahul2393
rahul2393 previously approved these changes Aug 1, 2025
@rahul2393 rahul2393 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2025
@rahul2393
Copy link
Copy Markdown
Contributor

@yordis TestSpannerTracesWithOpenTelemetry is now failing means trace status shows error which was fixed by calling the last Recv in go routine.

@rahul2393
Copy link
Copy Markdown
Contributor

Calling last Recv in go routine was indeed to fix that #12235

@rahul2393 rahul2393 self-requested a review August 1, 2025 09:15
@rahul2393
Copy link
Copy Markdown
Contributor

rahul2393 commented Aug 1, 2025

Screenshot 2025-08-01 at 2 53 56 PM

With the PR changes I again started to see error status in traces, which was fixed earlier in 1.81.1

@rahul2393
Copy link
Copy Markdown
Contributor

rahul2393 commented Aug 1, 2025

@yordis Can you share what does the trace look like at your end in case of context cancelled error, for me in mainline code status is always unset in traces, does it happen for all ExecuteStreamingSQL traces or just few at random?

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

I dont see anything weird, it will help a ton to get #12631 so I can have better data (well, the downside I can not upgrade 😭)

Screenshot 2025-08-01 at 1 20 47 PM

@rahul2393
Copy link
Copy Markdown
Contributor

rahul2393 commented Aug 1, 2025

@yordis Is it possible for you to test this patch and see if context cancel still happens

go mod edit -replace cloud.google.com/go/spanner=github.com/googleapis/google-cloud-go/spanner@fix-issue-12630
go mod tidy
go mod download

@rahul2393
Copy link
Copy Markdown
Contributor

TestSpannerTracesWithOpenTelemetry

Just run this test on your local and check if it passes.

@rahul2393
Copy link
Copy Markdown
Contributor

does it happen for all ExecuteStreamingSQL traces or just few at random?

Random, but it could be the case for the same query. What do you have in mind? Picking your brain will help me to understand what to look for

I think context cancel is happening because of this #11854 (comment), we should not cancel in iter.Stop if it's already planned to be cancelled in separate go routine in case of skipping trailers.

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

we should not cancel in iter.Stop if it's already planned to be cancelled in separate go routine in case of skipping trailers.

That make sense to me. What do you think the route should be here? I do not feel qualify to understand the entire situation at this point. I am just trying to be proactive, but I rather not be a distraction or slowness here

@rahul2393
Copy link
Copy Markdown
Contributor

Is there a way for you to verify in your test environment this patch

go mod edit -replace cloud.google.com/go/spanner=github.com/googleapis/google-cloud-go/spanner@fix-issue-12630
go mod tidy
go mod download

and confirm issue is not happening, if yes then I can include the change in next release

@rahul2393
Copy link
Copy Markdown
Contributor

rahul2393 commented Aug 1, 2025

@yordis Another case is when application does not iterate over all the rows and call iter.Stop, then context cancel will appear in traces, and that's a valid case.

Are we sure we are not skipping rows? example code which can cause "context cancel" issue is below.

// sql query will return 4 rows
iter := client.Single().Query(context.Background(), spanner.NewStatement(sql))
// By calling Stop() after consuming only one row, we guarantee that the
// iterator's cancel function is still active. This will cancel the
// underlying stream while the background goroutine is waiting for the
// delayed trailer, thus reliably reproducing the race condition.
_, err := iter.Next()
if err != nil && err != iterator.Done {
t.Fatalf("iter.Next() failed unexpectedly: %v", err)
}
iter.Stop()

Above code is wrong since we called iter.Stop without iterating over all the rows
OR, if we pass context with a deadline and it takes more time to process the result.

// Example with Timeout
ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
defer cancel()

iter := client.Single().Query(ctx, ...) // Pass the cancellable context
// If the iter.Do() takes longer than 500ms, the context will be canceled.
err := iter.Do(func(r *spanner.Row) error {
    time.Sleep(1 * time.Second) // Simulate long processing
    return nil
})
// err will be context.DeadlineExceeded, which is a form of cancellation.

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

Another case is when application does not iterate over all the rows and call iter.Stop, then context cancel will appear in traces, and that's a valid case.

Let me dig into this, I feel this is the case!

In 1.80, did you fix reporting to the OTEL, correct?

I want to confirm because before we did not have any error in the trace, but now we do, so it could be the case that the app was "broken" before, and we are chasing ghosts here.

@rahul2393
Copy link
Copy Markdown
Contributor

rahul2393 commented Aug 1, 2025

In 1.81.1 we fixed the reporting of error in traces, specifically this commit #12235

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

Well, I was using 1.83.0, so could be the case that wasn't fixed properly?

And what I mean by "fix" is that, I wouldn't see a "failure" in my trace. Otherwise we would have "error" signalling when it is expected behavior

@rahul2393
Copy link
Copy Markdown
Contributor

rahul2393 commented Aug 1, 2025

Well, I was using 1.83.0, so could be the case that wasn't fixed properly?

It is fixed in 1.81.1 and I am not able to see context cancel errors for valid cases, but if we try to do this #12630 (comment) then it will happen in traces and that's valid case, one should not call iter.Stop without iterating over all the rows.

@rahul2393
Copy link
Copy Markdown
Contributor

Can you write a test replicate the context cancel span behaviour using the public methods available? I am not able to replicate your case.

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

Let me report back, I think that we actually have a "bug" in our end, and we dont need to fix anything here.

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

application does not iterate over all the rows and call iter.Stop,

What if I don't want to iterate over the whole list and we do not want to see an issue?

Disclaimer, we are using https://github.com/googleapis/go-gorm-spanner so we aren't calling iter.Stop() or dealing with the situation at all,

Here is the one that is actually failing:

func apiCall(ctx context.Context, db *gorm.DB, shopSellerId uuid.UUID) (uuid.UUID, error) {
	var shopSeller models.ShopSeller
	err := db.WithContext(ctx).
		Where("id = ? AND deleted_at IS NULL", shopSellerId).
		First(&shopSeller).Error

	if err != nil {
		if errors.Is(err, gorm.ErrRecordNotFound) {
			return uuid.Nil, fmt.Errorf("...", ErrShopSellerResolutionFailed, shopSellerId)
		}
		return uuid.Nil, fmt.Errorf("...", err)
	}

	shopId, err := uuid.Parse(shopSeller.ShopID)
	if err != nil {
		return uuid.Nil, fmt.Errorf("...", err)
	}

	return shopId, nil
}

The interesting thing is that, I am unsure why Gorm would decide to use "streaming" version for this query; I dont understand Spanner, but it feels streaming would be for long-running queries, or something that wouldn't be selecting a single row

Screenshot 2025-08-01 at 3 12 13 PM

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

Could be we need https://github.com/googleapis/go-gorm-spanner/releases/tag/v1.8.1 (the weird thing is that I can not see file changes from the PR ...)

@rahul2393
Copy link
Copy Markdown
Contributor

There is no file changes because it just update the spanner dependency and dummy PR was to run the release process there, let me dig into GORM now since so far I was just looking into spanner client, what version of GORM spanner are you using?

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

I just tested in latest everything just in case

cloud.google.com/go/spanner v1.83.0
github.com/googleapis/go-gorm-spanner v1.8.2

So I am gonna test my actual branch by now ... sorry, just trying to gather more data just in case

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

btw @rahul2393 would be better to jump into a Google Meet? Your team should have our contact info if I am not mistaken

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

With the following change:

replace cloud.google.com/go/spanner => github.com/googleapis/google-cloud-go/spanner v1.83.1-0.20250801175611-daf851f83e36

The issue didn't go away 100% they are few calls still, but notice the drop on that particular report

@rpc.method:ExecuteStreamingSql @error.message:"context canceled" 
Screenshot 2025-08-01 at 3 39 58 PM

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

Since I do not have #12631 it is really difficult to see what queries are those; but the fact that it is the very last one, makes me think the issue still present

Screenshot 2025-08-01 at 3 42 05 PM

@rahul2393
Copy link
Copy Markdown
Contributor

The issue didn't go away 100% they are few calls still, but notice the drop on that particular report

I assume you wanted to attach before and after report, but there is only one report is it before?

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 1, 2025

I assume you wanted to attach before and after report, but there is only one report is it before?

See the drop in the graph

@rahul2393
Copy link
Copy Markdown
Contributor

@yordis I was able to replicate and track what's happening
I added this patch add-gcp-attribute-traces...add-gcp-attribute-traces-patch to log sequence of events for stream close in mainline

In case of context cancel error
Screenshot 2025-08-02 at 4 53 49 AM

In case of success
Screenshot 2025-08-02 at 5 01 28 AM

@yordis
Copy link
Copy Markdown
Author

yordis commented Aug 2, 2025

So what should be the resolution here based on your understanding thus far?

@rahul2393
Copy link
Copy Markdown
Contributor

@yordis
Can you try this patch and see if you still see context cancelled errors

go mod edit -replace cloud.google.com/go/spanner=github.com/googleapis/google-cloud-go/spanner@fix-issue-12630
go mod tidy
go mod download

@rahul2393
Copy link
Copy Markdown
Contributor

Closing in favor of #12635

@rahul2393 rahul2393 closed this Aug 5, 2025
@yordis yordis deleted the yordis/fix-context-canceled branch August 5, 2025 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the Spanner API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants