Skip to content

fix: async InstanceStart calls#869

Merged
acouvreur merged 2 commits into
mainfrom
loading-page-on-start
Apr 16, 2026
Merged

fix: async InstanceStart calls#869
acouvreur merged 2 commits into
mainfrom
loading-page-on-start

Conversation

@acouvreur

Copy link
Copy Markdown
Member

provider.InstanceStart has no guarantee of execution time. When the call is hanging, then the whole request and loading page is hanging which defeat the purpose of a waiting page.

Closes #782

provider.InstanceStart has no guarantee of execution time. When the call is hanging, then the whole request and loading page is hanging which defeat the purpose of a waiting page.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent the waiting page request from hanging by dispatching provider.InstanceStart asynchronously and tracking in-flight starts to avoid duplicate start calls (per #782).

Changes:

  • Added pendingStarts tracking on Sablier to deduplicate concurrent InstanceStart calls.
  • Updated InstanceRequest to dispatch starts asynchronously instead of synchronously calling InstanceStart (and no longer immediately InstanceInspect on first request).
  • Added unit tests covering async dispatch and deduplication behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
pkg/sablier/sablier.go Adds mutex + pendingStarts map to track in-flight starts.
pkg/sablier/instance_request.go Introduces requestStart helper and switches InstanceRequest to async start dispatch.
pkg/sablier/instance_request_test.go Adds tests for async start dispatch, deduplication, and error/retry behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +38
if ps, exists := s.pendingStarts[name]; exists {
select {
case <-ps.done:
// Goroutine completed
if ps.err != nil {
// Return the unconsumed error and clear the entry so the next call retries
err := ps.err
delete(s.pendingStarts, name)
return InstanceInfo{}, fmt.Errorf("instance start failed: %w", err)
}
// Succeeded previously but instance is no longer in store — start a new one
delete(s.pendingStarts, name)
default:
// Still running — don't start another goroutine
s.l.DebugContext(ctx, "instance start already in progress", slog.String("instance", name))
return NotReadyInstanceState(name, 0, 1), nil
}

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pendingStarts entries are only deleted when requestStart is called again (i.e., when sessions.Get returns ErrKeyNotFound). In the normal flow, this method stores a not-ready state immediately, so subsequent requests won't hit ErrKeyNotFound and completed entries (and any ps.err) will never be consumed/cleared. This can both leak memory (one map entry per instance) and prevent async start errors from ever surfacing. Consider cleaning up finished entries in the goroutine (under pendingMu) and/or checking/consuming pending completion/errors on the state.Status != ready path before InstanceInspect.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +47
go func() {
defer close(ps.done)
if err := s.provider.InstanceStart(context.Background(), name); err != nil {
ps.err = err

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async goroutine calls InstanceStart(context.Background(), name) with no timeout/cancellation. If the provider call hangs indefinitely (the reported issue), the goroutine can leak forever and the instance will be permanently considered "start in progress". Consider using a bounded context (e.g., context.WithTimeout) and, on timeout/error, record the failure and clear pendingStarts[name] to allow retries.

Copilot uses AI. Check for mistakes.
Comment thread pkg/sablier/instance_request.go Outdated
Comment on lines +43 to +48

go func() {
defer close(ps.done)
if err := s.provider.InstanceStart(context.Background(), name); err != nil {
ps.err = err
s.l.ErrorContext(ctx, "async instance start failed", slog.String("instance", name), slog.Any("error", err))

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goroutine captures and later uses the request-scoped ctx for logging. If InstanceStart hangs, that can retain request-scoped values (potentially including the HTTP request) in memory for a long time. Prefer logging with context.Background() or context.WithoutCancel(ctx) to keep correlation fields without holding the cancellable request context.

Suggested change
go func() {
defer close(ps.done)
if err := s.provider.InstanceStart(context.Background(), name); err != nil {
ps.err = err
s.l.ErrorContext(ctx, "async instance start failed", slog.String("instance", name), slog.Any("error", err))
logCtx := context.WithoutCancel(ctx)
go func() {
defer close(ps.done)
if err := s.provider.InstanceStart(context.Background(), name); err != nil {
ps.err = err
s.l.ErrorContext(logCtx, "async instance start failed", slog.String("instance", name), slog.Any("error", err))

Copilot uses AI. Check for mistakes.
Comment thread pkg/sablier/instance_request.go
Comment thread pkg/sablier/instance_request_test.go Outdated
Comment on lines +140 to +168
sessions.EXPECT().Get(ctx, "nginx").Return(sablier.InstanceInfo{}, store.ErrKeyNotFound).AnyTimes()
sessions.EXPECT().Put(ctx, sablier.NotReadyInstanceState("nginx", 0, 1), time.Minute).Return(nil).AnyTimes()

// First call's goroutine fails immediately
provider.EXPECT().InstanceStart(gomock.Any(), "nginx").Return(errors.New("connection refused")).Times(1)

// First call succeeds (returns not-ready, goroutine starts)
info, err := manager.InstanceRequest(ctx, "nginx", time.Minute)
assert.NilError(t, err)
assert.Equal(t, info.Status, sablier.InstanceStatus(sablier.InstanceStatusNotReady))

// Allow the goroutine to finish setting the error and closing ps.done
assert.Assert(t, checkWithTimeout(100*time.Millisecond, 5*time.Second, func() bool {
_, err = manager.InstanceRequest(ctx, "nginx", time.Minute)
return err != nil
}), "expected error to be surfaced")
assert.ErrorContains(t, err, "instance start failed: connection refused")
}

func TestInstanceRequest_RetryAfterErrorConsumed(t *testing.T) {
manager, sessions, provider := setupSablier(t)
ctx := t.Context()

secondDone := make(chan struct{})

// Allow multiple Get calls: the polling helper may call InstanceRequest several times
sessions.EXPECT().Get(ctx, "nginx").Return(sablier.InstanceInfo{}, store.ErrKeyNotFound).AnyTimes()
sessions.EXPECT().Put(ctx, sablier.NotReadyInstanceState("nginx", 0, 1), time.Minute).Return(nil).AnyTimes()

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests stub sessions.Get to return ErrKeyNotFound on every call (AnyTimes()), but InstanceRequest always does a Put immediately after dispatching the async start. With a real store, the next Get should return the stored not-ready state, so the error-consumption path in requestStart likely won't run and async start errors may never be surfaced/retried. Consider updating the mocks to reflect the real Get/Put sequence (2nd call returns the value written by Put) and asserting the intended error surfacing/retry behavior in that scenario.

Copilot uses AI. Check for mistakes.
@github-actions

github-actions Bot commented Apr 16, 2026

Copy link
Copy Markdown
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Diff between sablier and sablier                                                                         │
├──────────┬───────────────────────────────────────────────────────────────┬──────────┬──────────┬─────────┤
│ PERCENT  │ NAME                                                          │ OLD SIZE │ NEW SIZE │ DIFF    │
├──────────┼───────────────────────────────────────────────────────────────┼──────────┼──────────┼─────────┤
│ +2.89%   │ github.com/sablierapp/sablier                                 │ 221 kB   │ 228 kB   │ +6.4 kB │
│ +1.76%   │ go.opentelemetry.io/otel                                      │ 336 kB   │ 342 kB   │ +5.9 kB │
│ +175.07% │ golang.org/x/sync                                             │ 2.2 kB   │ 6.0 kB   │ +3.8 kB │
│ +0.73%   │ github.com/klauspost/compress                                 │ 491 kB   │ 495 kB   │ +3.6 kB │
│ +0.17%   │ go.podman.io/image/v5                                         │ 1.3 MB   │ 1.3 MB   │ +2.2 kB │
│ +0.10%   │ <autogenerated>                                               │ 746 kB   │ 746 kB   │ +775 B  │
│ +0.14%   │ text/template                                                 │ 269 kB   │ 270 kB   │ +388 B  │
│ +0.11%   │ golang.org/x/sys                                              │ 76 kB    │ 76 kB    │ +81 B   │
│ +0.04%   │ sync                                                          │ 79 kB    │ 79 kB    │ +31 B   │
│ +0.00%   │ google.golang.org/grpc                                        │ 1.1 MB   │ 1.1 MB   │ +16 B   │
│ +0.00%   │ google.golang.org/protobuf                                    │ 1.7 MB   │ 1.7 MB   │ +9 B    │
│ +0.00%   │ golang.org/x/crypto                                           │ 686 kB   │ 686 kB   │ +9 B    │
│ +0.00%   │ go.podman.io/storage                                          │ 1.5 MB   │ 1.5 MB   │ +6 B    │
│ +0.00%   │ golang.org/x/text                                             │ 150 kB   │ 150 kB   │ +4 B    │
│ +0.00%   │ crypto                                                        │ 1.8 MB   │ 1.8 MB   │ +3 B    │
│ +0.00%   │ github.com/opencontainers/cgroups                             │ 123 kB   │ 123 kB   │ +2 B    │
│ +0.00%   │ github.com/lmittmann/tint                                     │ 34 kB    │ 34 kB    │ +1 B    │
│ +0.01%   │ github.com/skeema/knownhosts                                  │ 14 kB    │ 14 kB    │ +1 B    │
│ +0.00%   │ archive/tar                                                   │ 80 kB    │ 80 kB    │ +1 B    │
│ -0.00%   │ k8s.io/client-go                                              │ 13 MB    │ 13 MB    │ -1 B    │
│ -0.00%   │ github.com/docker/docker                                      │ 390 kB   │ 390 kB   │ -1 B    │
│ -0.00%   │ github.com/spf13/cast                                         │ 107 kB   │ 107 kB   │ -1 B    │
│ -0.00%   │ github.com/kevinburke/ssh_config                              │ 64 kB    │ 64 kB    │ -1 B    │
│ -0.00%   │ k8s.io/apimachinery                                           │ 1.6 MB   │ 1.6 MB   │ -3 B    │
│ -0.01%   │ github.com/spf13/viper                                        │ 64 kB    │ 64 kB    │ -4 B    │
│ -0.01%   │ go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp │ 70 kB    │ 70 kB    │ -5 B    │
│ -0.00%   │ github.com/json-iterator/go                                   │ 412 kB   │ 412 kB   │ -6 B    │
│ -0.19%   │ google.golang.org/genproto/googleapis/rpc                     │ 4.3 kB   │ 4.3 kB   │ -8 B    │
│ -0.00%   │ golang.org/x/net                                              │ 814 kB   │ 814 kB   │ -8 B    │
│ -0.01%   │ github.com/vbatts/tar-split                                   │ 76 kB    │ 76 kB    │ -10 B   │
│ -0.00%   │ github.com/pelletier/go-toml/v2                               │ 207 kB   │ 207 kB   │ -10 B   │
│ -0.00%   │ k8s.io/api                                                    │ 16 MB    │ 16 MB    │ -15 B   │
│ -0.01%   │ github.com/BurntSushi/toml                                    │ 156 kB   │ 156 kB   │ -17 B   │
│ -0.02%   │ github.com/quic-go/quic-go                                    │ 1.2 MB   │ 1.2 MB   │ -279 B  │
├──────────┼───────────────────────────────────────────────────────────────┼──────────┼──────────┼─────────┤
│ +0.04%   │ .rodata                                                       │ 9.3 MB   │ 9.3 MB   │ +3.9 kB │
│ +0.10%   │ .data                                                         │ 262 kB   │ 262 kB   │ +256 B  │
│ +0.22%   │ .go.buildinfo                                                 │ 14 kB    │ 14 kB    │ +32 B   │
│ +0.03%   │ .typelink                                                     │ 110 kB   │ 110 kB   │ +28 B   │
├──────────┼───────────────────────────────────────────────────────────────┼──────────┼──────────┼─────────┤
│ +0.04%   │ sablier                                                       │ 66 MB    │ 66 MB    │ +29 kB  │
│          │ sablier                                                       │          │          │         │
└──────────┴───────────────────────────────────────────────────────────────┴──────────┴──────────┴─────────┘

@github-actions

Copy link
Copy Markdown

✅ All tests successful. No failed tests were found.

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

@codecov

codecov Bot commented Apr 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.93103% with 7 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
pkg/sablier/instance_request.go 87.50% 6 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
pkg/sablier/sablier.go 60.00% <100.00%> (+4.44%) ⬆️
pkg/sablier/instance_request.go 86.48% <87.50%> (+57.91%) ⬆️

... and 1 file with indirect coverage changes

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
7.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@acouvreur acouvreur requested a review from Copilot April 16, 2026 14:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +59 to +65
// Succeeded previously but instance is no longer in store — start a new one
delete(s.pendingStarts, name)
default:
// Still running — don't start another goroutine
s.l.DebugContext(ctx, "instance start already in progress", slog.String("instance", name))
return NotReadyInstanceState(name, 0, 1), nil
}

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requestStart returns NotReadyInstanceState(name, 0, 1) with a hard-coded desired replica count. This makes the waiting UI show incorrect (current/desired) values for providers where desired replicas can differ from 1 (e.g., Kubernetes names encode replicas), and it can also mark an already-running instance as not-ready on a cold store miss. Consider doing a quick provider.InstanceInspect (before or immediately after dispatching the async start) and returning that inspected state instead of a placeholder, while still keeping InstanceStart asynchronous.

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +81
go func() {
defer cancel()
defer close(ps.done)
if err := s.provider.InstanceStart(startCtx, name); err != nil {
ps.err = err
s.l.Error("async instance start failed", slog.String("instance", name), slog.Any("error", err))
} else {

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the goroutine error path, the pendingStarts entry is never removed unless a future request calls consumePendingError/requestStart for the same instance. If the first request triggers a failed start and the client never polls again, this leaves a completed entry in-memory indefinitely (one per failed instance). Consider cleaning up the map entry in the goroutine after setting ps.err (optionally keeping the error somewhere with a TTL if it still needs to be surfaced later), or adding an expiry-based cleanup for completed entries.

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +252
// Wait for goroutine to finish and self-clean
select {
case <-startDone:
case <-time.After(5 * time.Second):
t.Fatal("InstanceStart goroutine never completed")
}
// Small settle time for the goroutine to acquire the lock and clean up
time.Sleep(50 * time.Millisecond)

// 2nd call: store returns not-ready, no pending entry exists, goes straight to inspect
sessions.EXPECT().Get(ctx, "nginx").Return(notReady, nil)
provider.EXPECT().InstanceInspect(ctx, "nginx").Return(ready, nil)
sessions.EXPECT().Put(ctx, ready, time.Minute).Return(nil)

info, err = manager.InstanceRequest(ctx, "nginx", time.Minute)
assert.NilError(t, err)
assert.Equal(t, info.Status, sablier.InstanceStatus(sablier.InstanceStatusReady))

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a fixed time.Sleep(50ms) to wait for the async start goroutine to clean up the pending entry. That timing can be flaky on slower CI machines (and can fail spuriously if the goroutine hasn't acquired pendingMu yet). Prefer waiting deterministically (e.g., poll until InstanceRequest takes the inspect path, or expose/observe a synchronization point via a channel) instead of sleeping for an arbitrary duration.

Suggested change
// Wait for goroutine to finish and self-clean
select {
case <-startDone:
case <-time.After(5 * time.Second):
t.Fatal("InstanceStart goroutine never completed")
}
// Small settle time for the goroutine to acquire the lock and clean up
time.Sleep(50 * time.Millisecond)
// 2nd call: store returns not-ready, no pending entry exists, goes straight to inspect
sessions.EXPECT().Get(ctx, "nginx").Return(notReady, nil)
provider.EXPECT().InstanceInspect(ctx, "nginx").Return(ready, nil)
sessions.EXPECT().Put(ctx, ready, time.Minute).Return(nil)
info, err = manager.InstanceRequest(ctx, "nginx", time.Minute)
assert.NilError(t, err)
assert.Equal(t, info.Status, sablier.InstanceStatus(sablier.InstanceStatusReady))
// Wait for goroutine to finish its start call.
select {
case <-startDone:
case <-time.After(5 * time.Second):
t.Fatal("InstanceStart goroutine never completed")
}
// Poll until the pending entry is cleaned up and InstanceRequest takes the
// inspect path.
sessions.EXPECT().Get(ctx, "nginx").Return(notReady, nil).AnyTimes()
provider.EXPECT().InstanceInspect(ctx, "nginx").Return(ready, nil).Times(1)
sessions.EXPECT().Put(ctx, ready, time.Minute).Return(nil).Times(1)
var lastInfo sablier.InstanceInfo
var lastErr error
assert.Assert(t, checkWithTimeout(50*time.Millisecond, 5*time.Second, func() bool {
lastInfo, lastErr = manager.InstanceRequest(ctx, "nginx", time.Minute)
return lastErr == nil && lastInfo.Status == sablier.InstanceStatus(sablier.InstanceStatusReady)
}), "expected pending entry cleanup to allow inspect path")
assert.NilError(t, lastErr)
assert.Equal(t, lastInfo.Status, sablier.InstanceStatus(sablier.InstanceStatusReady))

Copilot uses AI. Check for mistakes.
@acouvreur acouvreur merged commit 3c1430f into main Apr 16, 2026
12 of 14 checks passed
@acouvreur acouvreur deleted the loading-page-on-start branch April 16, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async response with waiting screen

2 participants