feat: Add sharding for loki.write by kalleep · Pull Request #4882 · grafana/alloy

kalleep · 2025-11-19T13:08:28Z

PR Description

This PR implements queue_config for the loki.write component, enabling users to configure queue-based batching and parallel processing. The implementation introduces a new sharding architecture that distributes log entries across multiple parallel queues based on label fingerprints. This implementation is based on Prometheus rw sharding.

The shards implementation is used with both "normal" clients and "WAL" clients. So we get rid of a lot of duplicated logic.

Before this pr we had a queue_config block that was only used when WAL was enabled. It is now always used and will affect clients regardless.

Currently no automatic "resharding" is implemented. Implementing this without the WAL will most likely be pretty primitive. So for now min_shards is the only configurable value until we address this.

Ideally we would move a couple of attributes from endpoint block to queue_config block to closer match prometheus.remote_write. But we can't do that without doing a breaking change. These attributes are:

retry_on_http_429
max_backoff_period
min_backoff_period
batch_size
batch_wait

Which issue(s) this PR fixes

Part of: #4728

Notes to the Reviewer

I moved wal writer ownership into client.Manager. No need to expose it to loki.write component.
I plan to work on resharding in followup pr

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

github-actions · 2025-11-19T13:10:06Z

💻 Deploy preview available (loki.write: implement sharding):

https://deploy-preview-alloy-4882-zb444pucvq-vp.a.run.app/docs/alloy/latest/

Copilot

Pull Request Overview

This PR implements queue-based sharding for the loki.write component, introducing a new architecture that distributes log entries across multiple parallel queues based on label fingerprints. The implementation unifies the handling of both normal and WAL-enabled clients through a shared shards structure, eliminating significant code duplication. The queue_config block, previously WAL-only, now applies to all endpoints and controls both queue capacity and shard count.

Key changes:

Introduces shards.go with new sharding architecture for parallel processing via multiple queues
Refactors WAL and fanout clients to use the shared shards implementation, removing ~500 lines of duplicated code
Adds min_shards configuration parameter to control parallelism level

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
internal/component/loki/write/types.go	Adds `MinShards` field to `QueueConfig` and updates documentation to reflect queue config is now always used
internal/component/common/loki/client/shards.go	New file implementing the core sharding logic with queue management and parallel batch sending
internal/component/common/loki/client/shards_test.go	Comprehensive test coverage for queue operations including append, drain, and flush/shutdown scenarios
internal/component/common/loki/client/consumer_wal.go	Refactored to delegate batching and sending to the shards implementation, significantly simplified
internal/component/common/loki/client/consumer_fanout.go	Refactored to use shards implementation, removing duplicated send/batch logic
internal/component/common/loki/client/config.go	Adds `MinShards` field definition to `QueueConfig` struct
docs/sources/reference/components/loki/loki.write.md	Documents the new `min_shards` parameter and clarifies queue config usage

internal/component/common/loki/client/shards_test.go

docs/sources/reference/components/loki/loki.write.md

thampiotr

I still think we need to get a coherent story with naming established and then make sure it is reflected in docs and in the code. But we're on the right track now.

docs/sources/reference/components/loki/loki.write.md

thampiotr · 2025-11-19T14:19:13Z

docs/sources/reference/components/loki/loki.write.md


 ### `queue_config`

-{{< docs/shared lookup="stability/experimental_feature.md" source="alloy" version="<ALLOY_VERSION>" >}}


Shouldn't this continue to be experimental?

Yeah we could keep this as experimental. But we would always use this config after this pr.

What would be considered experimental with it would be naming and changing defaults I guess.

docs/sources/reference/components/loki/loki.write.md

thampiotr · 2025-11-19T14:21:55Z

docs/sources/reference/components/loki/loki.write.md


 | Name            | Type       | Description                                                                                                                                                                   | Default | Required |
 | --------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | -------- |
 | `capacity`      | `string`   | Controls the size of the underlying send queue buffer. This setting should be considered a worst-case scenario of memory consumption, in which all enqueued batches are full. | `10MiB` | no       |


Suggested change

| `capacity` | `string` | Controls the size of the underlying send queue buffer. This setting should be considered a worst-case scenario of memory consumption, in which all enqueued batches are full. | `10MiB` | no |

| `capacity` | `string` | Controls the size of the underlying send queue buffer of each shard. Consider this setting as the worst-case scenario of memory consumption, in which all enqueued batches are full. | `10MiB` | no |

What does it even mean 'all enqueued batches are full'? Shouldn't it say that it's the total size of all the enqueued batches instead?

Yeah this was there before and I did not check / alter it.

But essentially whenever the capacity is full that means that the queue of batches is full and we cannot enqueue another one so we would block here until we get more capacity

This setting is per-shard right? we should clarify this here.

docs/sources/reference/components/loki/loki.write.md

internal/component/common/loki/client/config.go

docs/sources/reference/components/loki/loki.write.md

kalleep · 2025-11-20T14:14:41Z

In ffb2bec I renamed the two clients implementation we have to endpoint and walEndpoint.

I will look into the other ways we discussed structuring this but will have this as a fallback if that don't work out

Copilot

Pull Request Overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

internal/component/common/loki/client/consumer_wal.go:135

sync.WaitGroup does not have a Go method. The standard library's sync.WaitGroup has Add(), Done(), and Wait() methods. This should be:

stopWG.Add(1)
go func() {
    defer stopWG.Done()
    pair.Stop(drain)
}()

		stopWG.Go(func() {
			pair.Stop(drain)
		})

internal/component/common/loki/client/consumer_fanout.go

internal/component/common/loki/client/shards_test.go

internal/component/common/loki/client/consumer_fanout.go

internal/component/common/loki/client/shards.go

internal/component/common/loki/client/consumer_fanout.go

kalleep · 2025-11-20T15:41:38Z

@thampiotr I update the pr with an attempt to share stuff between non WAL and WAL implementation.

So we now have one endpoint struct that will handle shards. This implemntation has one method enqueue that will try to forward entry to shards for queuing with retries. If endpoint is stopped any attempts will fail end return false.

We can use this directly in Fanout and we no longer need channels between fanout and endpoint.

For WAL implementation I renamed it walEndpointAdapter. This implements the necessary interface that watcher expected and will just call enqueue on endpoint, that internally handles retries etc.

Naming could be a bit off but this would be an option we can go with. Let me know what you think :)

Copilot

Pull Request Overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

internal/component/common/loki/client/consumer_wal.go:134

sync.WaitGroup does not have a Go() method. This code will not compile. You should either:

Use stopWG.Add(1) and go func(p endpointWatcherPair) { defer stopWG.Done(); p.Stop(drain) }(pair) pattern (note: capture pair as parameter to avoid closure issues)
Or use golang.org/x/sync/errgroup which has a Go() method

		stopWG.Go(func() {
			pair.Stop(drain)
		})

internal/component/common/loki/client/consumer_fanout.go

internal/component/common/loki/client/shards_test.go

clayton-cornell · 2025-11-24T21:33:28Z

Nothing really stands out for the small doc changes. It's good as-is.

queue. This would deadlock because we would not be able to drain and hard shutdown would not cancel it

github-actions · 2026-01-12T13:07:01Z

💻 Deploy preview available (feat: add sharding for loki.write):

https://deploy-preview-alloy-4882-zb444pucvq-vp.a.run.app/docs/alloy/latest/

docs/sources/reference/components/loki/loki.write.md

thampiotr · 2026-01-13T15:55:10Z

internal/component/loki/write/types.go

+var defaultQueueConfig = QueueConfig{
+	Capacity:     10 * units.MiB, // considering the default BatchSize of 1MiB, this gives us a default buffered channel of size 10
+	MinShards:    1,
+	DrainTimeout: 15 * time.Second,


In docs this is 1m

No BatchSize is configured to 1MB but capacity is 10MB in docs

thampiotr · 2026-01-13T15:56:31Z

internal/component/loki/write/write.go

+	if err := validateConfigStabilityLevel(o, args); err != nil {
+		return nil, err
+	}


Don't we call Update anyway where this check will happen?

That is true 👍

thampiotr · 2026-01-13T16:25:20Z

internal/component/common/loki/client/endpoint.go

+	for !c.shards.enqueue(entry, segmentNum) {
+		backoff.Wait()
+		if !backoff.Ongoing() {
+			return false
+		}
+	}


If this blocks forever due to some issues in enqueue or other functions downstream from here, what would be the symptoms? Would we have any logs or metrics or some other way of debugging this?

Not more than we have today. If we get stuck here for whatever reason loki_write_sent_bytes_total would go down to 0 and loki_write_dropped_bytes_total be 0.

I have a issue for adding more useful metrics for shardings here #4838 I could also go through add ad useful logs in the same pr for that one.

internal/component/common/loki/client/shards.go

thampiotr · 2026-01-13T16:29:36Z

internal/component/common/loki/client/shards.go

+	case <-s.softShutdown:
+		return false
+	default:
+		return s.queues[shard].append(tenantID, entry, segmentNum)


Do we still need to hold the lock when calling append here? Or it doesn't really matter?

We don't because has it's own locking but do not matter right now because we can only queue one item at a time currently, thanks to the limitations of the pipeline.

I think it's safest to hold the lock for the full duration for now. I need to revisit this if we change the pipeline

required for all queue scales with shards

github-actions · 2026-01-14T12:51:55Z

💻 Deploy preview deleted (feat: Add sharding for loki.write).

thampiotr

I can't see any obvious problems here, so I think we should merge it and then try to validate thoroughly in dev. Hope it's going to work well! As mentioned offline, I'd like to invest more in integration tests to be able to make such changes with more confidence.

kalleep requested review from a team and clayton-cornell as code owners November 19, 2025 13:08

kalleep force-pushed the kalleep/loki-write-queue-config branch from 4326897 to 6f05b9e Compare November 19, 2025 13:13

kalleep requested a review from Copilot November 19, 2025 13:14

Copilot started reviewing on behalf of kalleep November 19, 2025 13:15 View session

Copilot finished reviewing on behalf of kalleep November 19, 2025 13:16

Copilot AI reviewed Nov 19, 2025

View reviewed changes

internal/component/common/loki/client/shards_test.go Show resolved Hide resolved

internal/component/common/loki/client/shards_test.go Show resolved Hide resolved

docs/sources/reference/components/loki/loki.write.md Outdated Show resolved Hide resolved

thampiotr reviewed Nov 20, 2025

View reviewed changes

kalleep requested a review from Copilot November 20, 2025 14:23

Copilot started reviewing on behalf of kalleep November 20, 2025 14:24 View session

Copilot finished reviewing on behalf of kalleep November 20, 2025 14:28

Copilot AI reviewed Nov 20, 2025

View reviewed changes

kalleep force-pushed the kalleep/loki-write-queue-config branch from 6132ec6 to eaac392 Compare November 20, 2025 15:30

kalleep requested a review from thampiotr November 20, 2025 15:34

kalleep requested a review from Copilot November 20, 2025 15:57

Copilot started reviewing on behalf of kalleep November 20, 2025 15:58 View session

Copilot finished reviewing on behalf of kalleep November 20, 2025 16:00

Copilot AI reviewed Nov 20, 2025

View reviewed changes

kalleep force-pushed the kalleep/loki-write-queue-config branch 3 times, most recently from c889a4a to c55de0c Compare November 21, 2025 14:56

clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 24, 2025

kalleep force-pushed the kalleep/loki-write-queue-config branch 2 times, most recently from 9454993 to bf7d130 Compare December 10, 2025 13:44

kalleep force-pushed the kalleep/loki-write-queue-config branch from bf7d130 to e94d632 Compare December 15, 2025 08:11

kalleep changed the title ~~loki.write: implement sharding~~ feat: add sharding for loki.write Dec 15, 2025

kalleep added 6 commits January 12, 2026 13:31

fix test

36fbaab

fix test

def54ab

fix race where flushAndShutdown holds mutex while we try to drain the

94f680e

queue. This would deadlock because we would not be able to drain and hard shutdown would not cancel it

Update comment

b7c1f21

Chaning queue_config is marked as experimental and add test to check it

ef40a0a

Add back experimental banner

ea9f263

kalleep force-pushed the kalleep/loki-write-queue-config branch from e94d632 to ea9f263 Compare January 12, 2026 13:05

kalleep changed the title ~~feat: add sharding for loki.write~~ feat: Add sharding for loki.write Jan 12, 2026

thampiotr reviewed Jan 13, 2026

View reviewed changes

docs/sources/reference/components/loki/loki.write.md Show resolved Hide resolved

thampiotr reviewed Jan 13, 2026

View reviewed changes

internal/component/common/loki/client/shards.go Show resolved Hide resolved

thampiotr reviewed Jan 13, 2026

View reviewed changes

kalleep added 3 commits January 14, 2026 13:16

Remove duplicated validation

757dfa0

Add log when we fail to drain the whole queue during shutdown

19d9b79

Update docs to describe how queue size is calculated and how the memory

049f018

required for all queue scales with shards

thampiotr approved these changes Jan 14, 2026

View reviewed changes

kalleep merged commit 7570d65 into main Jan 14, 2026
48 of 49 checks passed

kalleep deleted the kalleep/loki-write-queue-config branch January 14, 2026 13:17

grafana-alloybot bot mentioned this pull request Jan 14, 2026

Unreleased Changes #5170

Draft

kalleep mentioned this pull request Jan 15, 2026

fix: HTTP/2 is no longer always disabled in loki.write #5267

Merged

3 tasks

tpaschalis mentioned this pull request Jan 21, 2026

Got changing queue_config requires stability.level flag to be experimental without defining a custom queue_config #5312

Closed

kalleep mentioned this pull request Jan 21, 2026

chore: Set default queue_config #5313

Merged

3 tasks

grafana-alloybot bot mentioned this pull request Jan 26, 2026

chore(release/v1.13): Release 1.13.0 #5351

Merged

kalleep mentioned this pull request Jan 27, 2026

loki.process forwarding to multiple components blocks all destinations if one is blocked #2194

Open

github-actions bot added the frozen-due-to-age label Jan 29, 2026

github-actions bot locked as resolved and limited conversation to collaborators Jan 29, 2026


		### `queue_config`

		{{< docs/shared lookup="stability/experimental_feature.md" source="alloy" version="<ALLOY_VERSION>" >}}

	\| `capacity` \| `string` \| Controls the size of the underlying send queue buffer. This setting should be considered a worst-case scenario of memory consumption, in which all enqueued batches are full. \| `10MiB` \| no \|
	\| `capacity` \| `string` \| Controls the size of the underlying send queue buffer of each shard. Consider this setting as the worst-case scenario of memory consumption, in which all enqueued batches are full. \| `10MiB` \| no \|

Conversation

kalleep commented Nov 19, 2025

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thampiotr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thampiotr Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kalleep commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kalleep commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clayton-cornell commented Nov 24, 2025

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

github-actions bot commented Nov 19, 2025 •

edited

Loading

thampiotr Jan 13, 2026 •

edited

Loading

kalleep commented Nov 20, 2025 •

edited

Loading

github-actions bot commented Jan 14, 2026 •

edited

Loading