Skip to content

Snapshot backup failures due to S3 rate limiting handling error #19196

@vldmit

Description

@vldmit

Bug Report

We observe consistent snapshot backup failures (when run in the default configuration on 8.5.2) on a relatively large cluster (1TiB+ per store, NN stores).

TiKV throws out following errors which suggest it is hitting AWS rate limits:

[2025/12/09 20:01:14.321 +00:00] [WARN] [util.rs:110] ["aws request fails"] [uuid=redacted] [context=upload_part] [retry?=false] [err="aws-sdk error: ServiceError(ServiceError { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some(\"SlowDown\"), message: Some(\"Please reduce your request rate.\"), extras: Some({\"s3_extended_request_id\": \"redacted\", \"aws_request_id\": \"redacted\"}) }, meta: ErrorMetadata { code: Some(\"SlowDown\"), message: Some(\"Please reduce your request rate.\"), extras: Some({\"s3_extended_request_id\": \"redacted\", \"aws_request_id\": \"redacted\"}) } }), raw: Response { status: StatusCode(503), headers: Headers { headers: {\"x-amz-request-id\": HeaderValue { _private: H0(\"redacted\") }, \"x-amz-id-2\": HeaderValue { _private: H0(\"redacted\") }, \"content-type\": HeaderValue { _private: H0(\"application/xml\") }, \"transfer-encoding\": HeaderValue { _private: H0(\"chunked\") }, \"date\": HeaderValue { _private: H0(\"Tue, 09 Dec 2025 20:01:13 GMT\") }, \"connection\": HeaderValue { _private: H0(\"close\") }, \"server\": HeaderValue { _private: H0(\"AmazonS3\") }} }, body: SdkBody { inner: Once(Some(b\"<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><RequestId>redacted</RequestId><HostId>redacted</HostId></Error>\")), retryable: true }, extensions: Extensions { extensions_02x: Extensions, extensions_1x: Extensions } } })"] [thread_id=326]

What version of TiKV are you using?

8.5.2

What operating system and CPU are you using?

Steps to reproduce

Create a snapshot backup. Observe backup fails with reason: BackupDataToRemoteFailed

BR logs in the failed job:

        [2025/12/10 20:01:18.320 +00:00] [ERROR] [client.go:117] ["store backup failed"] [round=1] [storeID=103] [error="rpc error: code = Canceled desc = context canceled"] [stack="github.com/pingcap/tidb/br/pkg/backup.(*MainBackupSender).SendAsync.func1\n\t/tidb/br/pkg/backup/client.go:117"]
        [2025/12/10 20:01:18.724 +00:00] [ERROR] [backup.go:57] ["failed to backup"] [error="error happen in store 2261919624: unknown error, retried too many times, give up: [BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error\nerror happen in store 2261919624: unknown error, retried too many times, give up\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).OnBackupResponse\n\t/tidb/br/pkg/backup/client.go:1213\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).RunLoop\n\t/tidb/br/pkg/backup/client.go:341\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges\n\t/tidb/br/pkg/backup/client.go:1126\ngithub.com/pingcap/tidb/br/pkg/task.RunBackup\n\t/tidb/br/pkg/task/backup.go:689\nmain.runBackupCommand\n\t/tidb/br/cmd/br/backup.go:56\nmain.newFullBackupCommand.func1\n\t/tidb/br/cmd/br/backup.go:148\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:985\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041\nmain.main\n\t/tidb/br/cmd/br/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700"] [stack="main.runBackupCommand\n\t/tidb/br/cmd/br/backup.go:57\nmain.newFullBackupCommand.func1\n\t/tidb/br/cmd/br/backup.go:148\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:985\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041\nmain.main\n\t/tidb/br/cmd/br/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272"]
        [2025/12/10 20:01:18.725 +00:00] [ERROR] [main.go:38] ["br failed"] [error="error happen in store 2261919624: unknown error, retried too many times, give up: [BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error\nerror happen in store 2261919624: unknown error, retried too many times, give up\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).OnBackupResponse\n\t/tidb/br/pkg/backup/client.go:1213\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).RunLoop\n\t/tidb/br/pkg/backup/client.go:341\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges\n\t/tidb/br/pkg/backup/client.go:1126\ngithub.com/pingcap/tidb/br/pkg/task.RunBackup\n\t/tidb/br/pkg/task/backup.go:689\nmain.runBackupCommand\n\t/tidb/br/cmd/br/backup.go:56\nmain.newFullBackupCommand.func1\n\t/tidb/br/cmd/br/backup.go:148\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:985\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041\nmain.main\n\t/tidb/br/cmd/br/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700"] [stack="main.main\n\t/tidb/br/cmd/br/main.go:38\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272"]
        Error: error happen in store 2261919624: unknown error, retried too many times, give up: [BR:KV:ErrKVStorage]tikv storage occur I/O error
        , err: exit status 1

What did you expect?

backup to succeed

What did happened?

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-8.5This bug affects the 8.5.x(LTS) versions.component/backup-restoreComponent: backup, import, external_storagecontributionThis PR is from a community contributor.severity/majortype/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions