Skip to content

BR restore / S3: Stalled stream protection interferes with QoS-based rate limiting #18846

@kennytm

Description

@kennytm

Bug Report

What version of TiKV are you using?

v8.5.3, v8.5.2

What operating system and CPU are you using?

Steps to reproduce

BR restore on a cluster of 20 TiKVs with data source being a Huawei OBS (an S3-compatible service).

The upstream service implemented rate limiting through a token-bucket filter, forcing the total download speed from the entire TiKV cluster to be below 1 GiB/s.

The Rust AWS SDK has a feature known as "stalled stream protection" awslabs/aws-sdk-rust#1146, which will kill the request if the throughput is below 1 B/s for some grace period (supposed to be 20s, but it seems the actual timeout is 6s).

So if the total downstream speed is too high, which exhausted the token bucket, and the penalty exceeded the grace period (6s), the AWS SDK will kill the connection, forcing TiKV to restart the downstream from the beginning, both worsening the resource usage and leads to #18839.

What did you expect?

Speed limit should not kill the download request (at least not within 6s).

What did happened?

The download request failed with "streaming error".

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.affects-9.0This bug affects the 9.0.x versions.component/backup-restoreComponent: backup, import, external_storageseverity/moderatetype/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions