Skip to content

Freeing Scroll Context can Result in the Store Getting Closed on a Transport Thread  #83515

@original-brownbear

Description

@original-brownbear

I looked into transport worker slow logging in 7.17 and one transport action and one outstanding and recurring issue is logs like the below:

[instance-0000000000] handling inbound transport message [InboundMessage{Header{1325}{7.17.0}{241879}{true}{false}{false}{false}{indices:data/read/search[free_context/scroll]}}] took [6004ms] which is above the warn threshold of [5000ms]

I believe this is caused by the fact that the underlying action decrements the store ref count. If it turns out to be the lat to decrement the ref count here, then that leads to the closing (including acquiring the shard lock) to run on a transport thread.

image

I think this is always a I think this can only happen (but happens quite a bit in Cloud logs) if there's a concurrent relocation or so but regardless IO should never run on transport workers.

I wonder if we may have other spots where this occurs and the last decrement for the store hits via a search action on a transport thread. It might be worth adding an assertion for not running the store close on a transport worker when fixing this.

Metadata

Metadata

Labels

:Distributed/EngineAnything around managing Lucene and the Translog in an open shard.:Search Foundations/SearchCatch all for Search Foundations>bugTeam:DistributedMeta label for distributed team.Team:Search FoundationsMeta label for the Search Foundations team in Elasticsearch

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions