Skip to content

Retry follow task when remote connection queue full#55314

Merged
dnhatn merged 4 commits intoelastic:masterfrom
dnhatn:remote-connect-queue
Apr 17, 2020
Merged

Retry follow task when remote connection queue full#55314
dnhatn merged 4 commits intoelastic:masterfrom
dnhatn:remote-connect-queue

Conversation

@dnhatn
Copy link
Copy Markdown
Member

@dnhatn dnhatn commented Apr 16, 2020

If more than 100 shard-follow tasks are trying to connect to the remote cluster, then some of them will abort with "connect listener queue is full". This is because we retry on ESRejectedExecutionException, but not on RejectedExecutionException.

@dnhatn dnhatn added >bug :Distributed/CCR Issues around the Cross Cluster State Replication features v8.0.0 v7.6.3 v6.8.9 v7.8.0 v7.7.1 labels Apr 16, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/CCR)


// this setting is intentionally not registered, it is only used in tests
public static final Setting<Integer> REMOTE_MAX_CONNECTION_QUEUE_SIZE =
Setting.intSetting("cluster.remote.max_connection_queue_size", 100, Setting.Property.NodeScope);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there was a lot of thought to the connection listener limit. If there is a strong reason to increase it past 100 we could probably do that. Also does does this name make sense? We only allow a single connection round at a time. Should the name be cluster.remote.max_pending_connection_listeners?

Copy link
Copy Markdown
Member Author

@dnhatn dnhatn Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the name be cluster.remote.max_pending_connection_listeners?

++. I renamed it in f9c807f.

I don't think there was a lot of thought to the connection listener limit. If there is a strong reason to increase it past 100 we could probably do that.

Yeah, I think we chose this value quite arbitrarily. I think it's fine to increase this value as we should not have many concurrent remote searches, and CCR will retry on this error anyway. I've increased this to 1000. WDYT?

@dnhatn dnhatn requested a review from Tim-Brooks April 16, 2020 17:30
Copy link
Copy Markdown
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Apr 17, 2020

@tbrooks8 Thanks for reviewing.

@dnhatn dnhatn merged commit 5216bd2 into elastic:master Apr 17, 2020
@dnhatn dnhatn deleted the remote-connect-queue branch April 17, 2020 04:10
dnhatn added a commit that referenced this pull request Apr 17, 2020
If more than 100 shard-follow tasks are trying to connect to the remote
cluster, then some of them will abort with "connect listener queue is
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.
dnhatn added a commit that referenced this pull request Apr 21, 2020
If more than 100 shard-follow tasks are trying to connect to the remote 
cluster, then some of them will abort with "connect listener queue is 
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request May 1, 2020
If more than 100 shard-follow tasks are trying to connect to the remote
cluster, then some of them will abort with "connect listener queue is
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.
dnhatn added a commit that referenced this pull request May 2, 2020
If more than 100 shard-follow tasks are trying to connect to the remote
cluster, then some of them will abort with "connect listener queue is
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.

Backport of #55314
@jakelandis jakelandis removed the v8.0.0 label Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/CCR Issues around the Cross Cluster State Replication features v6.8.9 v7.6.3 v7.7.1 v7.8.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants