Skip to content

Add possibility to acquire permits on primary shards with different checks#119794

Merged
arteam merged 4 commits intoelastic:mainfrom
arteam:acquire-primary-permit-custom-check
Jan 10, 2025
Merged

Add possibility to acquire permits on primary shards with different checks#119794
arteam merged 4 commits intoelastic:mainfrom
arteam:acquire-primary-permit-custom-check

Conversation

@arteam
Copy link
Copy Markdown
Contributor

@arteam arteam commented Jan 8, 2025

Since #42241 we check that the shard must be in a primary mode for acquiring a primary permit on it. We would like customize this check and an option to perform different checks before running the onPermitAcquired listener. For example, we would to skip the primary mode check when we acquire primary permits during recovering of a hollow indexing shard.

See ES-10487

@arteam arteam added >non-issue :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Jan 8, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine elasticsearchmachine added Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v9.0.0 labels Jan 8, 2025
…hecks

Since elastic#42241 we check that the shard must be in a primary mode for acquiring a
primary permit on it. We would like customize this check and an option to
perform different checks before running the `onPermitAcquired` listener.
For example, we would to skip the primary mode check when we acquire primary permits
during recovering of a hollow indexing shard.

See ES-10487
@arteam arteam force-pushed the acquire-primary-permit-custom-check branch from 23917ed to 94631d5 Compare January 8, 2025 20:12
@arteam arteam requested review from fcofdez and kingherc January 9, 2025 07:55
Copy link
Copy Markdown
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. A couple of comments.

Copy link
Copy Markdown
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arteam arteam merged commit 6ca7e75 into elastic:main Jan 10, 2025
@arteam arteam deleted the acquire-primary-permit-custom-check branch January 10, 2025 13:04
/**
* Check to run before running the primary permit operation
*/
public enum PrimaryPermitCheck {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit late, but wouldn't be possible to somehow relax the assertion in #isPrimaryMode when the engine is hollow instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fcofdez ! It's possible, but it's unclear what's the best way to do it since the hollow info/logic is in serverless code. In the POC, I remember I simply relaxed the assertion by allowing it if the shard is in the recovery process. Not sure though it's better than this PR since it was more generic (recovering shards). Feel free to shoot a specific proposal though to see if it's better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcofdez I believe the reason was that IndexShard doesn't know about HollowEngine which is a serverless concept, so you would need to expose this abstraction to Engine. I think disabling the primary mode check explicitly seems to be a good first approach and we can revert to the engine check if that makes more sense.

@fcofdez
Copy link
Copy Markdown
Contributor

fcofdez commented Jan 14, 2025

I've been chatting with @kingherc and I think that we could avoid this change if we acquire the permits in the target node during the primary context handover in serverless as that's called before the shard is marked as started in

@Override
public void onRecoveryDone(
final RecoveryState state,
ShardLongFieldRange timestampMillisFieldRange,
ShardLongFieldRange eventIngestedMillisFieldRange
) {
shardStateAction.shardStarted(
shardRouting,
primaryTerm,
"after " + state.getRecoverySource(),
timestampMillisFieldRange,
eventIngestedMillisFieldRange,
ActionListener.noop()
);
}
. That would avoid having to change this API.

@kingherc
Copy link
Copy Markdown
Contributor

breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
We added support for acquiring permits on primary shards without the routing mode check in elastic#119794, now we should be able to acquire permits for hollow shards.

We also have to release the permits in order for the test cluster to be properly shut down.

See
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
* [WIP] Add support for relocating hollow shards

* Add a test for relocating hollow shard

* Run spotless

* Use logger instead of System.out.println

* Don't check activeOperations if we hold permits

* Test relocating hollow shard

* Add more testing for relocating hollowable shards

* Acquire primary permits on hollow shards on recovery

We added support for acquiring permits on primary shards without the routing
mode check in elastic#119794, now
we should be able to acquire permits for hollow shards.

We also have to release them in order for the test cluster properly shit down.

* Add releasePrimaryPermits call when shutting down HollowIndexEngine

* Make releasing primary permits package-private

* Add a comment about HollowShardsService being available for DI

* Remove extra isLastCommitHollow check

* Support relocating hollow shards with HollowIndexEngine

* Run spotless

* Move stateless compound checks

* Add tests for swapping hollowed shards to HollowEngine on failures

* Swap the engine to HollowIndexEngine if we were unable to relocate the shard

* Always reset engine while holding primary permits

* Extract switch to hollow shard in a separate method

* Acquire primary permits after we initialized the shard with hollow index engine

* Add additional arePrimaryPermitsHeld checks after the relocation

* Revert to debug for "obtained primary context" message

* Use resetEngine

* Don't swap to HollowIndexEngine on source node when relocation fails

* Add a link to JIRA ticket about swapping the engine on relocation failures

* Rollback "acquiring all primary operation permits" to DEBUG

* Reference in the relocation fail test

* Use handOffPrimaryPermits for primaryPermits

* use handOffPrimaryPermits

* Simplify breaking of relocations

Just throw exceptions instead of disconecting nodes

* Use AtomicReference for primaryPermits since we mutate it

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
We added support for acquiring permits on primary shards without the routing mode check in elastic#119794, now we should be able to acquire permits for hollow shards.

We also have to release the permits in order for the test cluster to be properly shut down.

See
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
* [WIP] Add support for relocating hollow shards

* Add a test for relocating hollow shard

* Run spotless

* Use logger instead of System.out.println

* Don't check activeOperations if we hold permits

* Test relocating hollow shard

* Add more testing for relocating hollowable shards

* Acquire primary permits on hollow shards on recovery

We added support for acquiring permits on primary shards without the routing
mode check in elastic#119794, now
we should be able to acquire permits for hollow shards.

We also have to release them in order for the test cluster properly shit down.

* Add releasePrimaryPermits call when shutting down HollowIndexEngine

* Make releasing primary permits package-private

* Add a comment about HollowShardsService being available for DI

* Remove extra isLastCommitHollow check

* Support relocating hollow shards with HollowIndexEngine

* Run spotless

* Move stateless compound checks

* Add tests for swapping hollowed shards to HollowEngine on failures

* Swap the engine to HollowIndexEngine if we were unable to relocate the shard

* Always reset engine while holding primary permits

* Extract switch to hollow shard in a separate method

* Acquire primary permits after we initialized the shard with hollow index engine

* Add additional arePrimaryPermitsHeld checks after the relocation

* Revert to debug for "obtained primary context" message

* Use resetEngine

* Don't swap to HollowIndexEngine on source node when relocation fails

* Add a link to JIRA ticket about swapping the engine on relocation failures

* Rollback "acquiring all primary operation permits" to DEBUG

* Reference in the relocation fail test

* Use handOffPrimaryPermits for primaryPermits

* use handOffPrimaryPermits

* Simplify breaking of relocations

Just throw exceptions instead of disconecting nodes

* Use AtomicReference for primaryPermits since we mutate it

---------

Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants