Conversation
henningandersen
left a comment
There was a problem hiding this comment.
A couple initial comments, I'd like to see the missing class before I continue.
| final Collection<Object> components = new ArrayList<>(); | ||
| components.add(new ReindexSslConfig(services.environment().settings(), services.environment(), services.resourceWatcherService())); | ||
| components.add(new ReindexMetrics(services.telemetryProvider().getMeterRegistry())); | ||
| return (components); |
There was a problem hiding this comment.
This can be simplified to:
return List.of(new ReindexSslConfig(services.environment().settings(), services.environment(), services.resourceWatcherService()), new ReindexMetrics(services.telemetryProvider().getMeterRegistry()));
| ScriptService scriptService, | ||
| ReindexSslConfig reindexSslConfig | ||
| ReindexSslConfig reindexSslConfig, | ||
| ReindexMetrics reindexMetrics |
There was a problem hiding this comment.
I think you missed adding this class when commiting the changes.
There was a problem hiding this comment.
Ah Yes! Sorry about that.
|
Hi Team, this is an initial implementation to measure the total time taken by a reindex request and externalize it as a histogram via APM. Please take a close look since I am new to both Reindexing module and the metering infrastructure. Any and all feedback is welcome. |
|
I am looking into the test failures. |
henningandersen
left a comment
There was a problem hiding this comment.
Overall looks good. Next step seems to be to add a (or amend an existing) test to verify that we get some took times out.
| transportService, | ||
| new ReindexSslConfig(settings, environment, watcherService) | ||
| new ReindexSslConfig(settings, environment, watcherService), | ||
| null |
There was a problem hiding this comment.
Will this throw a NullPointerException when this special reindex task ends? I think we may have to pass an object here.
There was a problem hiding this comment.
Hi Henning, do you expect the NullPointerException to be thrown inside Reindexer->execute() ? Here, I did handle the case where we pass in a null ReindexMetrics object to the Reindexer class constructor. But perhaps you are thinking of some other scenario ?
Right now I pass in null ReindexMetrics object to TransportReindexAction when called from TransportEnrichReindexAction, but that also means we will not capture metrics for reindexing operations done by the EnrichPlugin. So I also wanted your opinion on whether we want that behavior or not.
The thing is that a metric can only be registered once on a node. So I think we can only instantiate one ReindexMetrics object for the node. If we want to record reindexing metrics from the EnrichPlugin as well, I believe we will have to instantiate the ReindexMetrics object in NodeConstruction here instead of as a component of ReindexPlugin like I do right now.
There was a problem hiding this comment.
I think it should work if you accept it as a parameter in TransportEnrichReindexAction. You expose it from the ReindexPlugin as a component and those are automatically injected whenever needed.
I'd prefer to do that over the null check. If there is a problem, perhaps you can let me know how it fails?
There was a problem hiding this comment.
Hi Henning, if I simply try to pass ReindexMetrics as a parameter to TransportEnrichReindexAction constructor, I run into the following failure. Could it have anything to do with the order in which the ReindexPlugin and the EnrichPlugin are created ? I have made the change and uploaded it here in case it is easier to inspect the failure here.
Task :docs:yamlRestTest FAILED
=== Log output of node node{:docs:yamlRestTest-0} ===
» ↓ errors and warnings from /Users/ankitakumar/elastic/elasticsearch/docs/build/testclusters/yamlRestTest-0/logs/es.out ↓
» [2024-08-14T16:09:26.012689Z] [BUILD] Starting Elasticsearch process
» Enter password for the elasticsearch keystore : Aug 14, 2024 12:09:32 PM sun.util.locale.provider.LocaleProviderAdapter
» WARNING: COMPAT locale provider will be removed in a future release
» [2024-08-14T12:10:03,451][ERROR][o.e.b.Elasticsearch ] [node-0] fatal exception while booting Elasticsearch org.elasticsearch.injection.guice.CreationException: Guice creation errors:
»
» 1) Could not find a suitable constructor in org.elasticsearch.reindex.ReindexMetrics. Classes must have either one (and only one) constructor annotated with @org.elasticsearch.injection.guice.Inject or a zero-argument constructor that is not private.
» at org.elasticsearch.reindex.ReindexMetrics.class(Unknown Source)
» while locating org.elasticsearch.reindex.ReindexMetrics
» for parameter 11 at org.elasticsearch.xpack.enrich.action.TransportEnrichReindexAction.(Unknown Source)
» at unknown
»
» 1 error
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.injection.guice.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:312)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.injection.guice.InjectorBuilder.initializeStatically(InjectorBuilder.java:115)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.injection.guice.InjectorBuilder.build(InjectorBuilder.java:73)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.injection.guice.Guice.createInjector(Guice.java:69)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.injection.guice.ModulesBuilder.createInjector(ModulesBuilder.java:36)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.construct(NodeConstruction.java:1163)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:270)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.node.Node.(Node.java:193)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.(Elasticsearch.java:239)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:239)
» at org.elasticsearch.server@8.16.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:74)
»
» ERROR: Elasticsearch did not exit normally - check the logs at /Users/ankitakumar/elastic/elasticsearch/docs/build/testclusters/yamlRestTest-0/logs/yamlRestTest.log
»
» ERROR: Elasticsearch died while starting up, with exit code 1
» ↓ last 40 non error or warning messages from /Users/ankitakumar/elastic/elasticsearch/docs/build/testclusters/yamlRestTest-0/logs/es.out ↓
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded module [x-pack-eql]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [analysis-ukrainian]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [analysis-kuromoji]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [discovery-gce]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [mapper-murmur3]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [analysis-phonetic]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [analysis-icu]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [discovery-ec2]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [mapper-annotated-text]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [analysis-smartcn]
» [2024-08-14T12:09:43,219][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [mapper-size]
» [2024-08-14T12:09:43,220][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [discovery-azure-classic]
» [2024-08-14T12:09:43,220][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [store-smb]
» [2024-08-14T12:09:43,220][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [analysis-nori]
» [2024-08-14T12:09:43,220][INFO ][o.e.p.PluginsService ] [node-0] loaded plugin [analysis-stempel]
» [2024-08-14T12:09:43,512][INFO ][o.e.c.u.FeatureFlag ] [node-0] The current build is a snapshot, feature flag [failure_store] is enabled
» [2024-08-14T12:09:44,529][INFO ][o.e.c.u.FeatureFlag ] [node-0] The current build is a snapshot, feature flag [ccs_telemetry] is enabled
» [2024-08-14T12:09:44,836][INFO ][o.e.e.NodeEnvironment ] [node-0] using [1] data paths, mounts [[/System/Volumes/Data (/dev/disk3s5)]], net usable_space [851.2gb], net total_space [926.3gb], types [apfs]
» [2024-08-14T12:09:44,836][INFO ][o.e.e.NodeEnvironment ] [node-0] heap size [512mb], compressed ordinary object pointers [true]
» [2024-08-14T12:09:44,858][INFO ][o.e.n.Node ] [node-0] node name [node-0], node ID [41A1Dw_hTsGdR6iragiuYQ], cluster name [yamlRestTest], roles [remote_cluster_client, master, data_warm, data_content, transform, data_hot, ml, data_frozen, ingest, data_cold, data]
» [2024-08-14T12:09:48,502][INFO ][o.e.c.u.FeatureFlag ] [node-0] The current build is a snapshot, feature flag [connector_secrets] is enabled
» [2024-08-14T12:09:49,400][INFO ][o.e.i.r.RecoverySettings ] [node-0] using rate limit [40mb] with [default=40mb, read=0b, write=0b, max=0b]
There was a problem hiding this comment.
If we don't care to record metrics from the TransportEnrichReindexAction, we could also call the TransportReindexAction constructor with a NOOP like so:
new ReindexMetrics(TelemetryProvider.NOOP.getMeterRegistry())
Is that better than passing a null ?
There was a problem hiding this comment.
Discussed this with Ryan here https://elastic.slack.com/archives/C8UUBNASY/p1724102081167389
|
Thanks for taking a look Henning. Will work on a test today. |
henningandersen
left a comment
There was a problem hiding this comment.
Couple more comments.
| listener | ||
| ActionListener.runAfter(listener, () -> { | ||
| long elapsedTime = TimeUnit.NANOSECONDS.toSeconds(System.nanoTime() - startTime); | ||
| if (reindexMetrics != null) reindexMetrics.recordTookTime(elapsedTime); |
There was a problem hiding this comment.
We normally prefer the curly-brace form, even for one-line if statements:
| if (reindexMetrics != null) reindexMetrics.recordTookTime(elapsedTime); | |
| if (reindexMetrics != null) { | |
| reindexMetrics.recordTookTime(elapsedTime); | |
| } |
| transportService, | ||
| new ReindexSslConfig(settings, environment, watcherService) | ||
| new ReindexSslConfig(settings, environment, watcherService), | ||
| null |
There was a problem hiding this comment.
I think it should work if you accept it as a parameter in TransportEnrichReindexAction. You expose it from the ReindexPlugin as a component and those are automatically injected whenever needed.
I'd prefer to do that over the null check. If there is a problem, perhaps you can let me know how it fails?
|
Adding @Tim-Brooks as a reviewer since Henning and Francisco are out this week. |
fcofdez
left a comment
There was a problem hiding this comment.
Looks good, I left a few comments
| new UpdateByQueryMetrics(services.telemetryProvider().getMeterRegistry()) | ||
| ); | ||
|
|
||
| /* |
There was a problem hiding this comment.
nit: we should get rid of this commented code
| private final LongHistogram reindexTimeSecsHistogram; | ||
|
|
||
| public ReindexMetrics(MeterRegistry meterRegistry) { | ||
| this(meterRegistry.registerLongHistogram(TOOK_TIME_HISTOGRAM, "Time to reindex by search", "seconds")); |
There was a problem hiding this comment.
are we sure that we want to use seconds granularity? we might end up reporting 0 seconds for fast reindex operations
There was a problem hiding this comment.
The max span of APM histograms is 100,000. That was the reason for choosing seconds granularity as per the discussion here: https://elastic.slack.com/archives/C05ENM6MEDD/p1722522943486389?thread_ts=1722522639.141469&cid=C05ENM6MEDD
Perhaps we can capture in centiseconds ? We probably don't expect operations to take more than 15 min ?
There was a problem hiding this comment.
Ah right, I didn't remember that limitation. Let's keep it in seconds then 👍
modules/reindex/src/main/java/org/elasticsearch/reindex/TransportDeleteByQueryAction.java
Show resolved
Hide resolved
| ScriptService scriptService, | ||
| ClusterService clusterService | ||
| ClusterService clusterService, | ||
| UpdateByQueryMetrics updateByQueryMetrics |
There was a problem hiding this comment.
maybe we should annotate this variable as @Nullable too?
| import org.elasticsearch.telemetry.metric.MeterRegistry; | ||
|
|
||
| public class UpdateByQueryMetrics { | ||
| public static final String TOOK_TIME_HISTOGRAM = "es.update_by_query.took_time.histogram"; |
There was a problem hiding this comment.
maybe we should name this metric es.update_by_query.duration.histogram? instead of using took_time?
| ); | ||
| assertHitCount(prepareSearch("source").setSize(0), 4); | ||
|
|
||
| TestTelemetryPlugin testTelemetryPlugin = getTestTelemetryPlugin(); |
There was a problem hiding this comment.
I think that we should write a new test case for the newly introduced metrics instead of updating an existing test
There was a problem hiding this comment.
ok sure. I will write a new test. Where should the test be located. Should it go inside the reindex package or somewhere else ? @henningandersen mentioned that ReindexBasicTests was not in the right folder.
There was a problem hiding this comment.
yes, it looks like it's an integration test. Integration tests should go into the internalClusterTest folder.
There was a problem hiding this comment.
I think we can move the ReindexBasicTests into internalClusterTest and write a new test case, wdyt?
There was a problem hiding this comment.
I think we can move the ReindexBasicTests into internalClusterTest and write a new test case, wdyt?
I've just noticed that you wrote a new test, we can do what I suggested here in a follow-up PR.
| testTelemetryPlugin.resetMeter(); | ||
| // Copy all the docs | ||
| ReindexRequestBuilder copy = reindex().source("source").destination("dest").refresh(true); | ||
| testTelemetryPlugin.collect(); |
There was a problem hiding this comment.
we're calling collect() before the reindex is executed, we should call that after
There was a problem hiding this comment.
Thanks! I wasn't quite sure how collect works.
fcofdez
left a comment
There was a problem hiding this comment.
Looks good, left a few more comments about the test
| import static org.hamcrest.Matchers.equalTo; | ||
|
|
||
| @ESIntegTestCase.ClusterScope(numDataNodes = 0, numClientNodes = 0, scope = ESIntegTestCase.Scope.TEST) | ||
| public class ReindexPluginMetricsTests extends ESIntegTestCase { |
There was a problem hiding this comment.
can we move this test into internalClusterTest?
There was a problem hiding this comment.
Sure, which folder inside internalClusterTest should I move it to ?
Perhaps server/src/internalClusterTest/java/org/elasticsearch/plugins ?
Or should I create a new folder inside server/src/internalClusterTest/java/org/elasticsearch/ called reindex ?
There was a problem hiding this comment.
You should move it into modules/reindex/src/internalClusterTest/java/org/elasticsearch/index/reindex/
There was a problem hiding this comment.
Ah ok inside modules
| assertThat(copy.get(), matcher().created(2)); | ||
|
|
||
| // wait for all threads to complete so that we get deterministic results | ||
| waitUntil(() -> (tps[0] = tp.stats()).stats().stream().allMatch(s -> s.active() == 0)); |
There was a problem hiding this comment.
usually what we do when an assertion might need to wait until something completes in the background is use assertBusy which asserts a few times until either the assertion is true or fails once the timeout is reached.
There was a problem hiding this comment.
Ah ok, so I don't need to do the explicit wait if I use assertBusy () ?
There was a problem hiding this comment.
No, you don't need to do the explicit wait, assertBusy waits implicitly.
fcofdez
left a comment
There was a problem hiding this comment.
Looks good, just a few minor comments about the tests before approving 👍
| ); | ||
| assertHitCount(prepareSearch("source").setSize(0), 4); | ||
|
|
||
| final String dataNodeName = internalCluster().getRandomNodeName(); |
There was a problem hiding this comment.
since this test uses a single node started with internalCluster().startNode(); can we get the name from that node instead of using getRandomNode that way it's clearer that we'll be fetching the metrics from that node.
There was a problem hiding this comment.
Yes! Let me make the change.
| copy = reindex().source("source").destination("none").filter(termQuery("foo", "no_match")).refresh(true); | ||
| assertThat(copy.get(), matcher().created(0)); | ||
| assertHitCount(prepareSearch("none").setSize(0), 0); | ||
| // wait for all threads to complete so that we get deterministic results |
There was a problem hiding this comment.
can we remove this unused code?
There was a problem hiding this comment.
Yes! Sorry I wanted to keep it around until I made sure the assertBusy() was working :-). I did the ci build multiple times to make sure results are deterministic.
| // Limit with maxDocs | ||
| copy = reindex().source("source").destination("dest_size_one").maxDocs(1).refresh(true); | ||
| assertThat(copy.get(), matcher().created(1)); | ||
| assertHitCount(prepareSearch("dest_size_one").setSize(0), 1); |
There was a problem hiding this comment.
Since this tests are meant to test that we do collect the metrics on the different reindex operations I wonder if we should assert that reindex and related operations work as expected, maybe we can simplify this a bit and remove these assertions? wdyt?
There was a problem hiding this comment.
The reason I kept those asserts is because I thought they were very basic asserts to check if the operation went through as expected. And I thought it would sort of make sense to verify that in conjunction with the metrics.
Do we want to remove them so the test would run faster or because it really doesn't matter if the operation succeeds or not - I guess metrics should get collected even for failed operations, right ?
There was a problem hiding this comment.
Do we want to remove them so the test would run faster or because it really doesn't matter if the operation succeeds or not - I guess metrics should get collected even for failed operations, right ?
Yes, that's my thinking, since we're just interested in testing the metric collection in these tests.
There was a problem hiding this comment.
Do we want to remove them so the test would run faster or because it really doesn't matter if the operation succeeds or not - I guess metrics should get collected even for failed operations, right ?
Yes, that's my thinking, since we're just interested in testing the metric collection in these tests.
Sure let me remove it. I can also remove the refresh(true) option in that case.
Refresh to latest
This PR adds metrics for the Reindexing plugin, to measure the end-to-end time taken by a reindex request, update-by-query request and delete-by-query request.
This PR adds metrics for the Reindexing plugin, to measure the end-to-end time taken by a reindex request.
This PR aims to resolve ES-8851