HLRC: ML Post Data by benwtrent · Pull Request #33443 · elastic/elasticsearch

benwtrent · 2018-09-05T20:45:33Z

Adds the ability to Post Data to an ML job in the HLRC.

This relates to (#29827)

elasticmachine · 2018-09-05T20:45:34Z

Pinging @elastic/es-core-infra

elasticmachine · 2018-09-05T20:45:35Z

Pinging @elastic/ml-core

droberts195

This is a great start. I left a few notes and ideas. I'll also ask the rest of the team to think about some of these design decisions.

droberts195 · 2018-09-06T09:09:51Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+    /**
+     * Specifies the end of the bucket resetting range
+     *
+     * @param resetEnd String representation of a timestamp; may be an epoch seconds, epoch millis or an ISO string


Might be best to say ISO 8601 instead of just ISO.

Also, this can be set to the string "now".

droberts195 · 2018-09-06T09:22:01Z

client/rest-high-level/src/test/java/org/elasticsearch/client/MachineLearningIT.java

+        for(int i = 0; i < 10; i++) {
+            Map<String, Object> hashMap = new HashMap<>();
+            hashMap.put("total", randomInt(1000));
+            postDataRequest.addDoc(hashMap);


It would be better if these docs contained the job's time field, with the loop generating some ascending values for it. Then the server could validate that they were received in ascending order.

To make this change an overload of buildJob() will have to be added that sets the time field and time format to specific values rather than random values, so that the loop knows what to add.

droberts195 · 2018-09-06T09:26:56Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+            }
+            for (Map<String, Object> objectMap : objectMaps) {
+                builder.map(objectMap);
+            }


There's a problem here because if data has been added as a mixture of bytes references and maps then all the maps will be sent after all the bytes references, and that could mean the data is not sent in ascending time order.

Two possible solutions I can think of are:

Document that all data must be supplied in the same format - either bytes references or maps - and enforce this in the addDoc() methods.

Convert the maps to bytes references in addDoc() so that there are only bytes references being stored.

This also shows that we should say somewhere in the Javadocs that docs will be processed by the job in the order they're added to the request and that therefore they should be added to the request in ascending time order.

I was thinking of doing option 2 and simply calling the bytesReferences overload of addDoc from the object map one.

The downside of serializing down to simply use the BytesReference overload is performance. Though, I may be preoptimizing here.

droberts195 · 2018-09-06T09:38:14Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+     * @param bytesReference document to add to bulk request, format must match the set XContentType
+     */
+    public void addDoc(BytesReference bytesReference) {
+        this.bytesReferences.add(Objects.requireNonNull(bytesReference, "bytesReferences must not be null"));


Should we throw an exception if content != null? Or otherwise return early to ignore as the Javadoc says.

Good thought, Since they set the the whole bulk content earlier, there is no need to continue to collect the individual docs.

droberts195 · 2018-09-06T09:38:21Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+     * @param objectMap document object to add to bulk request
+     */
+    public void addDoc(Map<String, Object> objectMap) {
+        this.objectMaps.add(Objects.requireNonNull(objectMap, "objectMap must not be null"));


Should we throw an exception if content != null? Or otherwise return early to ignore as the Javadoc says.

droberts195 · 2018-09-06T09:46:59Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+    }
+
+    /**
+     * Set the total content to post.


Maybe add again that this takes precedence over any individual docs already added.

It could also .clear() the lists of individual docs.

An alternative to consider would be to throw an exception if individual docs have already been added. It seems like client code is badly designed if it adds individual docs then wipes them out by overriding with externally formatted content.

dimitris-athanasiou · 2018-09-06T11:19:50Z

I've had a thought about this. Here is what I propose:

PostDataRequest should have the following constructors:

PostDataRequest(BytesReference, XContentType): this accepts the full load and is equivalent to what we currently have (checks BWC)
PostDataRequest(byte[], XContentType): this accepts the full load as a byte[]. This is nice to add as client users should not have to learn elasticsearch's BytesReference class set.
PostDataRequest(JsonBuilder): This specifically allows using a new JsonBuilder which allows doc-by-doc creation in a user-friendly but error-free way.

All these constructors end up building the final BytesReference that we'll end up sending over.

Then the JsonBuilder - an inner static class - has the following methods:

public JsonBuilder addDoc(String): this allows a String which is convenient for users who have their JSON stored as a String directly
public JsonBuilder addDoc(byte[]): for it is possibly a common format users will be able to convert their docs into
public JsonBuilder addDoc(Map<String, Object>): for supporting maps
private BytesReference build(): which combines all docs and returns a single BytesReference. This will only be called by the constructor of the PostDataRequest.

The above keeps means the users cannot possibly create an invalid PostDataRequest, keeps BWC and improves usability for JSON.

What do you think?

droberts195

The new way of structuring the request object is much better. I just left a few more minor comments.

droberts195 · 2018-09-06T16:50:30Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+    public PostDataRequest(String jobId, XContentType xContentType, BytesReference content) {
+        this.jobId = Objects.requireNonNull(jobId, "job_id must not be null");
+        this.xContentType = Objects.requireNonNull(xContentType, "content_type must not be null");
+        this.content = content;


Should we also requireNonNull for content?

droberts195 · 2018-09-06T16:51:51Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+        this.xContentType = Objects.requireNonNull(xContentType, "content_type must not be null");
+        ByteBuffer buffer = ByteBuffer.wrap(content);
+        ByteBuffer[] buffers = new ByteBuffer[]{ buffer };
+        this.content = BytesReference.fromByteBuffers(buffers);


Can you avoid the ByteBuffer[] by using this.content = new ByteArray(content)? IIRC ByteArray extends BytesReference.

droberts195 · 2018-09-06T16:54:45Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/PostDataRequest.java

+
+    @Override
+    public int hashCode() {
+        return Objects.hash(jobId, resetStart, resetEnd, xContentType);


I think it would be good to add a comment that content is deliberately left out as it is on the server side too. (I'm not convinced that was a good decision as it means two radically different posts can be the equal, but we are where we are.) But at least by having a comment it avoids someone adding it in in one place but not the other in the future.

Same for the comparison in equals() below.

droberts195

LGTM

* HLRC: ML Post data

* master: (30 commits) Include fallback settings when checking dependencies (elastic#33522) [DOCS] Fixing formatting issues in breaking changes CRUD: Disable wait for refresh tests with delete Test: Fix test name (elastic#33510) HLRC: split ingest request converters (elastic#33435) Logging: Configure the node name when we have it (elastic#32983) HLRC: split xpack request converters (elastic#33444) HLRC: split watcher request converters (elastic#33442) HLRC: add enable and disable user API support (elastic#33481) [DOCS] Fixes formatting error TEST: Ensure merge triggered in _source retention test (elastic#33487) [ML] Add a file structure determination endpoint (elastic#33471) HLRC: ML Forecast Job (elastic#33506) HLRC: split migration request converters (elastic#33436) HLRC: split snapshot request converters (elastic#33439) Make Watcher validation message copy/pasteable Removes redundant test method in SQL tests (elastic#33498) HLRC: ML Post Data (elastic#33443) Pass Directory instead of DirectoryService to Store (elastic#33466) Collapse package structure for metrics aggs (elastic#33463) ...

benwtrent added 6 commits September 5, 2018 15:19

HLRC: ML Post data

a8283b3

Merge branch 'master' into feature/hlrc-ml-post-data

e42a3a2

fixing comment

5b916d6

minor syntax changes adding headers

43fc933

Making builder creation a try-with-resource

9b474f2

minor formatting changes

61b620e

benwtrent added >enhancement review v7.0.0 :ml Machine learning v6.5.0 labels Sep 5, 2018

droberts195 reviewed Sep 6, 2018

View reviewed changes

benwtrent added 2 commits September 6, 2018 07:57

Merge branch 'master' into feature/hlrc-ml-post-data

b26474f

Updating class for incrimental bulk creation

2e667a6

elasticmachine mentioned this pull request Sep 6, 2018

Java high-level REST client completeness for xpack APIs #29827

Closed

98 tasks

droberts195 reviewed Sep 6, 2018

View reviewed changes

benwtrent added 2 commits September 6, 2018 12:55

Cleaning up constructors adding note about equals

a8b5290

Merge branch 'master' into feature/hlrc-ml-post-data

f93eb47

droberts195 approved these changes Sep 7, 2018

View reviewed changes

benwtrent merged commit 9230a48 into elastic:master Sep 7, 2018

benwtrent deleted the feature/hlrc-ml-post-data branch September 7, 2018 12:04

benwtrent added a commit that referenced this pull request Sep 7, 2018

HLRC: ML Post Data (#33443)

a203e47

* HLRC: ML Post data

benwtrent removed the :ml Machine learning label Oct 25, 2018

jimczi added the v7.0.0-beta1 label Feb 7, 2019

jimczi removed the v7.0.0 label Feb 7, 2019

Conversation

benwtrent commented Sep 5, 2018

Uh oh!

elasticmachine commented Sep 5, 2018

Uh oh!

elasticmachine commented Sep 5, 2018

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dimitris-athanasiou commented Sep 6, 2018 •

edited

Loading