Skip to content

rgw/s3select: json integration: identifying new data-source and execute a JSON flow.#49411

Merged
cbodley merged 1 commit intomainfrom
json_rgw_integration
Mar 3, 2023
Merged

rgw/s3select: json integration: identifying new data-source and execute a JSON flow.#49411
cbodley merged 1 commit intomainfrom
json_rgw_integration

Conversation

@galsalomon66
Copy link
Contributor

@galsalomon66 galsalomon66 commented Dec 13, 2022

  • integration of JSON flow

  • integration of SQL limit operator

  • scan range: it enables the user to select a range(start --> end) for SQL processing.
    thus, the user has the flexibility to process a specific range within the object.
    the significant use of the scan-range option is running parallel requests on the same object.
    the user must "know" how to merge the return results from the parallel requests (Trino is one example). (https://trino.io/docs/current/connector/hive.html).

  • Trino alignments (1) upon pushing down the Trino-engine converts the SQL-statements(different from AWS), the s3select-parser handles that. (2) Trino rejects(HIVE_CURSOR_ERROR) some of s3select responds due to its size. RGW/S3select limits the size of responses.

s3select submodule

Signed-off-by: Albin Antony aantony@redhat.com
Signed-off-by: galsalomon66 gal.salomon@gmail.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added the rgw label Dec 13, 2022
@albin-antony albin-antony mentioned this pull request Dec 13, 2022
14 tasks
@cbodley cbodley changed the title json integration: identifying new data-source and execute a JSON flow. rgw/s3select: json integration: identifying new data-source and execute a JSON flow. Dec 13, 2022
s3select_syntax.get_error_description().c_str(),
s3select_resource_id);
ldpp_dout(this, 10) << "s3-select query: wrong json format; {" << s3select_syntax.get_error_description() << "}" << dendl;
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use posix error codes like -EINVAL

Comment on lines +749 to +810
if (s->obj_size == 0) {
status = run_s3select_on_json(m_sql_query.c_str(), nullptr, 0);
} else {
auto bl_len = bl.get_num_buffers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status is unchecked here, then we immediately go on to call status = run_s3select_on_json(...) below

do we need this special case for obj_size == 0 at all? it should be fine to loop over an empty bl.buffers()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for the obj_size == 0 i recall we needed a specific flow for CSV upon zero size object.
it will be re-check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status is unchecked here, then we immediately go on to call status = run_s3select_on_json(...) below

do we need this special case for obj_size == 0 at all? it should be fine to loop over an empty bl.buffers()

upon empty-object(zero size), the bl = buffer-list is empty, thus buffer-list loop does not execute, no call to s3select-function, which results with a wrong response.

if (s->obj_size == 0) { ... } returns the appropriate empty response.

m_aws_response_handler.update_processed_size(it.length());
status = run_s3select_on_json(m_sql_query.c_str(), &(it)[0], it.length());
if(status<0) {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we return this error, instead of overwriting it with status = run_s3select_on_json(...) just below?

@galsalomon66
Copy link
Contributor Author

a good start with teuthology
8 of 8 had passed with success

http://pulpito.front.sepia.ceph.com/gsalomon-2022-12-13_20:26:47-rgw:verify-json_rgw_integration-distro-default-smithi/

@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

@galsalomon66 galsalomon66 force-pushed the json_rgw_integration branch 2 times, most recently from a1825e0 to d0b40ef Compare January 1, 2023 18:29
@mattbenjamin
Copy link
Contributor

I don't think commit subject we intend to merge should have "WIP:" prefixed

@galsalomon66
Copy link
Contributor Author

galsalomon66 commented Jan 1, 2023 via email

@galsalomon66
Copy link
Contributor Author

the last teuthology run had failed and pointed to valgrind issues.
on work.

@galsalomon66 galsalomon66 force-pushed the json_rgw_integration branch 2 times, most recently from 1c32c2a to 8a531b5 Compare January 19, 2023 15:37
@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

http://pulpito.front.sepia.ceph.com/gsalomon-2023-01-29_15:19:28-rgw:verify-json_rgw_integration-distro-default-smithi/
all tests had failed (16), it seems related to selinux
there is no failure related to s3select
13 s3select s3-tests had passed with success

@galsalomon66 galsalomon66 requested a review from a team as a code owner February 16, 2023 08:52
@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

@galsalomon66
Copy link
Contributor Author

http://pulpito.front.sepia.ceph.com/gsalomon-2023-02-26_12:32:11-rgw:verify-json_rgw_integration-distro-default-smithi/
22 had passed with success
the rest of the jobs had failed, it does not seem related to s3select tests(no failures to s3select)

@galsalomon66 galsalomon66 force-pushed the json_rgw_integration branch from dbd262b to 781ff4d Compare March 2, 2023 08:06
s3select submodule

integrate the limit-operator into RGW. current changes is for CSV format

integrate the limit-operator with CSV-flow, and JSON-flow. an update of s3select submodule

a fix for the input-serialization-type selection

debug functionality.

fix for error handling

adding the scan-range feature. i.e. enables the user to define the range of processing(text only)

remove the temp variable(the s3select-layer handles the sql-result setting upon JSON flow)

adding documentation. for JSON, SQL limit-operator, scan-range option.
a fix for JSON test framework in s3select module

investigate crash. replace len with it.length()

editorial changes; non-initialized variables

adding validation for offset&length. (it seems that the offset&length causes Invalid reads, in some cases)

changes related to trino integration.(1) syntax issues related to trino-engine statement conbersions(s3select parser handles that) (2).
(2). trino rejects(HIVE_CURSUR_ERROR) some of the s3select responses do to its size, the change is about control the size of the respond.

update JSON documentation

trino changes: response in paging (limitation on size), trino syntax issues.

Signed-off-by: galsalomon66 <gal.salomon@gmail.com>
@galsalomon66 galsalomon66 force-pushed the json_rgw_integration branch from 781ff4d to 9f20e75 Compare March 2, 2023 22:28
@cbodley
Copy link
Contributor

cbodley commented Mar 2, 2023

i see that all of those failed jobs were using objectstore/filestore-xfs.yaml. filestore was removed recently in #49528; maybe you had rebased your ceph branch but not your suite-branch?

since rgw/verify looks clean, i think this just needs to run against the full rgw suite? once you rebase the suite-branch, the rgw suite at --subset 1/3 is only ~44 jobs in total

@galsalomon66
Copy link
Contributor Author

https://pulpito.ceph.com/gsalomon-2023-03-03_04:37:38-rgw:verify-json_rgw_integration-distro-default-smithi/
1 dead, all the rest had passed with success.
24 jobs were scheduled (no limit)

@cbodley
Copy link
Contributor

cbodley commented Mar 3, 2023

https://pulpito.ceph.com/gsalomon-2023-03-03_04:37:38-rgw:verify-json_rgw_integration-distro-default-smithi/ 1 dead, all the rest had passed with success. 24 jobs were scheduled (no limit)

this is still just the rgw:verify subsuite. can you please run the rgw suite at --subset 1/3?

@cbodley
Copy link
Contributor

cbodley commented Mar 3, 2023

@cbodley cbodley merged commit d08115f into main Mar 3, 2023
@cbodley cbodley deleted the json_rgw_integration branch March 3, 2023 19:35
@cbodley
Copy link
Contributor

cbodley commented Mar 3, 2023

@galsalomon66 can you please make sure that a tag gets added to the s3select repo corresponding to the commit our submodule now points to?

@galsalomon66
Copy link
Contributor Author

anthonyeleven added a commit to anthonyeleven/ceph that referenced this pull request Mar 5, 2023
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
anthonyeleven added a commit to anthonyeleven/ceph that referenced this pull request Mar 5, 2023
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
anthonyeleven added a commit to anthonyeleven/ceph that referenced this pull request Mar 6, 2023
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
zdover23 added a commit that referenced this pull request Mar 7, 2023
doc/radosgw: Redd up s3select.rst as followup to #49411

Reviewed-by: Zac Dover <zac.dover@proton.me>
zdover23 pushed a commit to zdover23/ceph that referenced this pull request Mar 7, 2023
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
(cherry picked from commit 78b42ee)
zdover23 pushed a commit to zdover23/ceph that referenced this pull request Mar 7, 2023
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
(cherry picked from commit 78b42ee)
zdover23 pushed a commit to zdover23/ceph that referenced this pull request Mar 7, 2023
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
(cherry picked from commit 78b42ee)
zdover23 added a commit that referenced this pull request Mar 11, 2023
…0384-to-reef

reef: doc/radosgw: Redd up s3select.rst as followup to #49411

Reviewed-by: Cole Mitchell <cole.mitchell.ceph@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants