Skip to content

osd/PrimaryLogPG: erase gather op during op cancel#55410

Closed
NitzanMordhai wants to merge 1 commit intoceph:mainfrom
NitzanMordhai:wip-nitzan-cls-gather-ops-erase-on-cancel
Closed

osd/PrimaryLogPG: erase gather op during op cancel#55410
NitzanMordhai wants to merge 1 commit intoceph:mainfrom
NitzanMordhai:wip-nitzan-cls-gather-ops-erase-on-cancel

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Feb 1, 2024

The erase exist, but it try to delete iter and not the name of the soid

Fixes: https://tracker.ceph.com/issues/64258
Signed-off-by: Nitzan Mordechai nmordech@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@NitzanMordhai NitzanMordhai requested a review from a team as a code owner February 1, 2024 12:19
@github-actions github-actions bot added the core label Feb 1, 2024
@NitzanMordhai
Copy link
Contributor Author

jenkins test api

@athanatos
Copy link
Contributor

@NitzanMordhai What's wrong with invoking erase on the iterator? Invoking erase on the iterator should have the same effect as invoking it on the key, but much more efficiently (constant time rather than logarithmic). I don't see a mechanism that would allow this to fix the bug linked in the ticket.

@athanatos athanatos self-requested a review February 2, 2024 21:32
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai What's wrong with invoking erase on the iterator? Invoking erase on the iterator should have the same effect as invoking it on the key, but much more efficiently (constant time rather than logarithmic). I don't see a mechanism that would allow this to fix the bug linked in the ticket.

seeing that. will rework that.

@NitzanMordhai
Copy link
Contributor Author

@athanatos it looks like the peering process didn't call start_peering_interval which is supposed to call on_change which triggers the cleanup and cancel of the gather. we also have finish methode in the gather (struct C_gather) that getting r=-125) but doesn't clear the cls_gather_ops map, that can solve the issue as well

The erase exist, but it try to delete iter and not the name of the soid

Fixes: https://tracker.ceph.com/issues/64258
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-cls-gather-ops-erase-on-cancel branch from b4948fc to a324759 Compare February 12, 2024 09:49
@github-actions github-actions bot added the tests label Feb 12, 2024
@NitzanMordhai
Copy link
Contributor Author

@athanatos Since we are not have new interval, the cancel was not invoked, i added logic to cancel the ops if we already have them in the map (just like copy ops)

@NitzanMordhai
Copy link
Contributor Author

jenkins test make check

@athanatos
Copy link
Contributor

@athanatos it looks like the peering process didn't call start_peering_interval which is supposed to call on_change which triggers the cleanup and cancel of the gather. we also have finish methode in the gather (struct C_gather) that getting r=-125) but doesn't clear the cls_gather_ops map, that can solve the issue as well

If peering really skipped start_peering_interval, that would be an incredibly huge bug. Do you have logs?

@NitzanMordhai
Copy link
Contributor Author

If peering really skipped start_peering_interval, that would be an incredibly huge bug. Do you have logs?

@athanatos i was able to recreate the error with my local build, can you check the logs on folio03? /home/nmordech/output_gather

or should i move them to somewhere else?

@athanatos
Copy link
Contributor

@NitzanMordhai See my comment https://tracker.ceph.com/issues/64258 -- there was no interval change. The entry is there because the op was retried and the previous attempt was still in progress. This feature was implemented as an experiment, but I'm not aware of any serious users. For now, I suggest:

  1. move this test into an experimental suite and don't consider it for normal testing
  2. mark cls_gather deprecated in squid. If someone steps up to finish the implementation and maintain it going foward we can un-deprecate it. Otherwise, we'll remove it in T.

@NitzanMordhai Can you take care of the above?

@athanatos athanatos closed this Feb 14, 2024
@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai Can you take care of the above?

@athanatos thanks for reviewing it, will do that

@rzarzynski
Copy link
Contributor

Just to the record:

$ grep -r cls_cxx_gather 
crimson/osd/objclass.cc:int cls_cxx_gather(cls_method_context_t hctx, const std::set<std::string> &src_objs, const std::string& pool,
Binary file crimson/osd/.objclass.cc.swp matches
objclass/objclass.h:extern int cls_cxx_gather(cls_method_context_t hctx, const std::set<std::string> &src_objs, const std::string& pool,
cls/test_remote_reads/cls_test_remote_reads.cc:    r = cls_cxx_gather(hctx, src_objects, pool, cls.c_str(), method.c_str(), *in);
osd/objclass.cc:int cls_cxx_gather(cls_method_context_t hctx, const std::set<std::string> &src_objs, const std::string& pool
$ grep -r cls_cxx_get_gathered_data ./
./crimson/osd/objclass.cc:int cls_cxx_get_gathered_data(cls_method_context_t hctx, std::map<std::string, bufferlist> *results)
Binary file ./crimson/osd/.objclass.cc.swp matches
./objclass/objclass.h:extern int cls_cxx_get_gathered_data(cls_method_context_t hctx, std::map<std::string, bufferlist> *results);
./cls/test_remote_reads/cls_test_remote_reads.cc:  int r = cls_cxx_get_gathered_data(hctx, &src_obj_buffs);
./osd/objclass.cc:int cls_cxx_get_gathered_data(cls_method_context_t hctx, std::map<std::string, bufferlist> *results)

@rzarzynski rzarzynski mentioned this pull request Mar 4, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants