Skip to content

pybind/snap_schedule: add initial interface#29489

Merged
vshankar merged 26 commits intoceph:masterfrom
jan--f:cephfs-snap-schedule
Aug 28, 2020
Merged

pybind/snap_schedule: add initial interface#29489
vshankar merged 26 commits intoceph:masterfrom
jan--f:cephfs-snap-schedule

Conversation

@jan--f
Copy link
Contributor

@jan--f jan--f commented Aug 5, 2019

This is an RFC PR for the proposed interface for the cephfs snapshot schedule module.

This is base on a CDM discussion from June 5th (https://youtu.be/EG-3rcvqvGI?list=PLrBUGiINAakNbcSOvOM0IJHqqv5dzusZ6&t=1812), more details here https://pad.ceph.com/p/cephfs-snap-mirroring.

A snap_schedule is simply a multiplier and a time spec (m[inute], h[our], ...) to snapshot a path every so often. The start of this period is the time set was called. A (not yet added) start argument would allow more control over the timing. Imho this is preferable over adding a start time to the snap_schedule spec (something like 24h@<start_time>).

A prune_schedule is optional. By default a pruning is attempted after every scheduled snapshot. If specified is has them same specification as snap_schedule.

One thing that is not included yet would be to allow multiple snap (or prune) schedules per path. This would allow adding complex snapshot schedules without complicating the snap_schedule specification. I think this can easily be added to the interface at a later stage (index the snapshot schedules per path and add an optional index arg the the get, set and rm calls).

@jan--f jan--f added cephfs Ceph File System pybind DNM labels Aug 5, 2019
@batrick
Copy link
Member

batrick commented Aug 5, 2019

@jan--f
Copy link
Contributor Author

jan--f commented Aug 5, 2019

cc @oliveiradan

@jan--f
Copy link
Contributor Author

jan--f commented Aug 7, 2019

Should we allow setting snapshot schedules for not (yet) existing directories? This might be handy for use cases where provisioning scripts want to create snapshot schedules but not the full directory tree.
What goes hand in hand with this is that we don't necessarily have to clean up snapshot schedules when a directory is deleted.

@l-mb
Copy link
Member

l-mb commented Aug 7, 2019

Should we able to set a jitter period to avoid multiple snaps triggering at exactly the same time?

@liewegas
Copy link
Member

liewegas commented Aug 8, 2019

I don't really understand why pruning has a schedule at all. It seems like each snap schedule item would have some retention property (how many of these), and that determines what gets pruned. Whether a prune happens 2 seconds or 2 hours after the count associated with a schedule is changed isn't something a user should ever have to think about; normally the prune would happen after a new snap is taken anyway.

Note that having a snap schedule have a single period means that if you want (say) daily and weekly snapshots, then you'll have 2 snapshots every 7th day that sort of overlap. This is why the original proposal was trying to specify a more complicated retention policy that would thin out the snapshot history over time as part of the same schedule. I still think this would be a nicer experience, as long as we can come up with a way of describing that policy in an understandable way.

@jan--f
Copy link
Contributor Author

jan--f commented Aug 10, 2019

I don't really understand why pruning has a schedule at all. It seems like each snap schedule item would have some retention property (how many of these), and that determines what gets pruned. Whether a prune happens 2 seconds or 2 hours after the count associated with a schedule is changed isn't something a user should ever have to think about; normally the prune would happen after a new snap is taken anyway.

Right. That would of course be the default (the prune_schedule is optional). I thought that might be something user would want to configure, like "only prune at night" or so. But not feeling strongy about this...probably best we leave that out for now.

@jan--f
Copy link
Contributor Author

jan--f commented Aug 12, 2019

Note that having a snap schedule have a single period means that if you want (say) daily and weekly snapshots, then you'll have 2 snapshots every 7th day that sort of overlap. This is why the original proposal was trying to specify a more complicated retention policy that would thin out the snapshot history over time as part of the same schedule. I still think this would be a nicer experience, as long as we can come up with a way of describing that policy in an understandable way.

We could combine the arguments for snap-schedule set and prune-schedule set, in other words tie the prune schedule spec directly to a snapshot schedule but keep the explicit --keep-* arguments in order to keep things understandable.

I can alter the interface accordingly and there would be some UI workflows to iron out (like how to set/change a pruning schedule without changing the snapshot schedule) but I don't anticipate any show stoppers.

How does that sound?

@liewegas
Copy link
Member

Something about the --keep-* args being explicitly tied to fixed periods (day, week, month) bugs me, but I'm not sure many/any users would need something different.

@jan--f
Copy link
Contributor Author

jan--f commented Aug 16, 2019

Something about the --keep-* args being explicitly tied to fixed periods (day, week, month) bugs me, but I'm not sure many/any users would need something different.

Consider it the trade-off for a complex specification string. Imho this is pretty accessible if a bit verbose. Glancing over the options should be enough to figure out whats going on.

@jan--f
Copy link
Contributor Author

jan--f commented Aug 21, 2019

I updated the draft accordingly. One thing that is missing in this variant is how to alter pruning schedules explicitly. Not sure whether a separate subcommand or re-using set is better, though that should be easy enough to hash out later.

@oliveiradan
Copy link
Contributor

@jan--f , I actually was wondering if something like this: oliveiradan@6ef977c
Would be interesting, either for a .json file, or even if later on using the ''Rados". The commented json at the top gives an idea.

@jan--f jan--f force-pushed the cephfs-snap-schedule branch from 1a3b45c to 2271273 Compare September 18, 2019 12:38
@jan--f jan--f force-pushed the cephfs-snap-schedule branch from 2271273 to ea056d3 Compare October 24, 2019 11:25
@jan--f
Copy link
Contributor Author

jan--f commented Oct 24, 2019

I added the interface implementation. I changed the set interface somewhat to accommodate for the lack of kwarg passing to mgr modules. The retention policy is for now just a string. I think passing something like keep-weekly=10,keep-monthly=4 would be good to have (the module would parse that accordingly) but we could also come up with something more succinct.

Some open questions:

  • Should allow setting schedules on not-yet-exsiting paths?
  • Should we allow setting a schedule without a retention policy?

and TODOs:

  • Remove two instances of recursive calls in the ScheduleIndex
  • Add volume/subvolume resolution, probably just translate that to a full path?
  • Add more tests
  • Add logging
  • Improve output formatting

@liewegas
Copy link
Member

liewegas commented Oct 24, 2019 via email

@jan--f
Copy link
Contributor Author

jan--f commented Oct 24, 2019

cc @batrick @vshankar

@jan--f
Copy link
Contributor Author

jan--f commented Oct 24, 2019

Would it simplify things to say no to both?

  • Should allow setting schedules on not-yet-exsiting paths?

Not allowing this for now is probably fine. I could see this as a feature request at some point though then probably more complex. E.g. the ability to add a schedule for a path glob to set a schedule for all dirs under /home/. But we can certainly cross that bridge when we come to it.

  • Should we allow setting a schedule without a retention policy?

Allowing no retention policy is probably harder,since I assume we don't really want a large number of snapshots accumulating. So we'd need some kind of health check or so. But this is probably easier solved by setting a default policy (say keep the last 15 snapshots).

Jan Fajerski and others added 21 commits August 27, 2020 15:55
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
…ules per path

This commit adds a bunch of interesting metadata fields to the DB and
refactors the Schedule class so that it concentrates all the DB queries
in on place. Also enables to set multiple schedules per path as long as
the (repeat interval, start time) pairs are unique.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
This should make it easy to consume for cli tools.
Also simplifies some internals.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
This allows to 1) activate schedules that have previously been
deactivated due to the path no existing and 2) allowing ussers to
explicitly deactivate a schedule for debugging or maintenance purposes.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
This separates the Schedule class from the SnapSchedClient class.
Motivation is cleaner code layout and simpler testing.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
…onal

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
This adds a simple test for the Schedule.store_schedule. This test
exposed a bug (couldn't add schedule on different paths with the same
repeat and start time). Bug is also fixed.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
To enable the use of a minutely repeat interval in the snap schedules,
this adds a guard that checks for the SNAP_SCHED_TEST env variable.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
fromisoformat was introduced in pytho3.7. This can be worked around by
either installing
https://github.com/movermeyer/backports.datetime_fromisoformat. If a
pre3.7 python is found and the backport module is not installed,
snap_schedule falls back to parsing a fully specified timestamp
format. This commit adds the necessary code.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
While UPDATE <table_name> AS <alias> ... should be valid SQL syntax. It seems
like only some sqlite version support it however.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
…ox runs

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
@tchaikov
Copy link
Contributor

tchaikov commented Aug 31, 2020

@vshankar
Copy link
Contributor

@vshankar how did you test this change? my batch now looks like https://pulpito.ceph.com/kchai-2020-08-29_13:54:19-rados-wip-kefu-testing-2020-08-29-1945-distro-basic-smithi/

with this pr: #34552 + tox test that run on PR updates. Are you suspecting this part of the change to be the issue: https://github.com/ceph/ceph/pull/29489/files#diff-af4be246b95af4d67b3fc62e30a49178

@tchaikov
Copy link
Contributor

tchaikov commented Aug 31, 2020

@vshankar how did you test this change? my batch now looks like https://pulpito.ceph.com/kchai-2020-08-29_13:54:19-rados-wip-kefu-testing-2020-08-29-1945-distro-basic-smithi/

with this pr: #34552 + tox test that run on PR updates. Are you suspecting this part of the change to be the issue: https://github.com/ceph/ceph/pull/29489/files#diff-af4be246b95af4d67b3fc62e30a49178

@vshankar i don't think tox based tests have good coverage for these changes or they are able to find most the issues. in future, i'd suggest run all non-trivial changes with rados suite. see my analysis at https://tracker.ceph.com/issues/47185

@tchaikov
Copy link
Contributor

tchaikov commented Aug 31, 2020

@vshankar please add a Reviewed-by line in your merge commit when merging a PR in future.

@vshankar
Copy link
Contributor

@vshankar please add a Reviewed-by line in your merge commit when merging a PR in future.

ACK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System mgr pybind

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants