Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jun 24, 2021

This is to resolve ARROW-13054.
This will be needed for casting timezone-naive timestamps ARROW-13033 and defining starting day of the week.

@github-actions
Copy link

@rok rok force-pushed the ARROW-13054 branch 2 times, most recently from 3647567 to 65c3019 Compare June 24, 2021 22:22
@rok
Copy link
Member Author

rok commented Jun 24, 2021

Is this missing an option we will need in the near future? Should we remove something.
@jorisvandenbossche @thisisnic

@thisisnic
Copy link
Member

thisisnic commented Jun 25, 2021

Is this missing an option we will need in the near future? Should we remove something.
@jorisvandenbossche @thisisnic

Thanks for this! I've taken a look at the lubridate package in R, and in terms of options that are implemented in lubridate (specifically for functions which are used for extracting components of date-time objects as I am assuming that is the scope of this - correct me if I'm wrong though!) but not here, there is:

  • fiscal_start - this is an argument to quarter which indicates the starting month of a fiscal year - I think this would be useful to implement

  • with_year - an argument to quarter which defines whether the returned value is just the quarter or the quarter and the year - I'm not sure about this one - it feels like it's doing something that can be achieved by combining the year/quarter returned values but if it's widely used in other implementations, perhaps we should consider it?

  • label - an argument to month and wday (aka day_of_week in Arrow) which defines whether the returned value is an integer or character representation of the month - as discussed on https://issues.apache.org/jira/browse/ARROW-13133, this can be achieved via a separate strftime kernel

  • abbr - works with label and determines whether the label is abbreviated - I think this belongs to the strftime kernel

  • locale - locale to use for month/week day names if label is set to TRUE- I think this belongs to the strftime kernel

So, yeah, I'm thinking fiscal_start is the only other accessor option that would definitely be good to add, but it'd be good to get input from @nealrichardson on this.

@rok
Copy link
Member Author

rok commented Jun 25, 2021

Hey @thisisnic! Thanks for the input :).

  • fiscal_start - this is an argument to quarter which indicates the starting month of a fiscal year - I think this would be useful to implement

Is this just an arbitrary month user sets or is there a standard?

  • with_year - an argument to quarter which defines whether the returned value is just the quarter or the quarter and the year - I don't think we need this as it makes the outputs ambiguous and can be achieved at the R layer
  • label - an argument to month and wday (aka day_of_week in Arrow) which defines whether the returned value is an integer or character representation of the month - as discussed on https://issues.apache.org/jira/browse/ARROW-13133, this can be achieved via a separate strftime kernel
  • abbr - works with label and determines whether the label is abbreviated - I think this belongs to the strftime kernel
  • locale - locale to use for month/week day names if label is set to TRUE- I think this belongs to the strftime kernel

I was wondering about locale as well. Would it be a good idea to be able to set an arbitrary locale in some cases? Pandas appears not to offer this option. It would be useful for strftime but could cause misinterpretations for say day_of_week.

Added strftime jira.

@thisisnic
Copy link
Member

  • fiscal_start - this is an argument to quarter which indicates the starting month of a fiscal year - I think this would be useful to implement

Is this just an arbitrary month user sets or is there a standard?

The default value in lubridate is 1 (the calendar year quarter), and I think it would make sense for that to be the default unless otherwise specified.

I was wondering about locale as well. Would it be a good idea to be able to set an arbitrary locale in some cases? Pandas appears not to offer this option. It would be useful for strftime but could cause misinterpretations for say day_of_week.

Again, I'd like input from others on this (I am not experienced in this and may be missing something!), but my initial thoughts are that if we do choose to extract the day of week and month as integers only (and not implement the label argument), then all the locale stuff can be handled by strftime in terms of converting it from an integer to a text representation, so the locale is not relevant to TemporalOptions specifically within the context of extracting components from datetime objects.

Added strftime jira.

Thanks!

@rok
Copy link
Member Author

rok commented Jun 25, 2021

Again, I'd like input from others on this (I am not experienced in this and may be missing something!), but my initial thoughts are that if we do choose to extract the day of week and month as integers only (and not implement the label argument), then all the locale stuff can be handled by strftime in terms of converting it from an integer to a text representation, so the locale is not relevant to TemporalOptions specifically within the context of extracting components from datetime objects.

Well my thinking is to enable this day_of_week(timestamp_monday, "en_US") == 2 and day_of_week(timestamp_monday, "en_GB") == 1. I don't know how useful this is but it looks like matlab is doing it in this way.

@jorisvandenbossche
Copy link
Member

I was wondering about locale as well. Would it be a good idea to be able to set an arbitrary locale in some cases? Pandas appears not to offer this option. It would be useful for strftime but could cause misinterpretations for say day_of_week.

Pandas actually has a different method (the one you link is an attribute that has been there for a long time, so but pandas added an additional method so it could have arguments) that takes a locale arguments: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.day_name.html

Personally, I would (at least for now) leave all locale-specific handling to the bindings / downstream applications (except for strftime). As Nic says, if we only return integers (and not labels) for now, we shouldn't need locale handling (I wouldn't let the number depend on the locale, that seems very easy to miss).


About the actual options in this PR: that seems like a good start, but I don't think all those options would be needed in a single Options struct? For example, the ambiguous/non-existent handling, those can be put in a specific options struct for the kernel that needs it?

@rok
Copy link
Member Author

rok commented Jun 25, 2021

Pandas actually has a different method (the one you link is an attribute that has been there for a long time, so but pandas added an additional method so it could have arguments) that takes a locale arguments: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.day_name.html

Oh, nice to know!

Personally, I would (at least for now) leave all locale-specific handling to the bindings / downstream applications (except for strftime). As Nic says, if we only return integers (and not labels) for now, we shouldn't need locale handling (I wouldn't let the number depend on the locale, that seems very easy to miss).

Yeah, leaving locale out of this does seem like the sane option for now.

About the actual options in this PR: that seems like a good start, but I don't think all those options would be needed in a single Options struct? For example, the ambiguous/non-existent handling, those can be put in a specific options struct for the kernel that needs it?

I'm ok with single Options struct or multiple.
If we go for multiple we'd probably split this into TemporalStrftimeOptions (ARROW-13174), TemporalComponentExtractionOptions, TemporalLocalizationOptions (ARROW-13033). Am I missing something?

@thisisnic
Copy link
Member

thisisnic commented Jun 25, 2021

About the actual options in this PR: that seems like a good start, but I don't think all those options would be needed in a single Options struct? For example, the ambiguous/non-existent handling, those can be put in a specific options struct for the kernel that needs it?

I'm ok with single Options struct or multiple.
If we go for multiple we'd probably split this into TemporalStrftimeOptions (ARROW-13174), TemporalComponentExtractionOptions, TemporalLocalizationOptions (ARROW-13033). Am I missing something?

Those make sense to me. I think there may be other Options structs we might need; for example, ones for kernels that do maths with dates. However, we don't need to know all of these in advance, right?

@rok
Copy link
Member Author

rok commented Jun 25, 2021

However, we don't need to know all of these in advance, right?

Indeed. Even the ones I listed above should be implemented just-in-time with their respective kernels IMO. So I'll only implement TemporalComponentExtractionOptions here.
Meaning we only get index_of_monday parameter from this PR.

@rok rok force-pushed the ARROW-13054 branch 3 times, most recently from e0f2ab7 to 3fb529f Compare June 28, 2021 15:38
@rok rok changed the title ARROW-13054: [C++] Add TemporalOptions ARROW-13054: [C++] Add TemporalComponentExtractionOptions Jun 28, 2021
@jorisvandenbossche
Copy link
Member

@rok you mention this PR now only has index_of_monday, but in the diff I currently see start_index, which is something else?
Also the explanation of it, "Index of the first day of the week", is a bit ambiguous (what is the first day of the week?)

@rok
Copy link
Member Author

rok commented Jun 28, 2021

@rok you mention this PR now only has index_of_monday, but in the diff I currently see start_index, which is something else?
Also the explanation of it, "Index of the first day of the week", is a bit ambiguous (what is the first day of the week?)

@jorisvandenbossche well the beginning of the week could be e.g.: Monday=0, Monday=1, Sunday=0 or Sunday=1. Meaning we really need two parameters:

  • which day (Sunday/Monday) starts the week - this is normally (eventually) given by locale but in the scope of this PR it's Monday.
  • what index does the week start at - this would here be set by start_index

Alternatively we can have index_of_monday and calculate kinda like (day_of_week(timestamp) + index_of_monday) % 7.

How about "Index of the first day of the week" to "Offset of day of week"?

@thisisnic
Copy link
Member

thisisnic commented Jun 28, 2021

@rok you mention this PR now only has index_of_monday, but in the diff I currently see start_index, which is something else?
Also the explanation of it, "Index of the first day of the week", is a bit ambiguous (what is the first day of the week?)

@jorisvandenbossche well the beginning of the week could be e.g.: Monday=0, Monday=1, Sunday=0 or Sunday=1. Meaning we really need two parameters:

* which day (Sunday/Monday) starts the week - this is normally (eventually) given by locale but in the scope of this PR it's Monday.

* what index does the week start at - this would here be set by `start_index`

Alternatively we can have index_of_monday and calculate kinda like (day_of_week(timestamp) + index_of_monday) % 7.

How about "Index of the first day of the week" to "Offset of day of week"?

I'm not sure if this might be a little confusing combining them. Could we perhaps define ourselves which integer maps to which day, and then just have a single parameter, start_index, similarly to how week_start works in R's lubridate package (https://lubridate.tidyverse.org/reference/day.html)?

@rok
Copy link
Member Author

rok commented Jun 28, 2021

@thisisnic I like week_start. So then week starts on Monday and we have only have week_start as a parameter.
The Sunday vs Monday week start can be revisited later.

@thisisnic
Copy link
Member

@rok Sounds good to me. Let me know if you want me to take a look at the code that will need to be added to fix the R builds which are now failing because you've implemented those options.

@jorisvandenbossche
Copy link
Member

@rok it's still unclear to me what the week_start option in the PR now means / does (eg how should the "Index of the first day" be expressed. A number relative to what?). Please try to make the docstring more explicit.

There are two aspects that could be controlled:

  • On which day do we start counting (eg typically Sunday vs Monday) -> "start day"
  • At which number do we start counting (eg typically 0 vs 1) -> "start index"

ISO starts counting at 1 on Monday, C++ starts counting at 0 on Sunday (but date.h has a c_encoding() and iso_encoding() to get the other), lubridate starts counting at 1 on Sunday, Python starts counting at 0 on Monday, (to cover all the variations .. ;-))

lubridate's week_start actually covers the start day (first point, on which day do we start counting, for the rest it always uses numbers 1-7)). While I think what you implemented here covers the start index (the second point, at which number to start counting)?

Taking a step back: for the "start index", would users ever want something else as 0 or 1 ? (I don't think that counting from 2 to 8 would ever make sense). In addition, going from 0-indexed to 1-indexed is a simple operation (addition +1). On the other hand, the "start day" is more difficult to change afterwards (since you need to wrap around, not a simple +1, but eg 7 needs to become 0). So if we want to add some option, "start day" seems the more useful one to add?

@rok rok force-pushed the ARROW-13054 branch 5 times, most recently from e102fb0 to 0e0a5a1 Compare June 30, 2021 10:25
@rok
Copy link
Member Author

rok commented Jun 30, 2021

Thanks for the input @thisisnic & @jorisvandenbossche!

I've refactored this to use week_start and one_based_numbering parameters to control the starting day and Monday=0 vs Monday=1. I've also updated documentation.
Please check if this now makes sense.

@rok rok force-pushed the ARROW-13054 branch 3 times, most recently from a359103 to 73c0abb Compare July 1, 2021 17:39
@rok
Copy link
Member Author

rok commented Jul 2, 2021

@pitrou ping :)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok to me as far as C++ changes are concerned. There are a couple nits below.

I'll let @jorisvandenbossche validate the rest and then merge when desirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you usually pass options as const&.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This member doesn't seem used anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! Removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could be const&.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rok
Copy link
Member Author

rok commented Jul 5, 2021

Thanks for the review @pitrou! I've implemented the suggestions.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just 3 tiny last comments

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@rok
Copy link
Member Author

rok commented Jul 6, 2021

@jorisvandenbossche Thanks for the review! I've pushed the suggested changes.

@jorisvandenbossche
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants