Skip to content

improved ComponentMeta system and Timezone conversion#249

Merged
N-Coder merged 30 commits intomainfrom
timezones
Aug 18, 2021
Merged

improved ComponentMeta system and Timezone conversion#249
N-Coder merged 30 commits intomainfrom
timezones

Conversation

@N-Coder
Copy link
Copy Markdown
Member

@N-Coder N-Coder commented Jun 13, 2020

This PR contains two big things:

  1. a revamped ComponentMeta / Converter system that uses less magic than before, but should be easier to extend and set-up reliably (see f333e1a, initialize_converters; and the customizations of Alarm and Calendar for examples)
  2. some work towards reliable parsing, interpretation and serialization of Timezone IDs (i.e. Olson and Windows) and tzinfo objects (i.e. builtin, dateutil and pytz) in ca00173. This should fix Add VTIMEZONE / tzinfo Serialization and better TZID Parsing #129, but there's also still quite a lot of work to do, mostly handling all the special cases correctly, working on how to include the required data files and ensuring that everything does what it should.

I also started to include tests against regressions of prior github issues (see e8e13e4, where mostly Timezone-related issues are covered) and some more general polish.

@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Oct 31, 2020

@aureooms asked:

Can you explain precisely what kind of layer would ics_vtimezones add between ics and these data files. What are the sources of these files? Everything comes from Olson timezone database? Why not call it olson instead of ics_vtimezones?
I can see a few python packages that seem to bundle these files (https://pypi.org/search/?q=olson). None of these are usable?

I explained my initial thoughts here. The README of the ics_vtimezones project also has some relevant info on what's packaged there.

Basically, to get really portable standards-compliant ics files, you need to include the full vTimezone (i.e. usually the recurring changes of UTC-offset when changing between daylight-saving and normal time for that region, but also any political decisions changing the offset at specific points in time) specification of any timezone used within the file. There are two common sources for this information: the IANA time zone database (also informally known as Olson database) or the time zone database shipped by windows. Please note that both databases change over time - e.g. there have been 4 updated versions of the IANA time zone database released in 2020 alone. The IANA time zone database is shipped on most unix systems at different paths and also used in many programming languages (e.g. the pytz and the new stdlib 3.9 zoneinfo and tzdata modules for python, see also the list here), but is not always available on windows systems. While there is a mapping between windows and IANA timezone names, most ecosystems ship their own portable package with the IANA database:

The Olson timezone IDs are also used by the Unicode Common Locale Data Repository (CLDR) and International Components for Unicode (ICU). For example, the CLDR Windows–Tzid table maps Microsoft Windows time zone IDs to the standard Olson names, although such a mapping cannot be perfect because the number of time zones in Windows systems is significantly lower that those in the IANA TZ database.

So you are right, most UNIX systems already have (some revision of) that database installed, there are at least two python modules that package this information and starting from python 3.9, the database will also be used by the standard library.

The issue is that just having the database files available doesn't help us, as those files are usually in a specific binary format, which you first need to parse and then convert into an ics-compatible vTimezone representation. IANA has a long list of tools that can be useful for this. Here's a list of stuff that might be useful for us:

  • A python parser implementation is only since recently widely available with the backported python 3.9 zoneinfo module.
  • pytz ships the timezone data precompiled into python objects and I couldn't find what they use for parsing and precompiling.
  • The other python ics library vobject seems to use PyICU, which is "a Python extension wrapping the ICU C++ libraries", and then probe any given tzinfo objects for transitions.
  • Apple's python calendar server has its own python converter implementation.
  • The C library libical has its own C program for parsing and generating the ics vTimezone definitions it ships, vzic.
  • The Java ical4j library does a similar thing (I also listed both in my initial thoughts posting).
  • Babel includes the CLDR timezone mapping, but unfortunately only allows lookups in the wrong direction.

The issue with zoneinfo and pytz is, similar to any other tzinfo implementation, that the underlying information is not directly available (in the case of the C implementation of zoneinfo, this information is actually not available from python at all). In case of pytz and the pure-python version of zoneinfo, we could still try to grab this internal data and hope that its representation never changes, breaking our somewhat hacky code. As an alternative, we could also ship our own parser to be sure that the interface never changes. In both cases, we could then reuse the logic to convert the data to ics vTimezones at runtime from other projects. Of course, both approaches would need a lot more code on our side than just using the ready-made vzic tool to pregenerate the data and ship it as its own package. On the other hand, this package also need regular maintenance whenever there are new releases of the IANA database. I went with the packaged approach before the python maintainers decided to go with a standard zoneinfo and tzdata module, so I might give generating this data at runtime another try and see how much additional (and how hacky) code on our side that needs.

Base automatically changed from master to main January 17, 2021 11:46
@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Feb 7, 2021

I created an overview how different other tools handle the inclusion of vTimezone definitions when writing ics files:
https://github.com/N-Coder/vtimezone-examples

There seem to be three different broad categories

  1. don't include vTimezone definition at all:
    us currently, python-icalendar, google calendar
  2. include some specific transition dates (as available in the compiled files in /usr/share/zoneinfo/):
    python-icalendar-khal, python-vobject, outlook (with the Windows TZ names)
  3. include the actual transition rrules of the timezone (which are usually only available from the sources of the olson database, which doesn't ship with Unix distros or the zoneinfo module - see tzdb.txt vs zoneinfo.txt):
    apple-ccs-pycalendar-zonal, tzurl(-outlook), vzic, apple calendar, evolution (Ximian)

Using some minor introspection and the publicly available API, I was able to generate all the data for case 2. from pytz and zoneinfo instances (see zoneinfo.txt and convert_zoneinfo.py), which pretty much covers the years 1880 up to 2040, but gives all the transition dates explicitly, as the compiled timezone files don't contain meta information about the rules behind the transition dates. As the two versions of the tzurl files show, all those different transition dates might confuse Outlook. To get to the cleaner and shorter rrule definitions serialized in case 3, you need to have the Olson tzdb source files at hand - which is what I did to create the data for the ics_vtimezones project. As an alternative, these ics files are also served by the tzurl project via HTTP and could be downloaded on the fly, offloading the work to keep them up to date to someone else. In both cases, vzic is used to convert the tzdb rule files to ics vTimezones.

To summarize, none of the three different behaviours seem perfect. Maybe we should select a sensible default for ics.py and provide opt-in configurations for the alternatives:

  1. don't include vTimezone definitions
  2. include vTimezone defintions obtained by introspection from the pytz / zoneinfo tzinfo instances, i.e. the compiled timezone files only containing specific instances
  3. include proper rrule-based vTimezone definitions...
    a. provided by the ics_vtimezones package, if installed
    b. download tzurl vTimezone files on the fly

Note this only applies to writing ics files, interpreting and also serializing vTimezone definitions when they were already included in a parsed ics file is pretty straight-forward.

@waynew
Copy link
Copy Markdown

waynew commented May 7, 2021

I think a graceful upgrade is a good approach - IIUC, the vtimezone needs to specify RRULE and whatnot?

Though, if you can specify e.g. 'America/Chicago' then that should be the default.

If not, dumping as UTC with a warning about losing daylight info would be appropriate, and gracefully upgrading to use a library that can add the vtimezone info

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 22, 2021

Codecov Report

❗ No coverage uploaded for pull request base (main@5ea3e44). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #249   +/-   ##
=======================================
  Coverage        ?   75.80%           
=======================================
  Files           ?       30           
  Lines           ?     2703           
  Branches        ?        0           
=======================================
  Hits            ?     2049           
  Misses          ?      654           
  Partials        ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ea3e44...1e3a099. Read the comment docs.

@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented May 23, 2021

Hmm, that was closer to a usable state than I thought. I kept the code using ics_vtimezones to provide the vTimezone definitions in, as I figured that making it optional and allowing to switch these providers would make things a lot more complicated. So we should be good for now with depending on ics_vtimezones, but we can think about allowing alternatives later. Also, the current version on PyPI is already slightly outdated, but it should still be good for almost all use-cases. I guess the built-in zoneinfo module doesn't get updates every few months either. Still, we could reduce the maintenance overhead for ics_vtimezones and ensure that it actually stays up to date by using a weekly-scheduled GitHub CI action that checks for new data and automatically publishes a new version if there is any, see ics_vtimezones#2.

The code could still use some more testing and the component system would largely benefit from a self-contained documentation, but the core of this PR should be good to go. So I'd merge this soonish so that other work can easily build on this, meaning that we can also switch to working on the remaining points in #245 (i.e. mostly testing and documentation) more step-by-step in more fine grained PRs. @C4ptainCrunch due to the size of this PR I'm against again squashing into one uber-commit, I'd prefer fast-forwarding master to contain all the commits separately, but unfortunately I can't do that due to missing permissions. Could you make me co-owner of the organisation/repo so that I can merge this PR properly and also continue working on things like your PR #274?

@N-Coder N-Coder marked this pull request as ready for review May 23, 2021 16:12
@make-github-pseudonymous-again
Copy link
Copy Markdown
Contributor

I agree with not squashing.

@plammens
Copy link
Copy Markdown

plammens commented Jul 28, 2021

When serializing a timezone, the following is raising an AttributeError since the tzinfo object doesn't have a _name attribute:

return Timezone_from_offset(tzinfo._name, tzinfo._offset) # type: ignore[attr-defined]

Minimal example:

import datetime
import ics

# make a datetime with an "anonymous" timezone
dt = datetime.datetime.fromisoformat("2017-10-01T12:00:00+09:00")
print(dt.tzinfo)  # datetime.timezone(datetime.timedelta(seconds=32400))
print(hasattr(dt.tzinfo, "_name"))  # False
print(hasattr(dt.tzinfo, "_offset"))  # False

event = ics.Event(begin=dt, duration=datetime.timedelta(minutes=30))
event.serialize()  # AttributeError

I'm using Python 3.9.

I would suggest staying away from using datetime's internals and use the public methods available for these purposes, namely datetime.datetime.tzname() and datetime.datetime.utcoffset() . Since for some (probably good) reason the timezone info methods are on the datetime.datetime objects and not the datetime.timezone objects themselves, the ics.Timezone constructor would have to take a datetime.datetime, e.g.:

@attr.s(frozen=True, repr=False)
class Timezone(Component, _tzinfo):
    ...
    
    @classmethod
    def from_datetime(dt: datetime.datetime):
        return Timezone_from_offset(dt.tzname(), dt.utcoffset())

By the way, if Timezone.from_tzinfo is kept instead, the cache mechanism with the context parameter could be probably replaced with @functools.lru_cache decoration, to simplify the implementation of Timezone_from_tzinfo.

@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Jul 28, 2021

Unfortunately, tzname returns the name for that specific datetime, which doesn't actually indicate the desired timezone including any transitions:

>>> from dateutil.tz import gettz
>>> from datetime import datetime, timedelta
>>> now = datetime.now(tz=gettz("Europe/Berlin"))
>>> later = now + timedelta(days=183)
>>> now.tzname(), now.utcoffset(), now.tzinfo
('CEST', datetime.timedelta(seconds=7200), tzfile('/usr/share/zoneinfo/Europe/Berlin'))
>>> later.tzname(), later.utcoffset(), later.tzinfo
('CET', datetime.timedelta(seconds=3600), tzfile('/usr/share/zoneinfo/Europe/Berlin'))

So for more complex timezone objects, we get 'CEST' or 'CET' depending on the timestamp in this case, while the user would want and expect 'Europe/Berlin' - working with 'CEST' or 'CET' might even yield wrong results once you start modifying datetimes based on that fixed timezone. So to get the information we actually want, we unfortunately need to touch some internals as there is no suitable API to get this information. See here and here for more background information on why there seems to be no other way how to do this.

It seems that in your specific case (maybe starting with Python 3.9), the class was replaced with a built-in / compiled C++ implementation that replaces the python implementation for efficiency reasons:

>>> dt = datetime.fromisoformat("2017-10-01T12:00:00+09:00")
>>> dt.tzname(), dt.utcoffset(), dt.tzinfo
('UTC+09:00', datetime.timedelta(seconds=32400), datetime.timezone(datetime.timedelta(seconds=32400)))
>>> print(inspect.getsource(type(dt.tzinfo)))
class timezone(tzinfo):
    __slots__ = '_offset', '_name'
    ...
>>> dt.tzinfo.utcoffset
<built-in method utcoffset of datetime.timezone object at 0x7f917c2a9750>

Fortunately, in this case we can use the class methods of the datetime.timezone type instead of direct access to the properties to get suitable information.

>>> dt.tzinfo.tzname(None)
'UTC+09:00'
>>> dt.tzinfo.utcoffset(None)
datetime.timedelta(seconds=32400)

I'll push a small fix for this, could you then try again?


Thanks for the hint with @functools.lru_cache! The information which timezones are used in one Calendar and how they are defined is very local to the respective Calendar instance, so reusing them between calls might lead to various problems (not only regarding multi-threading, but also name-clashes, different versions of the same timezone, ...), so having a (de-)serialization-local context seems to be the way to go.

N-Coder and others added 10 commits August 18, 2021 13:54
Components are now fully decoupled from their Converters, they only save their ics NAME and never directly reference their converters.
The only exception to this rule are the convenience methods Component.from_container, populate, to_container and serialize and Timezone.from_tzid and from_tzinfo, but all of them import their Converters locally.
When loading the converters package, a registry mapping Component Classes to their respective Converters is set up.
This mapping can easily be changed later to use customized converters for certain Components.
@N-Coder N-Coder merged commit 42a1354 into main Aug 18, 2021
@N-Coder N-Coder deleted the timezones branch August 18, 2021 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants