Skip to content

Conversation

@jbrockmendel
Copy link
Contributor

I think with this fix, the errors exposed by #540 should be fixed.

@pganssle
Copy link
Member

pganssle commented Dec 6, 2017

Hm... This is probably OK, except that Z is actually an artifact of a specific way of specifying time zone offsets that's broader and always uses a single character.

See Alpha Time from this list. I'm not entirely sure I want to foreclose these as a general rule. Plus, eventually we'll want to allow tzinfo strategies to somehow advertise the valid time zone name ranges, and we don't want these other things to break if you supply some 1-length time zone abbreviations. Probably best to do this on the tokenization step maybe?

@jbrockmendel
Copy link
Contributor Author

Probably best to do this on the tokenization step maybe?

I've been giving this some thought (#541); catching e.g. "A.M." in the tokenization step would be really nice, not sure it would handle the other broken case here.

Also if in the future we want to identify e.g. "US/Eastern" as a tzname, that may be easier to catch pre-tokenization since that would become ["US", "/", "Eastern"].

eventually we'll want to allow tzinfo strategies to somehow advertise the valid time zone name ranges, and we don't want these other things to break if you supply some 1-length time zone abbreviations

That's reasonable. Maybe if we pass tzinfos through so that the appropriate check can be done in _could_be_tzname ?

@pganssle
Copy link
Member

pganssle commented Dec 6, 2017

@jbrockmendel Maybe take a look at how that last part is done in #227, since AFAICT that actually does work.

I do want to apologize that mostly my attitude towards this parser stuff is to sorta reject a lot of these incremental changes. It's a weird situation because I have only a general sense about general techniques in parsing, and I think this sort of heavily contextualized parsing with ambiguous constraints and such is one of the harder kinds of parsing (compared to well-defined grammars). I haven't found the time to like.. read a book on parsing, so my answer to a lot of these strategy proposals is more or less, "I'm not sure how I'll feel about that after I have a reasonable framework to understand parsing problems in general, so it's probably not worth commenting on now."

@jbrockmendel
Copy link
Contributor Author

my attitude towards this parser stuff is to sorta reject a lot of these incremental changes

I totally understand where you're coming from, but appreciate that you get that it can be frustrating on this end.

#227 looks like a reasonable approach (though parts of it would be easier if we caught some strings at tokenization-time). Is it high enough in the queue that I should wait for it to fix this bug?

In it's current condition, this PR basically trades one bug for another. It avoids incorrectly identifying 1-character timezones, but will incorrectly reject Alpha Time in the (probably rare) case that a user chooses that. I think the thing to do is to pass tzinfos to _parse and then to _could_be_tzname and Do It Right. Thoughts?

@pganssle
Copy link
Member

pganssle commented Dec 7, 2017

As a general rule, when the bugs are both at least somewhat rare, I tend to put a good deal of weight on the status quo, but also reversibility (i.e. I'm more comfortable changing the behavior when you can get the old behavior by jumping through some simple hoops).

Considering that I don't see any obvious seam where a user can reverse this behavior and we have some general, vague plans to fix this the right way later, I don't see any reason to make this particular trade-off.

@jbrockmendel
Copy link
Contributor Author

I tend to put a good deal of weight on the status quo

Absolutely agree. I'm currently working on updating this PR to fix this bug without introducing the new one.

def test_alpha_tzname_ampm(self):
# If the Alpha-style timezone is "A" it gets incorrectly
# identified as part of AM/PM
dstr = '2017-12-07 10:27:15A'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I would guess that 10:27:15A means AM/PM more often than it means Alpha, though I guess they did explicitly set the tzinfos...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah I'm not 100% sure what to do in this case, leaning towards agreeing on the "well if they were explicit...". Also I think "AM" and "A.M." are way more common than "A" so that should be reasonably safe.

tzname is None and
tzoffset is None and
len(token) <= 5 and
all(x in string.ascii_uppercase for x in token))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the right thing to do. If the user does pass a tzinfos, should we be strict in this branch and only accept token in tzinfos (and maybe UTC, GMT, Z or the parserinfo equivalents)? Should the token in tzinfo check also require the hour is not None and tzname is None and tzoffset is None?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants