-
Notifications
You must be signed in to change notification settings - Fork 523
Less of a PR than an exposition/discussion #419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Carry around a set of indexes and re-combine them into skipped tokens at the end of the process. Removes the need for last_skipped_index_i
Make (not-yet-used) stubs to do the same with dstridx, ystridx
…evel conditionals
…inue statements and incrementings of i
…lues that are already None
|
Just pushed a bunch more commits. Again, these are best viewed commit-by-commit. The ones this morning were intended to be obviously-correct, easy to review, and avoid making any serious design decisions. In contrast, these meander and in some cases border on silly. A few highlights:
|
|
By the way, with regards to reducing the >>> a = 'Hello'
>>> a[0:3]
'Hel'
>>> a[5:6]
''So if you want to compare a specific index of a string or something and don't want to be bothered to check if it exists: >>> a[4:5] == 'o'
True
>>> a[5:6] == 'o'
FalseThe downsides to using this are that it may be harder to read and I haven't checked what impact it would have on performance. In our case, you can imagine |
|
I've actually decided to roll back the |
|
@jbrockmendel Your last message was not 100% clear, is this ready for review? |
|
Yes. |
|
I did not realize initially that these were all being moved to global scope. It's not a big deal, but all the components of the parser should stay as member functions of the parser. The final goal will be the make them public so that downstream users can tweak the fine-grained behavior of the parser by subclassing (like virtual functions in C++). Don't worry about making the change, though. I think it might be a pain to keep this and #425 in sync. I'm still working through this, but barring anything actually conceptually wrong, we can merge this and then #425, then I'll go through and move these functions to member functions. |
dateutil/parser.py
Outdated
| """ | ||
| skipped_tokens = [] | ||
| for idx in sorted(list(skipped_idxs)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorted(list(skipped_idx)) is redundant. The first thing sorted does is cast the iterable to a list.
(Assuming this function is kept - see my comment above where I have a version of this that operates on a list to start with).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
| return index | ||
|
|
||
| def append(self, val): | ||
| def append(self, val, label=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really care for this label stuff, but it's not part of the public interface, so we can just change it later, I suppose. So consider this comment to be me making cranky noises, not anything that needs action.
| # consecutively skipped tokens (-2 for when i begins at 0). | ||
| last_skipped_token_i = -2 | ||
| skipped_tokens = [] | ||
| skipped_idxs = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is switched over to being a set because of the _recombine_skipped algorithm below. I think it's not really necessary if we know that these skipped index lists are going to be sorted (maybe the idea is that that might change later with retrospective analysis?).
Here's a new algorithm for _recombine_skipped that takes a list instead of a set. My testing seems to indicate that it's about the same speed (actually it says it's maybe 8x faster than the set version, but that there's a huge disparity between the first and subsequent runs, indicating caching):
def _recombine_skipped(tokens, skipped_idxs):
"""
>>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
>>> skipped_idxs = set([0, 1, 2, 5])
>>> _recombine_skipped(tokens, skipped_idxs)
["foo bar", "baz"]
"""
# This groups consecutive values
skipped_tokens = []
idx_queue = []
for idx in skipped_tokens:
if idx_queue and idx - 1 != idx_queue[-1]:
skipped_tokens.append(''.join(tokens[i] for i in idx_queue))
idx_queue.clear()
else:
idx_queue.append(idx)
if idx_queue:
skipped_tokens.append(''.join(tokens[i] for i in idx_queue))
return skipped_tokensThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the main reason I used a set here was because of the if idx-1 in skipped_idxs check.
The secondary concern was idempotency. I think that eventually we're going to have to do multiple passes or look-backs or something.
The version you posted doesn't work for me. I'm assuming the for idx in skipped_tokens: is supposed to be for idx in skipped_idxs:. If I make that change, then in py3 I get ['foo bar'] and in py2 I get an AttributeError because there is no list.clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the version I had was indeed a bit off and not Python 2.7 compatible. Here's an updated version:
def _recombine_skipped_queue(tokens, skipped_idxs):
"""
>>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
>>> skipped_idxs = set([0, 1, 2, 5])
>>> _recombine_skipped(tokens, skipped_idxs)
["foo bar", "baz"]
"""
# This groups consecutive values
skipped_tokens = []
idx_queue = []
for idx in skipped_idxs:
if idx_queue and idx - 1 == [-1]:
if len(idx_queue) and idx - 1 != idx_queue[-1]:
skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))
idx_queue = []
idx_queue.append(idx)
if idx_queue:
skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))
return skipped_tokensInterestingly, I couldn't figure out why this was consistently slower than your version, since strings are immutable, I was expecting growing a string incrementally with + to be much slower than queuing up all the parts and combining them at the end, but lo and behold, if I switch over to using the += style method, the list version is faster than your version! I believe that's because in CPython they've special-cased this sort of string extension to try to extend the string in-place. See this StackOverflow question. Here's the faster version:
def _recombine_skipped(tokens, skipped_idxs):
"""
>>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
>>> skipped_idxs = set([0, 1, 2, 5])
>>> _recombine_skipped(tokens, skipped_idxs)
["foo bar", "baz"]
"""
# This groups consecutive values
skipped_tokens = []
idx_queue = []
for i, idx in enumerate(skipped_idxs):
if i > 0 and idx - 1 == skipped_idxs[i - 1]:
skipped_tokens[-1] = skipped_tokens[-1] + tokens[idx]
else:
skipped_tokens.append(tokens[idx])
return skipped_tokensDespite the fact that this is not part of the Python spec, it seems that it's been implemented in Python 2.7 and 3.6 as well as pypy2 and pypy3. Using a loop that randomly generates token strings and skipped indices to test this:
Python 2.7:
$ python2.7 test_recombination_speed.py
Running extend timing test
1000 loops, 500 sets: 0.884 ms per loop, 1.767 us per set
Running queue timing test
1000 loops, 500 sets: 1.263 ms per loop, 2.526 us per set
Running set timing test
1000 loops, 500 sets: 1.192 ms per loop, 2.384 us per set
Python 3.6:
$ python3.6 test_recombination_speed.py
Running extend timing test
1000 loops, 500 sets: 1.075 ms per loop, 2.150 us per set
Running queue timing test
1000 loops, 500 sets: 1.469 ms per loop, 2.938 us per set
Running set timing test
1000 loops, 500 sets: 1.238 ms per loop, 2.477 us per set
pypy2:
$ pypy test_recombination_speed.py
Running extend timing test
1000 loops, 500 sets: 0.132 ms per loop, 0.263 us per set
Running queue timing test
1000 loops, 500 sets: 0.425 ms per loop, 0.851 us per set
Running set timing test
1000 loops, 500 sets: 0.313 ms per loop, 0.627 us per set
pypy3:
$ pypy3 test_recombination_speed.py
Running extend timing test
1000 loops, 500 sets: 0.149 ms per loop, 0.297 us per set
Running queue timing test
1000 loops, 500 sets: 0.339 ms per loop, 0.677 us per set
Running set timing test
1000 loops, 500 sets: 0.236 ms per loop, 0.472 us per set
I'll merge as is and make a separate PR to switch back to my string-based version. Even if we later decide that the order of the indices isn't guaranteed to be sorted, we're sorting in the set version anyway, so it's not a big deal to sort a list instead of a set. Running the same tests where the extend version sorts first gives similar speeds between list and set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some interesting stuff here. The pypy results are impressive.
| # Check weekday | ||
| value = info.weekday(l[i]) | ||
| if value is not None: | ||
| elif info.weekday(l[i]) is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand why this can be changed to elif here. Was the huge if statement being pointlessly broken up just so that info.weekday(l[i]) didn't need to be called twice?
If this is being switched for performance reasons, I guess we could check empirically to see which one runs the test suite faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the huge if statement being pointlessly broken up just so that info.weekday(l[i]) didn't need to be called twice?
AFAICT, yes. I was pretty sure that the big if blocks were mutually exclusive, but the number of continue statements, return (None, None), and exceptions made for enough code paths that I wanted to (n+1)-check. I chose to take the performance hit of a few extra dict lookups in order to make this obvious and have only one non-raising path through the loop.
| res.tzoffset *= signal | ||
| res.tzoffset = signal * (hour_offset*3600 + min_offset*60) | ||
|
|
||
| # TODO: Are we not requiring that res.tzname be None? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there's a question of adjudication if you find two things that have equal claim to being a time zone in the string. Right now I guess it just takes the most recent thing that it finds that matches?
| all(x in string.ascii_uppercase for x in l[i+4])): # TODO: merge this with _could_be_tzname | ||
| # -0300 (BRST) | ||
| res.tzname = l[i+2] | ||
| res.tzname = l[i+4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these differences accumulating because you're reducing the number of passes through the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are because I am delaying incrementing i for as long as possible. This decreases the total number of places where i is incremented at all (41 down from 51). The closer i is to fixed within each iteration of the loop, the easier I find this to reason about.
dateutil/parser.py
Outdated
| class InvalidDateError(InvalidDatetimeError): pass | ||
| class InvalidTimeError(InvalidDatetimeError): pass | ||
|
|
||
| class ProgrammingError(AssertionError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unused, and I'm somewhat dubious of its utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, removing it.
| # TODO: requre len(token) >= 3 like we do for the between-parens version? | ||
| # do some other validation here instead of putting it off? As of now, "Q" | ||
| # will be accepted as a timezone... | ||
| def _could_be_tzname(hour, tzname, tzoffset, token): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require is misspelled here.
Also, we absolutely cannot require len(token) >= 3, since Z is in the ISO8601 spec for Zulu time (UTC), unless we want to special-case this.
That said, when we fix this up, we may well be able to determine exactly the minimum and maximum length of all valid time zones. We can start by checking the length of all the keys of tzinfos.
| if fuzzy: | ||
| val_is_ampm = False | ||
| else: | ||
| raise ValueError('No hour specified with AM or PM flag.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these caught and re-raised in another form? Otherwise, they should be some ValueError subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this still gets caught by the except (IndexError, ValueError, AssertionError) wrapping the big loop. When we eventually get rid of that catch-all, I agree that something like InvalidTimeError (which subclasses ValueError) would be more appropriate.
|
I think there are a few changes in here to be made, but none of them affect the public interface at all, so I don't care if these hit master and the changes are done in a separate PR. @jbrockmendel Would you prefer to just update this PR or do a separate one with the changes? If you're going to update this PR, probably best to update it to include the change where these global functions get moved onto the |
|
I just pushed fixes to the small things: "require" typo, get rid of |
|
Merging now, I'll clean up some of the things I've mentioned here later. |
I don't really expect/intend this to be merged, just want to show the places that I think make for easy refactoring wins. These will be easiest to view commit-by-commit.
The heuristic for refactoring candidates was
resorymd?_build_tzinfo and _ampm_validity just cut/paste a chunk of code from a method into a function.
_adjust_ampm and _parse_min_sec are each only ~4 lines, but appear a several times each.
Commits 5 and 8 are just small cleanups.
The first commit deprecates
_skip_token. Instead of carrying aroundskipped_tokensandlast_skipped_index_i, this is replaced with a setskipped_idxs, and at the end of the parsing_recombine_skippedis called to createskipped_tokens.The second commit makes
mstridxan attribute of_ymdand moves validation intoymd.append. This was discussed briefly in #414.Those were all of the candidates I saw that didn't alter any of the flow, i.e. should be fairly uncontroversial. The next goals I'll suggest are
iis incremented (currently 51).while Trueblock.