Less of a PR than an exposition/discussion #419

jbrockmendel · 2017-07-17T17:57:01Z

I don't really expect/intend this to be merged, just want to show the places that I think make for easy refactoring wins. These will be easiest to view commit-by-commit.

The heuristic for refactoring candidates was

Can this be made a function without passing res or ymd?
Is it obvious what this function would be named?"

_build_tzinfo and _ampm_validity just cut/paste a chunk of code from a method into a function.

_adjust_ampm and _parse_min_sec are each only ~4 lines, but appear a several times each.

Commits 5 and 8 are just small cleanups.

The first commit deprecates _skip_token. Instead of carrying around skipped_tokens and last_skipped_index_i, this is replaced with a set skipped_idxs, and at the end of the parsing _recombine_skipped is called to create skipped_tokens.

The second commit makes mstridx an attribute of _ymd and moves validation into ymd.append. This was discussed briefly in #414.

Those were all of the candidates I saw that didn't alter any of the flow, i.e. should be fairly uncontroversial. The next goals I'll suggest are

Reducing the number of places where i is incremented (currently 51).
Moving away from the while True block.

Carry around a set of indexes and re-combine them into skipped tokens at the end of the process. Removes the need for last_skipped_index_i

Make (not-yet-used) stubs to do the same with dstridx, ystridx

…evel conditionals

…inue statements and incrementings of i

…lues that are already None

jbrockmendel · 2017-07-18T03:27:58Z

Just pushed a bunch more commits. Again, these are best viewed commit-by-commit. The ones this morning were intended to be obviously-correct, easy to review, and avoid making any serious design decisions. In contrast, these meander and in some cases border on silly.

A few highlights:

Implement InvalidDatetimeError(ValueError) and a couple others.
Decrease the set of paths we can take through The Loop
- Raise appropriate ValueErrors instead of returning (None, None)
- Remove continue statements and instead change a bunch of ifs to elifs. The cost of this is a few extra dictionary lookups.
- Delay i += 1 wherever possible. Fewer state changes make reasoning easier (for me at least), and reduce the number of args/returns needed as pieces of the loop are refactored out into their own functions.
Refactor out the while True loop into _hms_block. In this case I still don't like the function it became (and don't like passing res or returning i), but separating it gives hope that it may be de-whiled.
The silly one: override l.__getitem__ to return an empty string for out-of-bounds access. That way we can drop the len_l checks from a bunch of things like if i+3 < len_l and info.month(l[i+3])

pganssle · 2017-07-18T21:56:47Z

By the way, with regards to reducing the i < len_l checks to avoid IndexError, there's a possibly useful fact that in certain circumstances can be used to avoid index checks, which is that indexes are not ignored in slices:

>>> a = 'Hello'
>>> a[0:3]
'Hel'
>>> a[5:6]
''

So if you want to compare a specific index of a string or something and don't want to be bothered to check if it exists:

>>> a[4:5] == 'o'
True
>>> a[5:6] == 'o'
False

The downsides to using this are that it may be harder to read and I haven't checked what impact it would have on performance. In our case, you can imagine if i + 1 < len_l and l[i] == ':' might be refactored into if l[i:i+1] == [':'], but I think that the second statement is definitely slower than the first (it involves creating a list literal instead of a string literal, for one thing). Presumably we can check the performance of a proposed l.__getitem__ solution against this.

jbrockmendel · 2017-07-18T22:10:01Z

I've actually decided to roll back the __getitem__ patch locally. It's kind of a cute space-saver, but the amount of non-standardness it causes isn't worth the tradeoff. After I push that change, this will be more or less a real PR.

…ntally lost in a merge

pganssle · 2017-07-21T14:19:36Z

@jbrockmendel Your last message was not 100% clear, is this ready for review?

jbrockmendel · 2017-07-21T17:07:32Z

Yes.

pganssle · 2017-08-01T20:54:50Z

I did not realize initially that these were all being moved to global scope. It's not a big deal, but all the components of the parser should stay as member functions of the parser. The final goal will be the make them public so that downstream users can tweak the fine-grained behavior of the parser by subclassing (like virtual functions in C++).

Don't worry about making the change, though. I think it might be a pain to keep this and #425 in sync. I'm still working through this, but barring anything actually conceptually wrong, we can merge this and then #425, then I'll go through and move these functions to member functions.

pganssle · 2017-08-01T20:59:18Z

dateutil/parser.py

+
+    """
+    skipped_tokens = []
+    for idx in sorted(list(skipped_idxs)):


sorted(list(skipped_idx)) is redundant. The first thing sorted does is cast the iterable to a list.

(Assuming this function is kept - see my comment above where I have a version of this that operates on a list to start with).

Good catch.

pganssle · 2017-08-01T21:08:31Z

dateutil/parser.py

                return index

-    def append(self, val):
+    def append(self, val, label=None):


I don't really care for this label stuff, but it's not part of the public interface, so we can just change it later, I suppose. So consider this comment to be me making cranky noises, not anything that needs action.

pganssle · 2017-08-01T21:52:39Z

dateutil/parser.py

-        # consecutively skipped tokens (-2 for when i begins at 0).
-        last_skipped_token_i = -2
-        skipped_tokens = []
+        skipped_idxs = set()


I guess this is switched over to being a set because of the _recombine_skipped algorithm below. I think it's not really necessary if we know that these skipped index lists are going to be sorted (maybe the idea is that that might change later with retrospective analysis?).

Here's a new algorithm for _recombine_skipped that takes a list instead of a set. My testing seems to indicate that it's about the same speed (actually it says it's maybe 8x faster than the set version, but that there's a huge disparity between the first and subsequent runs, indicating caching):

def _recombine_skipped(tokens, skipped_idxs): """ >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"] >>> skipped_idxs = set([0, 1, 2, 5]) >>> _recombine_skipped(tokens, skipped_idxs) ["foo bar", "baz"] """ # This groups consecutive values skipped_tokens = [] idx_queue = [] for idx in skipped_tokens: if idx_queue and idx - 1 != idx_queue[-1]: skipped_tokens.append(''.join(tokens[i] for i in idx_queue)) idx_queue.clear() else: idx_queue.append(idx) if idx_queue: skipped_tokens.append(''.join(tokens[i] for i in idx_queue)) return skipped_tokens

IIRC the main reason I used a set here was because of the if idx-1 in skipped_idxs check.

The secondary concern was idempotency. I think that eventually we're going to have to do multiple passes or look-backs or something.

The version you posted doesn't work for me. I'm assuming the for idx in skipped_tokens: is supposed to be for idx in skipped_idxs:. If I make that change, then in py3 I get ['foo bar'] and in py2 I get an AttributeError because there is no list.clear.

Ah, the version I had was indeed a bit off and not Python 2.7 compatible. Here's an updated version:

def _recombine_skipped_queue(tokens, skipped_idxs): """ >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"] >>> skipped_idxs = set([0, 1, 2, 5]) >>> _recombine_skipped(tokens, skipped_idxs) ["foo bar", "baz"] """ # This groups consecutive values skipped_tokens = [] idx_queue = [] for idx in skipped_idxs: if idx_queue and idx - 1 == [-1]: if len(idx_queue) and idx - 1 != idx_queue[-1]: skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue))) idx_queue = [] idx_queue.append(idx) if idx_queue: skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue))) return skipped_tokens

Interestingly, I couldn't figure out why this was consistently slower than your version, since strings are immutable, I was expecting growing a string incrementally with + to be much slower than queuing up all the parts and combining them at the end, but lo and behold, if I switch over to using the += style method, the list version is faster than your version! I believe that's because in CPython they've special-cased this sort of string extension to try to extend the string in-place. See this StackOverflow question. Here's the faster version:

def _recombine_skipped(tokens, skipped_idxs): """ >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"] >>> skipped_idxs = set([0, 1, 2, 5]) >>> _recombine_skipped(tokens, skipped_idxs) ["foo bar", "baz"] """ # This groups consecutive values skipped_tokens = [] idx_queue = [] for i, idx in enumerate(skipped_idxs): if i > 0 and idx - 1 == skipped_idxs[i - 1]: skipped_tokens[-1] = skipped_tokens[-1] + tokens[idx] else: skipped_tokens.append(tokens[idx]) return skipped_tokens

Despite the fact that this is not part of the Python spec, it seems that it's been implemented in Python 2.7 and 3.6 as well as pypy2 and pypy3. Using a loop that randomly generates token strings and skipped indices to test this:

Python 2.7:

$ python2.7 test_recombination_speed.py Running extend timing test 1000 loops, 500 sets: 0.884 ms per loop, 1.767 us per set Running queue timing test 1000 loops, 500 sets: 1.263 ms per loop, 2.526 us per set Running set timing test 1000 loops, 500 sets: 1.192 ms per loop, 2.384 us per set

Python 3.6:

$ python3.6 test_recombination_speed.py Running extend timing test 1000 loops, 500 sets: 1.075 ms per loop, 2.150 us per set Running queue timing test 1000 loops, 500 sets: 1.469 ms per loop, 2.938 us per set Running set timing test 1000 loops, 500 sets: 1.238 ms per loop, 2.477 us per set

pypy2:

$ pypy test_recombination_speed.py Running extend timing test 1000 loops, 500 sets: 0.132 ms per loop, 0.263 us per set Running queue timing test 1000 loops, 500 sets: 0.425 ms per loop, 0.851 us per set Running set timing test 1000 loops, 500 sets: 0.313 ms per loop, 0.627 us per set

pypy3:

$ pypy3 test_recombination_speed.py Running extend timing test 1000 loops, 500 sets: 0.149 ms per loop, 0.297 us per set Running queue timing test 1000 loops, 500 sets: 0.339 ms per loop, 0.677 us per set Running set timing test 1000 loops, 500 sets: 0.236 ms per loop, 0.472 us per set

I'll merge as is and make a separate PR to switch back to my string-based version. Even if we later decide that the order of the indices isn't guaranteed to be sorted, we're sorting in the set version anyway, so it's not a big deal to sort a list instead of a set. Running the same tests where the extend version sorts first gives similar speeds between list and set.

There's some interesting stuff here. The pypy results are impressive.

pganssle · 2017-08-01T21:57:51Z

dateutil/parser.py

                # Check weekday
-                value = info.weekday(l[i])
-                if value is not None:
+                elif info.weekday(l[i]) is not None:


Not sure I understand why this can be changed to elif here. Was the huge if statement being pointlessly broken up just so that info.weekday(l[i]) didn't need to be called twice?

If this is being switched for performance reasons, I guess we could check empirically to see which one runs the test suite faster.

Was the huge if statement being pointlessly broken up just so that info.weekday(l[i]) didn't need to be called twice?

AFAICT, yes. I was pretty sure that the big if blocks were mutually exclusive, but the number of continue statements, return (None, None), and exceptions made for enough code paths that I wanted to (n+1)-check. I chose to take the performance hit of a few extra dict lookups in order to make this obvious and have only one non-raising path through the loop.

pganssle · 2017-08-01T22:01:21Z

dateutil/parser.py

-                    res.tzoffset *= signal
+                    res.tzoffset = signal * (hour_offset*3600 + min_offset*60)

+                    # TODO: Are we not requiring that res.tzname be None?


I think that there's a question of adjudication if you find two things that have equal claim to being a time zone in the string. Right now I guess it just takes the most recent thing that it finds that matches?

pganssle · 2017-08-01T22:02:13Z

dateutil/parser.py

+                        all(x in string.ascii_uppercase for x in l[i+4])): # TODO: merge this with _could_be_tzname
                        # -0300 (BRST)
-                        res.tzname = l[i+2]
+                        res.tzname = l[i+4]


Are these differences accumulating because you're reducing the number of passes through the loop?

These are because I am delaying incrementing i for as long as possible. This decreases the total number of places where i is incremented at all (41 down from 51). The closer i is to fixed within each iteration of the loop, the easier I find this to reason about.

pganssle · 2017-08-01T22:03:12Z

dateutil/parser.py

+class InvalidDateError(InvalidDatetimeError): pass
+class InvalidTimeError(InvalidDatetimeError): pass
+
+class ProgrammingError(AssertionError):


This seems unused, and I'm somewhat dubious of its utility.

Agreed, removing it.

pganssle · 2017-08-01T22:06:17Z

dateutil/parser.py

+# TODO: requre len(token) >= 3 like we do for the between-parens version?
+# do some other validation here instead of putting it off?  As of now, "Q"
+# will be accepted as a timezone...
+def _could_be_tzname(hour, tzname, tzoffset, token):


require is misspelled here.

Also, we absolutely cannot require len(token) >= 3, since Z is in the ISO8601 spec for Zulu time (UTC), unless we want to special-case this.

That said, when we fix this up, we may well be able to determine exactly the minimum and maximum length of all valid time zones. We can start by checking the length of all the keys of tzinfos.

pganssle · 2017-08-01T22:07:19Z

dateutil/parser.py

+        if fuzzy:
+            val_is_ampm = False
+        else:
+            raise ValueError('No hour specified with AM or PM flag.')


Are these caught and re-raised in another form? Otherwise, they should be some ValueError subclass.

At the moment this still gets caught by the except (IndexError, ValueError, AssertionError) wrapping the big loop. When we eventually get rid of that catch-all, I agree that something like InvalidTimeError (which subclasses ValueError) would be more appropriate.

pganssle · 2017-08-01T22:10:59Z

I think there are a few changes in here to be made, but none of them affect the public interface at all, so I don't care if these hit master and the changes are done in a separate PR.

@jbrockmendel Would you prefer to just update this PR or do a separate one with the changes?

If you're going to update this PR, probably best to update it to include the change where these global functions get moved onto the parser class.

jbrockmendel · 2017-08-01T23:13:30Z

I just pushed fixes to the small things: "require" typo, get rid of ProgrammingError, sorted(list --> sorted(. Other than that I want to avoid tinkering for the time being.

pganssle · 2017-08-02T15:20:20Z

Merging now, I'll clean up some of the things I've mentioned here later.

jbrockmendel added 25 commits July 17, 2017 09:45

Deprecate _skip_token in favor of skipped_idxs

3b22f0d

Carry around a set of indexes and re-combine them into skipped tokens at the end of the process. Removes the need for last_skipped_index_i

Make mstridx a _ymd attribute, with validation in append method

e800c45

Make (not-yet-used) stubs to do the same with dstridx, ystridx

Refactor out _build_tzinfo

ad1bff2

refactor out parse_min_sec

54deb33

Move some things outside of try block

4bfdca1

Refactor our _adjust_pm

ee75ff0

Refactor out _ampm_validity

9c107b9

simplify string.ascii_uppercase checks

03083f3

Mostly whitespace; make wrapping easier temporarily

a010031

Dont increment i until later in numbered-timezone block

8e674c9

postpone incrementing i in timezone-name block

c71dbe0

In numeric block, avoid incrementing i until after evaluating first-l…

174840d

…evel conditionals

Delay incrementing of i inside numeric block

495a2ee

Make if blocks into elifs; this will let us reduce the number of cont…

331e68b

…inue statements and incrementings of i

Remove continues; delay incrementing i inside month block

8cc4d13

Delay incrementing i in while True block

0da499d

raise InvalidDatetimeError instead of returning None,None

5df87a3

Delay incrementing i in while True block

f0252d7

Explicitly pass Y label to append where possible

6889917

Fix index check, analogous to GH#420

6222418

Delay incrementing of i

42e852d

Remove a bunch of i increments

8aca8e1

refactor _could_be_tzname, otherwise mostly TODO notes

53e7c3a

check isdigit instead of catching ValueError

df8fe17

Remove unnecessary conditions; if they fail we are setting to None va…

d735d14

…lues that are already None

isdigit check instead of catching ValueError

63fb699

jbrockmendel force-pushed the stately branch from c817a7d to 63fb699 Compare July 18, 2017 22:11

jbrockmendel added 2 commits July 18, 2017 15:22

Refactor our _parse_hms. This had previously been done but was accide…

64e94ac

…ntally lost in a merge

Merge branch 'master' into stately

dfee8e5

pganssle mentioned this pull request Aug 1, 2017

Separate out "while True" loop for e.g. 12 h00m04s #425

Closed

pganssle reviewed Aug 1, 2017

View reviewed changes

Cleanup suggested by reviewer

5edc0c3

pganssle merged commit 0a42d2c into dateutil:master Aug 2, 2017

pganssle mentioned this pull request Aug 2, 2017

Tweak to parser skipped_idx + PEP8 cleanup #435

Merged

jbrockmendel deleted the stately branch August 2, 2017 16:38

pganssle added this to the 2.7.0 milestone Mar 11, 2018

pganssle added parser style labels Mar 11, 2018

pganssle mentioned this pull request Mar 11, 2018

Release 2.7.0 #627

Closed

jbrockmendel mentioned this pull request May 27, 2018

Separate out _parse_token; small whitespace (flake8) cleanups #728

Closed

Less of a PR than an exposition/discussion #419

Less of a PR than an exposition/discussion #419

Uh oh!

Conversation

jbrockmendel commented Jul 17, 2017

Uh oh!

jbrockmendel commented Jul 18, 2017

Uh oh!

pganssle commented Jul 18, 2017

Uh oh!

jbrockmendel commented Jul 18, 2017

Uh oh!

pganssle commented Jul 21, 2017

Uh oh!

jbrockmendel commented Jul 21, 2017

Uh oh!

pganssle commented Aug 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pganssle commented Aug 1, 2017

Uh oh!

jbrockmendel commented Aug 1, 2017

Uh oh!

pganssle commented Aug 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants