Skip to content

Conversation

@jbrockmendel
Copy link
Contributor

I don't really expect/intend this to be merged, just want to show the places that I think make for easy refactoring wins. These will be easiest to view commit-by-commit.

The heuristic for refactoring candidates was

  • Can this be made a function without passing res or ymd?
  • Is it obvious what this function would be named?"

_build_tzinfo and _ampm_validity just cut/paste a chunk of code from a method into a function.

_adjust_ampm and _parse_min_sec are each only ~4 lines, but appear a several times each.

Commits 5 and 8 are just small cleanups.

The first commit deprecates _skip_token. Instead of carrying around skipped_tokens and last_skipped_index_i, this is replaced with a set skipped_idxs, and at the end of the parsing _recombine_skipped is called to create skipped_tokens.

The second commit makes mstridx an attribute of _ymd and moves validation into ymd.append. This was discussed briefly in #414.

Those were all of the candidates I saw that didn't alter any of the flow, i.e. should be fairly uncontroversial. The next goals I'll suggest are

  • Reducing the number of places where i is incremented (currently 51).
  • Moving away from the while True block.

Carry around a set of indexes and re-combine them into skipped tokens at the end
of the process.  Removes the need for last_skipped_index_i
Make (not-yet-used) stubs to do the same with dstridx, ystridx
@jbrockmendel
Copy link
Contributor Author

Just pushed a bunch more commits. Again, these are best viewed commit-by-commit. The ones this morning were intended to be obviously-correct, easy to review, and avoid making any serious design decisions. In contrast, these meander and in some cases border on silly.

A few highlights:

  • Implement InvalidDatetimeError(ValueError) and a couple others.
  • Decrease the set of paths we can take through The Loop
    • Raise appropriate ValueErrors instead of returning (None, None)
    • Remove continue statements and instead change a bunch of ifs to elifs. The cost of this is a few extra dictionary lookups.
    • Delay i += 1 wherever possible. Fewer state changes make reasoning easier (for me at least), and reduce the number of args/returns needed as pieces of the loop are refactored out into their own functions.
  • Refactor out the while True loop into _hms_block. In this case I still don't like the function it became (and don't like passing res or returning i), but separating it gives hope that it may be de-whiled.
  • The silly one: override l.__getitem__ to return an empty string for out-of-bounds access. That way we can drop the len_l checks from a bunch of things like if i+3 < len_l and info.month(l[i+3])

@pganssle
Copy link
Member

By the way, with regards to reducing the i < len_l checks to avoid IndexError, there's a possibly useful fact that in certain circumstances can be used to avoid index checks, which is that indexes are not ignored in slices:

>>> a = 'Hello'
>>> a[0:3]
'Hel'
>>> a[5:6]
''

So if you want to compare a specific index of a string or something and don't want to be bothered to check if it exists:

>>> a[4:5] == 'o'
True
>>> a[5:6] == 'o'
False

The downsides to using this are that it may be harder to read and I haven't checked what impact it would have on performance. In our case, you can imagine if i + 1 < len_l and l[i] == ':' might be refactored into if l[i:i+1] == [':'], but I think that the second statement is definitely slower than the first (it involves creating a list literal instead of a string literal, for one thing). Presumably we can check the performance of a proposed l.__getitem__ solution against this.

@jbrockmendel
Copy link
Contributor Author

I've actually decided to roll back the __getitem__ patch locally. It's kind of a cute space-saver, but the amount of non-standardness it causes isn't worth the tradeoff. After I push that change, this will be more or less a real PR.

@pganssle
Copy link
Member

@jbrockmendel Your last message was not 100% clear, is this ready for review?

@jbrockmendel
Copy link
Contributor Author

Yes.

@pganssle
Copy link
Member

pganssle commented Aug 1, 2017

I did not realize initially that these were all being moved to global scope. It's not a big deal, but all the components of the parser should stay as member functions of the parser. The final goal will be the make them public so that downstream users can tweak the fine-grained behavior of the parser by subclassing (like virtual functions in C++).

Don't worry about making the change, though. I think it might be a pain to keep this and #425 in sync. I'm still working through this, but barring anything actually conceptually wrong, we can merge this and then #425, then I'll go through and move these functions to member functions.

"""
skipped_tokens = []
for idx in sorted(list(skipped_idxs)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted(list(skipped_idx)) is redundant. The first thing sorted does is cast the iterable to a list.

(Assuming this function is kept - see my comment above where I have a version of this that operates on a list to start with).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

return index

def append(self, val):
def append(self, val, label=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really care for this label stuff, but it's not part of the public interface, so we can just change it later, I suppose. So consider this comment to be me making cranky noises, not anything that needs action.

# consecutively skipped tokens (-2 for when i begins at 0).
last_skipped_token_i = -2
skipped_tokens = []
skipped_idxs = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is switched over to being a set because of the _recombine_skipped algorithm below. I think it's not really necessary if we know that these skipped index lists are going to be sorted (maybe the idea is that that might change later with retrospective analysis?).

Here's a new algorithm for _recombine_skipped that takes a list instead of a set. My testing seems to indicate that it's about the same speed (actually it says it's maybe 8x faster than the set version, but that there's a huge disparity between the first and subsequent runs, indicating caching):

def _recombine_skipped(tokens, skipped_idxs):
    """
    >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
    >>> skipped_idxs = set([0, 1, 2, 5])
    >>> _recombine_skipped(tokens, skipped_idxs)
    ["foo bar", "baz"]

    """
    # This groups consecutive values
    skipped_tokens = []
    idx_queue = []

    for idx in skipped_tokens:
        if idx_queue and idx - 1 != idx_queue[-1]:
            skipped_tokens.append(''.join(tokens[i] for i in idx_queue))
            idx_queue.clear()
        else:
            idx_queue.append(idx)

    if idx_queue:
        skipped_tokens.append(''.join(tokens[i] for i in idx_queue))

    return skipped_tokens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the main reason I used a set here was because of the if idx-1 in skipped_idxs check.

The secondary concern was idempotency. I think that eventually we're going to have to do multiple passes or look-backs or something.

The version you posted doesn't work for me. I'm assuming the for idx in skipped_tokens: is supposed to be for idx in skipped_idxs:. If I make that change, then in py3 I get ['foo bar'] and in py2 I get an AttributeError because there is no list.clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the version I had was indeed a bit off and not Python 2.7 compatible. Here's an updated version:

def _recombine_skipped_queue(tokens, skipped_idxs):
    """
    >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
    >>> skipped_idxs = set([0, 1, 2, 5])
    >>> _recombine_skipped(tokens, skipped_idxs)
    ["foo bar", "baz"]

    """
    # This groups consecutive values
    skipped_tokens = []
    idx_queue = []
    for idx in skipped_idxs:
        if idx_queue and idx - 1 == [-1]:
        if len(idx_queue) and idx - 1 != idx_queue[-1]:
            skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))
            idx_queue = []

        idx_queue.append(idx)

    if idx_queue:
        skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))

    return skipped_tokens

Interestingly, I couldn't figure out why this was consistently slower than your version, since strings are immutable, I was expecting growing a string incrementally with + to be much slower than queuing up all the parts and combining them at the end, but lo and behold, if I switch over to using the += style method, the list version is faster than your version! I believe that's because in CPython they've special-cased this sort of string extension to try to extend the string in-place. See this StackOverflow question. Here's the faster version:

def _recombine_skipped(tokens, skipped_idxs):
    """
    >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
    >>> skipped_idxs = set([0, 1, 2, 5])
    >>> _recombine_skipped(tokens, skipped_idxs)
    ["foo bar", "baz"]

    """
    # This groups consecutive values
    skipped_tokens = []
    idx_queue = []
    for i, idx in enumerate(skipped_idxs):
        if i > 0 and idx - 1 == skipped_idxs[i - 1]:
            skipped_tokens[-1] = skipped_tokens[-1] + tokens[idx]
        else:
            skipped_tokens.append(tokens[idx])

    return skipped_tokens

Despite the fact that this is not part of the Python spec, it seems that it's been implemented in Python 2.7 and 3.6 as well as pypy2 and pypy3. Using a loop that randomly generates token strings and skipped indices to test this:

Python 2.7:

$ python2.7 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.884 ms per loop, 1.767 us per set
Running queue timing test
1000 loops, 500 sets: 1.263 ms per loop, 2.526 us per set
Running set timing test
1000 loops, 500 sets: 1.192 ms per loop, 2.384 us per set

Python 3.6:

$ python3.6 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 1.075 ms per loop, 2.150 us per set
Running queue timing test
1000 loops, 500 sets: 1.469 ms per loop, 2.938 us per set
Running set timing test
1000 loops, 500 sets: 1.238 ms per loop, 2.477 us per set

pypy2:

$ pypy test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.132 ms per loop, 0.263 us per set
Running queue timing test
1000 loops, 500 sets: 0.425 ms per loop, 0.851 us per set
Running set timing test
1000 loops, 500 sets: 0.313 ms per loop, 0.627 us per set

pypy3:

$ pypy3 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.149 ms per loop, 0.297 us per set
Running queue timing test
1000 loops, 500 sets: 0.339 ms per loop, 0.677 us per set
Running set timing test
1000 loops, 500 sets: 0.236 ms per loop, 0.472 us per set

I'll merge as is and make a separate PR to switch back to my string-based version. Even if we later decide that the order of the indices isn't guaranteed to be sorted, we're sorting in the set version anyway, so it's not a big deal to sort a list instead of a set. Running the same tests where the extend version sorts first gives similar speeds between list and set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some interesting stuff here. The pypy results are impressive.

# Check weekday
value = info.weekday(l[i])
if value is not None:
elif info.weekday(l[i]) is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why this can be changed to elif here. Was the huge if statement being pointlessly broken up just so that info.weekday(l[i]) didn't need to be called twice?

If this is being switched for performance reasons, I guess we could check empirically to see which one runs the test suite faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the huge if statement being pointlessly broken up just so that info.weekday(l[i]) didn't need to be called twice?

AFAICT, yes. I was pretty sure that the big if blocks were mutually exclusive, but the number of continue statements, return (None, None), and exceptions made for enough code paths that I wanted to (n+1)-check. I chose to take the performance hit of a few extra dict lookups in order to make this obvious and have only one non-raising path through the loop.

res.tzoffset *= signal
res.tzoffset = signal * (hour_offset*3600 + min_offset*60)

# TODO: Are we not requiring that res.tzname be None?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there's a question of adjudication if you find two things that have equal claim to being a time zone in the string. Right now I guess it just takes the most recent thing that it finds that matches?

all(x in string.ascii_uppercase for x in l[i+4])): # TODO: merge this with _could_be_tzname
# -0300 (BRST)
res.tzname = l[i+2]
res.tzname = l[i+4]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these differences accumulating because you're reducing the number of passes through the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are because I am delaying incrementing i for as long as possible. This decreases the total number of places where i is incremented at all (41 down from 51). The closer i is to fixed within each iteration of the loop, the easier I find this to reason about.

class InvalidDateError(InvalidDatetimeError): pass
class InvalidTimeError(InvalidDatetimeError): pass

class ProgrammingError(AssertionError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused, and I'm somewhat dubious of its utility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removing it.

# TODO: requre len(token) >= 3 like we do for the between-parens version?
# do some other validation here instead of putting it off? As of now, "Q"
# will be accepted as a timezone...
def _could_be_tzname(hour, tzname, tzoffset, token):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require is misspelled here.

Also, we absolutely cannot require len(token) >= 3, since Z is in the ISO8601 spec for Zulu time (UTC), unless we want to special-case this.

That said, when we fix this up, we may well be able to determine exactly the minimum and maximum length of all valid time zones. We can start by checking the length of all the keys of tzinfos.

if fuzzy:
val_is_ampm = False
else:
raise ValueError('No hour specified with AM or PM flag.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these caught and re-raised in another form? Otherwise, they should be some ValueError subclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment this still gets caught by the except (IndexError, ValueError, AssertionError) wrapping the big loop. When we eventually get rid of that catch-all, I agree that something like InvalidTimeError (which subclasses ValueError) would be more appropriate.

@pganssle
Copy link
Member

pganssle commented Aug 1, 2017

I think there are a few changes in here to be made, but none of them affect the public interface at all, so I don't care if these hit master and the changes are done in a separate PR.

@jbrockmendel Would you prefer to just update this PR or do a separate one with the changes?

If you're going to update this PR, probably best to update it to include the change where these global functions get moved onto the parser class.

@jbrockmendel
Copy link
Contributor Author

I just pushed fixes to the small things: "require" typo, get rid of ProgrammingError, sorted(list --> sorted(. Other than that I want to avoid tinkering for the time being.

@pganssle
Copy link
Member

pganssle commented Aug 2, 2017

Merging now, I'll clean up some of the things I've mentioned here later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants