Optimize _expand_named_fields by Luffbee · Pull Request #139 · r1chardj0n3s/parse

Luffbee · 2021-12-02T03:18:36Z

The origin re.match just do a simple job, using find and slicing is more efficient.
I find this problem when parsing large files, and my patterns only use simple field name like 'aaa'. (I think simple name is the common case, which should be optimized). What I did with the large file is like this:

pat = parsing.compile("some pattern with {simple_name}")
with open(fname, "r") as f:
  for line in f.readlines():
    res = pat.parse(line)
    # use the res to construct some simple objects
    # ...

Here is the timing and profiling by ipython:

# timing (ipython %timeit)

# origin code with re.match
6.34 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR with find and slicing
5.02 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# profiling (ipython %prun, truncated)

# origin code with re.match
         49321473 function calls in 13.133 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   657504    2.693    0.000   10.139    0.000 parse.py:961(evaluate_result)
   657504    1.189    0.000    3.676    0.000 parse.py:941(_expand_named_fields)
        1    1.098    1.098   13.070   13.070 rate.py:46(from_log)
  3945024    1.038    0.000    1.038    0.000 {method 'match' of 're.Pattern' objects}
  4602528    1.007    0.000    1.388    0.000 re.py:289(_compile)
  1315008    0.921    0.000    2.028    0.000 parse.py:537(__call__)
  3287520    0.710    0.000    2.216    0.000 re.py:188(match)
  3945024    0.603    0.000    0.843    0.000 parse.py:985(<genexpr>)
  8547552    0.593    0.000    0.593    0.000 {built-in method builtins.isinstance}
  3287520    0.554    0.000    0.735    0.000 parse.py:1289(__getitem__)
   657504    0.403    0.000   11.083    0.000 parse.py:886(parse)


# this PR with find and slicing
         36171393 function calls in 10.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   657504    2.544    0.000    7.208    0.000 parse.py:966(evaluate_result)
        1    0.974    0.974   10.001   10.001 rate.py:46(from_log)
  1315008    0.917    0.000    2.043    0.000 parse.py:537(__call__)
   657504    0.654    0.000    0.946    0.000 parse.py:941(_expand_named_fields)
  3945024    0.584    0.000    0.801    0.000 parse.py:990(<genexpr>)
  3287520    0.514    0.000    0.707    0.000 parse.py:1294(__getitem__)
   657504    0.481    0.000    0.481    0.000 {method 'match' of 're.Pattern' objects}
   657504    0.389    0.000    8.148    0.000 parse.py:886(parse)

wimglenn · 2023-11-09T03:04:03Z

Seems reasonable. This is not exactly logically equivalent, for example if the input was '[aaa]' the existing code will raise, but this code will return basename, subkeys as "", "[aaa]". But I don't see any handling around the AttributeError so I don't think it should cause any issue.

optimize _expand_named_fields

71ee32b

wimglenn merged commit 286bcb1 into r1chardj0n3s:master Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize _expand_named_fields#139

Optimize _expand_named_fields#139
wimglenn merged 1 commit intor1chardj0n3s:masterfrom
Luffbee:master

Luffbee commented Dec 2, 2021

Uh oh!

wimglenn commented Nov 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Luffbee commented Dec 2, 2021

Uh oh!

wimglenn commented Nov 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants