Skip to content

Optimize _expand_named_fields#139

Merged
wimglenn merged 1 commit intor1chardj0n3s:masterfrom
Luffbee:master
Nov 9, 2023
Merged

Optimize _expand_named_fields#139
wimglenn merged 1 commit intor1chardj0n3s:masterfrom
Luffbee:master

Conversation

@Luffbee
Copy link
Copy Markdown
Contributor

@Luffbee Luffbee commented Dec 2, 2021

The origin re.match just do a simple job, using find and slicing is more efficient.
I find this problem when parsing large files, and my patterns only use simple field name like 'aaa'. (I think simple name is the common case, which should be optimized). What I did with the large file is like this:

pat = parsing.compile("some pattern with {simple_name}")
with open(fname, "r") as f:
  for line in f.readlines():
    res = pat.parse(line)
    # use the res to construct some simple objects
    # ...

Here is the timing and profiling by ipython:

# timing (ipython %timeit)

# origin code with re.match
6.34 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR with find and slicing
5.02 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# profiling (ipython %prun, truncated)

# origin code with re.match
         49321473 function calls in 13.133 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   657504    2.693    0.000   10.139    0.000 parse.py:961(evaluate_result)
   657504    1.189    0.000    3.676    0.000 parse.py:941(_expand_named_fields)
        1    1.098    1.098   13.070   13.070 rate.py:46(from_log)
  3945024    1.038    0.000    1.038    0.000 {method 'match' of 're.Pattern' objects}
  4602528    1.007    0.000    1.388    0.000 re.py:289(_compile)
  1315008    0.921    0.000    2.028    0.000 parse.py:537(__call__)
  3287520    0.710    0.000    2.216    0.000 re.py:188(match)
  3945024    0.603    0.000    0.843    0.000 parse.py:985(<genexpr>)
  8547552    0.593    0.000    0.593    0.000 {built-in method builtins.isinstance}
  3287520    0.554    0.000    0.735    0.000 parse.py:1289(__getitem__)
   657504    0.403    0.000   11.083    0.000 parse.py:886(parse)


# this PR with find and slicing
         36171393 function calls in 10.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   657504    2.544    0.000    7.208    0.000 parse.py:966(evaluate_result)
        1    0.974    0.974   10.001   10.001 rate.py:46(from_log)
  1315008    0.917    0.000    2.043    0.000 parse.py:537(__call__)
   657504    0.654    0.000    0.946    0.000 parse.py:941(_expand_named_fields)
  3945024    0.584    0.000    0.801    0.000 parse.py:990(<genexpr>)
  3287520    0.514    0.000    0.707    0.000 parse.py:1294(__getitem__)
   657504    0.481    0.000    0.481    0.000 {method 'match' of 're.Pattern' objects}
   657504    0.389    0.000    8.148    0.000 parse.py:886(parse)

@wimglenn
Copy link
Copy Markdown
Collaborator

wimglenn commented Nov 9, 2023

Seems reasonable. This is not exactly logically equivalent, for example if the input was '[aaa]' the existing code will raise, but this code will return basename, subkeys as "", "[aaa]". But I don't see any handling around the AttributeError so I don't think it should cause any issue.

@wimglenn wimglenn merged commit 286bcb1 into r1chardj0n3s:master Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants