Skip to content

pygmentize performance bad when lexer not specified thanks to pkg_resources and disk IO  #2126

@YoloClin

Description

@YoloClin

Hi Pygments team

I'll start by saying that I use pygmentize for probably more than I should, but it's huge value to me. I currently have less aliased to to catfilter which runs pygmentize based on some manual properties (e.g., forcing certain lint formats based on file extension).

I noticed that when format autodetecting is on, the filter runs slow:

time pygmentize -l py generate.py > /dev/null
0.07s user 0.00s system 99% cpu 0.075 total

time pygmentize generate.py >/dev/null      
0.41s user 0.03s system 99% cpu 0.436 total

I asked a junior recruit, @Blackwolf499 to see if he could figure out what was taking so long. He determined with strace that there were a huge amount of libraries scanned / imported from python path. I have a fast SSD, but suspect this is an IO issue caused by that.

Together, we then stepped through code, and found that the speed issue was caused by https://github.com/pygments/pygments/blob/master/pygments/plugin.py#L49 , which subsequently calls pkg_resources (possibly std python lib, Idk). We were able to derive a poc which reproduces the issue:

import pkg_resources

group_name = 'pygments.lexers'

for i in pkg_resources.iter_entry_points(group_name): # This line costs about 130ms
  print(i)
  print(i.load()) # This line costs about 230ms

The above is based on my i9 8core w/ a Samsung 2TB pro ssd, with ipython, ipython3 and ipythonconsole installed.

I'm unclear on exactly why a filesystem scan has to occur, I suspect to scan+load lexers which must be loaded in order to auto-determine which linting format to be used. As such, I'm unsure on how to go about fixing this and think it might require more context/knowledge about pygments than I have.

I'm probably not in a position to recommend solutions, but one/more of:

  • some form of caching
  • the ability to disable plugin scanning, or hardcode plugin paths (e.g., some form of manually managed caching, such as pygmentize --plugin-scan > plugints.txt; pygmentize --plugin-scan-path plugins.txt test.py)
  • reviewing the actual purpose / necessity of the pkg_resources dep - I have no understanding of why it's used, but possibly alternatives are faster?

I don't have a lot of time personally, but would really appreciate this fixed as it would save me a small amount of time and huge amount of frustration. Happy to throw a $50 donation or something at the project to see it fixed if that helps :).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions