Memoize compiled regexps in fileutils.regexpMatch#20088
Memoize compiled regexps in fileutils.regexpMatch#20088pwaller wants to merge 1 commit intomoby:masterfrom
Conversation
My docker build context construction was taking ~10 CPU seconds. This patch reduces it to ~2 CPU seconds. The observation was made using pprof that the majority of the time was spent in regexp.Compile, so I just made it so we only do that once per expression in the least intrusive way possible. It creates a global map of compiled regular expressions which are never deleted, so it could constitute a leak if this is used repeatedly. I currently assume that there aren't enough of them for this to be a worry since in any case it only lasts as long as the the docker client is running a build. Signed-off-by: Peter Waller <p@pwaller.net>
51979fd to
1ffc971
Compare
|
/cc @duglin @buddhamagnet @vdemeester who may be interested because they have also edited code nearby. (Apologies if you just moved it). FWIW, I made this discovery while developing Further notes: Can someone confirm if those test failures are my fault? I cannot tell. |
|
@pwaller Dunno :) seems like unrelated because experimental passed. However, I think that your patch can lead to uncontrollable memory growth on many different builds. It would be preferrable to replace regexp with something else. I saw gobwas/glob library announcement which does similar things, but not sure if it'll fit our case. |
|
Sure. I'm investigating some other solutions such as using similar logic to git, which is considerably more efficient. I'm currently hacking on this. It looks like fixing this will very substantially reduce the build time for us (where the majority of layers are cached). @LK4D4 I realised the uncontrollable memory growth. This was more of a quick and dirty proof of concept to start the conversation and show the speedup. I'd prefer to see a proper solution. @LK4D4 what's the procedure for including an external library such as gobwas/glob? OK, new statistic: If I make Edit: To reduce the rapid fire spam I've collapsed several posts together (with the separator). |
|
I've now made a whole new implementation which brings the runtime down from 10s to 45ms, so it's a pretty good win for us if we can move things in this direction. It also speeds up the docker/docker context build by 10x, but that's only from 200ms to 20ms because @LK4D4 I took your suggestion and implemented something using gobwas/glob. Not sure if it is yet acceptable, looking for advice. I've blown the time budget on it for me for now. Positive noises will encourage me to make more time. But here it is: https://github.com/docker/docker/compare/master...pwaller:new-dockerignore-logic?expand=1 See in particular the new logic in this commit. It simplifies the archive creation and moves the logic out into a file called Excluder. I wrote it looking at the git implementation of file ignores, but I'm sure I have diverged significantly. At this point, I would have closed this PR and opened up a proposal issue, but I'm looking at the contribution guidelines and they seem to be out of date. In particular, they refer to a label Remaining work to do:
|
|
@pwaller do you mind closing this PR and opening a new one with that other implementation? It looks very promising. |
|
@duglin, @LK4D4, @calavera: has anyone worked on build performance recently? It's still unreasonably slow. A There is no reason I have also optimised my software layout as much as I can reasonably do so, so the only way I'm going to get decent improvements here are to speed up docker. Would love to see this fixed. |
|
Afaik, there's not been recent work in this area. Perhaps you can open an issue for tracking (or are you still interested in working on this?) |
|
@thaJeztah, thanks. Filed #26816. |
Headline figure: For my large project it reduces the build context construction phase of
docker buildfrom a run time of ~10s to ~2s.The observation was made using pprof that the majority of the time was
spent in regexp.Compile, so I just made it so we only do that once per
expression in the least intrusive way possible.
It creates a global map of compiled regular expressions which are
never deleted, so it could constitute a leak if this is used repeatedly.
I currently assume that there aren't enough of them for this to be a
worry since in any case it only lasts as long as the the docker client
is running a build.
Epilogue:
Before this fix, in my build context where I have 13,949 files and 4,417
non-ignored files (and 15
.dockerignoreexpressions), 43% of runtimewas spent in regexp.Compile (via fileutils.OptimizedMatches), which was
dominating. After this fix, regexp.Compile is invisible and the pprof
results don't make any sense to me any more. (go1.6rc2). I'm sure there
is more low hanging fruit but it is a bit harder to determine where it
might be now.