Skip to content

Memoize compiled regexps in fileutils.regexpMatch#20088

Closed
pwaller wants to merge 1 commit intomoby:masterfrom
pwaller:memoize-build-regexps
Closed

Memoize compiled regexps in fileutils.regexpMatch#20088
pwaller wants to merge 1 commit intomoby:masterfrom
pwaller:memoize-build-regexps

Conversation

@pwaller
Copy link
Copy Markdown
Contributor

@pwaller pwaller commented Feb 7, 2016

Headline figure: For my large project it reduces the build context construction phase of docker build from a run time of ~10s to ~2s.

The observation was made using pprof that the majority of the time was
spent in regexp.Compile, so I just made it so we only do that once per
expression in the least intrusive way possible.

It creates a global map of compiled regular expressions which are
never deleted, so it could constitute a leak if this is used repeatedly.
I currently assume that there aren't enough of them for this to be a
worry since in any case it only lasts as long as the the docker client
is running a build.

Epilogue:

Before this fix, in my build context where I have 13,949 files and 4,417
non-ignored files (and 15 .dockerignore expressions), 43% of runtime
was spent in regexp.Compile (via fileutils.OptimizedMatches), which was
dominating. After this fix, regexp.Compile is invisible and the pprof
results don't make any sense to me any more. (go1.6rc2). I'm sure there
is more low hanging fruit but it is a bit harder to determine where it
might be now.

@GordonTheTurtle GordonTheTurtle added status/0-triage dco/no Automatically set by a bot when one of the commits lacks proper signature labels Feb 7, 2016
My docker build context construction was taking ~10 CPU seconds.

This patch reduces it to ~2 CPU seconds.

The observation was made using pprof that the majority of the time was
spent in regexp.Compile, so I just made it so we only do that once per
expression in the least intrusive way possible.

It creates a global map of compiled regular expressions which are
never deleted, so it could constitute a leak if this is used repeatedly.
I currently assume that there aren't enough of them for this to be a
worry since in any case it only lasts as long as the the docker client
is running a build.

Signed-off-by: Peter Waller <p@pwaller.net>
@pwaller pwaller force-pushed the memoize-build-regexps branch from 51979fd to 1ffc971 Compare February 7, 2016 15:25
@GordonTheTurtle GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Feb 7, 2016
@pwaller
Copy link
Copy Markdown
Contributor Author

pwaller commented Feb 7, 2016

/cc @duglin @buddhamagnet @vdemeester who may be interested because they have also edited code nearby. (Apologies if you just moved it).

FWIW, I made this discovery while developing docker-show-context and noticing that the build time got slower the more files I ignored.


Further notes: docker build is actually still considerably slower than docker-show-context, it appears to sit there and do nothing for ~10 seconds before it even starts saying sending build context to docker daemon. I believe it is this later phase that I measure using docker-show-context. I have no idea what is happening before that point.


Can someone confirm if those test failures are my fault? I cannot tell.

@LK4D4
Copy link
Copy Markdown
Contributor

LK4D4 commented Feb 7, 2016

@pwaller Dunno :) seems like unrelated because experimental passed. However, I think that your patch can lead to uncontrollable memory growth on many different builds. It would be preferrable to replace regexp with something else. I saw gobwas/glob library announcement which does similar things, but not sure if it'll fit our case.
Also, I think it's possible to detect some basic cases early and return just filepath.Match.

@pwaller
Copy link
Copy Markdown
Contributor Author

pwaller commented Feb 7, 2016

Sure. I'm investigating some other solutions such as using similar logic to git, which is considerably more efficient. I'm currently hacking on this. It looks like fixing this will very substantially reduce the build time for us (where the majority of layers are cached).


@LK4D4 I realised the uncontrollable memory growth. This was more of a quick and dirty proof of concept to start the conversation and show the speedup. I'd prefer to see a proper solution.


@LK4D4 what's the procedure for including an external library such as gobwas/glob?


OK, new statistic:

If I make OptimizedMatches a straight map lookup (where I cheat and precompute the set of included files), the runtime for building the whole tar archive goes from 10 seconds to 0.2 seconds. It really seems worth getting this right!


Edit: To reduce the rapid fire spam I've collapsed several posts together (with the separator).

@pwaller
Copy link
Copy Markdown
Contributor Author

pwaller commented Feb 7, 2016

I've now made a whole new implementation which brings the runtime down from 10s to 45ms, so it's a pretty good win for us if we can move things in this direction. It also speeds up the docker/docker context build by 10x, but that's only from 200ms to 20ms because docker/docker doesn't currently ignore many files.

@LK4D4 I took your suggestion and implemented something using gobwas/glob. Not sure if it is yet acceptable, looking for advice. I've blown the time budget on it for me for now. Positive noises will encourage me to make more time.

But here it is: https://github.com/docker/docker/compare/master...pwaller:new-dockerignore-logic?expand=1

See in particular the new logic in this commit. It simplifies the archive creation and moves the logic out into a file called Excluder. I wrote it looking at the git implementation of file ignores, but I'm sure I have diverged significantly.

At this point, I would have closed this PR and opened up a proposal issue, but I'm looking at the contribution guidelines and they seem to be out of date. In particular, they refer to a label kind/proposal which appears to be empty, so I have nothing to draw inspiration from.

Remaining work to do:

  • Run the existing tests
  • Make the tests pass
  • Rip out old code used for performing ignores
  • Ensure that no regressions are introduced
  • Fix any regressions that are found
  • Write tests

@LK4D4
Copy link
Copy Markdown
Contributor

LK4D4 commented Feb 7, 2016

@pwaller from my point of view you did a great job in your branch. You can just open PR, and we'll discuss it there(I'll put design-review label on it). Also @duglin as the author of original feature can help a lot too.

@calavera
Copy link
Copy Markdown
Contributor

calavera commented Feb 8, 2016

@pwaller do you mind closing this PR and opening a new one with that other implementation? It looks very promising.

@pwaller
Copy link
Copy Markdown
Contributor Author

pwaller commented Feb 8, 2016

@calavera done, see #20128.

@pwaller pwaller closed this Feb 8, 2016
@pwaller pwaller deleted the memoize-build-regexps branch February 8, 2016 23:23
@pwaller
Copy link
Copy Markdown
Contributor Author

pwaller commented Sep 22, 2016

@duglin, @LK4D4, @calavera: has anyone worked on build performance recently? It's still unreasonably slow. A NOP (edit: I meant cached, not NOP) docker build on my project takes 33 CPU-seconds (client side measurement, dominated by "Sending build context to daemon") where a docker-show-context takes 0.8 seconds.

There is no reason docker build's should take much longer than docker-show-context takes, in principle. docker-show-context even uses the same code as docker to construct the archive!

I have also optimised my software layout as much as I can reasonably do so, so the only way I'm going to get decent improvements here are to speed up docker.

Would love to see this fixed.

@thaJeztah
Copy link
Copy Markdown
Member

Afaik, there's not been recent work in this area. Perhaps you can open an issue for tracking (or are you still interested in working on this?)

@pwaller
Copy link
Copy Markdown
Contributor Author

pwaller commented Sep 22, 2016

@thaJeztah, thanks. Filed #26816.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants