Allow to use a dedicated process for each job#3563
Allow to use a dedicated process for each job#3563yguedidi wants to merge 6 commits intorectorphp:mainfrom
Conversation
|
Challenge results: I wrote this PR on Chrome while running Rector from PhpStorm 😎 🎉 |
|
Could you fix the phsptan notice https://github.com/rectorphp/rector-src/actions/runs/4613330600/jobs/8155197723?pr=3563? |
@samsonasik not sure it's worth it for now, as this is a proof of concept, it will not be merged as is for sure, and the implementation will completely change for sure if the idea gets accepted 🙂 |
|
when using default |
Of course! I meant a fourth parameter |
|
Wow really cool 💙 When i wasn't able to run rector on our project in one go (after my performance and memory fixes), i thought about the same approach to keep the memory consumption stable in runs on larger codebases, but didn't have the time to actually try that out. So really cool that you had that idea as well and were able to get it working 👍
It's not only the cost to start a process, but under the hood rector and phpstan also use a lot of internal caching and memoization, and with each new process those caches are empty again, that's why it's all about finding the balance between reusing a process or starting a new one. My initial idea was to not tie the behaviour to the number of files per job, but to configure a So in the case of a machine with 12 cores and 16 GiB memory you can configure a $rectorConfig->parallel(maxNumberOfProcesses: 12, maxMemoryPerProcess: '1G', jobSize: 20);By default you could set What do you think? |
|
Thank you for your kind words!
To be honest I didn't thought about this idea! My first impression: usually when you set a max memory limit, it's because you want the memory usage to never go above. Maybe it's just a matter of the parameter wording to avoid this confusion. Next point is that we will need a userland memory usage management, is this a good thing? no strong opinion. me I know about the Symfony Stopwatch component that could be used Finally, your idea looks like changing the concept of "job": the current is file list based, this new is memory usage based |
|
@TomasVotruba @samsonasik to mention a real business case for being able to run Rector on a big project in one go: be able to set up Rector as a automated code review tool on an existing project. |
|
@TomasVotruba @samsonasik any feedback on the approach here? |
|
Hi @yguedidi , thanks for this feature proposal! There is quite a long discussion to follow, so only share my 2 cents: we mimic the PHPStan parallelism, as it has to match its workflow and is easier to maintain in case of PHPStan architecture changes in past and the future. Before going into code, I'd like to ask you 3 things:
Thank you |
I'll make it pass if this alternative way of running in parallel get accepted.
from description:
will depend on the parallel configuration, so with the following one
with job size of 100 (
This alternative way of parallelism better fit big projects configured with a high job size |
|
👍 from my side for this proposal, it basically solves the issues described here rectorphp/rector#7806 as well. |
|
@TomasVotruba would you accept this if I finalize it by making the alternative way configurable with a 4th optional parameter |
instead of having one more parameter I would choose something more complex with more flexibility. I tested the PR with one of the smaller more clean Code Paths I have.
Memory usage is more estimated by looking into total memory usage (while also other apps are running) So in my case, this PR really has an immense benefit on memory consumption, but it also looses a lot of efficiency by discarding everything loaded to often. I was not able to figure out the added code on the first look. If you provide me a place for a simple counter, so it only restarts after 500 parsed files, I would like to do another test run |
|
@Flyingmana I agree with you! I proposed #3602 few days back to prepare for potential increase of parallelism ocnfiguration options, but it way rejected |
525b3b9 to
01d03bb
Compare
|
@TomasVotruba @samsonasik this is now ready for review! the new behavior is now configurable, disabled by default. If this get accepted I plan to work on a following PR to improve DX by warning when the provided configuration lead to lot of processes to be created, impacting the performance a lot, as you experienced @Flyingmana. |
|
@Flyingmana to mitigate the performance impact, when using a separate process per job, I increase the job size in the parallel configuration too, to have less processes to be created, and adapt the timeout accordingly, for example |
|
@TomasVotruba @samsonasik better be reviewed commit by commit, I did my best to integrate as much as possible the new logic to the existing code, to minimize the diff |
| $jobs = $jobByProcessIdentifier; | ||
| $jobIdentifiers = array_keys($jobByProcessIdentifier); |
There was a problem hiding this comment.
when use separate process per job is enabled, the $jobs will become persistent with each job having as index the process identifiers.
the new $jobIdentifiers will be the new stack from which each process will pop the next one to run on exit
| } | ||
|
|
||
| if ($jobs === []) { | ||
| if ($useSeparateProcessPerJob || $jobs === []) { |
There was a problem hiding this comment.
when use separate process per job is enabled, always quit the actual process execution after it's done, don't take another job to work on
| if (! $useSeparateProcessPerJob) { | ||
| return; | ||
| } | ||
|
|
||
| if ($jobIdentifiers === []) { | ||
| return; | ||
| } | ||
|
|
||
| $nextProcessIdentifier = array_pop($jobIdentifiers); | ||
| $nextParallelProcess = $this->processPool->getProcess($nextProcessIdentifier); | ||
|
|
||
| $onDataCallable = $onDataCallableProvider($nextProcessIdentifier, $nextParallelProcess); | ||
| $onExitCallable = $onExitCallableProvider($nextProcessIdentifier); | ||
|
|
||
| $nextParallelProcess->start( | ||
| // 1. callable on data | ||
| $onDataCallable, | ||
|
|
||
| // 2. callable on error | ||
| $handleErrorCallable, | ||
|
|
||
| // 3. callable on exit | ||
| $onExitCallable | ||
| ); |
There was a problem hiding this comment.
when use separate process per job is enabled, here is the magic: start the next available job by its process identifier when the current one get exited
| ): array { | ||
| $jobs = array_reverse($schedule->getJobs()); | ||
| $jobIdentifiers = []; | ||
| $jobProcesses = []; |
There was a problem hiding this comment.
In last force-push diff I had to introduce this because I found an issue when a process timeout, $this->processPool->quitAll(); is called, and it breaks when having attached processes that are not started yet.
So I can't use the process pool to store the job processes
|
I'm going through this, but I'm unable to review/accept as there are many concepts changed at once. Seeing #3563 (comment), it also seems the performance got 6x times slower, which is a no-go. On the other hand, the series of well targeted and verified improvements are now released in Rector 0.16. Thanks for understanding |
|
@TomasVotruba the draft at this time did have this performance penalty, but with the later described increased jobsize it should be in theory get to similar results as my approach. I had now time to do a proper retest with some open source code as example RetestSetup: used config: return static function (RectorConfig $rectorConfig): void {
$rectorConfig->paths([
__DIR__ . '/src',
__DIR__ . '/../symfony',
]);
// usual config
//$rectorConfig->parallel(120*30,12,200);
// config for yguedidi
$rectorConfig->parallel(120*30,12,200, true);
$rectorConfig->rule(\Rector\Php74\Rector\Ternary\ParenthesizeNestedTernaryRector::class);
};On a current symfony/symfony checkout, 9629 files (without installed vendor)
So the performance degradation is not there anymore for a big enough job size. |
Disclaimer: as the title says, this is a proof of concept, to open the discussion. It's of course not aim to be merged as is!The story
I'm running Rector on a big project, and often it breaks because of no more memory...
It's impossible to run a full configuration on the project to clean it entirely based on the rules we decided to use.
We could argue that as a refactoring tool, rector aim to be run in one shot to do a just few rules migrations, and it actually work.
But still I find myself obliged to close PhpStorm and Chrome to free some memory for Rector.
The challenge
I challenged myself to be able to run Rector without memory issues until the end, with PhpStorm and Chrome still opened!
Here actual system monitoring before running Rector (there are other stuff needed for the project like Docker that are running too):
The investigation
I realized that Rector starts the configured number of processes and keep using them for each job.
So I thought about the idea of killing them after they finished their job, so one process = one parallel job.
The first results
On rector-src at master, on my machine with a configuration to use 3 threads and job size of 10, it takes around 5m 10s to run.
But it consume a lot of memory!
On this branch, with the same configuration it takes around 15min 20s!!! 3 times the initial time!
But changing the job size to 150 it's down to around 5m 20s, not that bad to me, when on master it takes around 4m 30s.
It has a cost to start processes!
The balance
The alternative don't fit for a parallel configuration with lots of small jobs, as it will lead to lots of processes to be created, and so exploding the time to run.
But being tweaked with a bigger job size, so less processes to run, it's not that bad from a time prespective.
So parallel configuration could depend on the project size, the number of rules configured, the running intentions of the developer.
The proposal
Have is alternative way of running parallel jobs available through configuration so it's up to the developers to decide what fits best their project and intention.
Random ideas
$useSeparateProcessPerJobthat default tofalsein->parallel()trueand the number of jobs is high@TomasVotruba @samsonasik what do you think?
With rectorphp/rector#7806 @keulinho you may be interested by this too