Allow to use a dedicated process for each job by yguedidi · Pull Request #3563 · rectorphp/rector-src

yguedidi · 2023-04-04T23:43:20Z

Disclaimer: as the title says, this is a proof of concept, to open the discussion. It's of course not aim to be merged as is!

The story

I'm running Rector on a big project, and often it breaks because of no more memory...
It's impossible to run a full configuration on the project to clean it entirely based on the rules we decided to use.
We could argue that as a refactoring tool, rector aim to be run in one shot to do a just few rules migrations, and it actually work.
But still I find myself obliged to close PhpStorm and Chrome to free some memory for Rector.

The challenge

I challenged myself to be able to run Rector without memory issues until the end, with PhpStorm and Chrome still opened!
Here actual system monitoring before running Rector (there are other stuff needed for the project like Docker that are running too):

The investigation

I realized that Rector starts the configured number of processes and keep using them for each job.
So I thought about the idea of killing them after they finished their job, so one process = one parallel job.

The first results

On rector-src at master, on my machine with a configuration to use 3 threads and job size of 10, it takes around 5m 10s to run.
But it consume a lot of memory!

On this branch, with the same configuration it takes around 15min 20s!!! 3 times the initial time!
But changing the job size to 150 it's down to around 5m 20s, not that bad to me, when on master it takes around 4m 30s.

It has a cost to start processes!

The balance

The alternative don't fit for a parallel configuration with lots of small jobs, as it will lead to lots of processes to be created, and so exploding the time to run.
But being tweaked with a bigger job size, so less processes to run, it's not that bad from a time prespective.

So parallel configuration could depend on the project size, the number of rules configured, the running intentions of the developer.

The proposal

Have is alternative way of running parallel jobs available through configuration so it's up to the developers to decide what fits best their project and intention.

Random ideas

Have it configurable with a fourth boolean parameter $useSeparateProcessPerJob that default to false in ->parallel()
Show a warning when true and the number of jobs is high

@TomasVotruba @samsonasik what do you think?
With rectorphp/rector#7806 @keulinho you may be interested by this too

yguedidi · 2023-04-04T23:50:57Z

Challenge results: I wrote this PR on Chrome while running Rector from PhpStorm 😎 🎉

samsonasik · 2023-04-04T23:52:29Z

Could you fix the phsptan notice https://github.com/rectorphp/rector-src/actions/runs/4613330600/jobs/8155197723?pr=3563?

yguedidi · 2023-04-04T23:56:56Z

Could you fix the phsptan notice https://github.com/rectorphp/rector-src/actions/runs/4613330600/jobs/8155197723?pr=3563?

@samsonasik not sure it's worth it for now, as this is a proof of concept, it will not be merged as is for sure, and the implementation will completely change for sure if the idea gets accepted 🙂
I'd prefer to first validate the idea with you maintainers

samsonasik · 2023-04-05T00:05:50Z

when using default $rectorConfig->parallel(), it should be similar at least equal fast, if it require changing job size, then it will be bad for DX.

yguedidi · 2023-04-05T00:43:55Z

when using default $rectorConfig->parallel(), it should be similar at least equal fast, if it require changing job size, then it will be bad for DX.

Of course! I meant a fourth parameter $useSeparateProcessPerJob that default to false to keep actual behavior.
Updated the description to mention that.
Users will have to opt-in for it to use the alternative

keulinho · 2023-04-05T06:07:19Z

Wow really cool 💙

When i wasn't able to run rector on our project in one go (after my performance and memory fixes), i thought about the same approach to keep the memory consumption stable in runs on larger codebases, but didn't have the time to actually try that out. So really cool that you had that idea as well and were able to get it working 👍

It has a cost to start processes!

It's not only the cost to start a process, but under the hood rector and phpstan also use a lot of internal caching and memoization, and with each new process those caches are empty again, that's why it's all about finding the balance between reusing a process or starting a new one.

My initial idea was to not tie the behaviour to the number of files per job, but to configure a maxMemoryPerProcess in the parallel() configuration, so that after each job was handled it's checked how much memory the process already consumed, if it is over the configured threshold the process will be killed and a new one will be spawned. In theory this should limit starting of new processes when it's only necessary (given that you configure a sensible max memory limit)

So in the case of a machine with 12 cores and 16 GiB memory you can configure a maxNumberOfProcesses of 12 and maxMemoryPerProcess of 1G.

$rectorConfig->parallel(maxNumberOfProcesses: 12, maxMemoryPerProcess: '1G', jobSize: 20);

By default you could set maxMemoryPerProcess to 0, meaning the feature is basically off and it works like before, so you have to actively have to opt-in.

What do you think?
From glimpsing at the code i have no idea how easy or hard it would be to implement, but maybe you can give that idea a try and see how it performs?

yguedidi · 2023-04-05T16:14:21Z

Thank you for your kind words!

What do you think?

To be honest I didn't thought about this idea!
So good to have another alternative in the discussion! 👍🏼

My first impression: usually when you set a max memory limit, it's because you want the memory usage to never go above.
If I understood your idea correctly, each process will actually be allowed to go above it for its last processed file.
An edge case could be that this file be a very big file.
So a bit confusing to me as is.

Maybe it's just a matter of the parameter wording to avoid this confusion. $newProcessThreshold maybe?

Next point is that we will need a userland memory usage management, is this a good thing? no strong opinion. me I know about the Symfony Stopwatch component that could be used

Finally, your idea looks like changing the concept of "job": the current is file list based, this new is memory usage based

yguedidi · 2023-04-05T16:17:11Z

@TomasVotruba @samsonasik to mention a real business case for being able to run Rector on a big project in one go: be able to set up Rector as a automated code review tool on an existing project.
On CI machines it breaks too, so impossible complete a first run that will build the cache for following PRs (that will run on changed files only so it's OK)

yguedidi · 2023-04-12T14:49:46Z

@TomasVotruba @samsonasik any feedback on the approach here?

TomasVotruba · 2023-04-12T16:58:24Z

Hi @yguedidi ,

thanks for this feature proposal! There is quite a long discussion to follow, so only share my 2 cents: we mimic the PHPStan parallelism, as it has to match its workflow and is easier to maintain in case of PHPStan architecture changes in past and the future.

Before going into code, I'd like to ask you 3 things:

make the CI pass, so we can test it in case of merge without any further reviews, delays
could you share 2-3 bullet points bottle-necks this solves
what are the before/after numbers on one exact project?

Thank you

yguedidi · 2023-04-12T21:18:37Z

make the CI pass, so we can test it in case of merge without any further reviews, delays

I'll make it pass if this alternative way of running in parallel get accepted.
Let's assume it's accepted, how would be the best way to configure Rector to use this alternative instead of the actual one that mimic PHPStan? so I can finalize the PR
I would have thought about configuring it with a new config function in ParallelConfig from #3602, but it got rejected.
A fourth boolean parameter $useSeparateProcessPerJob in ->parallel() as suggested in the description maybe?
There is of course no goal to remove the current way of doing parallelism like PHPStan!

could you share 2-3 bullet points bottle-necks this solves

from description:

no more excessive usage of memory
allow to run Rector fully in one go on big projects

what are the before/after numbers on one exact project?

will depend on the parallel configuration, so with the following one ->parallel(300, 3, 200) on rector-src:

on main, each persistent process exceeded the 2G (between 2G and 2.5G) of memory consumption, with a total execution time of ~4m 30s
on this branch, memory consumption remain stable with less than 1.5G per process, at the cost of moving to ~5m 10s
of execution time

with job size of 100 (->parallel(300, 3, 100)):

on main, same memory performance wit execution time ~4m 20s (-10s from previous, less communication with the main process)
on this branch, each process doesn't exceed 1G with execution time ~5m 30s (+20s from previous, more processes to create)

This alternative way of parallelism better fit big projects configured with a high job size

keulinho · 2023-04-13T06:24:20Z

👍 from my side for this proposal, it basically solves the issues described here rectorphp/rector#7806 as well.

yguedidi · 2023-04-23T22:26:27Z

@TomasVotruba would you accept this if I finalize it by making the alternative way configurable with a 4th optional parameter $useSeparateProcesses = false to ->parallel()?
It may help in rectorphp/rector#7891 :)

Flyingmana · 2023-04-24T11:15:17Z

Have it configurable with a fourth boolean parameter $useSeparateProcessPerJob that default to false in ->parallel()

instead of having one more parameter I would choose something more complex with more flexibility.
Working with parallelization and configurable restart conditions gets more complex fast.
As soon as you have it, there will come desire for restart on reaching a memory treshold (as was already suggested, but please dont call it a limit for this case, if its not setting an actual php limit for the process).
Not uncommon is also having a restart after $n iterations.

I tested the PR with one of the smaller more clean Code Paths I have.
Number of Files is 3114.

-	Run A		Run B
version	last release	last release	this PR
parallelization	1	3	3
Time Real	2m 17s	1m 15s	12m 58s
Time User	2m 02s	2m 43s	32m 23s
totalmemory max	~4G	~6G	~0.5G

Memory usage is more estimated by looking into total memory usage (while also other apps are running)

So in my case, this PR really has an immense benefit on memory consumption, but it also looses a lot of efficiency by discarding everything loaded to often.

I was not able to figure out the added code on the first look. If you provide me a place for a simple counter, so it only restarts after 500 parsed files, I would like to do another test run

yguedidi · 2023-04-24T12:06:29Z

@Flyingmana I agree with you! I proposed #3602 few days back to prepare for potential increase of parallelism ocnfiguration options, but it way rejected

yguedidi · 2023-05-01T17:01:31Z

@TomasVotruba @samsonasik this is now ready for review! the new behavior is now configurable, disabled by default.
@keulinho @Flyingmana if you want to have a look too :)

If this get accepted I plan to work on a following PR to improve DX by warning when the provided configuration lead to lot of processes to be created, impacting the performance a lot, as you experienced @Flyingmana.
Just need to find the sane threshold for the warning.

yguedidi · 2023-05-01T17:04:07Z

@Flyingmana to mitigate the performance impact, when using a separate process per job, I increase the job size in the parallel configuration too, to have less processes to be created, and adapt the timeout accordingly, for example ->parallel(1200, 16, 200) (10 times the defaults)

yguedidi · 2023-05-01T17:07:15Z

@TomasVotruba @samsonasik better be reviewed commit by commit, I did my best to integrate as much as possible the new logic to the existing code, to minimize the diff

yguedidi · 2023-05-01T17:17:43Z

packages/Parallel/Application/ParallelFileProcessor.php

+            $jobs = $jobByProcessIdentifier;
+            $jobIdentifiers = array_keys($jobByProcessIdentifier);


when use separate process per job is enabled, the $jobs will become persistent with each job having as index the process identifiers.
the new $jobIdentifiers will be the new stack from which each process will pop the next one to run on exit

yguedidi · 2023-05-01T17:18:36Z

packages/Parallel/Application/ParallelFileProcessor.php

                }

-                if ($jobs === []) {
+                if ($useSeparateProcessPerJob || $jobs === []) {


when use separate process per job is enabled, always quit the actual process execution after it's done, don't take another job to work on

yguedidi · 2023-05-01T17:19:11Z

packages/Parallel/Application/ParallelFileProcessor.php

+                if (! $useSeparateProcessPerJob) {
+                    return;
+                }
+
+                if ($jobIdentifiers === []) {
+                    return;
+                }
+
+                $nextProcessIdentifier = array_pop($jobIdentifiers);
+                $nextParallelProcess = $this->processPool->getProcess($nextProcessIdentifier);
+
+                $onDataCallable = $onDataCallableProvider($nextProcessIdentifier, $nextParallelProcess);
+                $onExitCallable = $onExitCallableProvider($nextProcessIdentifier);
+
+                $nextParallelProcess->start(
+                    // 1. callable on data
+                    $onDataCallable,
+
+                    // 2. callable on error
+                    $handleErrorCallable,
+
+                    // 3. callable on exit
+                    $onExitCallable
+                );


when use separate process per job is enabled, here is the magic: start the next available job by its process identifier when the current one get exited

yguedidi · 2023-05-01T19:50:45Z

packages/Parallel/Application/ParallelFileProcessor.php

    ): array {
        $jobs = array_reverse($schedule->getJobs());
+        $jobIdentifiers = [];
+        $jobProcesses = [];


In last force-push diff I had to introduce this because I found an issue when a process timeout, $this->processPool->quitAll(); is called, and it breaks when having attached processes that are not started yet.
So I can't use the process pool to store the job processes

TomasVotruba · 2023-05-06T16:18:51Z

I'm going through this, but I'm unable to review/accept as there are many concepts changed at once.

Seeing #3563 (comment), it also seems the performance got 6x times slower, which is a no-go.

On the other hand, the series of well targeted and verified improvements are now released in Rector 0.16.
Please, check those PRs as inspiration for further PRs, that are good example for review and merge 👍

Thanks for understanding

Flyingmana · 2023-05-07T14:19:33Z

@TomasVotruba the draft at this time did have this performance penalty, but with the later described increased jobsize it should be in theory get to similar results as my approach.

I had now time to do a proper retest with some open source code as example

Retest

Setup:
WSL2, Linux, php 8.1
cpu: physical cores 6, logical 12
Ram: 32 GB (25GB allowed for WSL2)

used config:

return static function (RectorConfig $rectorConfig): void {
    $rectorConfig->paths([
        __DIR__ . '/src',
        __DIR__ . '/../symfony',
    ]);

    // usual config
    //$rectorConfig->parallel(120*30,12,200);
    // config for yguedidi
    $rectorConfig->parallel(120*30,12,200, true);

    $rectorConfig->rule(\Rector\Php74\Rector\Ternary\ParenthesizeNestedTernaryRector::class);
};

On a current symfony/symfony checkout, 9629 files (without installed vendor)
last release: rector 1.6.0

-	Run A	Run B	Run C	Run D	Run E	Run F	Run G	Run H
version	last release	last release	last release	last release	pr-3563	pr-3563	pr-3722	pr-3722
special commands			--no-diffs --no-progress-bar
parallelization	16	6	6	12	12	12	12	12
job size	20	20	20	500	500	200	20	10
files per restart	X	X	X	x	x	x	200	100
Time Real	02m 28s	02m 04s	02m 06s	01m 40s	01m 54s	02m 12s	01m 55s	02m 21s
Time User	26m 49s	11m 48s	12m 02s	14m 44s	17m 28s	22m 39s	21m 24s	26m 12s
process memory	2G	2G ~ 4G	2G ~ 4G	0.6G ~ 3G	0.3G ~ 3G	0.2G ~ 2.8G	0.2G ~ 1.8G	0.2G - 0.7G
totalmemory max	~24G	~18G	~18G	~14G	~9G	5GB	5G	4G

So the performance degradation is not there anymore for a big enough job size.
The total memory usage is still improving a lot compared to the latest release, also the increased jobsize seems to have little difference compared with my PR #3722 which instead did go after limiting by itereations.

yguedidi mentioned this pull request Apr 10, 2023

[DX] Better and extensible way to configure parallel settings #3602

Closed

yguedidi mentioned this pull request Apr 28, 2023

[PoC] Parse again modifed files for full Rector processing in a single run #3672

Closed

yguedidi force-pushed the improve-parallel branch 3 times, most recently from 525b3b9 to 01d03bb Compare May 1, 2023 16:34

yguedidi changed the title ~~[PoC] An alternative way to do parallelism~~ Allow to use a dedicated process for each job May 1, 2023

yguedidi force-pushed the improve-parallel branch from 01d03bb to fc36aa1 Compare May 1, 2023 16:49

Flyingmana mentioned this pull request May 1, 2023

regular terminating and respawning of processes #3722

Merged

yguedidi marked this pull request as ready for review May 1, 2023 16:58

yguedidi requested a review from TomasVotruba as a code owner May 1, 2023 16:58

yguedidi commented May 1, 2023

View reviewed changes

yguedidi force-pushed the improve-parallel branch from fc36aa1 to de3ae27 Compare May 1, 2023 19:40

yguedidi commented May 1, 2023

View reviewed changes

yguedidi added 6 commits May 2, 2023 00:00

Extract on data and on error callables of parallel process

b756a86

Move TCP server on connection registration later

36373be

Add PARALLEL_USE_SEPARATE_PROCESS_PER_JOB option

1c8ed9b

Refactor $onExitCallableProvider

b43e490

Extract a createParallelProcessWithIdentifier method

e21a289

Have a dedicated process for each job

1333f8c

yguedidi force-pushed the improve-parallel branch from de3ae27 to 1333f8c Compare May 1, 2023 22:00

yguedidi mentioned this pull request May 3, 2023

[RFC] Recoverable parallel child process when system errors rectorphp/rector#7915

Closed

TomasVotruba closed this May 6, 2023

		$jobs = $jobByProcessIdentifier;
		$jobIdentifiers = array_keys($jobByProcessIdentifier);

Uh oh!

Conversation

yguedidi commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The story

The challenge

The investigation

The first results

The balance

The proposal

Random ideas

Uh oh!

yguedidi commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samsonasik commented Apr 4, 2023

Uh oh!

yguedidi commented Apr 4, 2023

Uh oh!

samsonasik commented Apr 5, 2023

Uh oh!

yguedidi commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keulinho commented Apr 5, 2023

Uh oh!

yguedidi commented Apr 5, 2023

Uh oh!

yguedidi commented Apr 5, 2023

Uh oh!

yguedidi commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomasVotruba commented Apr 12, 2023

Uh oh!

yguedidi commented Apr 12, 2023

Uh oh!

keulinho commented Apr 13, 2023

Uh oh!

yguedidi commented Apr 23, 2023

Uh oh!

Flyingmana commented Apr 24, 2023

Uh oh!

yguedidi commented Apr 24, 2023

Uh oh!

yguedidi commented May 1, 2023

Uh oh!

yguedidi commented May 1, 2023

Uh oh!

yguedidi commented May 1, 2023

Uh oh!

yguedidi May 1, 2023

Choose a reason for hiding this comment

Uh oh!

yguedidi May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yguedidi May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yguedidi May 1, 2023

Choose a reason for hiding this comment

Uh oh!

TomasVotruba commented May 6, 2023

Uh oh!

Flyingmana commented May 7, 2023

Retest

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yguedidi commented Apr 4, 2023 •

edited

Loading

yguedidi commented Apr 4, 2023 •

edited

Loading

yguedidi commented Apr 5, 2023 •

edited

Loading

yguedidi commented Apr 12, 2023 •

edited

Loading

yguedidi May 1, 2023 •

edited

Loading

yguedidi May 1, 2023 •

edited

Loading