Skip to content

Alternative to #11437: warn, do not fail, if sandbox cannot be deleted#11566

Closed
nojb wants to merge 2 commits intoocaml:mainfrom
nojb:alternative_to_11437
Closed

Alternative to #11437: warn, do not fail, if sandbox cannot be deleted#11566
nojb wants to merge 2 commits intoocaml:mainfrom
nojb:alternative_to_11437

Conversation

@nojb
Copy link
Copy Markdown
Collaborator

@nojb nojb commented Mar 25, 2025

An alternative to #11437: simply warn if the sandbox cannot be deleted, instead of failing.

Yet another possibility could be to warn and retry (as is done in #11437) the Path.rm_rf operation.

For discussion.

…eleted

Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
@Alizter
Copy link
Copy Markdown
Collaborator

Alizter commented Mar 25, 2025

I think this fix makes sense for the time being. It's not really the users fault if a sandbox cannot be deleted, and I don't think it is important enough for a build to fail.

If you look above, we already try to delete it when we create a sandbox. If for whatever reason that fails on Windows on the off chance that the sandbox wasn't cleared in a previous build and the virus checker has locked the file, then we can error out. The user should in that case just clear the _build directory manually and rebuild.

I don't know if it is worth adding some more sophisticated logic in the future for Windows to avoid this situation entirely, but from reading the previous discussion it appears to be rare and non-fatal.

@Alizter
Copy link
Copy Markdown
Collaborator

Alizter commented Mar 25, 2025

Before you update the test and merge, let's give a headsup to the JS people since this patch will affect them more directly.

cc @rgrinberg @snowleopard

@nojb
Copy link
Copy Markdown
Collaborator Author

nojb commented Mar 25, 2025

Thanks for the quick review @Alizter. @MSoegtropIMC: would this fix be OK for you?

@MSoegtropIMC
Copy link
Copy Markdown
Contributor

Reading the comment of @Alizter I would think that avoiding issues with sandbox clean up would be better. "Manually clean up and restart the build" is not helpful for CI, while my solution fixed all CI issues I had. How about having both changes - the retry on file deletes from my PR and this?

@MSoegtropIMC
Copy link
Copy Markdown
Contributor

So what happens if a previous sandbox could not be cleaned up and then another sandbox needs to be created? Will this lead to a build failure? Or do sandbox folders have unique names?

@Alizter
Copy link
Copy Markdown
Collaborator

Alizter commented Mar 25, 2025

So what happens if a previous sandbox could not be cleaned up and then another sandbox needs to be created? Will this lead to a build failure? Or do sandbox folders have unique names?

The sandbox directories contain a digest of their contents, so unless a rule is being rerun, which won't happen in a single build, there should be no conflicts. The behaviour I am mentioning would only happen on a rebuild, and seems too unlikely to cause major issues. We could perhaps have a better error message in that case however. I can have a look later today at improving that.

@MSoegtropIMC
Copy link
Copy Markdown
Contributor

I still don't see advantages of this solution over my suggestion. As discussed it is more of a hack because it ignores errors and leaves things in an unclean state while mine actually fixes the issue. Regearding the critics on my solution - that it might slow down builds - I was about to say that this is purely theoretical, but thinking about this it is not even this. So I would prefer my solution - which is unlike this solution also heavily tested in CI since > 1 month.

Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
@nojb
Copy link
Copy Markdown
Collaborator Author

nojb commented Mar 25, 2025

So I would prefer my solution - which is unlike this solution also heavily tested in CI since > 1 month.

I think this is a good point, and personally I would be ready to merge #11437 in its current state. The only small issue I see with the way the implementation of #11437 is that if the user is bitten by this problem, the logic kicks in completely silently without reporting what is going on (which may be useful information).

So as a last proposal, I pushed 9a539d8 which implements the same retry logic as in #11437 but in Sandbox.destroy instead of Fpath.unlink which allows us to tell the user what is going on. Could this variant satisfy everyone?

@Alizter
Copy link
Copy Markdown
Collaborator

Alizter commented Mar 25, 2025

To be clear, I am just expressing my opinion on fpath vs this patch. At the end of the day, it won't affect me. So I'm not going to block anything. I think the last commit @nojb pushed is good since it will also give the user some indication of what is going on unlike before.

@MSoegtropIMC will you be able to check that this works for you, or is this something that is difficult to reproduce over a short timescale?

@MSoegtropIMC
Copy link
Copy Markdown
Contributor

The question is if this issue can also happen in non sandbox mode. As far as I can see there is another call to this function in Dune/dune/src/dune_engine/build_system.ml in "execute_action_for_rule", a failing unlink could have strange effects there as well.

Since the main issue appears to be logging and the function is called in only few places, how about passing a logging function to the unlink function?

Another point: there are quite a few occurrences of Unix.unlink in the dune source tree. Shouldn't these be replaced by the safe unlink function?

@MSoegtropIMC
Copy link
Copy Markdown
Contributor

@MSoegtropIMC will you be able to check that this works for you, or is this something that is difficult to reproduce over a short timescale?

Putting it into Coq Platform CI is easy, but there it would take a week or so to be 99.9% conclusive. Local tests would be faster (a few hours), but more manual work.

~warn
t.dir
(Unix_error.Detailed.pp (Unix_error.Detailed.create error ~syscall ~arg));
loop (cnt - 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should sleep here a bit before looping, to avoid trying to delete the sandbox in a tight loop

Comment on lines +335 to +339
let warn = error = Unix.EACCES && cnt > 0 in
failed_to_delete_sandbox
~warn
t.dir
(Unix_error.Detailed.pp (Unix_error.Detailed.create error ~syscall ~arg));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says "simply warn if the sandbox cannot be deleted, instead of failing" but this part will fail after retrying 30 times.

My understanding is that we don't want the build to fail in case the sandbox files can't be deleted.

rusv-simcorp added a commit to rusv-simcorp/dune that referenced this pull request Mar 28, 2025
If we have retried the maximum number of times then simply
continue without doing anything. Also, sleep before retrying
to avoid retrying in a tight loop.

Implements my own change suggestions in ocaml#11566
@rusv-simcorp
Copy link
Copy Markdown

@nojb FYI, I have implemented my above review comments in rusv-simcorp@9f114b4

@nojb
Copy link
Copy Markdown
Collaborator Author

nojb commented Mar 28, 2025

Let's go with #11437 and keep these alternative approaches in mind for the future if needed.

@nojb nojb closed this Mar 28, 2025
@nojb nojb deleted the alternative_to_11437 branch March 28, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants