Alternative to #11437: warn, do not fail, if sandbox cannot be deleted by nojb · Pull Request #11566 · ocaml/dune

nojb · 2025-03-25T12:56:07Z

An alternative to #11437: simply warn if the sandbox cannot be deleted, instead of failing.

Yet another possibility could be to warn and retry (as is done in #11437) the Path.rm_rf operation.

For discussion.

…eleted Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>

Alizter · 2025-03-25T13:32:56Z

I think this fix makes sense for the time being. It's not really the users fault if a sandbox cannot be deleted, and I don't think it is important enough for a build to fail.

If you look above, we already try to delete it when we create a sandbox. If for whatever reason that fails on Windows on the off chance that the sandbox wasn't cleared in a previous build and the virus checker has locked the file, then we can error out. The user should in that case just clear the _build directory manually and rebuild.

I don't know if it is worth adding some more sophisticated logic in the future for Windows to avoid this situation entirely, but from reading the previous discussion it appears to be rare and non-fatal.

Alizter · 2025-03-25T13:35:11Z

Before you update the test and merge, let's give a headsup to the JS people since this patch will affect them more directly.

cc @rgrinberg @snowleopard

nojb · 2025-03-25T13:53:50Z

Thanks for the quick review @Alizter. @MSoegtropIMC: would this fix be OK for you?

MSoegtropIMC · 2025-03-25T14:20:56Z

Reading the comment of @Alizter I would think that avoiding issues with sandbox clean up would be better. "Manually clean up and restart the build" is not helpful for CI, while my solution fixed all CI issues I had. How about having both changes - the retry on file deletes from my PR and this?

MSoegtropIMC · 2025-03-25T14:30:32Z

So what happens if a previous sandbox could not be cleaned up and then another sandbox needs to be created? Will this lead to a build failure? Or do sandbox folders have unique names?

Alizter · 2025-03-25T14:38:36Z

So what happens if a previous sandbox could not be cleaned up and then another sandbox needs to be created? Will this lead to a build failure? Or do sandbox folders have unique names?

The sandbox directories contain a digest of their contents, so unless a rule is being rerun, which won't happen in a single build, there should be no conflicts. The behaviour I am mentioning would only happen on a rebuild, and seems too unlikely to cause major issues. We could perhaps have a better error message in that case however. I can have a look later today at improving that.

MSoegtropIMC · 2025-03-25T17:01:15Z

I still don't see advantages of this solution over my suggestion. As discussed it is more of a hack because it ignores errors and leaves things in an unclean state while mine actually fixes the issue. Regearding the critics on my solution - that it might slow down builds - I was about to say that this is purely theoretical, but thinking about this it is not even this. So I would prefer my solution - which is unlike this solution also heavily tested in CI since > 1 month.

Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>

nojb · 2025-03-25T17:23:01Z

So I would prefer my solution - which is unlike this solution also heavily tested in CI since > 1 month.

I think this is a good point, and personally I would be ready to merge #11437 in its current state. The only small issue I see with the way the implementation of #11437 is that if the user is bitten by this problem, the logic kicks in completely silently without reporting what is going on (which may be useful information).

So as a last proposal, I pushed 9a539d8 which implements the same retry logic as in #11437 but in Sandbox.destroy instead of Fpath.unlink which allows us to tell the user what is going on. Could this variant satisfy everyone?

Alizter · 2025-03-25T18:03:53Z

To be clear, I am just expressing my opinion on fpath vs this patch. At the end of the day, it won't affect me. So I'm not going to block anything. I think the last commit @nojb pushed is good since it will also give the user some indication of what is going on unlike before.

@MSoegtropIMC will you be able to check that this works for you, or is this something that is difficult to reproduce over a short timescale?

MSoegtropIMC · 2025-03-26T08:40:18Z

The question is if this issue can also happen in non sandbox mode. As far as I can see there is another call to this function in Dune/dune/src/dune_engine/build_system.ml in "execute_action_for_rule", a failing unlink could have strange effects there as well.

Since the main issue appears to be logging and the function is called in only few places, how about passing a logging function to the unlink function?

Another point: there are quite a few occurrences of Unix.unlink in the dune source tree. Shouldn't these be replaced by the safe unlink function?

MSoegtropIMC · 2025-03-26T10:06:54Z

@MSoegtropIMC will you be able to check that this works for you, or is this something that is difficult to reproduce over a short timescale?

Putting it into Coq Platform CI is easy, but there it would take a week or so to be 99.9% conclusive. Local tests would be faster (a few hours), but more manual work.

rusv-simcorp · 2025-03-27T12:26:59Z

src/dune_engine/sandbox.ml

+          ~warn
+          t.dir
+          (Unix_error.Detailed.pp (Unix_error.Detailed.create error ~syscall ~arg));
+        loop (cnt - 1)


We should sleep here a bit before looping, to avoid trying to delete the sandbox in a tight loop

rusv-simcorp · 2025-03-27T12:38:34Z

src/dune_engine/sandbox.ml

+        let warn = error = Unix.EACCES && cnt > 0 in
+        failed_to_delete_sandbox
+          ~warn
+          t.dir
+          (Unix_error.Detailed.pp (Unix_error.Detailed.create error ~syscall ~arg));


The PR description says "simply warn if the sandbox cannot be deleted, instead of failing" but this part will fail after retrying 30 times.

My understanding is that we don't want the build to fail in case the sandbox files can't be deleted.

If we have retried the maximum number of times then simply continue without doing anything. Also, sleep before retrying to avoid retrying in a tight loop. Implements my own change suggestions in ocaml#11566

rusv-simcorp · 2025-03-28T08:49:46Z

@nojb FYI, I have implemented my above review comments in rusv-simcorp@9f114b4

nojb · 2025-03-28T13:20:31Z

Let's go with #11437 and keep these alternative approaches in mind for the future if needed.

Alternative to ocaml#11437: warn, do not fail, if sandbox cannot be d…

ac0d2fe

…eleted Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>

nojb mentioned this pull request Mar 25, 2025

Windows: retry a file remove - a virus checker might lock files #11437

Merged

Alizter approved these changes Mar 25, 2025

View reviewed changes

Add retry logic

9a539d8

Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>

rusv-simcorp reviewed Mar 27, 2025

View reviewed changes

nojb closed this Mar 28, 2025

nojb deleted the alternative_to_11437 branch March 28, 2025 13:20

Conversation

nojb commented Mar 25, 2025

Uh oh!

Alizter commented Mar 25, 2025

Uh oh!

Alizter commented Mar 25, 2025

Uh oh!

nojb commented Mar 25, 2025

Uh oh!

MSoegtropIMC commented Mar 25, 2025

Uh oh!

MSoegtropIMC commented Mar 25, 2025

Uh oh!

Alizter commented Mar 25, 2025

Uh oh!

MSoegtropIMC commented Mar 25, 2025

Uh oh!

nojb commented Mar 25, 2025

Uh oh!

Alizter commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MSoegtropIMC commented Mar 26, 2025

Uh oh!

MSoegtropIMC commented Mar 26, 2025

Uh oh!

rusv-simcorp Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

rusv-simcorp Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

rusv-simcorp commented Mar 28, 2025

Uh oh!

nojb commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Alizter commented Mar 25, 2025 •

edited

Loading