Skip to content

Build remove outdated bin#1550

Merged
Lestropie merged 7 commits intodevfrom
build_remove_outdated_bin
Oct 3, 2019
Merged

Build remove outdated bin#1550
Lestropie merged 7 commits intodevfrom
build_remove_outdated_bin

Conversation

@maxpietsch
Copy link
Copy Markdown
Member

@maxpietsch maxpietsch commented Feb 5, 2019

Addresses #1042: remove binary files in the target bin folder if no corresponding .cpp file can be found. The check whether a file is binary should be conservative but check_encoding makes it a rather expensive operation and could be relaxed if need be.

Removal is triggered for every build, irrespective of the build target.

@maxpietsch maxpietsch changed the base branch from master to dev February 5, 2019 12:58
@Lestropie Lestropie added the build label Feb 5, 2019
@maxpietsch
Copy link
Copy Markdown
Member Author

Feel free to merge if you're happy with it.

maxpietsch and others added 2 commits February 7, 2019 18:39
… for speed reasons. Files without shebang in first line or non UTF-8 encoding in first 1024 bytes get deleted on build.
@maxpietsch
Copy link
Copy Markdown
Member Author

maxpietsch commented Feb 7, 2019

@jdtournier suggested checking for the presence of a shebang line in the file Instead of detecting whether a file is binary. This approach deletes everything that does not have a corresponding .cpp file, is not decodable as UTF-8, or does not start with !#.

For giggles, I timed 20 build runs:

/build && time for i in {1..20}; do ./build; done && git checkout 8b5d8d0f868e54b93bbdc8b760e4a5fe5b17a3ef && ./build && time for i in {1..20}; do ./build; done;
old (8b5d8d0):
real	1m17.403s
user	0m59.189s
sys	0m16.300s
new (9d868a2):
real	1m17.988s
user	0m59.210s
sys	0m16.731s

@Lestropie
Copy link
Copy Markdown
Member

@jdtournier suggested checking for the presence of a shebang line in the file Instead of detecting whether a file is binary. This approach deletes everything that does not have a corresponding .cpp file, is not decodable as UTF-8, or does not start with !#.

Sounds familiar 😅 👍

Only risk is that if a bash script were added to bin/ that didn't contain a shebang, it'd get yeeted; it would be pretty bad practise, but still... Would it be safer (given we're talking irreversible deletion of files) to use detection of a shebang as confirmation to definitely keep a file, but not explicitly require the presence of such to avoid deletion?

@thijsdhollander
Copy link
Copy Markdown
Contributor

Would it be safer (given we're talking irreversible deletion of files) to use detection of a shebang as confirmation to definitely keep a file, but not explicitly require the presence of such to avoid deletion?

Yeah, I was thinking along the same lines as well. I think I'd be more comfortable with that.

@jdtournier
Copy link
Copy Markdown
Member

Sounds familiar 😅 👍

I knew that rang a bell... 🔔

Otherwise, I'd tend to agree with @Lestropie & @thijsdhollander on the potential for scripts to get clobbered because the developer forgot to include the shebang line, and hadn't committed his changes yet. Note that my previous suggestion of relying on git would suffer from the same problem.

As far as I can tell, the current approach would remove a script that's missing a shebang? Even if it was valid utf8? Maybe it would be better to:

  • first check for shebang in first 2 characters, and keep if present
  • check first 1024 characters for valid utf8, and keep if valid
  • assume binary otherwise, check for corresponding cpp in cmd/ folder, and keep if found
  • otherwise remove file

Other idea might be to move 'removed' files to a dedicated folder (say invalid_files or something) and provide a clear warning that this has happened and where to find the file? This would guarantee no loss of data, and effectively take out any outdated binaries. If we go along with this idea, I reckon we don't even need the valid utf8 check - any scripts missing a shebang will need to be amended, but at least they won't disappear...

Also: we should use this approach in ./build clean too - basically would need to replace the list_untracked_bin_files() function.

What do you reckon?

@jdtournier
Copy link
Copy Markdown
Member

Other idea might be to move 'removed' files to a dedicated folder (say invalid_files or something)

On that note, maybe purged_files would be better...?

@maxpietsch
Copy link
Copy Markdown
Member Author

Agreed, moving instead of removing is a better solution. Any preference on purged_files/, tmp/purged_files/ or bin/purged_files/? I think I prefer bin/purged_files/ as that would be the place where I'd look for a missing binary / script.

As far as I can tell, the current approach would remove a script that's missing a shebang? Even if it was valid utf8?

Yes.

I reckon we don't even need the valid utf8 check - any scripts missing a shebang will need to be amended

This would require handling the byte order mark (see here). Any reason not to decode as utf8? If so, it might be better to fall back to binary detection (8b5d8d0) which also does not have the problem of removing scripts / text files that might miss a shebang.

Also: we should use this approach in ./build clean too

Not sure I understand, are you referring to moving objects instead of deleting or replacing git ls-files with detecting binary files via absence of shebang or utf8 encoding? The latter would change the behaviour as currenty ./build clean also deletes scripts (valid shebang or not) that are not in the git tree.

@jdtournier
Copy link
Copy Markdown
Member

This would require handling the byte order mark

I didn't know about that, good catch. However, we don't use it, and I've just tried inserting it at the beginning of the build script with this line:

$ sed -i '1s/^/\xef\xbb\xbf/' build

verified that it inserts the right character (0xEF 0xBB 0xBF) using hexdump:

$ hexdump -n 8 -x build
0000000    bbef    23bf    2f21    7375
0000008

looks like it's reading pairs of bytes back to front, but it checks out: vim and less open the file without problem, and don't show the BOM, and vim shows that it's interpreted the BOM as expected (:set fileencodings returns ucs-bom,utf-8,default,latin1).

But bash won't run the file (at least not on my MSYS2 / Win10 laptop):

$ ./build
./build: line 1: #!/usr/bin/env: No such file or directory
./build: line 3: usage_string: command not found
./build: line 87: import: command not found
./build: line 90: syntax error near unexpected token `and'
./build: line 90: `if sys.executable[0].isalpha() and sys.executable[1] == ':':'

So to ensure compatibility at least with MSYS2, it looks like we can't use the BOM. Besides, from that same wikipedia page:

Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[14] for this reason and for wider interoperability and philosophical concerns. Additionally, a byte order mark is not necessary in UTF-8, as that encoding does not have endianness issues; it serves only to identify the encoding as UTF-8.


Regarding the ./build clean issue: currently, it does indeed rely on git ls-files and removes anything not under version control. This means there is also scope for data loss if ./build clean is invoked with a non-versioned script in bin/. It also can't work on standalone, non-git setups. So I reckon we can simply use the same approach as here in the clean as well, and also move non-shebang files to the purged folder (whatever name we want to give it).

What do you reckon?

@maxpietsch
Copy link
Copy Markdown
Member Author

maxpietsch commented Feb 8, 2019

This version now moves files out of bin into bin/purged_files if it does not match an expected name (using .cpp files in cmd_dir and exe_suffix) unless any of the following are true, checked in that order:
- b'\x23\x21' is at the beginning of the file
- the first 1024 bytes can be decoded as utf8

This binary file detection requires knowing the exe_suffix so I moved the config file parsing up before the clean part of the code.
Would you change store_current_build as well? Then we could get rid of list_untracked_bin_files and replace it with list_unexpected_bin_files but that requires moving config parsing before config handling.

@Lestropie
Copy link
Copy Markdown
Member

Open question: If one runs ./build clean, and bin/purged_files/ already exists, currently the script will move any additional new unknown files it finds in bin/ into bin/purged_files/. Given the user would already have received a prior warning regarding the moving of unknown files into that location before they run ./build clean, should the pre-existing contents of that directory be wiped at that point, and then an empty directory constructed if necessary?

@Lestropie
Copy link
Copy Markdown
Member

Bumping and adding "question" tag; I tested & was happy to merge save for the question in the prior comment.

@maxpietsch
Copy link
Copy Markdown
Member Author

No strong preference but I'd suggest keeping the contents of bin/purged_files/, it feels unexpected that ./build clean would be deleting files previously moved because they are not part of MRtrix3. Other than taking up disk space I see no harm in keeping them in that folder.

@Lestropie Lestropie merged commit 59271fe into dev Oct 3, 2019
@Lestropie Lestropie deleted the build_remove_outdated_bin branch October 3, 2019 07:03
Lestropie added a commit that referenced this pull request Jan 26, 2020
Resolve conflict between #1550 and #1613, which resulted in compiled Python file "bin/mrtrix3.pyc" being moved to bin/purged_files/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants