Skip to content

Conversation

@cmb69
Copy link
Member

@cmb69 cmb69 commented Jan 22, 2021

The default encoding of filenames in a ZIP archive is IBM Code Page
437. Phar, however, only supports UTF-8 filenames. Therefore we have
to mark non ASCII filenames as being stored in UTF-8 by setting the
general purpose bit 11 (the language encoding flag).

The effect of not setting this bit for non ASCII filenames can be seen
in popular tools like 7-Zip and UnZip, but not when extracting the
archives via ext/phar (which is agnostic to the filename encoding), or
via ext/zip (which guesses the encoding). Thus we add a somewhat
brittle low-level test case.

The default encoding of filenames in a ZIP archive is IBM Code Page
437.  Phar, however, only supports UTF-8 filenames.  Therefore we have
to mark non ASCII filenames as being stored in UTF-8 by setting the
general purpose bit 11 (the language encoding flag).

The effect of not setting this bit for non ASCII filenames can be seen
in popular tools like 7-Zip and UnZip, but not when extracting the
archives via ext/phar (which is agnostic to the filename encoding), or
via ext/zip (which guesses the encoding).  Thus we add a somewhat
brittle low-level test case.
@cmb69 cmb69 added the Bug label Jan 22, 2021
ext/phar/zip.c Outdated
memcpy(central.datestamp, local.datestamp, sizeof(local.datestamp));
PHAR_SET_16(central.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0));
PHAR_SET_16(local.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0));
if (!is_ascii(entry->filename, entry->filename_len)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, would just unconditionally setting the flag be fine? ASCII and UTF-8 are the same when only ASCII characters are used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a real problem doing this unconditionally; if a ZIP tool doesn't cater to that flag, there still shouldn't be a difference regarding ASCII only filenames. OTOH, setting the flag conditionally, wouldn't cause any behavioral change for ASCII only filenames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit which would set the flag unconditionally. I'm fine with either solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always setting the flag is less code, so if that works, let's go for it :)

@php-pulls php-pulls closed this in 6a0b889 Jan 26, 2021
@cmb69 cmb69 deleted the cmb/70091 branch July 13, 2021 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants