Skip to content

ASCIIPropertyListParser: handle non-7b-ASCII chars#47

Closed
matvore wants to merge 1 commit into
3breadt:masterfrom
matvore:nonasc
Closed

ASCIIPropertyListParser: handle non-7b-ASCII chars#47
matvore wants to merge 1 commit into
3breadt:masterfrom
matvore:nonasc

Conversation

@matvore

@matvore matvore commented Aug 17, 2018

Copy link
Copy Markdown
Contributor

Currently, ASCIIPropertyListParser takes bytes[] and then pads the bytes
with an extra 00 byte to get UTF-16. If the byte is >= 0x80, then it
pads it with 0xff. This means that if the bytes are in the 7-bit ASCII
range, everything is fine. But if not, 0x80 for example becomes 0xff80,
(half-width TA katakana) which I don't believe corresponds to any real
encoding system.

The options are to:

  • convert using the default system encoding
  • convert using UTF-8

I think UTF-8 is a better default. The default system encoding is
good for backwards compatibility, but this feature (non-7-bit ASCII)
has never worked at all before, so that's not really necessary. This can
also be made configurable if the need presents itself.

Currently, ASCIIPropertyListParser takes bytes[] and then pads the bytes
with an extra 00 byte to get UTF-16. If the byte is >= 0x80, then it
pads it with 0xff. This means that if the bytes are in the 7-bit ASCII
range, everything is fine. But if not, 0x80 for example becomes 0xff80,
(half-width TA katakana) which I don't believe corresponds to any real
encoding system.

The options are to:
 - convert using the default system encoding
 - convert using UTF-8

I think UTF-8 is a better default. The default system encoding is
good for backwards compatibility, but this feature (non-7-bit ASCII)
has never worked at all before, so that's not really necessary. This can
also be made configurable if the need presents itself.
@3breadt

3breadt commented Aug 19, 2018

Copy link
Copy Markdown
Owner

I didn't know char casting did that, that was not intended behavior.

So I redesigned the approach for parsing ASCII property list. It now works on a char array instead of a byte array. An encoding can be specified explicitly, otherwise the parser attempts to detect it (UTF-8, UTF-16, UTF-32 or ASCII). I created a feature branch for this reworked parser: https://github.com/3breadt/dd-plist/tree/asciipropertylist-configurable-encoding

What do you think?

@matvore

matvore commented Aug 19, 2018

Copy link
Copy Markdown
Contributor Author

That's great! That commit would definitely fit my requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants