Why do I get different UTF-8 representations of an NSString depending on string construction or when running in different environments?

Question

I have some very simple Objective-C code that allocates and initialises an NSString and then gets the UTF-8 const char * representation of that string as follows:

const char *s = [[[NSString alloc] initWithFormat:@"%s", "£"] UTF8String];

I then print out the hex values of the code units that make up this string using this code:

while(*s)
    printf("%02x ", (unsigned int) *s++);

and I get the following output:

ffffffc2 ffffffac ffffffc2 ffffffa3

This is unexpected as I'd assume I'd just get ffffffc2 ffffffa3, seeing as the £ character is made up of two code units, represented in hex as c2 followed by a3, as you can see here.

Here's a screenshot of this output in the simplest iOS app imaginable running locally on my laptop:

Note that the output is the same if I create the NSString as follows:

[[NSString alloc] initWithFormat:@"%s", "\xc2\xa3"]

If I instead use an NSString as the argument to be interpolated into the format string then I get the expected output of ffffffc2 ffffffa3:

[[NSString alloc] initWithFormat:@"%@", @"£"]

What's even stranger to me is that exactly the same failing code as I have above (the first version) seems to work as I'd expect when on an online Objective C codepen-type site I found, which you can see here.

Why are the extra code units being added to the UTF-8 representation of the string when I use the initWithFormat:@"%s" version of the code, and seemingly only when I run it on my machine?

From String Format Specifiers: "Because the %s specifier causes the characters to be interpreted in the system default encoding, the results can be variable, especially with right-to-left languages. For example, with RTL, %s inserts direction markers when the characters are not strongly directional. For this reason, it’s best to avoid %s and specify encodings explicitly." — Willeke
– Willeke, Commented Feb 12, 2020 at 23:45
Yeah I read the docs about string format specifiers and it does seem like that's relevant. That said, how do I know what encoding is being used as my default? Further, how do I explicitly specify an encoding? Is the expectation that I would instead create the NSString with [[NSString alloc] initWithUTF8String: str]? — hamchapman
– hamchapman, Commented Feb 13, 2020 at 9:25
@Willeke I assume that it's MacOSRoman, as that appears to be what's returned if I call [NSString defaultCStringEncoding] — hamchapman
– hamchapman, Commented Feb 13, 2020 at 14:25
Is the encoding of the "£" argument of initWithFormat MacOSRoman? — Willeke
– Willeke, Commented Feb 13, 2020 at 17:01

CRD · Accepted Answer · 2020-02-13 22:27:40Z

The C language does not specify the encoding of strings, rather it specifies a set of characters that must be included in the source character set and that each character is a byte.

When compiling (Objective-)C the Apple Clang compiler appears to follow this, the encoding of the characters in a C string is based on the encoding of the source file. The default encoding for source files is UTF-8 and so the C string literal "£" is stored as the bytes c2, a3, 00 being the UTF-8 encoding for "£" and a null byte.

As @Wileke remarked the %s string format interprets its argument according to the system default encoding (documentation). This default encoding appears to be MacOSRoman, in that encoding the byte c2 is the character "¬" and the byte a3 is the character "£", and so the string you produce from stringWithFormat: has those two characters in it.

As you have already suggested in your comments you can address your problem by using initWithUTF8String:, which will work provided your source file encoding is UTF-8. If your source file uses a different encoding you should instead use initWithCString:encoding: and specify the encoding of your source file.

If you are unsure of your source file encoding select the file in Xcode and look at the inspect pane, there you can see and change (either reinterpreting or converting the existing bytes) the encoding.

Note: If in your real code the C string is not being formed from a string literal in the same file you will have to determine the encoding of that string.

HTH

Collectives™ on Stack Overflow

Why do I get different UTF-8 representations of an NSString depending on string construction or when running in different environments?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related