Skip to content
Merged
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
2b32e5e
Add article about WideStrings.
clalancette Mar 7, 2017
1ab6141
one line per sentence
sloretz Sep 11, 2018
4f9c3c4
ROS 2.0 -> ROS 2
sloretz Sep 11, 2018
93dd672
String UTF-8 WString UTF-16
sloretz Sep 12, 2018
354da47
Encodings are required but not enforced
sloretz Sep 12, 2018
7d2ca06
Partial code points not guaranteed to work
sloretz Sep 12, 2018
e4cd96a
Remove 'embedded systems'
sloretz Sep 12, 2018
45a8a50
Remove unnecessary sentence
sloretz Sep 12, 2018
ddf9901
UTF-8 can be 3 bytes
sloretz Sep 12, 2018
9c4e7b5
microcontroller 'could' instead of 'should'
sloretz Sep 12, 2018
ae17066
DDS sections and expectation to switch to Char16
sloretz Sep 12, 2018
5124c5a
Clarify str is used for string and wstring
sloretz Sep 26, 2018
3f1d09a
use unicode in python example + simplify
sloretz Sep 26, 2018
4102e8e
Title and abstract about unicode
sloretz Sep 26, 2018
4a21e09
Mention other strategies for invalid data across bridge
sloretz Sep 26, 2018
b23f680
Shrink background and introduction
sloretz Sep 26, 2018
ba70141
Re-wording to make clearer
sloretz Sep 26, 2018
2d75e5c
Move sentences to abstract
sloretz Sep 26, 2018
f86745a
ascii -> ASCII
sloretz Sep 26, 2018
7056c54
Remove unnecessary words 'or not'
sloretz Sep 26, 2018
3cb8cb5
Fix sentence
sloretz Sep 26, 2018
6938417
comma and remove todo
sloretz Sep 26, 2018
c8f6de1
Say dropping invalid wstring is default
sloretz Sep 26, 2018
6965fac
restrinct -> restrict
sloretz Sep 26, 2018
c41298d
two sentences to 1 long one
sloretz Sep 26, 2018
d3f0086
invalid data -> invalid strings
sloretz Sep 26, 2018
638b521
contrained -> constrained
sloretz Sep 26, 2018
b5b0bec
add permalink
dirk-thomas May 31, 2019
f82dfd9
Add umlaut
sloretz May 31, 2019
5f3c97f
Use std::printf
sloretz May 31, 2019
9818383
Add note that bridging wstrings is not yet supported
sloretz May 31, 2019
3996b8d
Add note about byte order marks with MSVC
sloretz May 31, 2019
ecf689b
Moved note to limit scope to wstring bridging
sloretz May 31, 2019
5588980
mentioning that the BOM is only necessary for Windows
dirk-thomas May 31, 2019
f6644bb
Small spelling and grammar fixes.
clalancette May 31, 2019
e1a7dc8
out of band feedback: removed utf-32 bit since it adds nothing now th…
sloretz May 31, 2019
5a8d8fe
Merge branch 'wstring' of github.com:ros2/design into wstring
sloretz May 31, 2019
7103b43
Python code works without warnings on Dashing
sloretz May 31, 2019
39763f4
C++ code compiles without warnings in Dashing
sloretz May 31, 2019
48a2b56
Simplified c++ example and BOM note
sloretz May 31, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions articles/WideStrings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
---
layout: default
title: Unicode Support
permalink: articles/wide_strings.html
abstract:
This article describes how ROS 2 will support sending multi-byte character data using the [Unicode](https://en.wikipedia.org/wiki/Unicode) standard.
It also describes how such data will be sent over the ROS 1 bridge.
author: '[Chris Lalancette](https://github.com/clalancette)'
published: true
---

- This will become a table of contents (this text will be scraped).
{:toc}

# {{ page.title }}

<div class="abstract" markdown="1">
{{ page.abstract }}
</div>

Original Author: {{ page.author }}

## Background

Some users would like to send text data in languages that cannot be represented by ASCII characters.
Currently ROS 1 only supports ASCII data in the string field but [allows users to populate it with UTF-8](http://wiki.ros.org/msg).

Note that topic names cannot use multi-byte characters as they are disallowed by the DDS specification.
See the [topic and service name](/articles/topic_and_service_names.html) design document for more information.

The following links have more information about multi-byte characters and the history of character encodings.

* [http://kunststube.net/encoding/](http://kunststube.net/encoding/)
* [http://stackoverflow.com/questions/4588302/why-isnt-wchar-t-widely-used-in-code-for-linux-related-platforms](http://stackoverflow.com/questions/4588302/why-isnt-wchar-t-widely-used-in-code-for-linux-related-platforms)
* [http://www.diveintopython3.net/strings.html](http://www.diveintopython3.net/strings.html)
* [http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring)
* [https://utf8everywhere.org/](http://utf8everywhere.org/)


## Unicode Characters in Strings
Two goals for ROS 2 strings are to be compatible with ROS 1 strings, and compatible with the DDS wire format.
ROS 1 says string fields are to contain ASCII encoded data, but allows UTF-8.
DDS-XTYPES mandates UTF-8 be used as the encoding of IDL type `string`.
To be compatibile with both, in ROS 2 the content of a `string` is expected to be UTF-8.

## Wide Strings
ROS 2 messages will have a new [primitive field type](/articles/interface_definition.html) `wstring`.
The purpose is to allow ROS 2 nodes to communicate with non-ROS DDS entities using an IDL containing a `wstring` field.
The encoding of data in this type should be UTF-16 to match DDS-XTYPES 1.2.
Since both UTF-8 and UTF-16 can encode the same code points, new ROS 2 messages should prefer `string` over `wstring`.

## Encodings are Required but not Guaranteed to be Enforced
`string` and `wstring` are required to be UTF-8 and UTF-16, but the requirement may not be enforced.
Since ROS 2 is targeting resource constrained systems, it is left to the rmw implementation to choose whether to enforce the encoding.
Further, since many users will write code to check that a string contains valid data, checking again in lower layers may not be necessary in some cases.

If a `string` or `wstring` field is populated with the wrong encoding then the behavior is undefined.
It is possible the rmw implementation may allow invalid strings to be passed through to subscribers.
Each subscriber is responsible for detecting invalid strings and deciding how to handle them.
For example, subscribers like `ros2 topic echo` may echo the bytes in hexadecimal.

The IDL specification forbids `string` from containing `NULL` values.
To be compatible, a ROS message `string` field must not contain zero bytes, and a `wstring` field must not contain zero words.
This restriction will be enforced.

## Unicode Strings Across ROS 1 Bridge

Since ROS 1 and 2 both allow `string` to be UTF-8, the ROS 1 bridge will pass values unmodified between them.
If a message with a string field fails to serialize because the content is not legal UTF-8 then the default behavior will be to drop the entire message.
Other strategies like replacing invalid bytes could unintentionally change the meaning, so they will be opt-in if available at all.


**Note:** Bridging `wstring` fields is not yet implemented.
See [ros2/ros1_bridge#203](https://github.com/ros2/ros1_bridge/issues/203).

If a ROS 2 message has a field of type `wstring` then the bridge will attempt to convert it from UTF-16 to UTF-8.
The resulting UTF-8 encoded string will be published as a `string` type.
If the conversion fails then the bridge will by default not publish the message.

## Size of a Wide String

Both UTF-8 and UTF-16 are variable width encodings.
To minimize the amount of memory used, the `string` and `wstring` types are to be stored in client libraries according to the smallest possible code point.
This means `string` must be specified as a sequence of bytes, and `wstring` is to be specified as a sequence of words.

Some DDS implementations currently use 32bit types to store wide strings values.
This may be due to DDS-XTYPES 1.1 section 7.3.1.5 specifying `wchar` as a 32bit value.
However this changes in DDS-XTYPES 1.2 section 7.3.1.4 to be a 16bit value.
It is expected that most DDS implementations will switch to 16bit character storage in the future.
ROS 2 will aim to be compatible with DDS-XTYPES 1.2 and use 16bit storage for wide characters.
Generated code for ROS 2 messages will automatically handle the conversion when a message is serialized or deserialized.

### Bounded wide strings

Message definitions may restrict the maximum size of a string.
These are referred to as bounded strings.
Their purpose is to restrict the amount of memory used, so the bounds must be specified as units of memory.
If a `string` field is bounded then the size is given in bytes.
Similarly the size of a bounded `wstring` is to be specified in words.
It is the responsibility of whoever populates a bounded `string` or `wstring` to make sure it contains whole code points only.
Partial code points are indistinuguishable from invalid code points, so a bounded string whose last code point is incomplete is not guaranteed to be published.

## Runtime impact of wide string

Dealing with wide strings puts more strain on the software of a system, both in terms of speed and code size.
UTF-8 and UTF-16 are both variable width encodings, meaning a code point can take 1 to 4 bytes depending on the encoding.
It may take multiple code points to represent a single user perceived character.
One of the goals of ROS 2 is to support microcontrollers that are constrained by both code size and processor speed.
Some wide string operations like splitting a string on a user perceived character may not be possible on these devices.

However, whole string equality checking is the same whether using wide strings or not.
Further splitting a UTF-8 string on an ASCII character is identical to splitting an ASCII character on an ASCII string.
If code on a microcontroller must do string manipluation then it could assert that a `string` only contains ASCII data by ceasing to proces a string when it encounters a byte greater than 127.

## What does the API look like to a user of ROS 2?

### Python 3

In Python the `str` type will be used for both strings and wide strings.
Bytes of a known encoding should be converted to a `str` using [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) before being assigned to a field.

**Example**

```python
import rclpy
from test_msgs.msg import WStrings


if __name__ == '__main__':
rclpy.init()

node = rclpy.create_node('talker')

chatter_pub = node.create_publisher(WStrings, 'chatter', 1)

msg = WStrings()
msg.wstring_value = 'Hello Wörld'
print('Publishing: "{0}"'.format(msg.wstring_value))
chatter_pub.publish(msg)
node.destroy_node()
rclpy.shutdown()

```

### C++

In C++ wstring `wchar_t` has different sizes on different platforms (2 bytes on Windows, 4 bytes on Linux).
Instead ROS 2 will use `char16_t` for characters of wide strings, and `std::u16string` for wide strings themselves.

**Example**

```
/*
* Note that C++ source files containing unicode characters must begin with a byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark .
* Failure to do so can result in an incorrect encoding of the characters on Windows.
* For an example, see https://github.com/ros2/system_tests/pull/362#issue-277436162
*/
#include <cstdio>
#include <memory>
#include <string>

#include "rclcpp/rclcpp.hpp"

#include "test_msgs/msg/w_strings.hpp"

int main(int argc, char * argv[])
{
rclcpp::init(argc, argv);

auto node = std::make_shared<rclcpp::Node>("talker");

auto chatter_pub = node->create_publisher<test_msgs::msg::WStrings>("chatter", 10);

test_msgs::msg::WStrings msg;
std::u16string hello(u"Hello Wörld");
msg.wstring_value = hello;
chatter_pub->publish(msg);
rclcpp::spin_some(node);

return 0;
}
```