identify: length limit for unicode fields by lidel · Pull Request #491 · libp2p/specs

lidel · 2022-12-06T20:27:57Z

This PR formalizes ~~normalization and~~ length limit per string field ( agentVersion, protocolVersion and protocols array).

The goal is to reduce surprises and unify behavior across implementations.

Kubo PR: fix(cmds): cleanup unicode identify strings ipfs/kubo#9465

Specify length limit to unify version string handling across implementations. License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

marten-seemann

Can we make this UTF-8? I'd really like to not introduce any surprises here.

lidel · 2022-12-06T22:57:22Z

@marten-seemann a̹̖̪͔͖̝͘̕͞r̛̛̫̞̬̝͎̘̹͞ę̴͉͉̼̀͟ ̛͏̴͜͏͉̣͉̮͖̳̝͍̭̮̮͍͔͈̲͉͙̰ͅy̞̠̮̥̯̞̦͈̮̣̗̤͚͓̻͝ǫ̶̸̻̫͍̞͎̘͡u̦̮̭̬͔̝͇̠͟͡ ̦͇̘̤͔̝̥̜̘̙̭͙̱́͢s̸̛̠̝͓͚̬̦͇͘͜u͠͏̧̪͇̙͖̳̘̘͓͞͝ͅr̶̷̛̪̩̭̮̱͖̼̙̬̣͔̫͇̳̳̦͢ͅe̸̡̧̯̝͖̮̱͖̲̜̙̹͜͝ͅ ̧̰̟̻̳̗̱͕̰̜̠͇̘͔́͜͟y̡̢͉͙͎͎̹̖̲͇̰͟͢o̶̷͍̼̙̦̫̹̯̭͓̺̞͢ù͏̡̠̭̤̗͖͈͕͙̭͢ ̡̨̛͇͚̦̠̗͢͢w͏̦̗͍̫̀á̢͔͚̰̬͔͕̼̕ņ̴̛̲̜̮͈͙̰̀͘t̶̤̜̯͙̝̺̜̠̘̼͙̞͍̖̟͓̝̤͘͞ ͜͞͏͏͕̣͍̝͍̜͉̳̙̘̥͜t̛̪͔̯̱͓͝ǫ̸̨̢̤̘̰͉̬̤͓̙̼̖̞͖͟ ̷̨̢̳̣͈̪͙̦̻͉̗̹͓̤͖͕͘͢a̢̛͈̲͉͎̲͕͘l͠҉̧̛̞̥̟̜̠̯̰͙͎̬͜ͅl̸̵͍͓͓̼̬͙͍̰̭̟̖̪͍̀o͡͏̡̜̩̗̤͚̪̟̘̥̲̘̥͕̘͎̗̬͉͘͟w̡̡͖̼̟̣͙͕̥͈̮̜̩̟͈̫͘̕͠ ̘̻͕̬͓̗̠̕͞u̸̗͎̜̗͍̝͔̘͎̙̠̭̗͕t̴̤̺̫̮̩̀́͟͞ͅf̸͙̞̞̣̦͟8͏̞̥̮̙̙̲͉̰͖͉͚̤͇͠ ?

marten-seemann · 2022-12-06T23:11:36Z

Nice UTF-8 art! :)

I don't really see reason not to. Limiting ourselves to ASCII is so 1990s style.

Winterhuman · 2022-12-07T13:39:25Z

Perhaps this could be phrased as: "Implementations should discard non-ASCII characters and trim the string to 64 characters, but may choose to allow UTF-8 characters if potential for UTF-8 art/mimicry is acceptable"

I definitely wouldn't want UTF-8 support to be outright gone, using UTF-8 in protocol names (maybe containing CIDs with UTF-8 encodings) could have a lot of potential use-cases. For the agent and version strings though, that's perfectly understandable

Winterhuman · 2022-12-08T19:47:34Z

In fact, maybe better idea, what about:

"Implementations should trim the string to 64 characters. Implementations MAY allow UTF-8 characters in the string, however, these strings should be visible to users as both UTF-8 and ASCII punycode (per IETF RFC 3492) to protect against UTF-8 mimicry."

marten-seemann · 2022-12-08T20:52:50Z

I'd say let's fully embrace UTF-8. This is 2022, and we finally have a standard encoding that's universally supported.

Building on @Winterhuman's proposal:

"Strings are UTF-8 encode. Implementations MAY trim the string to 64 characters. When made visible to users, implementations MAY output both UTF-8 and ASCII punycode (per IETF RFC 3492) to protect against UTF-8 mimicry."

…ring-limits License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

@lidel

Adds guidance for safely handling untrusted Unicode strings from remote peers in the identify protocol: - Add 128 rune (Unicode code points) limit as SHOULD requirement - Add Unicode Sanitization section with guidance to: - Remove dangerous Unicode categories (Cc, Cf, Co, Cs) - Preserve legitimate international characters and emojis - Reference Unicode Standard Annex #44 for formal definitions - Note that sanitization is for display safety, not protocol restriction - Update revision to r2, 2025-09-12 - Add @lidel to Interest Group License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel · 2025-09-12T18:11:35Z

Agreed: we need to keep unicode to ensure this is not limited to latin alphabet.

👉 Changed this PR to effectively only introduce that limit of 128 unicode runes per identify value – we DO need a ceiling here for security reasons, but it does not have to be super low.
- 💭 Take it or leave it – mainly wanted to write this down somewhere so it can be referenced inthe future. Maybe this PR could be replaced with a single sentence, but I feel if we dont spell out all the risks like control sequences, implementers will not thing about them when printing strings to temrinal.
Leaving it up to implementations to decide if they want to further sanitize (e.g. Kubo will strip control characters to avoid display issues in terminals etc), but included "Recommended sanitization steps" as SHOULD (strong recommendation).
- Proof of concept in fix(cmds): cleanup unicode identify strings ipfs/kubo#9465 only strips special control characters but allow Unicode up to 128 runes per value.

MarcoPolo · 2025-09-12T18:16:43Z

This could be a good reference: https://www.rfc-editor.org/rfc/rfc9839.html

- add RFC 9839 reference for unicode string handling - replace problematic characters with U+FFFD instead of removing - remove private use characters (Co) from restricted list per RFC - add note that deletion is a known security risk per RFC 9839 License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

remove whitespace trimming to align with RFC 9839's principle of avoiding silent deletion License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel · 2025-09-12T18:44:27Z

@MarcoPolo thanks! pushed changes to reference and follow idea from that RFC: replacement with U+FFFD instead of silent removal.

replace problematic unicode characters with U+FFFD instead of deleting them, following RFC 9839 guidance that deletion is a security risk. moves sanitization to version.go and adds comprehensive tests. refs: https://www.rfc-editor.org/rfc/rfc9839.html refs: libp2p/specs#491

Shows the AgentVersion from libp2p identify protocol below successful connections. Basic sanitization inspired by libp2p/specs#491

MarcoPolo

I'd just call them Unicode code points rather than also call them runes.

MarcoPolo · 2025-09-13T03:47:41Z

besides that, this lgtm

License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel · 2025-09-14T12:17:26Z

done

preserve private use characters as specified in libp2p/specs#491 enforce 128 rune limit on untrusted peer data

lidel requested a review from mxinden December 6, 2022 20:27

identify: ASCII-only version strings

9f717e2

Specify length limit to unify version string handling across implementations. License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel force-pushed the identify-version-string-limits branch from 7142a33 to 9f717e2 Compare December 6, 2022 20:37

lidel requested a review from marten-seemann December 6, 2022 20:48

marten-seemann reviewed Dec 6, 2022

View reviewed changes

lidel mentioned this pull request Dec 6, 2022

fix(cmds): cleanup unicode identify strings ipfs/kubo#9465

Merged

lidel changed the title ~~identify: ASCII-only version strings~~ identify: specify length limit per opaque string value Sep 12, 2025

lidel added 2 commits September 12, 2025 19:28

Merge remote-tracking branch 'origin/master' into identify-version-st…

2bf7e2c

…ring-limits License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel changed the title ~~identify: specify length limit per opaque string value~~ identify: length limit for unicode fields Sep 12, 2025

lidel requested review from MarcoPolo and sukunrt and removed request for mxinden September 12, 2025 18:11

lidel added 2 commits September 12, 2025 20:29

docs(identify): remove trim step per RFC 9839

0b667d2

remove whitespace trimming to align with RFC 9839's principle of avoiding silent deletion License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel mentioned this pull request Sep 12, 2025

feat: display peer agent version in diagnostics ipfs/ipfs-check#105

Merged

lidel added a commit to ipfs/ipfs-check that referenced this pull request Sep 13, 2025

feat: display peer agent version in diagnostics (#105)

0f85a42

Shows the AgentVersion from libp2p identify protocol below successful connections. Basic sanitization inspired by libp2p/specs#491

MarcoPolo reviewed Sep 13, 2025

View reviewed changes

docs(identify): use 'Unicode code points' instead of 'runes'

86ddaf3

License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

MarcoPolo approved these changes Sep 15, 2025

View reviewed changes

lidel added a commit to ipfs/kubo that referenced this pull request Sep 19, 2025

fix(cmds): cleanup unicode identify strings (#9465)

f6a9b34

preserve private use characters as specified in libp2p/specs#491 enforce 128 rune limit on untrusted peer data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identify: length limit for unicode fields#491

identify: length limit for unicode fields#491
lidel wants to merge 6 commits intomasterfrom
identify-version-string-limits

lidel commented Dec 6, 2022 •

edited

Loading

Uh oh!

marten-seemann left a comment

Uh oh!

lidel commented Dec 6, 2022

Uh oh!

marten-seemann commented Dec 6, 2022

Uh oh!

Winterhuman commented Dec 7, 2022

Uh oh!

Winterhuman commented Dec 8, 2022 •

edited

Loading

Uh oh!

marten-seemann commented Dec 8, 2022

Uh oh!

lidel commented Sep 12, 2025

Uh oh!

MarcoPolo commented Sep 12, 2025

Uh oh!

lidel commented Sep 12, 2025 •

edited

Loading

Uh oh!

MarcoPolo left a comment

Uh oh!

MarcoPolo commented Sep 13, 2025

Uh oh!

lidel commented Sep 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lidel commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marten-seemann left a comment

Choose a reason for hiding this comment

Uh oh!

lidel commented Dec 6, 2022

Uh oh!

marten-seemann commented Dec 6, 2022

Uh oh!

Winterhuman commented Dec 7, 2022

Uh oh!

Winterhuman commented Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marten-seemann commented Dec 8, 2022

Uh oh!

lidel commented Sep 12, 2025

Uh oh!

MarcoPolo commented Sep 12, 2025

Uh oh!

lidel commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoPolo left a comment

Choose a reason for hiding this comment

Uh oh!

MarcoPolo commented Sep 13, 2025

Uh oh!

lidel commented Sep 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lidel commented Dec 6, 2022 •

edited

Loading

Winterhuman commented Dec 8, 2022 •

edited

Loading

lidel commented Sep 12, 2025 •

edited

Loading