Skip to content

identify: length limit for unicode fields#491

Open
lidel wants to merge 6 commits intomasterfrom
identify-version-string-limits
Open

identify: length limit for unicode fields#491
lidel wants to merge 6 commits intomasterfrom
identify-version-string-limits

Conversation

@lidel
Copy link
Copy Markdown
Member

@lidel lidel commented Dec 6, 2022

This PR formalizes normalization and length limit per string field ( agentVersion, protocolVersion and protocols array).

The goal is to reduce surprises and unify behavior across implementations.

@lidel lidel requested a review from mxinden December 6, 2022 20:27
Specify length limit to unify version string handling across implementations.

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
@lidel lidel force-pushed the identify-version-string-limits branch from 7142a33 to 9f717e2 Compare December 6, 2022 20:37
@lidel lidel requested a review from marten-seemann December 6, 2022 20:48
Copy link
Copy Markdown
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this UTF-8? I'd really like to not introduce any surprises here.

@lidel
Copy link
Copy Markdown
Member Author

lidel commented Dec 6, 2022

@marten-seemann a̹̖̪͔͖̝͘̕͞r̛̛̫̞̬̝͎̘̹͞ę̴͉͉̼̀͟ ̛͏̴͜͏͉̣͉̮͖̳̝͍̭̮̮͍͔͈̲͉͙̰ͅy̞̠̮̥̯̞̦͈̮̣̗̤͚͓̻͝ǫ̶̸̻̫͍̞͎̘͡u̦̮̭̬͔̝͇̠͟͡ ̦͇̘̤͔̝̥̜̘̙̭͙̱́͢s̸̛̠̝͓͚̬̦͇͘͜u͠͏̧̪͇̙͖̳̘̘͓͞͝ͅr̶̷̛̪̩̭̮̱͖̼̙̬̣͔̫͇̳̳̦͢ͅe̸̡̧̯̝͖̮̱͖̲̜̙̹͜͝ͅ ̧̰̟̻̳̗̱͕̰̜̠͇̘͔́͜͟y̡̢͉͙͎͎̹̖̲͇̰͟͢o̶̷͍̼̙̦̫̹̯̭͓̺̞͢ù͏̡̠̭̤̗͖͈͕͙̭͢ ̡̨̛͇͚̦̠̗͢͢w͏̦̗͍̫̀á̢͔͚̰̬͔͕̼̕ņ̴̛̲̜̮͈͙̰̀͘t̶̤̜̯͙̝̺̜̠̘̼͙̞͍̖̟͓̝̤͘͞ ͜͞͏͏͕̣͍̝͍̜͉̳̙̘̥͜t̛̪͔̯̱͓͝ǫ̸̨̢̤̘̰͉̬̤͓̙̼̖̞͖͟ ̷̨̢̳̣͈̪͙̦̻͉̗̹͓̤͖͕͘͢a̢̛͈̲͉͎̲͕͘l͠҉̧̛̞̥̟̜̠̯̰͙͎̬͜ͅl̸̵͍͓͓̼̬͙͍̰̭̟̖̪͍̀o͡͏̡̜̩̗̤͚̪̟̘̥̲̘̥͕̘͎̗̬͉͘͟w̡̡͖̼̟̣͙͕̥͈̮̜̩̟͈̫͘̕͠ ̘̻͕̬͓̗̠̕͞u̸̗͎̜̗͍̝͔̘͎̙̠̭̗͕t̴̤̺̫̮̩̀́͟͞ͅf̸͙̞̞̣̦͟8͏̞̥̮̙̙̲͉̰͖͉͚̤͇͠ ?

@marten-seemann
Copy link
Copy Markdown
Contributor

Nice UTF-8 art! :)

I don't really see reason not to. Limiting ourselves to ASCII is so 1990s style.

@Winterhuman
Copy link
Copy Markdown

Perhaps this could be phrased as: "Implementations should discard non-ASCII characters and trim the string to 64 characters, but may choose to allow UTF-8 characters if potential for UTF-8 art/mimicry is acceptable"

I definitely wouldn't want UTF-8 support to be outright gone, using UTF-8 in protocol names (maybe containing CIDs with UTF-8 encodings) could have a lot of potential use-cases. For the agent and version strings though, that's perfectly understandable

@Winterhuman
Copy link
Copy Markdown

Winterhuman commented Dec 8, 2022

In fact, maybe better idea, what about:

"Implementations should trim the string to 64 characters. Implementations MAY allow UTF-8 characters in the string, however, these strings should be visible to users as both UTF-8 and ASCII punycode (per IETF RFC 3492) to protect against UTF-8 mimicry."

@marten-seemann
Copy link
Copy Markdown
Contributor

I'd say let's fully embrace UTF-8. This is 2022, and we finally have a standard encoding that's universally supported.

Building on @Winterhuman's proposal:

"Strings are UTF-8 encode. Implementations MAY trim the string to 64 characters. When made visible to users, implementations MAY output both UTF-8 and ASCII punycode (per IETF RFC 3492) to protect against UTF-8 mimicry."

@lidel lidel changed the title identify: ASCII-only version strings identify: specify length limit per opaque string value Sep 12, 2025
…ring-limits

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
Adds guidance for safely handling untrusted Unicode strings from remote
peers in the identify protocol:

- Add 128 rune (Unicode code points) limit as SHOULD requirement
- Add Unicode Sanitization section with guidance to:
  - Remove dangerous Unicode categories (Cc, Cf, Co, Cs)
  - Preserve legitimate international characters and emojis
  - Reference Unicode Standard Annex #44 for formal definitions
- Note that sanitization is for display safety, not protocol restriction
- Update revision to r2, 2025-09-12
- Add @lidel to Interest Group

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
@lidel lidel changed the title identify: specify length limit per opaque string value identify: length limit for unicode fields Sep 12, 2025
@lidel lidel requested review from MarcoPolo and sukunrt and removed request for mxinden September 12, 2025 18:11
@lidel
Copy link
Copy Markdown
Member Author

lidel commented Sep 12, 2025

Agreed: we need to keep unicode to ensure this is not limited to latin alphabet.

  • 👉 Changed this PR to effectively only introduce that limit of 128 unicode runes per identify value – we DO need a ceiling here for security reasons, but it does not have to be super low.

    • 💭 Take it or leave it – mainly wanted to write this down somewhere so it can be referenced inthe future. Maybe this PR could be replaced with a single sentence, but I feel if we dont spell out all the risks like control sequences, implementers will not thing about them when printing strings to temrinal.
  • Leaving it up to implementations to decide if they want to further sanitize (e.g. Kubo will strip control characters to avoid display issues in terminals etc), but included "Recommended sanitization steps" as SHOULD (strong recommendation).

@MarcoPolo
Copy link
Copy Markdown
Contributor

This could be a good reference: https://www.rfc-editor.org/rfc/rfc9839.html

- add RFC 9839 reference for unicode string handling
- replace problematic characters with U+FFFD instead of removing
- remove private use characters (Co) from restricted list per RFC
- add note that deletion is a known security risk per RFC 9839

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
remove whitespace trimming to align with RFC 9839's principle
of avoiding silent deletion

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
@lidel
Copy link
Copy Markdown
Member Author

lidel commented Sep 12, 2025

@MarcoPolo thanks! pushed changes to reference and follow idea from that RFC: replacement with U+FFFD instead of silent removal.

lidel added a commit to ipfs/ipfs-check that referenced this pull request Sep 12, 2025
replace problematic unicode characters with U+FFFD instead of deleting
them, following RFC 9839 guidance that deletion is a security risk.
moves sanitization to version.go and adds comprehensive tests.

refs: https://www.rfc-editor.org/rfc/rfc9839.html
refs: libp2p/specs#491
lidel added a commit to ipfs/ipfs-check that referenced this pull request Sep 13, 2025
Shows the AgentVersion from libp2p identify protocol below successful
connections. 

Basic sanitization inspired by libp2p/specs#491
Copy link
Copy Markdown
Contributor

@MarcoPolo MarcoPolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call them Unicode code points rather than also call them runes.

@MarcoPolo
Copy link
Copy Markdown
Contributor

besides that, this lgtm

License: MIT
Signed-off-by: Marcin Rataj <lidel@lidel.org>
@lidel
Copy link
Copy Markdown
Member Author

lidel commented Sep 14, 2025

done

lidel added a commit to ipfs/kubo that referenced this pull request Sep 19, 2025
preserve private use characters as specified
in libp2p/specs#491
enforce 128 rune limit on untrusted peer data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Triage

Development

Successfully merging this pull request may close these issues.

4 participants