Skip to content
This repository was archived by the owner on Nov 1, 2020. It is now read-only.

Conversation

@nattress
Copy link
Contributor

@nattress nattress commented Nov 10, 2016

Fix issue #1964 and provide robustness for mangled type names.

// restricted to that use only. Replace them if they happened to be used in any identifiers in
// the compilation input.
return _mangleForCplusPlus
? santizedName.Replace(EnterNameScopeSequence, "_AA_").Replace(ExitNameScopeSequence, "_VV_")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to use a single character that is also acceptable as part of a C++ identifier? E.g. acute accent and cedilla are allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or use < and > for when we're writing an object file, and the weird characters for C++ - basically I'm trying to see if we can get rid of the double String.Replace. If you look at the GC allocation profile of the compiler right now, we allocate a ton of strings because of how inefficient NameMangler already is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep the C++ to ASCII as much as possible, partially because different compilers have different levels of support for unicode character identifiers and this emitted C++ code is supposed to be ultra portable. I'll do a quick survey of the current support in gcc / llvm without requiring -fextended-identifiers switch and see what we can expect to work broadly.

@jkotas
Copy link
Member

jkotas commented Nov 10, 2016

you look at the GC allocation profile of the compiler right now, we allocate a ton of strings

Agree ... the name mangling needs revamp, for both correctness and overall efficiency (do not allocate as much to construct the mangled names; and make the mangled names shorter):

Here are some things that I have been wondering about:

  • Do we need to include the namespaces in the typenames? What if we instead just use the type simple name; and disambiguate it if there is conflict - similar to what we do for methods.
  • Use shortcut for the system module or frequently used system types. E.g. The reference to System.String type today is something like: System_Private_CoreLib__System__String. What if it is just something like #String?
  • CppCodeGen and native codegen name mangling can be quite different
  • Store the mangled strings as UTF8
  • ...

I assume that this PR is fixing a problem that Simon is running into. I would be ok with it to get things unblocked; and make the overall cleanup separately later.

@nattress Could you please add a case to tests\src\Simple\Generics\Generics.cs that fails before this change and passes with it?

@jkotas jkotas mentioned this pull request Nov 10, 2016
@jkotas
Copy link
Member

jkotas commented Nov 10, 2016

Opened #2178

Copy link
Member

@MichalStrehovsky MichalStrehovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to keep SanitizeName allocation free probably doesn't matter given we're going to dig through everything anyway. LGTM

Fix issue 1964 and provide robustness for mangling type names.

Previously, there was no scoping around generic instantiation argument
lists in a mangled type name. That would allow situations where the
following two types would have the same mangled name:
[test]Gen1<[Test.CoreLib]System.Object[]>[]
[test]Gen1<[Test.CoreLib]System.Object[][]>

These would both mangle as
test_Foo_Gen_1__Test_CoreLib_System_Object__Array__Array.

To fix this, use reserved character sequences to denote a range of
generic instantiation arguments and for the appended marker for arrays,
byrefs, and pointer types.  For RyuJIT, use standard angle brackets
since object files allow them. For C++, we need to generate valid
identifiers, so use "_A_" to denote a beginning marker, and "_V_" for an
ending marker. If you tilt your head sideways, they sort of look like
angle brackets.  To prevent input code from clashing with these markers,
when sanitizing names, replace them with something else: _A_ => _AA_;
_V_ => _VV_.  We don't need to worry about that with RyuJIT since angle
brackets are already stripped during sanitization.

// Emit the Unicode byte order mark to prevent the MS C++ compiler treating the emitted
// C++ code as codepage 1251
_out.Write('\uFEFF');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use the other constructor of StreamWriter here? The one that takes an explicit Encoding parameter and pass it Encoding.Utf8 - that one should emit a BOM.

You might also be fixing #663 with this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants