Detect data section string literal hash collisions#77061
Detect data section string literal hash collisions#77061jjonescz merged 8 commits intodotnet:mainfrom
Conversation
dd41ab3 to
f4c6064
Compare
src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs
Outdated
Show resolved
Hide resolved
| // If there is a hash collision, we cannot fallback to normal string literal emit strategy | ||
| // because the selection of which literal would get which emit strategy would not be deterministic. | ||
| var messageProvider = @this.ModuleBuilder.CommonCompilation.MessageProvider; | ||
| diagnostics.Add(messageProvider.CreateDiagnostic(messageProvider.ERR_DataSectionStringLiteralHashCollision, syntaxNode.GetLocation(), previousText)); |
|
Thinking more about this, I wonder if it's better to not emit an error.
|
| if (previousText != text) | ||
| { | ||
| // If there is a hash collision, we cannot fallback to normal string literal emit strategy | ||
| // because the selection of which literal would get which emit strategy would not be deterministic. |
There was a problem hiding this comment.
Would it be possible to delay the assignment of the type name that gets emitted into the binary until after the typedef token is assigned? Assuming that typedef tokens are deterministic, the name generated from the typedef token would be deterministic as well. Also, the unique names generated from the typedef tokens can be a lot shorter than the hash.
There was a problem hiding this comment.
It might be possible, but I'm not sure the compiler's internal object model is ready for that, the Name of the symbol is likely used in other places before the typedef token is assigned.
Chuck has suggested another alternative where we would collect the string literals during binding, sort them by length and content, and then assign indices to them (and names based on that) just before emit.
Also, as I mentioned above, currently we share the machinery for synthesizing data fields with array initializers and u8 literals, and these fields are named using sha256. So ideally we would change those too so they also get names based on indices instead of hashes.
I can add that to the spec as future work and go with an error for now.
|
It is an error to have duplicate type names in an assembly. From ECMA-335: "There shall be no duplicate rows in the TypeDef table, based on TypeNamespace+TypeName (unless this is a nested type - see below) [ERROR]". The runtime behavior for malformed binaries is undefined. I understand why it happens to work fine in the current runtime. I do not think it is a good idea for the compiler to generate malformed binaries silently even if it happens to work at the moment. It is better to produce an error. We have number of similar corner-case situations where the user needs to alter their code to workaround the internal compiler limitations. For example, very complex expression may fail to compile and users need to alter their code to make it work. I have checked the behavior of a few tools on duplicate type names: ildasm/ilasm roundtrip fails, native aot compilation happens to handle it gracefully. I would not be surprised if we find a tool with silent bad codegen for malformed input with duplicate type names. |
Motivated by this discussion: #76139 (comment)
TODO: