Skip to content

normalize targets to 80 char strings for ATN serialization segments.#3438

Merged
parrt merged 2 commits into
antlr:masterfrom
parrt:normalize-getSerializedATNSegmentLimit
Dec 27, 2021
Merged

normalize targets to 80 char strings for ATN serialization segments.#3438
parrt merged 2 commits into
antlr:masterfrom
parrt:normalize-getSerializedATNSegmentLimit

Conversation

@parrt

@parrt parrt commented Dec 27, 2021

Copy link
Copy Markdown
Member

@ericvergnaud @KvanTTT How does this look? Tests appear to pass locally.

Comment thread tool/src/org/antlr/v4/codegen/Target.java
@parrt

parrt commented Dec 27, 2021

Copy link
Copy Markdown
Member Author

Well, @ericvergnaud is the one who cares and suggests 80 so let's go old school :)

@parrt

parrt commented Dec 27, 2021

Copy link
Copy Markdown
Member Author

crap. i got the -q removal stuck in this PR too. :( oops. already in master. oh well.

@parrt parrt merged commit a16b5ef into antlr:master Dec 27, 2021
@ericvergnaud

Copy link
Copy Markdown
Contributor

well not sure what's going on but it seems that the value returned by getSerializedATNSegmentLimit is not used...

@parrt

parrt commented Dec 28, 2021

Copy link
Copy Markdown
Member Author

For which targets? Wow. SerializedATN.getSegments() isn't called anywhere (unless reflection). Looks like constructor does all the work to get string:

	public SerializedATN(OutputModelFactory factory, ATN atn) {
		super(factory);
		IntegerList data = ATNSerializer.getSerialized(atn);
		serialized = new ArrayList<String>(data.size());
		for (int c : data.toArray()) {
			String encoded = factory.getGenerator().getTarget().encodeIntAsCharEscape(c == -1 ? Character.MAX_VALUE : c);
			serialized.add(encoded);
		}
//		System.out.println(ATNSerializer.getDecoded(factory.getGrammar(), atn));
	}

oh! the template can call <segments>. Here's PHP:

private const SERIALIZED_ATN =
	<model.segments:{segment| "<segment; wrap={" .<\n>"}>"}; separator=" .\n">;

and java:

<model.segments:{segment|private static final String _serializedATNSegment<i0> =
	"<segment; wrap={"+<\n><\t>"}>";}; separator="\n">
public static final String _serializedATN = Utils.join(
	new String[] {
		<model.segments:{segment | _serializedATNSegment<i0>}; separator=",\n">
	},
	""
);

@KvanTTT

KvanTTT commented Jan 21, 2022

Copy link
Copy Markdown
Member

well not sure what's going on but it seems that the value returned by getSerializedATNSegmentLimit is not used...

Actually, it's called internally by StringTemplate, and value from getSerializedATNSegmentLimit is used (I don't know how it works exactly).

@KvanTTT

KvanTTT commented Jan 27, 2022

Copy link
Copy Markdown
Member

Unfortunately, in the end, it's a very bad change. It's limit not for line length (it's set up internally by StringTemplate during wrapping) but for the maximum length of string or char array. And now in generated code, we have a lot of useless small 80-element arrays for targets that consider ATN segment limit (Dart, C++, PHP).

Also, this limit is actual only for Java target since other targets don't have 65535 limit.

I suggest reverting 80 to Integer.MAX_VALUE by default.

@parrt

parrt commented Jan 29, 2022

Copy link
Copy Markdown
Member Author

The getSerializedATNSegmentLimit is used to prevent strings longer than 65535 characters for the serialized ATN. If we have a serialized ATN that requires something larger than that, it has to be split into multiple strings that are joined at runtime. An obvious case where this will happen is when the number of ATN states goes beyond 16 bits.

I believe this also has the effect of generating strings on multiple lines instead of one contiguous long line. @ericvergnaud wants to continue keeping it for the various targets he manages because it makes it easier to see the generated code in the editor. In the case of Java, it would slow things down a little bit and cause memory allocation if we set it to 80 or whatever. In my case, I prefer the full 64k string to avoid multiple allocations.

@KvanTTT

KvanTTT commented Jan 29, 2022

Copy link
Copy Markdown
Member

The getSerializedATNSegmentLimit is used to prevent strings longer than 65535 characters for the serialized ATN.

Yes, that's why it should not be changed to 80. It creates a lot of unwanted small fragments for targets that respect segments (PHP, C++, Dart). I've fixed it in the latest PR. It looks like this parameter is only relevant for Java.

I believe this also has the effect of generating strings on multiple lines instead of one contiguous long line.

It doesn't affect string splitting because StringTemplate is responsible for string wrapping.

In the case of Java, it would slow things down a little bit and cause memory allocation if we set it to 80 or whatever.

Only on the compiler step. Java compiler folds constant strings into a single one in .class file.

@ericvergnaud

ericvergnaud commented Jan 30, 2022 via email

Copy link
Copy Markdown
Contributor

@ericvergnaud

ericvergnaud commented Jan 30, 2022 via email

Copy link
Copy Markdown
Contributor

@KvanTTT

KvanTTT commented Jan 30, 2022

Copy link
Copy Markdown
Member

The limit is not about multiline strings, it's about maximum string length that is only actual for Java because of 65535 limit.
It's worth changing the default value back to Int.MaxValue. Is that ok for you that C++ has a lot of 80-length string arrays for ATN? https://gist.github.com/KvanTTT/58496ebf63e32498ff644c43deb5bd96#file-largelexer-cpp-L1183

@parrt

parrt commented Jan 30, 2022

Copy link
Copy Markdown
Member Author

My own preference would be to see one big array of integers wrapped appropriately at some line length, rather than that just with lots of different segments. Unless a target has a limit, we should probably set the segment length to max integer.

I think we should separate the idea of line length, which can be handled in templates, from the length of the ATN serialization list of integers. I don't think any language but Java has a string length issue in terms of compilation or storing in a generated binary file. @ericvergnaud points out that most targets support multi line strings so they can be built with appropriate template actions.

In the end, my vote is to set the segment length to be max int and then change the templates to solve the issue of wrapping that Eric originally pointed out. E.g., the java target already does that trivially:

public static final String _serializedATN =
	"<model.serialized; wrap={"+<\n><\t>"}>";

@ericvergnaud how will it affect JavaScript or other targets if we moved to max int? Can we tweak the templates to satisfy your needs?

@parrt

parrt commented Jan 30, 2022

Copy link
Copy Markdown
Member Author

(ANTLR history/info Is slowly filtering back into my old brain) woot!

@KvanTTT

KvanTTT commented Jan 30, 2022

Copy link
Copy Markdown
Member

I've checked, all targets have wrapping (over string or arrays).

@parrt

parrt commented Jan 30, 2022

Copy link
Copy Markdown
Member Author

I've checked, all targets have wrapping (over string or arrays).

excellent. Ok, so this covers the primary concern that @ericvergnaud had originally. I will look at this after we figure out the UU ID and ATN serialization version thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants