Skip to content

[java] Symbol reference framework#1566

Merged
jsotuyod merged 43 commits into
pmd:java-grammarfrom
oowekyala:sym-reference-framework
Oct 23, 2019
Merged

[java] Symbol reference framework#1566
jsotuyod merged 43 commits into
pmd:java-grammarfrom
oowekyala:sym-reference-framework

Conversation

@oowekyala

@oowekyala oowekyala commented Jan 9, 2019

Copy link
Copy Markdown
Member

This is the first third of #1536. It contains interfaces to represent code declarations (e.g. a class/ method/ field declaration) that abstracts over whether the class was found through reflection or by parsing a source file.

The goals are

  • To unify the representation of Symbol table and type resolution. For now they barely communicate at all, but with a common representation, they could do much more, and better split their responsibilities.
    • Symbol table uses AST nodes, nothing else
    • Type resolution uses Class and other reflection APIs, it even reads the bytecode of the source files.
    • Symbols try to bridge the gap
  • That comprises a rewrite of the symbol table framework. The documentation of JSymbolTable explains a little where the current one falls short of being useful, and wouldn't scale. Eventually type resolution will probably have to be touched as well.
  • To provide a solution to [java] Resolve explicit types using FQCNs, without hitting the classloader #1207 (eventually), using JResolvableClassSymbol, and that more powerful symbol table. JResolvableClassSymbol allows a "best effort" strategy while we don't have multifile analysis.
  • To provide something very abstract that may be used to implement multifile analysis eventually. Symbols don't provide guarantees about being backed either by a Class instance or by a Node. Similarly, they could be backed by serialized instances, a database, whatever. At the very least I'd like to see soon a global in-memory cache of JClassSymbols to avoid hitting the classloader so much (which in particular will be useful when looking for symbols through type hierarchies). Changing the back-end of that cache to an on-disk store can be though of as a possible way to achieve multifile analysis.

For comparison, this kind of framework is e.g. found in Spoon (code analysis and transformation processor). The explanation on that page is very relevant to us. IntelliJ does some voodoo stuff with symbolic AST nodes backed by an on-disk serialized index, that trigger a full file reparse if you ask them a question that the index doesn't know... we probably don't need that do we

About VCS history:

  • New symbol table #1536 was not ready and already big, so I split it in two: the reference framework first, and then the new symbol table implementations.
  • I then added tests on the reference framework and fleshed out my implementation, so that it again became too big for a reasonable PR. So I resplit it in two, isolating interfaces from implementation.
  • So basically:
    • This PR only contains the interfaces I plan to work with, plus all the changes a bit everywhere that I was lazy to hunt down. Since I worked on the implementation directly, many comments or code changes may seem irrelevant for now, I'm sorry about that.
    • The next set of PRs contains the implementations of e.g. JClassSymbol and the like, but not JSymbolTable. I'll try to split that into one PR for each symbol impl or so. The removals to be reverted are in 66b7f03. I'll open an independent PR to revert a96c570 ([java] Share utilities defined in ClassTypeResolverTest #1567)
    • The following PR adds implementations for JSymbolTable. The changes to be reverted are in ae2e94d, but it needs some adjustments. I included the interface so that you have a clearer view of what I'm talking about, though without the tests I understand it's hard to follow.

Navigating the changes

  • JDeclarationSymbol is the main interface, the other interfaces are more specific.
  • JSymbolTable is the interface for the new symbol tables.

@ghost

ghost commented Jan 9, 2019

Copy link
Copy Markdown
1 Message
📖 No java rules are changed!

Generated by 🚫 Danger

@jsotuyod

Copy link
Copy Markdown
Member

Thanks for the PR, I'm still trying to fully grasp your vision as for where this is going, so please, bare with me…

  • Symbol table uses AST nodes, nothing else

This is not exactly true... for overloaded methods, you need type resolution to known which method is actually being called (which in turn may affect the return type of that method, which feeds back to the type resolution).

Symbol table and type resolution are very intertwined; I believe the best approach is to have a few self-contained helper objects for specific tasks:

  • modeling scopes
  • type resolver (ie: TypeSet), which abstracts the order and rules for resolving a reference to a Class
  • looking up symbols (simple matching, considering scopes, but it may need external disambiguation)
  • etc.

But a standalone object will have to drive the whole process, calling upon these helpers at specific points on it's own; having these operations as part of the logic, and not separate standalone stages.

I'm a little worried about this... the current status has 2 issues I think are very important to address:

  1. I don't see the doc being explicit as to when a caller can rely on the underlaying node being available. As a rule developer, if I rely on it, my rule may not be matching properly if it's not present; and if I can't rely on it, I'd probably avoid that method altogether… maybe removing that method is for the best? or be very clear as to when the node will be available.
  2. The defined interfaces duplicate lots of the properties the node already have (as defined in AccessNode, JavaQualifiableNode, etc.). Aren't we going to get to memory intensive on this process?

It's not explicit, but you are thinking of having 2 implementations for each interface, right? one using the node (where the overhead is simply the wrapper object), and another one where all fields are populated (ie: taken from a file storage)…

So, on multifile, you would need a first pass on all files to populate this cache? we should be aware of not keeping all files' AST in memory.

@oowekyala

Copy link
Copy Markdown
Member Author

I believe the best approach is to have a few self-contained helper objects for specific tasks:

  • modeling scopes
  • type resolver (ie: TypeSet), which abstracts the order and rules for resolving a reference to a Class
  • looking up symbols (simple matching, considering scopes, but it may need external disambiguation)

The meaning of a name depends on the context the name is found in, and so on the declarations currently in scope, and on their relative precedence, w.r.t. shadowing rules. So essentially, resolving names can't be separated from the scope model. JSymbolTable addresses your three points with the single mechanism of delegation, and encodes the precedence rules into the structure of the symbol table stack. That obviates the need for a master algorithm that knows about all the rules, and is easier to unit test I think.

But a standalone object will have to drive the whole process, calling upon these helpers at specific points on it's own; having these operations as part of the logic, and not separate standalone stages.

The approach I'm thinking of for now is the following:

  • A first stage resolves symbol tables and attach them to the relevant AST nodes.
  • Then a second stage can come-in and disambiguate syntactically ambiguous names based on the symbol table.
  • Then ClassTypeResolver can come in and resolve the types of expressions.

Conceptually, the third stage (type resolution) can easily be made lazy, provided the scope information has already been built. The typing of an expression only depends on the typing of the symbols it references (and its syntactic context, but we have an AST node), so that if references can be resolved, we can resolve types on-demand. Reference resolving needs a more global approach that takes the structure of the whole compilation unit into account, which is why I don't think the first stage can be optional.

maybe removing that method is for the best? or be very clear as to when the node will be available.

Yes it maybe would be best to remove it, I don't expect it to be very useful. Ideally the JSymbolElement would present all useful information directly. That is internal API for now so we can decide on that later I think

The defined interfaces duplicate lots of the properties the node already have (as defined in AccessNode, JavaQualifiableNode, etc.). Aren't we going to get to memory intensive on this process?

I don't think this will be very noticeable, especially if we try to cache symbols/ be somewhat lazy. We'll have to be careful but I'm not too worried. I would think that AST nodes and Tokens would be orders of magnitude heavier

It's not explicit, but you are thinking of having 2 implementations for each interface, right?

In fact what I've implemented uses the node or Class at the time of construction, and so uses a single implementation for each interface. What you propose may be better, idk. I'd rather talk about that on the implementation PRs

@jsotuyod

jsotuyod commented Jan 22, 2019

Copy link
Copy Markdown
Member

The meaning of a name depends on the context the name is found in, and so on the declarations currently in scope, and on their relative precedence, w.r.t. shadowing rules. So essentially, resolving names can't be separated from the scope model.

Yes, I would expect the scope object to handle this, ie: qualify reference types with the proper parent types from the current file. Actually, some of this is already done within the current Scope implementation if I recall correctly… (although probably missing some edge cases).

However, going from this "qualified" name to the actual type may be split apart if needed.

JSymbolTable addresses your three points with the single mechanism of delegation, and encodes the precedence rules into the structure of the symbol table stack. That obviates the need for a master algorithm that knows about all the rules, and is easier to unit test I think.

I'll dig deeper into it and get back to you.

The approach I'm thinking of for now is the following:

  • A first stage resolves symbol tables and attach them to the relevant AST nodes.
  • Then a second stage can come-in and disambiguate syntactically ambiguous names based on the symbol table.
  • Then ClassTypeResolver can come in and resolve the types of expressions.

I'm unsure I follow. As already stated, disambiguation needs type resolution, how can that be a third stage? how is the second stage going to disambiguate without extra info?

Conceptually, the third stage (type resolution) can easily be made lazy, provided the scope information has already been built. The typing of an expression only depends on the typing of the symbols it references (and its syntactic context, but we have an AST node), so that if references can be resolved, we can resolve types on-demand. Reference resolving needs a more global approach that takes the structure of the whole compilation unit into account, which is why I don't think the first stage can be optional.

Actually, I think the one thing you need to do beforehand is creating the scope hierarchy, noting all declarations and "qualifying" types.

Both, type resolution and symbol-usage finding can be made on-demand (for performance we may want to index all occurrences of symbol names, and on-demand simply disambiguate among them). If type resolution results are stored into the nodes (as currently done), then these can be memoized to have each node resolved at most once.

The defined interfaces duplicate lots of the properties the node already have (as defined in AccessNode, JavaQualifiableNode, etc.). Aren't we going to get to memory intensive on this process?

I don't think this will be very noticeable, especially if we try to cache symbols/ be somewhat lazy. We'll have to be careful but I'm not too worried. I would think that AST nodes and Tokens would be orders of magnitude heavier

Bare in mind, we keep the AST of a single file per thread in memory at a time. When a thread moves on to the next file, the whole AST and tokens are dropped for garbage collection. We never keep all the files analyzed in memory. For multifile analysis keeping these objects for all files is orders of magnitude more memory than we currently use.

@oowekyala

oowekyala commented Jan 22, 2019

Copy link
Copy Markdown
Member Author

However, going from this "qualified" name to the actual type may be split apart if needed.

Yes, there is a semantic gap between names and actual types. JSymbolTable only cares about resolving unqualified names, types can be resolved later-on.

I'll dig deeper into it and get back to you.

I wrote some more extensive documentation about how those shadowing rules can be encoded in the stack on JSymbolTable, but I planned to introduce it in some later PR. Here are some links:

Yes, I would expect the scope object to handle this, ie: qualify reference types with the proper parent types from the current file.

Actually, I think the one thing you need to do beforehand is creating the scope hierarchy, noting all declarations and "qualifying" types.

JSymbolTable doesn't qualify anything in itself. It only holds the knowledge to find the symbol denoted by a given simple name, taking care of the shadowing rules. That way later stages like actual type resolution and disambiguation drive their analysis while supporting them on the table when a reference needs to be resolved, but the table itself is not e.g. a type resolver, just a support system.

Tables also can resolve for type names, value names, and method names, provided they're unqualified. The shadowing rules for those different kinds of names are so similar that we can shoot the three birds with one stone.

I'm unsure I follow. As already stated, disambiguation needs type resolution, how can that be a third stage? how is the second stage going to disambiguate without extra info?

You're wrong about that. Disambiguation occurs for names, not types, so it doesn't need type resolution. It only needs to occur for names that were syntactically classified as ambiguous. The JLS goes into detail about how reclassification is done, and notice it doesn't mention types at all, only names, and the scope of declarations. Notice that the reclassification algorithm is iterative and starts from the leftmost name, ie. the unqualified one. That one can be resolved to a symbol by the symbol table.

The first phase models the scope of declarations, so that after it, we have all the info we need to perform disambiguation. Having disambiguation occur in a second phase allows to separate concerns and take care of obscuring rules separately from shadowing rules. Shadowing rules can be encoded in the symbol table stack because they apply to a whole extent of program points indiscriminately. Obscuring depends on the actual context of individual name references, and only occurs if a type and a value with the same name are in scope at the same time, and they could both occur in the position of the reference. Since JSymbolTable has separate channels to find a type name and a value name, this second pass can use it without the symbol table having to care.

Btw symbol tables use the auxclasspath to know about members that were declared in e.g. superclasses of the current class. This is distinct from type resolution itself.

If type resolution results are stored into the nodes (as currently done)

Yes this should be the case in general for expressions. But symbols should also carry type information

For multifile analysis keeping these objects for all files is orders of magnitude more memory than we currently use.

I never implied keeping all ASTs in memory, just keeping some symbols in memory. If we keep the getNode method of JElementSymbol, then it would be backed by a weak reference so as not to prevent garbage collection of the AST. If we remove it it's even simpler. Basically, the AST has references to JSymbolTables, and JSymbolTables create new JElementSymbols on demand, or reuse the ones that were already found (bc then the type resolution would not need to be duplicated).

@oowekyala

Copy link
Copy Markdown
Member Author

@jsotuyod May I merge this? Not in the main 7.0.x branch if you still have reservations. I think bringing in pieces of the implementation will drive this discussion forward, and I wouldn't like us to stall much longer on the very first PR.

@jsotuyod jsotuyod left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had agreed to proceed with this, but just in case…

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where we use a stream instead of a list, any reason for that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, looking up a method may involve looking up all the supertypes of the class in scope (and enclosing classes and their supertypes, for inner/anonymous classes). So since this might be costly, doing it lazily and stopping the hierarchy exploration as soon as we find a matching method might improve performance. A return type that is not a list also documents that the methods may not actually be prefetched.

But maybe all of this should be mentioned on the doc of the method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, unless the method I'm looking for has 0 args / I have an EXACT match on all parameters, I'll always need all to determine which overload best fits my arguments and that requires the complete list.

Probably most methods will fit that bill 'though…

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be... Tbh it's too soon to say, we'll have to profile this for different scenarios when it's implemented and used by type resolution.

@oowekyala oowekyala changed the base branch from pmd/7.0.x to java-grammar October 6, 2019 23:15
@oowekyala oowekyala force-pushed the sym-reference-framework branch from b976d22 to b5ab656 Compare October 6, 2019 23:15
@oowekyala

oowekyala commented Oct 6, 2019

Copy link
Copy Markdown
Member Author

@jsotuyod I rebased this branch onto java-grammar, since it missed a lot of updates.

I'd actually like to merge this only after #2034, since there will be some more conflicts

@oowekyala oowekyala force-pushed the sym-reference-framework branch from 4f8f68f to 12a52bc Compare October 7, 2019 15:05
Comment thread pmd-java/src/main/java/net/sourceforge/pmd/lang/java/JavaLanguageHandler.java Outdated
@jsotuyod

jsotuyod commented Oct 7, 2019

Copy link
Copy Markdown
Member

@oowekyala build is failing due to a missing symbol… maybe a bad rebase?

@oowekyala oowekyala force-pushed the sym-reference-framework branch from 799b517 to cb65fa3 Compare October 7, 2019 23:47
@jsotuyod jsotuyod merged commit b5caa8c into pmd:java-grammar Oct 23, 2019
@oowekyala oowekyala deleted the sym-reference-framework branch December 9, 2019 11:20
@oowekyala oowekyala added the in:symbol-table Affects the symbol table code label Apr 26, 2020
@adangel adangel mentioned this pull request Jan 23, 2023
55 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in:symbol-table Affects the symbol table code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants