[java] Symbol reference framework by oowekyala · Pull Request #1566 · pmd/pmd

oowekyala · 2019-01-09T01:00:55Z

This is the first third of #1536. It contains interfaces to represent code declarations (e.g. a class/ method/ field declaration) that abstracts over whether the class was found through reflection or by parsing a source file.

The goals are

To unify the representation of Symbol table and type resolution. For now they barely communicate at all, but with a common representation, they could do much more, and better split their responsibilities.
- Symbol table uses AST nodes, nothing else
- Type resolution uses Class and other reflection APIs, it even reads the bytecode of the source files.
- Symbols try to bridge the gap
That comprises a rewrite of the symbol table framework. The documentation of JSymbolTable explains a little where the current one falls short of being useful, and wouldn't scale. Eventually type resolution will probably have to be touched as well.
To provide a solution to [java] Resolve explicit types using FQCNs, without hitting the classloader #1207 (eventually), using JResolvableClassSymbol, and that more powerful symbol table. JResolvableClassSymbol allows a "best effort" strategy while we don't have multifile analysis.
To provide something very abstract that may be used to implement multifile analysis eventually. Symbols don't provide guarantees about being backed either by a Class instance or by a Node. Similarly, they could be backed by serialized instances, a database, whatever. At the very least I'd like to see soon a global in-memory cache of JClassSymbols to avoid hitting the classloader so much (which in particular will be useful when looking for symbols through type hierarchies). Changing the back-end of that cache to an on-disk store can be though of as a possible way to achieve multifile analysis.

For comparison, this kind of framework is e.g. found in Spoon (code analysis and transformation processor). The explanation on that page is very relevant to us. IntelliJ does some voodoo stuff with symbolic AST nodes backed by an on-disk serialized index, that trigger a full file reparse if you ask them a question that the index doesn't know... we probably don't need that do we

About VCS history:

New symbol table #1536 was not ready and already big, so I split it in two: the reference framework first, and then the new symbol table implementations.
I then added tests on the reference framework and fleshed out my implementation, so that it again became too big for a reasonable PR. So I resplit it in two, isolating interfaces from implementation.
So basically:
- This PR only contains the interfaces I plan to work with, plus all the changes a bit everywhere that I was lazy to hunt down. Since I worked on the implementation directly, many comments or code changes may seem irrelevant for now, I'm sorry about that.
- The next set of PRs contains the implementations of e.g. JClassSymbol and the like, but not JSymbolTable. I'll try to split that into one PR for each symbol impl or so. The removals to be reverted are in 66b7f03. I'll open an independent PR to revert a96c570 ([java] Share utilities defined in ClassTypeResolverTest #1567)
- The following PR adds implementations for JSymbolTable. The changes to be reverted are in ae2e94d, but it needs some adjustments. I included the interface so that you have a clearer view of what I'm talking about, though without the tests I understand it's hard to follow.

Navigating the changes

JDeclarationSymbol is the main interface, the other interfaces are more specific.
JSymbolTable is the interface for the new symbol tables.

ghost · 2019-01-09T01:14:31Z

	1 Message
📖	No java rules are changed!

Generated by 🚫 Danger

jsotuyod · 2019-01-21T19:10:14Z

Thanks for the PR, I'm still trying to fully grasp your vision as for where this is going, so please, bare with me…

Symbol table uses AST nodes, nothing else

This is not exactly true... for overloaded methods, you need type resolution to known which method is actually being called (which in turn may affect the return type of that method, which feeds back to the type resolution).

Symbol table and type resolution are very intertwined; I believe the best approach is to have a few self-contained helper objects for specific tasks:

modeling scopes
type resolver (ie: TypeSet), which abstracts the order and rules for resolving a reference to a Class
looking up symbols (simple matching, considering scopes, but it may need external disambiguation)
etc.

But a standalone object will have to drive the whole process, calling upon these helpers at specific points on it's own; having these operations as part of the logic, and not separate standalone stages.

To provide a solution to [java] Resolve explicit types using FQCNs, without hitting the classloader #1207 (eventually), using JResolvableClassSymbol, and that more powerful symbol table. JResolvableClassSymbol allows a "best effort" strategy while we don't have multifile analysis.

I'm a little worried about this... the current status has 2 issues I think are very important to address:

I don't see the doc being explicit as to when a caller can rely on the underlaying node being available. As a rule developer, if I rely on it, my rule may not be matching properly if it's not present; and if I can't rely on it, I'd probably avoid that method altogether… maybe removing that method is for the best? or be very clear as to when the node will be available.
The defined interfaces duplicate lots of the properties the node already have (as defined in AccessNode, JavaQualifiableNode, etc.). Aren't we going to get to memory intensive on this process?

It's not explicit, but you are thinking of having 2 implementations for each interface, right? one using the node (where the overhead is simply the wrapper object), and another one where all fields are populated (ie: taken from a file storage)…

So, on multifile, you would need a first pass on all files to populate this cache? we should be aware of not keeping all files' AST in memory.

oowekyala · 2019-01-21T23:33:35Z

I believe the best approach is to have a few self-contained helper objects for specific tasks:

modeling scopes

type resolver (ie: TypeSet), which abstracts the order and rules for resolving a reference to a Class

looking up symbols (simple matching, considering scopes, but it may need external disambiguation)

The meaning of a name depends on the context the name is found in, and so on the declarations currently in scope, and on their relative precedence, w.r.t. shadowing rules. So essentially, resolving names can't be separated from the scope model. JSymbolTable addresses your three points with the single mechanism of delegation, and encodes the precedence rules into the structure of the symbol table stack. That obviates the need for a master algorithm that knows about all the rules, and is easier to unit test I think.

But a standalone object will have to drive the whole process, calling upon these helpers at specific points on it's own; having these operations as part of the logic, and not separate standalone stages.

The approach I'm thinking of for now is the following:

A first stage resolves symbol tables and attach them to the relevant AST nodes.
Then a second stage can come-in and disambiguate syntactically ambiguous names based on the symbol table.
Then ClassTypeResolver can come in and resolve the types of expressions.

Conceptually, the third stage (type resolution) can easily be made lazy, provided the scope information has already been built. The typing of an expression only depends on the typing of the symbols it references (and its syntactic context, but we have an AST node), so that if references can be resolved, we can resolve types on-demand. Reference resolving needs a more global approach that takes the structure of the whole compilation unit into account, which is why I don't think the first stage can be optional.

maybe removing that method is for the best? or be very clear as to when the node will be available.

Yes it maybe would be best to remove it, I don't expect it to be very useful. Ideally the JSymbolElement would present all useful information directly. That is internal API for now so we can decide on that later I think

The defined interfaces duplicate lots of the properties the node already have (as defined in AccessNode, JavaQualifiableNode, etc.). Aren't we going to get to memory intensive on this process?

I don't think this will be very noticeable, especially if we try to cache symbols/ be somewhat lazy. We'll have to be careful but I'm not too worried. I would think that AST nodes and Tokens would be orders of magnitude heavier

It's not explicit, but you are thinking of having 2 implementations for each interface, right?

In fact what I've implemented uses the node or Class at the time of construction, and so uses a single implementation for each interface. What you propose may be better, idk. I'd rather talk about that on the implementation PRs

jsotuyod · 2019-01-22T03:30:35Z

The meaning of a name depends on the context the name is found in, and so on the declarations currently in scope, and on their relative precedence, w.r.t. shadowing rules. So essentially, resolving names can't be separated from the scope model.

Yes, I would expect the scope object to handle this, ie: qualify reference types with the proper parent types from the current file. Actually, some of this is already done within the current Scope implementation if I recall correctly… (although probably missing some edge cases).

However, going from this "qualified" name to the actual type may be split apart if needed.

JSymbolTable addresses your three points with the single mechanism of delegation, and encodes the precedence rules into the structure of the symbol table stack. That obviates the need for a master algorithm that knows about all the rules, and is easier to unit test I think.

I'll dig deeper into it and get back to you.

The approach I'm thinking of for now is the following:

A first stage resolves symbol tables and attach them to the relevant AST nodes.

Then a second stage can come-in and disambiguate syntactically ambiguous names based on the symbol table.

Then ClassTypeResolver can come in and resolve the types of expressions.

I'm unsure I follow. As already stated, disambiguation needs type resolution, how can that be a third stage? how is the second stage going to disambiguate without extra info?

Conceptually, the third stage (type resolution) can easily be made lazy, provided the scope information has already been built. The typing of an expression only depends on the typing of the symbols it references (and its syntactic context, but we have an AST node), so that if references can be resolved, we can resolve types on-demand. Reference resolving needs a more global approach that takes the structure of the whole compilation unit into account, which is why I don't think the first stage can be optional.

Actually, I think the one thing you need to do beforehand is creating the scope hierarchy, noting all declarations and "qualifying" types.

Both, type resolution and symbol-usage finding can be made on-demand (for performance we may want to index all occurrences of symbol names, and on-demand simply disambiguate among them). If type resolution results are stored into the nodes (as currently done), then these can be memoized to have each node resolved at most once.

The defined interfaces duplicate lots of the properties the node already have (as defined in AccessNode, JavaQualifiableNode, etc.). Aren't we going to get to memory intensive on this process?

I don't think this will be very noticeable, especially if we try to cache symbols/ be somewhat lazy. We'll have to be careful but I'm not too worried. I would think that AST nodes and Tokens would be orders of magnitude heavier

Bare in mind, we keep the AST of a single file per thread in memory at a time. When a thread moves on to the next file, the whole AST and tokens are dropped for garbage collection. We never keep all the files analyzed in memory. For multifile analysis keeping these objects for all files is orders of magnitude more memory than we currently use.

oowekyala · 2019-01-22T11:44:07Z

However, going from this "qualified" name to the actual type may be split apart if needed.

Yes, there is a semantic gap between names and actual types. JSymbolTable only cares about resolving unqualified names, types can be resolved later-on.

I'll dig deeper into it and get back to you.

I wrote some more extensive documentation about how those shadowing rules can be encoded in the stack on JSymbolTable, but I planned to introduce it in some later PR. Here are some links:

About what's currently implemented and more precise spec
About type declaration symbol tables
The tests of the current implementation are quite explicit at showing how each symbol table manages its own set of declaration independently of both upstream and downstream, and how nesting them allows transparently encoding shadowing

Yes, I would expect the scope object to handle this, ie: qualify reference types with the proper parent types from the current file.

Actually, I think the one thing you need to do beforehand is creating the scope hierarchy, noting all declarations and "qualifying" types.

JSymbolTable doesn't qualify anything in itself. It only holds the knowledge to find the symbol denoted by a given simple name, taking care of the shadowing rules. That way later stages like actual type resolution and disambiguation drive their analysis while supporting them on the table when a reference needs to be resolved, but the table itself is not e.g. a type resolver, just a support system.

Tables also can resolve for type names, value names, and method names, provided they're unqualified. The shadowing rules for those different kinds of names are so similar that we can shoot the three birds with one stone.

I'm unsure I follow. As already stated, disambiguation needs type resolution, how can that be a third stage? how is the second stage going to disambiguate without extra info?

You're wrong about that. Disambiguation occurs for names, not types, so it doesn't need type resolution. It only needs to occur for names that were syntactically classified as ambiguous. The JLS goes into detail about how reclassification is done, and notice it doesn't mention types at all, only names, and the scope of declarations. Notice that the reclassification algorithm is iterative and starts from the leftmost name, ie. the unqualified one. That one can be resolved to a symbol by the symbol table.

The first phase models the scope of declarations, so that after it, we have all the info we need to perform disambiguation. Having disambiguation occur in a second phase allows to separate concerns and take care of obscuring rules separately from shadowing rules. Shadowing rules can be encoded in the symbol table stack because they apply to a whole extent of program points indiscriminately. Obscuring depends on the actual context of individual name references, and only occurs if a type and a value with the same name are in scope at the same time, and they could both occur in the position of the reference. Since JSymbolTable has separate channels to find a type name and a value name, this second pass can use it without the symbol table having to care.

Btw symbol tables use the auxclasspath to know about members that were declared in e.g. superclasses of the current class. This is distinct from type resolution itself.

If type resolution results are stored into the nodes (as currently done)

Yes this should be the case in general for expressions. But symbols should also carry type information

For multifile analysis keeping these objects for all files is orders of magnitude more memory than we currently use.

I never implied keeping all ASTs in memory, just keeping some symbols in memory. If we keep the getNode method of JElementSymbol, then it would be backed by a weak reference so as not to prevent garbage collection of the AST. If we remove it it's even simpler. Basically, the AST has references to JSymbolTables, and JSymbolTables create new JElementSymbols on demand, or reuse the ones that were already found (bc then the type resolution would not need to be duplicated).

oowekyala · 2019-02-17T23:33:06Z

@jsotuyod May I merge this? Not in the main 7.0.x branch if you still have reservations. I think bringing in pieces of the implementation will drive this discussion forward, and I wouldn't like us to stall much longer on the very first PR.

jsotuyod

I thought we had agreed to proceed with this, but just in case…

jsotuyod · 2019-10-06T21:39:57Z

This is the only place where we use a stream instead of a list, any reason for that?

Possibly, looking up a method may involve looking up all the supertypes of the class in scope (and enclosing classes and their supertypes, for inner/anonymous classes). So since this might be costly, doing it lazily and stopping the hierarchy exploration as soon as we find a matching method might improve performance. A return type that is not a list also documents that the methods may not actually be prefetched.

But maybe all of this should be mentioned on the doc of the method.

The thing is, unless the method I'm looking for has 0 args / I have an EXACT match on all parameters, I'll always need all to determine which overload best fits my arguments and that requires the complete list.

Probably most methods will fit that bill 'though…

May be... Tbh it's too soon to say, we'll have to profile this for different scenarios when it's implemented and used by type resolution.

oowekyala · 2019-10-06T23:17:25Z

@jsotuyod I rebased this branch onto java-grammar, since it missed a lot of updates.

I'd actually like to merge this only after #2034, since there will be some more conflicts

jsotuyod · 2019-10-07T19:32:48Z

@oowekyala build is failing due to a missing symbol… maybe a bad rebase?

This is only done to split the work into two PRs. TODO revert me

Some utilities used by the symbol table tests were deleted in this commit to reduce the diff. TODO restore them with the next PR

About Optional for example

Fix PMD warnings Remove unused method

oowekyala mentioned this pull request Jan 9, 2019

[java] Share utilities defined in ClassTypeResolverTest #1567

Merged

oowekyala mentioned this pull request Jan 16, 2019

[ui] Hide noisy attributes + styling #1591

Merged

adangel added this to the 7.0.0 milestone Jan 19, 2019

oowekyala mentioned this pull request Jan 21, 2019

[java] Resolve explicit types using FQCNs, without hitting the classloader #1207

Closed

oowekyala mentioned this pull request Apr 9, 2019

[java] New expression and type grammar #1759

Merged

This was referenced Sep 30, 2019

[java] Align method and constructor declaration grammar #2034

Merged

[RFC] [POC] - report missing classes #1371

Closed

jsotuyod approved these changes Oct 6, 2019

View reviewed changes

oowekyala changed the base branch from pmd/7.0.x to java-grammar October 6, 2019 23:15

oowekyala force-pushed the sym-reference-framework branch from b976d22 to b5ab656 Compare October 6, 2019 23:15

oowekyala force-pushed the sym-reference-framework branch from 4f8f68f to 12a52bc Compare October 7, 2019 15:05

jsotuyod reviewed Oct 7, 2019

View reviewed changes

Comment thread pmd-java/src/main/java/net/sourceforge/pmd/lang/java/JavaLanguageHandler.java Outdated

oowekyala added 10 commits October 8, 2019 01:47

Add refs, JScope interface

ec670cd

Create symbolic class reference

a59c216

Create import scopes

5dd2f6d

Finish import on demand based on an existing resolver

3ad9567

Make SamePackageScope

52bbd09

Accessibility check

8ee8b05

Document

5e3afe2

Some improvements

0a5a6a1

Update JavaNode interface

a11490d

Refactor ClassTypeResolverTest

2dfcd9d

oowekyala added 25 commits October 8, 2019 01:47

Test laziness

3300753

Fix static-single-import should import types

17684b0

Move files around

242713d

Reduce visibility

f5cc4f9

Remove symbol table implementations and tests

aab1a19

This is only done to split the work into two PRs. TODO revert me

Cleanup the symbol impls

b2ca9b1

Test field and method symbols

7063fde

Organise test utils

a5dc9ea

Some utilities used by the symbol table tests were deleted in this commit to reduce the diff. TODO restore them with the next PR

Add getEnclosingClass

6433118

Update tests

d123869

Add owner symbol to type parameters

423e1c3

Use weak reference for node

df693dc

Test class symbol

3acf89f

Make getEnclosingClass always resolved

6a6f20d

Add some insights about global caching

f535a23

Remove irrelevant changes

b154c5e

Extract interfaces

dd66e43

REVERT ME Remove implementation and tests

7702bf4

REVERT ME remove changes to ClassTypeResolverTest

bd8fafe

Fix JSymbolTable

ce55da9

Fix checkstyle

f16193f

Remove type parameter

2a11c98

Fix rebase

003af13

Update symbol API to newer guidelines

2133d5a

About Optional for example

Cleanup

cb65fa3

Fix PMD warnings Remove unused method

oowekyala force-pushed the sym-reference-framework branch from 799b517 to cb65fa3 Compare October 7, 2019 23:47

jsotuyod merged commit b5caa8c into pmd:java-grammar Oct 23, 2019

oowekyala deleted the sym-reference-framework branch December 9, 2019 11:20

oowekyala added the in:symbol-table Affects the symbol table code label Apr 26, 2020

adangel mentioned this pull request Jan 23, 2023

PMD 7 Tracking Issue #3898

Closed

55 tasks

Uh oh!

Conversation

oowekyala commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jan 9, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsotuyod commented Jan 21, 2019

Uh oh!

oowekyala commented Jan 21, 2019

Uh oh!

jsotuyod commented Jan 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oowekyala commented Jan 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oowekyala commented Feb 17, 2019

Uh oh!

jsotuyod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsotuyod Oct 6, 2019

Choose a reason for hiding this comment

Uh oh!

oowekyala Oct 6, 2019

Choose a reason for hiding this comment

Uh oh!

jsotuyod Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

oowekyala Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

oowekyala commented Oct 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jsotuyod commented Oct 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oowekyala commented Jan 9, 2019 •

edited

Loading

ghost commented Jan 9, 2019 •

edited by ghost

Loading

jsotuyod commented Jan 22, 2019 •

edited

Loading

oowekyala commented Jan 22, 2019 •

edited

Loading

oowekyala commented Oct 6, 2019 •

edited

Loading