Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Syntactic indexing non local refs java#63822

Merged
keynmol merged 34 commits into
mainfrom
syntactic-indexing-non-local-refs-java
Jul 25, 2024
Merged

Syntactic indexing non local refs java#63822
keynmol merged 34 commits into
mainfrom
syntactic-indexing-non-local-refs-java

Conversation

@keynmol

@keynmol keynmol commented Jul 15, 2024

Copy link
Copy Markdown
Contributor

Fixes GRAPH-649

This PR adds non-local references support to Java (and more generally to our locals handling code).

The main idea is to cherrypick nodes that can contain global or resolved (see below) references, handle them first, and then mark everything else as locals.

  • Pure Global references are formed from types of definitions that can't be locals in our code. Currently it's only methods in Java that we treat as always global references

  • Pure local references - these references should only be resolved against the locals scope tree and definitions within it - never emitting a non-local references

  • Resolved references - these are first resolved as locals, and if that doesn't succeed, a global reference is emitted. These will be used most frequently, as TS grammars don't carry enough information for us to identify more pure global references. For example, in Java's type bounds class Hello<N extends Number>, Number can refer to both the local type parameter of an enclosing class (in which case we should emit a local reference), or it can refer to java.lang.Number (in which case we should emit a global reference)

Test plan

  • New snapshot results
  • New evaluation results

Changelog

@cla-bot cla-bot Bot added the cla-signed label Jul 15, 2024
@github-actions github-actions Bot added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform labels Jul 15, 2024
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated

@kritzcreek kritzcreek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few drive-by comments

Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
@keynmol keynmol marked this pull request as ready for review July 16, 2024 11:04
Comment on lines 40 to 44

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first step towards unifying the analysis into a single pass, with the ability to enable/disable certain features.

By default everything is enabled, and we can choose to disable certain features (like disabling global references for highlighting code).

This necessarily leads to --no-<feature> flags in the CLI. I don't have a way not to have them, nor do I think they're a problem (this came up in the past).

Comment thread docker-images/syntax-highlighter/crates/scip-syntax/benches/symbol_parsing.rs Outdated
@kritzcreek kritzcreek self-requested a review July 17, 2024 03:31
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/lib.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment on lines 638 to 627

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the Nothing case here? No visibility modifier can still produce a local reference

Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment thread docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs Outdated
Comment on lines 823 to 824

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use scip_strict here (means you'd need to move it from the scip-syntax crate into syntax-analysis).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scip_strict this as well while we're at it.

Comment on lines 25 to 26

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, this one's weird. Overlapping def and ref might be a bug in scope tree construction that existed before this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly at this point I'm not sure how to fix it without extra logic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've attempted to identify overlapping nodes in the simplest possible way, and while this does help, it breaks Perl:

   11    11 │           ref @$self (8, 13)-(8, 19)
   12       │-          ref $self (8, 14)-(8, 19)
    7     7 │       if (@_ == 2) {
    8     8 │           my $self = shift;
    9     9 │ //           ^^^^^ definition local 2
   10    10 │           push(@$self, shift);
   11       │-//              ^^^^^ reference local 2

In the queries we extract both $self and @$self as references, but @$self comes first and contains within it $self. Problem is - @$self cannot be resolved to a local definition.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the GOAT when it comes to Perl, but I think the reference and definition we should actually be targeting there is self. See https://perldoc.perl.org/perlintro#Arrays where you can declare an array as @animals and refer to it via $animals[0].

(Not saying go fix the Perl locals, just don't let them stop you from implementing a simpler solution if you've found one)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple solution for choosing the longest reference node and ignore all matches contained within:

diff --git a/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs b/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs
index ba64c9362aa..4c57e85b56f 100644
--- a/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs
+++ b/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs
@@ -587,6 +587,7 @@ impl<'a> LocalResolver<'a> {
         let mut definitions: Vec<DefCapture> = vec![];
         let mut references: Vec<RefCapture<'a>> = vec![];
         let mut skip_references_at_offsets: HashSet<usize> = HashSet::new();
+        let mut min_reference_start: usize = 0;
 
         for match_ in cursor.matches(&config.query, tree.root_node(), source_bytes) {
             let properties = config.query.property_settings(match_.pattern_index);
@@ -622,7 +623,11 @@ impl<'a> LocalResolver<'a> {
                 } else if capture_name.eq_ignore_ascii_case("reference") {
                     let offset = capture.node.start_byte();
 
-                    if skip_references_at_offsets.contains(&offset) {
+                    if offset < min_reference_start {
+                        continue;
+                    }
+
+                    if self.skip_occurrences_at_offsets.contains(&offset) {
                         continue;
                     }
 
@@ -642,11 +647,13 @@ impl<'a> LocalResolver<'a> {
                         other => other,
                     };
 
-                    if self.skip_occurrences_at_offsets.contains(&offset) {
-                        continue;
-                    }
+                    eprintln!(
+                        "{:?} {offset} {min_reference_start} {:?}",
+                        capture.node.utf8_text(source_bytes),
+                        capture.node.kind()
+                    );
 
-                    skip_references_at_offsets.insert(offset);
+                    min_reference_start = capture.node.end_byte();
 
                     let descriptor = match visibility {
                         Some(Visibility::Global) => kind_property

This fixes Java imports and is somewhat generic.

re:Perl - indeed the reference should be to self, but our current grammar does a very crude selection of entire (array) node. I'm exploring switching back to upstream or syncing at least part of the changes into our fork to make this possible

@keynmol keynmol force-pushed the syntactic-indexing-non-local-refs-java branch from e1e27f4 to 3b094e2 Compare July 23, 2024 15:36

@kritzcreek kritzcreek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in good enough shape to merge. The extra reference to lang. is not going to matter, because search is going to give us identifiers on word boundaries anyway. (and having a few false positives is not the end of the world, it's certainly better than false negatives)

I'll open an issue for some follow-up scip_strict usage, and any simplification I think we could make in the future.

If there are other issues we know of, please open issues for those as well, and we can look at them independent of this PR.

@keynmol

keynmol commented Jul 25, 2024

Copy link
Copy Markdown
Contributor Author

For future reference, here's the latest attempt to prefer a longer reference in case there are referenced subparts contained entirely within it.

It causes issues in perl because we cast a wide net when marking identifiers as reveferences in the grammar, producing @$self and $self references from same occurrence of @$self

I attempted to update the grammar we have (sourcegraph/tree-sitter-perl#4) but it causes other issues because of how it's structured.

diff --git a/docker-images/syntax-highlighter/Cargo.lock b/docker-images/syntax-highlighter/Cargo.lock
index 7f33669d4d5..985bc58e8ae 100644
--- a/docker-images/syntax-highlighter/Cargo.lock
+++ b/docker-images/syntax-highlighter/Cargo.lock
@@ -2633,7 +2633,7 @@ dependencies = [
 [[package]]
 name = "tree-sitter-perl"
 version = "0.0.1"
-source = "git+https://github.com/sourcegraph/tree-sitter-perl?rev=e1b4844afd17b7dc019a436b1ac890568d79a1f2#e1b4844afd17b7dc019a436b1ac890568d79a1f2"
+source = "git+https://github.com/sourcegraph/tree-sitter-perl?rev=25473ee75c2fc0757b019d5be54cba0cbe412437#25473ee75c2fc0757b019d5be54cba0cbe412437"
 dependencies = [
  "cc",
  "tree-sitter",
diff --git a/docker-images/syntax-highlighter/crates/syntax-analysis/queries/java/scip-locals.scm b/docker-images/syntax-highlighter/crates/syntax-analysis/queries/java/scip-locals.scm
index fcf8e3c67d8..1a91b868483 100644
--- a/docker-images/syntax-highlighter/crates/syntax-analysis/queries/java/scip-locals.scm
+++ b/docker-images/syntax-highlighter/crates/syntax-analysis/queries/java/scip-locals.scm
@@ -91,18 +91,28 @@
 
 ; REFERENCES
 
-
+; NOTE: commented out because current implementation produces overlapping
+; references (in the example below, java.util and util will be produced)
 ; import java.util.HashSet
 ;        ^^^^^^^^^ namespace
 ;                  ^^^^^^^ type (could also be a constant, but type is more common)
+; (import_declaration
+;   (scoped_identifier
+;     scope: ((_) @reference (#set! "kind" "global.namespace"))
+;     ))
+
+; (import_declaration
+;   (scoped_identifier
+;     name: (_) @reference (#set! "kind" "global.type")
+;   ))
 (import_declaration
   (scoped_identifier
-    scope: ((_) @reference (#set! "kind" "global.namespace"))
+    scope: (_) @occurrence.skip
     ))
 
 (import_declaration
   (scoped_identifier
-    name: (_) @reference (#set! "kind" "global.type")
+    name: (_) @occurrence.skip
   ))
 
 (field_access object: (identifier) @reference)
diff --git a/docker-images/syntax-highlighter/crates/syntax-analysis/queries/perl/scip-locals.scm b/docker-images/syntax-highlighter/crates/syntax-analysis/queries/perl/scip-locals.scm
index 80fd51e2eaf..3d0c04bd8ee 100644
--- a/docker-images/syntax-highlighter/crates/syntax-analysis/queries/perl/scip-locals.scm
+++ b/docker-images/syntax-highlighter/crates/syntax-analysis/queries/perl/scip-locals.scm
@@ -5,11 +5,56 @@
 
 ;; TODO: Add `state` variables once we've updated the grammar. Only
 ;; `our` variables are non-local, so we must not include them here
-(variable_declaration "my" (_) @definition.term)
-(for_statement my_var: (_) @definition.term)
-
-((scalar) @reference (#set! "kind" "local"))
-((array) @reference (#set! "kind" "local"))
-((arraylen) @reference (#set! "kind" "local"))
-((hash) @reference (#set! "kind" "local"))
-((glob) @reference (#set! "kind" "local"))
+(variable_declaration
+  "my"
+  variable: [
+    (scalar (varname) @definition.term)
+    (array (varname) @definition.term)
+    (arraylen (varname) @definition.term)
+    (hash (varname) @definition.term)
+    (glob (varname) @definition.term)
+  ]
+                      ; (_) @definition.term
+)
+
+(variable_declaration
+  "my"
+  variables: ([
+    (scalar ((varname) @definition.term))
+    (array ((varname) @definition.term))
+    (arraylen ((varname) @definition.term))
+    (hash ((varname) @definition.term))
+    (glob ((varname) @definition.term))
+  ])*
+)
+
+(for_statement
+  my_var: [
+    (scalar (varname) @definition.term)
+    (array (varname) @definition.term)
+    (arraylen (varname) @definition.term)
+    (hash (varname) @definition.term)
+    (glob (varname) @definition.term)
+  ]
+)
+
+(scalar
+    ((varname) @reference (#set! "kind" "local"))
+)
+
+(array
+  (varname) @reference (#set! "kind" "local")
+)
+
+(arraylen
+  (varname) @reference (#set! "kind" "local")
+)
+
+(hash
+  (varname) @reference (#set! "kind" "local")
+)
+
+(glob
+  (varname) @reference (#set! "kind" "local")
+)
+
diff --git a/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs b/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs
index ba64c9362aa..a5cf70b6e1c 100644
--- a/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs
+++ b/docker-images/syntax-highlighter/crates/syntax-analysis/src/locals.rs
@@ -587,6 +587,7 @@ impl<'a> LocalResolver<'a> {
         let mut definitions: Vec<DefCapture> = vec![];
         let mut references: Vec<RefCapture<'a>> = vec![];
         let mut skip_references_at_offsets: HashSet<usize> = HashSet::new();
+        let mut skip_references_until: Option<usize> = None;
 
         for match_ in cursor.matches(&config.query, tree.root_node(), source_bytes) {
             let properties = config.query.property_settings(match_.pattern_index);
@@ -626,6 +627,11 @@ impl<'a> LocalResolver<'a> {
                         continue;
                     }
 
+                    match skip_references_until {
+                        Some(end_offset) if offset < end_offset => continue,
+                        _ => (),
+                    }
+
                     let kind_property = properties
                         .iter()
                         .find(|p| p.key.as_ref() == "kind")
@@ -647,6 +653,7 @@ impl<'a> LocalResolver<'a> {
                     }
 
                     skip_references_at_offsets.insert(offset);
+                    skip_references_until = Some(capture.node.end_byte());
 
                     let descriptor = match visibility {
                         Some(Visibility::Global) => kind_property
@@ -666,6 +673,7 @@ impl<'a> LocalResolver<'a> {
                 } else if capture_name == "occurrence.skip" {
                     let offset = capture.node.start_byte();
                     self.skip_occurrences_at_offsets.insert(offset);
+                    skip_references_until = Some(capture.node.end_byte());
                 } else {
                     debug_assert!(false, "Discarded capture: {capture_name}")
                 }
diff --git a/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_occurrences.snap b/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_occurrences.snap
index 924113e1d29..f58a8ac1117 100644
--- a/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_occurrences.snap
+++ b/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_occurrences.snap
@@ -5,16 +5,8 @@ expression: dumped
   package com.hello;
   
   import java.lang.AutoCloseable;
-//       ^^^^^^^^^ reference scip-syntax `java.lang`/
-//            ^^^^ reference scip-syntax lang.
-//                 ^^^^^^^^^^^^^ reference scip-syntax AutoCloseable#
   import java.util.*;
-//       ^^^^ reference scip-syntax java/
-//            ^^^^ reference scip-syntax util#
   import java.util.stream.*;
-//       ^^^^^^^^^ reference scip-syntax `java.util`/
-//            ^^^^ reference scip-syntax util.
-//                 ^^^^^^ reference scip-syntax stream#
   
   @Deprecated
 // ^^^^^^^^^^ reference scip-syntax Deprecated#
diff --git a/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_scopes.snap b/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_scopes.snap
index 9c6eba09de3..457d5d7317e 100644
--- a/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_scopes.snap
+++ b/docker-images/syntax-highlighter/crates/syntax-analysis/src/snapshots/syntax_analysis__locals__test__java_scopes.snap
@@ -3,14 +3,6 @@ source: crates/syntax-analysis/src/locals.rs
 expression: scope_tree
 ---
 scope global (0, 0)-(121, 0)
-  ref java.lang (2, 7)-(2, 16)
-  ref lang (2, 12)-(2, 16)
-  ref AutoCloseable (2, 17)-(2, 30)
-  ref java (3, 7)-(3, 11)
-  ref util (3, 12)-(3, 16)
-  ref java.util (4, 7)-(4, 16)
-  ref util (4, 12)-(4, 16)
-  ref stream (4, 17)-(4, 23)
   scope scope (6, 0)-(120, 1)
     ref Deprecated (6, 1)-(6, 11)
     def Container (7, 20)-(7, 29)
diff --git a/docker-images/syntax-highlighter/crates/tree-sitter-all-languages/Cargo.toml b/docker-images/syntax-highlighter/crates/tree-sitter-all-languages/Cargo.toml
index 19d7905ab28..c99a0f5f9c8 100644
--- a/docker-images/syntax-highlighter/crates/tree-sitter-all-languages/Cargo.toml
+++ b/docker-images/syntax-highlighter/crates/tree-sitter-all-languages/Cargo.toml
@@ -26,7 +26,7 @@ tree-sitter-kotlin = { git = "https://github.com/fwcd/tree-sitter-kotlin", rev =
 tree-sitter-magik = { git = "https://github.com/sourcegraph/tree-sitter-magik", rev = "dfabbd4e70ff311939cf48fdf90d340f07758d37" }
 tree-sitter-matlab = { git = "https://github.com/acristoffers/tree-sitter-matlab", rev = "6071891a8c39600203eba20513666cf93b4d650a" }
 tree-sitter-nickel = { git = "https://github.com/nickel-lang/tree-sitter-nickel", rev = "d6c7eeb751038f934b5b1aa7ff236376d0235c56" }
-tree-sitter-perl = { git = "https://github.com/sourcegraph/tree-sitter-perl", rev = "e1b4844afd17b7dc019a436b1ac890568d79a1f2" }
+tree-sitter-perl = { git = "https://github.com/sourcegraph/tree-sitter-perl", rev = "25473ee75c2fc0757b019d5be54cba0cbe412437" }
 tree-sitter-pkl = { git = "https://github.com/apple/tree-sitter-pkl", rev = "b79c8c4d2419e82324d9aca31e9de47ed8304f1f" }
 tree-sitter-pod = { git = "https://github.com/sourcegraph/tree-sitter-pod", rev = "f422a0dca6847c692e811f06fd92c6a75d647222" }
 tree-sitter-xlsg = { git = "https://github.com/sourcegraph/tree-sitter-xlsg", rev = "d956b54ea151b12f19c945f7be421c3dcd3a77ba" }

@keynmol keynmol merged commit b84636f into main Jul 25, 2024
@keynmol keynmol deleted the syntactic-indexing-non-local-refs-java branch July 25, 2024 10:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants