This repository was archived by the owner on Sep 30, 2024. It is now read-only.
client: Avoid complex tokenization in ref panel code#58954
Merged
Conversation
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation.
59cfc99 to
5186cc6
Compare
Contributor
Author
fkling
approved these changes
Dec 14, 2023
Comment on lines
+102
to
+105
| interface OneBasedPosition { | ||
| line: number | ||
| character: number | ||
| } |
Contributor
There was a problem hiding this comment.
I've also run into quite a few issues with 0-based vs 1-based positions. I think it's useful to have separate types to express intent, but just note that the type checker won't prevent you from passing a ZeroBasedPostition as a OneBasedPosition, because TS uses structural equivalence (or whatever it is called).
Contributor
Author
There was a problem hiding this comment.
Do you think I should use classes here? I'd be happy to introduce new vocabulary types for Positions and Ranges in a central place that can be reused elsewhere.
Contributor
Author
There was a problem hiding this comment.
OK, let me attempt to do that in a follow-up PR. Thanks for flagging this, I forgot that interfaces have structural subtyping.
Contributor
Author
varungandhi-src
added a commit
that referenced
this pull request
Jan 16, 2024
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2)
Closed
varungandhi-src
added a commit
that referenced
this pull request
Jan 17, 2024
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2)
varungandhi-src
added a commit
that referenced
this pull request
Jan 23, 2024
Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2)
BolajiOlajide
pushed a commit
that referenced
this pull request
Jan 24, 2024
…58954) (#59636) * client: Minor cleanup for search-based code intel (#58331) The separation of the logic into different functions makes it clearer what the order of searches is. It also makes it clearer that for some reason, we're only using the locals information from the SCIP Document for 'Find references', and not for 'Go to definition'. Using the SCIP Document for for 'Go to definition' too could avoid a network request. (cherry-picked from e955cddec490d0cc2b5eba36be2ec4958ba06bf8) * client: Avoid complex tokenization in ref panel code (#58954) Previously, we relied on detecting the language from file paths, then using various regexes associated with the language to identify token boundaries. However, the code mirror blob view always provides a full token range, which can be used directly, instead of attempting to recompute the token boundaries. For older URLs, we fallback to simple identifiers, which should work for the vast majority of languages and identifiers. We cannot yet remove the language detection here because the file extensions associated with the language are later used for search-based code navigation. This patch also makes the language spec optional for search-based code intel, as we do not have a solution to #56376 which would guarantee that we always have a language available. If a language is not available, search-based code intel falls back to searching other files with the same extension as a best effort guess. Locally tested for MATLAB code. The ref panel shows up correctly, unlike the error earlier. (cherry-picked from c42cad2) * Fix lint error due to short variable name
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, we relied on detecting the language from file paths,
then using various regexes associated with the language to identify
token boundaries. However, the code mirror blob view always provides
a full token range, which can be used directly, instead of attempting
to recompute the token boundaries.
For older URLs, we fallback to simple identifiers, which should
work for the vast majority of languages and identifiers.
We cannot yet remove the language detection here because the file
extensions associated with the language are later used for search-based
code navigation.
This patch also makes the language spec optional for search-based
code intel, as we do not have a solution to #56376 which would
guarantee that we always have a language available. If a language
is not available, search-based code intel falls back to searching
other files with the same extension as a best effort guess.
Fixes https://github.com/sourcegraph/sourcegraph/issues/58548
Test plan
Locally tested for MATLAB code. The ref panel shows up correctly,
unlike the error earlier in #58548