search jobs: switch from CSV to JSON by stefanhengl · Pull Request #59619 · sourcegraph/sourcegraph-public-snapshot

stefanhengl · 2024-01-16T10:45:23Z

Closes #59329
Relates to #59352

We switch the format of the results download from CSV to line-separated JSON.
Each line corresponds to a JSON object containing chunk matches.
The JSON object has the same format as the matches served by the Stream API.

example.json

This is a breaking change motivated by customer feedback.
Search Jobs is still released as EAP so a breaking change is acceptable.

Pros:

richer information (matches, content, positions)
supports all result types
same format as Stream API

Cons:

Requires more storage
Not as easy to parse by a human as a CSV

The commits can be reviewed separatedly.
The first 2 commits contain the core of this change.
In the third commit I delete code and update tests

Next: update documentation

Test plan

updated and new units tests
manual testing
- I used this Python script to get the result counts from the downloaded json and compared them to the counts in the web app for a couple of examples.

"""
Usage:
    python main.py <path/to/json>
"""
import json
import argparse


def count_results(filename):
    return sum(len(chunks['ranges']) for line in open(filename) for chunks in json.loads(line)['chunkMatches'])


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('filename', type=str, help='Search Jobs result json')
    args = parser.parse_args()

    print(f'{count_results(args.filename)} results')

stefanhengl · 2024-01-16T11:06:32Z

@@ -72,13 +72,13 @@ func (h *exhaustiveSearchRepoRevHandler) Handle(ctx context.Context, logger log.
 		return err
 	}



This is the "write" part. We replace the CSV writer with the JSON writer.

stefanhengl · 2024-01-16T11:07:45Z

 	}
 }

+func writeSearchJobJSON(ctx context.Context, iter *iterator.Iterator[string], uploadStore uploadstore.Store, w io.Writer) (int64, error) {


This is the "read" part. We simply concatenate the result blobs from the various repo-revisions we searched.

stefanhengl · 2024-01-16T11:10:18Z


 type MatchJSONWriter struct {
-	w *http.JSONArrayBuf
+	w *bufferedWriter


switching to line-separted JSON for the internal format. See discussion

keegancsmith

LGTM

keegancsmith · 2024-01-16T14:00:34Z

 	m.Path("/insights/export/{id}").Methods("GET").Handler(trace.Route(handlers.CodeInsightsDataExportHandler))
 	m.Path("/search/stream").Methods("GET").Handler(trace.Route(frontendsearch.StreamHandler(db)))
-	m.Path("/search/export/{id}.csv").Methods("GET").Handler(trace.Route(handlers.SearchJobsDataExportHandler))
+	m.Path("/search/export/{id}.json").Methods("GET").Handler(trace.Route(handlers.SearchJobsDataExportHandler))


I guess technically we could support something which converts the json into csv in the future? I quite liked the csv support, but we should actually get feedback it is useful.

Agreed. I also like the CSV. However, I think it might get a unwieldy quickly once we want to support multiple result types or return individual result chunks with their position.

keegancsmith · 2024-01-16T14:13:11Z

+func (j *bufferedWriter) Append(v any) error {
+	oldLen := j.buf.Len()
+
+	enc := json.NewEncoder(&j.buf)


was wondering if it was worth saving the returned encoder for use between calls. But after reading the implementation, it seems all the real state that can be inferred between calls is stored in a sync.pool. So the only downside of this is a tiny allocation which won't outlive this function call. LGTM.

…-to-json

Relates to https://github.com/sourcegraph/sourcegraph/pull/59619 We have moved from CSV to JSON as the new export format. Note: I also fixed some typos and tweaked the copy a bit. Several customers have asked about the ENVs for the blobstore so I made it more explicit that the blobstore doesn't require setting any ENVs.

stefanhengl added 3 commits January 16, 2024 10:49

return JSON

e75143b

update copy

43b38b8

remove CSVWriter, port tests

06c03b4

cla-bot Bot added the cla-signed label Jan 16, 2024

stefanhengl commented Jan 16, 2024

View reviewed changes

update CHANGELOG

03802c4

stefanhengl marked this pull request as ready for review January 16, 2024 11:33

stefanhengl requested a review from a team January 16, 2024 11:33

Merge branch 'main' into sh/search-jobs/switch-to-json

36615b5

keegancsmith approved these changes Jan 16, 2024

View reviewed changes

stefanhengl added 2 commits January 17, 2024 10:45

Merge remote-tracking branch 'origin/main' into sh/search-jobs/switch…

879600f

…-to-json

add new language field to test output

d9f4c7a

stefanhengl merged commit d29948e into main Jan 17, 2024

stefanhengl deleted the sh/search-jobs/switch-to-json branch January 17, 2024 12:19

stefanhengl mentioned this pull request Jan 17, 2024

search-jobs: document new result format sourcegraph/docs#40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

search jobs: switch from CSV to JSON#59619

search jobs: switch from CSV to JSON#59619
stefanhengl merged 7 commits into
mainfrom
sh/search-jobs/switch-to-json

stefanhengl commented Jan 16, 2024 •

edited

Loading

Uh oh!

stefanhengl Jan 16, 2024 •

edited

Loading

Uh oh!

stefanhengl Jan 16, 2024

Uh oh!

stefanhengl Jan 16, 2024

Uh oh!

keegancsmith left a comment

Uh oh!

keegancsmith Jan 16, 2024

Uh oh!

stefanhengl Jan 17, 2024

Uh oh!

keegancsmith Jan 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -72,13 +72,13 @@ func (h *exhaustiveSearchRepoRevHandler) Handle(ctx context.Context, logger log.
		return err
		}

Conversation

stefanhengl commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

stefanhengl Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefanhengl Jan 16, 2024

Choose a reason for hiding this comment

Uh oh!

stefanhengl Jan 16, 2024

Choose a reason for hiding this comment

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

keegancsmith Jan 16, 2024

Choose a reason for hiding this comment

Uh oh!

stefanhengl Jan 17, 2024

Choose a reason for hiding this comment

Uh oh!

keegancsmith Jan 16, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stefanhengl commented Jan 16, 2024 •

edited

Loading

stefanhengl Jan 16, 2024 •

edited

Loading