Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

search jobs: switch from CSV to JSON#59619

Merged
stefanhengl merged 7 commits into
mainfrom
sh/search-jobs/switch-to-json
Jan 17, 2024
Merged

search jobs: switch from CSV to JSON#59619
stefanhengl merged 7 commits into
mainfrom
sh/search-jobs/switch-to-json

Conversation

@stefanhengl

@stefanhengl stefanhengl commented Jan 16, 2024

Copy link
Copy Markdown
Member

Closes #59329
Relates to #59352

We switch the format of the results download from CSV to line-separated JSON.
Each line corresponds to a JSON object containing chunk matches.
The JSON object has the same format as the matches served by the Stream API.

example.json

This is a breaking change motivated by customer feedback.
Search Jobs is still released as EAP so a breaking change is acceptable.

Pros:

  • richer information (matches, content, positions)
  • supports all result types
  • same format as Stream API

Cons:

  • Requires more storage
  • Not as easy to parse by a human as a CSV

The commits can be reviewed separatedly.
The first 2 commits contain the core of this change.
In the third commit I delete code and update tests

Next: update documentation

Test plan

  • updated and new units tests
  • manual testing
    • I used this Python script to get the result counts from the downloaded json and compared them to the counts in the web app for a couple of examples.
"""
Usage:
    python main.py <path/to/json>
"""
import json
import argparse


def count_results(filename):
    return sum(len(chunks['ranges']) for line in open(filename) for chunks in json.loads(line)['chunkMatches'])


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('filename', type=str, help='Search Jobs result json')
    args = parser.parse_args()

    print(f'{count_results(args.filename)} results')

@cla-bot cla-bot Bot added the cla-signed label Jan 16, 2024
@@ -72,13 +72,13 @@ func (h *exhaustiveSearchRepoRevHandler) Handle(ctx context.Context, logger log.
return err
}

@stefanhengl stefanhengl Jan 16, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "write" part. We replace the CSV writer with the JSON writer.

}
}

func writeSearchJobJSON(ctx context.Context, iter *iterator.Iterator[string], uploadStore uploadstore.Store, w io.Writer) (int64, error) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "read" part. We simply concatenate the result blobs from the various repo-revisions we searched.


type MatchJSONWriter struct {
w *http.JSONArrayBuf
w *bufferedWriter

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switching to line-separted JSON for the internal format. See discussion

@stefanhengl stefanhengl marked this pull request as ready for review January 16, 2024 11:33
@stefanhengl stefanhengl requested a review from a team January 16, 2024 11:33

@keegancsmith keegancsmith left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

m.Path("/insights/export/{id}").Methods("GET").Handler(trace.Route(handlers.CodeInsightsDataExportHandler))
m.Path("/search/stream").Methods("GET").Handler(trace.Route(frontendsearch.StreamHandler(db)))
m.Path("/search/export/{id}.csv").Methods("GET").Handler(trace.Route(handlers.SearchJobsDataExportHandler))
m.Path("/search/export/{id}.json").Methods("GET").Handler(trace.Route(handlers.SearchJobsDataExportHandler))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess technically we could support something which converts the json into csv in the future? I quite liked the csv support, but we should actually get feedback it is useful.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I also like the CSV. However, I think it might get a unwieldy quickly once we want to support multiple result types or return individual result chunks with their position.

func (j *bufferedWriter) Append(v any) error {
oldLen := j.buf.Len()

enc := json.NewEncoder(&j.buf)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was wondering if it was worth saving the returned encoder for use between calls. But after reading the implementation, it seems all the real state that can be inferred between calls is stored in a sync.pool. So the only downside of this is a tiny allocation which won't outlive this function call. LGTM.

@stefanhengl stefanhengl merged commit d29948e into main Jan 17, 2024
@stefanhengl stefanhengl deleted the sh/search-jobs/switch-to-json branch January 17, 2024 12:19
stefanhengl referenced this pull request in sourcegraph/docs Jan 17, 2024
Relates to https://github.com/sourcegraph/sourcegraph/pull/59619

We have moved from CSV to JSON as the new export format.

Note: I also fixed some typos and tweaked the copy a bit. Several
customers have asked about the ENVs for the blobstore so I made it more
explicit that the blobstore doesn't require setting any ENVs.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search jobs: Switch from CSV to JSON

2 participants