Skip to content

Conversation

@hiddeco
Copy link
Contributor

@hiddeco hiddeco commented Jul 20, 2023

What this PR does / why we need it:

This adds support for creating an index.yaml in JSON format using helm repo index [...] --json.

Fixes #10542
Supersedes #12151

Special notes for your reviewer:

Given that it does not create unnecessary HTTP requests to discover index.json and/or index.yaml files hosted by a repository server, nor requires rigorous changes to the caching logic, and is fully backwards compatible due to YAML being a superset of JSON. This appears to be the least invasive way to add support for the JSON format.

Benchmark using Bitnami's infamous "full index" archive:

$  benchstat old.txt new.txt                                  
goos: linux
goarch: amd64
pkg: helm.sh/helm/v3/pkg/repo
cpu: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
                      │   old.txt    │               new.txt               │
                      │    sec/op    │   sec/op     vs base                │
LoadIndexFile/YAML-12     2.692 ± 1%    2.770 ± 3%        ~ (p=0.242 n=20)
LoadIndexFile/JSON-12   2318.3m ± 0%   582.6m ± 1%  -74.87% (p=0.000 n=20)
geomean                   2.498         1.270       -49.15%

                      │    old.txt    │               new.txt                │
                      │     B/op      │     B/op      vs base                │
LoadIndexFile/YAML-12    836.0Mi ± 0%   836.0Mi ± 0%        ~ (p=0.355 n=20)
LoadIndexFile/JSON-12   671.99Mi ± 0%   93.84Mi ± 0%  -86.03% (p=0.000 n=20)
geomean                  749.5Mi        280.1Mi       -62.63%

                      │   old.txt    │               new.txt               │
                      │  allocs/op   │  allocs/op   vs base                │
LoadIndexFile/YAML-12    12.03M ± 0%   12.03M ± 0%        ~ (p=0.683 n=20)
LoadIndexFile/JSON-12   10.015M ± 0%   1.078M ± 0%  -89.23% (p=0.000 n=20)
geomean                  10.98M        3.601M       -67.19%

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

When an index is in a JSON format, the `sigs.k8s.io/yaml` package uses
an inefficient approach to unmarshaling the data, as it does an
unnecessary roundtrip on the data to transform the YAML to valid JSON.

To prevent this from happening, detect if the bytes which we attempt
to load contain valid JSON, and unmarshal them directly using
`json.Unmarshal` instead.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 20, 2023
This adds support for generating the repository index file in JSON
format using the `--json` flag. The index itself is still written
to `index.yaml`, which is fully backwards compatible as YAML is a
superset of JSON.

For big indexes (think multiple megabytes), this approach is however
more efficient in combination with the changes to the load logic,
as it prevents a YAML -> JSON roundtrip during decoding.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Copy link

@pjbgf pjbgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, it will make a massive difference processing large helm indexes.

@sabre1041
Copy link
Contributor

This is a really good enhancement and verified that the enhancement does function correctly. However, the use of index.yaml for both JSON and YAML content is confusing as to the actual format of content included in this file.

It might be interesting to consider two potential files that could contain repository metadata: index.yaml and index.json (where index.yaml would take precedence if both files were found)

@hiddeco
Copy link
Contributor Author

hiddeco commented Jul 24, 2023

It might be interesting to consider two potential files that could contain repository metadata: index.yaml and index.json (where index.yaml would take precedence if both files were found)

In an iteration before creating this PR, I actually had an approach like this where ChartRepository#DownloadIndexFile would loop over possible URLs (first JSON, then YAML), while breaking the loop on the first non-error response (or returning a collection of errors if none was found).

Besides this approach causing potentially lots of HTTP 404s for the server when a JSON isn't served, after taking a closer look at the caching logic (more precisely the various CacheIndexFile calls throughout the code base, and required changes to certain error types to be factually correct). Getting this done right appeared more painful than leveraging YAML being a superset of JSON.

Another upside of the approach taken is that it's backwards compatible without having to serve (and thus store) two different file types. A Helm repository administrator can simply decide to start serving JSON, and any version of Helm will happily be able to download and work with the new index format (albeit only the Hem version which would include this change with memory consumption improvements).

Copy link
Contributor

@joejulian joejulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... the use of index.yaml for both JSON and YAML content is confusing ...

I disagree. Since json is a subset of yaml, it seems perfectly valid to me.

The rest of hiddeco's rational is also very compelling.

I tested this on my use cases and it works for me.

@joejulian joejulian added the Has One Approval This PR has one approval. It still needs a second approval to be merged. label Jul 27, 2023
@sabre1041
Copy link
Contributor

@hiddeco @joejulian My concern with mixing JSON and YAML content is centered around two primary areas:

  1. Understanding content types within index.yaml file. Yes YAML is a subset of JSON, but if there was an assumption that the contents included in these files were one format or the other, unexpected parsing errors could occur
  2. Backwards compatibility. Do we break it by implementing such a change?

@joejulian
Copy link
Contributor

Other way around. Yaml is a superset of json. Anything in json is 100% compatible with yaml. Yaml has features that are not in json.

So any json content in a yaml file is guaranteed compatible.

@hiddeco
Copy link
Contributor Author

hiddeco commented Aug 2, 2023

YAML is a subset of JSON, but if there was an assumption that the contents included in these files were one format or the other, unexpected parsing errors could occur

The actual parser Helm uses converts YAML to JSON during unmarshalling. In addition, from the YAML 1.2 spec (2009, revised in 2021):

[...] Its primary focus was making YAML a strict superset of JSON. It also removed [...]


Backwards compatibility. Do we break it by implementing such a change?

As noted in my previous comment, the exact opposite is true. As introducing an index.json would mean you either:

  1. Have to continue to publish both an index.yaml and index.json file as a Helm repository administrator, as otherwise older versions of Helm are unable to work with your repository. It's also worth to note this makes publishing itself more complicated, as you now have to ensure both files are in sync.
  2. Stop publishing an index.yaml, which would yield your Helm repository useless for older versions of Helm.

Copy link
Contributor

@sabre1041 sabre1041 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hiddeco @joejulian Reevaluated code changes, comments and LGTM

@joejulian joejulian added this to the 3.13.0 milestone Sep 11, 2023
@joejulian joejulian merged commit e7bb860 into helm:main Sep 11, 2023
@hiddeco hiddeco deleted the json-index branch September 11, 2023 17:16
@thernstig
Copy link

@hiddeco @sabre1041 this is amazing! Should it not be mentioned in the docs?

@hiddeco
Copy link
Contributor Author

hiddeco commented Sep 11, 2023

They are partly automatically documented: https://github.com/helm/helm-www#updating-the-helm-cli-reference-docs

@mattfarina
Copy link
Collaborator

Note, while writing the release notes I discovered this doesn't fully work. I filed a bug at #12422.

@ArtemShevelyukhin
Copy link

ArtemShevelyukhin commented May 27, 2025

@hiddeco sorry, I don't understand, how this PR should help with #9931

I have helm repository in artifactory and large (~50 Mb) index.yaml file and my "helm repo update" command consumes a lot of memory (~900 Mb). And as far as I know artifactory doesn't support json format for index.yaml file

So, could you explain does this PR help with "helm repo update" or not ?

@hiddeco
Copy link
Contributor Author

hiddeco commented Jun 6, 2025

It will not help if you are not in control of the published format of the repository index (which appears to be the case based on what you describe).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Has One Approval This PR has one approval. It still needs a second approval to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index.json - necessary for large projects

7 participants