Skip to content

specs: include source provenance in spec.json and package hash#32312

Open
tgamblin wants to merge 20 commits intodevelopfrom
show-sha256-in-spec
Open

specs: include source provenance in spec.json and package hash#32312
tgamblin wants to merge 20 commits intodevelopfrom
show-sha256-in-spec

Conversation

@tgamblin
Copy link
Copy Markdown
Member

@tgamblin tgamblin commented Aug 22, 2022

We've included a package hash in Spack since #7193 for CI, and we started using it on the spec in #28504. However, what goes into the package hash is opaque.

We want this information to be available from the concrete spec so that it can be retrieved as provenance without referring to a package.py file. Why?

  1. We want to be able to look up authoritative checksums, versions, etc. and pass them through scanners long after an installation takes place.
  2. Right now you can look at the version on a Spec, then look up the checksum for a version from the package.py, but that's not reliable in the long term. We may remove versions from package.py (in fact, we might want to as they become less relevant), and checksums for particular versions/releases may change over time.
  3. Only the version is stored explicitly on the spec; that's not sufficient to know exactly what hashes a spec was built with.

Adding hashes explicitly gets around this and gets us closer to having all the information we need for a complete SBOM.

Here's what spec.json looked like before:

{
  "spec": {
    "_meta": {
      "version": 3
    },
    "nodes": [
      {
        "name": "zlib",
        "version": "1.2.12",
        ...
        "patches": [
          "0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76"
        ],
        "package_hash": "pthf7iophdyonixxeed7gyqiksopxeklzzjbxtjrw7nzlkcqleba====",
        "hash": "ke4alug7ypoxp37jb6namwlxssmws4kp"
      }
    ]
  }
}

The package_hash there is a hash of the concatenation of:

  • A canonical hash of the package.py recipe, as implemented in Make the package hash consistent #28156;
  • sha256's of patches applied to the spec; and
  • Archive sha256 sums of archives or commits/revisions of repos used to build the spec.

There are some issues with this: patches are counted twice in this spec (in patches and in the package_hash), the hashes of sources used to build are conflated with the package.py hash, and we don't actually include resources anywhere.

With this PR, I've expanded the package hash out in the spec.json body. Here is the "same" spec with the new fields:

{
  "spec": {
    "_meta": {
      "version": 3
    },
    "nodes": [
      {
        "name": "zlib",
        "version": "1.2.12",
        ...
        "package_hash": "6kkliqdv67ucuvfpfdwaacy5bz6s6en4",
        "sources": [
          {
            "type": "archive",
            "sha256": "91844808532e5ce316b3c010929493c0244f3d37593afd6de04f71821d5136d9"
          }
        ],
        "patches": [
          "0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76"
        ],
        "hash": "ts3gkpltbgzr5y6nrfy6rzwbjmkscein"
      }
    ]
  }
}

Now:

  • Patches and archive hashes are no longer included in the package_hash;
  • Artifacts used in the build go in sources, and we tell you their checksum in the spec.json;
  • sources will include resources for packages that have it;
  • Patches are the same as before -- but only represented once; and
  • The package_hash is a base32-encoded sha1, like other hashes in Spack, and it only
    tells you that the package.py changed.

The behavior of the DAG hash (which includes the package_hash) is basically the same as before, except now resources are included, and we can see differences in archives and resources directly in the spec.json

Note that we do not need to bump the spec meta version on this, as past versions of Spack can still read the new specs; they just will not notice the new fields (which is fine, since we currently do not do anything with them).

Among other things, this will more easily allow us to convert Spack specs to SBOM and track relevant security information (like sha256's of archives). For example, we could do continuous scanning of a Spack installation based on these hashes, and if the sha256's become associated with CVE's, we'll know we're affected. (@vsoch FYI)

  • Add a method, spec_attrs() to FetchStrategy that can be used to describe a fetcher for a spec.json.
  • Simplify the way package_hash() is handled in Spack. Previously, it was handled as a special-case spec hash in hash_types.py, but it really doesn't belong there. Now, it's handled as part of Spec._finalize_concretization() and hash_types.py is much simpler.
  • Change PackageBase.content_hash() to PackageBase.artifact_hashes(), and include more information about artifacts in it.
  • Update package hash tests and make them check for artifact and resource hashes.

@spackbot-app spackbot-app bot added core PR affects Spack core functionality fetching new-version resources tests General test capability(ies) utilities labels Aug 22, 2022
@iarspider
Copy link
Copy Markdown
Contributor

@tgamblin since you will be working on package hashes, can you also try fixing #30720 ?

@tgamblin tgamblin force-pushed the show-sha256-in-spec branch from 05c1543 to 82073b5 Compare October 2, 2022 23:12
@iarspider
Copy link
Copy Markdown
Contributor

Quick question: will this fix the following problem - if Spack has a recipe for package, whose sources are taken from Git (or any other version control, I think), and I have a local repository with the same package but with different git URL, Spack will blindly use sources from Spack mirror?

@tgamblin tgamblin added this to the v0.20.0 milestone Nov 28, 2022
@tgamblin tgamblin force-pushed the show-sha256-in-spec branch from 82073b5 to a9675b0 Compare December 4, 2022 07:11
@tgamblin tgamblin force-pushed the show-sha256-in-spec branch from 2f03429 to d322cb9 Compare December 4, 2022 18:25
@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Dec 4, 2022

@spackbot run pipeline

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Dec 4, 2022

I've started that pipeline for you!

@tgamblin tgamblin requested a review from scheibelp December 5, 2022 19:06
@scheibelp scheibelp self-assigned this Dec 7, 2022
Copy link
Copy Markdown
Member

@scheibelp scheibelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point for consideration:

  1. This is trending toward listing out more information individually in the node dict: what is the point of that? I don’t perceive a benefit other than that it becomes more-readable; IMO there are already mechanisms for getting a better interface to this information, in particular reading it in as a Spec object.

(I only have one point to make here but I numbered it because I reference this in one of the PR comments)

The representation may only include certain dependency types, and it may optionally
include a canonicalized hash of the ``package.py`` for each node in the graph.

We currently use different hashes for different use cases."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(question) are you removing this because it's no longer true, or because you don't think it's the right place to document this?

(we've discussed using a single hash everywhere for a while, but I don't think this PR handles that, so what I'm asking is if it already happened and this PR is just tidying up the docstring)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant with the text above it, but if you think it's necessary, we could keep it.

value = getattr(self, attr, None)
if value:
attrs[attr] = value
return attrs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to embed configuration-YAML-awareness into the fetchers, what is this doing that source_id wasn't?

If this function is in fact necessary, I think it could use a better name and docstring (which I can suggest pending an answer to the above question).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to document this and remembered that I did not remove source_id. See what you think of the new source_provenance function.

assert getattr(spec, ht.package_hash.attr)
# TODO: make artifact hashes a static method
artifact_hashes = pkg.artifact_hashes()
spec._package_hash = artifact_hashes.pop("package_hash")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be cleaner to call package_hash() directly here vs. depending on artifact_hashes() and then modifying the resulting dictionary: it seems like that's the direction that artifact_hashes is going (i.e. the name suggests we could exclude the contents of the package.py).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you read the docstring for this method?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was: however this was gotten in pkg.artifact_hashes(), get it directly here instead. I only see this called in one place in this PR, so I would have assumed calling

spack.util.package_hash(self)

directly (i.e. here, in this function) would make more sense than delegating it to artifact_hashes and then deconstructing it after the fact here.

return syaml.syaml_dict(sorted(source.items()))
return source

d[key] = [order(source) for source in source_list]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking through this PR it looks like the only thing that would ever need to be ordered is a dictionary. But FYI I had to spend some time cross-referencing to confirm that. If another editor ever adds a list of things, then this has the potential to become unordered (i.e. inconsistent hash).

You could add a consideration for lists in the locally-defined order, but that might be defeated by gaps in the type checking. Ideally if _artifact_hashes values are guaranteed to be iterables of comparable elements, then you could just call sorted without constructing a new type around it (i.e. don't create an syaml_dict: admittedly this interferes with the representation as listed in the PR description, which relates to point [1] of my review).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment. If the item is a list, it doesn't need to be deterministically ordered. It should already be. We sort dictionaries but list order is up to the creator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything artifact_hashes calls (i.e. right now just _get_needed_resources) orders things, so this is fine for now. My worry is that this behavior is disconnected from the need: I had to dig a couple layers down to confirm this, vs. just seeing it enforced here where it really matters; that seems fragile to me.

You may be arguing that we depend on this for other info we add in to_node_dict, but most of that is part of _cmp_node (which presumably tests order), and _artifact_hashes isn't.

@DarylGrunau
Copy link
Copy Markdown
Contributor

Just pinging here to see what the status/blockers are for this MR to be merged. Having the capability to produce an SBOM is becoming highly desirable at LANL (perhaps there's another MR that does this?).

@vsoch
Copy link
Copy Markdown
Member

vsoch commented May 4, 2023

I had done https://github.com/spack/spack-sbom and a follow up PR here to add a command in 2021 (since closed) but you probably want the changes in first that @tgamblin thinks are important.

@tgamblin tgamblin modified the milestones: v0.20.0, v0.21.0 May 12, 2023
@tgamblin
Copy link
Copy Markdown
Member Author

@scheibelp: RE:

This is trending toward listing out more information individually in the node dict:

Yes.

what is the point of that? I don’t perceive a benefit other than that it becomes more-readable; IMO there are already mechanisms for getting a better interface to this information, in particular reading it in as a Spec object.

There are not reliable interfaces for retrieving this from an old installation. package.py files may drift; we need an authoritative place for this stuff. The Spec is where authoritative provenance goes.

I put some motivation in the description, but here it is for posterity:

We want this information to be available from the concrete spec so that it can be retrieved as provenance without referring to a package.py file. Why?

  1. We want to be able to look up authoritative checksums, versions, etc. and pass them through scanners long after an installation takes place.
  2. Right now you can look at the version on a Spec, then look up the checksum for a version from the package.py, but that's not reliable in the long term. We may remove versions from package.py (in fact, we might want to as they become less relevant), and checksums for particular versions/releases may change over time.
  3. Only the version is stored explicitly on the spec; that's not sufficient to know exactly what hashes a spec was built with.

Adding hashes explicitly gets around this and gets us closer to having all the information we need for a complete SBOM.

@tgamblin tgamblin mentioned this pull request Sep 8, 2023
@tgamblin tgamblin removed this from the v0.21.0 milestone Nov 5, 2023
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
`installer.py` currently swallows the traceback and preserves only
the error messaege if a build process fails.

- [x] preserve exceptions from failed build processes
- [x] print a full traceback for each one when running with `spack -d`

Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Patches variant needs to stay on the spec even if we also have resources.

Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
The submodules argument for git versions supports callbacks (which are used on some key
packages to conditionally fetch certain submodules). Callbacks can't be serialized to
JSON, so we need to ensure that these things are resolved (by calling them) at
concretization time.

- [x] ensure that submodule callbacks are called when creating spec JSON
- [x] add tests that try serializing all types of git versions
- [x] don't allow calling `spec.package_hash()` on abstract specs
- [x] get rid of last few uses of `ht.package_hash`, which was removed.

Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
@tgamblin tgamblin force-pushed the show-sha256-in-spec branch from c0c13ff to bb77a77 Compare October 28, 2024 01:14
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands core PR affects Spack core functionality fetching hash-change things that will change a lot of hashes maintainers new-version resources tests General test capability(ies) utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants