specs: include source provenance in spec.json and package hash#32312
specs: include source provenance in spec.json and package hash#32312
spec.json and package hash#32312Conversation
05c1543 to
82073b5
Compare
|
Quick question: will this fix the following problem - if Spack has a recipe for package, whose sources are taken from Git (or any other version control, I think), and I have a local repository with the same package but with different git URL, Spack will blindly use sources from Spack mirror? |
82073b5 to
a9675b0
Compare
2f03429 to
d322cb9
Compare
|
@spackbot run pipeline |
|
I've started that pipeline for you! |
scheibelp
left a comment
There was a problem hiding this comment.
Point for consideration:
- This is trending toward listing out more information individually in the node dict: what is the point of that? I don’t perceive a benefit other than that it becomes more-readable; IMO there are already mechanisms for getting a better interface to this information, in particular reading it in as a Spec object.
(I only have one point to make here but I numbered it because I reference this in one of the PR comments)
| The representation may only include certain dependency types, and it may optionally | ||
| include a canonicalized hash of the ``package.py`` for each node in the graph. | ||
|
|
||
| We currently use different hashes for different use cases.""" |
There was a problem hiding this comment.
(question) are you removing this because it's no longer true, or because you don't think it's the right place to document this?
(we've discussed using a single hash everywhere for a while, but I don't think this PR handles that, so what I'm asking is if it already happened and this PR is just tidying up the docstring)
There was a problem hiding this comment.
It's redundant with the text above it, but if you think it's necessary, we could keep it.
| value = getattr(self, attr, None) | ||
| if value: | ||
| attrs[attr] = value | ||
| return attrs |
There was a problem hiding this comment.
It seems odd to embed configuration-YAML-awareness into the fetchers, what is this doing that source_id wasn't?
If this function is in fact necessary, I think it could use a better name and docstring (which I can suggest pending an answer to the above question).
There was a problem hiding this comment.
I tried to document this and remembered that I did not remove source_id. See what you think of the new source_provenance function.
| assert getattr(spec, ht.package_hash.attr) | ||
| # TODO: make artifact hashes a static method | ||
| artifact_hashes = pkg.artifact_hashes() | ||
| spec._package_hash = artifact_hashes.pop("package_hash") |
There was a problem hiding this comment.
It seems like it would be cleaner to call package_hash() directly here vs. depending on artifact_hashes() and then modifying the resulting dictionary: it seems like that's the direction that artifact_hashes is going (i.e. the name suggests we could exclude the contents of the package.py).
There was a problem hiding this comment.
Did you read the docstring for this method?
There was a problem hiding this comment.
My point was: however this was gotten in pkg.artifact_hashes(), get it directly here instead. I only see this called in one place in this PR, so I would have assumed calling
spack.util.package_hash(self)
directly (i.e. here, in this function) would make more sense than delegating it to artifact_hashes and then deconstructing it after the fact here.
| return syaml.syaml_dict(sorted(source.items())) | ||
| return source | ||
|
|
||
| d[key] = [order(source) for source in source_list] |
There was a problem hiding this comment.
From looking through this PR it looks like the only thing that would ever need to be ordered is a dictionary. But FYI I had to spend some time cross-referencing to confirm that. If another editor ever adds a list of things, then this has the potential to become unordered (i.e. inconsistent hash).
You could add a consideration for lists in the locally-defined order, but that might be defeated by gaps in the type checking. Ideally if _artifact_hashes values are guaranteed to be iterables of comparable elements, then you could just call sorted without constructing a new type around it (i.e. don't create an syaml_dict: admittedly this interferes with the representation as listed in the PR description, which relates to point [1] of my review).
There was a problem hiding this comment.
I'm not sure I understand this comment. If the item is a list, it doesn't need to be deterministically ordered. It should already be. We sort dictionaries but list order is up to the creator.
There was a problem hiding this comment.
Everything artifact_hashes calls (i.e. right now just _get_needed_resources) orders things, so this is fine for now. My worry is that this behavior is disconnected from the need: I had to dig a couple layers down to confirm this, vs. just seeing it enforced here where it really matters; that seems fragile to me.
You may be arguing that we depend on this for other info we add in to_node_dict, but most of that is part of _cmp_node (which presumably tests order), and _artifact_hashes isn't.
|
Just pinging here to see what the status/blockers are for this MR to be merged. Having the capability to produce an SBOM is becoming highly desirable at LANL (perhaps there's another MR that does this?). |
|
I had done https://github.com/spack/spack-sbom and a follow up PR here to add a command in 2021 (since closed) but you probably want the changes in first that @tgamblin thinks are important. |
|
@scheibelp: RE:
Yes.
There are not reliable interfaces for retrieving this from an old installation. I put some motivation in the description, but here it is for posterity:
|
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
`installer.py` currently swallows the traceback and preserves only the error messaege if a build process fails. - [x] preserve exceptions from failed build processes - [x] print a full traceback for each one when running with `spack -d` Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Patches variant needs to stay on the spec even if we also have resources. Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
The submodules argument for git versions supports callbacks (which are used on some key packages to conditionally fetch certain submodules). Callbacks can't be serialized to JSON, so we need to ensure that these things are resolved (by calling them) at concretization time. - [x] ensure that submodule callbacks are called when creating spec JSON - [x] add tests that try serializing all types of git versions
- [x] don't allow calling `spec.package_hash()` on abstract specs - [x] get rid of last few uses of `ht.package_hash`, which was removed. Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
c0c13ff to
bb77a77
Compare
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
Signed-off-by: Todd Gamblin <tgamblin@llnl.gov>
We've included a package hash in Spack since #7193 for CI, and we started using it on the spec in #28504. However, what goes into the package hash is opaque.
We want this information to be available from the concrete spec so that it can be retrieved as provenance without referring to a
package.pyfile. Why?Spec, then look up the checksum for a version from thepackage.py, but that's not reliable in the long term. We may remove versions frompackage.py(in fact, we might want to as they become less relevant), and checksums for particular versions/releases may change over time.Adding hashes explicitly gets around this and gets us closer to having all the information we need for a complete SBOM.
Here's what
spec.jsonlooked like before:{ "spec": { "_meta": { "version": 3 }, "nodes": [ { "name": "zlib", "version": "1.2.12", ... "patches": [ "0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76" ], "package_hash": "pthf7iophdyonixxeed7gyqiksopxeklzzjbxtjrw7nzlkcqleba====", "hash": "ke4alug7ypoxp37jb6namwlxssmws4kp" } ] } }The
package_hashthere is a hash of the concatenation of:package.pyrecipe, as implemented in Make the package hash consistent #28156;sha256's of patches applied to the spec; andsha256sums of archives or commits/revisions of repos used to build the spec.There are some issues with this: patches are counted twice in this spec (in
patchesand in thepackage_hash), the hashes of sources used to build are conflated with thepackage.pyhash, and we don't actually include resources anywhere.With this PR, I've expanded the package hash out in the
spec.jsonbody. Here is the "same" spec with the new fields:{ "spec": { "_meta": { "version": 3 }, "nodes": [ { "name": "zlib", "version": "1.2.12", ... "package_hash": "6kkliqdv67ucuvfpfdwaacy5bz6s6en4", "sources": [ { "type": "archive", "sha256": "91844808532e5ce316b3c010929493c0244f3d37593afd6de04f71821d5136d9" } ], "patches": [ "0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76" ], "hash": "ts3gkpltbgzr5y6nrfy6rzwbjmkscein" } ] } }Now:
package_hash;sources, and we tell you their checksum in thespec.json;sourceswill include resources for packages that have it;package_hashis a base32-encodedsha1, like other hashes in Spack, and it onlytells you that the
package.pychanged.The behavior of the DAG hash (which includes the
package_hash) is basically the same as before, except now resources are included, and we can see differences in archives and resources directly in thespec.jsonNote that we do not need to bump the spec meta version on this, as past versions of Spack can still read the new specs; they just will not notice the new fields (which is fine, since we currently do not do anything with them).
Among other things, this will more easily allow us to convert Spack specs to SBOM and track relevant security information (like
sha256's of archives). For example, we could do continuous scanning of a Spack installation based on these hashes, and if thesha256's become associated with CVE's, we'll know we're affected. (@vsoch FYI)spec_attrs()toFetchStrategythat can be used to describe a fetcher for aspec.json.hash_types.py, but it really doesn't belong there. Now, it's handled as part ofSpec._finalize_concretization()andhash_types.pyis much simpler.PackageBase.content_hash()toPackageBase.artifact_hashes(), and include more information about artifacts in it.