Conversation
This reverts commit fb87411. According to w3c/json-ld-syntax#343 (comment) references should point to final published versions on https://www.w3.org/TR/json-ld11/
* upstream/master: (113 commits) [DOC] Auto-generate changelog entry for PR #152 [DOC] Auto-generate changelog entry for PR #467 Specify that suffix must be alphanumeric ENH: make NOT RECOMMENDED stronger (SHOULD NOT) for zero padding for uniqueness ENH: Include leading . within definition of the file extension ENH: provide an example for a suffix based on an _eeg.vhdr filename [DOC] Auto-generate changelog entry for PR #477 [DOC] Auto-generate changelog entry for PR #460 Also ignore users urls on github Quote regexp in command line [INFRA] linkchecker - ignore github pull and tree URLs Apply suggestions from code review replace purview with scope label -> index Apply suggestions from code review drop _part-, introduce _split- Apply SA feedback and amended to purview [DOC] Auto-generate changelog entry for PR #459 Add Domain Expert to Maintainers Group [DOC] Auto-generate changelog entry for PR #465 ...
* upstream/master: [DOC] Auto-generate changelog entry for PR #470 Clarify precision of fractional seconds. Separate sentences. Square brackets seem to be a problem. Wrangle line lengths. Add optional fractional seconds to scans file datetimes.
Enh/prov
Co-authored-by: Stefan Appelhoff <stefan.appelhoff@mailbox.org>
Co-authored-by: Julia Guiomar Niso Galán <guiomar.niso@ctb.upm.es>
Co-authored-by: Julia Guiomar Niso Galán <guiomar.niso@ctb.upm.es>
| } | ||
| } | ||
| which inputs from the BIDS dataset were used together with what software was | ||
| run in what environment and with what parameters. |
There was a problem hiding this comment.
Not sure what these lines at the end of the examples mean?
| "commandLine": "mri_convert ..." | ||
| } | ||
| } | ||
| involved in a study. |
There was a problem hiding this comment.
| involved in a study. |
They seem like not finished?
| } | ||
| } | ||
| appropriate attribution to the original dataset generators as well as | ||
| future transformers. |
| } | ||
| appropriate attribution to the original dataset generators as well as | ||
| future transformers. | ||
| 5. For datasets and derivatives, provenance can also include details of |
There was a problem hiding this comment.
There's something unclosed here so all the following text is reading like in the example/comment grey mode. Maybe it's just missing ``` to close the type of text or something similar. I'm not super expert in this formatting :)
There was a problem hiding this comment.
Also the numbering goes from 1 to 5, so maybe something missing in the middle?
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.
yarikoptic
left a comment
There was a problem hiding this comment.
just minor comments while was looking at the PR
|
|
||
| <sup>1</sup>Storing actual source files with the data is preferred over links to | ||
| external source repositories to maximize long term preservation (which would | ||
| suffer if an external repository would not be available anymore). |
There was a problem hiding this comment.
| suffer if an external repository would not be available anymore). | |
| suffer if an external repository would not be available anymore), and to remove ambiguity | |
| in which version of the code (in the repository) was actually used for a given dataset. |
Although not sure if such recommendation could/should be generally followed since it remains ambigous -- e.g. should people copy entire heudiconv or any other BIDS converter?
There was a problem hiding this comment.
ha -- this is Code section, hence my comment. But upon rereading I wonder if intent was to talk about sourcedata/? Anyways -- language should be clarified. And not 100% sure this addition relates to BEP028 really -- seems generic and could be submitted as independent PR
|
|
||
| Optional: Yes | ||
|
|
||
| ### Rationale |
There was a problem hiding this comment.
this is a specification and not a text book. I have not found any other Rationale section, and thus not sure if such (even though concise) section should be here.
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.
Unfortunately there is no convention yet in BIDS on storing such information in a standardized way. bids-standard/bids-specification#440 proposes to add GeneratedBy (within dataset_description.json) which could provide detailed high level information which should then be consistent through out dataset (so we would need to add safeguards) bids-standard/bids-specification#487 is WiP to introduce PROV into BIDS standard, which would allow to establish _prov.json with all needed gory details. For now, since fields in side car .json files are not strictly regulated, I think it would be benefitial to user to have heudiconv version stored there along with other "Version" fields, such as $> grep -e Version -e dcm2ni fmap/sub-phantom1sid1_ses-localizer_acq-3mm_phasediff.json "ConversionSoftware": "dcm2niix", "ConversionSoftwareVersion": "v1.0.20211006", "SoftwareVersions": "syngo MR E11", and although strictly speaking Heudiconv is a "conversion software", since dcm2niix decided to use that pair, I have decided to leave it alone and just come up with yet another descriptive key "HeudiconvVersion": "0.10.0",
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
|
(NOTE: I'll cross-post this message across several BEP threads) Hi there, just a quick notification that we have just merged #918 and it may be interesting to look at the implications for this BEP. We are introducing "BIDS URIs", which unify the way we refer to and point to files in BIDS datasets (as opposed to "dataset-relative" or "subject-relative" or "file-relative" links). If the diff and discussion in the PR is unclear, you can also read the rendered version: https://bids-specification.readthedocs.io/en/latest/02-common-principles.html#bids-uri Perhaps there are things in the BEP that need adjusting now, but perhaps also not -- in any case it's good to be aware of this new feature! Let me know if there are any questions, comments, or concerns. |
|
FYI: added a link to the HTML version of the BEP in the top comment of this PR. |
|
maintenance note: added a link to the rendered BEP in the top message of this PR. |
|
while listening to @cmaumet giving a presentation at BIDS derivatives meeting decided to add a reference to a related effort in unrelated other sciences: https://academic.oup.com/gji/article/207/2/1003/2583765 "An Adaptable Seismic Data Format" , ASDF incorporates the W3C PROV standard to record provenance information |
|
|
||
| ```Text | ||
| - [Dataset level] prov.jsonld | ||
| - [File level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov |
There was a problem hiding this comment.
Thinking about many types of "applications" operating on entire subject/sessions, indeed it makes sense to have it at the subj/session level (which is presented here, not file level) and have possibility for file level too to "complement" that (similarly how done in inheritance principle for other metadata?):
| - [File level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov | |
| - [Subject[/session] level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov | |
| - [File level] sub-<label>/[ses-<label>/]<modality>/sub-<label>[_ses-<label>][_entities-<values>]_<suffix>.prov |
I am also not sure why it is prov.jsonld on top and .prov and the lower? why not .prov.jsonld at the bottom levels too?
There was a problem hiding this comment.
What about pretty much allowing .jsonld at any level where appropriate? E.g. could be for subject session or specific modality processing etc. All of those to be joined into a single graph anyways, right?
There was a problem hiding this comment.
pushed a fix for .prov
|
@cmaumet advised that this PR is outdated and shouldn't be the "reference" for BEP028 ATM and the google doc should be the one to look after ATM. |
| There are no limitations or recommendations on the language and/or | ||
| code organization of these scripts at the moment. | ||
|
|
||
| <sup>1</sup>Storing actual source files with the data is preferred over links to |
There was a problem hiding this comment.
seems can't do suggestions any longer (because closed?) via button:
| <sup>1</sup>Storing actual source files with the data is preferred over links to | |
| Storing actual source files with the data is preferred over links to |
since there is no 2.
| they can be aggregated without the need to apply any inheritance principle. | ||
|
|
||
| iv. The provenance file MAY be used to reflect the provenance of a dataset, | ||
| a collection of files or a specific file at any level_of the bids hierarchy. |
There was a problem hiding this comment.
| a collection of files or a specific file at any level_of the bids hierarchy. | |
| a collection of files or a specific file at any level of the bids hierarchy. |
@cmaumet advised that this PR is outdated and shouldn't be the "reference" for BEP028 ATM and the google doc should be the one to look after ATM.
This replaces #439.
ping @cmaumet @remiadon
This is a branch directly owned by the bids-specification allowing any maintainer to merge PRs to this.
Link to the HTML render of this BEP:
https://bids-specification--487.org.readthedocs.build/en/487/03-modality-agnostic-files.html#provenance-of-bids-datasets-files-and-derivatives