BEP028 - Provenance by satra · Pull Request #487 · bids-standard/bids-specification

satra · 2020-05-27T23:50:59Z

@cmaumet advised that this PR is outdated and shouldn't be the "reference" for BEP028 ATM and the google doc should be the one to look after ATM.

This replaces #439.

ping @cmaumet @remiadon

This is a branch directly owned by the bids-specification allowing any maintainer to merge PRs to this.

Link to the HTML render of this BEP:
https://bids-specification--487.org.readthedocs.build/en/487/03-modality-agnostic-files.html#provenance-of-bids-datasets-files-and-derivatives

This reverts commit fb87411. According to w3c/json-ld-syntax#343 (comment) references should point to final published versions on https://www.w3.org/TR/json-ld11/

* upstream/master: (113 commits) [DOC] Auto-generate changelog entry for PR #152 [DOC] Auto-generate changelog entry for PR #467 Specify that suffix must be alphanumeric ENH: make NOT RECOMMENDED stronger (SHOULD NOT) for zero padding for uniqueness ENH: Include leading . within definition of the file extension ENH: provide an example for a suffix based on an _eeg.vhdr filename [DOC] Auto-generate changelog entry for PR #477 [DOC] Auto-generate changelog entry for PR #460 Also ignore users urls on github Quote regexp in command line [INFRA] linkchecker - ignore github pull and tree URLs Apply suggestions from code review replace purview with scope label -> index Apply suggestions from code review drop _part-, introduce _split- Apply SA feedback and amended to purview [DOC] Auto-generate changelog entry for PR #459 Add Domain Expert to Maintainers Group [DOC] Auto-generate changelog entry for PR #465 ...

* upstream/master: [DOC] Auto-generate changelog entry for PR #470 Clarify precision of fractional seconds. Separate sentences. Square brackets seem to be a problem. Wrangle line lengths. Add optional fractional seconds to scans file datetimes.

Enh/prov

src/03-modality-agnostic-files.md

Co-authored-by: Stefan Appelhoff <stefan.appelhoff@mailbox.org>

src/03-modality-agnostic-files.md

Co-authored-by: Julia Guiomar Niso Galán <guiomar.niso@ctb.upm.es>

guiomar · 2021-07-22T16:01:28Z

src/03-modality-agnostic-files.md

+                      }
+    }
+which inputs from the BIDS dataset were used together with what software was 
+run in what environment and with what parameters.


Not sure what these lines at the end of the examples mean?

guiomar · 2021-07-22T16:01:58Z

src/03-modality-agnostic-files.md

+                      "commandLine": "mri_convert ..."
+                      }
+    }
+involved in a study.


Suggested change

involved in a study.

They seem like not finished?

guiomar · 2021-07-22T16:02:22Z

src/03-modality-agnostic-files.md

+       }
+    }
+appropriate attribution to the original dataset generators as well as 
+future transformers.  


guiomar · 2021-07-22T16:06:20Z

src/03-modality-agnostic-files.md

+    }
+appropriate attribution to the original dataset generators as well as 
+future transformers.  
+5. For datasets and derivatives, provenance can also include details of 


There's something unclosed here so all the following text is reading like in the example/comment grey mode. Maybe it's just missing ``` to close the type of text or something similar. I'm not super expert in this formatting :)

Also the numbering goes from 1 to 5, so maybe something missing in the middle?

bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.

yarikoptic

just minor comments while was looking at the PR

yarikoptic · 2021-10-11T18:37:41Z

src/03-modality-agnostic-files.md


+<sup>1</sup>Storing actual source files with the data is preferred over links to
+external source repositories to maximize long term preservation (which would
+suffer if an external repository would not be available anymore).


Suggested change

suffer if an external repository would not be available anymore).

suffer if an external repository would not be available anymore), and to remove ambiguity

in which version of the code (in the repository) was actually used for a given dataset.

Although not sure if such recommendation could/should be generally followed since it remains ambigous -- e.g. should people copy entire heudiconv or any other BIDS converter?

ha -- this is Code section, hence my comment. But upon rereading I wonder if intent was to talk about sourcedata/? Anyways -- language should be clarified. And not 100% sure this addition relates to BEP028 really -- seems generic and could be submitted as independent PR

yarikoptic · 2021-10-11T18:39:18Z

src/03-modality-agnostic-files.md

+
+Optional: Yes
+
+### Rationale


this is a specification and not a text book. I have not found any other Rationale section, and thus not sure if such (even though concise) section should be here.

src/03-modality-agnostic-files.md

bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.

Unfortunately there is no convention yet in BIDS on storing such information in a standardized way. bids-standard/bids-specification#440 proposes to add GeneratedBy (within dataset_description.json) which could provide detailed high level information which should then be consistent through out dataset (so we would need to add safeguards) bids-standard/bids-specification#487 is WiP to introduce PROV into BIDS standard, which would allow to establish _prov.json with all needed gory details. For now, since fields in side car .json files are not strictly regulated, I think it would be benefitial to user to have heudiconv version stored there along with other "Version" fields, such as $> grep -e Version -e dcm2ni fmap/sub-phantom1sid1_ses-localizer_acq-3mm_phasediff.json "ConversionSoftware": "dcm2niix", "ConversionSoftwareVersion": "v1.0.20211006", "SoftwareVersions": "syngo MR E11", and although strictly speaking Heudiconv is a "conversion software", since dcm2niix decided to use that pair, I have decided to leave it alone and just come up with yet another descriptive key "HeudiconvVersion": "0.10.0",

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

sappelhoff · 2022-07-25T09:46:12Z

(NOTE: I'll cross-post this message across several BEP threads)

Hi there, just a quick notification that we have just merged #918 and it may be interesting to look at the implications for this BEP.

We are introducing "BIDS URIs", which unify the way we refer to and point to files in BIDS datasets (as opposed to "dataset-relative" or "subject-relative" or "file-relative" links).

If the diff and discussion in the PR is unclear, you can also read the rendered version: https://bids-specification.readthedocs.io/en/latest/02-common-principles.html#bids-uri

Perhaps there are things in the BEP that need adjusting now, but perhaps also not -- in any case it's good to be aware of this new feature!

Let me know if there are any questions, comments, or concerns.

Remi-Gau · 2023-02-13T09:14:37Z

FYI: added a link to the HTML version of the BEP in the top comment of this PR.

Remi-Gau · 2023-02-13T10:14:32Z

maintenance note: added a link to the rendered BEP in the top message of this PR.

yarikoptic · 2023-06-22T07:20:37Z

while listening to @cmaumet giving a presentation at BIDS derivatives meeting decided to add a reference to a related effort in unrelated other sciences: https://academic.oup.com/gji/article/207/2/1003/2583765 "An Adaptable Seismic Data Format" , ASDF incorporates the W3C PROV standard to record provenance information

yarikoptic · 2023-06-22T07:27:08Z

src/03-modality-agnostic-files.md

+
+```Text
+- [Dataset level] prov.jsonld
+- [File level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov


Thinking about many types of "applications" operating on entire subject/sessions, indeed it makes sense to have it at the subj/session level (which is presented here, not file level) and have possibility for file level too to "complement" that (similarly how done in inheritance principle for other metadata?):

Suggested change

- [File level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov

- [Subject[/session] level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov

- [File level] sub-<label>/[ses-<label>/]<modality>/sub-<label>[_ses-<label>][_entities-<values>]_<suffix>.prov

I am also not sure why it is prov.jsonld on top and .prov and the lower? why not .prov.jsonld at the bottom levels too?

What about pretty much allowing .jsonld at any level where appropriate? E.g. could be for subject session or specific modality processing etc. All of those to be joined into a single graph anyways, right?

pushed a fix for .prov

yarikoptic · 2023-06-22T15:14:09Z

@cmaumet advised that this PR is outdated and shouldn't be the "reference" for BEP028 ATM and the google doc should be the one to look after ATM.

yarikoptic · 2024-11-21T19:09:40Z

src/03-modality-agnostic-files.md

 There are no limitations or recommendations on the language and/or
 code organization of these scripts at the moment.

+<sup>1</sup>Storing actual source files with the data is preferred over links to


seems can't do suggestions any longer (because closed?) via button:

Suggested change

<sup>1</sup>Storing actual source files with the data is preferred over links to

Storing actual source files with the data is preferred over links to

since there is no 2.

yarikoptic · 2024-11-21T19:14:53Z

src/03-modality-agnostic-files.md

+they can be aggregated without the need to apply any inheritance principle. 
+
+iv. The provenance file MAY be used to reflect the provenance of a dataset, 
+a collection of files or a specific file at any level_of the bids hierarchy. 


Suggested change

a collection of files or a specific file at any level_of the bids hierarchy.

a collection of files or a specific file at any level of the bids hierarchy.

Remi Adon and others added 10 commits March 25, 2020 10:08

[ADD] bids provencance proposal

bbe42c9

lint bids-prov markdown

35fccbb

simplified uri in chapter 03

cd31b35

mardkown uri : try removing link in name

66bd405

[RM] link causing link checker to fail

fb87411

Revert "[RM] link causing link checker to fail" with tuned up URL

4a06044

This reverts commit fb87411. According to w3c/json-ld-syntax#343 (comment) references should point to final published versions on https://www.w3.org/TR/json-ld11/

reorganize a bit w.r.t. comments on google doc

1bbafe1

Merge pull request #486 from satra/enh/prov

0d08b8c

Enh/prov

satra mentioned this pull request May 27, 2020

[WIP] [ENH] Provenance BEP028 #439

Closed

satra commented May 27, 2020

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

satra commented May 27, 2020

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

satra commented May 27, 2020

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

satra commented May 27, 2020

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

satra commented May 27, 2020

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

apply changes from linter on travis

fed1152

sappelhoff reviewed May 28, 2020

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

Update src/03-modality-agnostic-files.md

43946d0

Co-authored-by: Stefan Appelhoff <stefan.appelhoff@mailbox.org>

satra mentioned this pull request Jun 2, 2020

[ENH] Storing basic provenance on JSON sidecars of derivatives #300

Closed

sappelhoff mentioned this pull request Jun 7, 2020

Provenance enhancement for BIDS #368

Closed

tsalo mentioned this pull request Jun 7, 2020

[INFRA] Convert entity table to yaml #475

Merged

5 tasks

remiadon mentioned this pull request Jun 12, 2020

BIDS-prov examples ohbm/hackathon2020#181

Open

14 tasks

sappelhoff added the BEP label Jul 4, 2020

cmaumet mentioned this pull request Aug 6, 2020

Create contributing.md bids-standard/BEP028_BIDSprov#27

Merged

sappelhoff marked this pull request as draft May 22, 2021 16:40

sappelhoff changed the title ~~[WIP][ENH] BEP028 - Provenance~~ BEP028 - Provenance May 22, 2021

guiomar reviewed Jul 22, 2021

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

Update src/03-modality-agnostic-files.md

5a4d3a4

guiomar reviewed Jul 22, 2021

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

satra and others added 3 commits July 22, 2021 11:58

Update src/03-modality-agnostic-files.md

e93110e

Co-authored-by: Julia Guiomar Niso Galán <guiomar.niso@ctb.upm.es>

Update src/03-modality-agnostic-files.md

e18b349

Co-authored-by: Julia Guiomar Niso Galán <guiomar.niso@ctb.upm.es>

Merge branch 'master' into bep-028

88aa1f6

guiomar reviewed Jul 22, 2021

View reviewed changes

yarikoptic reviewed Oct 11, 2021

View reviewed changes

yarikoptic mentioned this pull request Oct 11, 2021

ENH: add HeudiconvVersion to sidecar .json files nipy/heudiconv#529

Merged

Update src/03-modality-agnostic-files.md

f3aa34d

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

Remi-Gau mentioned this pull request Dec 10, 2022

A JSON-LD version of the dataset_description.json metadata for harmonizing with other data standards? #1370

Open

yarikoptic mentioned this pull request Mar 13, 2023

[ENH] add _proc-<label> to all modalities having _rec (anat, fmap, func, perf, and pet) #105

Open

yarikoptic reviewed Jun 22, 2023

View reviewed changes

yarikoptic closed this Jun 22, 2023

yarikoptic mentioned this pull request Jun 22, 2023

Revert "update bep028 link" to point back to google doc bids-standard/bids-website#313

Merged

effigies mentioned this pull request Sep 28, 2023

Next steps for BIDS-Prov bids-standard/BEP028_BIDSprov#125

Closed

yarikoptic mentioned this pull request Nov 14, 2024

Allow for GeneratedBy in any sidecar .json file to collect their provenance #1970

Open

yarikoptic reviewed Nov 21, 2024

View reviewed changes

	suffer if an external repository would not be available anymore).
	suffer if an external repository would not be available anymore), and to remove ambiguity
	in which version of the code (in the repository) was actually used for a given dataset.

	- [File level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov
	- [Subject[/session] level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov
	- [File level] sub-<label>/[ses-<label>/]<modality>/sub-<label>[_ses-<label>][_entities-<values>]_<suffix>.prov

	<sup>1</sup>Storing actual source files with the data is preferred over links to
	Storing actual source files with the data is preferred over links to

	a collection of files or a specific file at any level_of the bids hierarchy.
	a collection of files or a specific file at any level of the bids hierarchy.


		Optional: Yes

		### Rationale

Conversation

satra commented May 27, 2020 • edited by Remi-Gau Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yarikoptic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sappelhoff commented Jul 25, 2022

Uh oh!

Remi-Gau commented Feb 13, 2023

Uh oh!

Remi-Gau commented Feb 13, 2023

Uh oh!

yarikoptic commented Jun 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yarikoptic commented Jun 22, 2023

Uh oh!

yarikoptic Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

satra commented May 27, 2020 •

edited by Remi-Gau

Loading

yarikoptic Nov 21, 2024 •

edited

Loading