Skip to content

BEP028 - Provenance#487

Closed
satra wants to merge 17 commits intomasterfrom
bep-028
Closed

BEP028 - Provenance#487
satra wants to merge 17 commits intomasterfrom
bep-028

Conversation

@satra
Copy link
Copy Markdown
Collaborator

@satra satra commented May 27, 2020

@cmaumet advised that this PR is outdated and shouldn't be the "reference" for BEP028 ATM and the google doc should be the one to look after ATM.


This replaces #439.

ping @cmaumet @remiadon

This is a branch directly owned by the bids-specification allowing any maintainer to merge PRs to this.

Link to the HTML render of this BEP:
https://bids-specification--487.org.readthedocs.build/en/487/03-modality-agnostic-files.html#provenance-of-bids-datasets-files-and-derivatives

Remi Adon and others added 10 commits March 25, 2020 10:08
This reverts commit fb87411.

According to w3c/json-ld-syntax#343 (comment)
references should point to final published versions on
https://www.w3.org/TR/json-ld11/
* upstream/master: (113 commits)
  [DOC] Auto-generate changelog entry for PR #152
  [DOC] Auto-generate changelog entry for PR #467
  Specify that suffix must be alphanumeric
  ENH: make NOT RECOMMENDED stronger (SHOULD NOT) for zero padding for uniqueness
  ENH: Include leading . within definition of the file extension
  ENH: provide an example for a suffix based on an _eeg.vhdr filename
  [DOC] Auto-generate changelog entry for PR #477
  [DOC] Auto-generate changelog entry for PR #460
  Also ignore users urls on github
  Quote regexp in command line
  [INFRA] linkchecker - ignore github pull and tree URLs
  Apply suggestions from code review
  replace purview with scope
  label -> index
  Apply suggestions from code review
  drop _part-, introduce _split-
  Apply SA feedback and amended to purview
  [DOC] Auto-generate changelog entry for PR #459
  Add Domain Expert to Maintainers Group
  [DOC] Auto-generate changelog entry for PR #465
  ...
* upstream/master:
  [DOC] Auto-generate changelog entry for PR #470
  Clarify precision of fractional seconds.
  Separate sentences.
  Square brackets seem to be a problem.
  Wrangle line lengths.
  Add optional fractional seconds to scans file datetimes.
Co-authored-by: Stefan Appelhoff <stefan.appelhoff@mailbox.org>
@tsalo tsalo mentioned this pull request Jun 7, 2020
5 tasks
@sappelhoff sappelhoff added the BEP label Jul 4, 2020
@sappelhoff sappelhoff marked this pull request as draft May 22, 2021 16:40
@sappelhoff sappelhoff changed the title [WIP][ENH] BEP028 - Provenance BEP028 - Provenance May 22, 2021
satra and others added 3 commits July 22, 2021 11:58
Co-authored-by: Julia Guiomar Niso Galán <guiomar.niso@ctb.upm.es>
Co-authored-by: Julia Guiomar Niso Galán <guiomar.niso@ctb.upm.es>
}
}
which inputs from the BIDS dataset were used together with what software was
run in what environment and with what parameters.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what these lines at the end of the examples mean?

"commandLine": "mri_convert ..."
}
}
involved in a study.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
involved in a study.

They seem like not finished?

}
}
appropriate attribution to the original dataset generators as well as
future transformers.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here

}
appropriate attribution to the original dataset generators as well as
future transformers.
5. For datasets and derivatives, provenance can also include details of
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something unclosed here so all the following text is reading like in the example/comment grey mode. Maybe it's just missing ``` to close the type of text or something similar. I'm not super expert in this formatting :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the numbering goes from 1 to 5, so maybe something missing in the middle?

yarikoptic added a commit to yarikoptic/bids-specification that referenced this pull request Oct 11, 2021
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance
capture/expression for BIDS datasets.  This PR just follows the idea of bids-standard#371
(small atomic ENHs), and is based on current state of the specification where
we have GeneratedBy to describe how a BIDS derivative dataset came to its
existence.

## Rationale

As I had  previously stated in many (face-to-face when it was still
possible ;)) conversations, in my view, any BIDS dataset is a derivative
dataset.  Even if it contains "raw" data, it is never given by gods, but is a
result of some process (let's call it pipeline for consistency) which produced
it out of some other data. That is why there is 1) `sourcedata/` to provide
placement for such original (as "raw" in terms of processing, but "raw"er in
terms of its relation to actual data acquired by equipment), and 2) `code/` to
provide placement for scripts used to produce or "tune" the dataset.  Typically
"sourcedata" is either a collection of DICOMs or a collection of data in some
other formats (e.g. nifti) which is then either converted or just renamed into
BIDS layout. When encountering a new BIDS dataset ATM it requires forensics
and/or data archaeology to discover how this BIDS dataset came about, to e.g.
possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields
during conversion into side car .json files they produce,

<details>
<summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary>

```shell
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

```
</details>

ATM I need to add such metadata to datasets produced by heudiconv to make
sure that in case of incremental conversions there is no switch in versions of
the software.
yarikoptic added a commit to yarikoptic/bids-specification that referenced this pull request Oct 11, 2021
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance
capture/expression for BIDS datasets.  This PR just follows the idea of bids-standard#371
(small atomic ENHs), and is based on current state of the specification where
we have GeneratedBy to describe how a BIDS derivative dataset came to its
existence.

## Rationale

As I had  previously stated in many (face-to-face when it was still
possible ;)) conversations, in my view, any BIDS dataset is a derivative
dataset.  Even if it contains "raw" data, it is never given by gods, but is a
result of some process (let's call it pipeline for consistency) which produced
it out of some other data. That is why there is 1) `sourcedata/` to provide
placement for such original (as "raw" in terms of processing, but "raw"er in
terms of its relation to actual data acquired by equipment), and 2) `code/` to
provide placement for scripts used to produce or "tune" the dataset.  Typically
"sourcedata" is either a collection of DICOMs or a collection of data in some
other formats (e.g. nifti) which is then either converted or just renamed into
BIDS layout. When encountering a new BIDS dataset ATM it requires forensics
and/or data archaeology to discover how this BIDS dataset came about, to e.g.
possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields
during conversion into side car .json files they produce,

<details>
<summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary>

```shell
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

```
</details>

ATM I need to add such metadata to datasets produced by heudiconv to make
sure that in case of incremental conversions there is no switch in versions of
the software.
Copy link
Copy Markdown
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just minor comments while was looking at the PR


<sup>1</sup>Storing actual source files with the data is preferred over links to
external source repositories to maximize long term preservation (which would
suffer if an external repository would not be available anymore).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
suffer if an external repository would not be available anymore).
suffer if an external repository would not be available anymore), and to remove ambiguity
in which version of the code (in the repository) was actually used for a given dataset.

Although not sure if such recommendation could/should be generally followed since it remains ambigous -- e.g. should people copy entire heudiconv or any other BIDS converter?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha -- this is Code section, hence my comment. But upon rereading I wonder if intent was to talk about sourcedata/? Anyways -- language should be clarified. And not 100% sure this addition relates to BEP028 really -- seems generic and could be submitted as independent PR


Optional: Yes

### Rationale
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a specification and not a text book. I have not found any other Rationale section, and thus not sure if such (even though concise) section should be here.

yarikoptic added a commit to yarikoptic/bids-specification that referenced this pull request Oct 11, 2021
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance
capture/expression for BIDS datasets.  This PR just follows the idea of bids-standard#371
(small atomic ENHs), and is based on current state of the specification where
we have GeneratedBy to describe how a BIDS derivative dataset came to its
existence.

## Rationale

As I had  previously stated in many (face-to-face when it was still
possible ;)) conversations, in my view, any BIDS dataset is a derivative
dataset.  Even if it contains "raw" data, it is never given by gods, but is a
result of some process (let's call it pipeline for consistency) which produced
it out of some other data. That is why there is 1) `sourcedata/` to provide
placement for such original (as "raw" in terms of processing, but "raw"er in
terms of its relation to actual data acquired by equipment), and 2) `code/` to
provide placement for scripts used to produce or "tune" the dataset.  Typically
"sourcedata" is either a collection of DICOMs or a collection of data in some
other formats (e.g. nifti) which is then either converted or just renamed into
BIDS layout. When encountering a new BIDS dataset ATM it requires forensics
and/or data archaeology to discover how this BIDS dataset came about, to e.g.
possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields
during conversion into side car .json files they produce,

<details>
<summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary>

```shell
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

```
</details>

ATM I need to add such metadata to datasets produced by heudiconv to make
sure that in case of incremental conversions there is no switch in versions of
the software.
yarikoptic added a commit to yarikoptic/bids-specification that referenced this pull request Oct 11, 2021
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance
capture/expression for BIDS datasets.  This PR just follows the idea of bids-standard#371
(small atomic ENHs), and is based on current state of the specification where
we have GeneratedBy to describe how a BIDS derivative dataset came to its
existence.

## Rationale

As I had  previously stated in many (face-to-face when it was still
possible ;)) conversations, in my view, any BIDS dataset is a derivative
dataset.  Even if it contains "raw" data, it is never given by gods, but is a
result of some process (let's call it pipeline for consistency) which produced
it out of some other data. That is why there is 1) `sourcedata/` to provide
placement for such original (as "raw" in terms of processing, but "raw"er in
terms of its relation to actual data acquired by equipment), and 2) `code/` to
provide placement for scripts used to produce or "tune" the dataset.  Typically
"sourcedata" is either a collection of DICOMs or a collection of data in some
other formats (e.g. nifti) which is then either converted or just renamed into
BIDS layout. When encountering a new BIDS dataset ATM it requires forensics
and/or data archaeology to discover how this BIDS dataset came about, to e.g.
possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields
during conversion into side car .json files they produce,

<details>
<summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary>

```shell
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

```
</details>

ATM I need to add such metadata to datasets produced by heudiconv to make
sure that in case of incremental conversions there is no switch in versions of
the software.
yarikoptic added a commit to dbic/heudiconv that referenced this pull request Oct 11, 2021
Unfortunately there is no convention yet in BIDS on storing such information in
a standardized way.

bids-standard/bids-specification#440
 proposes to add GeneratedBy (within dataset_description.json)
 which could provide detailed high level information which should then be
 consistent through out dataset (so we would need to add safeguards)

bids-standard/bids-specification#487
 is WiP to introduce PROV into BIDS standard, which would allow to establish
 _prov.json with all needed gory details.

For now, since fields in side car .json files are not strictly regulated,
I think it would be benefitial to user to have heudiconv version stored there
along with other "Version" fields, such as

	$> grep -e Version -e dcm2ni fmap/sub-phantom1sid1_ses-localizer_acq-3mm_phasediff.json
	  "ConversionSoftware": "dcm2niix",
	  "ConversionSoftwareVersion": "v1.0.20211006",
	  "SoftwareVersions": "syngo MR E11",

and although strictly speaking Heudiconv is a "conversion software", since
dcm2niix decided to use that pair, I have decided to leave it alone and just
come up with yet another descriptive key

  "HeudiconvVersion": "0.10.0",
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
@sappelhoff
Copy link
Copy Markdown
Member

(NOTE: I'll cross-post this message across several BEP threads)

Hi there, just a quick notification that we have just merged #918 and it may be interesting to look at the implications for this BEP.

We are introducing "BIDS URIs", which unify the way we refer to and point to files in BIDS datasets (as opposed to "dataset-relative" or "subject-relative" or "file-relative" links).

If the diff and discussion in the PR is unclear, you can also read the rendered version: https://bids-specification.readthedocs.io/en/latest/02-common-principles.html#bids-uri

Perhaps there are things in the BEP that need adjusting now, but perhaps also not -- in any case it's good to be aware of this new feature!

Let me know if there are any questions, comments, or concerns.

@Remi-Gau
Copy link
Copy Markdown
Collaborator

FYI: added a link to the HTML version of the BEP in the top comment of this PR.

@Remi-Gau
Copy link
Copy Markdown
Collaborator

maintenance note: added a link to the rendered BEP in the top message of this PR.

@yarikoptic
Copy link
Copy Markdown
Collaborator

while listening to @cmaumet giving a presentation at BIDS derivatives meeting decided to add a reference to a related effort in unrelated other sciences: https://academic.oup.com/gji/article/207/2/1003/2583765 "An Adaptable Seismic Data Format" , ASDF incorporates the W3C PROV standard to record provenance information


```Text
- [Dataset level] prov.jsonld
- [File level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about many types of "applications" operating on entire subject/sessions, indeed it makes sense to have it at the subj/session level (which is presented here, not file level) and have possibility for file level too to "complement" that (similarly how done in inheritance principle for other metadata?):

Suggested change
- [File level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov
- [Subject[/session] level] sub-<label>/[ses-<label>/]sub-<label>[_ses-<label>]_<suffix>.prov
- [File level] sub-<label>/[ses-<label>/]<modality>/sub-<label>[_ses-<label>][_entities-<values>]_<suffix>.prov

I am also not sure why it is prov.jsonld on top and .prov and the lower? why not .prov.jsonld at the bottom levels too?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pretty much allowing .jsonld at any level where appropriate? E.g. could be for subject session or specific modality processing etc. All of those to be joined into a single graph anyways, right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a fix for .prov

@yarikoptic
Copy link
Copy Markdown
Collaborator

@cmaumet advised that this PR is outdated and shouldn't be the "reference" for BEP028 ATM and the google doc should be the one to look after ATM.

There are no limitations or recommendations on the language and/or
code organization of these scripts at the moment.

<sup>1</sup>Storing actual source files with the data is preferred over links to
Copy link
Copy Markdown
Collaborator

@yarikoptic yarikoptic Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems can't do suggestions any longer (because closed?) via button:

Suggested change
<sup>1</sup>Storing actual source files with the data is preferred over links to
Storing actual source files with the data is preferred over links to

since there is no 2.

they can be aggregated without the need to apply any inheritance principle.

iv. The provenance file MAY be used to reflect the provenance of a dataset,
a collection of files or a specific file at any level_of the bids hierarchy.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a collection of files or a specific file at any level_of the bids hierarchy.
a collection of files or a specific file at any level of the bids hierarchy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants