Skip to content

Outcomes from OHBM Hackathon 2025#15

Merged
Lestropie merged 32 commits intomasterfrom
hackathon
Jul 8, 2025
Merged

Outcomes from OHBM Hackathon 2025#15
Lestropie merged 32 commits intomasterfrom
hackathon

Conversation

@Lestropie
Copy link
Copy Markdown
Owner

@Lestropie Lestropie commented Jun 23, 2025

This content is being posted as a draft Pull Request in order to demonstrate the volume of changes generated during the OHBM 2025 Hackathon.
The code still requires further modification before the tool can be considered ready for broader uptake.


The following is a list of items flagged within the code base that need to be either addressed prior to making this first merge to master, or need to have a standalone Issue created to be addressed later.

  • Check expected behaviour of metadata files other than JSON

    • Do all such files obey the "use only the nearest" rule under the Inheritance Principle,
      or are there some where it would be a violation for there to be more than one potential match
      regardless of what is stated within the Inheritance Principle?

    • Does this behaviour change as a function of Inheritance Principle ruleset?

    • Are there metadata file types outside of the MRI space I'm not aware of?

  • Refactor detection of IP violations between full-graph and standalone-file functions

    Currently, the code responsible for generating a list of applicable metadata files per data file takes partial responsibility for detecting violations of the specified Inheritance Principle ruleset while the list is being generated.
    This however turned out to be the wrong location for much of that code.
    What I want instead is:

    1. Function that generates, for a specified data file, all possibly applicable metadata files, without observation of any specific IP ruleset.

    2. Function that, for a specified data files:

      1. Generates the list of candidate metadata files (step 1).
      2. Checks for violations of the IP for just that mapping.
      3. Prunes the lists if necessary: depending on metadata file extension, it may be only the "closest" metadata file that is ultimately deemed applicable
    3. Class Graph, which:

      1. Generates the list of candidate metadata files for all data files (step 1).
      2. Generates the full inverse mapping.
      3. Checks for all violations of the IP, since step 1 will no longer perform a partial check of such.
      4. Option to in-place prune the whole graph.

    This will have the added benefit of better distinguishing between errors that relate to general parsing of the BIDS dataset, and those that relate specifically to the Inheritance Principle.
    Currently these are distinguished by their relative Exception classes, but I think it would be better if they occurred at different points during processing.

    This may also come with corresponding changes to function names; eg. metadata "applicability".

  • Implement check for currently overlooked Inheritance Principle rule.

    A metadata file is not permitted to be potentially applicable to some data file based on only looking at their relative file names, but for the metadata file to not reside in a parent directory of / the same directory as the data file.

  • Improve checks of metadata association graph equivalency

    For proposal ruleset "I1195" where highly complex inheritance is permitted, it is possible for two metadata files that reside in the same directory and have an equal number of entities to one another to both be applicable to the same data file (as long as they do not have any metadata key-value clashes).
    This however poses a problem for comparing file association graphs for equivalence, where metadata files associated with a data file are inferred to be ordered.
    A more robust check would involve either:

    • Enforcing an equivalent order of these lists when that order is consequential, and not enforcing such an order when it is not.
    • For all lists, enforce equivalent order of entries only when that order can be disambiguated; for entries in the same directory and with the same number of entities, do not interpret permutation of those entries as inequivalence of the graphs.
  • More elegant solution for loading non-JSON metadata

    For instance, along with other attributes ascribed to the different metadata file extensions
    could be the property that files of a given extension can be read as plaintext matrix data
    (eg. .bvec / .bval)

  • Change how key-value metadata overrides are identified?

    Currently this is done through a stand-alone function.
    That works, and is technically faster and uses less RAM than a full metadata load, but isn't as general.
    Having a function return a tuple containing firstly all of the metadata, and secondly the set of overrides, would allow both of these steps to be done in one go.

  • Implement checks comparing behaviour of full graph representation and individual functions

    While evaluation of the full graph makes sense for validation of datasets and for evaluating IP rulesets, it is likely to be a common use case of this package to simply query what metadata are associated with a nominated data file. The testing therefore needs to ensure that the outcomes from running these functions individually are exactly the same as what is captured by the full graph.

  • Check ability of individual fetch functions to detect IP violations

    Checking whether a dataset violates the IP ruleset in some way is much easier when the full association graph is stored in memory. It is however desired that if a developer chooses to make direct use of the functions that access the metadata per data file / the data files to which a metadata file applies, any violations of the IP under some nominated ruleset will nevertheless be captured.

    My expectation is that it won't be possible for those functions to catch all such violations. I do however want to confirm that the set of violations that they fail to identify is acceptable.

@Lestropie Lestropie self-assigned this Jun 23, 2025
3.9-slim omitted numpy, but would attempt to bump Python itself to 3.11 in order to install it, so moved to the non-slim version.
Precludes getting an unhandled Exception regarding subscripting of types in Python 3.8 and earler.
- In 1.1.x, make sure that metadata files that are subject-specific are not placed in the subject-agnostic root directory, and that metadata files that are subject-agnostic (by name) are not placed in a subject-specific directory.
- In 1.7.0, ensure that rule 3 of the Inheritance Principle is not violated.
- Replace ruleset "1.x" with "1.1.x" and "1.7.x", since those two versions of the specification differ in the exact criteria applied.
- Perform testing of new example datasets that demonstrate these properties.
- Fix comparison between data file and metadata file suffixes happening in utils.applicability.is_applicable() but not utils.applicability.is_applicable_nameonly().
The version of this file provided in the BIDS Apps example repository is dependent on bootstrapping from a Docker container of a fixed name inside the "bids" organisation.
Lestropie added 18 commits June 24, 2025 08:56
Evaluation of the validity of the data file - metadata file association graph is deferred until after the full graph is generated.
FOr each sample dataset, the graph is constructed only once; that graph is then considered immutable as it is tested under different rule sets.
This collapses the graph information for those metadata files that do not have key-value dictionaries that can be merged, instead choosing only the nearest metadata file given the filesystem hierarchy and BIDS file names.
Some test data have been correspondingly updated so that the comparison is performed against the pruned graph.
Lots of other changes along the way, including defining class BIDSFilePathList to have ability to encapsulate functions that operate exclusively on such data.
Previously, two data file - metadata file association graphs were compared simply by ensuring that all files in each list were present in the other. This however failed to account for the fact that in many circumstances these lists must be interpreted in an ordered way, and so ensuring that their orders are equivalent is important. The lists however cannot simply be tested for equivalence. A ruleset may permit a single data file to load key-value metadata from multiple JSON files that are equivalent in both filesystem directory location and number of entities. In this scenario, the order within that set of equidistant metadats files is not of consequence in a test of equivalence. This commit replaces the earlier placeholder implementation with this more robust algorithm.
- Multiple fixes to datafiles_for_metafile() when a ruleset is specified.
- During testing, evalate whether associating a data file with metadata files (or vice versa) is capable of detecting potential IP issues. Not all IP issues can be reasonably caught when operating in this way, so a small number of such tests are skipped.
De-escalate the presence of key-value overrides for rulesets 1.1.x and 1.7.x; while these specifications recommend against using such overrides extensively, they do not recommend against the practise itself.
- New ruleset 1.11.x, which reflects the change in #1834 of the BIDS specification, wherein it will be recommended to never use key-value overriding; this should therefore result in the issuing of a warning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant