Skip to content

Flexible fusion part 2: core functionality#864

Merged
osma merged 11 commits intomainfrom
issue813-flexible-fusion-core-functionality
Aug 6, 2025
Merged

Flexible fusion part 2: core functionality#864
osma merged 11 commits intomainfrom
issue813-flexible-fusion-core-functionality

Conversation

@osma
Copy link
Copy Markdown
Member

@osma osma commented Jul 24, 2025

2nd part of #813, after PR #863

This PR implements the core functionality needed for experimenting with "flexible fusion", i.e. making it possible to apply Annif backends not just on document text, but metadata fields. It enables the following:

  • The CSV short text format can now contain extra columns/fields for metadata. Internally, they are stored in the new metadata field within Document objects.
  • A new transform select can be used to select which parts of a document (main text and/or metadata fields) are given as input to the backend. For example select(title) or select(title,description,text)

Implementation notes

  • I had to change the namedtuple Document into a real class in order to better support optional fields. Now only text is a mandatory field, while subject_set and metadata are optional (defaulting to empty set and empty dict, respectively)
  • I extended the transform functionality. Previously transforms could only operate on text. Now transforms can implement either transform_text (to transform text only, similar as before) or transform_doc (transform whole Document objects). The latter was needed to implement the select transform.
  • I had to change all the various suggest methods on both projects and backends to accept a Document (or batch of Document objects) instead of text. This lead to a lot of churn, but it's all in a single commit b0bb163.
  • Currently, this PR is targeting the PR branch of Flexible fusion part 1: CSV short-text document corpus format #863 because this is built on top of that. It needs to be merged after Flexible fusion part 1: CSV short-text document corpus format #863.

What is still missing

This PR implements only the minimum needed for experimentation with custom metadata and select transforms: it should now be possible to train and evaluate models with custom document metadata, as long as the CSV short text corpus format is used.

These parts are still missing, to be implemented in future PRs:

  • support for metadata in the full text corpus format (for example via YAML front matter, or a separate file for metadata)
  • CLI suggest command should allow specifying extra metadata fields, probably as command line arguments
  • REST API suggest method should allow specifying extra metadata fields in the request + the http backend should support it
  • Web UI should allow entering metadata fields

@osma osma self-assigned this Jul 24, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Jul 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.65%. Comparing base (777e138) to head (be13776).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #864   +/-   ##
=======================================
  Coverage   99.65%   99.65%           
=======================================
  Files          99      100    +1     
  Lines        7446     7520   +74     
=======================================
+ Hits         7420     7494   +74     
  Misses         26       26           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@osma osma changed the title WIP: Flexible fusion core functionality Flexible fusion part 2: core functionality Jul 24, 2025
@osma osma marked this pull request as ready for review July 24, 2025 14:43
@osma osma requested a review from juhoinkinen July 24, 2025 14:43
@osma osma mentioned this pull request Jul 24, 2025
Copy link
Copy Markdown
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@osma osma changed the base branch from issue813-flexible-fusion-csv-short-text-corpus to main August 6, 2025 10:31
@osma osma added this to the 1.4 milestone Aug 6, 2025
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Aug 6, 2025

@osma osma merged commit 1c48f91 into main Aug 6, 2025
14 checks passed
@osma osma deleted the issue813-flexible-fusion-core-functionality branch August 6, 2025 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants