Skip to content

Flexible fusion part 1: CSV short-text document corpus format#863

Merged
osma merged 4 commits intomainfrom
issue813-flexible-fusion-csv-short-text-corpus
Aug 6, 2025
Merged

Flexible fusion part 1: CSV short-text document corpus format#863
osma merged 4 commits intomainfrom
issue813-flexible-fusion-csv-short-text-corpus

Conversation

@osma
Copy link
Copy Markdown
Member

@osma osma commented Jul 24, 2025

This PR implements the first part of #813 (flexible fusion backend) which is independent from the rest and may be useful on its own: a new CSV short text corpus format. This is very similar to the existing TSV short text format, with two important differences:

  1. Obviously it's CSV based, so not using tabs but commas to separate columns etc. Any CSV file that the python csv module can read without custom options should be OK.
  2. There is a mandatory header row that defines the columns. Thus the column order is flexible and it's possible to add custom columns.

Right now, only two columns are really supported (and required): text and subject_uris. Any other columns are silently ignored. My plan is to build on this when implementing #813 and enable the use of custom columns for additional document metadata; see #813 (comment) for details. However, this PR can be merged on its own.

The code supports both uncompressed CSV files (file names must have extension .csv, case-insensitive) as well as gzipped CSV files (.csv.gz, case-insensitive).

Implementation notes:

@osma osma self-assigned this Jul 24, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Jul 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.65%. Comparing base (6bae2e5) to head (d5bec87).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #863   +/-   ##
=======================================
  Coverage   99.64%   99.65%           
=======================================
  Files          99       99           
  Lines        7349     7446   +97     
=======================================
+ Hits         7323     7420   +97     
  Misses         26       26           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@osma osma changed the title WIP: CSV short-text document corpus format Support new CSV short-text document corpus format Jul 24, 2025
@osma osma marked this pull request as ready for review July 24, 2025 09:05
@osma osma requested a review from juhoinkinen July 24, 2025 09:05
@osma osma changed the title Support new CSV short-text document corpus format Flexible fusion part 1: CSV short-text document corpus format Jul 24, 2025
@osma osma mentioned this pull request Jul 24, 2025
Copy link
Copy Markdown
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please see the comment about lines without a comma.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Aug 6, 2025

@osma osma added this to the 1.4 milestone Aug 6, 2025
@osma osma merged commit 777e138 into main Aug 6, 2025
19 of 22 checks passed
@osma osma deleted the issue813-flexible-fusion-csv-short-text-corpus branch August 6, 2025 10:40
@osma osma mentioned this pull request Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants