Skip to content

Alternative articles processing flavors #1202

Merged
lfoppiano merged 88 commits intomasterfrom
feature/segmentation-light
Jan 10, 2025
Merged

Alternative articles processing flavors #1202
lfoppiano merged 88 commits intomasterfrom
feature/segmentation-light

Conversation

@lfoppiano
Copy link
Copy Markdown
Member

@lfoppiano lfoppiano commented Nov 21, 2024

This PR implements two alternatives segmentation flavors for the scientific articles:

  • article/light: Segment the document into header and body, and extract only title, authors, DOI, and publication date if available, body is segmented in paragraphs only.
  • article/light-ref: Segment the document into header, body and references, and extract title, authors, DOI, and publication date if available, body is segmented in paragraphs only, references are processed as usual.

The article's body is then composed by two paragraphs:
The first paragraph contains leftover from the header, since the extraction may be sparse, this avoids to miss data
The second paragraph contains the full body
The article body is segmented into head and paragraphs. Tables and figures are embedded into the paragraphs too.

The PR #1151 was tested in this PR.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants