Skip to content

GSoC meta issue: OCR Integration in JabRef #13267

@Siedlerchr

Description

@Siedlerchr

Is your suggestion for improvement related to a problem? Please describe.
As a User I want to be able to extract text from pdf documents and use that for searching

Describe the solution you'd like
A solution that extracts text via OCR and makes it searchable

Additional context

https://jabref.github.io/GSoC/posts/ocr/

JabRef, comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.

Useful links:

Some aspects:

  • Add an option to call an OCR engine from JabRef, e.g., cloud based or local installs
  • Define a common interface to support multiple OCR engines
  • Provide a good default set of settings for the OCR engines
  • Support expert configuration of the settings
  • Add the extracted text as a layer to the pdf so that Apache Lucene can parse it
  • Add an option to further process the text with Grobid or a language model for metadata extraction

Expected outcome:

A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability.
B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.

Skills required:

  • Proficiency in Java programming.
  • A keen interest and curiosity in document processing and AI technologies.

Possible mentors:

@InAnYan, @calixtus, @subhramit, @Stewori

Project size:
90h (small)

Metadata

Metadata

No fields configured for feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions