Is your suggestion for improvement related to a problem? Please describe.
As a User I want to be able to extract text from pdf documents and use that for searching
Describe the solution you'd like
A solution that extracts text via OCR and makes it searchable
Additional context
https://jabref.github.io/GSoC/posts/ocr/
JabRef, comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.
Useful links:
Some aspects:
- Add an option to call an OCR engine from JabRef, e.g., cloud based or local installs
- Define a common interface to support multiple OCR engines
- Provide a good default set of settings for the OCR engines
- Support expert configuration of the settings
- Add the extracted text as a layer to the pdf so that Apache Lucene can parse it
- Add an option to further process the text with Grobid or a language model for metadata extraction
Expected outcome:
A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability.
B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.
Skills required:
- Proficiency in Java programming.
- A keen interest and curiosity in document processing and AI technologies.
Possible mentors:
@InAnYan, @calixtus, @subhramit, @Stewori
Project size:
90h (small)
Is your suggestion for improvement related to a problem? Please describe.
As a User I want to be able to extract text from pdf documents and use that for searching
Describe the solution you'd like
A solution that extracts text via OCR and makes it searchable
Additional context
https://jabref.github.io/GSoC/posts/ocr/
JabRef, comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.
Useful links:
Some aspects:
Expected outcome:
A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability.
B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.
Skills required:
Possible mentors:
@InAnYan, @calixtus, @subhramit, @Stewori
Project size:
90h (small)