GSoC meta issue: OCR Integration in JabRef

**Is your suggestion for improvement related to a problem? Please describe.**
As a User I want to be able to extract text from pdf documents and use that for searching

**Describe the solution you'd like**
A solution that extracts text via OCR and makes it searchable

**Additional context**

https://jabref.github.io/GSoC/posts/ocr/

JabRef, comprehensive literature management software, currently supports both handling metadata and text-based PDF documents. However, a significant limitation arises with scanned PDFs, particularly historical articles, which are not text-searchable due to their image-based format. This project aims to bridge this gap by integrating advanced OCR (Optical Character Recognition) technology, enabling full-text search in scanned PDFs.

### Useful links:

- A Document AI Package: [deepdoctection](https://github.com/deepdoctection/deepdoctection)
- Hand-written text recognition in historical documents: [SimpleHTR#handwritten-text-recognition-with-tensorflow](https://github.com/githubharald/SimpleHTR#handwritten-text-recognition-with-tensorflow)
- Java OCR with Tesseract: [Baeldung Guide](https://www.baeldung.com/java-ocr-tesseract)
- Tesseract OCR Library: [Official Documentation](https://tesseract-ocr.github.io/)
- OCRmyPDF Installation and Usage: [GitHub Repository](https://github.com/ocrmypdf/OCRmyPDF#installation)
- ChatOCR and ChatGPT Integration: [Blog Article](https://www.blogmojo.de/chatgpt-plugin/chatocr/)
- AI-Powered OCR: [Addepto Blog](https://addepto.com/blog/ai-powered-ocr-optical-character-recognition-enhancing-accuracy-and-efficiency-in-document-analysis/)
- Tika OCR Integration: [Apache Tika Wiki](https://cwiki.apache.org/confluence/display/tika/tikaocr)
- Surya AI powered OCR, claims to be better than Tesseract, but coded in python [VikParuchuri/Surya](https://github.com/VikParuchuri/surya)
- SOTA (October 2025) language model for OCR: [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL); Supported by llama.cpp with [PR 16701](https://github.com/ggml-org/llama.cpp/pull/16701)
- Docling (Parsing and document pre-processing): [GitHub](https://github.com/docling-project/docling)

### Some aspects:

- Add an option to call an OCR engine from JabRef, e.g., cloud based or local installs
- Define a common interface to support multiple OCR engines
- Provide a good default set of settings for the OCR engines
- Support expert configuration of the settings
- Add the extracted text as a layer to the pdf so that Apache Lucene can parse it
- Add an option to further process the text with Grobid or a language model for metadata extraction

### Expected outcome:

A) Develop a common interface within JabRef to accommodate multiple OCR engines, ensuring flexibility and expandability.
B) Enable expert users to fine-tune OCR settings, catering to specific needs or document formats.
C) Incorporate the OCR-extracted text as a searchable layer in PDFs, allowing Apache Lucene to index and look for the content.

### Skills required:

- Proficiency in Java programming.
- A keen interest and curiosity in document processing and AI technologies.

### Possible mentors:

[@InAnYan](https://github.com/InAnYan/), [@calixtus](https://github.com/calixtus), [@subhramit](https://github.com/subhramit), [@Stewori](https://github.com/Stewori)

Project size:
90h (small)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GSoC meta issue: OCR Integration in JabRef #13267

Useful links:

Some aspects:

Expected outcome:

Skills required:

Possible mentors:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

GSoC meta issue: OCR Integration in JabRef #13267

Description

Useful links:

Some aspects:

Expected outcome:

Skills required:

Possible mentors:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions