Replace PDFContentImporter by another library

As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.

Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.

## Solution Sketch

We have a grobid in place. This should be used. Check Apache Tika, too.

Steps:

1. If first PDF page containts `@article` or something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage).
2. Check if a `.bib` file is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de)
3. Check if XMP data is available. If yes -> use that. Stop. Else continue.
4. Look for DOI in the first page. If present -> use that. Stop. Else continue.
5. Use Apache Tika/GROBID to extract PDF. Use that data.

## Improvement possibility

Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)

## Challenges

- Cover different cases (BibTeX text on the first page, `.bib` embedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...)
- Good test cases
  - Create test PDFs
 
## Side notes

Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.

In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.

Refs https://github.com/JabRef/jabref/pull/7209

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace PDFContentImporter by another library #169

Solution Sketch

Improvement possibility

Challenges

Side notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Replace PDFContentImporter by another library #169

Description

Solution Sketch

Improvement possibility

Challenges

Side notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions