As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.
Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.
Solution Sketch
We have a grobid in place. This should be used. Check Apache Tika, too.
Steps:
- If first PDF page containts
@article or something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage).
- Check if a
.bib file is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de)
- Check if XMP data is available. If yes -> use that. Stop. Else continue.
- Look for DOI in the first page. If present -> use that. Stop. Else continue.
- Use Apache Tika/GROBID to extract PDF. Use that data.
Improvement possibility
Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)
Challenges
- Cover different cases (BibTeX text on the first page,
.bib embedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...)
- Good test cases
Side notes
Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.
In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.
Refs JabRef#7209
As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.
Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.
Solution Sketch
We have a grobid in place. This should be used. Check Apache Tika, too.
Steps:
@articleor something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage)..bibfile is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de)Improvement possibility
Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)
Challenges
.bibembedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...)Side notes
Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.
In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.
Refs JabRef#7209