Skip to content

general paragraph text wrongly recognized as "figDesc/div/p" #1077

@sawyerzheng

Description

@sawyerzheng

I am using a docker container of docker pull lfoppiano/grobid:0.8.0

v0.7.3 also tested

  • What is your Java version (java --version)?

just used official docker: lfoppiano/grobid

  • In case of build or run errors, please submit the error while running gradlew with --stacktrace and --info for better log traces (e.g. ./gradlew run --stacktrace --info) or attach the log file logs/grobid-service.log.

No this file, as using docker


Problem

  1. The general paragraph text which is not belong to a figure is wrongly recognized as a figDesc
  2. Part of the wrongly recognized text as figDesc also in the general paraph text "body/div/p"
    • This mean its repeated in two part of tei xml: "body/figure/figDesc/div/p" and "body/div/p"

original pdf area

image

extracted xml

image

Reference materials

Used pdf

176_liu2010.pdf

Result tei xml

note: github not accept .xml file, I modified its suffix as .txt

176_liu2010.pdf.tei.xml.txt

Metadata

Metadata

Assignees

Labels

bugFrom Hemiptera and especially its suborder Heteropteraerror casesSome error/test case for future improvementsimplementedThe issue has been implementedmodels:fulltext

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions