Skip to content

Extracting from pdfs #1279

@efuae

Description

@efuae

Hello, I am using Grobid for my project and I am working with PDF Drug Labels. I have noticed a few things that happen when the pdf is extracted into xml:

  1. It often times does not extract the text that comes right after an image
  2. It sometimes captures a new head into the preceding header. For example after extracting section 12.3, it extracts section 12.4 as a continuation of the preceding header.

Could this be looked at please?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions