Skip to content

Whitespace issues in extract_text() #42

@SharmileeS

Description

@SharmileeS

I am not able to read text which proper formatting and spaces are not handled during extraction:

PreemptiveInformationExtractionusingUnrestrictedRelationDiscoveryYusukeShinyamaSatoshiSekineNewYorkUniversity715,Broadway,7thFloorNewYork,NY,10003fyusuke,sekineg@cs.nyu.eduAbstractWearetryingtoextendtheboundaryofInformationExtraction(IE)systems.Ex-istingIEsystemsrequirealotoftimeandhumanefforttotuneforanewscenario.

Is it true that pypdf2 is not format aware as given here: http://victorwyee.com/python/convert-pdf-to-text-pypdf-pdfminer-first-impression/

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-featureA feature requestwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions