-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
is-featureA feature requestA feature requestwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I am not able to read text which proper formatting and spaces are not handled during extraction:
PreemptiveInformationExtractionusingUnrestrictedRelationDiscoveryYusukeShinyamaSatoshiSekineNewYorkUniversity715,Broadway,7thFloorNewYork,NY,10003fyusuke,sekineg@cs.nyu.eduAbstractWearetryingtoextendtheboundaryofInformationExtraction(IE)systems.Ex-istingIEsystemsrequirealotoftimeandhumanefforttotuneforanewscenario.
Is it true that pypdf2 is not format aware as given here: http://victorwyee.com/python/convert-pdf-to-text-pypdf-pdfminer-first-impression/
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
is-featureA feature requestA feature requestwhitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow