Skip to content

.extractText() reads / as 1. #789

@dakotabartell

Description

@dakotabartell

I'm trying to automate sorting pdfs by the date on the pdf. However the issue I continue having is that the /'s in the dates continually get read as 1's. Wouldn't be a problem 90% of the time unfortunately it reads a lot of January and November dates as the same

1/11/2022
11/1/2022

Both end up as 111112022

I tried getting the new pdfs to change to a new format to have 01/11/2022 but they aren't able to do that. Is there a way to fix this?

from PyPDF2 import PdfReader

reader = PdfReader("TestPackingSlip637860440227283947.pdf")
print(f"Total pages= {len(reader.pages)}")

for i, page in enumerate(reader.pages, start=1):
    print(f"Page: {i}")
    print(page.extract_text())

The info on the pdf I'm uploading is randomized and does not represent anyone's real info.

TestPackingSlip637860440227283947.pdf

Metadata

Metadata

Assignees

Labels

Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions