Skip to content

Euro sign not being recognized by extractText #443

@disimone

Description

@disimone

Hi,

I am using pyPDF2 to extract text from a PDF file, and I am having problems with the Euro sign.

This is what the pdf looks like.
image

A copy/paste from acrobat reader properly gives back the euro sign.
Also extracting with pdftotext correctly yields the character:

image

pyPDF2, however, recognises it as a bullet (U+2022):

image

Is there anything I can do to fix this? I do not seem to find any encoding options I can tweak in extractText.

Thanks for your help,

Andrea.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions