Skip to content

Cannot extract_text from weasyprint generated PDF #242

@mattjmorrison-imt

Description

@mattjmorrison-imt

Generating a PDF with the following code ends up not returning anything from extractText.

"""
PyPDF2==2.1.0
WeasyPrint==55.0
"""

from io import BytesIO
from PyPDF2 import PdfReader

# Create example
from weasyprint import HTML
stream = BytesIO()
HTML(string="""
<html>
<body>
<div>Hello World</div>
</body>
</html>
""").write_pdf(stream)
stream.seek(0)

# Try to read "Hello World"
reader = PdfReader(stream)
print(reader.pages[0].extract_text())

In this issue: Kozea/WeasyPrint/issues/290 @liZe points out that other tools are able to extract the text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions