Skip to content

Read PDF changed from text to random symbols #654

@mathieuboudreau

Description

@mathieuboudreau

Hi there,

I've been using this script regularly on PDF for work,

import PyPDF2
from pathlib import Path
import re
import os

search_terms = ["DATA AVAILABILITY STATEMENT",
                "open source",
                "open-source",
                "opensource",
                "open science",
                "github",
                " git "
                "osf",
                "jupyter",
                "notebook",
                "octave",
                "available online",
                "released",
                "shared",
                " code ",
                "numerical phantom",
                "bitbucket",
                "sourceforge",
                "xnat",
                "reproducible research",
                "julia",
                "image set",
                "image sets",
                "raw k-space data",
                "SHA-1",
                "gitlab",
                "Docker",
                "container",
                "MyBinder",
                "Binder",
                "mrhub",
                "MR-Hub",
                "codeocean",
                "Code Ocean"]

folder_path = '.'

for filename in sorted(os.listdir(Path(folder_path))):
    if filename.endswith(".pdf"):
        # open the pdf file
        object = PyPDF2.PdfFileReader(filename)

        # get number of pages
        num_pages = object.getNumPages()
        
        found_keywords = []
        # search through keywords
        for keyword in search_terms:
            # extract text and do the search
            for page_index in range(0, num_pages):
                page_obj = object.getPage(page_index)
                page_text = page_obj.extractText() 

                search_result = re.search(keyword, page_text)

                if search_result is not None:
                    found_keywords.append(keyword)
                    break
        
        if found_keywords:
            print(filename + " contains " + str(found_keywords))

and for all PDFs I used before a few months ago, the page text was correctly being read as text (e.g. the PDF downloadable here: https://onlinelibrary.wiley.com/doi/10.1002/mrm.28965).

However, now recent PDFs (like this one: https://onlinelibrary.wiley.com/doi/10.1002/mrm.29078) are reading the pages as random symbols, like this (generated by adding print(page_text)):

ƒ
˙−
ˇˇ

ƒ
ˇ˘

−

ˇ˙
ˇ˝−
˜˚
˜ˇ

˜˜


˜˘−

“

“–‡”
“‹
“‹
⁄”


©



−
©

“‹
ƒ
−
ƒ

“‹

−



“‹

ƒ
⁄‡‡”


‹

⁄‡
−
“‹



Žƒ

So clearly, my keyword detection isn't working anymore.

I can't seem to find a difference in the PDF files (they are both Adobe InDesign 15.1 (Windows), Adobe PDF Library 15.0; modified using iText 4.2.0 by 1T3XT). Any clue on how to resolve this for the newer PDFs I'm using?

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfReaderThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions