Read PDF changed from text to random symbols

Hi there,

I've been using this script regularly on PDF for work,

```python
import PyPDF2
from pathlib import Path
import re
import os

search_terms = ["DATA AVAILABILITY STATEMENT",
                "open source",
                "open-source",
                "opensource",
                "open science",
                "github",
                " git "
                "osf",
                "jupyter",
                "notebook",
                "octave",
                "available online",
                "released",
                "shared",
                " code ",
                "numerical phantom",
                "bitbucket",
                "sourceforge",
                "xnat",
                "reproducible research",
                "julia",
                "image set",
                "image sets",
                "raw k-space data",
                "SHA-1",
                "gitlab",
                "Docker",
                "container",
                "MyBinder",
                "Binder",
                "mrhub",
                "MR-Hub",
                "codeocean",
                "Code Ocean"]

folder_path = '.'

for filename in sorted(os.listdir(Path(folder_path))):
    if filename.endswith(".pdf"):
        # open the pdf file
        object = PyPDF2.PdfFileReader(filename)

        # get number of pages
        num_pages = object.getNumPages()
        
        found_keywords = []
        # search through keywords
        for keyword in search_terms:
            # extract text and do the search
            for page_index in range(0, num_pages):
                page_obj = object.getPage(page_index)
                page_text = page_obj.extractText() 

                search_result = re.search(keyword, page_text)

                if search_result is not None:
                    found_keywords.append(keyword)
                    break
        
        if found_keywords:
            print(filename + " contains " + str(found_keywords))
```

and for all PDFs I used before a few months ago, the page text was correctly being read as text (e.g. the PDF downloadable here: https://onlinelibrary.wiley.com/doi/10.1002/mrm.28965).

However, now recent PDFs (like this one: https://onlinelibrary.wiley.com/doi/10.1002/mrm.29078) are reading the pages as random symbols, like this (generated by adding `print(page_text)`):

```
ƒ
˙−
ˇˇ

ƒ
ˇ˘

−

ˇ˙
ˇ˝−
˜˚
˜ˇ

˜˜


˜˘−

“

“–‡”
“‹
“‹
⁄”


©



−
©

“‹
ƒ
−
ƒ

“‹

−



“‹

ƒ
⁄‡‡”


‹

⁄‡
−
“‹



Žƒ
```

So clearly, my keyword detection isn't working anymore.

I can't seem to find a difference in the PDF files (they are both Adobe InDesign 15.1 (Windows), Adobe PDF Library 15.0; modified using iText 4.2.0 by 1T3XT). Any clue on how to resolve this for the newer PDFs I'm using?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read PDF changed from text to random symbols #654

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Read PDF changed from text to random symbols #654

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions