-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Hi there,
I've been using this script regularly on PDF for work,
import PyPDF2
from pathlib import Path
import re
import os
search_terms = ["DATA AVAILABILITY STATEMENT",
"open source",
"open-source",
"opensource",
"open science",
"github",
" git "
"osf",
"jupyter",
"notebook",
"octave",
"available online",
"released",
"shared",
" code ",
"numerical phantom",
"bitbucket",
"sourceforge",
"xnat",
"reproducible research",
"julia",
"image set",
"image sets",
"raw k-space data",
"SHA-1",
"gitlab",
"Docker",
"container",
"MyBinder",
"Binder",
"mrhub",
"MR-Hub",
"codeocean",
"Code Ocean"]
folder_path = '.'
for filename in sorted(os.listdir(Path(folder_path))):
if filename.endswith(".pdf"):
# open the pdf file
object = PyPDF2.PdfFileReader(filename)
# get number of pages
num_pages = object.getNumPages()
found_keywords = []
# search through keywords
for keyword in search_terms:
# extract text and do the search
for page_index in range(0, num_pages):
page_obj = object.getPage(page_index)
page_text = page_obj.extractText()
search_result = re.search(keyword, page_text)
if search_result is not None:
found_keywords.append(keyword)
break
if found_keywords:
print(filename + " contains " + str(found_keywords))and for all PDFs I used before a few months ago, the page text was correctly being read as text (e.g. the PDF downloadable here: https://onlinelibrary.wiley.com/doi/10.1002/mrm.28965).
However, now recent PDFs (like this one: https://onlinelibrary.wiley.com/doi/10.1002/mrm.29078) are reading the pages as random symbols, like this (generated by adding print(page_text)):
ƒ
˙−
ˇˇ
ƒ
ˇ˘
−
ˇ˙
ˇ˝−
˜˚
˜ˇ
˜˜
˜˘−
“
“–‡”
“‹
“‹
⁄”
©
−
©
“‹
ƒ
−
ƒ
“‹
−
“‹
ƒ
⁄‡‡”
‹
⁄‡
−
“‹
Žƒ
So clearly, my keyword detection isn't working anymore.
I can't seem to find a difference in the PDF files (they are both Adobe InDesign 15.1 (Windows), Adobe PDF Library 15.0; modified using iText 4.2.0 by 1T3XT). Any clue on how to resolve this for the newer PDFs I'm using?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PdfReaderThe PdfReader component is affectedThe PdfReader component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow