-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Description
See description below. Seems like a bug to me. This is solved by make the following edits in function compute_space_width in _cmap.py, line 19 in the code below st = w[0] -> st = w[0] if isinstance(w[0], int) else w[0].get_object(), this is in line 412 from the original file, since I am not familiar at all with the lower level implementation of pdf format, I am not sure if this is a bug at all, or if my fix makes sense:
def compute_space_width(
ft: DictionaryObject, space_code: int, space_width: float
) -> float:
sp_width: float = space_width * 2.0 # default value
w = []
w1 = {}
st: int = 0
if "/DescendantFonts" in ft: # ft["/Subtype"].startswith("/CIDFontType"):
ft1 = ft["/DescendantFonts"][0].get_object() # type: ignore
try:
w1[-1] = cast(float, ft1["/DW"])
except Exception:
w1[-1] = 1000.0
if "/W" in ft1:
w = list(ft1["/W"])
else:
w = []
while len(w) > 0:
# st = w[0]
# above commented out line is the original, below is my edit:
st = w[0] if isinstance(w[0], int) else w[0].get_object()
second = w[1].get_object()
if isinstance(second, int):
for x in range(st, second):
w1[x] = w[2]
w = w[3:]
elif isinstance(second, list):
for y in second:
w1[st] = y
st += 1
w = w[2:]
else:
logger_warning(
"unknown widths : \n" + (ft1["/W"]).__repr__(),
__name__,
)
break
try:
sp_width = w1[space_code]
except Exception:
sp_width = (
w1[-1] / 2.0
) # if using default we consider space will be only half size
elif "/Widths" in ft:
w = list(ft["/Widths"]) # type: ignore
try:
st = cast(int, ft["/FirstChar"])
en: int = cast(int, ft["/LastChar"])
if st > space_code or en < space_code:
raise Exception("Not in range")
if w[space_code - st] == 0:
raise Exception("null width")
sp_width = w[space_code - st]
except Exception:
if "/FontDescriptor" in ft and "/MissingWidth" in cast(
DictionaryObject, ft["/FontDescriptor"]
):
sp_width = ft["/FontDescriptor"]["/MissingWidth"] # type: ignore
else:
# will consider width of char as avg(width)/2
m = 0
cpt = 0
for x in w:
if x > 0:
m += x
cpt += 1
sp_width = m / max(1, cpt) / 2
return sp_widthEnvironment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.4.0-148-generic-x86_64-with-glibc2.31
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.4, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=noneCode + PDF
This is a minimal, complete example that shows the issue:
import pypdf
f_path = "data/Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf"
with open(f_path, "rb") as pdf_file_obj:
p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
print(p)The pdf file:
Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf
Traceback
This is the complete Traceback I see:
Traceback (most recent call last):
File "/export/home/***/try_parse_pdf.py", line 12, in <module>
p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 2263, in extract_text
return self._extract_text(
File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 1908, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 29, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 89, in build_char_map_from_dict
sp_width = compute_space_width(ft, sp, space_width)
File "/export/home/cuda00042/***/***/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 415, in compute_space_width
for x in range(st, second):
TypeError: 'IndirectObject' object cannot be interpreted as an integer
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels