Skip to content

IndexError: index out of range when encountering a digital certificate/signature #1245

@Bryan-Fagan

Description

@Bryan-Fagan

I'll start with I'm very new to using Python and PyPDF. I'm trying to collect all of the fields within a pdf to collect into a dataframe. Eventually I want to collect thousands of PDFs that all have the same structure (form) as the baseline and place them into the PDF. I was able to get this code to work great on a PDF without a digital certificate/signature. However, when I run the code on a PDF with the digital certificate/signature I get an error.

I don't really need the digital signature/certificate spot of the document so I think the easiest way to do this is to just skip that field of the PDF. However, I don't know how to do that since the PyPDF2 package looks at every field.

I was able to get around the error by doing try/except but then it wouldn't capture the information from the pdf (i.e. result was blank).

Environment

Plotly Dash Workspace

$ python -m platform
# TODO: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-debian-buster-sid

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
# TODO: 2.10.0

Code + PDF

import PyPDF2 as pypdf

directory = 'files'

for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isfile(f):
        print(f)
        pdf=pypdf.PdfFileReader(f, strict= False)
        print(pdf)
        #information = pdf.getFormTextFields()
        information = pdf.getFields()
        print(information)
        output = pd.DataFrame([information])
        df = pd.concat([df, output], ignore_index=True)

I'll have to play around with the PDF to see if I can post it as it have PII information.

Traceback

Traceback (most recent call last):
  File "/workspace/app.py", line 77, in <module>
    information = pdf.getFields()
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 526, in getFields
    return self.get_fields(tree, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 510, in get_fields
    self._build_field(field, retval, fileobj, field_attributes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 535, in _build_field
    self._check_kids(field, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
    self.get_fields(kid.get_object(), retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 499, in get_fields
    self._check_kids(tree, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
    self.get_fields(kid.get_object(), retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 503, in get_fields
    self._build_field(tree, retval, fileobj, field_attributes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 547, in _build_field
    retval[key] = Field(field)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 1626, in __init__
    self[NameObject(attr)] = data[attr]
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 679, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 251, in get_object
    obj = self.pdf.get_object(self)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 1167, in get_object
    retval, indirect_reference.idnum, indirect_reference.generation
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 741, in decrypt_object
    return cf.decrypt_object(obj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
    obj[dictkey] = self.decrypt_object(value)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 185, in decrypt_object
    obj[i] = self.decrypt_object(obj[i])
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
    obj[dictkey] = self.decrypt_object(value)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 176, in decrypt_object
    data = self.strCrypt.decrypt(obj.original_bytes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 88, in decrypt
    return d[: -d[-1]]
IndexError: index out of range

TODO
I believe the best solution would be something for if the getFields() or getFormFields() methods encounter a digital signature/certificate then it passes that field.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions