I'll start with I'm very new to using Python and PyPDF. I'm trying to collect all of the fields within a pdf to collect into a dataframe. Eventually I want to collect thousands of PDFs that all have the same structure (form) as the baseline and place them into the PDF. I was able to get this code to work great on a PDF without a digital certificate/signature. However, when I run the code on a PDF with the digital certificate/signature I get an error.
I don't really need the digital signature/certificate spot of the document so I think the easiest way to do this is to just skip that field of the PDF. However, I don't know how to do that since the PyPDF2 package looks at every field.
I was able to get around the error by doing try/except but then it wouldn't capture the information from the pdf (i.e. result was blank).
I'll have to play around with the PDF to see if I can post it as it have PII information.
Traceback (most recent call last):
File "/workspace/app.py", line 77, in <module>
information = pdf.getFields()
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 526, in getFields
return self.get_fields(tree, retval, fileobj)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 510, in get_fields
self._build_field(field, retval, fileobj, field_attributes)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 535, in _build_field
self._check_kids(field, retval, fileobj)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
self.get_fields(kid.get_object(), retval, fileobj)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 499, in get_fields
self._check_kids(tree, retval, fileobj)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
self.get_fields(kid.get_object(), retval, fileobj)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 503, in get_fields
self._build_field(tree, retval, fileobj, field_attributes)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 547, in _build_field
retval[key] = Field(field)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 1626, in __init__
self[NameObject(attr)] = data[attr]
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 679, in __getitem__
return dict.__getitem__(self, key).get_object()
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 251, in get_object
obj = self.pdf.get_object(self)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 1167, in get_object
retval, indirect_reference.idnum, indirect_reference.generation
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 741, in decrypt_object
return cf.decrypt_object(obj)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
obj[dictkey] = self.decrypt_object(value)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 185, in decrypt_object
obj[i] = self.decrypt_object(obj[i])
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
obj[dictkey] = self.decrypt_object(value)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 176, in decrypt_object
data = self.strCrypt.decrypt(obj.original_bytes)
File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 88, in decrypt
return d[: -d[-1]]
IndexError: index out of range
I'll start with I'm very new to using Python and PyPDF. I'm trying to collect all of the fields within a pdf to collect into a dataframe. Eventually I want to collect thousands of PDFs that all have the same structure (form) as the baseline and place them into the PDF. I was able to get this code to work great on a PDF without a digital certificate/signature. However, when I run the code on a PDF with the digital certificate/signature I get an error.
I don't really need the digital signature/certificate spot of the document so I think the easiest way to do this is to just skip that field of the PDF. However, I don't know how to do that since the PyPDF2 package looks at every field.
I was able to get around the error by doing try/except but then it wouldn't capture the information from the pdf (i.e. result was blank).
Environment
Plotly Dash Workspace
Code + PDF
I'll have to play around with the PDF to see if I can post it as it have PII information.
Traceback
TODO
I believe the best solution would be something for if the getFields() or getFormFields() methods encounter a digital signature/certificate then it passes that field.