Skip to content

Garbage output on parsing law pdf files. #523

@sunn-e

Description

@sunn-e

I have been trying to extract text from some court case judgement files.

from PyPDF2 import PdfFileReader

with open('17343_2008_Order_09-Jan-2019.pdf', 'rb') as fd:
  pdf = PdfFileReader(fd)
  p1 = pdf.getPage(2)
  print(p1.extractText()) 

the file
17343_2008_Order_09-Jan-2019.pdf

actual output

$ !#--.&!* #* #(*!*$ !0!#/+(+)+))!(-!
%!3#%*$+$ !)#-$$ #$$ !
,(-,*!($+--.%%!*,($ !#3%,-.1$.%#1),!1*0!#%!
+)$ !6,!0$ #$,$0+.1*'!%!#&+(#'1!$+/%!&."!
$ #$$ !0!#/+(+)+))!(-!C#&&,0#&
#6#,1#'1!0#&'7#--.&!*$+
-+"",$$ !$ !%!1.-$#(-!+)$ !
$+#--!/$$ !/%!&!(-!+)$ !#--.&!*
#$$ !&/+$#(*+.%#--!/$#(-!+)$ !&#,*
),(*,(3-#(%!#&+(#'171!#*$+#-+(-1.&,+($ #$
$ !,(-,*!($ #*+--.%%!*0,$ +.$#(7/%!"!*,$#$,+(
+($ !/#%$+)$ !*!#$ $++9/1#-!
)+.%*#7&#)$!%$ !#11!3!*$ !&!
-,%-."&$#(-!&8,)+(!1!#%(!*2.*3!+)$ ,&
 #*$#9!($ !6,!0$ #$)+.%$ !F-!/$,+($+
,&#$$%#-$!*80!#%!+)$ !6,!0$ #$$ !
&#"!& +.1*-+""!(*$+.&)+%#--!/$#(-!,($ !
+)!F-!/$,+(
ˇ$+B
 +",-,*!,&(+$".%*!%,),$
,&-+"",$$!*0,$ +.$/%!"!*,$#$,+(,(
#&.**!(),3 $,($ ! !#$+)/#&&,+(
./+(#&.**!(<.#%%!1#(*0,$ +.$$ !
+))!(*!% #6,(3$#9!(.(*.!#*6#($#3!

expected output

IN THE SUPREME COURT OF INDIA
CRIMINAL APPELLATE JURISDICTION
CRIMINAL APPEAL  NO(S).  2094/2008
AJIT SINGH ...APPELLANT(S) VERSUS
THE STATE OF PUNJAB ...RESPONDENT(S)
ORDER 
1. The matter has been referred to this
Bench due to a difference of opinion between the
two learned judges of this Court who had heard the
appeal; one learned judge holding the offence to be
one under Section 304 Part I IPC and the second
learned judge holding the said offence to be one
covered by Section 302 IPC.
2. It appears that following the aforesaid
order the accused has been released from custody in
September, 2011 on the strength of a warrant of
release issued by the jurisdictional Sessions
Judge.  Though we fail to understand how the
accused could have been released, pending a
resolution of the difference of opinion between the
two learned judges of this Court, we are not
inclined to go into the said issue and instead deem
it appropriate to go into the core issue arising.

I tried different pages but the problem persists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions