Using Selenium Java

Fetch Text from Image & PDF Using Selenium Java | Devstringx

Table of Contents

In this blog, we will learn how we can fetch data from images and PDFs.

This Blog Contains:

  • Read Text From Image Using OCR with Tesseract (tess4j)
  • Reading PDF Text Using PDFUtil
  • Save PDF as Image Using PDFUtil
  • Extract Images From PDF Using PDFUtil

Fetch Text From Image In Selenium

To get a text from the Image in selenium, we use Optical Character Recognition (OCR) with Tesseract (tess4j). Tesseract Supports UTF-8 Unicode.

  • First, we need to create a folder with the name โ€œtesseractโ€ in our project and put trained data in that folder. You can find trained data for any language from the below URL:

https://github.com/tesseract-ocr/tessdata

Just Download eng. trained data for English Language and put it into Tesseract Folder for your project.

  • Add below is maven dependency for tesseract (tess4j):
<dependency>

<groupId>net.sourceforge.tess4j</groupId>

<artifactId>tess4j</artifactId>

<version>4.5.4</version>

</dependency>
  • Below is the Java code to fetch text from the image:
ITesseract image =ย newย Tesseract();

image.setDatapath(โ€œLocation for TessData Folderโ€);

image.setLanguage(โ€œengโ€);

String str1 = image.doOCR(newย File(โ€œLocation Of Imageโ€));

Read Also:- Process Java Script Executor in Selenium Test Automation

Fetch Text From PDF

  • Add Below Maven Dependency For PDFUtil
<dependency>

<groupId>com.testautomationguru.pdfutil</groupId>

<artifactId>pdf-util</artifactId>

<version>0.0.3</version>

</dependency>
  • Below Java Code is used toย Read Text From PDF
String pdfLocation = โ€œLocation where we have PDF Fileโ€;

PDFUtil pdfUtil =ย newย PDFUtil();

String text = pdfUtil.getText(pdfLocation);
  • Below Java Code is used toย Save PDF as an Image
String folderLocation = โ€œLocation Where we need to save Imageโ€;

String pdfLocation = โ€œLocation where we have PDF Fileโ€;

PDFUtil pdfUtil =ย newย PDFUtil();

pdfUtil.setImageDestinationPath(folderLocation);

pdfUtil.savePdfAsImage(pdfLocation);
  • Below Java Code is used toย Fetch Image From PDF
String folderLocation = โ€œLocation Where we need to save Imageโ€;

String pdfLocation = โ€œ Location where we have PDF Fileโ€;

PDFUtil pdfUtil = new PDFUtil();

pdfUtil.setImageDestinationPath(folderLocation);

pdfUtil.extractImages(pdfLocation);

Related Blogs