Extract text from PDF file with Java using Selenium WebDriver

15 April, 2014
We may have to verify PDF content while testing. In WebDriver there are no direct methods to read the content from PDF file. We can use ApachePDFBox API in our tests to extract the PDF content.

Download and Configure
We need to download the Jar file and add it to the Eclipse Class path before we run our test to extract the content from PDF file. The latest release at the time of writing this article was pdfbox-app-1.8.4.jar.

Below is the sample script which extracts text from the PDF file located at http://www.votigo.com/pdf/corp/CASE_STUDY_EarthBox.pdf.

import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.URL;
import java.util.concurrent.TimeUnit;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.BeforeTest;
import org.testng.annotations.Test;

public class ReadPdfFile {
 WebDriver driver;
  public void setUpDriver() {
   driver = new FirefoxDriver();
  public void start() throws IOException{
  driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
  URL url = new URL(driver.getCurrentUrl()); 
  BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());

  //parse()  --  This will parse the stream and populate the COSDocument object. 
  //COSDocument object --  This is the in-memory representation of the PDF document

  PDFParser parser = new PDFParser(fileToParse);

  //getPDDocument() -- This will get the PD document that was parsed. When you are done with this document you must call    close() on it to release resources
  //PDFTextStripper() -- This class will take a pdf document and strip out all of the text and ignore the formatting and           such.

  String output=new PDFTextStripper().getText(parser.getPDDocument());
  driver.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);

The output of the above code is:

EarthBox a Day Giveaway 
EarthBox wanted to engage their Facebook 
audience with an Earth Day promotion that would 
also increase their Facebook likes. They needed a 
simple solution that would allow them to create a 
sweepstakes application themselves. 

EarthBox utilized the Votigo 
platform to create a like-
gated sweepstakes. Utilizing a 
theme and uploading a custom graphic they 
were able to create a branded promotion. 

• 1 prize awarded each day for the entire Month of April  
• A grand prize given away on Earth Day  
• Daily winner announcements on Facebook 
• Promoted through email newsletter blast  
Results (4 weeks) 
• 6,550 entries 

No comments:

Post a Comment