Pdftotext (Xpdf)

Pdftotext excels at extracting data from conventional tables, stumbles with more complex tasks

Program is a handy addition to any journalist's toolbox, provided you don't try anything fancy.

Overall:

Not suited for overly complex tasks, but speed, compatibility, open-source nature make it a must-have

Documentation:

Excellent documentation, installation options

Usability:

Navigating command line tricky for uninitiated; however, embedded instruction commands are helpful

Community:

No official forums; however, helpful journalist-friendly tutorials are plentiful

Performance:

Converts large PDFs in the blink of an eye; suited for conventionally formatted tables; multiple headers, embedded font cause issues

Product:

Pdftotext (Xpdf)

//
Company:

Glyph & Cog, LLC

//
Cost:

Free

Pdftotext does a great job of speedily converting PDF documents into delimiter-friendly text files, making it a valuable addition to every journalist's toolbox. However, the spartan functionality of this command-line software means it's ill-suited for documents with multiple headers and complex layouts.

For journalists with little or no coding experience, installing and running Pdftotext will probably be the most challenging aspect of using it. While Xpdf's site provides you with plenty of documentation and installation options (as well as a list of commonly experienced problems), it may intially stump journalists used to double-click installations. Thankfully, plenty of online tutorials written by other journalists exist online; we recommend this excellent startup guide by The Edmonton Journal's Lucas Timmons or this IRE tip sheet from San Diego State University's Chris Milholland (IRE membership required).

Once you have Pdftotext up and running, converting PDFs to text files is incredibly simple. A list of commands within the command prompt itself provides quick reference for the different extraction options, such as scraping specific pages. However, the one command that was absolutely crucial to our success with Pdftotext was "-layout," which preserves the original layout structure of the data. On PDFs structured like spreadsheets, this worked beautifully; the outputted text files were easily converted to spreadsheets using Excel's fixed-width delimiter.

But in cases where the PDFs contained multiple subheadings or featured embedded text, Pdftotext returned text files that were seriously jumbled, or in the case of embedded text, complete gibberish.

It is worth noting that regardless of the output, Pdftotext was able to convert large PDFs (some of which contained upwards of 10,000 rows) in less than five seconds.

Pdftotext is by no means the be all and end all of PDF converters. However, its ability to quickly extract fairly accurate output files from PDFs with conventionally formatted spreadsheets is a huge, cost-effective boon for reporters covering all beats, and should be particularly useful for Freedom of Information/Access to Information-savvy journalists.

 
Product:

Pdftotext (Xpdf)

//
Company:

Glyph & Cog, LLC

//
Version Tested:

3.03

//
Release Date:

August 15, 2011

//
OS Tested:

Microsoft WIndows 7 x64

//
Cost:

Free

//
Open Sourced:

Yes

//
Demo Available:

No

//
Obsolete:

No

 

How Pdftotext (Xpdf) performed on our tests

Verdict:

Converts PDF to near-perfect text file in seconds; some post-conversion cleanup required

Pdftotext quickly, fairly accurately converts lined spreadsheet into text file

Pdftotext easily converts this list of former Madoff customers into a text file optimized for fixed width delimiting in Excel. The program's -layout command, which creates an output file identical in format to the original, is key to this result.

READ OUR FULL TEST RESULT »

Verdict:

Serious manual cleanup required; outputted text retains some initial formatting

Pdftotext partly converts database report, requires serious clean-up

Pdftotext stumbled with the report-style format of this list of housing violations in Washington, D.C., requiring significant manual cleanup before it can be used for analysis.

READ OUR FULL TEST RESULT »

Verdict:

Jumbled formatting renders output useless for sorting

Multiple, separate subheadings in complex table stump Pdftotext

Pdftotext cannot accurately process the multiple headers and complicated formatting of this list of appointments made by the Clinton administration.

READ OUR FULL TEST RESULT »

Verdict:

Data entries garbled beyond recognition; maintained document format

Pdftotext returns garbled characters due to embedded font

Because Pdftotext fails to preserve any data, it would be impossible to get any relevant information from this list of contributors for Arizona Gov. Jan Brewer's proposed border fence initiative using the program.

READ OUR FULL TEST RESULT »
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.

Testing

Testing