Pdftotext does a great job of speedily converting PDF documents into delimiter-friendly text files, making it a valuable addition to every journalist's toolbox. However, the spartan functionality of this command-line software means it's ill-suited for documents with multiple headers and complex layouts.
For journalists with little or no coding experience, installing and running Pdftotext will probably be the most challenging aspect of using it. While Xpdf's site provides you with plenty of documentation and installation options (as well as a list of commonly experienced problems), it may intially stump journalists used to double-click installations. Thankfully, plenty of online tutorials written by other journalists exist online; we recommend this excellent startup guide by The Edmonton Journal's Lucas Timmons or this IRE tip sheet from San Diego State University's Chris Milholland (IRE membership required).
Once you have Pdftotext up and running, converting PDFs to text files is incredibly simple. A list of commands within the command prompt itself provides quick reference for the different extraction options, such as scraping specific pages. However, the one command that was absolutely crucial to our success with Pdftotext was "-layout," which preserves the original layout structure of the data. On PDFs structured like spreadsheets, this worked beautifully; the outputted text files were easily converted to spreadsheets using Excel's fixed-width delimiter.
But in cases where the PDFs contained multiple subheadings or featured embedded text, Pdftotext returned text files that were seriously jumbled, or in the case of embedded text, complete gibberish.
It is worth noting that regardless of the output, Pdftotext was able to convert large PDFs (some of which contained upwards of 10,000 rows) in less than five seconds.
Pdftotext is by no means the be all and end all of PDF converters. However, its ability to quickly extract fairly accurate output files from PDFs with conventionally formatted spreadsheets is a huge, cost-effective boon for reporters covering all beats, and should be particularly useful for Freedom of Information/Access to Information-savvy journalists.
3.03
//August 15, 2011
//Free
//Yes
//No
//No
Pdftotext easily converts this list of former Madoff customers into a text file optimized for fixed width delimiting in Excel. The program's -layout command, which creates an output file identical in format to the original, is key to this result.
READ OUR FULL TEST RESULT »Pdftotext stumbled with the report-style format of this list of housing violations in Washington, D.C., requiring significant manual cleanup before it can be used for analysis.
READ OUR FULL TEST RESULT »Pdftotext cannot accurately process the multiple headers and complicated formatting of this list of appointments made by the Clinton administration.
READ OUR FULL TEST RESULT »Because Pdftotext fails to preserve any data, it would be impossible to get any relevant information from this list of contributors for Arizona Gov. Jan Brewer's proposed border fence initiative using the program.
READ OUR FULL TEST RESULT »Testing
Testing
The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.