Document: Your Seat at the Table

Use OCR to make scanned-in memos searchable

Difficulty:

Few minor formatting challenges, such as italics

The Your Seat at the Table documents provide a treasure trove of information supplied to the Obama-Biden transition team from lobbyists, governmental organizations, citizens and other sources. All of these documents are available in PDF format, but not of all of them appear as searchable text, which reporters would need if they were looking for specific information contained in reams of pages.

Download the following specific documents from the overall set and run them through OCR software to make the text searchable. All documents are typed, and aside from a few basic formatting challenges (italics, for example), should be a relatively easy task for most software.

DESIRED OUTCOME: Samples from the output should match four pre-selected samples that have already been transcribed into text.

To test the accuracy of your output, first download the comparison sample file, where you'll find the locations of the selections in the documents above. Copy these selections from your results and paste them directly into a new text document. Then run "Compare Documents" (available in most word processors) using the sample file and your results file.

The OCR software's performance on this task should be judged based on how well the two documents match up.

 

Test Results

Verdict:

Perfect text recognition; works fast

OmniPage error-free in recognizing scanned memo text

OmniPage had no trouble with these scanned memos from the Obama-Biden transition team, flawlessly recognizing the text despite a variety of different scan qualities.

READ OUR FULL TEST RESULT »

Verdict:

Quick, nearly accurate conversion

FineReader makes scanned memos searchable with few errors

ABBYY FineReader 11 easily converted scanned memos submitted to the Obama-Biden Transition team to searchable PDFs, getting the text mostly right.

READ OUR FULL TEST RESULT »

Verdict:

Conversion handles variety of text styles with few errors

Scanned memos easy fodder for Acrobat's OCR

This collection of scanned-in memos from the Obama-Biden transition wasn't much of a match for Adobe Acrobat, which recognized the text with only a few minor errors. It even handled italics well.

READ OUR FULL TEST RESULT »

Verdict:

Effectively and accurately converts memos to text

Able2Extract converts scanned memos with few errors

Able2Extract takes these scanned in memos and quickly and reliably makes them searchable.

READ OUR FULL TEST RESULT »

Verdict:

Handles most documents easily; italics a problem

Aside from italics, DocumentCloud tackles scanned-in memos

DocumentCloud recognized the text in three of four memos from the Obama-Biden transition team we tested with few problems. But the service did have a harder time with italicized text.

READ OUR FULL TEST RESULT »

Verdict:

Flawlessly converted small documents; didn't accept larger files

Google Drive flawless with scanned-in memos, but only if small files

Due to its file size limit, Google Drive was only able to recognize text in half the memos we tested from the Obama administration's Your Seat at the Table site.

READ OUR FULL TEST RESULT »

Testing

Testing