Document: N.C. legislature's statements of economic interest

Use OCR to make form-based PDF searchable

Difficulty:

Form structure shouldn't cause problems; some scans sloppy

PDFs from the state ethics commission aren't much good to reporters unless they can search for the companies that might pose conflicts of interest for legislators. To do that, convert these scanned statements of economic interest into PDFs with searchable text.

It's only necessary to convert the documents from 2011.

Because the forms are consistently structured and scanned at a decent quality, discerning text from them shouldn't be too hard. Some type is tiny though, and some forms are sloppily scanned.

DESIRED OUTCOME: Successfully search for the following public companies to find which candidates and their families held stock in them in 2011 (include stock symbol in your search):

BB&T
Bill Faison, Linda Garrou, Michael Walters, William Brent Jackson, Diane Parfitt, Timothy Spear

Pfizer
William Brent Jackson, James Crawford Jr.

IBM
William Brent Jackson, Diane Parfitt, Charles McGrady, David Martin, Thomas Tillis

 

Test Results

Verdict:

Text captured accurately

FineReader accurately detects text on scanned forms

FineReader made quick work of these reports from lawmakers, which could potentially show conflicts of interest. The resulting PDFs made finding search terms easy.

READ OUR FULL TEST RESULT »

Verdict:

Recognized about half of desired search results

DocumentCloud spotty on recognizing text in form-based PDF

It took two tries to upload these disclosure forms from North Carolina legislators to DocumentCloud, and the service correctly recognized only half of the terms we were looking for in our tests.

READ OUR FULL TEST RESULT »

Verdict:

Slow and imperfect, but mostly effective

Form-based PDF provides a challenge for Able2Extract

Able2Extract took a very long time to process this large PDF of nearly 1,700 pages of scanned-in forms, some with handwritten responses. After nearly four hours, it did eventually produce a legible, search-ready document, but missed a few terms.

READ OUR FULL TEST RESULT »

Verdict:

Translated most text, tripped up on some symbols

Acrobat makes scanned forms searchable with inaccuracies

Acrobat took its time wading through this 1,600-page collection of forms from North Carolina legislators, but its attempt to make the poorly scanned document text searchable shows the program is a good first step when trying to locate keywords in lengthy PDFs.

READ OUR FULL TEST RESULT »

Verdict:

Converted text in most search targets accurately

Drive quickly recognizes text in scanned forms

Google Drive handled this test with relative ease, uploading and recognizing text in 174 political candidate disclosure forms in about 40 minutes.

READ OUR FULL TEST RESULT »

Verdict:

Some misses and false positives; big file takes time, causes glitches

Software crashes bring down passable performance on scanned forms

Processing these scanned disclosure forms from North Carolina legislators is time-consuming with OmniPage, and although it recognized most of the text accurately, a software hangup prompted by this 1,700-page file made the results more difficult to wrangle.

READ OUR FULL TEST RESULT »

Testing

Testing