PDFs from the state ethics commission aren't much good to reporters unless they can search for the companies that might pose conflicts of interest for legislators. To do that, convert these scanned statements of economic interest into PDFs with searchable text.
It's only necessary to convert the documents from 2011.
Because the forms are consistently structured and scanned at a decent quality, discerning text from them shouldn't be too hard. Some type is tiny though, and some forms are sloppily scanned.
DESIRED OUTCOME: Successfully search for the following public companies to find which candidates and their families held stock in them in 2011 (include stock symbol in your search):
BB&T
Bill Faison, Linda Garrou, Michael Walters, William Brent Jackson, Diane Parfitt, Timothy Spear
Pfizer
William Brent Jackson, James Crawford Jr.
IBM
William Brent Jackson, Diane Parfitt, Charles McGrady, David Martin, Thomas Tillis
FineReader made quick work of these reports from lawmakers, which could potentially show conflicts of interest. The resulting PDFs made finding search terms easy.
READ OUR FULL TEST RESULT »It took two tries to upload these disclosure forms from North Carolina legislators to DocumentCloud, and the service correctly recognized only half of the terms we were looking for in our tests.
READ OUR FULL TEST RESULT »Able2Extract took a very long time to process this large PDF of nearly 1,700 pages of scanned-in forms, some with handwritten responses. After nearly four hours, it did eventually produce a legible, search-ready document, but missed a few terms.
READ OUR FULL TEST RESULT »Acrobat took its time wading through this 1,600-page collection of forms from North Carolina legislators, but its attempt to make the poorly scanned document text searchable shows the program is a good first step when trying to locate keywords in lengthy PDFs.
READ OUR FULL TEST RESULT »Google Drive handled this test with relative ease, uploading and recognizing text in 174 political candidate disclosure forms in about 40 minutes.
READ OUR FULL TEST RESULT »Processing these scanned disclosure forms from North Carolina legislators is time-consuming with OmniPage, and although it recognized most of the text accurately, a software hangup prompted by this 1,700-page file made the results more difficult to wrangle.
READ OUR FULL TEST RESULT »Testing
Testing