This document, a list of reports due to Congress from the executive branch, is a challenge for optical character recognition because only pieces of it appear as searchable text.
Page numbers, annotations and other fields in a different orientation than most of the data (portrait vs. landscape) may also complicate the conversion.
DESIRED OUTCOME: Make text completely searchable. For each of the following, search the output to connect a given report with the corresponding law or policy that authorizes it.
For each of the following, search your output to find when these reports are due to Congress.
For each of the following, search your output to connect a given law or policy with the report it authorizes.
Despite issues with recognizing combined strings of numbers and letters and some miscues from this poorly scanned list of congressional reports, DocumentCloud was a good first step when trying to sift through this massive PDF file.
READ OUR FULL TEST RESULT »Although FineReader recognized much of the text on these congressional reports, the software had trouble with the poor scan quality and missed quite a bit.
READ OUR FULL TEST RESULT »The relatively poor quality of this partial-text PDF of congressional reports was too difficult for Adobe Acrobat to translate into a fully text-enabled, searchable document. Much of the text was garbled and misinterpreted, so be wary of using this product if your document was hastily scanned or features unclear text.
READ OUR FULL TEST RESULT »OmniPage completely ignores the relevant text in this low-resolution PDF index of Congress reports containing partial text. Even with ample options for recognizing text, the software only manages to capture page numbers and annotations -- worthless in this context.
READ OUR FULL TEST RESULT »Google Drive failed completely in our attempt to recognize text in this list of executive branch reports required by Congress.
READ OUR FULL TEST RESULT »Able2Extract completely fails with this difficult document, a PDF already partially processed with OCR.
READ OUR FULL TEST RESULT »Testing
Testing