Document: Executive branch congressional reports

Use OCR to make partial text PDF completely searchable

Difficulty:

Partially searchable text may gum up some software

This document, a list of reports due to Congress from the executive branch, is a challenge for optical character recognition because only pieces of it appear as searchable text.

Page numbers, annotations and other fields in a different orientation than most of the data (portrait vs. landscape) may also complicate the conversion.

DESIRED OUTCOME: Make text completely searchable. For each of the following, search the output to connect a given report with the corresponding law or policy that authorizes it.

  • Certification that a shrimp harvesting nation has adopted a regulatory program governing the incidental taking of certain sea turtles.
    Answer (on page 23): Pub. L. 101-162, Sec. 609(b)(2) (103 Sat. 1038)
  • Library of Congress Register of Copyrights: Balance achieved between the rights of creators and the needs of users when copies are made by libraries
    Answer (on page 13): 17 U.S.C. 108(i)
  • An evaluation of the financial impact of the CCA program, of changes in access to physicians and other health care providers and beneficiary satisfaction
    Answer (on page 97): 42 U.S.C. 1395w-29 Pub. L. 108-173, Sec. 241(a)
  • A report on Federal Trade Commission's experience with authority to regulate spam and spyware
    Answer (on page 188): 15 U.S.C.  44 Pub. L. 109-455, Sec. 14

For each of the following, search your output to find when these reports are due to Congress.

  • Expenditures incurred by the U.S. Government directly attributable to the exercise of emergency war powers or authorities
    Answer (on page 34): During each emergency; final report within 9 days after its termination.
  • A report on the established procedure for expeditiously clearing individuals whose names have been mistakenly placed on a terrorist list or who may have names identical or similar to individuals on a terrorist database list
    Answer (on page 111): Not later than 6 months after the date of enactment of this section

For each of the following, search your output to connect a given law or policy with the report it authorizes.

  • 10 U.S.C. 1071 Pub. L. 106-65, Sec. 723
    Answer (on page 62): Report on the quality of health care furnished under the health care programs of the Department of Defense
  • Pub. L 108-199, Sec. Div. D, Title II (118 Stat. 160)
    Answer (on page 143): A determination that Israel is not being denied its right to participate in the activities of the IAEA
 

Test Results

Verdict:

Poor scan quality hinders text recognition; strings of numbers, letters an issue

Searching partial-text PDF clumsy, but doable with DocumentCloud

Despite issues with recognizing combined strings of numbers and letters and some miscues from this poorly scanned list of congressional reports, DocumentCloud was a good first step when trying to sift through this massive PDF file.

READ OUR FULL TEST RESULT »

Verdict:

Some terms searchable; others not found

Docs with partial text prove hit or miss for FineReader

Although FineReader recognized much of the text on these congressional reports, the software had trouble with the poor scan quality and missed quite a bit.

READ OUR FULL TEST RESULT »

Verdict:

Inaccuracy makes finding keywords unlikely

Acrobat's OCR incorrectly translates partial-text PDF

The relatively poor quality of this partial-text PDF of congressional reports was too difficult for Adobe Acrobat to translate into a fully text-enabled, searchable document. Much of the text was garbled and misinterpreted, so be wary of using this product if your document was hastily scanned or features unclear text.

READ OUR FULL TEST RESULT »

Verdict:

Fails to recognize text; returns only headers, footers

Low-resolution text stumps OmniPage's text recognition

OmniPage completely ignores the relevant text in this low-resolution PDF index of Congress reports containing partial text. Even with ample options for recognizing text, the software only manages to capture page numbers and annotations -- worthless in this context.

READ OUR FULL TEST RESULT »

Verdict:

Couldn't upload large file

Google Drive fails to convert large partial-text PDF

Google Drive failed completely in our attempt to recognize text in this list of executive branch reports required by Congress.

READ OUR FULL TEST RESULT »

Verdict:

Can't handle badly scanned report

Able2Extract fails to make partial text PDF searchable at all

Able2Extract completely fails with this difficult document, a PDF already partially processed with OCR.

READ OUR FULL TEST RESULT »

Testing

Testing