These transcripts from Combat Status Review Tribunals for high-value detainees, which appear as image PDFs, should be relatively easy for most optical character recognition software, since the scans are high-quality and the material is typed.
Use OCR software to process the documents and produce results with searchable text that are accurate and complete enough to dig for important names, events and other key phrases.
DESIRED OUTCOME: Samples from the output should match five pre-selected samples that have already been transcribed into text.
To test the accuracy of your output, first download the comparison sample file, where you'll find the locations of the five selections. Copy these selections from your results and paste them directly into a new text document. Then run "Compare Documents" (available in most word processors) using the file and your results.
The OCR software's performance on this task should be judged based on how much the two documents match up.
OmniPage Pro whipped through these transcripts from combatant tribunals fast, recognizing the text in all but a few areas with lower quality scans.READ OUR FULL TEST RESULT »
FineReader made this document set, a collection of transcripts from combatant tribunals, into searchable PDFs almost perfectly. A reporter searching for specific words would have no trouble using this product when converting such basic documents.READ OUR FULL TEST RESULT »
Able2Extract was nearly perfect in converting these scanned, typed transcripts of detainee interviews into searchable text.READ OUR FULL TEST RESULT »
Acrobat's OCR feature was able to fill this PDF collection with searchable text with only a handful of missing words. Even most of the Arabic names in the original document were translated accurately into the resulting PDF.READ OUR FULL TEST RESULT »
Formatting and white space posed problems for DocumentCloud in its attempt to recognize text in these typed transcripts from Iraq-Afghanistan combatant tribunals. It garbled some text and mixed up content split into columns.READ OUR FULL TEST RESULT »
Google Drive couldn't get out of the starting gate in this test of combatant tribunal transcripts because of its file size restrictions.READ OUR FULL TEST RESULT »