[FULL DISCLOSURE: DocumentCloud is a project by Investigative Reporters and Editors, where Reporters' Lab Director Sarah Cohen serves as a member of the board of directors.]
DocumentCloud is invaluable for reporters who want a free, Web-based service to share and mark up documents, or even publish them. Although its optical character recognition and entity extraction functions are far from perfect, they're suitable in most cases for a first pass at tricky documents.
DocumentCloud's core is a suite of tools for document sharing, markup and publication. It allows users to define a set of colleagues allowed to view and annotate a document. Notes are easily accessible by collaborators through the document or direct links, and private annotations also allow for personal notes.
Other features allow users to insert or reorder pages, create sections and edit document data. Reporters can also use DocumentCloud to publish or embed a document, and redact portions of it before publication. So far, users cannot break or divide a document into multiple documents, though a developer said that feature was under consideration.
DocumentCloud's optical character recognition feature, which uses open-source Tesseract, performed decently in our tests. It didn't do as well as industry standard Adobe Acrobat or other cheaper options, but provided useful results in more than half the cases. In our tests, it did very well with clean and regularly-formatted documents.
But it struggles noticeably when dealing with unusual font styles, like italics, or smaller font sizes. It also has trouble with dense blocks of text, like long paragraphs with no breaks, and irregular formats with expanses of white space seem to stump it.
However, it does allow easy switching between the OCR text and original document. That lets users cast a wide net with a search in the OCR version, then get the accurate translation from the original. Basic users can correct text recognized via OCR through an interactive window. Advanced users can work with DocumentCloud's API to "train" it to recognize and replace terms.
DocumentCloud features include entity analysis -- in effect an automated index of all the organizations, people, places and terms in the document. It uses Thomson Reuters' OpenCalais, which is lightning-fast and useful, as long as you recognize the limitations.
First, OpenCalais struggles with larger documents; a DocumentCloud developer said it probably breaks such files into sections and processes them separately. That may have led to the poor result in our test of a large document set, where the entity analysis appeared to simply skip over and ignore dozens of pages of the document. It performed much better on a smaller file, but ignored an obvious entity, the U.S. Department of Justice.
While that's not good enough for a publishable analysis, it could be useful for reporters making a first pass at a document and looking for terms that are clearly included. That could form the basis for a more thorough look via DocumentCloud's search functions. The feature does have a slick display, producing a "timeline" of the document with hashmarks indicating where the entity name appears. Hover on a hashmark and you get the text where the term appears.
That said, the entity extraction has some annoying restrictions.You can only analyze one document at a time, not a set of related documents. Analyzing a portion of a document requires a manual scan of the full-document results. DocumentCloud doesn't allow you to merge or split documents, which would make those tasks simpler.
Still, DocumentCloud puts a slew of tools in the hands of journalists and gives them a terrific primer in what's possible and what they can attempt to do. For the price (free), there's no reason not to use it for every project.
Investigative Reporters and Editors
//N/A
//2009
//Free
//Yes
//No
//No
It took two tries to upload these disclosure forms from North Carolina legislators to DocumentCloud, and the service correctly recognized only half of the terms we were looking for in our tests.
READ OUR FULL TEST RESULT »DocumentCloud only spotted about half the references of organizations mentioned in these memos from the Obama-Biden transition team -- although it performed better than our annotator on those it did catch.
READ OUR FULL TEST RESULT »DocumentCloud recognized the text in three of four memos from the Obama-Biden transition team we tested with few problems. But the service did have a harder time with italicized text.
READ OUR FULL TEST RESULT »Despite issues with recognizing combined strings of numbers and letters and some miscues from this poorly scanned list of congressional reports, DocumentCloud was a good first step when trying to sift through this massive PDF file.
READ OUR FULL TEST RESULT »Formatting and white space posed problems for DocumentCloud in its attempt to recognize text in these typed transcripts from Iraq-Afghanistan combatant tribunals. It garbled some text and mixed up content split into columns.
READ OUR FULL TEST RESULT »Even after multiple passes, DocumentCloud was only able to catch a small fraction of the names we were looking to highlight in an analysis of emails related to the Deepwater Horizon oil spill.
READ OUR FULL TEST RESULT »Testing
Testing
The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.