DocumentCloud

Despite flaws, DocumentCloud a good start to reporting projects

OCR and entity analysis struggle with complex files, but document management features are great.

Overall:

Great for the price; doc management stellar; OCR, entity extraction need refinement

Documentation:

No manual, video, but help section includes guides, how-tos; API available

Usability:

Easy to grasp, learn; lots of useful features

Community:

Forum discusses bugs, fixes, dream features; responsive staff

Performance:

Fast; handles basic jobs, but hard-to-explain gaps and flaws in OCR, entity extraction, not quite as good as proprietary competitors

Product:

DocumentCloud

//
Company:

Investigative Reporters and Editors

//
Cost:

Free

[FULL DISCLOSURE: DocumentCloud is a project by Investigative Reporters and Editors, where Reporters' Lab Director Sarah Cohen serves as a member of the board of directors.]

DocumentCloud is invaluable for reporters who want a free, Web-based service to share and mark up documents, or even publish them. Although its optical character recognition and entity extraction functions are far from perfect, they're suitable in most cases for a first pass at tricky documents.

DocumentCloud's core is a suite of tools for document sharing, markup and publication. It allows users to define a set of colleagues allowed to view and annotate a document. Notes are easily accessible by collaborators through the document or direct links, and private annotations also allow for personal notes.

Other features allow users to insert or reorder pages, create sections and edit document data. Reporters can also use DocumentCloud to publish or embed a document, and redact portions of it before publication. So far, users cannot break or divide a document into multiple documents, though a developer said that feature was under consideration. 

OCR SPOTTY, BUT USEFUL

DocumentCloud's optical character recognition feature, which uses open-source Tesseract, performed decently in our tests. It didn't do as well as industry standard Adobe Acrobat or other cheaper options, but provided useful results in more than half the cases. In our tests, it did very well with clean and regularly-formatted documents.

But it struggles noticeably when dealing with unusual font styles, like italics, or smaller font sizes. It also has trouble with dense blocks of text, like long paragraphs with no breaks, and irregular formats with expanses of white space seem to stump it.

However, it does allow easy switching between the OCR text and original document. That lets users cast a wide net with a search in the OCR version, then get the accurate translation from the original. Basic users can correct text recognized via OCR through an interactive window. Advanced users can work with DocumentCloud's API to "train" it to recognize and replace terms.

ENTITY EXTRACTION ERRATIC

DocumentCloud features include entity analysis -- in effect an automated index of all the organizations, people, places and terms in the document. It uses Thomson Reuters' OpenCalais, which is lightning-fast and useful, as long as you recognize the limitations.

First, OpenCalais struggles with larger documents; a DocumentCloud developer said it probably breaks such files into sections and processes them separately. That may have led to the poor result in our test of a large document set, where the entity analysis appeared to simply skip over and ignore dozens of pages of the document. It performed much better on a smaller file, but ignored an obvious entity, the U.S. Department of Justice. 

While that's not good enough for a publishable analysis, it could be useful for reporters making a first pass at a document and looking for terms that are clearly included. That could form the basis for a more thorough look via DocumentCloud's search functions. The feature does have a slick display, producing a "timeline" of the document with hashmarks indicating where the entity name appears. Hover on a hashmark and you get the text where the term appears. 

That said, the entity extraction has some annoying restrictions.You can only analyze one document at a time, not a set of related documents. Analyzing a portion of a document requires a manual scan of the full-document results. DocumentCloud doesn't allow you to merge or split documents, which would make those tasks simpler.

Still, DocumentCloud puts a slew of tools in the hands of journalists and gives them a terrific primer in what's possible and what they can attempt to do. For the price (free), there's no reason not to use it for every project.

 
Product:

DocumentCloud

//
Company:

Investigative Reporters and Editors

//
Version Tested:

N/A

//
Release Date:

2009

//
OS Tested:

Web Based

//
Cost:

Free

//
Open Sourced:

Yes

//
Demo Available:

No

//
Obsolete:

No

 

How DocumentCloud performed on our tests

Verdict:

Recognized about half of desired search results

DocumentCloud spotty on recognizing text in form-based PDF

It took two tries to upload these disclosure forms from North Carolina legislators to DocumentCloud, and the service correctly recognized only half of the terms we were looking for in our tests.

READ OUR FULL TEST RESULT »

Verdict:

Good for a first pass, not enough for rock-solid analysis

DocumentCloud performs erratically when searching memos

DocumentCloud only spotted about half the references of organizations mentioned in these memos from the Obama-Biden transition team -- although it performed better than our annotator on those it did catch.

READ OUR FULL TEST RESULT »

Verdict:

Handles most documents easily; italics a problem

Aside from italics, DocumentCloud tackles scanned-in memos

DocumentCloud recognized the text in three of four memos from the Obama-Biden transition team we tested with few problems. But the service did have a harder time with italicized text.

READ OUR FULL TEST RESULT »

Verdict:

Poor scan quality hinders text recognition; strings of numbers, letters an issue

Searching partial-text PDF clumsy, but doable with DocumentCloud

Despite issues with recognizing combined strings of numbers and letters and some miscues from this poorly scanned list of congressional reports, DocumentCloud was a good first step when trying to sift through this massive PDF file.

READ OUR FULL TEST RESULT »

Verdict:

Some text recognition stumped by formatting

DocumentCloud hit or miss with OCR on typed transcripts

Formatting and white space posed problems for DocumentCloud in its attempt to recognize text in these typed transcripts from Iraq-Afghanistan combatant tribunals. It garbled some text and mixed up content split into columns.

READ OUR FULL TEST RESULT »

Verdict:

Provides useful first read, but not comprehensive

DocumentCloud flaws hurt entity analysis of emails

Even after multiple passes, DocumentCloud was only able to catch a small fraction of the names we were looking to highlight in an analysis of emails related to the Deepwater Horizon oil spill.

READ OUR FULL TEST RESULT »
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.

Testing

Testing