Even after multiple passes, DocumentCloud was only able to catch a small fraction of the names we were looking to highlight in an analysis of emails related to the Deepwater Horizon oil spill.
The first time we processed the PDF, DocumentCloud's entity extraction service skipped more than half of the pages we sampled in our test. On the remaining pages, it did pull out instances of three of the six names, but whiffed on the rest.
Developer Jeremy Ashkenas said this may be an issue with OpenCalais, which DocumentCloud uses to perform entity analysis. The service breaks large documents into chunks and processes them separately, which may cause uneven results across sections of the document.
Per his suggestion, we reprocessed the document with somewhat better results.
In the second run, DocumentCloud found about half of the total occurrences of one name. It found a few more appearances of four other names in the test group, but still completely missed one name.