Needlebase

Needlebase effective, expensive scraping solution

When it comes to scraping websites, Needlebase is hard to beat.

Overall:

Versatile and effective if users can afford it

Documentation:

Adequate documentation with some tutorials; some material dated

Usability:

Straightforward, but moderate learning curve

Community:

Only forum fairly small; company tweets infrequently

Performance:

Fast, gathering data in the background; accomplished most scraping tasks

Product:

Needlebase

//
Company:

ITA Software

//
Cost:

Free to $999/Month

[Editor's note: The Needlebase team announced its technology would be retired June 1, 2012, as team members work to integrate it with Google. Read more on the future of Needlebase and the state of Web scraping solutions here. -TD]

If you can afford the cost and the time to handle the learning curve, Needlebase can solve most common scraping problems in the newsroom, aside from gathering and downloading files.

This isn't the type of product you can just pick up and play. It takes a little time to figure out how to set up models for data, which are oriented around loosely connected tags.

Mediocre documentation doesn't really speed up that learning curve. Its video tutorial is helpful, but features an older version with a slightly different interface. Several text tutorials appear to be more up-to-date and helpful. There's not much of a user community to go to for help either -- the forum only had about 80 posts at the time of this review.

After users master the basics though, Needlebase becomes a powerful tool. Its real power comes from building a custom scraper with its visual interface. Just provide the URL of the page you want to scrape and let it load. You can then begin tagging.

The interface is mostly intuitive. It can handle form fields and links for pagination with simple clicks, and Needlebase almost always guesses what users need correctly.

It also handles detail pages well. After telling it what links to follow, the application gets a good idea of what to scrape.

Needlebase performed well on our scraping tasks, although it was unable to handle a site that required browser cookies. This is an unfortunate side effect of a hosted solution, and is hard to troubleshoot.

While the system isn't designed to gather files like PDFs or images, Needlebase will collect links to these files, making it easier to use another tool to download what you need.

Despite its strengths, many news organizations may find Needlebase too expensive. Users are charged on a per-cell basis, so larger databases mean larger fees. The free version will let users collect 100,000 cells from 5,000 pages a month, but data must be made public. Costs peak at $999 a month.

 
Product:

Needlebase

//
Company:

ITA Software

//
Version Tested:

--

//
OS Tested:

Web Based

//
Cost:

Free to $999/Month

//
Open Sourced:

No

//
Demo Available:

Yes

//
Obsolete:

Yes

 

How Needlebase performed on our tests

Verdict:

Easily downloads content using search forms, master/detail pattern

Needlebase plows through search fields to perfectly copy database

With its ability to automate searches and handle multilevel databases, Needlebase can easily root through and capture this collection of teacher information.

READ OUR FULL TEST RESULT »

Verdict:

Effortlessly converts website into any format

Needlebase has no trouble with simple data table

Needlebase effortlessly pulls down structured data from the South Dakota lobbyist database within minutes.

READ OUR FULL TEST RESULT »

Verdict:

Captures metadata, but can't download files

Lack of download option leaves PDF database test half-finished

Needlebase proves more than adequate in tagging and saving the metadata associated with these Obama transition team documents. But it wasn't able to download the PDFs and missed some information.

READ OUR FULL TEST RESULT »

Verdict:

Fails to connect to site, returning no results

Needlebase won't connect to site, preventing scrape

Needlebase was simply not able to perform any part of this test, failing to connect to the page at all. But since it's a hosted solution, it's hard to figure out why.

READ OUR FULL TEST RESULT »
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.

Testing

Testing