Crawler-commons project gets started

December 3, 2009
Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality.

Out of this was born the crawler-commons project. As the main page says:

The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

There’s a long list of functionality that is identical, or nearly so, between the various projects. The project wiki has a more detailed write-up from the ApacheCon meeting, but a short list includes:

  • robots.txt parsing
  • URL normalization
  • URL filtering
  • Domain name manipulation
  • HTML page cleaning
  • HttpClient configuration
  • Text similarity

It’s still early, but some initial code has been submitted to the Google Code SVN repository. And anybody with an interest in the area of Java web crawlers should use this feed to track project updates.

Comments are closed.