Crawler-commons project gets started

December 3, 2009

Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality.

Out of this was born the crawler-commons project. As the main page says:

The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

There’s a long list of functionality that is identical, or nearly so, between the various projects. The project wiki has a more detailed write-up from the ApacheCon meeting, but a short list includes:

robots.txt parsing
URL normalization
URL filtering
Domain name manipulation
HTML page cleaning
HttpClient configuration
Text similarity

It’s still early, but some initial code has been submitted to the Google Code SVN repository. And anybody with an interest in the area of Java web crawlers should use this feed to track project updates.

Filed under:
Uncategorized by kkrugler

Comments are closed.

Crawler-commons project gets started

Recent Blog Posts

Site Tags