nutch | Linuxaria

Crawling in Open Source, Part 1

Feb 172011

Today I present you this excellent and comprehensive article on an open source search engine: Nutch, you can find the original article with the code examples here

After reading this article readers should be somewhat familiar with the basic crawling concepts and core MapReduce jobs in Nutch.

What is a web crawler?

A Web Crawler is a computer program that usually discovers and downloads content from the web via an HTTP protocol. The discovery process of a crawler is usually simple and straightforward. A crawler is first given a set of URLs, often called seeds. Next the crawler goes and downloads the content from those URLs and then extracts hyperlinks or URLs from the downloaded content. This is exactly the same thing that happens in the real world when a human is interfacing with a web browser and clicks on links from a homepage, and pages that follow, one after another.
Continue reading »