Crawling in Open Source, Part 1

Feb 172011

Today I present you this excellent and comprehensive article on an open source search engine: Nutch, you can find the original article with the code examples here

After reading this article readers should be somewhat familiar with the basic crawling concepts and core MapReduce jobs in Nutch.

What is a web crawler?

A Web Crawler is a computer program that usually discovers and downloads content from the web via an HTTP protocol. The discovery process of a crawler is usually simple and straightforward. A crawler is first given a set of URLs, often called seeds. Next the crawler goes and downloads the content from those URLs and then extracts hyperlinks or URLs from the downloaded content. This is exactly the same thing that happens in the real world when a human is interfacing with a web browser and clicks on links from a homepage, and pages that follow, one after another.

When a web crawler discovers new URLs to follow, it then goes to fetch the content from those URLs – just like a normal browser fetches the content when links on a web page are clicked. The download part is even simpler. In the simplified case of HTTP, a network connection to the host is opened, an HTTP request is sent, an HTTP response is read from the wire and then the network connection is closed. A crawler might also support other means of getting the content, making it capable of finding and fetching content from various different places like local or shared file systems, ftp servers, mail boxes etc.

Crawler use cases

Generally speaking, crawlers are used to find and bring in the content that interests you. Often the reason is that you want to make that information searchable by indexing it. The use case and context for crawling can vary greatly depending on the task at hand. For example the most popular web search engines, like Google and Yahoo, deploy a web crawler that primarily finds content that is publicly available via the web. The main challenge for such a generic web crawler is obviously scalability. The crawler architecture needs to be efficient because the amount of content to crawl is enormous and growing all the time at ever increasing speeds. Another example, of a completely different kind of use case, is a price comparison site where users are presented with products and their price information from various online shops.

To gather such information one could implement a crawler that continuously re-crawls content from a predefined set of web sites, and is mainly interested in just a few specific URLs from those sites. A crawler in an enterprise search context possesses yet another set of features important for that context. Usually, an enterprise crawler needs to understand and extract information from various different document formats and access those documents from many different places like mail boxes, CRM systems, and content management systems. Understanding many different document formats plays an important role.

And there are many other examples of crawling needs that lie in between these examples that each require a slightly different set of requirements for the crawler to fulfill.

Essential crawler components

There are a few features or characteristics that are very important for a crawler. The most important single feature every crawler should contain is a sense of politeness. Since crawling content from web sites uses resources from the target server, like network bandwidth and processor time, it is mandatory to limit the frequency a crawler accesses particular content or else the target server would soon become inaccessible to real users. Owners of a website may also want to expose only some part of the available content to web crawlers. For such cases there exists a method called Robots Exclusion Standard that gives the owner of a web site the means to control what and how often something is fetched from the server. A polite crawler respects such rules. Politeness is not just following the rules, but respecting a site’s resources even when such rules are not explicitly specified by limiting the frequency of your crawler’s visits. The most common components of a crawler include a queue, fetcher, extractor and content repository.

The queue contains URLs to be fetched. It may be a simple memory based, first in, first out queue, but usually it’s more advanced and consists of host based queues, a way to prioritize fetching of more important URLs, an ability to store parts or all of the data structures on disk and so on. The fetcher is a component that does the actual work of getting a single piece of content, for example one single HTML page. The extractor is a component responsible for finding new URLs to fetch, for example by extracting that information from an HTML page. The newly discovered URLs are then normalized and queued to be fetched. The content repository is a place where you store the content. Later the content is processed and finally indexed. I am not going to go into more detail about crawler architectures and implementation details, but will instead take a closer look at one open source java crawler implementation: Apache Nutch.

Apache Nutch

Nutch was originally implemented by Doug Cutting and Michael Cafarella et al. in around 2002. The goal was to make Nutch a web scale crawler and search application capable of fetching billions of URLs per month, maintain an index of these pages and allow searching of that index 1000 times per second. The original design was proven to scale up to 100 million documents but became impractical with problems of maintenance at such a scale. During 2004, after the incubation process, Nutch became part of Apache. Soon after Google published its MapReduce paper in 2004, the Nutch architecture was rewritten to take advantage of MapReduce, and a new data storage system called NutchDistributedFileSystem (NDFS) was implemented.

This new architecture allowed Nutch to be run on a large cluster of machines, making the architecture scalable both in processing power and data storage. Later both the MapReduce execution framework and NDFS were promoted to a top level Apache project called Apache Hadoop. Hadoop solves much of the maintenance issues Nutch had in the pre-MapReduce era. After setting up a Hadoop cluster you can control Nutch crawling from one single machine despite the size of the cluster you are running Nutch on. Nutch as it exists today is still pretty much an application that helps you to build a generic web search engine. It supports fetching content with various protocols such as HTTP, HTTPS, FTP and local file system. Nutch can also extract textual content from several document formats like HTML, RSS, ATOM, PDF, ms formats (doc, excel, ppt), etc right out of the box. Nutch is not for everyone as some quite low level skills are still required to run a crawl and maintain the indexes.

Apache Nutch Internals

Running Nutch consists of executing several commands from the shell in sequence. Those commands each launch one or more MapReduce jobs that are executed by Hadoop. Next I will walk you through some of the most important commands and how they are constructed as MapReduce jobs and tasks. I’ll use Python-like pseudocode to describe the functions in simplified examples for illustration purposes, but this should still capture the most essential aspects of each job. Before going into the individual commands,the core terms and components that I will refer to later in this article need to be explained. The Crawl Database is a data store where Nutch stores every URL, together with the metadata that it knows about. In Hadoop terms it’s a Sequence file (meaning all records are stored in sequential manner) consisting of tuples of URL and CrawlDatum. Many other data structures in Nutch are of similar structure and no relational databases are used. The rationale behind this kind of data structure is scalability.

The model of simple, flat data storage works well in a distributed environment. Each node gets a one or more split of the whole data and operates on that (The Map phase of MapReduce). Data can be stored inside the Hadoop Distributed File System so nodes can access the splits from the nearest host that contains a replica of it. Operations (like inserts, deletes and updates) in Crawl Database and other data are processed in batch mode. Here is an example of the contents of crawldb:

http://www.example.com/page1.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...
http://www.example.com/page2.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...
http://www.example.com/page3.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...

The Fetch List is a data structure (SequenceFile, URL->Crawldatum) that contains the URL, crawldatum tuples that are going to be fetched in one batch, usually the Fetch list contents are a subset of CrawlDB that was created by the generate command.

FetchList is stored inside Segment. Segment is basically a folder containing all the data related to one fetching batch. Besides the Fetch List, the fetched content itself will be stored there in addition to the extracted plain text version of the content, anchor texts and URLs of outlinks, protocol and document level metadata etc.

The Link Database is a data structure (Sequence file, URL -> Inlinks) that contains all inverted links. In the parsing phase Nutch can extract outlinks from a document and store them in format source url -> target_url,anchor_text. In the process of inversion we invert the order and combine all instances making the data records in Link Database look like: targetURL -> anchortext[] text so we can use that information later when individual documents are indexed.

Inject

The Inject command in Nutch has one responsibility: inject more URLs into Crawl Database. Normally you should collect a set of URLs to add and then process them in one batch to keep the time of a single insert small.

Generate

The Generate command in Nutch is used to generate a list of URLs to fetch from Crawl Database URLs with the highest scores are preferred.

Fetch

Fetcher is responsible for fetching content from URLs and writing them to disk. It also optionally parses the content. URLs are read from a Fetch List generated by Generator.

Parse

Parser reads raw fetched content, parses it and stores the results.

Invert links

Inverts link information so we can use anchor texts from other documents that point to a document together with the rest of the document data.

Final words

I have now gone through the core MapReduce jobs in Nutch that are related to crawling. There are also many other data processing jobs in Nutch like indexing, page scoring calculation, removing duplicate content from index etc. The best way to get familiar with the remaining algorithms to look at the following resources (the link is in in references). In the next article of this series I will take a look at other open source crawlers.

Resources

Introduction
to Information Retrieval has a section on crawling

Wikipedia
article on crawling

A Standard for
Robot Exclusion Protocol

Partial
list of open source crawlers implemented in Java

Nutch
source code

Google
MapReduce Paper

Apache Hadoop

About the author

Lucid Imagination is the commercial company dedicated to Apache Lucene technology. The company provides value-added software, documentation, commercial-grade support, training, high-level consulting, and free certified distributions, for Lucene and Solr. Lucid Imagination’s goal is to serve as a central resource for the entire Lucene community and marketplace, to make enterprise search application developers more productive. Customers include AT&T, Sears, Ford, Verizon, Elsevier, Zappos, The Motley Fool, Macy’s, Cisco and many other household names. To learn more please visit http://www.lucidimagination.com