5.web Crawler Writeup
5.web Crawler Writeup
: 4
Theory:
Web Crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated
manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web
spiders, Web robots, or Web scutters.
ARCHITECTURE OF WEB-CRAWLER
1. URL Frontier: Contains URLs to be fetched in the current crawl. At first, a seed set is stored in URL
Frontier, and a crawler begins by taking a URL from the seed set.
2. DNS: Domain Name Service resolution. Look up IP address for domain names.
3. Fetch: Generally use the http protocol to fetch the URL.
4. Parse: The page is parsed. Texts (images, videos, and etc.) and Links are extracted.
5. Content Seen: Tests whether a web page with the same content has already been seen at another
URL. Need to develop a way to measure the fingerprint of a web page.
6. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
For eg: For en.wikipedia.org/wiki/Main_Page
<ahref = "/wiki/Wikipedia:General_disclaimer"
title=" Wikipedia:Generaldisclaimer">Disclaimers</a>
7. Dup URL Elim: The URL is checked for duplicate elimination.
8. Other issues:
Housekeeping tasks:
Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds)
Check pointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk.
(Every few hours)
Priority of URLs in URL frontier: Change rate And Quality.
Politeness: Avoid repeated fetch requests to a host within a short time span.
Otherwise: blocked
Doc Robots UR
L
Fingerpri template
DNS nt s set
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
See whether there's a robots.txt file at this site that includes a "Disallow" statement.
(If so, break out of the loop, back to "While".)
Try to "open" the URL (that is, retrieve that document From the Web).
If it's not an HTML file, break out of the loop, back to "While."
Step through the HTML file. While the HTML text contains another link,
{
Validate the link's URL and make sure robots are allowed (just as in the outer loop).
If it's an HTML file,
If the URL isn't present in either the to-search list or the already-searched
list, add it to the to-search list.
Else if it's the type of the file the user requested,
Add it to the list of files found.
}
}
NUTCH :
Apache Nutch is an open source Web crawler written in Java. Nutch is coded entirely in the Java
programming language, but data is written in language-independent formats. It has a highly modular
architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and
clustering.Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the
multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a
MapReduce facility and a distributed file system. The two facilities have been spun out into their own
subproject, called Hadoop.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of
Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top
level project of the Apache Software Foundation. Apache Nutch is an open source Web crawler written
in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of
maintenance work, for example checking broken links, and create a copy of all the visited pages for
searching over.
Conclusion:
Thus we successfully implemented a web crawler in JAVA.