0% found this document useful (0 votes)
81 views

5.web Crawler Writeup

The document describes the aim of implementing a simple web crawler in Java. It provides details on the theory behind web crawlers, including how they work by starting with seed URLs and recursively following hyperlinks to build a list of URLs to visit. The document also discusses the architecture of web crawlers and how they parse pages, extract text and links, check for duplicate URLs, and more. Finally, it provides examples of different web crawlers and a pseudo code for a basic web crawler.

Uploaded by

Pratik B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

5.web Crawler Writeup

The document describes the aim of implementing a simple web crawler in Java. It provides details on the theory behind web crawlers, including how they work by starting with seed URLs and recursively following hyperlinks to build a list of URLs to visit. The document also discusses the architecture of web crawlers and how they parse pages, extract text and links, check for duplicate URLs, and more. Finally, it provides examples of different web crawlers and a pseudo code for a basic web crawler.

Uploaded by

Pratik B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment no.

: 4

Title: WEB CRAWLER

Aim: To implement a simple Web Crawler in Java.

Objective: To study the working of Web Crawler.

Theory:

Web Crawler

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated
manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web
spiders, Web robots, or Web scutters.

 This process is called Web crawling or spidering. Many sites, in particular search engines, use


spidering as a means of providing up-to-date data.
 Web crawlers are mainly used to create a copy of all the visited pages for later processing by a
search engine that will index the downloaded pages to provide fast searches.
 Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or
validating HTML code.
 Also, crawlers can be used to gather specific types of information from Web pages, such as
harvesting e-mail addresses (usually for sending spam).
 A Web crawler is one type of bot, or software agent.
 In general, it starts with a list of URLs to visit, called the seeds.
 As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list
of URLs to visit, called the crawl frontier.
 URLs from the frontier are recursively visited according to a set of policies.
 A crawler must carefully choose at each step which pages to visit next.
 There are important characteristics of the Web that make crawling very difficult:-
1. Its large volume,
2. Its fast rate of change,
3. Dynamic page generation.
 The large volume implies that the crawler can only download limited number of the Web pages
within a given time, so it needs to prioritize its downloads. The high rate of change implies that the
pages might have already been updated or even deleted.
 The number of possible crawl able URLs being generated by server-side software has also made it
difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP
GET (URL-based) parameters exist, of which only a small selection will actually return unique
content.
 The behavior of a Web crawler is the outcome of a combination of policies:
 A selection policy that states which pages to download.
As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the
downloaded fraction contains the most relevant pages and not just a random sample of the Web.
This requires a metric of importance for prioritizing Web pages. The importance of a page is a
function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL. It
must work with partial information, as the complete set of Web pages is not known during
crawling.
 A re-visit policy that states when to check for changes to the pages.
The Web has a very dynamic nature, and crawling a fraction of the Web can take weeks or
months. By the time a Web crawler has finished its crawl, many events could have happened,
including creations, updates and deletions. From the search engine's point of view, there is a cost
associated with not detecting an event, and thus having an outdated copy of a resource. The
most-used cost functions are freshness and age. The objective of the crawler is to keep the
average freshness of pages in its collection as high as possible, or to keep the average age of
pages as low as possible. Two simple re-visiting policies are
Uniform policy: This involves re-visiting all pages in the collection with the same frequency,
regardless of their rates of change.
Proportional policy: This involves re-visiting more often the pages that change more
frequently.
 A politeness policy that states how to avoid overloading Web sites.
Crawlers can retrieve data much quicker and in greater depth. If a single crawler is performing
multiple requests per second and/or downloading large files, a server would have a hard time
keeping up with requests from multiple crawlers.
 A parallelization policy that states how to coordinate distributed Web crawlers.
A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize
the download rate while minimizing the overhead from parallelization and to avoid repeated
downloads of the same page. To avoid downloading the same page more than once, the crawling
system requires a policy for assigning the new URLs discovered during the crawling process, as
the same URL can be found by two different crawling processes
Common Uses
a. Essentially a web crawler may be used by anyone seeking to collect information out on the Internet.
b. Search engines frequently use web crawlers to collect information about what is available on public
web pages. Their primary purpose is to collect data so that when Internet surfers enter a search term
on their site, they can quickly provide the surfer with relevant web sites.
c. Linguists may use a web crawler to perform a textual analysis; that is, they may comb the Internet to
determine what words are commonly used today.
d. Market researchers may use a web crawler to determine and assess trends in a given market.

Different Web Crawlers available


 Bingbot is the name of Microsoft's Bing WebCrawler.
 World Wide Web Worm, a crawler used to build a simple index of document titles and URLs.
 WebRACE is a crawling and caching module implemented in Java, and used as a part of a more
generic system called eRACE. The system receives requests from users for downloading web pages,
so the crawler acts in part as a smart proxy server. The system also handles requests for
"subscriptions" to Web pages that must be monitored: when the pages change, they must be
downloaded by the crawler and the subscriber must be notified.

ARCHITECTURE OF WEB-CRAWLER
1. URL Frontier: Contains URLs to be fetched in the current crawl. At first, a seed set is stored in URL
Frontier, and a crawler begins by taking a URL from the seed set.
2. DNS: Domain Name Service resolution. Look up IP address for domain names.
3. Fetch: Generally use the http protocol to fetch the URL.
4. Parse: The page is parsed. Texts (images, videos, and etc.) and Links are extracted.
5. Content Seen: Tests whether a web page with the same content has already been seen at another
URL. Need to develop a way to measure the fingerprint of a web page.
6. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
For eg: For en.wikipedia.org/wiki/Main_Page
<ahref = "/wiki/Wikipedia:General_disclaimer"
title=" Wikipedia:Generaldisclaimer">Disclaimers</a>
7. Dup URL Elim: The URL is checked for duplicate elimination.
8. Other issues:
 Housekeeping tasks:
 Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds)
 Check pointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk.
(Every few hours)
 Priority of URLs in URL frontier: Change rate And Quality.
 Politeness: Avoid repeated fetch requests to a host within a short time span.
 Otherwise: blocked

Doc Robots UR
L
Fingerpri template
DNS nt s set

Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch

URL Frontier

How Web Crawlers Work?


 When a search engine's web crawler visits a web page, it "reads" the visible text, the hyperlinks, and
the content of the various tags used in the site.
 It starts by parsing a specified web page, noting any hypertext links on that page that point to other
web pages. They then parse those pages for new links, and so on, recursively.
 Using the information gathered from the crawler, a search engine will then determine what the site is
about and index the information. The website is then included in the search engine's database and its
page ranking process.
 A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to
other machines on the Internet, just as a web browser does when the user clicks on links. All the
crawler really does is to automate the process of following links.
 Web crawlers may operate one time only. If its purpose is for something long-term, as is the case
with search engines, web crawlers may be programmed to comb through the Internet periodically to
determine whether there has been any significant changes. If a site is experiencing heavy traffic or
technical difficulties, the crawler may be programmed to note that and revisit the site again.
Fig: Flow of a sequential basic crawler
Pseudo Code for a Web Crawler
Get the user's input: the starting URL and the desired file type.
Add the URL to the currently empty list of URLs to search.
While the list of URLs to search is not empty,
{
Get the first URL in the list.
Move the URL to the list of URLs already searched.
Check the URL to make sure its protocol is HTTP
(If not, break out of the loop, back to "While").

See whether there's a robots.txt file at this site that includes a "Disallow" statement.
(If so, break out of the loop, back to "While".)

Try to "open" the URL (that is, retrieve that document From the Web).
If it's not an HTML file, break out of the loop, back to "While."
Step through the HTML file. While the HTML text contains another link,
{
Validate the link's URL and make sure robots are allowed (just as in the outer loop).
If it's an HTML file,
If the URL isn't present in either the to-search list or the already-searched
list, add it to the to-search list.
Else if it's the type of the file the user requested,
Add it to the list of files found.
}
}
NUTCH :

Apache Nutch is an open source Web crawler written in Java. Nutch is coded entirely in the Java
programming language, but data is written in language-independent formats. It has a highly modular
architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and
clustering.Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the
multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a
MapReduce facility and a distributed file system. The two facilities have been spun out into their own
subproject, called Hadoop.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of
Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top
level project of the Apache Software Foundation. Apache Nutch is an open source Web crawler written
in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of
maintenance work, for example checking broken links, and create a copy of all the visited pages for
searching over.

Advantages of Nutch over a simple Fetcher include

 highly scalable and relatively feature rich crawler


 features like politeness which obeys robots.txt rules
 robust and scalable - you can run Nutch on a cluster of 100 machines
 quality - you can bias the crawling to fetch “important” pages first

Conclusion:
Thus we successfully implemented a web crawler in JAVA.

You might also like