Oracle Ultra Search Online Documentation
Release 9.2


Home	Book List	Contents	Master Index	Feedback

Tuning the Crawling Process

Web Crawling Strategy

The Ultra Search crawler is a powerful tool for discovering information on Web sites in an organization's intranet. This feature is especially relevant to Web crawling. The other data sources are well defined such that the crawler does not follow any links to other documents that you may not be aware of.
Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you will have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, in reality, the process of discovering and crawling your organization's intranet is an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters to direct the crawling process somewhat.

For example, if you observe that the crawler is spending days crawling one Web host, you might want to exclude crawling at that host or limit the crawling depth.

Monitoring the Crawling Process

You can monitor the crawling process by using a combination of the following methods:

Monitoring the schedule status with the administration tool.

Monitoring the real time schedule progress with the administration tool.

Monitoring the crawler statistics with the administration tool.

Monitoring the log file for the current schedule.

URL Looping

URL looping refers to the scenario where, for some reason, a large number of unique URLs all point to the same document. Although the document is never indexed more than once, the documents still need to be retrieved from the Web server for analysis.
One particularly difficult situation is where a site contains a large number of pages and each page contains links to every other page in the site. Ordinarily, this would not be a problem as the crawler eventually complete analyzing all documents in the site.

However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document. For example, http://mycompany.com/somedocument.html?p_origin_page=10 might refer to the same document as http://mycompany.com/somedocument.html?p_origin_page=13 but the p_origin_page parameter is different in both cases, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This scenario is one example of how URL looping can occur.

You can monitor the crawler statistics in the Ultra Search administration tool to get an idea of what URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, you might want to do one of the following:

Exclude the Web server
Excluding the Web server prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)

Reduce the crawling depth
Reducing the crawling depth limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters in the Crawler Page. Your changes will take effect only after restarting the crawler.


Home	Book List	Contents	Master Index	Feedback