Setting Up and Starting the Crawl

The next section, Setting Up and Starting the Crawl, discusses how to configure the initial crawl of your content files. For more information on crawl and configuring crawl, see Administering Crawl for Web and File Share Content.

To set up and start the crawl:

1. In the left-hand menu, click Crawl and Index > Crawl URLs.

2. In the Start Crawling from the Following URLs field, type one or more start URLs.

3. For the initial setup and testing, it is best to enter a start URL that does not require a login or user authentication.

Start URLs must be fully qualified URLs, in the following format:

protocol://host[:port]/[path]/

For example, http://dracula:2346/content.

The information in the square brackets is optional.

4. In the Follow and Crawl Only URLs with the Following Patterns field, copy all start URLs from the Start Crawling from the Following URLs field.

· If you enter the URL pattern for a directory, the URL must terminate in a forward slash (/). Use only the server part of the URL.

· If a URL refers to a specific page, only that page is crawled.
For more information on URL patterns, click the Help link or see Administering Crawl for Web and File Share Content.

5. In the Do Not Crawl URLs with the Following Patterns field, scroll through the list of patterns that can be blocked from being crawled.

· Many file formats are excluded from the crawl by default, including common graphic formats such as .jpg.

· If you want a particular format crawled, remove the format from the list or comment the format out using the comment symbol (#).

· If you do not want a particular document type to be crawled, remove the comment symbol from the corresponding pattern.
For example, if you do not want any Microsoft Word files (.doc) crawled, remove the # sign that is in front of “.doc$” and no .doc files will be crawled.

· You can also add specific URL patterns to this area to prevent the URLs that match the patterns from being crawled.

6. Click Save URLs to Crawl.

7. In the left-hand menu, click Status and Reports > Crawl Status.

8. Click Resume Crawl.

The search appliance starts to crawl the URLs according to the URL patterns you entered.

When the search appliance software is crawling content, the graphic on the page shows multicolored balls in motion.

You do not have to pause the crawl before making changes on the Crawl URLs page.