Pages

Advertisement

Tuesday, July 10, 2007

The Yahoo SLURP Crawler - The Robot

SLURP crawls websites, scans their contents and meta tags, and travels down the links contained on the page. It then brings back information for the search engine to index. Yahoo SLURP 2.0 stores the full text of the page it crawls in its memory and then returns to Yahoo’s searchable database. This is one of the semi-unique points of Yahoo SLURP; not all search engine crawlers store the entire text of the pages they crawl.

While SLURP has some features unique to it, it also obeys the robots.txt command. This command is very important since it ensures that you have control over which pages the crawler searches and indexes. This lets you protect the sensitive pages which you need to keep secure, pages which contain information you would rather not have in the hands of hackers (who regularly try and infiltrate search engines databases), or pages which you don’t want indexed at all (for whatever reason).

Another good thing about the robots.txt file is that it enables you to exclude specific robots, so you can inhibit the Googlebot but enable SLURP to crawl a particular page. This can be useful if you have optimized different pages for separate search engines. This may occur in order to give you flexibility, but a search engine may think you have duplicate pages and may penalize you. So careful use of the robots.txt file should definitely be on our list of how to make your website more search engine friendly. So how do you use the robots.txt file? You open notepad and type in the following lines:

  User-Agent: Slurp
  Disallow: whatsisname.html
  Disallow: page_optimized_for_google.html
  Disallow: credit_card_list.html
  Disallow: whatnot.html

Save it as robots.txt and upload it into your root directory. You can disallow as many pages for each crawler robot as you want, but to disallow certain pages for another crawler, you start a new line of code.

  User-Agent: Slurp
  Disallow: whatsisname.html
  Disallow: page_optimized_for_google.html
  Disallow: credit_card_list.html
  Disallow: whatnot.html
  User-Agent: Googlebot
  Disallow: page_optimized_for_yahoo.html
  Disallow: credit_card_list.html
  Disallow: whatnot.html

If you want to disallow all crawlers, you replace the name of the user agent with the wildcard command (*)

Robots.txt is useful for not getting banned on search engines and can also be used to pinpoint crawlers when they come calling. Only crawlers request Robots.txt, and these requests show up on the server logs.

No comments: