S3 Tech Blog: How Search Engines Work (and Sometimes Don’t)

Share |

You know how important it is to score high in the SERPs. But your site isn't reaching the first three pages, and you don't understand why. It could be that you're confusing the web crawlers that are trying to index it. How can you find out? Keep reading.

You have a masterful website, with lots of relevant content, but it isn’t coming up high in the search engine results pages (SERPs). You know that if your site isn’t on those early pages, searchers probably won’t find you. You can’t understand why you’re apparently invisible to Google and the other major search engines. Your rivals hold higher spots in the SERPs, and their sites aren’t nearly as nice as yours.

Search engines aren’t people. In order to handle the tens of billions of web pages that comprise the World Wide Web, search engine companies have almost completely automated their processes. A software program isn’t going to look at your site with the same “eyes” as a human being. This doesn’t mean that you can’t have a website that is a joy to behold for your visitors. But it does mean that you need to be aware of the ways in which search engines “see” your site differently, and plan around them.

Despite the complexity of the web, and dealing with all that data at speed, search engines actually perform a short list of operations in order to return relevant results to their users. Each of these four operations can go awry in certain ways. It isn’t so much that the search engine itself has gone awry; it may have simply encountered something that it was not programmed to deal with. Or the way it was programmed to deal with whatever it encountered led to less than desirable results.

Understanding how search engines operate will help you understand what can go wrong. All search engines perform the following four tasks:

Web crawling. Search engines send out automated programs, sometimes called “bots” or “spiders,” which use the web’s hyperlink structure to “crawl” its pages. According to some of our best estimates, search engine spiders have crawled maybe half of the pages that exist on the Internet.
Document indexing. After spiders crawl a page, its content needs to be put into a format that makes it easy to retrieve when a user queries the search engine. Thus, pages are stored in a giant, tightly managed database that makes up the search engine’s index. These indexes contain billions of documents, which are delivered to users in mere fractions of a second.
Query processing. When a user queries a search engine, which happens hundreds of millions of times each day, the engine examines its index to find documents that match. Queries that look superficially the same can yield very different results. For example, searching for the phrase “field and stream magazine,” without quotes around it, yields more than four million results in Google. Do the same search with the quote marks, and Google returns only 19,600 results. This is just one of many modifiers a searcher can use to give the database a better idea of what should count as a relevant result.
Ranking results. Google isn’t going to show you all 19,600 results on the same page – and even if it did, it needs some way to decide which ones should show up first. Thus, the search engine runs an algorithm on the results to calculate which ones are most relevant to the query. These are shown first, with all the others in descending order of relevance.

Now that you have some idea of the processes involved, it’s time to take a closer look at each one. This should help you understand how things go right, and how and why these tasks can go “wrong.” This article will focus on web crawling, while a later article will cover the remaining processes.

You’re probably thinking chiefly of your human visitors when you set up your website’s navigation, as well you should. But certain kinds of navigation structures will trip up spiders, making it less likely for those visitors to find your site in the first place. As an added bonus, many of the things you do to your site that will make it easier for a spider to find content, will often make it easier for visitors to navigate your site.

It’s worth keeping in mind, by the way, that you might not want spiders to be able to index everything on your site. If you own a site with content that users pay a fee to access, you probably don’t want a Google bot to grab that content and show it to anyone who enters the right keywords. There are ways to deliberately block spiders from such content. In keeping with the rest of this article, which is intended mainly as an introduction, they will only be mentioned briefly here.

Dynamic URLs are one of the biggest stumbling blocks for search engine spiders. In particular, pages with two or more dynamic parameters will give a spider fits. You know a dynamic URL when you see it; it usually has a lot of “garbage” in it such as question marks, equal signs, ampersands (&) and percent signs. These pages are great for human users, who usually get to them by setting certain parameters on a page. For example, typing a zip code into a box at weather.com will return a page that describes the weather for a particular area of the US – and a dynamic URL as the page location.

There are other ways in which spiders don’t like complexity. For example, pages with more than 100 unique links to other pages on the same site can make them get tired with just one look. A spider may not follow each link. If you are trying to build a site map, there are better ways to organize it.

Pages that are buried more than three clicks from your website’s home page also might not be crawled. Spiders don’t like to go that deep. For that matter, many humans can get “lost” on a website with that many levels of links if there isn’t some kind of navigational guidance.

Pages that require a “Session ID” or cookie to enable navigation also might not be spidered. Spiders aren’t browsers, and don’t have the same capabilities. They may not be able to retain these forms of identification.

Another stumbling block for spiders is pages that are split into “frames.” Many web designers like frames; it allows them to keep page navigation in one place even when a user scrolls through content. But spiders find pages with frames confusing. To them, content is content, and they have no way of knowing which pages should go in the search results. Frankly, many users don’t like pages with frames either; rather than providing a cleaner interface, such pages often look cluttered.

Most of the stumbling blocks above are ones you may have accidentally put in the way of spiders. This next set of stumbling blocks includes some that website owners might use on purpose to block a search engine spider. While I mentioned one of the most obvious reasons for blocking a spider above (content that users must pay to see), there are certainly others: the content itself might be free, but should not be easily available to everyone, for example.

Pages that can be accessed only after filling out a form and hitting “Submit” might as well be closed doors to spiders. Think of them as not being able to push buttons or type. Likewise, pages that require use of a drop down menu to access might not be spidered, and the same holds true for documents that can only be accessed via a search box.

Documents that are purposefully blocked will usually not be spidered. This can be handled with a robots meta tag or robots.txt file. You can find other articles that discuss the robots.txt file on SEO Chat.

Pages that require a login block search engine spiders. Remember the “spiders can’t type” observation above. Just how are they going to log in to get to the page?

Finally, I’d like to make a special note of pages that redirect before showing content. Not only will that not get your page indexed, it could get your site banned. Search engines refer to this tactic as “cloaking” or “bait-and-switch.” You can check Google’s guidelines for webmasters (http://www.google.com/intl/en/webmasters/guidelines.html) if you have any questions about what is considered legitimate and what isn’t.

Now that you know what will make spiders choke, how do you encourage them to go where you want them to? The key is to provide direct HTML links to each page you want the spiders to visit. Also, give them a shallow pool to play in. Spiders usually start on your home page; if any part of your site cannot be accessed from there, chances are the spider won’t see it. This is where use of a site map can be invaluable.

Pages

Advertisement

Tuesday, July 10, 2007

How Search Engines Work (and Sometimes Don’t)

No comments:

Post a Comment

Feeds & Stats

Categories

Blog Archive

Followers

Google Analytics

Featured Videos