EmailEmail
PrintPrint
Spiders and Webcrawlers and spamdexing, oh my!
Saturday, April 26, 2008

Welcome to another episode of TechMan's when-he-feels-like-it series "I Want to Know About Technology But Not Too Much."

You know how a search engine works -- you enter a term and, in an instant, results are returned. "But how could it search something as big as the Internet that fast?" you think.

Actually, it didn't. Search engines have programs called "spiders" or "Webcrawlers" that are constantly creeping around the Internet gathering information about Web pages.

Google's crawlers look for words in titles and text. In the past it gave high preference to metadata, words describing the page attached by its creator. But this practice led to some unscrupulous types repeating popular search words in their metadata to get included in the results of more searches.

For example, a malware site creator might repeat "Britney Spears" hundreds of times in the metadata because that has been one of Google's top searches. So Google changed its system to look more deeply.

After collecting the information, a search engine indexes it in a database. When a request is made, it searches the index to find the appropriate Web sites, instead of searching the entire Web.

Think of it like this: If you had to search every page of a book for a term it would take forever. Instead you search the index -- much quicker.

Now the search engine must present the results of your search. Which results come first? This is critical, for if a result is returned on Page 33 of the list, few are likely to see it.

This is Google's real technological breakthrough, developing a patented process called Page Ranking that determines how results are presented.

Page Ranking is based on the number of other sites that link to a site. The idea is that the more "popular" a Web site is in terms of other quality sites linking to it, the more relevant the information on that site is.

But the unscrupulous began creating Web pages called "link farms" -- nothing more than a group of pages that all link to each other.

The idea was to spam the index (sometimes called spamdexing) to get a higher page rank. To counter that, Google looks for signs of link farms.

There are whole books on effective searching techniques, but here are a few tips. They apply to Google, but similar features exist in other search engines.

• State your preferences. Next to the Google search window, click preferences. Two of the most useful preference settings are "Do not filter my search results" and "Open search results in a new browser window." They do exactly what they say. But be aware that turning off filtering might result in some objectionable sites being returned.

• Advance is not that advanced. If you click on "advanced search" next to the search box, you get a form that narrows your searches in useful ways. For example, you can search in only a certain domain, such as .edu or .gov. There are individual commands for these, but if you use advanced search, you won't have to memorize them.

• Say "yes" to negation. One very useful command is negation (It also can be selected in advanced search). Negation allows the exclusion of certain words from a search by putting a minus sign before them. If you are researching missile defense systems, you will narrow your search faster if request "star wars" -movie.

• Use special terms. Using the term phonebook: followed by a phone number (with area code) will look up a name for that number; define: followed by a word will give you definitions; movie: followed by a ZIP code gives you movie times; music: followed by a song name gives you lists of albums and songs.

These tips and much more can be found in an excellent book, "Google Hacks, 3rd Edition," by Rael Dornfest, Paul Bausch and Tara Calishain, published by O'Reilly.

Want to send a question to TechMan? Just fire an e-mail to techman@post-gazette.com. Please include your name, hometown and a daytime phone number.
First published on April 26, 2008 at 12:00 am