The
term "search engine" is often used generically
to describe both crawler-based search engines
and human-powered directories. These two types
of search engines gather their listings in radically
different ways.
Crawler-Based
Search Engines
Crawler-based search engines, such as Google,
create their listings automatically. They "crawl"
or "spider" the web, then people search
through what they have found.
If
you change your web pages, crawler-based search
engines eventually find these changes, and that
can affect how you are listed. Page titles,
body copy and other elements all play a role.
Human-Powered
Directories
A human-powered directory, such as the Open
Directory, depends on humans for its listings.
You submit a short description to the directory
for your entire site, or editors write one for
sites they review. A search looks for matches
only in the descriptions submitted.
Changing
your web pages has no effect on your listing.
Things that are useful for improving a listing
with a search engine have nothing to do with
improving a listing in a directory. The only
exception is that a good site, with good content,
might be more likely to get reviewed for free
than a poor site.
"Hybrid
Search Engines" Or Mixed Results
In the web's early days, it used to be that
a search engine either presented crawler-based
results or human-powered listings. Today, it
extremely common for both types of results to
be presented. Usually, a hybrid search engine
will favor one type of listings over another.
For example, MSN Search is more likely to present
human-powered listings from LookSmart. However,
it does also present crawler-based results (as
provided by Inktomi), especially for more obscure
queries.
The
Parts of a Crawler-Based Search Engine
Crawler-based search engines have three major
elements. First is the spider, also called the
crawler. The spider visits a web page, reads
it, and then follows links to other pages within
the site. This is what it means when someone
refers to a site being "Spidered"
or "crawled." The spider returns to
the site on a regular basis, such as every month
or two, to look for changes.
Everything
the spider finds goes into the second part of
the search engine, the index. The index, sometimes
called the catalog, is like a giant book containing
a copy of every web page that the spider finds.
If a web page changes, then this book is updated
with new information.
Sometimes
it can take a while for new pages or changes
that the spider finds to be added to the index.
Thus, a web page may have been "Spidered"
but not yet "indexed." Until it is
indexed -- added to the index -- it is not available
to those searching with the search engine.
Search
engine software is the third part of a search
engine. This is the program that sifts through
the millions of pages recorded in the index
to find matches to a search and rank them in
order of what it believes is most relevant.