|
Indexing spiders (sometimes called robots, or bots, or crawlers)
are the secret agents doing the work the results of which you enjoy
when performing searches. Spider programs, just like browsers, request
and retrieve documents from web servers; but unlike browsers, they do
it not for viewing by humans but for automatic indexing and inclusion
into their database. They do it tirelessly, in hardly imaginable
amounts (millions of pages per day), around the clock and without days off.
Spiders are what sets apart search engines from directories
(one of the most prominent directories is Yahoo). Directories don't
keep their pet spiders because all links in a directory are
discovered (or submitted), examined, and annotated by humans.
This difference makes the hand-picked resources of directories, on
average, much more valuable but much less voluminous than the
homogeneous heap of links in a search engine.
Each new document encountered by the spider is scanned for links,
and these links are either traversed immediately or scheduled for
later retrieval. Theoretically, by following all links starting from a
representative initial set of documents, a spider will end up having
indexed the whole Web.
In practice, however, this goal in unachievable. To begin with,
lots of documents on the web are generated dynamically, most
often in response to input from a form. Naturally, although spiders
can follow links they have no idea what to put into the fields of a
form, so any data retrieved upon request is inherently inaccessible to
search spiders (if no alternative access mechanism is provided). In
this category belong various web-accessible databases, including
search engines themselves.
Also, spiders can never reach pages that are customized via cookies
or pages using any JavaScript or Java tricks that affect their content. Some
spiders cannot even understand frames (see "Frames," later in this chapter). As
you might have guessed, search engines cannot yet make heads or
tails of any images, audio or video clips, so these bits of
information are wasted (in fact, they aren't even requested
by spiders). What remains is pure HTML source, of which
spiders additionally strip off all markup and tags to get to the
bare-bones plain text.
Even with these economizing assumptions, boxing up the entire web
into a single database turns out to be a practically unfeasible task. It
might have been possible just a year ago, but not now when the Web has gotten
that large. That's why search engines are now moving from the
strategy of swallowing everything they see to various selection
techniques.
Ideally, this selection should aim at improving the quality of the
database by discarding junk and scanning only the premier web content.
In reality, of course, this kind of discernment is impossible because there are no
automatic programs smart enough to separate wheat from chaff. The only
way to sort out anything is by placing some rather arbitrary
restrictions.
One search engine that admits "sampled spidering" is Alta Vista. It's been
claimed that the quota for Alta Vista's spider is not more than 600
documents per any single domain. If true, this means that
large domains such as geocities.com or even microsoft.com are
severely underrepresented in Alta Vista's database. It
remains open to speculation whether other search engines employ
similar sampling techniques or the size of their databases is
limited only by their technical capacity.
All search engines allow users to add their URLs to the database
for spidering. Some of them retrieve submitted documents immediately,
others schedule them for future scanning, but in any case this allows
to at least make sure that your domain isn't missed. You're supposed
to submit only the root URL of your site, while using this mechanism
for registering each and every single page has been blamed as a sort
of "spamming." On the other hand, given the selective nature of
spidering, it's not a bad idea to register at least all key pages of
your site. (Be careful, however: some search engines limit the number
of submissions per domain.)
Another important question is how often spiders update their
databases by revisiting sites they've already indexed. This parameter
varies significantly for different engines, with the update periods having been
quoted from one week to several months. This aspect of search engines'
performance allows some independent estimation: You can analyze your
server's access logs to see when you were visited by spiders and what
documents they requested. A helpful Perl script for this purpose,
called BotWatch, is available at
http://www.tardis.ed.ac.uk/~sxw/robots/botwatch.html.
Many search engines have problems with sites in languages other
than English, especially if these languages use character sets
different from ISO 8859-1 (see Chapter 41, "Internationalizing
Your HTML"). For example, HotBot returns nothing when queried
with keywords in Russian, nor can it properly display summaries for
documents in Russian. This makes it useless for Russian surfers, despite
the fact that HotBot's spider routinely scans a good
share of all web sites in Russia.
|