Home
Syllabus
Class Sessions
Worksheets
LEXs
PoPs & Xtra Credit Student Progress

Searching the World Wide Web...

return to David's facutly home page - Now Featuring
 
How Searching the World Wide Web Works
 

When the World Wide Web opened up the Internet to the world and the general public, search engine information searches became a normal part of our work, entertainment and study routines. Web surfing was born. However, one recent survey reported that 84% of the users were dissatisfied with their ability to find information on the Web.

There are two things that allow us to find and retrieve information:

  1. Search engine programs to find the information on the WWW and build the database to store it
  2. Search engine software to allow us users to see and search a database and retrieve information from it

Web Search Retrieval Software - are Information Retrieval (IR) systems, a branch of computer science that allows users (you at your client machine) to retrieve specific information from large databases. Search engines are Web search tools that build a database and then allow users to ask for information using a web page interface. The search engine software then retrieves information about webpages that fit the search criteria and return a web page of results or "hits." Some allow submissions by business and organizations that create websites. Often there are no fees for a submission, small fees for a company to submit for you to 20-100's of search engines, or fees by the search engines for high placement in the results that are returned.

The user types a query or search expression into a small window on screen and the search engine does its best to find Web pages (Hits) and their URL's that seem to match your query. Often a description of the Webpage is included with the results. Each search engine searches its own database. No one search engine comes close to categorizing the entire Web and no two search engines have the same database of pages.

Elements of a good search Interface
Indexed versus Search Zones - Some search engines will choose between letting you search its entire indexed database of sites versus creating search zones to help the user narrow their site search to pertinent information and pages.
Another element of a well-planned search engine interface is to recognize the users information needs and provide search capability of your site based upon those needs, including the method of indexing your site and the choice of indexing software to load or purchase. Users will do one or more of the following:

1. Search for a known item - a clearly defined search; likely providing a single, correct answer.
2. Existence Search - the user knows what they want but not how to describe it.
3. Exploratory Search - users know how to phrase the question but are unsure of what they will find. They are exploring and learning and will do multiple searches.
4. Comprehensive Search - research; users want everything available on the topic

A last consideration is to build a good user interface for the user search tool and for the returned results. The search tool screen needs to be little more than a small window on the interface web page screen but often the interface includes tons of ads and other info. See Yahoo versus Google.

 

 

How well do search engines cover the Internet? - In 1997, the top 11 search engines covered about 60% of the Internet. As of 2004, the top 11 coverage has dropped to about 42% of the Internet.

Despite claims made by the search engines, no search engine comes close to covering the entire Internet nor all its web pages. The Internet is growing faster than search engines can find it and index or categorize it.

Additionally, much of the data that a user might want is "Hidden" from the search bots in databases. This is known as the Hidden Internet, the Deep Web or the Invisible Web/Internet.

The Invisible Web is comprised of information stored in databases, according to Chris Sherman, Webmaster of About.com's Web Search. Spiders and robots cannot enter these databases.

"It's as if they've run smack into the entrance of a massive library with securely bolted doors," Sherman said. "Spiders can record the library's address, but can tell you nothing about the books, magazines or other documents."

What else makes up the Invisible Web?

* Non-HTML files (PDF files, etc.)
* Webbed databases
* Sites requiring registration or login
* Archives (newspapers and magazines, etc.)
* Dynamically created Web pages
* Interactive tools (calculators, etc.)

http://www.libraryspot.com/features/invisibleweb.htm

Search Engine Coverage of the Internet
http://www.cabrillo.edu/~tsmalley/searchengines.html

Search Engine Statistics
http://searchenginewatch.com
http://searchenginewatch.com/3632382

Search Engine Ranks by Pages Indexed
http://searchenginewatch.com/reports/article/php/215481

Popularity of Search Engines
http://www.silurian.com/sitepos/coverage.htm

Search Engines outside the U.S. - Despite calling the web part of the Internet the World Wide Web, our viewpoint, is generally focused on the U.S. and ignores the world. The search engines we use are U.S. based and focus on U.S. websites.

There are search engines that are based in other countries that focus on the websites of those countries and their continents.

It's the world wide web click here to see more
European search engines
Search Engine Colossus

Queries may be in the form of:

  • key word searches - specific words or phrases
  • advanced searches using Boolean logic (Boolean operators such as AND, OR, NOT to search more specifically)
  • fuzzy queries - where you type in full sentences (Ask Jeeves)

The databases are built by automated searching tools. Private Internet companies like Yahoo! and Lycos have developed powerful software systems and computer programs called spiders, crawlers, web robots, or just bots for short that search the Internet for web pages and enter them into the giant databases. The bots automatically:

  • Find new information and Web pages
  • find updated information
  • delete web pages that no longer are on the Web
  • update the database

 

How do search engines index a webpage?

http://www.topsy.org/HowaWebPageGetsIndexed.html

 

The databases that are created by the bots and searchable by users come in four general formats or categories:

  1. search engines - uncategorized databases that search large parts of the Web.
  2. subject lists, indexes and directories - databases sorted into categories
  3. meta-search engines - one search engine is used to search several other search engines, collate the results and return them in one organized report
  4. Other Web resources - whatis, Internic's whois, Cruzio's domain lookup, bigfoot, whowhere, Yahoo! People Search, Yahoo Yellow Pages, mapquest, Yellow Pages, Usenet newsgroups, and so on.

A comparative search engine chart for selected search engines
http://libwww.cabrillo.cc.ca.us/html/searchengchart.html

A comparative chart of four search engines
Google - The largest. Particularly good at ranking of results Automatically puts an AND between your terms so you don't have to, e.g., "cheap airline tickets" Europe. No truncation function. Try the Advanced search mode, where you can use date and domain filters.
AltaVista - Particularly powerful in the Advanced search mode, where you can do a Boolean search. AltaVista is the only search engine to offer the operator NEAR which will retrieve words within 10 words of each other. For example "cheap airline tickets" NEAR Barcelona.
AllTheWeb A very large, very fast search engine. Simulate Boolean searches by using + signs to require words in results list, e.g., +"cheap airline tickets " +Europe Try the Advanced Search mode. No truncation

HotBot Easy to search for particular kinds of files. Takes Boolean search statements when you change the option box, e.g., "cheap airline" AND (Spain OR Austria) You can specify date and time filters, also file types, from the opening search page. Use Advanced Search for more options. Can truncate using *

To get to a list (with descriptions) of the Search Tools:

1. Be on the Cabrillo College Library homepage
2. Click on Search the Internet
3. Click on Search Engines

The Big Question: How will anyone know you're there! Well, they won't unless you shout it out. It takes marketing, registering and promoting to get customers to your Web site. 

Visitors typically either:

  1. Hear about you from advertising, write down or memorize your URL and type it into their browser's "Location box,"...Two years ago about 70% of all business eCommerce sites were found through Internet searches; Now, about 65% are found through traditional advertising channels.
  2. They follow a link from another site or online ad to your site,
  3. A friend or acquaintance refers them,
  4. They "hit" on your site through a search.

The reality of the Internet today is that your website is one of millions of Websites. It is unrealistic and even foolish to expect customers to find you through search engines and directories. You will need to rely upon targeted marketing efforts. That means you must make and then implement a marketing plan that begins with identifying and targeting customers and includes  both Web and traditional marketing efforts to reach those customers with informative and persuasive messages.

Putting Out the Welcome Mat and Inviting Customers In.
First things first: Configure your Web page so search engines can find you, describe and categorize you and include you in directories and indexes.
  • Page Titles - Make Sure your page titles make sense. Alta Vista, for instance, puts the title at the top of your listing and uses it for the link to your page. Search engines index all the words in the title.
  • Page Content - Many search engines use the first few text lines in a page as a kind of "abstract" of the page. Focus on using words that will be important to a customer "searching" the Web.
  • Meta Tags - keywords and descriptions. 

Keywords META tag values used by search engines to index and categorize your page. Many search engines (spiders) limit the number of keywords you can use. Carefully choose 10-15 keywords at most. Example: <meta name="keywords" content="puppies, dog food, chew toys, discount pet food, pet food">

Descriptions are META tag values used in directories along with the link. Some search engines will use descriptions to index or categorize your sie. Example: <meta name="description" content="The galaxies best and cheapest pet chow.">

  • Text Comments are used by some search engines that don't use Meta Tags and also by those that do use Meta Tags as a description for your site. Example: <!-- The galaxies best and cheapest pet chow. eep your puppy healthy and happy. Order online today. -->
  • ALT Text in Images - some search engines index the ALT attributes in your IMG tags to get a sense of a page and to index graphics.
Conducting Searches

Every search engine has slightly or greatly different rules for searches. Despite the the variations, three rules do apply across all the search engines:

  • Use quotation marks (" ") to keep words in phrases together
  • Use a plus sign (+) in front of a term (no space) to require it in the search results. In some search engines, now, the plus sign is not required -- but you're not penalized for using it.
  • Use a minus sign (-) in front of a term (no space) to disallow it in the search results

Try This!
You are asking for Web pages that have the words: Monterey and bay and sea and otters.

Click on Google Search. Type in the words and see what is returned. This is not a very precise search...you will probably have retrieved about 25,300 resources.

Now, try varying the search. Here's one idea. Type in

"Monterey Bay" "sea otters"

and then click on Google Search.

There's a big lesson here: You use quotation marks to hold words in phrases together. This reduces our results from 25,300 or so to 10,300 or so. In this second search, you are specifying that the Web pages you retrieve should have the phrase "Monterey Bay" and the phrase "sea otters."

Try some other variations -- try making the word otter singular, for example.

Try searching for "otters in Monterey Bay." Each time, note how literal the computer is.

Advanced searches

Boolean Searches               Buddhist AND Monastery AND "santa cruz"
http://www.cabrillo.edu/~tsmalley/Boolean.html

Some Common Boolean Operators
AND AND NOT
OR ( ) = or?
NOT truncations start* finds anything that begins with start
BOTH  
ADJ - adjacent, like near NEAR - within 10 words

1. Go to Lycos

You've heard that there is a Buddhist monastery somewhere in Santa Cruz County. Find it using Boolean operators and describe how you did it.

How did you find it?

______________________________________________________________

______________________________________________________________

______________________________________________________________

Advanced search modes particular to some of the search engines.

1. Go to Google (remember, to get to a list of the search engines, it's Cabrillo College Library -> Searching the Internet -> Search Engines).

Click on Advanced Search -- it's over to the right of the search box

Search for Web pages that meet these criteria:


You want information about best bike trails in Santa Cruz

You want Web pages updated within the last 3 months

You want Web pages that only come from educational domains (i.e., have .edu in their domain name) (You figure those college kids would know more than others about great bike trails)

What did you find?

______________________________________________________________

______________________________________________________________

2. Go to AllTheWeb. Click on ADVANCED SEARCH.

Search for Web pages that meet these criteria:


You want information about the best places for surfing in Santa Cruz

You want Web pages that include images

You want Web pages updated after 1 January 2001

What did you find?

______________________________________________________________

______________________________________________________________

 

 
World Search Engines

Play around a bit with the international search engines so you become familiar with them. My guess is that these will get better and better in the near future.


Search Engine Colossus

European Search Engines

Search Engines Worldwide

Country-Based Search Engines

Using international search engines

1. Search for Web pages that meet these criteria:

You want information about best bike trails in Melbourne (Australia)

You want Web pages updated within the last 3 months

You want Web pages that only come from Australia. The top level domain code for Australia is .au.

Here's a list of all the top level domain and country codes.

What did you find?

______________________________________________________________

InfoSeek is now Go.com

NorthernLights is gone!