Searching the Web
Click here
for a less technical version
Author:Bhupinder S.
Sran, DeVry Institute,
Email: bsran@admin.nj.devry.edu
This document is meant for personal, educational, or non-profit use only.
What is the World Wide Web?
- The World Wide Web is a collection of millions of linked hypertext documents stored on thousands of servers on the Internet.
Types of documents available through the World Wide Web
- Hypertext
- Gopher Text
- Multimedia
- FTP files
Methods for finding documents on the Web
- Browsing. Example: New Jersey NIE Home Page
- Hot lists of "popular" documents
- Hierarchical Web Directories
- Automated Web Search Engines
- Alta Vista - "The Largest Web Index." Search the Web and Usenet news.
- InfoSeek - Search by keyword and phrases. Rated the best in several reviews.
- Lycos - Search by keyword. Catalog covers "over 90%" of the web. Very thorough.
- WebCrawler - Search by keyword/phrases in document title and content. Pretty fast.
- Hybrid Search Tools: Combination of web directories and search engines
- Yahoo - Search by subject and keyword. Catalog is relatively small, leading you to the "major" web sites quickly.
- Excite - Search by keywords and "concepts".
- Asking others users via Listservs (ex. Best-of-Web Listserv)
Main Components of Automated Web Search Tools
- Spider (or Robot) - programs that search the Web everyday, creating a catalog of documents.
- Catalog - a database of information about Web documents that have been visited by the spiders.
- Search Engine - a program that takes a user query and matches it against the catalog and displays information about the "relevant" documents to the user.
Lycos
- Operated by Lycos, Inc.
- Developed by Dr. Michael L. Mauldin, Center for Machine Translation, Carnegie Mellon University.
- One of the two largest catalogs on the Internet.
- Has indexed over 18,000,000 documents
- Maintained by automated robots (spiders).
- Has been generally ranked second (behind InfoSeek) in several studies.
- Info stored about each document:
- URL (Uniform Resource Locator)
- Title
- Outline: First 200 characters in headings and subheadings
- Keys: 100 most "weighty" words
- Excerpt: The smaller of the first 20 lines or 20% of the document.
- Date the document was last downloaded
- Date the document was last modified
- Size of document in bytes
- Number of words
- Description: Up to 16 lines of "hyperlink" text from other documents to this document.
- Info about matching documents is displayed accroding to a relevance score
- Relevance is based on how many external documents point to the given document.
- "Lycos adds, deletes, and updates about 50,000 documents a day in its catalog."
- The catalog is growing "faster than the Internet itself."
- URL's submitted by users are guaranteed to be indexed within one week.
- Accessible in Lynx by typing the "s" command.
- Lycos Search summary:
- Put a period after any search term that should be treated as a word and not as a substring. Example - icon. computer
- Note: Other rules may apply. However, they were not available in the current Lycos documentation.
- Use the detailed search form to:
- control the number of search terms that must be in the document;
- specify the desired "closeness" of the match;
- specify display options.
- Demonstration of Lycos
Yahoo!
- Developed by David Filo and Jerry Yang when they were Ph.D. candidates at Stanford University in April 1994.
- YAHOO is an acronym for "Yet Another Hierarchical Officious Oracle"
- Yahoo! is a privately funded company providing the Yahoo! service.
- Maintained by humans (not computers). Links are submitted by users.
- A robot also looks for new site announcements at various places.
- Has indexed over 100,000 documents
- Two ways of finding information:
- Browse through subject categories
- Search by keywords globally or within a selected category
- Information stored about each document:
- Document Title
- URL
- Comments supplied by users
- Search results are displayed alphabetically by the category they were found in.
- Search results consist of:
- names of documents that match the keywords
- names of subject categories that the matching documents belong to
- names of subject categories that match the keywords
- the "best" documents are often identified with a Cool
tags
- the [Xtra!] tag leads to the Reuters newsfeed for that subject.
- the New tag indicates that the document was
added in the last three days.
- the "@" tag at the end of a subject category indicates that this category appears at multiple places. Clicking on the heading will send you to the main category.
- "Headlines" on the top page provide you with the latest news stories, updated hourly.
- Good for a user looking for "major sites"
- Yahoo search options:
- make the search case-sensitive or case-insensitive
- specify whether you want matches to contain all of your keywords or at least one of your keywords.
- specify whether the keywords should be considered as substrings or whole words.
- limit the number of matches found
- these search options are available on the detailed search screen only.
- Demonstration of Yahoo!
Infoseek
- Maintained by automated robots.
- Many pages have also been reviewed by humans.
- Infoseek has been rated the "best" search engine in several tests.
- Infoseek has a relatively sophisticated search language.
- Infoseek does not support Boolean search operators.
- Infoseek also offers fee-based searching of the Web and other databases.
- Search results are sorted by relevance and displayed 10 at a time.
- Factors affecting the relevance score for a document:
- number of times the keywords appear in the document
- more weight is given to the existence of more discriminating terms. Ex. "Clinton" is likely to generate a higher relevance score than "president".
- existence of phrases being searched for generates a higher relevance score.
- InfoSeek has a relatively powerful search language.
- InfoSeek Search Examples.
- John Wayne -- Capital letters indicate proper names
- "artificial intelligence" -- double quotes indicate that the words must appear next to each other.
- computer-system -- hyphen indicates that the words must be within one word of each other
- [sun ultraviolet] -- square brackets indicate that the words must be within 100 words of each other.
- icon +computer -- + indicates that this word must appear in the document.
- etc.
- Use discriminating keywords.
- Do not expect to find documents that would appear in the "lower" levels of the Web.
- Demonstration of Infoseek
Alta Vista
- The newest automated search engine.
- Developed by Digital Equipment Corporation (DEC).
- Gives access to "all 8 billion words found in over 16 million Web pages."
- A very thorough search languages allows:
- Search for phrases by putting them in double quotes. example: "water pollution"
- Specify required keywords, prohibited keywords and wildcards. example: +president* -"foreign policy"
- Find out who has links to a given URL. Example: To find out which external documents have links to Stevens Institute of Technology home page: +link:http://www.stevens-tech.edu -url:http://www.stevens-tech.edu
- Search for words in a titles of documents only. Example: To find all documents that have "New Jersey" in the title tag: title:"New Jersey"
- etc.
- Demonstration of Alta Vista
WebCrawler
- Operated by America Online
- Developed by Brian Pinkerton at the University of Washington
- Maintained by automated robots.
- Has indexed over 250,000 documents.
- Info stored about each document:
- Titles (only!) of matching documents are displayed accroding to a relevance score
- Relevance score is calculated by taking "the total number of times each of the words in your query appears in the document and dividing it by the total number of words in the document."
- Maximum number of hits shown: unlimited
- Uses stop words in indexing
- Does not have a rich search language.
- One advantage: You can see more hits on one page because only titles are displayed.
- Demonstration of WebCrawler
Some Information Management Issues
- The web is growing very rapidly.
- Human generated catalogs may not be able to keep up with the growth
- Information in catalogs may become obsolete
- The search space is not clearly defined and is constantly changing
- Document set is highly hetrogenous with respect to type, content, and style.
- User Interface design for search engines
- Standard for Robot Exclusion.
- Use of the META tag.
- Security of information.
General Search Strategies
- Understand the characteristics of each search tool
- Learn and use the search language of the search tool
- Hierarchical Web directories and hybrid tools: Best for searching the "top" of the Web.
- With Yahoo, select the proper category before doing the search.
- Automated search engines (e.g. Lycos, InfoSeek, Alta Vista, WebCrawler): Best for finding documents on unusual topics.
- Use discriminating terms in your query to reduce the number of matches. Avoid one word queries.
The End