A running joke in the software development field is that software engineering is the only branch of engineering in which adding a new wing to a building is considered "maintenance." By some estimates, software maintenance changes that are made to the product after it is released for use account for almost 90 percent of the lifetime cost of the product.
Software maintenance occurs for several reasons. Some maintenance actions occur to fix latent defect bugs. Sometimes maintenance must be performed to keep software up-to-date with new standards, or with changes in other components of the system. Of course, sometimes a product is changed to add a new feature that is requested by the users.
Regardless of why it occurs, changing a product that is already in the field is expensive. Thorough testing can reduce the number of defects, eliminating some maintenance costs. Netscape ONE applications incorporate not only dynamic content (such as any program might have), but also static content (HTML). Validating HTML to make sure it meets the standard makes it less likely that a change in some browser will force the developer to recode the page. Reducing these costs allows the developer to offer site development at lower cost, and the site owner can spend more on content maintenance, which builds traffic.
In early 1995, the fledgling Web marketplace was dominated by Mosaic. By the end of 1995, Netscape Navigator had acquired over 70 percent of the market, and many people believed that number inevitably would move to 100 percent. By mid-1996, Microsoft had entered the market. Their product, Microsoft Internet Explorer (MSIE), is largely a Navigator clone and has seized back about 30 percent of the market from Navigator.
Each of the graphical Web browsers (Navigator, MSIE, as well as Mosaic and others) interprets HTML into images and text on the screen of a Windows, Macintosh, or other desktop computer. This chapter concentrates on the portions of HTML that are common to all Web browsers. Chapter 4, "Netscape Enhancements to HTML," describes tags and attributes that are supported by Navigator and its clones, which are not (yet) part of standard HTML.
Netscape Communications Corporation has repeatedly announced its commitment to open standards, including HTML. While its products support a superset of the open standards, Netscape has participated in the standardization process; Netscape has presented its enhancements to the Web standards community. In many cases, these enhancements have been adopted into the standard; HTML 3.2 includes several concepts that were first introduced in early versions of Navigator.
Document Type Definitions and Why You Care About Them The Hypertext Markup Language, or HTML, is not a programming language or a desktop publishing language. It is a language for describing the structure of a document. Using HTML, users can identify headlines, paragraphs, and major divisions of a work.
HTML is the result of many hours of work by members of various working groups of the Internet Engineering Task Force (IETF), with support from the World Wide Web Consortium (W3C). Participation in these working groups is open to anyone who wishes to volunteer. Any output of the working groups is submitted to international standards organizations as a proposed standard. Once enough time has passed for public comment, the proposed standard becomes a draft, and eventually might be published as a standard. HTML Level 2 has been approved by the Internet Engineering Steering Group (IESG) to be released as Proposed Standard RFC 1866. (As if the open review process weren't clear enough, RFC in proposed standard names stands for Request For Comments.)
The developers of HTML used the principles of a meta-language, the Standard Generalized Markup Language (SGML). SGML may be thought of as a toolkit for markup languages. One feature of SGML is the capability to identify within the document which of many languages and variants was used to build the document.
Each SGML language has a formal description designed to be read by computer. These descriptions are called Document Type Definitions (DTDs). An HTML document can declare for which level of HTML it was written by using a DOCTYPE tag as its first line. For example, an HTML 3.0 document starts with the following:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">_
The DOCTYPE tag is read by validators and other software. It's available for use by browsers and SGML-aware editors, although it's not generally used by those kinds of software. If the DOCTYPE tag is missing, the software reading the document assumes that the document is HTML 2.0.
DOCTYPE tags are used to
cue document readers about what type of markup language is being
used. Table 3.1 lists the most common DOCTYPE
lines and their corresponding HTML levels.
DOCTYPE | Level |
<!DOCTYPE HTML PUBLIC | "-//IETF//DTD HTML 2.0//EN"> 2.0 |
<!DOCTYPE HTML PUBLIC | "-//IETF//DTD HTML 3.0//EN"> 3.0 |
<!DOCTYPE HTML PUBLIC | "-//Netscape Comm. Corp.//DTD HTML//EN">Netscape |
The best kind of maintenance is the kind that improves the site-by adding new content and features that attract new visitors and encourage people to come back again and again. This kind of maintenance usually takes a lower priority compared to the tasks of defect removal and keeping the site up-to-date with the browsers. One key to building an effective site is to keep the maintenance costs low so plenty of resources are available to improve the site and, consequently, build traffic.
On the Web, severe software defects are rare. One reason for this is that HTML is not a programming language, so many opportunities a programmer might have to introduce defects are eliminated. Another reason is that browsers are forgiving by design. If you write bad C++ and feed it to a C++ compiler, chances are high that the compiler will issue a warning or even an error. If you write bad HTML, on the other hand, a browser will try its best to put something meaningful on-screen. This behavior is commonly referred to as the Internet robustness principle: "Be liberal about what you accept, and conservative about what you produce."
The Internet robustness principle can be a good thing. If you write poor HTML and don't want your clients to know, this principle can hide many of your errors. In general, though, living at the mercy of a browser's error-handling routines is bad for the following reasons:
If you could write each page once, and leave it alone forever, then maybe you could take the time to perfect each line of HTML. If your site is being actively used, however, then it is being changed-or should be.
The most effective Web sites are those that invite two-way communication with the visitor. Remember the principle content is king. Web visitors crave information from your site. One way to draw them back to the site is to offer new, fresh information regularly. If a client posts new information every few weeks, people will return to the site. If the client posts new information daily, people will stampede back to the site. The expert Webmaster must deal with all the new content, while still ensuring that each page is valid, high-quality HTML.
This section shows how to use various tools to ensure that your HTML is as perfect as possible when the site is initially developed. Use these same tools regularly to make sure your maintenance activities haven't "broken" the page. Some of these tools also check external links to make sure that pages referenced by your site have not moved or "gone dark."
Strictly speaking, "validation" refers to ensuring that the HTML code complies with approved standards. More generally, validator-like tools are available to check for consistency and good practice as well as compliance with the standards.
The fastest and easiest way to validate a Web site is to submit each page to an online program known as a "validator." This section shows how the first validator, known as HALsoft, works. Although there are other validators that are better for most Webmasters, understanding HALsoft gives you an appreciation of the newer validators such as Gerald Oskoboiny's Kinder Gentler Validator.
HALSoft and the WebTech Validator As the original
Web validator, the WebTech validator is the standard by which
other validators are judged. Unfortunately, the output of the
WebTech program is not always clear. It reports errors in terms
of the SGML standard-not a particularly useful reference for most
Web designers.
ON THE WEB |
http://www.webtechs.com/html-val-svc/ The HALsoft validator was the first formal validator widely available on the Web. In January 1996, the HALsoft validator moved to WebTech and is now available at this site. |
Listing 3.1 gives an example of a piece of flawed HTML and the corresponding error messages that were returned from the WebTech validator.
Listing 3.1 -An Example of Invalid HTML
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN"> <HEAD> <TITLE>Test</TITLE> <BODY BACKGROUND="Graphics/white.gif> <H1>This is header one</H1> <P> This document is about nothing at all. <P> But the HTML is not much good! </BODY> </HTML> produces the following: Errors sgmls: SGML error at -, line 4 at "B": Possible attributes treated as data because none were defined
The Netscape attribute (BACKGROUND) is flagged by the validator as an unrecognizable attribute. The missing closing tag for the HEAD doesn't help much, either, but it's not an error (because the standard states that the HEAD is implicitly closed by the beginning of the BODY). Even though it's not a violation of the standard, it's certainly poor practice-this kind of problem will be flagged by Weblint, described later in this chapter.
The WebTech validator gives you the option of validating against any of several standards, including:
HTML Level 2 is "plain vanilla" HTML. There were once HTML Level 0 and Level 1 standards, but the current base for all popular browsers is HTML Level 2 (also known as RFC 1866).
Each level of HTML tries to maintain backward compatibility with its predecessors, but using older features is rarely wise. The HTML working groups regularly deprecate features of previous levels. The notation Strict on a language level says that deprecated features are not allowed. Validators allow you to specify "strict" checking, generally with a check box.
HTML Level 3 represents a bit of a problem. Shortly after HTML
Level 2 stabilized, developers put together a list of good ideas
that didn't make it into Level 2. This list became known as HTML+.
The HTML Working Group used HTML+ as the starting point for developing
HTML Level 3. A written description and a DTD were prepared for
HTML Level 3, but it quickly became apparent that there were more
good ideas than there was time or volunteers to implement them.
In March 1995, the HTML Level 3 draft was allowed to expire and
the components of HTML Level 3 were divided among several working
groups. Some of these groups, like the one on tables, released
recommendations quickly. The tables portion of the standard has
been adopted by several popular browsers. Other groups, such as
the one on style sheets, have been slower to release a stable
recommendation. A version of the Cascading Style Sheet level 1
standard (CSS1) has been adopted by Netscape in Navigator 4.0,
and by Microsoft in Internet Explorer 3.0.
ON THE WEB |
http://www.w3.org/pub/WWW/TR/ Visit this site for the latest information on HTML recommendations and working drafts (including CSS1). Also see http://www.w3.org/pub/WWW/MarkUp/Wilbur/ for information on HTML 3.2, the World Wide Web Consortium's new specification for HTML. HTML 3.2 contains such features as tables, applets, text flow around images, superscripts, and subscripts. |
The DTD for Netscape is even more troublesome. Netscape Communications has not released a DTD for its extension to HTML. The patient people at HALsoft reverse-engineered a DTD for validation purposes, but as new browser versions are released, there's no guarantee that the DTDs will be updated.
Gerald Oskoboiny's Kinder, Gentler Validator During the brightest days of the HALsoft validator's reign, the two most commonly heard cries among Web developers were "We have to validate" and "Can anybody tell me what this error code means?"
Gerald Oskoboiny, at the University of Alberta, was a champion of HTML Level 3 validation and was acutely aware that the HALsoft validator did not make validation a pleasant experience. He developed his Kinder, Gentler Validator (KGV) to meet the validation needs of the developer community while also providing more intelligible error messages.
The KGV is available at
http://ugweb.cs.ualberta.ca/~gerald/validate/
To run it, just enter the URL of the page to be validated. KGV examines the page and displays any lines that have failed, with convenient arrows pointing to the approximate point of failure. The error codes are in real English, not SGML-ese.
Figure 3.1 is an example of KGV's treatment of the same code that was validated above by the WebTech validator:
Notice that each message contains an explanation link. The additional information in these explanations is useful.
Given the fact that KGV uses the same underlying validation engine as WebTech's program, there's no reason not to use KGV as your primary validation tool.
There are many reasons that pages won't validate, and you can do something to resolve each of them. The following sections cover the problems in detail.
Netscapeisms Netscape Communications Corporation has elected to introduce new, HTML-like tags and attributes to enhance the appearance of pages when viewed through their browser. The strategy was a good one-in February 1996, BrowserWatch reported that over 90 percent of the visitors to their site used some form of Netscape. (Even after Microsoft offered their competing product, MSIE, for free, Netscape still maintained more than 70 percent of market share.)
There is much to be said for enhancing a site with Netscape tags, but unless the site is validated against the Netscape DTD (which has its own set of problems), the Netscape tags will cause the site to fail validation.
Table 3.2 is a list of some popular Netscape-specific tags. Later
in this chapter, a strategy for dealing with these tags is described.
A section in Chapter 4, "Netscape
Enhancements to HTML," describes how to get the best of both
worlds-putting up pages that take advantage of Netscape, while
displaying acceptable quality to other browsers that follow the
standard more closely.
Tag | Attribute |
<BODY> | BGCOLOR
TEXT LINK ALINK VLINK |
Multiple <BODY> tags <CENTER>
Table caption with embedded headers (for example, <TABLE><CAPTION><H2>...</H2></CAPTION>...) <TABLE WIDTH=400> <UL TYPE=Square> <HR SIZE=3 NOSHADE WIDTH=75% ALIGN=Center> <FONT...> <BLINK> <NOBR><FRAME>,<FRAMESET>,<NOFRAME> <EMBED> | No longer supported by Netscape |
Using Quotation Marks A generic HTML tag consists of three parts:
<TAG ATTRIBUTE=value>
You might have no attribute, one attribute, or more than one attribute.
The value of the attribute must be enclosed in quotation marks if the text of the attribute contains any characters except A through Z, a through z, 0 through 9, or a few others such as the period. When in doubt, quote. Thus, format a hypertext link something like this:
<A HREF="http://www.whitehouse.gov">
It is an error to leave off the quotation marks because a forward slash is not permitted unless it is within quotation marks.
It is also a common mistake to forget the final quotation mark:
<A HREF="http://www.whitehouse.gov>
The syntax in this example was accepted by Navigator 1.1, but in Navigator 2.0 and later versions, the text after the link doesn't display. Therefore, a developer who doesn't validate-and who instead checks the code with a browser-would have seen no problem in 1995 when putting up this code and checking it with the then-current Netscape 1.1. By 1996, though, when Netscape 2.0 began shipping, that same developer's pages would break.
Keeping Tags Balanced Most HTML tags come in pairs. For every <H1> there should be an </H1>. For every <EM> there must be an </EM>. It's easy to forget the trailing tag, and even easier to forget the slash in the trailing tag, leaving something like the following:
_<EM>This text is emphasized.<EM>
Occasionally, one also sees mismatched headers like the following:
_<H1>This is the headline.</H2>
Validators catch these problems.
Typos Spelling checkers catch many typographical
errors, but desktop spelling checkers don't know about HTML tags,
so it's difficult to use them on Web pages. It's possible to save
a page as text and then check it.
ON THE WEB |
http://www.eece.ksu.edu/~spectre/WebSter/spell.html Use the tool at this site to spell check the copy online. |
What can be done, however, about spelling errors inside the HTML itself? Here's an example:
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINKS="#0000FF" ALINKS="#FF0000" VLINKS="#FF00FF">
The human eye does a pretty good job of reading right over the errors. This tag is wrong-the LINK, ALINK, and VLINK attributes are typed incorrectly. A good browser just ignores anything it doesn't understand (in accordance with the Internet Robustness Principle), so the browser acts as though it sees the following:
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
Validators report incorrect tags such as these so that the developer can correct them.
Incorrect Nesting Every tag has a permitted context. The structure of an HTML document is shown here:
<HTML> _<HEAD> __Various head tags, such as TITLE, BASE, and META _</HEAD> _<BODY> __Various body tags, such as <H1>...</H1>, ______and paragraphs <P>...</P> _</BODY> </HTML>
While most developers don't make the mistake of putting paragraphs in the header, some inadvertently do something like the following example.
Suppose a developer writes these three lines on a page:
<P><STRONG>Here is a key point.</STRONG> <P>This text explains the key point. <P><EM>Here is another point</EM>
These lines are valid HTML. As the site is developed, the author decides to change the emphasized paragraphs to headings. The developer's intent is that the strongly emphasized paragraph will become an H1; the emphasized paragraph will become an H2. Here is the result:
<H1>Here is a key point. <P>This text explains the key point. <H2>Here is another point.</H1> </H2>
Even the best browser would become confused by this code, but fortunately, a validator catches this error so the developer can clarify the intent.
Forgotten Tags Developers frequently omit "unnecessary" tags. For example, the following code is legal HTML 2.0:
<P>Here is a paragraph. <P>Here is another. <P>And here is a third.
Under the now-obsolete HTML 1.0, <P> was a paragraph separator. It was an unpaired tag that typically was interpreted by browsers as a request for a bit of white space. Many pages still are written this way:
Here is a paragraph.<P> Here is another.<P> And here is a third.<P>
But starting with HTML 2.0, <P> became a paired tag, with strict usage calling for the formatting shown here:
<P> Here is a paragraph. </P> <P> Here is another. </P> <P> And here is a third. </P>
While the new style calls for a bit more typing, and is not required, it serves to mark clearly where paragraphs begin and end. This style helps some coders and serves to clarify things for browsers. Thus, it often is useful to write pages by using strict HTML and to validate them with strict DTDs.
Validation is intended to give some assurance that the code will
display correctly in any browser. By definition, browser-specific
extensions will display correctly only in one browser. Netscape
draws the most attention, of course, because that browser has
such a large market share. Netscape Communications has announced
that when HTML 3.0 is standardized, Netscape will support the
standard. Indeed, many of the tags and attributes in HTML 3.2
originally appeared in Navigator.
Note |
Many other browsers, such as Microsoft's Internet Explorer, currently support some or all of the Netscape extensions. |
Thus, you may decide it's reasonable to validate against HTML Level 2 Strict, then add enough HTML Level 3 features to give your page the desired appearance. The resulting page should validate successfully against the HTML Level 3 standard.
Finally, if the client wants a particular effect (such as a change in font size) that can be accomplished only by using Netscape, you have to use the Netscape tags and do three things:
If the desired page (as enhanced for Netscape) doesn't look acceptable in other browsers, don't just mark the page "Enhanced for Netscape!" For many reasons, at least five percent of the market does not use Navigator or a Navigator clone such as MSIE. Many of these users use browsers supplied by commercial online services (such as NetCruiser, from NetCom). These users are often among the least knowledgeable when it comes to understanding why Web pages have a certain appearance.
Various estimates place the size of the Web audience at around
30,000,000 people. Putting "Enhanced for Netscape!"
on a site turns away over one million potential customers. A better
solution is to redesign the page so that it takes advantage of
Netscape-specific features, but still looks good in other browsers.
Failing that, you might need to prepare more than one version
of the page and use META REFRESH
or another technique to serve up browser-specific versions of
the page. This is a lot of extra work, but it is better than turning
away five percent of the potential customers, or having them see
shoddy work.
Tip |
One of the fastest ways to separate Netscape Navigator and its clones from less sophisticated browsers is to include a line like <META HTTP-EQUIV='REFRESH' CONTENT='0; /some/url.html'> in the <HEAD> section. Navigator and other high-end browsers recognize the META REFRESH sequence and immediately call up the page named in the <META> tag. Other browsers ignore the <META> tag, so they display the contents of the original page. |
The good news is that most pages can be made to validate under HTML 3.2 and then can be enhanced for Netscape without detracting from their appearance in other browsers. Chapter 4, "Netscape Enhancements to HTML," discusses techniques for preparing such pages.
WebTech and KGV are formal validators-they report places where a document does not conform to the DTD. A document can be valid HTML, though, and still be poor HTML.
Part of what validators don't catch is content-related. Content problems are caught by copywriters, graphic artists, and human evaluators, as well as review by the client and developer. There are some other problems that can be caught by software, even though they are perfectly legal HTML.
Lack of ALT Tags Here's an example of code that passes validation, but is nonetheless broken:
<IMG SRC="Graphics/someGraphic.gif" HEIGHT=50 WIDTH=100>
The problem here is a missing ALT tag. When users visit this site with Lynx, or with a graphical browser with image loading turned off, they see a browser-specific placeholder. In Navigator, they see a "broken graphic" icon. In Lynx, they see [IMAGE].
By adding the ALT attribute, browsers that cannot display the graphic instead display the ALT text.
<IMG SRC="Graphics/someGraphic.gif" ALT="[Some Graphic]" HEIGHT=50 WIDTH=100>
Out-of-Sequence Headings It's not an error to skip heading levels, but it's a poor idea. Some search engines look for <H1>, then <H2>, and so on to prepare an outline of the document. Yet, the code shown here is perfectly valid:
<H2>This is not the top level heading</H2> <P> Here is some text that is not the top-level heading. </P> <H1>This text should be the top level heading, but it is buried inside the document</H1> <P> Here is some more text. </P>
Some designers skip levels, going from H1 to H3. This technique is a bad idea, too. First, the reason people do this is often to get a specific visual effect, but no two browsers render headers in quite the same way, so this technique is not reliable for that purpose. Second, automated tools (like some robots) that attempt to build meaningful outlines may become confused by missing levels.
There are several software tools available online that can help locate problems like these, including Doctor HTML and Weblint.
Once you have ensured that your pages validate against each of the DTDs you have selected, it's time to give your site a more rigorous workout.
One of the best online checkout tools is Doctor HTML, located at http://imagiware.com/RxHTML.cgi. Written by Thomas Tongue and Imagiware, Doctor HTML can run eight different tests on a page. The following list explains the tests in detail.
Figure 3.4: Doctor HTML's Hyperlink Analysis Test shows which links are suspect.
Caution |
This test has a difficult time with on-page named anchors such as <A HREF="#more">. |
Sometimes a link returns an unusually small message, such as This site has moved. Doctor HTML shows the size of the returned page so that such small messages can be tested manually.
Figure 3.5: Doctor HTML's Summary Report contains a wealth of information about the page.
Another online tool is the Perl script Weblint, written by Neil Bowers of Khoral Research. Weblint is distinctive in that it's available online at
http://www.unipress.com/web-lint/
and also can be copied from the Internet to a developer's local machine. The gzipped tar file of Weblint is available from
ftp://ftp.khoral.com/pub/weblint/weblint-1.014.tar.gz
A ZIPped version is available at
ftp://ftp.khoral.com/pub/weblint/weblint.zip
The Weblint home page is
http://www.khoral.com/staff/neilb/weblint.html
Tip |
KGV (described earlier) offers an integrated Weblint with a particularly rigorous mode called the "pedantic option." You'll find it worthwhile to use this service. |
What Is a Lint? The original C compilers on UNIX let programmers get away with many poor practices. The language developers decided not to try to enforce good style in the compilers. Instead, compiler vendors wrote a lint, a program designed to "pick bits of fluff" from the program under inspection.
Weblint Warning Messages Weblint is capable of performing 24 separate checks of an HTML document. The following list is adapted from the README file of Weblint 1.014, by Neil Bowers.
Weblint can check the document for the following:
When Weblint is run from the command line, the following combination of checks gives a document the most thorough workout:
_weblint -pedantic -e upper-case, bad-link, require-doctype [filename]
The -pedantic switch turns
on all warnings except case,
bad-link, and require-doctype.
Note |
The documentation says that -pedantic turns on all warnings except case, but that's incorrect. |
The -e upper-case switch enables a warning about tags that aren't completely in uppercase. While there's nothing wrong with using lowercase, it's useful to be consistent. If you know that every occurrence of the BODY tag is <BODY> and never <body>, <Body>, or <BoDy>, then you can build automated tools that look at your document without worrying about tags that are in non-standard format.
The -e ..., bad-link switch enables a warning about missing links in the local directory. Consider the following example:
<A HREF="http://www.whitehouse.gov/">The White House</A> <A HREF="theBrownHouse.html">The Brown House</A> <A HREF="#myHouse">My House</A>
If you write this, Weblint (with the bad-link warning enabled) checks for the existence of the local file theBrownHouse.html. Links that begin with http:, news:, or mailto: are not checked. Neither are named anchors such as #myHouse.
The -e ..., require-doctype switch enables a warning about a missing <!DOCTYPE ...> tag.
Notice that the -x netscape switch is not included. Leave it off to show exactly which lines hold Netscape-specific tags. Never consider a page done until you're satisfied that you've eliminated as much Netscape-specific code as possible, and that you (and your client) can live with the rest. See Chapter 4, "Netscape Enhancements to HTML," for more specific recommendations.
If we use the Weblint settings in this section, and the sample code we tested earlier in the chapter with the WebTech validator and KGV, Weblint gives us these warning messages:
line 2: <HEAD> must immediately follow <HTML> line 2: outer tags should be <HTML> .. </HTML>. line 4: odd number of quotes in element <BODY BACKGROUND= "Graphics/white.gif>. line 4: <BODY> must immediately follow </HEAD> line 4: <BODY> cannot appear in the HEAD element. line 5: <H1> cannot appear in the HEAD element. line 6: <P> cannot appear in the HEAD element. line 8: <P> cannot appear in the HEAD element. line 11: unmatched </HTML> (no matching <HTML> seen). line 0: no closing </HEAD> seen for <HEAD> on line 2. HTML source listing: 1.<!-- select doctype above... --> 2.<HEAD> 3.<TITLE>Test</TITLE> 4.<BODY BACKGROUND="Graphics/white.gif> 5.<H1>This is header one</H1> 6.<P> 7.This document is about nothing at all. 8.<P> 9.But the HTML is not much good! 10.</BODY>
Because Weblint is a Perl script and is available for download, you should pull it down onto the development machine. Here is an efficient process for delivering high-quality validated pages using a remote server:
Note |
For this step, the -x netscape option is turned on. This option allows Weblint to read Netscape-specific tags without issuing a warning. |
Figure 3.6: Weblint is aggressive and picky-just what you want in a lint.
The HTML Source Listing With some online tools, such as KGV, any problematic source line is printed by the tool. With others, such as Weblint, it isn't. The forms interface for Weblint, available through
http://www.ccs.org/validate/
turns on the source listing by default. It's best if you leave it at that setting.
One of the advantages of using LiveWire is that the Site Manager includes integrated tests of every page on the site. These tests are a subset of the tests run by KGV, Weblint, and Doctor HTML, so you can't use Site Manager to replace these tests. In fact, Site Manager's checks are confined to various checks of link integrity-a subset of the tests made by Doctor HTML. But, because the Site Manager runs quickly and locally, it's a nice supplement.
Webmasters who have developed sites before LiveWire know that the single most difficult task a Webmaster faces is keeping the links working. As files are moved, copied, and renamed, one link or another inevitably ends up with the wrong URL. Site Manager offers three tools to help the Webmaster deal with this problem: automated link reorganization, a link validity checker, and the capability to automatically correct case mismatches.
Internal Link Reorganization Suppose you have two pages that are under management with Site Manager: bar.html and baz.html. baz.html contains a link to bar.html, and vice versa. You decide to change the names of the files to something more meaningful, like first.html and last.html. In a conventional development environment, you change the file names and then painstakingly go through the files and change the links. In Site Manager, you have an easier way.
Make sure that bar.html and baz.html are under management. (Look
for the red triangle on the file icon in the left pane in Site
Manager.) Select the icon for bar.html and choose File, Rename
to change the name to first.html. Do the same thing with baz.html,
changing its name to last.html. Now examine both files with Netscape
Navigator. The links are updated to reflect the new names.
Caution |
Site Manager keeps the links up-to-date, but does not change your content. If you have a link that points to toc.html and the link text says "Table of Contents," changing toc.html to index.html changes the link, but the text still reads "Table of Contents." |
Site Manager should be the focal point of your development process.
Use Site Manager to add, delete, and modify your pages.
Caution |
If you change a file outside Site Manager while Site Manager is running, the links are not updated. Try to avoid making changes outside Site Manager. |
Checking the Links-Internal Links First Even with the help of Site Manager, some links inevitably break. You can check quickly for these links by using Site Manager's Check Links menu items. Site Manager defines internal links as those within the site. External links go outside the site, even if they link to other sites on the same server.
On most sites, it's a good idea to start the links check by checking internal links. Not only do you have more control over these links, but these are also the links that are most likely to be broken during the early development effort. Internal link verification is also faster than external link verification because internal links link to pages on your hard drive-but external links may have to be exercised over the Internet.
Select the site's development directory in the left pane. Have Site Manager test the internal links by choosing Site, Check Internal Links. Then open the Site Links tab in the right pane. Turn off external links, if necessary, to reduce the clutter. Resize the panes and columns of the Site Links tab so that you can see the information presented. Figure 3.7 shows the resul-ting window.
Figure 3.7: Use the Site Links tab to see the invalid internal links.
To fix the invalid links, select one of the links and choose Edit, Modify Links. In the dialog box that appears, change the link to one that is valid. As you correct broken links, the link disappears from the Site Links tab.
The field at the bottom of the Site Links tab shows which pages contain the invalid link that is selected in the top pane. You can see this information to make more specific changes on a page-by-page basis if the problem is more complex than just a typographic error.
Repairing Links that Have Mismatched Case The most common reason for invalid internal links is mismatched case. One person builds a page and calls it toc.html. The next person includes a link to TOC.html. The link is invalid because TOC.html doesn't exist.
Before LiveWire, Webmasters spent a lot of time on problems like this one. Now, Site Manager can fix these problems quickly. Choose Site, Repair Case Sense Problems-Site Manager puts up a list of links that would work, if only they were the right case. Figure 3.8 shows such a list.
Figure 3.8: Site Manager can quickly repair all links broken because of a case mismatch.
To see which page Site Manager thinks is the proper destination for the link, select the link with the case problem. Site Manager's choice is shown in the field at the bottom of the dialog box. If you want, you can edit that choice. When you're satisfied that Site Manager will do the right thing, click Fix Links.
Checking External Links After all of the internal links are working, you're ready to move on to external links. Checking external links often takes longer because the connection over the network is slower than the hard drive, but usually there are fewer external links than internal ones.
Choose Site, Check External Links to start the checking process. When it completes, open the Site Links tab in the right pane and check the External Links check box. Use the pop-up menu to restrict the view to Invalid links only if you need to reduce the clutter. Figure 3.9 shows a typical list of external links.
If a more sophisticated fix is needed, use Modify Links from the Edit menu to fix links or edit the page with the invalid link.
Links that Cannot Be Checked You may notice that the status of some links says "Unchecked" or "Never Checked." Unchecked links should be rare-they represent links that have been added since you last selected Site, Check External Links. "Never Checked" links are links to non-Web servers, such as mailto links. The only way to verify these links is to send mail-which Site Manager leaves up to you.
HTML forms are most programmers' first introduction to server-side scripts such as CGI or server-side JavaScript. Matt's Script Archive, online at http://www.worldwidemart.com/scripts/, contains formmail.pl, a CGI script that reads the contents of a form and sends it on to a designated recipient. As the complexity of the form grows, some Webmasters want to split it so that each page of the form depends upon the answers to the page before it. In order to build a multipart form, the concept of "state" must be added to HTTP.
The Hypertext Transfer Protocol, HTTP, the protocol of the Web, is stateless-that is, the server sees each request as a stand-alone transaction. When a user submits page 2 of a multipart form, the server and CGI scripts have no built-in mechanism for associating this user's page 2 with his or her page 1. These state-preserving mechanisms have to be grafted onto HTTP by using any of several techniques, including modified URLs, hidden fields, and Netscape "cookies."
When you write CGI scripts, you must choose one of these mechanisms and laboriously add it. Fortunately, when you use LiveWire (Netscape's application development environment), the Application Manager writes this section for you.
The nature of HTTP is that each request stands alone. If the server receives a series of requests for related files from the same server that's well, all fine and good, but as far as the server is concerned, it has no reason to suspect that these requests may be from the same user. If you're trying to build, say, a shopping cart application, little things like keeping the right user with the right shopping cart become important. And HTTP provides no support for this task. None.
Prior to Netscape's introduction of LiveWire, all the work of state preservation was up to the CGI programmer.
This section shows how HTTP works and why it's impossible to remember state with HTTP alone. This section also shows some mechanisms that can be grafted onto HTTP to meet the need for state preservation.
Anyone who has entered an URL has wondered about the letters "http" and why they're omnipresent on the Web. HTTP is a series of handshakes that are exchanged between a browser, like Netscape Navigator, and the server.
You can find many different servers. CERN, the research center in Switzerland that did the original development of the Web, has one. So does the National Center for Supercomputer Applications (NCSA), the organization that did much of the early work on the graphical portions of the Web. Of course, Netscape sells two second-generation Web servers; one for entry-level use and one for high-volume sites and the Internet. The one thing all Web servers have in common is that they speak HTTP.
The definitive description of HTTP is found at
http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v10-spec-03.html
That document contains a detailed memo from the HTTP Working Group of the Internet Engineering Task Force. The current version, HTTP/1.0, is the standard for how all communication is accomplished over the Web.
Communication on the Internet takes place by using a set of protocols named TCP/IP, which stands for Transmission Control Protocol/Internet Protocol. Think of TCP/IP as being similar to the telephone system and HTTP as a conversation that two people have over the phone.
The Request When a user enters an URL, such as http://www.xyz.com/index.html, TCP/IP on the user's machine talks to the network name servers to find out the IP address of the xyz.com server. TCP/IP then opens a conversation with the machine named www at that domain. TCP defines a set of ports-each of which can provide some service-on a server. By default, the http server (commonly named httpd) is listening on port 80.
The client software (a browser like Netscape Navigator) starts the conversation. To get the file named index.html from www.xyz.com, the browser says the following to the designated port on the designated server:
Get /index.html/1.0
Note |
After this line, the browser sends optional headers, followed by a second <CRLF> that causes the server to process the request. |
Formally, index.html is an
instance of a uniform resource identifier (URI). A uniform
resource locator (URL) is a type of URI.
Note |
Web specifications include provisions for identifiers to specify a particular document, regardless of where that document is located. Other provisions can enable a browser to recognize that two documents are different versions of the same original-differing in language, perhaps, or in format (for example, one may be plain text, and another might be in Adobe Portable Document Format, PDF). For now, most servers and browsers know about only one type of URI: the URL. |
The GET method asks the server
to return whatever information is indicated by the URI. If the
URI represents a file (for example, index.html), then the contents
of the file are returned. If the URI represents a process (such
as formmail.cgi), then the server runs the process and sends the
output.
Note |
This explanation is a bit simplified, since the server has to be configured to run CGI scripts. Because this book concentrates on server-side JavaScript rather than CGI, this section does not describe how to configure a server for CGI. |
Most commonly, the URI is expressed in terms relative to the document root of the server. For example, the server can be configured to serve pages starting at
/usr/local/etc/httpd/htdocs
If the user wants a file, for instance, whose full path is
/usr/local/etc/httpd/htdocs/hypertext/WWW/TheProject.html
the client sends the following instruction:
GET /hypertext/WWW/TheProject.html http/1.0
The http/1.0 at the end of the line indicates to the server which version of HTTP the client is able to accept. As the HTTP standard evolves, this field is used to provide backwards compatibility to older browsers.
The Response When the server gets a request, it generates a response. The response a client wants usually looks something like this:
HTTP/1.0 200 OK Date: Mon, 19 Feb 1996 17:24:19 GMT Server: Apache/1.0.2 Content-type: text/html Content-length: 5244 Last-modified: Tue, 06 Feb 1996 19:23:01 GMT <!DOCTYPE HTML PUBLIC "-//IETF/DTD HTML 3.0//EN"> <HTML> <HEAD> . . . </BODY> </HTML>
The first line is called the status line. It contains three elements, separated by spaces:
When the server is able to find and return an entity associated with the requested URI, the server returns status code 200, which has the reason phrase OK.
The first digit of the status code (the code returned by
the Web server in the status line) defines the class of response.
Table 3.3 lists the five classes.
Class | Meaning | |
Informational | These codes are not used, but are reserved for future use. | |
Success | The request was successfully received, understood, and accepted. | |
Redirection | Further action must be taken in order to complete the request. | |
Client error | The request contained bad syntax or could not be fulfilled through no fault of the server. | |
Server error | The server failed to fulfill an apparently valid request. |
Table 3.4 shows the individual values of all status codes presently
in use and a typical reason phrase for each code. Reason
phrases are associated with status codes to provide a human-readable
explanation of the status. These phrases are given as examples
in the standard-each site or server can replace these phrases
with local equivalents.
Status Code | Reason Phrase |
OK | |
Created | |
Accepted | |
Partial Information | |
No Content | |
Moved Permanently | |
Moved Temporarily | |
Method | |
Not Modified | |
Bad Request | |
Unauthorized | |
Payment Required | |
Forbidden | |
Not Found | |
Internal Server Error | |
Not Implemented | |
Server Temporarily Overloaded (Bad Gateway) | |
Server Unavailable (Gateway Timeout) |
The most common responses are 200, 204, 302, 401, 404, and 500. These and other status codes are discussed more fully in the document located at
http://www.w3.org/hypertext/WWWProtocols/HTTP/HTRESP.html
Status code 200 was described earlier in this section. It means that the request has succeeded and data is coming.
Code 204 means that the document has been found, but it is completely empty. This code is returned if the developer associated an empty file with an URL, perhaps as a placeholder. The most common browser response when code 204 is returned is to leave the current data on-screen and put up an alert dialog box that says Document contains no data or something to that effect.
When a document has been moved, a code 3xx is returned. Code 302 is most commonly used when the URI is a CGI script that outputs something like the following:
_Location: http://www.xyz.com/newPage.html
Typically, this line is followed by two line feeds. A server-side JavaScript programmer initially sends a 302 response when he or she uses the redirect() function.
Most browsers recognize code 302 and look in the Location: line to see which URL to retrieve; they then issue a GET to the new location. Chapter 6, "LiveWire and Server-Side JavaScript," contains details about outputting Location: using redirect().
Status code 401 is seen when the user accesses a protected directory. The response includes a WWW-Authenticate header field with a challenge. Typically, a browser interprets a code 401 by giving the user an opportunity to enter a user name and password.
Status code 402 has some tantalizing possibilities. So far, it has not been implemented in any common browsers or servers. Chapter 18, "Learning More About Netscape ONE Technology," describes Netscape's plans to offer an online digital wallet that enables the user to pay a site owner.
When working on new CGI scripts, the developer frequently sees code 500. The most common explanation of code 500 is that the script has a syntax error or is producing a malformed header. LiveWire applications are much less likely to generate status code 500 because the header is generated by LiveWire itself, and not by the user's application.
Other Requests The preceding examples involve
GET, the most common request.
A client can also send requests involving HEAD,
POST, and "conditional
GET."
Note |
The HTTP standard also provides for a PUT method. Although PUT is not commonly used on the Web, Netscape uses it to implement the "One-button Publishing" feature of Netscape Navigator Gold. With one-button publishing, a person using Navigator Gold can send a Web page to the server with a single click of the mouse, without resorting to complex FTP software. |
The HEAD request is just like the GET request, except no data is returned. HEAD can be used by special programs called proxy servers to test URIs, either to see whether an updated version is available or to ensure that the URI is available at all. Proxy servers are special server configurations that collect Web pages from standard servers, as though they were a Web client, and serve it back to Web clients, as though they were a conventional server.
See "Livewire and Server-Side
JavaScript," p. 157
See "Learning More About Netscape ONE Technology,"
p. 455
POST is like GET
in reverse; POST is used
to send data to the server. Developers use POST
most frequently when writing CGI scripts and applications to handle
form output.
Note |
As the LiveWire application developer, you don't see any difference between GET and POST, so it may be difficult for you to choose which method to use. The rule of thumb is this-some platforms put a limit on the number of characters that can be passed in environment variables, which is the method by which GET is implemented. STDIN-the mechanism used by POST-is not subject to such a limit. Unless you know that the number of characters is small, always use POST. |
Typically, a POST request brings a code 200 or code 204 response.
Requests Through Proxy Servers Some online services, like America Online, set up machines to be proxy servers. A proxy server sits between the client and the real server. When the client sends a GET request to, say, www.xyz.com, the proxy server checks to see whether it has the requested data stored locally. This local storage is called a cache.
If the requested data is available in the cache, the proxy server determines whether to return the cached data or the version that's on the real server. This decision usually is made on the basis of time-if the proxy server has a recent copy of the data, it can be more efficient to return the cached copy.
To find out whether the data on the real server has been updated, the proxy server can send a conditional GET, like this:
GET index.html http/1.0 If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT <CRLF>
If the request would not normally succeed, the response is the same as though the request were a GET. The request is processed as a GET if the date is invalid (including a date that's in the future). The request also is processed as a GET if the data has been modified since the specified date.
If the data has not been modified since the requested date, the server returns status code 304 (Not Modified).
If the proxy server sends a conditional GET, it either gets back data, or it doesn't. If it gets data, it updates the cache copy. If it gets code 304, it sends the cached copy to the user. If it gets any other code, it passes that code back to the client.
Header Fields If-Modified-Since is an example of a header field. Here are the four types of header fields:
General headers may be used on a request or on the data. Data can flow both ways. On a GET request, data comes from the server to the client. On a POST request, data goes to the server from the client. In either case, the data is known as the entity.
Here are the three general headers defined in the standard:
By convention, the server should send its current date with the response. By the standard, only one Date header is allowed.
Although HTTP does not conform to the MIME standard, it is useful to report content types by using MIME notation. To avoid confusion, the server may send the MIME Version that it uses. MIME Version 1.0 is the default.
Optional behavior can be described in Pragma directives. HTTP/1.0 defines the nocache directive on request messages to tell proxy servers to ignore their cached copy and GET the entity from the server.
Request header fields are sent by the browser software. Here are the valid request header fields:
Referer can be used by LiveWire applications to determine the preceding link. For example, if an application developer announces a client's site to a major search engine, he or she can keep track of the Referer variable to see how often users follow that link to get to the client's site.
User-Agent is sent by the browser to report which software and version the user is running. This field ultimately appears in the request.agent property and can be used to return pages with browser-specific code.
Response header fields appear in server responses and can be used by the browser software. Here are the valid response header fields:
Location is the same "Location" mentioned earlier in this chapter, in the section entitled "The Response." Most browsers expect to see a Location field in a response with a 3xx code, and interpret it by requesting the entity at the new location.
Server gives the name and version number of the server software.
WWW-Authenticate is included in responses with status code 401. The syntax is
_WWW-Authenticate: 1#challenge_
The browser reads the challenge(s)-there must be at least one-and asks the user to respond. Most popular browsers handle this process with a dialog box that prompts the user for a user name and password. Figure 3.10 shows the Netscape FastTrack server and Netscape Navigator challenging a user for authentication information.
Entity header fields contain information about the data. Recall that the data is called the entity; information about the contents of the entity body, or metainformation, is sent in entity header fields. Much of this information can be supplied in an HTML document by using the <META> tag. The earlier section of this chapter, "Validating and Checking HTML," shows one use of the <META> tag.
Here are the entity header fields:
In addition, new field types can be added to an entity without
extending the protocol. It's up to the author to determine what
software (if any) will recognize the new type. Client software
ignores entity headers that it doesn't recognize.
Note |
Netscape uses this mechanism to implement the HTTP-EQUIV field in the <META> tag. |
The Expires header is used as another mechanism to keep caches up-to-date. For example, an HTML document might contain the following line:
<META http-equiv="Expires" Contents="Thu, 01 Dec 1994 16:00:00 GMT">
This line means that a proxy server should discard the document
at the indicated time and should not send out data after that
time without retrieving a fresh copy from the server.
Note |
The exact format of the date is specified by the standard, and the date must always be in Greenwich Mean Time (GMT). |
Nothing in HTTP associates the sender of one request with any other request, past or future. But suppose you want to implement a multipart form, like the one shown in Figures 3.11 and 3.12.
Figure 3.11: The user fills in the first page of the Mortgage Advisor form.
Here's another example. The first page, shown in Figure 3.13, is part of a shopping application-the user places items into a shopping cart.
Figure 3.13: The shopper adds items to the shopping cart.
Later, when the shopper reviews the order, the items displayed must match the ones the shopper has been putting in the cart. Figure 3.14 shows the order, presented for review.
After the user is finished shopping, the user checks out, as shown in Figure 13.15. The system must ensure that the shopping cart follows the user to the checkout page so that the order fulfillment portion of the site knows what to tell the site owner to ship.
Clearly, these and other applications will work only if you can find a way to graft state preservation onto the stateless HTTP.
Fundamentally, state information is retained in just two places: the client or the server. This section describes the mechanisms available to the LiveWire application developer.
Application Manager enables the user to choose from five techniques for state preservation. These choices appear in the right pane of Application Manager in the field entitled Client Object Maintenance, shown in Figure 3.16.
Figure 3.16: Application Manager affords the developer five techniques for preserving client state.
The client-based choices are:
The server-based choices are:
The remainder of this section describes how these options work and identifies when they are appropriate choices.
Client URL To see how Client URL state preservation works, go to Application Manager and select the World application. Follow the Modify link; when the right pane shows the modify frame, change the Client Object Maintenance type to client-url. Figure 3.17 shows this change in progress.
Figure 3.17: Change the Client Object Maintenance type through Application Manager.
Now, choose http://your.server.domain/world/ to run the application. Enter your name in the field and press Enter-and watch the URL at the top of the Window. You sent the browser to http://your.server.domain/world/-the default page is actually hello.html-but the browser you are at http://your.server.domain/world hello.html?NETSCAPE_LIVEWIRE.oldname=null. Run the application again, and the URL changes to http://your.server.domain/world/hello.html?NETSCAPE_LIVEWIRE.oldname=yourName. What is going on here?
Open your editor to the source file for hello.html. (Don't just do a View, Document Source. You need to see what's between the <SERVER> tags.)
The operative line is the one that reads
client.oldname = request.newname;
When you run the application the first time, you have not yet submitted the form, so the properties of the request object that come from the form are null. The assignment statement stuffs that null into a property that the programmer defined on the client object: oldname. When the application finishes with the request, it has to store the property oldname somewhere so that it can reconstruct the client object on the next request.
Where will it store client's properties? Where you told it to-in the URL.
When the programmer writes the source code, he or she shouldn't have to worry about which mechanism you are going to choose for state preservation. So, the programmer tells the application to submit the form to hello.html (in the ACTION attribute of the FORM tag). How did hello.html transform into /hello.html?NETSCAPE_LIVEWIRE.oldname=yourName?
If you have a UNIX machine, you can find out by going to the command line. (The procedure is a bit different on a Windows NT server, but the process is the same.) Change directories to the location of the hello.html source file. Now enter
lwcomp -d -c hello.html | more
For now, it's enough to know that the -d compiler switch causes the compiler to show the code it produced from the input file. The -c switch tells the compiler not to produce a new output file-just check the code. The pipe to more is useful because the line you're interested in is near the top of the file, and you don't want the output to scroll off the top of the screen.
Look at the lines that write the <FORM...> tag:
write<"\n\n<h3> Enter your name... </h3>\n<form method=\"post\" action=\""); writeURL("hello.html"); write("\">\n<input type=...
writeURL() is a special server-side JavaScript function that knows enough to look at the current state preservation mechanism before it runs. If the state preservation mechanism is set to client-url, as it is now, writeURL() appends the client information to the URL.
To see this, go back to the browser window that is running World and do a View, Document Source. Here you see the line that LiveWire compiler actually produced in response to the code that contained the writeURL("hello.html"):
<form method="post" action="hello.html?NETSCAPE_LIVEWIRE.oldname="null">
When the server gets the request, it is in the form
POST /hello.html?NETSCAPE_LIVEWIRE.oldname="null" HTTP/1.0
followed by the contents of the field on the form. The server
runs the LiveWire application that is associated with the hello.html
page-LiveWire pulls off the oldname
parameter and attaches it to the client
object.
Tip |
In most cases, the LiveWire compiler can figure out where the URLs are and substitute writeURL() for write(). If you know that your application is building an URL dynamically, be sure to use writeURL(). You can check to see which function the compiler is using by looking at the compiler's output with the -d switch. Always use writeURL() for dynamic URLs, even if you plan to use a state-preserving mechanism that doesn't rely on URLs. That way, a Webmaster can safely change Client Object Maintenance to client-url or server-url without having to worry about whether the application will run correctly. |
The principal advantage of the client-url method is that it works with all browsers. Most Web sites receive the largest percentage of visits from Netscape browsers, but still find that 10 to 30 percent of their visitors use non-Netscape browsers.
The principal disadvantage of this approach is that, as the number
of parameters grows, the URL can become quite long. Some applications
have five, ten, or more parameters attached to their client
object. Using the client to remember these parameters can take
up a fair amount of bandwidth.
Caution |
Note that all of the encoding for the URL must be in place before the content is sent back to the client machine. After the page is returned to the client, no more opportunity exists to add or change properties on the client object. Try to finish the setup of the client object before you begin to write to the client. As a minimum, recall that output to the client is written to a 64K buffer-after you've written 64K, the buffer is sent. (The buffer is sent before that time if your application calls flush().) If you use client-URL encoding, you must finish setting the properties in the client object before the buffer is sent. |
Caution |
Another drawback of both client- and server-urls is that the client must stay in the application to preserve the information in the URL. If a user is in the middle of the application, and then goes out to another site, his or her URL information can be lost. |
Server url Go back to Application Manager and change Client Object Maintenance to server-url. Now go back to the World application and repeat the process of entering your name in the field.
This time, the URL says something like
http://your.server.domain/world/hello.html?NETSCAPE_LIVEWIRE_ID=002822754597
The URL doesn't hold the contents of the client object. Instead, it holds an ID number that points to a bit of shared memory on the server. The client maintains the pointer just like it maintained the data itself when you used client-url. When the user submits the form, LiveWire strips off the ID and uses it to look up the client properties and set up the client object.
The server-url mechanism for preserving state offers the same advantage as client-url-it works with any browser, not just Netscape Navigator. And it consumes far less bandwidth because only the ID number is passed back and forth to the server.
Server IP Another approach that works with all browsers-though not with all service providers-is the server-ip mechanism.
If you rerun the experiment from the last two sections with Client
Object Maintenance set to server-ip,
you won't see anything unusual appearing in the URL. Instead,
the server keeps track of which user is which by examining the
client's Internet Protocol (IP) address. This approach works well
as long as each client has a unique fixed IP address. On the Internet,
though, that assumption often breaks down.
Note |
Are We Running out of IP Addresses? On the Internet, many people connect through service providers that dynamically allocate IP addresses. IP addresses are a 32-bit number, usually written as four 8-bit numbers in dotted-decimal form, like this: 207.20.8.1. An eight-bit number can express 256 different values, so there are only 2564 unique IP addresses, which works out to a theoretical maximum of 4,294,967,296. This number is much higher than the practical limit- some numbers are reserved for special purposes, and the numbers are broken into five classes, depending on how many computers a company is placing on the Net. By some estimates, nearly 10 million computers are on the Internet today on a full-time basis. However, the rate of growth is fast enough and the practical limit low enough that valid concern exists about running out of IP addresses. All of the huge class A addresses have been allocated, and most of the class Bs are in use, so many large companies are making do with multiple class C addresses. One stop-gap measure is for an ISP to dynamically allocate its assigned addresses to users as they connect. Suppose an ISP has around 2,000 subscribers, but at any given moment only about 200 of them are online. Instead of tying up 2,000 IP addresses, the ISP may request a block of 255-called a Class C address-and give each user an IP address when they connect. As long as the number of users connected is never more than 255, they can service all of their subscribers. Some systems use CIDR (Classless Interdomain Routing) or DHCP (Dynamic Host Configuration Protocol) to help with this problem. Others are holding out for a next generation of IP, called Ipng. When IPng becomes a reality, perhaps all machines will have a unique address. Until that day, the server-ip mechanism is reliable only in a controlled environment, such as an intranet. |
Intranets often have most of their machines permanently online. A large company may have a single Class B license, with over 65,000 unique addresses. For most intranets, server-ip can offer all of the advantages of the URL-based methods, yet consumes no extra bandwidth at all.
Of course, many intranets are large enough for applications to be accessed through in-house proxy servers. This design can break the server-ip method as well because each request comes to the application from the IP address of the proxy server.
In short, feel free to use server-ip if you can, but be aware of the restrictions.
The remaining methods are based on the Netscape cookie-a browser-specific mechanism introduced by Netscape and now used by about a dozen browsers. This section describes how cookies work in general and shows how they are used by LiveWire.
To start using a cookie, a server application must ask the user's browser to set up a cookie. The server sends a header like this:
Set-Cookie: NAME=VALUE; expires=DATE; path=PATH; domain=DOMAIN_NAME; secure
If the server application is a CGI script, the programmer has
to manage each of these fields directly. If the application is
a LiveWire application, the installer just has to set Client Object
Maintenance to one of the cookie mechanisms. Nevertheless, understanding
each field enables you to know what LiveWire can do for you.
Tip |
The latest specification for Netscape cookies is available online at http://www.netscape.com/newsref/std/cookie_spec.html. |
NAME The application sets the name to something meaningful-LiveWire always uses NETSCAPE_LIVEWIRE.propName=propValue; to avoid conflicts with other applications such as CGI scripts. In a multipage survey for the XYZ company, NAME may be set to PRODUCT=BaffleBlaster. NAME is the only required field in Set-Cookie.
expires After a server asks the browser to set up a cookie, that cookie remains on the user's system until the cookie expires. When the user visits the site again, the browser presents its cookie, and the application can read the information stored in it. For some applications, a cookie may be useful for an indefinite period. For others, the cookie has a definite lifetime. In the example of the survey, the cookie is not useful after the survey ends. Using the standard HTTP date notation in Greenwich Mean Time (GMT), an application can force the cookie to expire by sending an expiration date, as shown throughout this chapter. Here is an example:
Set-Cookie: NAME=XYZSurvey12; expires=Mon, 03-Jun-96 00:00:00 GMT;
After the expiration date is reached, the cookie is no longer stored or given out. If no expiration date is given, the cookie expires when the user exits the browser. LiveWire applications set up a default expiration of the client object of ten minutes but leave the expires field empty.
Unexpired cookies are deleted from the client's disk if certain internal limits are hit. For example, Navigator has a limit of 300 cookies, with no more than 20 cookies per path and domain. The maximum size of one cookie is 4K.
domain Each cookie has a domain for which it is valid. When an application asks a browser to set up or send its cookie, the browser compares the URL of the server with the domain attributes of its cookies. The browser looks for a tail match. That is, if the cookie domain is xyz.com, the domain matches www.xyz.com, or pluto.xyz.com, or mercury.xyz.com. If the domain is one of the seven special top-level domains, the browser expects at least two periods in the matching domain. If the domain is not one of the special seven, there must be at least three periods. The seven special domains are COM, EDU, NET, ORG, GOV, MIL, and INT. Thus, www.xyz.com matches xyz.com, but atl.ga.us does not match ga.us.
If no domain is specified, the browser uses the name of the server as the default domain name. LiveWire does not set the domain.
Order is important in Set-Cookie. If you set up your own cookies (as CGI scripts do), do not put the domain before the name, or the browser becomes confused.
path If the server
domain's tail matches a cookie's domain attribute, the browser
performs a path match. The purpose of path-matching is to allow
multiple cookies per server. For example, a user visiting www.xyz.com
may take a survey at http://www.xyz.com/survey/
and get a cookie named XYZSurvey12.
That user may also report a tech support problem at http://www.xyz.com/techSupport/
and get a cookie called XYZTechSupport.
Each of these cookies should set the path so that the appropriate
cookie is retrieved later.
Tip |
Note that, because of a defect in Netscape Navigator 1.1 and earlier, cookies that have an expires attribute must have their path explicitly set to "/" in order for the cookie to be saved correctly. As the old versions of Netscape disappear from the Internet, this fact will become less significant. |
Paths match from the top down. A cookie with path /techSupport matches a request on the same domain from /techSupport/wordProcessingProducts/.
By default, the path attribute is set to the path of the URL that responded with the Set-Cookie request. For example, when you access the World application at http://your.server.com/world/, LiveWire sets the path to /world.
secure A cookie is marked secure by putting the word secure at the end of the request. A secure cookie is sent only to a server offering HTTPS (HTTP over SSL).
By default, cookies are sent in the clear over nonsecure channels. The current version of LiveWire does not use the secure field.
Making Cookies Visible Cookies are stored in a disk file on the client computer. For example, Netscape Navigator stores cookies on a Windows computer in a file called cookies.txt. On a UNIX machine, Navigator uses the file name cookies. On the Mac, the file is called MagicCookie. These files are simple ASCII text-you can examine them with a text editor. Don't change them, though, or you can confuse the applications that wrote them.
Most LiveWire application cookies don't make it into the cookies
file, however, because they are set to expire when the user quits
the browser. To see LiveWire application cookies, you have to
pretend to be the browser.
Note |
As an application programmer, you can set the client to expire a given number of seconds after the last request. The default is 600 seconds, or 10 minutes. |
If an application calls the client.expiration() method and uses one of the cookie-based Client state maintenance mechanisms, the browser saves the cookie to the hard drive.
Set Client Object Maintenance for the World application to client-cookie. Use telnet to log into your server.
By default, Web servers listen to port 80. If the URL for the World application is http://your.server.domain/world/, connect to port 80. If the URL looks like http://your.server.domain:somePort/world/, connect to the indicated port.
After you are connected, send
HEAD /world/hello.html http/1.0
Press the Enter key twice after typing that line-once to end the line, and once to tell the server that you've sent all the header lines (in this case, none). Note that the HEAD method is used because you don't want to see all the lines of the page-just the header.
Your server responds with something like this:
HTTP/1.0 200 OK Server: Netscape-FastTrack/2.0a Date: Sat, 29 Jun 1996 10:52:32 GMT Set-cookie: NETSCAPE_LIVEWIRE.number=0; path=/world Set-cookie: NETSCAPE_LIVEWIRE.oldname=null; path=/world Content-type: text/html
You'll recognize many of these header lines from the earlier discussion
of HTTP. The Set-cookie lines
tell the browser to remember this information and send it back
with subsequent requests.
Caution |
Like the URL mechanism, cookies must be sent to the client before the page contents. Try to set up the client object's properties before sending content. You must set up the client's properties before the buffer flushes at 64K, or you'll miss your chance to put the properties in the cookie. |
Like client-URLs, client-cookies can start consuming a fair amount of bandwidth. More importantly, large numbers of client properties can overflow the browser cookie table. Recall that Navigator has a limit of 300 cookies, with no more than 20 cookies per path and domain. The maximum size of one cookie is 4K. If your application requires more client properties, consider switching to short cookies, called server-cookies in Application Manager.
Using the short cookie technique, the cookie contains an ID number that identifies where on the server the data may be found. For example, a shopping cart script might use a cookie on the user's machine to store the fact that this shopper is working on order 142. When the shopper connects to the server, the server looks up its record of order 142 to find out what items are on the order.
Change Client Object Maintenance for the World application once more-to server cookies. Now go back to telnet and again make a HEAD request to World. This time you see:
HTTP/1.0 200 OK Server: Netscape-FastTrack/2.0a Date: Sat, 29 Jun 1996 19:03:44 GMT Set-cookie: NETSCAPE_LIVEWIRE_ID=002823513984; path=/world; expires Sat, 29 Jun 1996 19:13:44 GMT Content-type: text/html
Note first that the short cookie is an ID number-it works the same way the ID did in the server URL. Note, too, the expiration date. A user can get this cookie, exit the browser, and then restart the browser, and still join his or her old client object.
On intranets, where all of the browsers are cookie-aware, server-cookies have a lot of advantages. Because they write only one cookie (with the ID) to the browser, the network overhead is negligible, and the browser's cookie table is unlikely to overflow.
On the Internet, however, Webmasters still have to deal with the
possibility of getting visits from browsers that don't know about
cookies. These sites may be best served by a mechanism like server-URL
that is browser independent.
Tip |
After you select a state-preserving mechanism that you will use most of the time, go back to Application Manager, choose the Config link, and select that mechanism as the default. |
In general, servers are more powerful machines than the desktop clients. Often, you can take advantage of this fact by distributing processing between the two machines and piggybacking information for the client onto the client's state-preservation mechanism.
Here's an example of how that works: Suppose you are using client cookies as your state preservation mechanism. Then the client-side JavaScript object document.cookie contains cookies whose names begin with NETSCAPE_LIVEWIRE. Listing 3.2 shows a function that gives the client access to the cookies. Listing 3.3 shows a function to set or change the cookie.
Listing 3.2 -Read a Netscape Cookie from the Client
Script
function getCookie(Name) { var search = "NETSAPE_LIVEWIRE." + Name + "=" var RetStr = "" var offset = 0 var end = 0 if (document.cookie.length > 0) { offset = document.cookie.indexOf(search) if (offset != -1) { offset += search.length end = document.cookie.indexOf(";", offset) if (end == -1) end = document.cookie.length RetStr = decode(document.cookie.substring(offset, end)) } } return (RetStr); }
Listing 3.3 -Set a Netscape Cookie from the Client
Script
function setCookie(Name, Value, Expire) { document.cookie = "NETSCAPE_LIVEWIRE." + Name + "=" + encode(Value) + ((Expire == null) ? "" : ("; expires = Expire.toGMTString())) }
Listing 3.4 shows how to use these two functions.
Listing 3.4 -Use a Netscape Cookie from the Client
Script
var Kill = new Date() Kill.setDate(Kill.getDate() + 7) var value = getCookie("answer") if (value == "") setCookie("answer", "42", Kill) else document.write("The answer is ", value)
For several months, the move from HTML 2.0 to HTML 3.0 languished. The draft version of the HTML 3.0 standard was allowed to expire while working groups debated various aspects of the standard. Finally, the World Wide Web Consortium has announced a new specification (HTML 3.2) developed in cooperation with the leading browser vendors (including Netscape, Microsoft, and Sun).
HTML 3.2 includes several new features that have been part of
the day-to-day HTML world for several months, including tables,
applets, and the <!DOCTYPE...>
tag.
ON THE WEB |
http://www.w3.org/pub/WWW/MarkUp/Wilbur/ This site contains an overview of the new HTML 3.2 specification, including links to the "Features at a Glance" page and the working draft of the specification. |
ON THE WEB |
http://www.w3.org/pub/WWW/TR/WD-html32.html Go directly to the latest draft of the HTML 3.2 specification. The HTML 3.2 standard is a "work in progress," but has advanced sufficiently so that many browser vendors are using it to guide their development efforts. |
HTML 3.2 is by no means the last word in HTML standardization.
Indeed, it is only a working draft, and discussions continue about
exactly which features will appear in the final version. Look
for new developments in scripting, forms, frames, and "meta-math"-a
proposal for an extensible notation for math that can be processed
by symbolic math systems.
ON THE WEB |
http://www.w3.org/pub/WWW/TR/ This site contains links to technical reports and publications of the World Wide Web Consortium. Use this site to follow the working drafts of various proposed HTML features. |
Membership in the World Wide Web Consortium's HTML working group
is open to all interested parties. You can join the HTML Working
Group by sending a subscription request to [email protected].
A subscription request should contain the word subscribe
in the Subject field of the message. (If you want to subscribe
under a different address, put that address in the Reply-To
field of the message.) You can also get help about the list, or
information about the archives, by putting help
or archive help in the Subject
field instead of subscribe.
ON THE WEB |
http://www.w3.org/pub/WWW/MarkUp/HTML-WG/ This site contains information about the HTML Working Group of the Internet Engineering Task Force. Visit here to learn more about participating in the ongoing maintenance of the HTML standard. |
ON THE WEB |
http://www.w3.org/pub/WWW/MarkUp/Activity/ Visit this site to see the World Wide Web Consortium's statement of direction concerning HTML. The index page here includes links to practical advice on which HTML tags can be considered "standard." |
Netscape has publicly stated its commitment to support standard HTML in Navigator. Netscape participates in the standardization process; many of the tags and attributes in HTML 3.2 appeared first as a Netscape extension.
To learn more about Netscape's view of standard HTML, and about
HTML in general, download the HTML Reference from the Netscape
ONE SDK site.
ON THE WEB |
http://developer.netscape.com/library/documentation/htmlguid/index.htm Reach this online guide through the Netscape ONE SDK site. It describes a range of HTML features, with emphasis on the elements added by Netscape. |