This section takes you step by step through the process of constructing an XML document. Along the way, you'll gain experience with the XML components you'll typically use to create your data structures.
Writing a Simple XML File
You'll start by writing the kind of XML data you can use for a slide presentation. To become comfortable with the basic format of an XML file, you'll use your text editor to create the data. You'll use this file and extend it in later exercises.
Creating the File
Using a standard text editor, create a file called slideSample.xml.
Note: Here is a version of it that already exists: slideSample01.xml. (The browsable version is slideSample01-xml.html.) You can use this version to compare your work or just review it as you read this guide.
Writing the Declaration
Next, write the declaration, which identifies the file as an XML document. The declaration starts with the characters <?, which is also the standard XML identifier for a processing instruction. (You'll see processing instructions later in this tutorial.)
<?xml version='1.0' encoding='utf-8'?>
This line identifies the document as an XML document that conforms to version 1.0 of the XML specification and says that it uses the 8-bit Unicode character-encoding scheme. (For information on encoding schemes, see Appendix A.)
Because the document has not been specified as standalone, the parser assumes that it may contain references to other documents. To see how to specify a document as standalone, see The XML Prolog.
Adding a Comment
Comments are ignored by XML parsers. A program will never see them unless you activate special settings in the parser. To put a comment into the file, add the following highlighted text.
<?xml version='1.0' encoding='utf-8'?>
<!-- A SAMPLE set of slides -->
Defining the Root Element
After the declaration, every XML file defines exactly one element, known as the root element. Any other elements in the file are contained within that element. Enter the following highlighted text to define the root element for this file, slideshow:
<?xml version='1.0' encoding='utf-8'?>
<!-- A SAMPLE set of slides -->
<slideshow>
</slideshow>
Note: XML element names are case-sensitive. The end tag must exactly match the start tag.
Adding Attributes to an Element
A slide presentation has a number of associated data items, none of which requires any structure. So it is natural to define these data items as attributes of the slideshow element. Add the following highlighted text to set up some attributes:
...
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>
</slideshow>
When you create a name for a tag or an attribute, you can use hyphens (-), underscores (_), colons (:), and periods (.) in addition to characters and numbers. Unlike HTML, values for XML attributes are always in quotation marks, and multiple attributes are never separated by commas.
Note: Colons should be used with care or avoided, because they are used when defining the namespace for an XML document.
Adding Nested Elements
XML allows for hierarchically structured data, which means that an element can contain other elements. Add the following highlighted text to define a slide element and a title element contained within it:
<slideshow
...
>
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
</slideshow>
Here you have also added a type attribute to the slide. The idea of this attribute is that you can earmark slides for a mostly technical or mostly executive audience using type="tech" or type="exec", or identify them as suitable for both audiences using type="all".
More importantly, this example illustrates the difference between things that are more usefully defined as elements (the title element) and things that are more suitable as attributes (the type attribute). The visibility heuristic is primarily at work here. The title is something the audience will see, so it is an element. The type, on the other hand, is something that never gets presented, so it is an attribute. Another way to think about that distinction is that an element is a container, like a bottle. The type is a characteristic of the container (tall or short, wide or narrow). The title is a characteristic of the contents (water, milk, or tea). These are not hard-and-fast rules, of course, but they can help when you design your own XML structures.
Adding HTML-Style Text
Because XML lets you define any tags you want, it makes sense to define a set of tags that look like HTML. In fact, the XHTML standard does exactly that. You'll see more about that toward the end of the SAX tutorial. For now, type the following highlighted text to define a slide with a couple of list item entries that use an HTML-style <em> tag for emphasis (usually rendered as italicized text):
...
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
</slideshow>
Note that defining a title element conflicts with the XHTML element that uses the same name. Later in this tutorial, we discuss the mechanism that produces the conflict (the DTD), along with possible solutions.
Adding an Empty Element
One major difference between HTML and XML is that all XML must be well formed, which means that every tag must have an ending tag or be an empty tag. By now, you're getting pretty comfortable with ending tags. Add the following highlighted text to define an empty list item element with no contents:
...
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
</slideshow>
Note that any element can be an empty element. All it takes is ending the tag with /> instead of >. You could do the same thing by entering <item></item>, which is equivalent.
Note: Another factor that makes an XML file well formed is proper nesting. So <b><i>some_text</i></b> is well formed, because the <i>...</i> sequence is completely nested within the <b>..</b> tag. This sequence, on the other hand, is not well formed: <b><i>some_text</b></i>.
The Finished Product
Here is the completed version of the XML file:
<?xml version='1.0' encoding='utf-8'?>
<!-- A SAMPLE set of slides -->
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide
</slideshow>
Save a copy of this file as slideSample01.xml so that you can use it as the initial data structure when experimenting with XML programming operations.
Writing Processing Instructions
It sometimes makes sense to code application-specific processing instructions in the XML data. In this exercise, you'll add a processing instruction to your slideSample.xml file.
Note: The file you'll create in this section is slideSample02.xml. (The browsable version is slideSample02-xml.html.)
As you saw in Processing Instructions, the format for a processing instruction is <?target data?>, where target is the application that is expected to do the processing, and data is the instruction or information for it to process. Add the following highlighted text to add a processing instruction for a mythical slide presentation program that will query the user to find out which slides to display (technical, executive-level, or all):
<slideshow
...
>
<!-- PROCESSING INSTRUCTION -->
<?my.presentation.Program QUERY="exec, tech, all"?>
<!-- TITLE SLIDE -->
Notes:
* The data portion of the processing instruction can contain spaces or it can even be null. But there cannot be any space between the initial <? and the target identifier.
* The data begins after the first space.
* It makes sense to fully qualify the target with the complete web-unique package prefix, to preclude any conflict with other programs that might process the same data.
* For readability, it seems like a good idea to include a colon (:) after the name of the application:
<?my.presentation.Program: QUERY="..."?>
The colon makes the target name into a kind of "label" that identifies the intended recipient of the instruction. However, even though the W3C spec allows a colon in a target name, some versions of Internet Explorer 5 (IE5) consider it an error. For this tutorial, then, we avoid using a colon in the target name.
Save a copy of this file as slideSample02.xml so that you can use it when experimenting with processing instructions.
Introducing an Error
The parser can generate three kinds of errors: a fatal error, an error, and a warning. In this exercise, you'll make a simple modification to the XML file to introduce a fatal error. Later, you'll see how it's handled in the Echo application.
Note: The XML structure you'll create in this exercise is in slideSampleBad1.xml. (The browsable version is slideSampleBad1-xml.html.)
One easy way to introduce a fatal error is to remove the final / from the empty item element to create a tag that does not have a corresponding end tag. That constitutes a fatal error, because all XML documents must, by definition, be well formed. Do the following:
1. Copy slideSample02.xml to slideSampleBad1.xml.
2. Edit slideSampleBad1.xml and remove the character shown here:
...
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
...
This change produces the following:
...
<item>Why <em>WonderWidgets</em> are great</item>
<item>
<item>Who <em>buys</em> WonderWidgets</item>
...
Now you have a file that you can use to generate an error in any parser, any time. (XML parsers are required to generate a fatal error for this file, because the lack of an end tag for the <item> element means that the XML structure is no longer well formed.)
Substituting and Inserting Text
In this section, you'll learn about
* Handling special characters (<, &, and so on)
* Handling text with XML-style syntax
Writing a Simple XML File
You'll start by writing the kind of XML data you can use for a slide presentation. To become comfortable with the basic format of an XML file, you'll use your text editor to create the data. You'll use this file and extend it in later exercises.
Creating the File
Using a standard text editor, create a file called slideSample.xml.
Note: Here is a version of it that already exists: slideSample01.xml. (The browsable version is slideSample01-xml.html.) You can use this version to compare your work or just review it as you read this guide.
Writing the Declaration
Next, write the declaration, which identifies the file as an XML document. The declaration starts with the characters <?, which is also the standard XML identifier for a processing instruction. (You'll see processing instructions later in this tutorial.)
<?xml version='1.0' encoding='utf-8'?>
This line identifies the document as an XML document that conforms to version 1.0 of the XML specification and says that it uses the 8-bit Unicode character-encoding scheme. (For information on encoding schemes, see Appendix A.)
Because the document has not been specified as standalone, the parser assumes that it may contain references to other documents. To see how to specify a document as standalone, see The XML Prolog.
Adding a Comment
Comments are ignored by XML parsers. A program will never see them unless you activate special settings in the parser. To put a comment into the file, add the following highlighted text.
<?xml version='1.0' encoding='utf-8'?>
<!-- A SAMPLE set of slides -->
Defining the Root Element
After the declaration, every XML file defines exactly one element, known as the root element. Any other elements in the file are contained within that element. Enter the following highlighted text to define the root element for this file, slideshow:
<?xml version='1.0' encoding='utf-8'?>
<!-- A SAMPLE set of slides -->
<slideshow>
</slideshow>
Note: XML element names are case-sensitive. The end tag must exactly match the start tag.
Adding Attributes to an Element
A slide presentation has a number of associated data items, none of which requires any structure. So it is natural to define these data items as attributes of the slideshow element. Add the following highlighted text to set up some attributes:
...
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>
</slideshow>
When you create a name for a tag or an attribute, you can use hyphens (-), underscores (_), colons (:), and periods (.) in addition to characters and numbers. Unlike HTML, values for XML attributes are always in quotation marks, and multiple attributes are never separated by commas.
Note: Colons should be used with care or avoided, because they are used when defining the namespace for an XML document.
Adding Nested Elements
XML allows for hierarchically structured data, which means that an element can contain other elements. Add the following highlighted text to define a slide element and a title element contained within it:
<slideshow
...
>
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
</slideshow>
Here you have also added a type attribute to the slide. The idea of this attribute is that you can earmark slides for a mostly technical or mostly executive audience using type="tech" or type="exec", or identify them as suitable for both audiences using type="all".
More importantly, this example illustrates the difference between things that are more usefully defined as elements (the title element) and things that are more suitable as attributes (the type attribute). The visibility heuristic is primarily at work here. The title is something the audience will see, so it is an element. The type, on the other hand, is something that never gets presented, so it is an attribute. Another way to think about that distinction is that an element is a container, like a bottle. The type is a characteristic of the container (tall or short, wide or narrow). The title is a characteristic of the contents (water, milk, or tea). These are not hard-and-fast rules, of course, but they can help when you design your own XML structures.
Adding HTML-Style Text
Because XML lets you define any tags you want, it makes sense to define a set of tags that look like HTML. In fact, the XHTML standard does exactly that. You'll see more about that toward the end of the SAX tutorial. For now, type the following highlighted text to define a slide with a couple of list item entries that use an HTML-style <em> tag for emphasis (usually rendered as italicized text):
...
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
</slideshow>
Note that defining a title element conflicts with the XHTML element that uses the same name. Later in this tutorial, we discuss the mechanism that produces the conflict (the DTD), along with possible solutions.
Adding an Empty Element
One major difference between HTML and XML is that all XML must be well formed, which means that every tag must have an ending tag or be an empty tag. By now, you're getting pretty comfortable with ending tags. Add the following highlighted text to define an empty list item element with no contents:
...
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
</slideshow>
Note that any element can be an empty element. All it takes is ending the tag with /> instead of >. You could do the same thing by entering <item></item>, which is equivalent.
Note: Another factor that makes an XML file well formed is proper nesting. So <b><i>some_text</i></b> is well formed, because the <i>...</i> sequence is completely nested within the <b>..</b> tag. This sequence, on the other hand, is not well formed: <b><i>some_text</b></i>.
The Finished Product
Here is the completed version of the XML file:
<?xml version='1.0' encoding='utf-8'?>
<!-- A SAMPLE set of slides -->
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide
</slideshow>
Save a copy of this file as slideSample01.xml so that you can use it as the initial data structure when experimenting with XML programming operations.
Writing Processing Instructions
It sometimes makes sense to code application-specific processing instructions in the XML data. In this exercise, you'll add a processing instruction to your slideSample.xml file.
Note: The file you'll create in this section is slideSample02.xml. (The browsable version is slideSample02-xml.html.)
As you saw in Processing Instructions, the format for a processing instruction is <?target data?>, where target is the application that is expected to do the processing, and data is the instruction or information for it to process. Add the following highlighted text to add a processing instruction for a mythical slide presentation program that will query the user to find out which slides to display (technical, executive-level, or all):
<slideshow
...
>
<!-- PROCESSING INSTRUCTION -->
<?my.presentation.Program QUERY="exec, tech, all"?>
<!-- TITLE SLIDE -->
Notes:
* The data portion of the processing instruction can contain spaces or it can even be null. But there cannot be any space between the initial <? and the target identifier.
* The data begins after the first space.
* It makes sense to fully qualify the target with the complete web-unique package prefix, to preclude any conflict with other programs that might process the same data.
* For readability, it seems like a good idea to include a colon (:) after the name of the application:
<?my.presentation.Program: QUERY="..."?>
The colon makes the target name into a kind of "label" that identifies the intended recipient of the instruction. However, even though the W3C spec allows a colon in a target name, some versions of Internet Explorer 5 (IE5) consider it an error. For this tutorial, then, we avoid using a colon in the target name.
Save a copy of this file as slideSample02.xml so that you can use it when experimenting with processing instructions.
Introducing an Error
The parser can generate three kinds of errors: a fatal error, an error, and a warning. In this exercise, you'll make a simple modification to the XML file to introduce a fatal error. Later, you'll see how it's handled in the Echo application.
Note: The XML structure you'll create in this exercise is in slideSampleBad1.xml. (The browsable version is slideSampleBad1-xml.html.)
One easy way to introduce a fatal error is to remove the final / from the empty item element to create a tag that does not have a corresponding end tag. That constitutes a fatal error, because all XML documents must, by definition, be well formed. Do the following:
1. Copy slideSample02.xml to slideSampleBad1.xml.
2. Edit slideSampleBad1.xml and remove the character shown here:
...
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
...
This change produces the following:
...
<item>Why <em>WonderWidgets</em> are great</item>
<item>
<item>Who <em>buys</em> WonderWidgets</item>
...
Now you have a file that you can use to generate an error in any parser, any time. (XML parsers are required to generate a fatal error for this file, because the lack of an end tag for the <item> element means that the XML structure is no longer well formed.)
Substituting and Inserting Text
In this section, you'll learn about
* Handling special characters (<, &, and so on)
* Handling text with XML-style syntax
