Tuesday, September 23, 2008

Generating XML Data

This section takes you step by step through the process of constructing an XML document. Along the way, you'll gain experience with the XML components you'll typically use to create your data structures.

Writing a Simple XML File


You'll start by writing the kind of XML data you can use for a slide presentation. To become comfortable with the basic format of an XML file, you'll use your text editor to create the data. You'll use this file and extend it in later exercises.

Creating the File


Using a standard text editor, create a file called slideSample.xml.

Note: Here is a version of it that already exists: slideSample01.xml. (The browsable version is slideSample01-xml.html.) You can use this version to compare your work or just review it as you read this guide.

Writing the Declaration


Next, write the declaration, which identifies the file as an XML document. The declaration starts with the characters <?, which is also the standard XML identifier for a processing instruction. (You'll see processing instructions later in this tutorial.)

<?xml version='1.0' encoding='utf-8'?>

This line identifies the document as an XML document that conforms to version 1.0 of the XML specification and says that it uses the 8-bit Unicode character-encoding scheme. (For information on encoding schemes, see Appendix A.)

Because the document has not been specified as standalone, the parser assumes that it may contain references to other documents. To see how to specify a document as standalone, see The XML Prolog.

Adding a Comment

Comments are ignored by XML parsers. A program will never see them unless you activate special settings in the parser. To put a comment into the file, add the following highlighted text.

<?xml version='1.0' encoding='utf-8'?>

<!-- A SAMPLE set of slides -->

Defining the Root Element

After the declaration, every XML file defines exactly one element, known as the root element. Any other elements in the file are contained within that element. Enter the following highlighted text to define the root element for this file, slideshow:

<?xml version='1.0' encoding='utf-8'?>

<!-- A SAMPLE set of slides -->

<slideshow>

</slideshow>

Note: XML element names are case-sensitive. The end tag must exactly match the start tag.

Adding Attributes to an Element

A slide presentation has a number of associated data items, none of which requires any structure. So it is natural to define these data items as attributes of the slideshow element. Add the following highlighted text to set up some attributes:

...
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>
</slideshow>

When you create a name for a tag or an attribute, you can use hyphens (-), underscores (_), colons (:), and periods (.) in addition to characters and numbers. Unlike HTML, values for XML attributes are always in quotation marks, and multiple attributes are never separated by commas.

Note: Colons should be used with care or avoided, because they are used when defining the namespace for an XML document.
Adding Nested Elements

XML allows for hierarchically structured data, which means that an element can contain other elements. Add the following highlighted text to define a slide element and a title element contained within it:

<slideshow
...
>

<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>

</slideshow>

Here you have also added a type attribute to the slide. The idea of this attribute is that you can earmark slides for a mostly technical or mostly executive audience using type="tech" or type="exec", or identify them as suitable for both audiences using type="all".

More importantly, this example illustrates the difference between things that are more usefully defined as elements (the title element) and things that are more suitable as attributes (the type attribute). The visibility heuristic is primarily at work here. The title is something the audience will see, so it is an element. The type, on the other hand, is something that never gets presented, so it is an attribute. Another way to think about that distinction is that an element is a container, like a bottle. The type is a characteristic of the container (tall or short, wide or narrow). The title is a characteristic of the contents (water, milk, or tea). These are not hard-and-fast rules, of course, but they can help when you design your own XML structures.

Adding HTML-Style Text

Because XML lets you define any tags you want, it makes sense to define a set of tags that look like HTML. In fact, the XHTML standard does exactly that. You'll see more about that toward the end of the SAX tutorial. For now, type the following highlighted text to define a slide with a couple of list item entries that use an HTML-style <em> tag for emphasis (usually rendered as italicized text):

...
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>

<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>

</slideshow>

Note that defining a title element conflicts with the XHTML element that uses the same name. Later in this tutorial, we discuss the mechanism that produces the conflict (the DTD), along with possible solutions.

Adding an Empty Element

One major difference between HTML and XML is that all XML must be well formed, which means that every tag must have an ending tag or be an empty tag. By now, you're getting pretty comfortable with ending tags. Add the following highlighted text to define an empty list item element with no contents:

...
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>

</slideshow>

Note that any element can be an empty element. All it takes is ending the tag with /> instead of >. You could do the same thing by entering <item></item>, which is equivalent.

Note: Another factor that makes an XML file well formed is proper nesting. So <b><i>some_text</i></b> is well formed, because the <i>...</i> sequence is completely nested within the <b>..</b> tag. This sequence, on the other hand, is not well formed: <b><i>some_text</b></i>.

The Finished Product

Here is the completed version of the XML file:

<?xml version='1.0' encoding='utf-8'?>

<!-- A SAMPLE set of slides -->
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>

<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>

<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide
</slideshow>

Save a copy of this file as slideSample01.xml so that you can use it as the initial data structure when experimenting with XML programming operations.

Writing Processing Instructions

It sometimes makes sense to code application-specific processing instructions in the XML data. In this exercise, you'll add a processing instruction to your slideSample.xml file.

Note: The file you'll create in this section is slideSample02.xml. (The browsable version is slideSample02-xml.html.)

As you saw in Processing Instructions, the format for a processing instruction is <?target data?>, where target is the application that is expected to do the processing, and data is the instruction or information for it to process. Add the following highlighted text to add a processing instruction for a mythical slide presentation program that will query the user to find out which slides to display (technical, executive-level, or all):

<slideshow
...
>
<!-- PROCESSING INSTRUCTION -->
<?my.presentation.Program QUERY="exec, tech, all"?>
<!-- TITLE SLIDE -->

Notes:

* The data portion of the processing instruction can contain spaces or it can even be null. But there cannot be any space between the initial <? and the target identifier.
* The data begins after the first space.
* It makes sense to fully qualify the target with the complete web-unique package prefix, to preclude any conflict with other programs that might process the same data.
* For readability, it seems like a good idea to include a colon (:) after the name of the application:

<?my.presentation.Program: QUERY="..."?>

The colon makes the target name into a kind of "label" that identifies the intended recipient of the instruction. However, even though the W3C spec allows a colon in a target name, some versions of Internet Explorer 5 (IE5) consider it an error. For this tutorial, then, we avoid using a colon in the target name.

Save a copy of this file as slideSample02.xml so that you can use it when experimenting with processing instructions.

Introducing an Error

The parser can generate three kinds of errors: a fatal error, an error, and a warning. In this exercise, you'll make a simple modification to the XML file to introduce a fatal error. Later, you'll see how it's handled in the Echo application.

Note: The XML structure you'll create in this exercise is in slideSampleBad1.xml. (The browsable version is slideSampleBad1-xml.html.)

One easy way to introduce a fatal error is to remove the final / from the empty item element to create a tag that does not have a corresponding end tag. That constitutes a fatal error, because all XML documents must, by definition, be well formed. Do the following:

1. Copy slideSample02.xml to slideSampleBad1.xml.
2. Edit slideSampleBad1.xml and remove the character shown here:

...
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
...

This change produces the following:

...
<item>Why <em>WonderWidgets</em> are great</item>
<item>
<item>Who <em>buys</em> WonderWidgets</item>
...

Now you have a file that you can use to generate an error in any parser, any time. (XML parsers are required to generate a fatal error for this file, because the lack of an end tag for the <item> element means that the XML structure is no longer well formed.)
Substituting and Inserting Text

In this section, you'll learn about

* Handling special characters (<, &, and so on)
* Handling text with XML-style syntax

How Can You Use XML?

There are several basic ways to use XML:

* Traditional data processing, where XML encodes the data for a program to process
* Document-driven programming, where XML documents are containers that build interfaces and applications from existing components
* Archiving--the foundation for document-driven programming--where the customized version of a component is saved (archived) so that it can be used later
* Binding, where the DTD or schema that defines an XML data structure is used to automatically generate a significant portion of the application that will eventually process that data

Traditional Data Processing

XML is fast becoming the data representation of choice for the web. It's terrific when used in conjunction with network-centric Java platform programs that send and retrieve information. So a client-server application, for example, could transmit XML-encoded data back and forth between the client and the server.

In the future, XML is potentially the answer for data interchange in all sorts of transactions, as long as both sides agree on the markup to use. (For example, should an email program expect to see tags named <FIRST> and <LAST>, or <FIRSTNAME> and <LASTNAME>?) The need for common standards will generate a lot of industry-specific standardization efforts in the years ahead. In the meantime, mechanisms that let you "translate" the tags in an XML document will be important. Such mechanisms include projects such as the Resource Description Framework initiative (RDF), which defines meta tags, and the Extensible Stylesheet Language specification (XSL), which lets you translate XML tags into other XML tags.

Document-Driven Programming


The newest approach to using XML is to construct a document that describes what an application page should look like. The document, rather than simply being displayed, consists of references to user interface components and business-logic components that are "hooked together" to create an application on-the-fly.

Of course, it makes sense to use the Java platform for such components. To construct such applications, you can use JavaBeans components for interfaces and Enterprise JavaBeans components for the business logic. Although none of the efforts undertaken so far is ready for commercial use, much preliminary work has been done.

Note: The Java programming language is also excellent for writing XML-processing tools that are as portable as XML. Several visual XML editors have been written for the Java platform. For a listing of editors, see http://www.xml.com/pub/pt/3. For processing tools and other XML resources, see Robin Cover's SGML/XML web page at http://xml.coverpages.org/software.html.

Binding


After you have defined the structure of XML data using either a DTD or one of the schema standards, a large part of the processing you need to do has already been defined. For example, if the schema says that the text data in a <date> element must follow one of the recognized date formats, then one aspect of the validation criteria for the data has been defined; it only remains to write the code. Although a DTD specification cannot go the same level of detail, a DTD (like a schema) provides a grammar that tells which data structures can occur and in what sequences. That specification tells you how to write the high-level code that processes the data elements.

But when the data structure (and possibly format) is fully specified, the code you need to process it can just as easily be generated automatically. That process is known as binding--creating classes that recognize and process different data elements by processing the specification that defines those elements. As time goes on, you should find that you are using the data specification to generate significant chunks of code, and you can focus on the programming that is unique to your application.

Archiving


The Holy Grail of programming is the construction of reusable, modular components. Ideally, you'd like to take them off the shelf, customize them, and plug them together to construct an application, with a bare minimum of additional coding and additional compilation.

The basic mechanism for saving information is called archiving. You archive a component by writing it to an output stream in a form that you can reuse later. You can then read it and instantiate it using its saved parameters. (For example, if you saved a table component, its parameters might be the number of rows and columns to display.) Archived components can also be shuffled around the web and used in a variety of ways.

When components are archived in binary form, however, there are some limitations on the kinds of changes you can make to the underlying classes if you want to retain compatibility with previously saved versions. If you could modify the archived version to reflect the change, that would solve the problem. But that's hard to do with a binary object. Such considerations have prompted a number of investigations into using XML for archiving. But if an object's state were archived in text form using XML, then anything and everything in it could be changed as easily as you can say, "Search and replace."

XML's text-based format could also make it easier to transfer objects between applications written in different languages. For all these reasons, there is a lot of interest in XML-based archiving.

Summary

XML is pretty simple and very flexible. It has many uses yet to be discovered, and we are only beginning to scratch the surface of its potential. It is the foundation for a great many standards yet to come, providing a common language that different computer systems can use to exchange data with one another. As each industry group comes up with standards for what it wants to say, computers will begin to link to each other in ways previously unimaginable.

Why Is XML Important?

There are a number of reasons for XML's surging acceptance. This section lists a few of the most prominent.

Plain Text

Because XML is not a binary format, you can create and edit files using anything from a standard text editor to a visual development environment. That makes it easy to debug your programs, and it makes XML useful for storing small amounts of data. At the other end of the spectrum, an XML front end to a database makes it possible to efficiently store large amounts of XML data as well. So XML provides scalability for anything from small configuration files to a company wide data repository.
Data Identification

XML tells you what kind of data you have, not how to display it. Because the markup tags identify the information and break the data into parts, an email program can process it, a search program can look for messages sent to particular people, and an address book can extract the address information from the rest of the message. In short, because the different parts of the information have been identified, they can be used in different ways by different applications.

Stylability

When display is important, the stylesheet standard, XSL, lets you dictate how to portray the data. For example, consider this XML:

<to>you@yourAddress.com</to>

The stylesheet for this data can say

1. Start a new line.
2. Display "To:" in bold, followed by a space
3. Display the destination data.

This set of instructions produces:

To: you@yourAddress

Of course, you could have done the same thing in HTML, but you wouldn't be able to process the data with search programs and address-extraction programs and the like. More importantly, because XML is inherently style-free, you can use a completely different stylesheet to produce output in Postscript, TEX, PDF, or some new format that hasn't even been invented. That flexibility amounts to what one author described as "future proofing" your information. The XML documents you author today can be used in future document-delivery systems that haven't even been imagined.

Inline Reusability

One of the nicer aspects of XML documents is that they can be composed from separate entities. You can do that with HTML, but only by linking to other documents. Unlike HTML, XML entities can be included "inline" in a document. The included sections look like a normal part of the document: you can search the whole document at one time or download it in one piece. That lets you modularize your documents without resorting to links. You can single-source a section so that an edit to it is reflected everywhere the section is used, and yet a document composed from such pieces looks for all the world like a one-piece document.

Linkability

Thanks to HTML, the ability to define links between documents is now regarded as a necessity. Appendix B discusses the link-specification initiative. This initiative lets you define two-way links, multiple-target links, expanding links (where clicking a link causes the targeted information to appear inline), and links between two existing documents that are defined in a third.

Easily Processed

As mentioned earlier, regular and consistent notation makes it easier to build a program to process XML data. For example, in HTML a <dt> tag can be delimited by </dt>, another <dt>, <dd>, or </dl>. That makes for some difficult programming. But in XML, the <dt> tag must always have a </dt> terminator, or it must be an empty tag such as <dt/>. That restriction is a critical part of the constraints that make an XML document well formed. (Otherwise, the XML parser won't be able to read the data.) And because XML is a vendor-neutral standard, you can choose among several XML parsers, any one of which takes the work out of processing XML data.

Hierarchical

Finally, XML documents benefit from their hierarchical structure. Hierarchical document structures are, in general, faster to access because you can drill down to the part you need, as if you were stepping through a table of contents. They are also easier to rearrange, because each piece is delimited. In a document, for example, you could move a heading to a new location and drag everything under it along with the heading, instead of having to page down to make a selection, cut, and then paste the selection into a new location.

What Is XML

What Is XML?

XML is a text-based markup language that is fast becoming the standard for data interchange on the web. As with HTML, you identify data using tags (identifiers enclosed in angle brackets: <...>). Collectively, the tags are known as markup.

But unlike HTML, XML tags identify the data rather than specify how to display it. Whereas an HTML tag says something like, "Display this data in bold font" (<b>...</b>), an XML tag acts like a field name in your program. It puts a label on a piece of data that identifies it (for example, <message>...</message>).

Note: Because identifying the data gives you some sense of what it means (how to interpret it, what you should do with it), XML is sometimes described as a mechanism for specifying the semantics (meaning) of the data.

In the same way that you define the field names for a data structure, you are free to use any XML tags that make sense for a given application. Naturally, for multiple applications to use the same XML data, they must agree on the tag names they intend to use.

Here is an example of some XML data you might use for a messaging application:

<message>
<to>you@yourAddress.com</to>
<from>me@myAddress.com</from>
<subject>XML Is Really Cool</subject>
<text>
How many ways is XML cool? Let me count the ways...
</text>
</message>

Note: Throughout this tutorial, we use boldface text to highlight things we want to bring to your attention. XML does not require anything to be in bold!

The tags in this example identify the message as a whole, the destination and sender addresses, the subject, and the text of the message. As in HTML, the <to> tag has a matching end tag: </to>. The data between the tag and its matching end tag defines an element of the XML data. Note, too, that the content of the <to> tag is contained entirely within the scope of the <message>..</message> tag. It is this ability for one tag to contain others that lets XML represent hierarchical data structures.

Again, as with HTML, whitespace is essentially irrelevant, so you can format the data for readability and yet still process it easily with a program. Unlike HTML, however, in XML you can easily search a data set for messages containing, say, "cool" in the subject, because the XML tags identify the content of the data rather than specify its representation.

Tags and Attributes


Tags can also contain attributes--additional information included as part of the tag itself, within the tag's angle brackets. The following example shows an email message structure that uses attributes for the to, from, and subject fields:

<message to="you@yourAddress.com" from="me@myAddress.com"
subject="XML Is Really Cool">
<text>
How many ways is XML cool? Let me count the ways...
</text>
</message>

As in HTML, the attribute name is followed by an equal sign and the attribute value, and multiple attributes are separated by spaces. Unlike HTML, however, in XML commas between attributes are not ignored; if present, they generate an error.

Because you can design a data structure such as <message> equally well using either attributes or tags, it can take a considerable amount of thought to figure out which design is best for your purposes. Designing an XML Data Structure, includes ideas to help you decide when to use attributes and when to use tags.

Empty Tags


One big difference between XML and HTML is that an XML document is always constrained to be well formed. There are several rules that determine when a document is well formed, but one of the most important is that every tag has a closing tag. So, in XML, the </to> tag is not optional. The <to> element is never terminated by any tag other than </to>.

Note: Another important aspect of a well-formed document is that all tags are completely nested. So you can have <message>..<to>..</to>..</message>, but never <message>..<to>..</message>..</to>. A complete list of requirements is contained in the list of XML frequently asked questions (FAQ) at http://www.ucc.ie/xml/#FAQ-VALIDWF. (This FAQ is on the W3C "Recommended Reading" list at http://www.w3.org/XML/.)

Sometimes, though, it makes sense to have a tag that stands by itself. For example, you might want to add a tag that flags the message as important: <flag/>.

This kind of tag does not enclose any content, so it's known as an empty tag. You create an empty tag by ending it with /> instead of >. For example, the following message contains an empty flag tag:

<message to="you@yourAddress.com" from="me@myAddress.com"
subject="XML Is Really Cool">
<flag/>
<text>
How many ways is XML cool? Let me count the ways...
</text>
</message>

Note: Using the empty tag saves you from having to code <flag></flag> in order to have a well-formed document. You can control which tags are allowed to be empty by creating a schema or a document type definition, or DTD. If there is no DTD or schema associated with the document, then it can contain any kinds of tags you want, as long as the document is well formed.

Comments in XML Files


XML comments look just like HTML comments:

<message to="you@yourAddress.com" from="me@myAddress.com"
subject="XML Is Really Cool">
<!-- This is a comment -->
<text>
How many ways is XML cool? Let me count the ways...
</text>
</message>

The XML Prolog

To complete this basic introduction to XML, note that an XML file always starts with a prolog. The minimal prolog contains a declaration that identifies the document as an XML document:

<?xml version="1.0"?>

The declaration may also contain additional information:

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

The XML declaration is essentially the same as the HTML header, <html>, except that it uses <?..?> and it may contain the following attributes:

* version: Identifies the version of the XML markup language used in the data. This attribute is not optional.
* encoding: Identifies the character set used to encode the data. ISO-8859-1 is Latin-1, the Western European and English language character set. (The default is 8-bit Unicode: UTF-8.)
* standalone: Tells whether or not this document references an external entity or an external data type specification. If there are no external references, then "yes" is appropriate.

The prolog can also contain definitions of entities (items that are inserted when you reference them from within the document) and specifications that tell which tags are valid in the document. Both declared in a document type definition (DTD) that can be defined directly within the prolog, as well as with pointers to external specification files. But those are the subject of later tutorials. For more information on these and many other aspects of XML, see the Recommended Reading list on the W3C XML page at http://www.w3.org/XML/.

Note: The declaration is actually optional, but it's a good idea to include it whenever you create an XML file. The declaration should have the version number, at a minimum, and ideally the encoding as well. That standard simplifies things if the XML standard is extended in the future and if the data ever needs to be localized for different geographical regions.

Everything that comes after the XML prolog constitutes the document's content.

Processing Instructions


An XML file can also contain processing instructions that give commands or information to an application that is processing the XML data. Processing instructions have the following format:

<?target instructions?>

target is the name of the application that is expected to do the processing, and instructions is a string of characters that embodies the information or commands for the application to process.

Because the instructions are application-specific, an XML file can have multiple processing instructions that tell different applications to do similar things, although in different ways. The XML file for a slide show, for example, might have processing instructions that let the speaker specify a technical- or executive-level version of the presentation. If multiple presentation programs were used, the program might need multiple versions of the processing instructions (although it would be nicer if such applications recognized standard instructions).

Note: The target name "xml" (in any combination of upper- or lowercase letters) is reserved for XML standards. In one sense, the declaration is a processing instruction that fits that standard. (However, when you're working with the parser later, you'll see that the method for handling processing instructions never sees the declaration.)