Atomik Roundtrip 2.1: Working with Roundtrip > Chapter 20 A brief guide to XML << >>

20.3 XML Structure and DTDs

It is important when you’re creating a publication for print that it is constructed in a consistent and coherent manner. Publishers will normally create a ‘style guide’ which defines the style and structure of the magazine (and usually editorial style too), to provide rules to which designers should conform in order that the publication is clear and consistent.

If XML is representing the same sort of structure as is present in your printed content, then you’d expect there to be an XML equivalent of the style guide, defining the structure to which the XML will conform. This role is fulfiled by either a DTD (Document Type Definition), or XML schema. Although these are two types of description file, they are just different methods of describing the same structure.

We will concentrate on the DTD, which is the most common form of describing XML structures for now.

<!ELEMENT Review (GameTitle, Standfirst, ReviewText, Score)>
<!ELEMENT GameTitle (#PCDATA)>
<!ELEMENT Standfirst (#PCDATA)>
<!ELEMENT ReviewText (Paragraph)+>
<!ELEMENT Paragraph (#PCDATA | EMail | URL)*>
<!ELEMENT EMail (#PCDATA)>
<!ELEMENT URL (#PCDATA)>
<!ELEMENT Score (#PCDATA)>

Above is the DTD for the review XML which we have already been looking at. The DTD defines the structure which the XML conforms to. If the XML conforms to the structure without deviation, it is said to be valid. If XML doesn’t conform exactly to these rules, but it does have a hierarchy which is correctly ‘balanced’ (i.e. all the opening tags are matched by corresponding closing tags), then it is referred to as well-formed XML. Of course, it it doesn’t match this criteria - it’s not XML at all!

The DTD can be vocalised as follows:

“A review consists of a GameTitle followed by a Standfirst, ReviewText and a Score. ReviewText consists of one or more Paragraphs. Paragraphs contain a combination of text data, and Email and URL elements, in any order. GameTitle, Standfirst and Score, Email and URL elements consist of text data.”

You’ll notice that these rules could be applied equally well to the printed page (and it’s probably the sort of thing you’d hear a Production Manager drilling into new QuarkXPress operators on their first day...).

This simple DTD contains a declaration for every element, and simply defines what that element will contain. The simplest entries are the ones for which the element contains text data:

<!ELEMENT GameTitle (#PCDATA)>

Each element declaration starts off with the marker <!ELEMENT - because it’s an element which is being declared. Next is the name of the element, ‘GameTitle’ in the example above. Finally, the content of the element is listed within parenthesis. Text data is declared as #PCDATA. This is short for Parsed Character Data. Finally, the declaration ends with a closing angle bracket. All XML tags and DTD declarations must be enclosed in angle brackets, as that’s how any computer attempting to understand the XML can identify what text is XML instructions.

Data types in XML <!ELEMENT ... > declarations
Other elements
(#PCDATA) Parsed character data : text which does not contain any XML markup
CDATA Character data : text which could contain XML markup (or any other use of the punctuation marks which would normally be interpreted as XML markup).
ANY Any type of data - declaring an element’s data type as ANY means that any sort of data would be valid content.
EMPTY The element doesn’t contain anything at all (although it can still have attributes applied to it).

The XML tags should be named in such a way that they describe their content. There is no dictionary of pre-defined XML tag names to which you must conform : your element names (and therefore your tags) are specified in plain English. There are a few rules, however. The element name must be all one word, and contain no spaces. The convention for element names which consist of two words in order to convey their meaning is to miss out any spaces and capitalise the first letter of each word (‘GameTitle’, ‘ReviewText’, for example). You should note that XML is case sensitive, so ‘Review’, ‘review’ and ‘REVIEW’ would be seen as three completely different element names - you must ensure that your use of element names throughout your DTD and XML (or the XML will be invalid).

As we’ve already seen, XML elements can contain other elements, and this structure can be seen in other DTD definitions.

<!ELEMENT ReviewText (Paragraph)+>

The ‘ReviewText’ element contains no text in its own right, simply multiple ‘Paragraph’ elements : so its DTD declaration, above, probably seems quite logical. You’ll notice the ‘+’ sign after the brackets. This means that a ReviewText element can contain one or more Paragraph elements, but must always contain at least one.

<!ELEMENT Review (GameTitle, Standfirst, ReviewText, Score)>

We’ve already seen in the XML that some elements can contain more than one other element, and when this is the case, the content of the element is expressed as a list.

The ‘Review’ element, shown above, consisted of multiple XML elements, as we’ve already seen in the XML:

<Review>
<GameTitle>The Neverland</GameTitle>
<Standfirst>The excesses of life are sometimes too much to handle for the Princess.</Standfirst>
<ReviewText>
<Paragraph>Two Jabberwockies laughed, and umpteen fountains perused two bourgeois mats, but bureaux cleverly tastes five obese Macintoshes. </Paragraph>
<Paragraph>Two partly obese cats tastes one slightly irascible chrysanthemum.
<EMail>neverland@linsoft.com</EMail>
he quite bourgeois sheep incinerated two Macintoshes, and five dogs partly drunkenly telephoned
<URL>www.linesoft.com</URL>
umpteen irascible botulisms. The aardvark comfortably kisses umpteen quixotic chrysanthemums, although two mostly irascible dwarves abused the progressive chrysanthemums. Bureaux auctioned off the sheep.</Paragraph> ...</ReviewText> <Score>45</Score> </Review>

You’ll notice that all the element names listed in the DTD declaration appear in the XML in precisely the order in which they are listed in the list which constitutes the ‘Review’ element’s declaration. This kind of list is called a Sequence. A sequence is a strict content model, as if one element is missing, then the XML will be invalid. Also, if the elements within ‘Review’ appear in a different order to that defined in the DTD, then the XML will not be valid.

A sequence list is separated by commas. Greater flexibility can be specified for the structure by adding one of the three modifier symbols after an element name. These are:

DTD ‘modifier’ symbols
+ Element must appear once, or more than once.
? Element is optional, and can appear once, or not at all. The element cannot appear more than once.
* Element can appear any number of times, including not at all.
Without one of these marks, the element can appear once, and ONCE only.

If you needed to make this DTD more flexible, for example, maybe the ‘Score’ element only appears in some reviews, you could amend it like so:

<!ELEMENT Review (GameTitle, Standfirst, ReviewText, Score?)>

The sequence always imposes a fixed ordering on elements, and whilst you can choose to make certain elements optional (by adding a ‘?’ symbol after it), it is always fairly rigid. A more flexible option is the choice list.

<!ELEMENT Paragraph (#PCDATA | EMail | URL)*>

A choice list states that the element is made up of any one of the listed elements. The elements listed are separated by the vertical bar character (‘|’) which, on most keyboards, is positioned on the same key as the backslash (‘\’).. This can be made more flexible by adding a ‘*’ modifier after the list, as in the example above. This changes the definition from referring simply to ‘any one of the listed items, once only’, to ‘any of the listed items, in any order, any number of times’.

This reveals an important point. When you place one of the 3 modifier symbols after a single element, it applies to that element, if you place the modifier outside the enclosing brackets of an entire list, then that modifier applies to the entire list. it’s possible to combine the structure imposed by sequence lists with the flexibility offered by choice lists by including one list within another. For example, if there were a circumstance where the ‘ReviewText’ and ‘Score’ elements were expressed in the XML in a different order, the DTD could be amended as follows to ensure that the XML is always valid:

<!ELEMENT Review (GameTitle, Standfirst, (ReviewText | Score)*)>

It is possible to nest any number of lists within other lists, allowing for incredibly complex DTDs to be created.

When constructing a DTD to define the structure of your XML, it is important to find a balance between rigid structure and flexibility. If the DTD imposes too much structure, it could be difficult to fit your content into that structure. If there is too little structure, the XML becomes less useful.

Not all XML files have the same structure - and because XML can be so flexible, and only has to abide by the rules which you define in your DTD it’s very important that it knows which DTD it’s designed to work with. You’ll notice that every XML file has a line close to the top known as the Doctype declaration.

<!DOCTYPE Magazine SYSTEM "Easy_Magazine.dtd">

This statement declares the DTD to which the XML complies (“Easy_Magazine.dtd” in this example), and the root element of the XML. The root element is the foundation from which the rest of the hierarchy descends. In this example it’s ‘Magazine’ - the XML which is written to this DTD represent the content of different parts of a magazine publication, and all XML written to this DTD will start with a ‘Magazine’ element, of which all other elements in the XML file will be children.

The specification of the name of the DTD can either be a file name, a full path to the file, or a relative path to the file. When creating XML to use with Import, all that really matters is the name of the DTD - Import will extract the filename from any paths, and look for that file in the ‘Default DTD location’ which you’ve specified in your preferences (remember, the first thing you did in Tutorial 1).

The identifier in this example is known as a private or SYSTEM identifier. This means that the XML requires the presence of the specified file (“Easy_Magazine.dtd’ in this example) in order to be interpreted. One way around this would be to include the whole DTD within the XML file : effectively replacing the reference to the file with its content:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE Magazine [
<!ELEMENT Magazine (Section)+>
<!ELEMENT Section (SectionName,Author,Review+)>
<!ELEMENT SectionName (#PCDATA)>
<!ELEMENT Author (#PCDATA)>
<!ELEMENT Review (GameTitle, Standfirst, ReviewText, Score)>
<!ELEMENT GameTitle (#PCDATA)>
<!ELEMENT Standfirst (#PCDATA)>
<!ELEMENT ReviewText (Paragraph)+>
<!ELEMENT Paragraph (#PCDATA | EMail | URL)*>
<!ELEMENT EMail (#PCDATA)>
<!ELEMENT URL (#PCDATA)>
<!ELEMENT Score (#PCDATA)>]>
<Magazine><Section><SectionName>Review Official Roundup</SectionName>
<Author>Writers Stan Laurel, Oliver Hardy</Author>
<Review>
<GameTitle>The Neverland</GameTitle>
<Standfirst>The excesses of life are sometimes too much to handle for the Princess.</Standfirst>
<ReviewText>
<Paragraph>Two Jabberwockies laughed, and umpteen fountains perused two bourgeois mats, but bureaux cleverly tastes five obese Macintoshes. </Paragraph>
<Paragraph>Two partly obese cats tastes one slightly irascible chrysanthemum.
<EMail>neverland@linsoft.com</EMail>
he quite bourgeois sheep incinerated two Macintoshes, and five dogs partly drunkenly telephoned
<URL>www.linesoft.com</URL>
umpteen irascible botulisms. The aardvark comfortably kisses umpteen quixotic chrysanthemums, although two mostly irascible dwarves abused the progressive chrysanthemums. Bureaux auctioned off the sheep.</Paragraph> ...</ReviewText> <Score>45</Score> </Review></Section></Magazine>

Of course, whilst this is quite common practice, if your DTD is big or complex, it’s going to make the XML files larger and harder to manage.

Another form of doctype declaration which you may commonly see in XML is the public identifier:

<!DOCTYPE Magazine PUBLIC “-//Easypress//DTD Magazine//EN” “Easy_Magazine.DTD”>

The public identifier is, in many ways, a hang-over from XML’s history, and its ‘parent’ format SGML. Public identifiers are a reference to a Catalog File (sometimes referred to as an XCatalog file), which can be used as a central location through which large XML projects consisting of multiple files are connected. This catalog file gives the context for other file references made within the declaration.

A public identifier (identified by the word PUBLIC) specifies a specially formatted reference, in the format:

registered//Company name//Description of resource//Language

The ‘registered’ part of this reference shows a ‘+’ for those references which have been officially registered, ‘-’ for those which are not, and a full ISO number for official ISO definitions.

However, public identifiers are largely irrelevant to Import, as it will always look for the specified file in the ‘Default DTD Folder’ location specified in your Roundtrip preferences, regardless of whether the declaration is a SYSTEM or PUBLIC one. So long as the name of your DTD is listed in the <!DOCTYPE declaration, and the actual file is in your ‘Default DTD Folder’ location, Atomik Roundtrip will be able to find it.