Atomik Roundtrip 2.1: Working with Roundtrip > Chapter 20 A brief guide to XML << >>

20.2 So what is XML then?

Before we address the obvious question, ‘What is XML’, let’s first consider ‘Why do we need XML’.

In the past, content was designed purely with one delivery medium in mind - usually print. As such, the content and its presentation were inextricably linked. A headline is a headline because it is bold and larger than the surrounding text, a standfirst usually appears just beneath the headline, smaller, but still differentiated from the majority of the body copy. Without any knowledge of typographic theory, anyone can understand and navigate a printed page by the relative styling and positioning of the text.

Over the last few years, things have changed. We now have multiple ways of accessing information : print, the internet, WAP mobile phones, handled and integrated devices, to name only a few. So information providers need to provide the same information in many diverse formats. One of the first realisations which people had when producing content for the internet and other new media was that the typography which works well on a printed page does not necessarily have the appropriate impact in a different media (and a 40pt headline won’t fit on the screen of a cellphone). The design must be changed so that it will have an appropriate styling for the device that it’s going to be displayed on, and retain the ability for users to be able to navigate and understand the different parts of the content.

Whatever means is used to achieve this, the stumbling block is that once the presentation is removed from the content, there is nothing to differentiate the different parts of the content. There needs to be some way of expressing what type of content a particular piece of text is when it has no presentation applied to it.

That’s exactly what XML does.

XML, short for ‘Extensible Markup Language’ is a system of ‘tags’ which identify content, specifying what the content is, rather than what it looks like.

<Review>
<GameTitle>The Neverland</GameTitle>
<Standfirst>The excesses of life are sometimes too much to handle for the Princess.</Standfirst>
<ReviewText>
<Paragraph>Two Jabberwockies laughed, and umpteen fountains perused two bourgeois mats, but bureaux cleverly tastes five obese Macintoshes. </Paragraph>
<Paragraph>Two partly obese cats tastes one slightly irascible chrysanthemum.
<EMail>neverland@linsoft.com</EMail>
he quite bourgeois sheep incinerated two Macintoshes, and five dogs partly drunkenly telephoned
<URL>www.linesoft.com</URL>
umpteen irascible botulisms. The aardvark comfortably kisses umpteen quixotic chrysanthemums, although two mostly irascible dwarves abused the progressive chrysanthemums. Bureaux auctioned off the sheep.</Paragraph> ...</ReviewText> <Score>45</Score> </Review>

Take a look at the example above. This shows an article in a magazine layout (the same article we’ve been working with in Tutorials 1 to 6), and an XML representation of part of that article. You see that the XML text has no styling applied to it to differentiate between different types of content, contrasting with the print version.Instead of different styling, you’ll see that there are words surrounded by angle brackets separating the content. Take the title, ‘The Neverland’, for example. It is proceeded in the XML text by an XML tag: <GameTitle>, and is followed by another XML tag </GameTitle>.

These tags both separate the different types of content, and specify what it is. Every piece of content in an XML file needs to be enclosed by a pair of tags like these.Content enclosed between tags in this way is referred to as an Element. The first tag consists of the name of the element between angle brackets, and the second, which appears after the content, contains the name of the element with a slash character ‘/’ in front of it.

<GameTitle>The Neverland</GameTitle>

You’ll notice that before the opening <GameTitle> tag, there is another tag : <Review>. XML is designed as a way of describing a document’s structure, without requiring any formatting. In a print layout, content is formatted into blocks by its formatting, and as readers we understand which text is connected together by its proximity to the surrounding text. We understand that a particular headline belongs with a particular piece of body copy because of their proximity to one another. Similarly, we recognise that these items together represent a logical block of content : an article. As such, we are inferring a structural hierarchy from the printed page. The page contains multiple articles, each article contains a headline, a standfirst and body copy. The body copy in each article is made up of multiple paragraphs.

XML can represent a structural hierarchy in a very simple way - elements are not only restricted to containing textual content, they can also contain other elements. In the example above, our text is from the reviews page of a gaming magazine. The page is logically divided up into multiple blocks, as you can see from the picture of the full spread. Each one of these blocks represents a separate review of a game. The XML file represents this logical structure by grouping all of the elements which are part of this logical block as a ‘Review’ element within the XML.

If you take a look back to the XML, you’ll see that it starts with a <Review> tag, and ends with a </Review> tag. All the elements which appear between these tags are considered part of that one ‘Review’ element.

In XML terminology, the ‘Review’ element is referred to as the parent element of the ‘GameTitle’ element. The ‘GameTitle’ element is referred to as a child of ‘Review’. This ‘family tree’ analogy is used frequently to describe XML hierarchies.

<Review>
<GameTitle>The Neverland</GameTitle>
<Standfirst>The excesses of life are sometimes too much to handle for the Princess.</Standfirst>
...
</Review>

In the XML, the ‘GameTitle’ element, is followed by another element; which we can see from its tagging is a ‘Standfirst’. You’ll notice, if you compare the XML with the image of the page, that this contains the standfirst text from the review. This element is also enclosed within the ‘Review’ element, and is considered to be a child of ‘Review’, and a sibling of ‘GameTitle’ (continuing the family analogy).

<ReviewText>
<Paragraph>Two Jabberwockies laughed, and umpteen fountains perused two bourgeois mats, but bureaux cleverly tastes five obese Macintoshes. </Paragraph>
<Paragraph>Two partly obese cats tastes one slightly irascible chrysanthemum.
<EMail>neverland@linsoft.com</EMail>
he quite bourgeois sheep incinerated two Macintoshes, and five dogs partly drunkenly telephoned
<URL>www.linesoft.com</URL>
umpteen irascible botulisms. The aardvark comfortably kisses umpteen quixotic chrysanthemums, although two mostly irascible dwarves abused the progressive chrysanthemums. Bureaux auctioned off the sheep.</Paragraph>
...
</ReviewText>

The next tag in the text is ‘ReviewText’, but you’ll notice, like the ‘Review’ tag, this is not immediately followed by content, rather it is followed by another tag : <Paragraph>. If you look at the image of the printed page again, you’ll see that the body copy is divided into paragraphs. In the same way that the ‘Review’ element contains a grouping of other elements, the ‘ReviewText’ element contains a group of multiple ‘Paragraph’ elements, all of which contain text. It’s important to note, however, that although the ‘ReviewText’ element is broken down into these multiple ‘Paragraph’ elements, this is not the same as putting a carriage return into the text.

So far we’ve seen that elements can contain either text, or other elements : allowing for hierarchies of content to be represented within the XML structure.

<Paragraph>Two partly obese cats tastes one slightly irascible chrysanthemum.
<EMail>neverland@linsoft.com</EMail>
he quite bourgeois sheep incinerated two Macintoshes, and five dogs partly drunkenly telephoned
<URL>www.linesoft.com</URL>
umpteen irascible botulisms. The aardvark comfortably kisses umpteen quixotic chrysanthemums, although two mostly irascible dwarves abused the progressive chrysanthemums. Bureaux auctioned off the sheep.</Paragraph>

Take a look at the next paragraph element. This element contains both text and other elements. Elements like this are known as ‘Mixed’ elements, and allow for even more complex hierarchies to be developed. In this case, the second ‘Paragraph’ element contains a URL (Uniform Resource Locater - an internet web page address) and an Email address. These need to be identified, both so that they can be displayed differently, should that be appropriate, and also so that they can have some advanced functionality in a multimedia environment. For example, if this XML were to be used to make a web page, the URL text could become a hyperlink to automatically take the user to that site when they clicked on it. Unless this is separated out from the rest of the text, it would not be possible to achieve this.

Summary: The design of a page allows a user to infer structure from it. XML is a mechanism through which the structure can be represented separately from its presentation. Because XML content does not rely on presentation for its structure, it can be repurposed easily. XML tags are placed around content in order to indicate what that content is.