Atomik Roundtrip 2.1: Working with Roundtrip > Chapter 20 A brief guide to XML << >>

20.5 Special characters and entities

One of the key aims of XML is to allow content to be moved between media, and it is inevitable that this will mean moving between different types of devices and different computer operating systems. Computers cannot read, they simply perform calculations on numbers. Computers are only able to understand numbers. They store letters and other characters by assigning a number for each one, A=65, B=66, etc. You will be aware, if you’ve ever tried opening Macintosh text documents on a PC or vice-versa, that the numbers used to store these characters - known as the character set - used by different operating systems, can vary significantly. Although you will find the core letters and numbers (A-Z, 0-9) are usually consistent between different devices and operating systems, other characters -accents, for example, can be different, causing text to be displayed incorrectly if it is transferred between systems.

However, as there is a core of range of characters which are common to most character sets, so long as XML is limited to these characters, it could safely be transported between different systems and devices. But if you were to try to represent your content entirely in these common characters, it would give you a very limited vocabulary which didn’t include any accented characters, symbol characters or letters outside the English alphabet, making using XML difficult in English, and almost impossible in other languages.

Fortunately, there are solutions to this issue. These centre around a technology known as Unicode. Unicode is a huge character set which contains a unique number for every character, no matter what platform it’s being used on or what language it’s displaying.

There are two methods of using the unicode character set within an XML file. The first is with character encoding .You’ll notice the first line of every XML file contains a special line which starts with the characters <? : this is known as a processing instruction. This particular processing instruction is important as it identifies this file as being XML (if it’s not there, then systems won’t see this file as an XML file). It’s known as an XML declaration.

<?xml version="1.0" encoding="UTF-8"?>

The first item of this declaration identifies the XML version. At the time of writing, XML is still at version 1.0, so this value should always read “1.0”. The second item in this declaration defines the character encoding. There are several different encoding standards, but they all have the same common aim of representing the numeric unicode value as a number of text characters.

<PullQuote>‚RW‚s cast invokes very little sympathy‚</PullQuote>

This example XML shows an example of some smart (curly) quotes are encoded using the UTF-8 encoding system. You’ll note that each encoded character is represented by three or more characters in UTF-8. In combination, these characters mathematically represent the unicode character set values for the curly quote characters. There are other encoding systems, UTF-16, ISO-8859 / US-ASCII, to name just a few, but UTF-8 is probably the most commonly used.

However, as you can see from the example above, these character encoding schemes make the text rather difficult to read for humans (although it makes perfect sense to machines). This brings us to the second method of using unicode with XML - character entities. An entity in XML is a piece of text which can be used to represent a character or a piece of text within the XML, and is filled in automatically as the XML is interpreted.

<PullQuote>‚&ldquote;RW&apos;s cast invokes very little sympathy&ldquo; </PullQuote>

You can recognise an entity in XML as it is always proceeded by an ampersand, and followed by a semicolon. In the above example, the text &ldquo; will be recognised when the XML is interpreted, and replaced with a left-facing double-quote character. In order for this to happen, however, it must be declared in the DTD (as with elements and attributes).

<!ENTITY ldquo “&#x201C;”>
<!ENTITY rdquo “&#x201D;”>
<!ENTITY apos “&#x0027;”>

Now this might appear weird, because the entity declared in this example, seems just to be a usage of another entity - and you’d be right - that’s exactly what it is. However, these are a special form of entity, a specific reference to unicode character numbers, and these entities do not need to be further defined, as any system which understands XML will automatically replace these with the unicode character to which they correspond. It would be quite possible to enter these values directly into the XML, but that would be just as confusing as the UTF-8 encoding.

<PullQuote>‚&#x201C;RW&#x0027;s cast invokes very little sympathy&#x201D; </PullQuote>

If you were to include every entity declaration in a DTD, for all the characters which may or may not be used in XML files which are compliant with this DTD, then it would rapidly become a large file which is difficult to maintain - especially if you’re maintaining multiple DTDs simultaneously.