Atomik Xport SE: Reference > Chapter 15 Extracting Styling Information << >>

15.1 Styling elements and attributes

Atomik Xport employs what is known as a Mixed Content Model for including the formatting information in the XML. Here is the text book definition of mixed content in XML.

Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements. In this case, the types of the child elements may be constrained, but not their order or their number of occurrences.

Looking specifically at Atomik Xport, this is achieved by defining DTD Elements that are used by Atomik Xport to indicate text styling. There are twelve styling attributes in total, all of which can be included in the XML file. To make this as easy as possible both for the person creating the DTD and for the person responsible for configuring the Atomik Xport rulesets, the names of the required DTD Elements have been predefined. The required DTD Element names have been defined as follows:

Styling Attribute  DTD Element 
Bold 
Italic 
Underline 
Superscript  sup 
Subscript  sub 
All Caps  allcaps 
Small Caps  smallcaps 
Superior  superior 
Shadow  shadow 
Outline  outline 
Strikethru  strikethru 
Word Underline  WU 

If you would like Atomik Xport to include the styling information for any of these styling attributes in the XML, you will need to define each required attribute in your DTD. As you will know from the previous chapters, the DTD Elements that contain text content (as opposed to image references) generally use the definition #PCDATA (parsed character data). So for example if I was defining a Paragraph element in my DTD as in Tutorial1, it would normally look as follows:

<!ELEMENT Paragraph (#PCDATA)> 

Atomik Xport in this case will include only character data in the Paragraph element and not any styling. Using the mixed content model, your paragraph element could be defined as follows so that the styling information is included in the XML extraction:

 <!ELEMENT Paragraph (#PCDATA | B | I | U | sup | sub)*> 

In this case Atomik Xport will not only include the character data but will also include the tags for B (Bold), I (Italic), U (Underline), sup (superscript) and sub (subscript) if these styling changes occur within the Paragraph.

Note: You could have included all of the 12 styling DTD Elements in this definition if you wished.

As you have included these new formatting DTD Elements in your Paragraph DTD Element, you also need to define the styling DTD Elements in your DTD. Therefore here is a full definition of the Paragraph DTD Element with the defined styling DTD Elements.

<!ELEMENT Paragraph (#PCDATA | B | I | U | sup | sub)*> 
<!ELEMENT B (#PCDATA | I | U | sup | sub)*>
<!ELEMENT I (#PCDATA | B | U | sup | sub)*>
<!ELEMENT U (#PCDATA | B | I | sup | sub)*>
<!ELEMENT sup (#PCDATA | B | I | U | sub)*>
<!ELEMENT sub (#PCDATA | B | I | U | sup)*>

It is important to note that not only have the styling DTD Elements been defined as #PCDATA, they have also been defined as having mixed content themselves. This is because it is entirely possible that you could have styling changes embedded within other styling changes as in the example below:

A character goes into a Wonderland trance to bring her back to her childhood and free up her inhibitions so she can solve her psychological problems.

In the above text, from the word ‘goes’, the text is in italics, it then has three words in italics and bold and one word in italics, bold and small caps. From an XML perspective therefore, the Italics tags also need to contain bold tags and the bold tags need to include small caps tags. Therefore each formatting DTD element must be defined as being able to contain other formatting DTD elements as well as standard #PCDATA.

The Atomik Xport CD includes a sample DTD ‘Easy_Magazine2.dtd’ which includes the formatting DTD elements for the paragraph DTD element. This will be a useful starting point for you to see how to implement these tags within your own DTDs.