|Atomik Xport SE: Reference > Chapter 10 Working With DTDs||<< >>|
Possibly the single most important issue when in getting the most out of Atomik Xport is preparing Document Type Definitions (DTDs) that are appropriately structured to hold the content that is being extracted as XML from InDesign documents. This is essential in achieving an optimal level of automation with Atomik Xport.
This chapter is intended to provide guidelines to help you create DTDs that will work best in Atomik Xport and your InDesign documents.
It is important to note that constructing DTDs is similar in principle to constructing database schemas. This is because a DTD defines the structure of a document in which structured information is stored, in the same way that a database schema defines the structure of the database in which structured information is stored.
Therefore, it is highly recommended that the task of DTD creation be handled only by practitioners that have knowledge of DTDs and structuring information.
A DTD used in Atomik Xport should be viewed as a tool for getting content out of InDesign, rather than the definitive structure in which that content will ultimately be stored.
By viewing the Atomik Xport extraction DTD as a simple extraction tool, rather than the final destination of content, you are free to be more flexible with the rules and structure of the DTD. This can have significant benefits in raising the proportion of content that can be extracted automatically.
Try a simple DTD as the basis for a Ruleset, and see if all the content is extracted automatically. Do not expect everything to extract perfectly first time.
InDesign is a design tool that is extremely flexible. This flexibility means that there are many factors that can affect how a document is extracted.
If there is content that has not been extracted, examine what has not been extracted and see if:
• A style mapping has been omitted in the Ruleset
• The ordering of the DTD is too rigid for the ordering of the content in InDesign
• The alignment of text/image boxes is causing the Box Ordering to ‘miss’ text boxes (see the Box Ordering sections in the Chapter Working with Rulesets)
On a printed page, it is normally quite clear in which order we read different pieces of content. For example, in a newspaper, we know to read the headline first, then the by-line, and then the bodytext of the article. This is typically because the headline comes at the top of the box that contains the article, with the by-line just below, and then the bodytext follows on afterwards.
As you have seen in previous chapters, Atomik Xport follows specific patterns to move from one text or image box to the next (such as Default Box Ordering and Reverse Box Ordering).
You can maximise the performance of automated extraction by creating the DTD so that the order of the elements matches the order in which the content appears on the InDesign page (bearing in mind the Box Ordering Preference you will be using with your Ruleset).
In other words, rather than trying to map each piece of content from the InDesign page into disparate elements of a DTD which may contain a wide variety of elements, keep the DTD closely aligned with the content (and its ordering) in the InDesign document.
While this might impose a restriction on the complexity and ordering of elements in a DTD used for extracting content from InDesign, it will greatly improve automated extraction performance.
Most publications can be classified into generic types, for example newspapers, newsletters, magazines, catalogues etc. It is recommended that, where possible, a generic DTD for that type of publication is used as a basis for creating a DTD for a specific publication. This makes the task of customising the DTD simpler. You’ll find some sample DTDs on the Atomik Xport CD and on www.easypress.com.
A common issue when conducting automated extraction is that the DTD being used specifies a sequence of elements. While this is useful in that it creates a well-ordered list of XML elements, it can cause pieces of content not to be extracted if any irregularity of ordering occurs.
In this instance, it is useful to create the DTD so that whatever Atomik Xport encounters on the InDesign page, it can be extracted, and that the ordering of XML elements will be driven by the order of content on the InDesign page.
To explain by way of example:
Say a InDesign document contains a section of a magazine. The section contains multiple articles. Each article can contain:
• Headline • Standfirst
• Byline • Bodytext
• Subhead • Images
• Tables • Boxouts
Select the item(s) that defines the start of an article. This is typically something like a Headline. We can therefore define an Article at this point as a Headline followed by a choice of any of the remaining elements:
<!ELEMENT Article (Headline, (Standfirst | Byline | Bodytext | Subhead | Images | Table | Boxout)+)>
What the above element declaration is saying is that the Article consists of “one or more choices of Byline or Standfirst or Bodytext etc...”
The result of structuring the DTD in this way is that Atomik Xport is now indifferent to the order in which any of the possible elements within an article occur – it will just create new XML elements for whatever it finds, in whatever order it finds it.
However, each time Atomik Xport encounters a Headline style in the InDesign document, it will create a new Article.
This approach can be used across the whole DTD, or can be restricted to subsections where ordering of content may be less structured than other sections.