Christian Roth

C R

Revision History
Revision 1Sun, 08 Nov 2009 02:28:00 CET

RTF to DocBook v4.5 Pipeline Documentation

1. General

Though this pipeline aims to convert arbitrary RTF documents to DocBook, the conversion quality can be greatly increased if a certain amount of authoring discipline can be achieved. DocBook is a big DTD with many semantic elements for which there cannot be found uniquely identifying, typographical equivalences.

1.1. Named styles

The pipeline therefore relies on a default, basic set of named paragraph and character styles which it uses to detect semantic elements in the source document. These names can be configured if desired by editing pipeline module properties. The set of supported style names will be discussed later.

1.2. Root elements: book, article

The pipeline implementation supports two DocBook root elements: book and article. Since a book has more structuring layers before reaching the recursive nesting of sections, the pipeline's mapping of styles to the respective container elements may need to be adjusted with respect to your actual documents. The hierarchical mapping relies on an ordered set of named paragraph styles for the respective (required) title elements in the DocBook DTD. The pipeline infers the level of a title (and as a result the required boxing of container elements) from these named paragraph styles.

For each of the two root elements, book and article, there is a separate default set of named paragraph styles. You can override the effective set of styles (resulting from the chosen root element) with your own one if needed.

1.3. Translation of Tables

DocBook supports the use of both the CALS and the HTML table model. In the CALS table model, the pipeline does not support nested tables. If your further workflow requires CALS tables and you can be sure that your source documents do not contain nested tables, then using the CALS table model should be safe.

In any other case, it is highly recommended to use the HTML table model for conversion so that table structures are properly expressed in DocBook and that you do not lose any information, as any nested table when outputting to CALS will be dropped from the output completely.

1.4. The role attribute

The names of paragraph styles in the source document will be mirrored in the role attribute of the resulting para element.

1.5. Captions

Many elements in DocBook (tables, images) can be either of the informal variant, i.e. they do not have a name or caption. Translation of these elements is as expected and will create informaltable or inlinemediaobject elements.

When you want a table or image to have a caption or title, you must use the named paragraph style "caption" for this in your source document and you must ensure that it immediately precedes or follows the table or image it belongs to. To resolve ambiguous situations when mixing the position (before, after) within a single document, you can specify a bias parameter that tells the pipeline if it should prefer captions before or after elements when unclear. However, it is recommended that you decide for one placement option (either before or after the element) throughout the whole document to avoid false caption/element combinations.

2. Configuration parameters

Fig. DocBook Article Pipeline Configuration

You can set the following options:

  • Source file:

    Lets you choose the source RTF or Word file to convert.

  • Destination folder:

    Lets you choose the folder where the final DocBook XML file will be written to.

  • Options:

    • Include revision information

      When checked, revisions to the source document are marked up using the phrase element with attribute revisionflag set to added or deleted, respectively.

    • Use PIs for generated text content

      When checked, processing instructions are generated for Word fields. They look like this:

      <?uc-gentext type="…" data="…" value="…"?>

      with the "attribute" values taken directly from the respective gentext element attributes.

    • Suppress default role attribute

      When checked, the default roles "Normal" (for paragraphs) resp. "Default Paragraph Font" (for character formattings) are suppressed in para and phrase elements.

    • Generate <emphasis> elements also for named character styles

      When checked, the pipeline will explicitly create nested emphasis elements with a role attribute value of bold and underline, respectively, even for named character styles which create corresponding phrase elements. This allows those post-processing systems unaware of the typographical rendering of those named phrase roles to at least pick up the mentioned traits italic (no role attribute), bold (role is bold) and underline (role is underline).

  • Search for captions:

    Lets you set where the pipeline should look for element captions first: directly after the respective element (table, image) or before the element.

  • Table model:

    Choose the table model to use in the resulting DocBook file, either HTML or CALS. As described earlier, it is highly recommended that you use the HTML table model for accurate and complete conversion results.

  • DocBook structure:

    Lets you choose between two root elements, book and article, and the resulting document structures therefrom.

  • Style to level mapping:

    Lets you override the default style to level mapping that is chosen based on the setting of the DocBook structure parameter. The details for this mechanism and the syntax of this parameter are described here.

    Note that there are 9 lines to enter your mappings, each line corresponding to one structural level.

The pipeline implementation allows you to post-process the resulting DocBook XML file with another processing pipeline. The template comes with a sample pipeline configured to translate the DocBook file to HTML for viewing in a web browser. As a matter of fact, the HTML documentation you are currently reading has been generated from the sample RTF document distributed in /data/source/rtf/ using the settings you see in the screenshot above, utilizing the DocBook XSLs in the sub-pipeline. Parameters are:

  • Post-processing pipeline:

    The path to the pipeline to use for post-processing. If the field is empty, no post-processing takes place.

  • Pipeline parameters:

    Lets you specify additional parameters that should be passed to the sub-pipeline called. Use the same syntax as you would within an External Pipeline Processor's Parameters field.

3. Conversion feature details

This section described how certain Word or RTF features in the source will be automatically converted to DocBook.

3.1. Paragraph and Character Styles

Both, paragraph and character styles are supported, with special handling where appropriate. As a general rule, the name of the style is preserved in the role attribute in the destination element.

Consecutive paragraphs with paragraph style names contained in the following pipeline's parameters are handled specially by grouping them into an element of the respective name:

  • dbBlockquoteStyles

  • dbAddressStyles

  • dbLiterallayoutStyles

  • dbProgramlistingStyles

  • dbScreenStyles

  • dbEpigraphStyles

  • dbHighlightsStyles

  • dbTipStyles

  • dbNoteStyles

  • dbCautionStyles

  • dbImportantStyles

  • dbWarningStyles

  • dbPersonnameStyles

Character styles with names listed in the pipeline's parameter dbInlineClasses create an inline element of the respective name in the result.

You can modify the default setting (=style names) for all of the above variables by editing the Custom pipeline variables parameter in the PVAR: set semantic style names module.

3.2. Manual formatting

The manual inline formattings italic, bold and underline are converted to emphasis elements with no role attribute, a role attribute of bold and a role attribute of underline, respectively.

Superscripted and subscripted text is automatically converted to superscript and subscript elements based on its layout traits.

Warning

The pipeline does not check whether an element is valid at a specific location or not. If it isn't, the resulting document will not be valid. This is intended behaviour so you can correct source documents which do not map into the DocBook structure.

(Btw., this is an example of creating a warning with two consecutive paragraphs of name "warning".)

3.3. Lists

Lists are converted in a best-effort manner. The pipeline supports both normal and compact lists:

  1. This is a normal list. The list marker is outdented.

  2. A second item of the list, decimal numbering.

  1. This is a compact list. The list marker is placed inside the list item's block.

  2. This is a second item, which has a subordinate normal list with a different numbering scheme.

    1. First item of normal subordinate list with long first item.

    2. The list has lower roman numbering.

  3. This is a third list item.

3.4. Pagebreaks, Columnbreaks

Pagebreaks and column breaks in the original document are marked using XML comments:

3.5. Tables

The pipeline supports both the HTML and CALS[1] table model with full support for horizontal and vertical spans. Colors and other layout traits are not preserved.

Important

Nested tables are not supported in the CALS table model output, but dropped from output completely (leaving an XML comment in the source code where the table would have occurred).

Unless you can be sure that the source document does not contain nested tables, use if the HTML table model is preferred.

Feature

Column 2

Wide Column 3 in repeating Header Row

Narrow Column 4

Colspan 1

yellow

Colspan 2

lavender

Rowspan 1

light green

Row+Colspan

light brown[2]

blue

dark gray

light gray

3.6. Unicode and Special Character Support

upCast RT comes with full Unicode support, including special font encodings like Zapf Dingbats, Symbol and (as far as possible) Wingdings. Encodings for custom fonts can easily be added. Here are some examples:

❤ (Zapf Dingbats hearts), © (copyright symbol), … (ellipsis).

3.7. Generated Content

RTF has the concept of automatically generated content at the time of publishing (i.e.: printing) a document. Generated content can be output either by using the phrase element with a role attribute of GEN-contenttype :

<phrase role="GEN-TIME">21:39</phrase>

or by using a uc-gentext Processing Instruction:

<?uc-gentext type="TIME" value="21:39"?>

Generation can be controlled by setting the respective pipeline parameter.

Examples of generated content are the current date (), the current time () or the source document's filename ().

A second reference to an earlier footnote[1] is handled specially and creates a footnoteref element.

3.8. Cross References, Internal Links, External Links

Cross references to earlier sections (see Configuration parameters) are automatically generated using the xref element.

Links to interesting targets (like Lists) within the same document create link elements.

Hyperlinks to interesting external destinations create ulink elements.

3.9. Captions

RTF has no concept of titles for specific elements. Instead, they are formatted as standard paragraphs in the immediate vicinity of the object that should get a title and use a special paragraph style name, caption (or its localized equivalent):

Fig. upCast Product Logo

The DocBook pipeline tries to make a good guess for the matching title of an image or table it finds. You can set a bias as to whether the caption should be searched first before an element or after it by setting the corresponding pipeline parameter.

3.10. Index terms

This document has been annotated with index terms using Word's tools for marking index entries. Index entries are converted to indexterm elements up to tertiary level.

3.11. Floating Boxes

Floating textboxes are converted to sidebar elements.

3.12. Annotations

Word annotations or comments are converted to remark elements. If author information is available, it is includedChristian RothCRShould be revised to include more detail on authorname markup..

3.13. Table of Contents, Object Lists

Generating a table of contents and its placement is not supported and left to the final publishing engine. Same holds true for other lists (LoT, LoF, …).

Objects currently are not marked for inclusion in a list of objects. To include them, you need to edit the corresponding templates in the processing sheet.

4. Style to level mapping

Creating the hierarchical structure of the result DocBook document relies on a defined paragraph style (for titles) to level mapping. This mapping is different for the two root elements book and article since book has additional structuring layers (the part and chapter elements).

The pipeline template allows you to either use the default mappings if they suit your source documents, or to define your own style name to level mapping overlay.

4.1. book default mapping

The default mapping for the book root element looks like this:

book

Style Names

DocBook Element

Title, part-title

part

heading 1, Heading 1, Chapter title, chapter-title

chapter

heading 2, Heading 2, sect1, section1-title

section (level 1)

heading 3, Heading 3, sect2, section2-title

section (level 2)

heading 4, Heading 4, sect3, section3-title

section (level 3)

heading 5, Heading 5, sect4, section4-title

section (level 4)

heading 6, Heading 6, sect5, section5-title

section (level 5)

heading 7, Heading 7, sect6, section6-title

section (level 6)

heading 8, Heading 8, sect7, section7-title

section (level 7)

4.2. article default mapping

The default mapping for the article root element looks like this:

article

Style Names

DocBook Element

heading 1, Heading 1, sect1, section1-title

section (level 1)

heading 2, Heading 2, sect2, section2-title

section (level 2)

heading 3, Heading 3, sect3, section3-title

section (level 3)

heading 4, Heading 4, sect4, section4-title

section (level 4)

heading 5, Heading 5, sect5, section5-title

section (level 5)

heading 6, Heading 6, sect6, section6-title

section (level 6)

heading 7, Heading 7, sect7, section7-title

section (level 7)

heading 8, Heading 8, sect8, section8-title

section (level 8)

heading 9, Heading 9, sect9, section9-title

section (level 9)

4.3. Creating a custom overlay

Using the Style to level mapping parameter, you can define a custom overlay. This parameter is a parameter of type list and can have up to 9 elements. In the graphical UI, each line corresponds to a list element. Each list element consists again of a list of style names which map onto the respective DocBook element's title child (and implicitly creating the container nesting in that way).

As an example, have a look at the following style to level mapping overlay:

As you see, for the list elements 1 and 3, two paragraph style names have been defined each. For syntax, enclose the style names in double quotes and separate them by whitespace or commas.

When using the article DocBook root element, combining this with the default settings, this result in the following, final style to level mapping table:

As you see, for each level where there is no overlay defined (empty line), the default settings are used, where for the levels where there are non-empty definitions, those replace the default settings. When you do not want any of the default settings to apply, you must define non-empty entries for all 9 level mapping slots.

5. DocBook XML Catalog

upCast RT supports the use of XML Catalogs. The default installation of the application for Windows and Mac OS X includes the DocBook 4.5 DTD and a corresponding entry to its catalog file in the application-global catalog setup. The DTD is required by the pipeline for validation of the final DocBook XML result file, and in its default setting, it will use that application-global catalog setup.

However, if you are using a non-default installation of upCast RT (e.g. running it via the API in a server-like environment or from the commandline), you should definitely make sure that you add the DocBook DTD's catalog file in the Catalog tab of the pipeline settings.

Tip

We highly recommend to hold a local copy of the DocBook 4.5 DTD to be independent of internet access and to reduce net traffic and load on the DocBook servers.

6. Customizing the pipeline

The RTF to DocBook conversion pipeline is a standard upCast RT pipeline programming. It relies on both, some UPL code and tree processing, as well on a set of XSLTs (a core XSLT for common processing of book and article roots which gets included by dedicated entry-point XSLTs defining root-specific overrides and additions). To apply your own modifications, make a copy of the complete template (use the File > New from Template > Word to DocBook > create independent copy… command for this) and edit to your desire.

Most of the code (XSLT and UPL) is self-explanatory for a medium-experienced XSLT or UPL programmer or includes inline comments. A good grasp of upCast RT 's concepts and architecture is certainly very helpful as the pipeline implementation does use quite a few features of the application, some of them being rather advances ones (like custom module initialization code). However, most of the time, you will only want to modify some default parameters or tweak the XSLT processing sheets, which does not require in-depth upCast RT knowledge.

If you are unsure of how to implement a requirement, have found a bug or have an idea or suggestions to make on how to improve the pipeline implementation, please do contact us at support@infinity-loop.de.

7. Index

This is the place where we have Word made generate its automatic Index.



[1] CALS is an acronym for "Continuous Acquisition & Life-Cycle Support". (And this is a footnote.)

[2] Tables and lists are also supported in footnotes:

First table cell.

Second table cell

  • List item 1

  • List item 2

Dropped nested table in CALS table model export:

N-1

N-2

Some text after the table.