RTF to DocBook v4.5 Pipeline Documentation
Though this pipeline aims to convert arbitrary RTF documents to DocBook, the conversion quality can be greatly increased if a certain amount of authoring discipline can be achieved. DocBook is a big DTD with many semantic elements for which there cannot be found uniquely identifying, typographical equivalences.
The pipeline therefore relies on a default, basic set of named paragraph and character styles which it uses to detect semantic elements in the source document. These names can be configured if desired by editing pipeline module properties. The set of supported style names will be discussed later.
The pipeline implementation supports two DocBook root elements: book
and article
. Since a book
has more structuring layers before reaching the recursive nesting of section
s, the pipeline's mapping of styles to the respective container elements may need to be adjusted with respect to your actual documents. The hierarchical mapping relies on an ordered set of named paragraph styles for the respective (required) title
elements in the DocBook DTD. The pipeline infers the level of a title
(and as a result the required boxing of container elements) from these named paragraph styles.
For each of the two root elements, book
and article
, there is a separate default set of named paragraph styles. You can override the effective set of styles (resulting from the chosen root element) with your own one if needed.
DocBook supports the use of both the CALS and the HTML table model. In the CALS table model, the pipeline does not support nested tables. If your further workflow requires CALS tables and you can be sure that your source documents do not contain nested tables, then using the CALS table model should be safe.
In any other case, it is highly recommended to use the HTML table model for conversion so that table structures are properly expressed in DocBook and that you do not lose any information, as any nested table when outputting to CALS will be dropped from the output completely.
The names of paragraph styles in the source document will be mirrored in the role
attribute of the resulting para
element.
Many elements in DocBook (tables, images) can be either of the informal variant, i.e. they do not have a name or caption. Translation of these elements is as expected and will create informaltable
or inlinemediaobject
elements.
When you want a table or image to have a caption or title, you must use the named paragraph style "caption" for this in your source document and you must ensure that it immediately precedes or follows the table or image it belongs to. To resolve ambiguous situations when mixing the position (before, after) within a single document, you can specify a bias parameter that tells the pipeline if it should prefer captions before or after elements when unclear. However, it is recommended that you decide for one placement option (either before or after the element) throughout the whole document to avoid false caption/element combinations.
![]() |
Fig. DocBook Article Pipeline Configuration
You can set the following options:
Source file:
Lets you choose the source RTF or Word file to convert.
Destination folder:
Lets you choose the folder where the final DocBook XML file will be written to.
Options:
When checked, revisions to the source document are marked up using the phrase
element with attribute revisionflag
set to added
or deleted
, respectively.
When checked, processing instructions are generated for Word fields. They look like this:
<?uc-gentext type="…" data="…" value="…"?>
with the "attribute" values taken directly from the respective gentext
element attributes.
When checked, the default roles "Normal" (for paragraphs) resp. "Default Paragraph Font" (for character formattings) are suppressed in para
and phrase
elements.
When checked, the pipeline will explicitly create nested emphasis elements with a role attribute value of bold and underline, respectively, even for named character styles which create corresponding phrase elements. This allows those post-processing systems unaware of the typographical rendering of those named phrase roles to at least pick up the mentioned traits italic (no role attribute), bold (role is bold) and underline (role is underline).
Lets you set where the pipeline should look for element captions first: directly after the respective element (table, image) or before the element.
Table model:
Choose the table model to use in the resulting DocBook file, either HTML or CALS. As described earlier, it is highly recommended that you use the HTML table model for accurate and complete conversion results.
DocBook structure:
Lets you choose between two root elements, book and article, and the resulting document structures therefrom.
Style to level mapping:
Lets you override the default style to level mapping that is chosen based on the setting of the DocBook structure parameter. The details for this mechanism and the syntax of this parameter are described here.
Note that there are 9 lines to enter your mappings, each line corresponding to one structural level.
The pipeline implementation allows you to post-process the resulting DocBook XML file with another processing pipeline. The template comes with a sample pipeline configured to translate the DocBook file to HTML for viewing in a web browser. As a matter of fact, the HTML documentation you are currently reading has been generated from the sample RTF document distributed in /data/source/rtf/
using the settings you see in the screenshot above, utilizing the DocBook XSLs in the sub-pipeline. Parameters are:
Post-processing pipeline:
The path to the pipeline to use for post-processing. If the field is empty, no post-processing takes place.
Pipeline parameters:
Lets you specify additional parameters that should be passed to the sub-pipeline called. Use the same syntax as you would within an External Pipeline Processor's Parameters field.
This section described how certain Word or RTF features in the source will be automatically converted to DocBook.
Both, paragraph and character styles are supported, with special handling where appropriate. As a general rule, the name of the style is preserved in the role
attribute in the destination element.
Consecutive paragraphs with paragraph style names contained in the following pipeline's parameters are handled specially by grouping them into an element of the respective name:
dbBlockquoteStyles
dbAddressStyles
dbLiterallayoutStyles
dbProgramlistingStyles
dbScreenStyles
dbEpigraphStyles
dbHighlightsStyles
dbTipStyles
dbNoteStyles
dbCautionStyles
dbImportantStyles
dbWarningStyles
dbPersonnameStyles
Character styles with names listed in the pipeline's parameter dbInlineClasses
create an inline element of the respective name in the result.
You can modify the default setting (=style names) for all of the above variables by editing the Custom pipeline variables parameter in the module.
The manual inline formattings italic, bold and underline are converted to emphasis
elements with no role
attribute, a role
attribute of bold
and a role
attribute of underline
, respectively.
Superscripted and subscripted text is automatically converted to superscript
and subscript
elements based on its layout traits.
The pipeline does not check whether an element is valid at a specific location or not. If it isn't, the resulting document will not be valid. This is intended behaviour so you can correct source documents which do not map into the DocBook structure.
(Btw., this is an example of creating a warning
with two consecutive paragraphs of name "warning".)
The pipeline supports both the HTML and CALS[1] table model with full support for horizontal and vertical spans. Colors and other layout traits are not preserved.
Nested tables are not supported in the CALS table model output, but dropped from output completely (leaving an XML comment in the source code where the table would have occurred).
Unless you can be sure that the source document does not contain nested tables, use if the HTML table model is preferred.
Feature | Column 2 | Wide Column 3 in repeating Header Row | Narrow Column 4 |
Colspan 1 | yellow | ||
Colspan 2 | lavender | ||
Rowspan 1 | light green | ||
Row+Colspan | light brown[2] | ||
blue | dark gray | ||
light gray | |||
upCast RT comes with full Unicode support, including special font encodings like Zapf Dingbats, Symbol and (as far as possible) Wingdings. Encodings for custom fonts can easily be added. Here are some examples:
❤ (Zapf Dingbats hearts), © (copyright symbol), … (ellipsis).
RTF has the concept of automatically generated content at the time of publishing (i.e.: printing) a document. Generated content can be output either by using the phrase
element with a role attribute of GEN-
contenttype
:
<phrase role="GEN-TIME">21:39</phrase>
or by using a uc-gentext
Processing Instruction:
<?uc-gentext type="TIME" value="21:39"?>
Generation can be controlled by setting the respective pipeline parameter.
Examples of generated content are the current date (), the current time () or the source document's filename ().
A second reference to an earlier footnote[1] is handled specially and creates a footnoteref
element.
Cross references to earlier sections (see Configuration parameters) are automatically generated using the xref
element.
Links to interesting targets (like Lists) within the same document create link
elements.
Hyperlinks to interesting external destinations create ulink
elements.
RTF has no concept of titles for specific elements. Instead, they are formatted as standard paragraphs in the immediate vicinity of the object that should get a title and use a special paragraph style name, caption
(or its localized equivalent):
![]() |
Fig. upCast Product Logo
The DocBook pipeline tries to make a good guess for the matching title of an image or table it finds. You can set a bias as to whether the caption should be searched first before an element or after it by setting the corresponding pipeline parameter.
This document has been annotated with index terms using Word's tools for marking index entries. Index entries are converted to indexterm
elements up to tertiary level.
Word annotations or comments are converted to remark
elements. If author information is available, it is included Should be revised to include more detail on authorname markup..
Generating a table of contents and its placement is not supported and left to the final publishing engine. Same holds true for other lists (LoT, LoF, …).
Objects currently are not marked for inclusion in a list of objects. To include them, you need to edit the corresponding templates in the processing sheet.
Creating the hierarchical structure of the result DocBook document relies on a defined paragraph style (for title
s) to level mapping. This mapping is different for the two root elements book
and article
since book
has additional structuring layers (the part
and chapter
elements).
The pipeline template allows you to either use the default mappings if they suit your source documents, or to define your own style name to level mapping overlay.
The default mapping for the book
root element looks like this:
book | |
Style Names | DocBook Element |
Title, part-title | part |
heading 1, Heading 1, Chapter title, chapter-title | chapter |
heading 2, Heading 2, sect1, section1-title | section (level 1) |
heading 3, Heading 3, sect2, section2-title | section (level 2) |
heading 4, Heading 4, sect3, section3-title | section (level 3) |
heading 5, Heading 5, sect4, section4-title | section (level 4) |
heading 6, Heading 6, sect5, section5-title | section (level 5) |
heading 7, Heading 7, sect6, section6-title | section (level 6) |
heading 8, Heading 8, sect7, section7-title | section (level 7) |
The default mapping for the article
root element looks like this:
article | |
Style Names | DocBook Element |
heading 1, Heading 1, sect1, section1-title | section (level 1) |
heading 2, Heading 2, sect2, section2-title | section (level 2) |
heading 3, Heading 3, sect3, section3-title | section (level 3) |
heading 4, Heading 4, sect4, section4-title | section (level 4) |
heading 5, Heading 5, sect5, section5-title | section (level 5) |
heading 6, Heading 6, sect6, section6-title | section (level 6) |
heading 7, Heading 7, sect7, section7-title | section (level 7) |
heading 8, Heading 8, sect8, section8-title | section (level 8) |
heading 9, Heading 9, sect9, section9-title | section (level 9) |
Using the Style to level mapping parameter, you can define a custom overlay. This parameter is a parameter of type list and can have up to 9 elements. In the graphical UI, each line corresponds to a list element. Each list element consists again of a list of style names which map onto the respective DocBook element's title child (and implicitly creating the container nesting in that way).
As an example, have a look at the following style to level mapping overlay:
As you see, for the list elements 1 and 3, two paragraph style names have been defined each. For syntax, enclose the style names in double quotes and separate them by whitespace or commas.
When using the article DocBook root element, combining this with the default settings, this result in the following, final style to level mapping table:
As you see, for each level where there is no overlay defined (empty line), the default settings are used, where for the levels where there are non-empty definitions, those replace the default settings. When you do not want any of the default settings to apply, you must define non-empty entries for all 9 level mapping slots.
upCast RT supports the use of XML Catalogs. The default installation of the application for Windows and Mac OS X includes the DocBook 4.5 DTD and a corresponding entry to its catalog file in the application-global catalog setup. The DTD is required by the pipeline for validation of the final DocBook XML result file, and in its default setting, it will use that application-global catalog setup.
However, if you are using a non-default installation of upCast RT (e.g. running it via the API in a server-like environment or from the commandline), you should definitely make sure that you add the DocBook DTD's catalog file in the Catalog tab of the pipeline settings.
We highly recommend to hold a local copy of the DocBook 4.5 DTD to be independent of internet access and to reduce net traffic and load on the DocBook servers.
The RTF to DocBook conversion pipeline is a standard upCast RT pipeline programming. It relies on both, some UPL code and tree processing, as well on a set of XSLTs (a core XSLT for common processing of book and article roots which gets included by dedicated entry-point XSLTs defining root-specific overrides and additions). To apply your own modifications, make a copy of the complete template (use the > > > command for this) and edit to your desire.
Most of the code (XSLT and UPL) is self-explanatory for a medium-experienced XSLT or UPL programmer or includes inline comments. A good grasp of upCast RT 's concepts and architecture is certainly very helpful as the pipeline implementation does use quite a few features of the application, some of them being rather advances ones (like custom module initialization code). However, most of the time, you will only want to modify some default parameters or tweak the XSLT processing sheets, which does not require in-depth upCast RT knowledge.
If you are unsure of how to implement a requirement, have found a bug or have an idea or suggestions to make on how to improve the pipeline implementation, please do contact us at support@infinity-loop.de.
This is the place where we have Word made generate its automatic Index.