This schema documents the internal document structure that the
RTF Importer module creates as its result.
uci
schema design
Physical separation of out-of-flow
from in-flow content
During development of the RTF Importer
module, the following fundamental design decision was made:
out-of-flow elements are located in a separate, out-of-flow
container
. This is different from most other common schemas
like DocBook or HTML. Why did we do this?
One of upCast RT's
focus is offering operators to bring legacy, often unstructured
word-processing content into a useful structure. For this, we often
need to work on the raw text and its formatting for operations like
regular expressions or determining styling for a run of text. If we
kept out-of-flow elements like footnotes, textboxes or annotations
inline in the text flow, those operations would become much more
complicated to use (or they would need to be equipped with some
auto-magic). Let's have a look at an
example:
<par>Java<footnote>A programming language
formerly called Oak.</footnote>-based
applications</par>
Suppose the user wanted
"Java-based applications" to be a heading, but did not use a dedicated
style for this and used local formatting overrides instead. Now, we
need to detect this - but how? A straight-forward approach would be to
define the following rule for headings:
If a paragraph consists
only of at max 35 characters that are 16pt bold, consider it a
heading.
If you'd apply that rule to the paragraph above, you'd
find that
string-length( string(self::par) )
is greater
than 35 because it counts characters also in descendants.
Additionally, the characters in the footnote would not be 16pt and
bold, so the second condition would also not match.
If you had
ever to deal with these kind of problems (excluding certain sub-trees
in operations on elements), you know that this can become a nightmare.
You must define logic to exclude those items at any level, keep the
list of elements to exclude up-to-date, etc. Since these operations
are fundamental, common operations in legacy document conversion, we
decided to move lofical out-of-flow content also physically out of
low. This means that the above turns conceptually into a structure
like
<body><par>Java<contentref
idref="id1"/>-based
applications</par></body>
<extflow><footnote
id="id1">A programming language formerly called
Oak.</footnote><extflow>
Layout properties
exposed as namespaced attributes
You often will want to access
certain layout properties on elements. These CSS properties are
exposed in the tree as
synthesized attributes. However, their
being synthesized dynamically at query time is a technical detail you
can neglect for most operations, except that you cannot set them. They
are only made real element attributes when you serialize the internal
tree with the XML Exporter module (and they are not filtered by its
attribute filter settings). The properties are exposed on elements in
the three semantic namespaces
css
,
csso
and
cssc
. See the upCast RT manual for details on the
semantics of these namespaces.
Document structure
Basic
structure
The basic document structure looks as
follows:
<uci:document>
<uci:head>...</uci:head>
<uci:body>...</uci:body>
<uci:extflow>...</uci:extflow>
</uci:document>
The
uci:body
element contains the all content that is
within the regular document flow.
The
uci:extflow
element contains all out-of-flow element definitions like footnotes,
endnotes, annotations, page headers and footers, binary objects,
textboxes etc.
Tables
Tables are represented as a
rectangular, regular grid of uci:cell objects, organized in uci:row
objects, which are contained in an uci:table object. There are no
elements to physically represent table header, table footer or row
groups. Instead, this info is attached to uci:cell and uci:row objects
as attributes. This simple, generic grid structure allows us to
- easily programmatically change row-grouping structures by simply
setting attributes instead of having to physically juggle with table
elements (creation, moving, etc.)
- be table-model independent: the XML Exporter will take care of
serializing the generic internal table structure in one of
(currently) three target models based solely on the attribute
settings:
HTML
,
CALS
and
native
.