Office documents — a ZIP of namespaced parts¶
The .docx on your disk and the .odt next to it are not single XML files. Each
is a ZIP archive of many XML parts, wired together by relationship files.
This page looks at both big families — Microsoft's OOXML (.docx, .xlsx,
.pptx) and OASIS's OpenDocument / ODF (.odt, .ods, .odp) — because
together they show what happens when a vocabulary gets large: dozens of
namespaces, split across files, referencing each other.
Look inside one yourself
A .docx is just a ZIP. unzip -l report.docx lists the parts;
[Content_Types].xml, word/document.xml, word/styles.xml, and
word/_rels/document.xml.rels are the load-bearing ones. ODF is the same idea:
content.xml, styles.xml, meta.xml, META-INF/manifest.xml.
OOXML: the main document part¶
word/document.xml holds the text. WordprocessingML's vocabulary lives almost
entirely under one prefix, w:, with r: (relationships) reaching out of the
part to images and hyperlinks.
w:pis a paragraph;w:pPris its properties (paragraph style, spacing). OOXML's pattern is rigid: a content element, optionally preceded by a…Prproperties sibling. Once you see it, the whole format reads.w:ris a run — a span of text with uniform formatting — andw:rPris its run properties (w:b= bold). Text always lives inside aw:t.r:id="rId4"does not contain the URL. It is a relationship id that points into a separate part,word/_rels/document.xml.rels, whererId4maps to an actual target. Ther:namespace exists purely to express these cross-part links.
Note xml:space="preserve" on the bold run: the xml: prefix is the one
namespace every XML document gets for free (bound to
http://www.w3.org/XML/1998/namespace), used here so the trailing space in
"Bold " survives.
The relationships part¶
The id from r:id="rId4" is resolved here. This is OOXML's answer to "how does
one XML part point at another file in the package":
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId4" Type="http://…/hyperlink"
Target="https://example.org" TargetMode="External"/>
<Relationship Id="rId5" Type="http://…/image"
Target="media/logo.png"/>
</Relationships>
The indirection (rId4 → real target) keeps document.xml stable when targets
move, and lets binary assets (images) live as separate package members rather
than being base64-stuffed into the markup.
ODF: the same problem, OASIS's answer¶
OpenDocument splits content the same way but makes a different namespace choice:
instead of one big w:, it uses many small, semantic namespaces — office:,
text:, table:, style:, draw:, fo: — so each prefix names a domain.
Two philosophies, same data model
OOXML funnels almost everything through w:/x:/p: (one prefix per app:
Word, Excel, PowerPoint). ODF spreads it across text:/table:/style: (one
prefix per concept). Neither is wrong — they are two legitimate ways to carve
up a large vocabulary with namespaces, and seeing them side by side is the
lesson. ODF even borrows the XSL-FO fo: namespace
(next-but-one page) for formatting properties rather than
inventing its own.
Editing a document by hand¶
Because the package is just a ZIP of text, you can change a document with nothing
but unzip, an editor, and zip — no Office libraries at all. This is the most
direct way to feel what the format really is:
unzip report.docx -d report/ # explode the package
unxml report/word/document.xml # read the part compactly (or edit it)
# … change "Quarterly report" to "Annual report" in word/document.xml …
cd report && zip -r ../edited.docx . && cd .. # repackage
# edited.docx opens in Word with the edited heading
Three things make this work — and one that quietly breaks it:
- The edit is text in, text out. As long as you keep
document.xmlwell-formed and do not touch the[Content_Types].xmlthat names each part, Word reads your hand-edited package without complaint. - Binary assets (the
media/logo.pngfrom the.rels) are copied through the zip untouched; you are only ever editing the XML parts. - Mail-merge and report generation in the wild are often exactly this: unzip a
template, string-replace tokens in
document.xml, rezip. No COM automation, no headless Word.
ODF's mimetype must be stored first and uncompressed
.odt/.ods files carry a mimetype entry that must be the very first
member of the archive and stored without compression — it lets a tool
identify the file type by reading the first few bytes, before parsing any XML.
A blind zip -r compresses it and reorders the entries, and some readers then
refuse the file. The fix is to add it first, stored: zip -X0 out.odt mimetype
then zip -rX out.odt . -x mimetype. OOXML has no such rule — its type map
lives in [Content_Types].xml inside the package instead.
Querying Office XML¶
Both formats are heavily namespaced, so — exactly as with
SVG — //p finds no paragraphs.
You must bind the prefixes in your XPath host and query
//w:p or //text:p. The wrinkle that surprises people: the prefix you use in
your query need not match the document's; only the namespace URI matters.
A document could declare xmlns:foo="…wordprocessingml…" and //foo:p would
still be wrong unless you bound foo to that URI too.
Things to note¶
- A "file" can be a package of XML parts; the interesting structure is across parts, not within one.
- Relationship indirection (
r:id→.rels) keeps markup stable and keeps binary assets out of the XML. - A large vocabulary can be organized one-prefix-per-app (OOXML) or one-prefix-per-concept (ODF) — two namespace strategies, with different trade-offs.
xml:spaceandxml:langride the always-availablexml:namespace.
Next: Atom and feed extensions, where the lesson flips — instead of one huge vocabulary, a tiny core that everyone extends.