APIs: Python (lxml)¶
Python has two XML libraries worth knowing: the standard-library
xml.etree.ElementTree (always available, minimal) and lxml (a binding to
libxml2/libxslt that adds full XPath, XSLT
1.0, XSD and Schematron). For real work, use lxml. This page
works the five tasks on the shared
invoice.xml.
Install
pip install lxml. The API mirrors ElementTree (lxml.etree is largely a
drop-in superset), so most snippets below work on stdlib ElementTree too —
except XPath with prefixes, XSLT, and schema validation, which are lxml-only.
1. Parse — the element tree¶
from lxml import etree
doc = etree.parse("invoice.xml") # ElementTree
root = doc.getroot() # <inv:invoice>
Python's model has one defining quirk: element tags are stored in James Clark
notation, {namespace-uri}localname:
print(root.tag) # {urn:example:invoice}invoice (1)
total = root.find("{urn:example:invoice}total")
print(total.text) # 100.00
print(total.get("currency")) # EUR -- unprefixed attr, so no namespace (2)
- The prefix
invfrom the file is gone by the time you have a tree — it is replaced by the URI in braces. This is Python making the namespace rule physical: an element's identity is{uri}local, and the prefix was only ever a serialization detail. get("currency")with a bare name targets the no-namespace attribute, exactly like .NET. A namespaced attribute would beget("{uri}currency").
2. Navigate — XPath with a namespace map¶
{uri}local is verbose, so for real navigation use .xpath() with a
namespaces= dict — lxml's prefix → URI map:
ns = {"i": "urn:example:invoice", # (1)!
"p": "urn:example:party"}
total = doc.xpath("/i:invoice/i:total/text()", namespaces=ns)[0] # "100.00"
ccy = doc.xpath("string(/i:invoice/i:total/@currency)", namespaces=ns) # "EUR"
name = doc.xpath("//p:name/text()", namespaces=ns)[0] # "Acme Records"
iis our prefix bound to the document's URI. As everywhere, the file's own prefix (inv) is irrelevant — only the URI matches. lxml will raise if you use a prefix in the query that is not innamespaces, which is friendlier than silently matching nothing.
There is no default-prefix shortcut in XPath 1.0
A common trap: you cannot map the empty prefix "" to a URI in
namespaces= and then write //total — XPath 1.0 forbids it (the same rule
as the SVG page). You must give
the namespace a non-empty prefix in your map and use it. (libxml2 supports
XPath 1.0 only.)
1b. Stream large files — iterparse¶
lxml's iterparse is the pull/streaming model:
it yields elements as they finish parsing, and you free them to keep memory flat:
INV = "urn:example:invoice"
for _, el in etree.iterparse("big.xml", tag=f"{{{INV}}}total"): # (1)!
print(el.text, el.get("currency"))
el.clear() # (2)!
while el.getprevious() is not None:
del el.getparent()[0] # (3)!
tag=filters to just the elements you care about — note the triple braces: an f-string{{{INV}}}produces{urn:example:invoice}around the local name.el.clear()drops the element's children and text once you are done with it.- The
getprevious()/deldance also removes already-processed siblings — the canonical lxml idiom for processing a huge file in constant memory.
3. Validate against XSD¶
schema = etree.XMLSchema(etree.parse("invoice.xsd")) # (1)!
doc = etree.parse("invoice.xml")
if not schema.validate(doc):
for err in schema.error_log: # (2)!
print(f"line {err.line}: {err.message}")
XMLSchemaresolvesxs:import/xs:includerelative to the schema file automatically. lxml also hasetree.Schematronandetree.RelaxNGfor the Schematron and RELAX NG layers.error_logholds every failure with line numbers — the first layer of the validation pipeline, in three lines.
4. Transform — XSLT¶
libxslt is XSLT 1.0. For 1.0 work, lxml is excellent and — the key habit — lets you compile the stylesheet once and reuse it:
transform = etree.XSLT(etree.parse("to-fo.xsl")) # compile once
result = transform(etree.parse("invoice.xml")) # run per input
result.write("invoice.fo") # -> XSL-FO, then Apache FOP
Need XSLT 2.0 / 3.0 in Python?
libxslt stops at 1.0. For the modern XSLT this site
teaches (grouping, xsl:function, JSON), call Saxon — there is a
saxonche (Saxon-C/HE) Python package that runs 3.0 from Python.
5. Data binding¶
Python has no built-in XML data binding, but xsdata (and the older
generateDS) generate dataclasses from an XSD:
from xsdata.formats.dataclass.parsers import XmlParser
from model import Invoice
inv = XmlParser().parse("invoice.xml", Invoice) # XML -> dataclass
print(inv.total) # 100.00
For loosely-structured or extension-heavy documents, though, the
lxml element/XPath API above is usually the better fit than binding.
Python cheat-sheet¶
| Task | API |
|---|---|
| Tree parse | etree.parse (lxml, or stdlib ElementTree) |
| Names | {uri}local (James Clark notation) |
| Streaming | etree.iterparse (+ clear() to free memory) |
| XPath | .xpath(expr, namespaces={...}) (lxml only) |
| Validate | etree.XMLSchema (also Schematron, RelaxNG) |
| XSLT 1.0 | etree.XSLT |
| XSLT ⅔ | Saxon (saxonche) |
| Binding | xsdata / generateDS |