Concept · The ContentDocument AST

One typed document tree. Every source format.

A platform that has to read PDFs, Word documents, spreadsheets, HTML pages, and EDGAR filings cannot afford a different output shape for each source. KAOS uses one typed document tree — a Python data model called ContentDocument — that every extractor produces and every retriever, language-model program, and citation verifier reads. The grammar is borrowed from Pandoc, the source-location metadata from Docling, and the runtime types from Pydantic v2.

The grammar

The grammar is the same one Pandoc has used since 2006. A document is a list of block-level elements — paragraphs, headings, quotations, code blocks, lists, tables, figures — and each block contains inline elements: runs of text, emphasis, code spans, links, images, footnote references. The content rules are enforced at the type level: a block contains other blocks or inlines, never both at the same level. Generic containers cover anything the closed grammar doesn't already.

The result is an immutable Pydantic data model. It round-trips losslessly to JSON, serializes cleanly to Markdown, HTML, or plain text, and survives chunking, search, and language-model extraction without losing the link back to the page and offset where each piece of text originated.

From source file to a grounded answer

The document tree is the carrier. Source-location metadata rides on every element from the moment a parser produces it through to the language-model output that points back at a specific page and bounding box.

Parse 01 Annotate 02 View 03 Search 04 Cite 05 PDF parser Office-doc parser web-page parser source location annotation layer stable identifiers pages and sections flat paragraph list outline BM25 index pointer to element page + bounding box cited answer grounded answer citation verifier OUTCOME

Source location on every element

Every paragraph, heading, table cell, and inline run carries a typed source-location record: which document it came from, which page it sat on for paginated formats, the bounding box where it rendered, the character offsets in the original source string, the extractor's confidence score, and which extractor — name and version — produced it.

Every element also carries a stable identifier and a JSON-pointer reference to its position in the tree, so a downstream consumer can refer back to one specific paragraph of one specific document without re-parsing the source.

A separate annotation layer sits alongside the tree for things text alone can't carry: redactions, defined terms, cross-references, external citations, amendments, named entities, tracked changes, and the bidirectional anchors that link an extracted spreadsheet cell back to the source span that justified it. Annotations target elements by identifier and may carry a character-offset range when they apply to a substring rather than a whole element.

What it looks like in code

Parse a signed MSA from disk, walk the typed tree to find every paragraph that mentions termination, then search the document for the change-of-control language a partner asked about. Each hit carries a citation anchor that points at one specific paragraph of one specific page.

from kaos_pdf import parse_pdf
from kaos_content import DocumentView, search_document
doc = parse_pdf("deal-room/msa-acme-2024.pdf")
view = DocumentView(doc)
termination = [p for p in view.paragraphs if "termination" in p.text.lower()]
print(len(termination), "termination paragraphs found")
hits = search_document(doc, "change of control").results
top = hits[0]
print(f"page {top.element.provenance.page}, score {top.score:.3f}")
print(f"citation anchor: {top.element.block_ref}")
print(top.element.text[:160])

Read next

The provenance concept shows how source-location metadata carries through retrieval, the language-model call, and citation verification. The extraction page covers the parsers that produce the document tree; the runtime page covers the registry it lives in.

On learn-kaos: one-document-model · build-a-document.