Surface · Search & Retrieval

Fast keyword search and meaning-based search, behind one interface.

Two ways to find a passage in a legal corpus. Keyword search (BM25) is fast, exact, and explainable. Meaning-based search (embeddings) catches paraphrase and concept matches. Both speak the same Python interface, so you can run them alone, mix them, and add a re-ranker on top — without rewriting the calling code.

Terminal window
pip install 'kaos-nlp-core kaos-nlp-transformers'

A measured benchmark, not a marketing number

A platform that runs over the US Code, EDGAR filings, and PACER dockets cannot afford a pure-Python search loop. kaos-nlp-core puts the inner loops in Rust (with SIMD-accelerated string ops and rayon-parallelized scoring), hands typed results back to Python, and keeps text offsets honest by converting byte offsets to character offsets at the language boundary.

The published benchmark is verified: typical 2-term BM25 queries complete in under 600 microseconds over 69,000 US Code documents — about 1,600 queries per second on a single process. Benchmarks are published in the kaos-nlp-core repository. No Python interpreter contention, no per-call overhead in the hot loop, and the indexes pickle cleanly so multi-process workflows scale.

The dense half lives in kaos-nlp-transformers. Load a model, call .embed(texts), get back a NumPy array. The default is BAAI/bge-small-en-v1.5 (MIT, 384 dimensions, pinned to a specific commit). The model registry is license-audited and refuses to load non-commercial or ambiguously-licensed families, so a model someone added in a notebook does not become a license issue at deployment time.

Combine keyword and meaning, then re-rank

Both packages implement the same Retriever interface, so wiring keyword and dense search together is a one-liner: HybridRetriever(sparse=BM25Retriever, dense=EmbeddingRetriever). Reciprocal Rank Fusion merges the two result lists; an optional cross-encoder re-ranker on top tightens the head of the list.

BM25 kaos-nlp-core <600 µs / 2-term Dense kaos-nlp-tx bge-small (CPU) Fuse Hybrid retriever RRF Rerank CrossEncoder [torch] extra

A taste

A privilege reviewer searching a 30,000-document production for the controlling-precedent passage on attorney work product. Two-term BM25 over 69K US Code documents runs in under 600 µs; the same loop runs over a discovery production with the production's Bates IDs preserved.

from kaos_nlp_core.documents import DocumentCollection
from kaos_nlp_core.search import Searcher
# Each record carries a Bates ID through to the result via external_id.
docs = DocumentCollection.from_records(
iter_records("production-001/"),
id_field="doc_id",
text_field="ocr_text",
external_id_field="bates",
)
searcher = Searcher.from_collection(docs)
# Hickman v. Taylor work-product test, against the production.
hits = searcher.search("work product anticipation of litigation", top_k=10)
for hit in hits:
print(f"{hit.score:.2f} {hit.external_id} {hit.text[:100]}")

Packages in this group

How it compares

vs. LangChain retrievers, llama-index, raw FAISS. The Python-side libraries layer abstraction over a Python BM25 loop or a vector DB. KAOS puts the BM25 inner loop in Rust (SIMD, rayon-parallelized) and ships a typed Retriever protocol that hybrid composition reads cleanly through. No vector DB required for the verified perf number; no per-call Python overhead.

vs. proprietary search (Westlaw, Lexis, Harvey Vault). The platforms ship their own search over their own corpora. KAOS searches the corpus you own; no platform lock-in, no licensing fee per query, full Rust performance over local data.

License posture: the model registry refuses non-commercial / ambiguous-license families like jinaai/jina-embeddings-v3, nvidia/NV-Embed-v1/v2, Qwen/Qwen3-Embedding-*. BAAI/bge-small-en-v1.5 (MIT) is the default; BAAI/bge-reranker-base is the reranker default. Both are pinned to specific commits.

See /compare.

Get started

See the quickstart, browse all 18 packages, or read the docs.