Surface · Search & Retrieval

Fast keyword search and meaning-based search, behind one interface.

Two ways to find a passage in a legal corpus. Keyword search (BM25) is fast, exact, and explainable. Meaning-based search (embeddings) catches paraphrase and concept matches. Both speak the same Python interface, so you can run them alone, mix them, and add a re-ranker on top — without rewriting the calling code.

pip install kaos-nlp-core kaos-nlp-transformers

A measured benchmark, not a marketing number

A platform that runs over the US Code, EDGAR filings, and PACER dockets cannot afford a pure-Python search loop. kaos-nlp-core puts the inner loops in Rust (with SIMD-accelerated string ops and rayon-parallelized scoring), hands typed results back to Python, and keeps text offsets honest by converting byte offsets to character offsets at the language boundary.

The published benchmark is verified: typical 2-term BM25 queries complete in under 600 microseconds over 69,000 US Code documents — about 1,600 queries per second on a single process. Benchmarks are published in the kaos-nlp-core repository. No Python interpreter contention, no per-call overhead in the hot loop, and the indexes pickle cleanly so multi-process workflows scale.

The dense half lives in kaos-nlp-transformers. Load a model, call .embed(texts), get back a NumPy array. The default is BAAI/bge-small-en-v1.5 (MIT, 384 dimensions, pinned to a specific commit). The model registry is license-audited and refuses to load non-commercial or ambiguously-licensed families, so a model someone added in a notebook does not become a license issue at deployment time.

Combine keyword and meaning, then re-rank

Both packages implement the same Retriever interface, so wiring keyword and dense search together is a one-liner: HybridRetriever(sparse=BM25Retriever, dense=EmbeddingRetriever). Reciprocal Rank Fusion merges the two result lists; an optional cross-encoder re-ranker on top tightens the head of the list.

A taste

A privilege reviewer searching a 30,000-document production for the controlling-precedent passage on attorney work product. Two-term BM25 over 69K US Code documents runs in under 600 µs; the same loop runs over a discovery production with the production's Bates IDs preserved.

from kaos_nlp_core.documents import DocumentCollection
from kaos_nlp_core.search import Searcher

# Each record carries a Bates ID through to the result via external_id.
docs = DocumentCollection.from_records(
    iter_records("production-001/"),
    id_field="doc_id",
    text_field="ocr_text",
    external_id_field="bates",
)
searcher = Searcher.from_collection(docs)

# Hickman v. Taylor work-product test, against the production.
hits = searcher.search("work product anticipation of litigation", top_k=10)
for hit in hits:
    print(f"{hit.score:.2f}  {hit.external_id}  {hit.text[:100]}")

Packages in this group

kaos-nlp-core

10 MCP tools. Rust + PyO3 — BM25 + TF-IDF, Punkt, Aho-Corasick, MinHash, CTPH, distance metrics.

kaos-nlp-transformers

Dense embeddings via fastembed (CPU) or sentence-transformers (GPU). License-audited model registry.

How it compares

vs. LangChain retrievers, llama-index, raw FAISS. The Python-side libraries layer abstraction over a Python BM25 loop or a vector DB. KAOS puts the BM25 inner loop in Rust (SIMD, rayon-parallelized) and ships a typed Retriever protocol that hybrid composition reads cleanly through. No vector DB required for the verified perf number; no per-call Python overhead.

vs. proprietary search (Westlaw, Lexis, Harvey). Those platforms search their own corpora. KAOS runs over the corpus you own and host — no per-query licensing, full Rust performance over local data. A different model, not a replacement for licensed primary-law content.

License posture: the model registry refuses non-commercial / ambiguous-license families like jinaai/jina-embeddings-v3, nvidia/NV-Embed-v1/v2, Qwen/Qwen3-Embedding-*. BAAI/bge-small-en-v1.5 (MIT) is the default; BAAI/bge-reranker-base is the reranker default. Both are pinned to specific commits.

See /compare.

Get started

See the quickstart, browse all 18 packages, or read the docs.