Chat over Documents (RAG)

Approach A: stuff the whole corpus

For small-to-medium corpora — internal handbooks, product FAQs, a few PDFs — the simplest solution is to put everything in the system prompt. No retrieval, no vector store, no embedding model.

from openai import OpenAI

client = OpenAI(base_url="https://api.aiand.com/v1", api_key="sk-...")

with open("handbook.md") as f:
    handbook = f.read()

def answer(question: str) -> str:
    response = client.chat.completions.create(
        model="qwen/qwen3.6-27b",  # 262K context
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer based only on the handbook below. "
                    "If the answer isn't in it, say so.\n\n"
                    f"---\n{handbook}\n---"
                ),
            },
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

print(answer("What's the policy on remote work?"))

Pick a 262K-context model from the Catalog — qwen/qwen3.6-27b is free for prototyping.

Approach B: retrieval with local embeddings

When the corpus is too large to fit, compute embeddings yourself and retrieve the top-K chunks at query time. The example below runs embeddings locally with sentence-transformers:

pip install openai sentence-transformers numpy

import numpy as np
from openai import OpenAI
from sentence_transformers import SentenceTransformer

client = OpenAI(base_url="https://api.aiand.com/v1", api_key="sk-...")
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

CHUNKS = [
    "Refunds are processed within 5 business days of approval.",
    "All employees are entitled to 20 days of paid annual leave.",
    "Expense reports must be submitted within 30 days of the expense.",
    "Remote work is supported for engineering and design roles.",
    # ... thousands more in practice
]
embeds = embedder.encode(CHUNKS, normalize_embeddings=True)

def retrieve(query: str, k: int = 3) -> list[str]:
    q = embedder.encode([query], normalize_embeddings=True)[0]
    scores = embeds @ q
    return [CHUNKS[i] for i in np.argsort(-scores)[:k]]

def answer(question: str) -> str:
    context = "\n\n".join(retrieve(question))
    response = client.chat.completions.create(
        model="qwen/qwen3.6-27b",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer based only on the snippets below. "
                    "If the answer isn't in them, say so.\n\n"
                    f"{context}"
                ),
            },
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

print(answer("What's the policy on remote work?"))

Scaling up

When you outgrow a numpy array:

Vector store: swap in pgvector, Pinecone, Qdrant, or Weaviate.
Chunking: 200–500 tokens per chunk with some overlap. Don’t embed entire pages.
Re-ranking: pull top-K with dense retrieval, then re-rank with a cross-encoder or an LLM. High-leverage for accuracy.
Citations: ask the model to quote chunk IDs alongside the answer. Use Structured Outputs for reliability.
Embeddings provider: if running embeddings locally isn’t a fit, call any external embeddings API from the same script.