Skip to content

Chat over Documents (RAG)

For small-to-medium corpora — internal handbooks, product FAQs, a few PDFs — the simplest solution is to put everything in the system prompt. No retrieval, no vector store, no embedding model.

from openai import OpenAI
client = OpenAI(base_url="https://api.aiand.com/v1", api_key="sk-...")
with open("handbook.md") as f:
handbook = f.read()
def answer(question: str) -> str:
response = client.chat.completions.create(
model="qwen/qwen3.5-27b", # 262K context
messages=[
{
"role": "system",
"content": (
"Answer based only on the handbook below. "
"If the answer isn't in it, say so.\n\n"
f"---\n{handbook}\n---"
),
},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
print(answer("What's the policy on remote work?"))

Pick a 262K-context model from the Catalogqwen/qwen3.5-9b is free for prototyping.

Approach B: retrieval with local embeddings

Section titled “Approach B: retrieval with local embeddings”

When the corpus is too large to fit, compute embeddings yourself and retrieve the top-K chunks at query time. The example below runs embeddings locally with sentence-transformers:

Terminal window
pip install openai sentence-transformers numpy
import numpy as np
from openai import OpenAI
from sentence_transformers import SentenceTransformer
client = OpenAI(base_url="https://api.aiand.com/v1", api_key="sk-...")
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
CHUNKS = [
"Refunds are processed within 5 business days of approval.",
"All employees are entitled to 20 days of paid annual leave.",
"Expense reports must be submitted within 30 days of the expense.",
"Remote work is supported for engineering and design roles.",
# ... thousands more in practice
]
embeds = embedder.encode(CHUNKS, normalize_embeddings=True)
def retrieve(query: str, k: int = 3) -> list[str]:
q = embedder.encode([query], normalize_embeddings=True)[0]
scores = embeds @ q
return [CHUNKS[i] for i in np.argsort(-scores)[:k]]
def answer(question: str) -> str:
context = "\n\n".join(retrieve(question))
response = client.chat.completions.create(
model="qwen/qwen3.5-27b",
messages=[
{
"role": "system",
"content": (
"Answer based only on the snippets below. "
"If the answer isn't in them, say so.\n\n"
f"{context}"
),
},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
print(answer("What's the policy on remote work?"))

When you outgrow a numpy array:

  • Vector store: swap in pgvector, Pinecone, Qdrant, or Weaviate.
  • Chunking: 200–500 tokens per chunk with some overlap. Don’t embed entire pages.
  • Re-ranking: pull top-K with dense retrieval, then re-rank with a cross-encoder or an LLM. High-leverage for accuracy.
  • Citations: ask the model to quote chunk IDs alongside the answer. Use Structured Outputs for reliability.
  • Embeddings provider: if running embeddings locally isn’t a fit, call any external embeddings API from the same script.