Skip to content

Document RAG

Document RAG (Retrieval-Augmented Generation)

This example demonstrates how to build a classic RAG system using Synalinks. RAG combines document retrieval with language model generation to answer questions based on your own documents.

How RAG Works

graph LR
    subgraph Indexing
        A[Documents] --> B[Embeddings]
        B --> C[(KnowledgeBase)]
    end
    subgraph Query Time
        D[Question] --> E[RetrieveKnowledge]
        C --> E
        E --> F[Relevant Docs]
        F --> G[Generator]
        G --> H[Answer]
    end
  1. Index: Store documents in a knowledge base with embeddings
  2. Retrieve: When a question is asked, find relevant documents
  3. Generate: Use the retrieved context to generate an accurate answer

Creating a Document Store

class Document(synalinks.DataModel):
    id: str = synalinks.Field(description="Document ID")
    title: str = synalinks.Field(description="Document title")
    content: str = synalinks.Field(description="Document content")

knowledge_base = synalinks.KnowledgeBase(
    uri="duckdb://./documents.db",
    data_models=[Document],
    embedding_model=embedding_model,  # For semantic search
)

Building the RAG Pipeline

inputs = synalinks.Input(data_model=Query)

# Retrieve relevant documents
retrieved = await synalinks.RetrieveKnowledge(
    knowledge_base=knowledge_base,
    language_model=language_model,
    search_type="hybrid",
    k=3,
)(inputs)

# Generate answer from retrieved context
answer = await synalinks.Generator(
    data_model=Answer,
    language_model=language_model,
    instructions="Answer based on the retrieved documents.",
)(retrieved)

Key Takeaways

  • Hybrid Search: Combines keyword (BM25) and semantic (vector) search for better retrieval accuracy.
  • Chunking: For large documents, split into smaller chunks for better retrieval granularity.
  • Context Window: Retrieved documents are passed as context to the LM for grounded generation.
  • Trainable: The retrieval and generation modules can be optimized using Synalinks training.

Program Visualization

document_rag

API References

Answer

Bases: DataModel

An answer generated from retrieved documents.

Source code in examples/13_document_rag.py
class Answer(synalinks.DataModel):
    """An answer generated from retrieved documents."""

    answer: str = synalinks.Field(
        description="The answer to the question based on retrieved documents",
    )
    sources: str = synalinks.Field(
        description="The document titles used to generate the answer",
    )

Document

Bases: DataModel

A document stored in the knowledge base.

Source code in examples/13_document_rag.py
class Document(synalinks.DataModel):
    """A document stored in the knowledge base."""

    id: str = synalinks.Field(
        description="Unique document identifier",
    )
    title: str = synalinks.Field(
        description="Document title",
    )
    content: str = synalinks.Field(
        description="The main text content of the document",
    )
    source: str = synalinks.Field(
        description="Source or category of the document",
    )

Query

Bases: DataModel

A user question.

Source code in examples/13_document_rag.py
class Query(synalinks.DataModel):
    """A user question."""

    query: str = synalinks.Field(
        description="The user's question",
    )