Knowledge Base

A Knowledge Base in Synalinks is a structured storage system that enables your LM applications to retrieve and reason over external data. Unlike simple prompt injection, a Knowledge Base provides semantic search capabilities, automatic chunking, and efficient retrieval - the foundation for building Retrieval-Augmented Generation (RAG) systems.

Why Knowledge Bases Matter

Language models have a knowledge cutoff and limited context windows. A Knowledge Base solves both problems:

graph LR
    subgraph Without Knowledge Base
        A[Query] --> B[LLM]
        B --> C[Hallucination Risk]
    end
    subgraph With Knowledge Base
        D[Query] --> E[Retrieve Relevant Docs]
        E --> F[LLM + Context]
        F --> G[Grounded Answer]
    end

Knowledge Bases provide:

Grounded Responses: Answers based on actual data, not hallucinations
Unlimited Knowledge: Store documents beyond context limits
Up-to-Date Information: Add new data without retraining
Source Attribution: Track where answers come from

Architecture

Synalinks Knowledge Base is built on DuckDB, providing:

graph TD
    A[DataModels] --> B[KnowledgeBase]
    B --> C[DuckDB Storage]
    B --> D[Full-Text Index]
    B --> E[Vector Index]
    F[Search Query] --> G{Search Type}
    G -->|fulltext| D
    G -->|similarity| E
    G -->|hybrid| H[Combine Both]
    D --> I[Results]
    E --> I
    H --> I

Creating a Knowledge Base

Define DataModels for your documents, then create the Knowledge Base:

import synalinks

class Document(synalinks.DataModel):
    """A document in the knowledge base."""
    id: str = synalinks.Field(description="Unique document ID")
    title: str = synalinks.Field(description="Document title")
    content: str = synalinks.Field(description="Document content")

# Create the knowledge base
kb = synalinks.KnowledgeBase(
    uri="duckdb://my_database.db",    # Storage location
    data_models=[Document],            # What types to store
    embedding_model=embedding_model,   # For vector search (optional)
    metric="cosine",                   # Similarity metric
    wipe_on_start=False,               # Preserve existing data
)

Key Parameters

Parameter	Description
`uri`	Database connection string (e.g., `duckdb://path.db`)
`data_models`	List of DataModel classes to store
`embedding_model`	EmbeddingModel for vector search (optional)
`metric`	Similarity metric: `cosine`, `l2`, or `ip`
`wipe_on_start`	Clear database on initialization

Search Methods

Full-Text Search (BM25)

Uses the BM25 algorithm for traditional keyword-based search:

results = await kb.fulltext_search(
    "machine learning neural networks",
    data_models=[Document], # If None search in all tables
    k=10,           # Number of results
    threshold=None, # Minimum score (optional)
)

Best for:

Exact keyword matching
When users search with specific terms
Quick, lightweight search

Similarity Search (Vector)

Uses embedding vectors for semantic search:

results = await kb.similarity_search(
    "how do computers learn",  # Semantically matches "machine learning"
    data_models=[Document], # If None search in all tables
    k=10,
    threshold=0.7,  # Minimum similarity score
)

Best for:

Semantic meaning matching
Natural language queries
Finding conceptually related content

Hybrid Search

Combines both methods for best results:

results = await kb.hybrid_search(
    "machine learning basics",
    data_models=[Document],
    k=10,
    bm25_weight=0.5,    # Weight for BM25 scores
    vector_weight=0.5,  # Weight for vector scores
)

Best for:

Production RAG systems
When you need both exact and semantic matching
Complex queries that benefit from both approaches

CRUD Operations

Create/Update

The update method performs upsert (insert or update). The first field in your DataModel is used as the primary key:

doc = Document(
    id="doc1",
    title="Introduction to AI",
    content="Artificial intelligence is...",
)

await kb.update(doc.to_json_data_model())

Read by ID

result = await kb.get(
    "doc1",  # Primary key value
    data_models=[Document],
)

List All

all_docs = await kb.getall(
    Document,
    limit=100,
    offset=0,
)

Delete

await kb.delete(
    "doc1",
    data_models=[Document],
)

Raw SQL

For complex queries, use raw SQL:

results = await kb.query(
    "SELECT id, title FROM Document WHERE title LIKE ?",
    params=["%Learning%"],
)

Knowledge Modules

Synalinks provides modules for integrating Knowledge Bases into programs:

RetrieveKnowledge

Retrieves relevant documents using LM-generated search queries:

graph LR
    A[Input] --> B[Generate Query]
    B --> C[Search KB]
    C --> D[Context + Input]

retrieved = await synalinks.RetrieveKnowledge(
    knowledge_base=kb,
    language_model=lm,
    search_type="hybrid",  # fulltext, similarity, or hybrid
    k=10,
    return_inputs=True,    # Include original input in output
)(inputs)

UpdateKnowledge

Stores DataModels in the Knowledge Base:

stored = await synalinks.UpdateKnowledge(
    knowledge_base=kb,
)(extracted_data)

EmbedKnowledge

Generates embeddings for DataModels:

embedded = await synalinks.EmbedKnowledge(
    embedding_model=embedding_model,
    in_mask=["content"],  # Which fields to embed
)(inputs)

Building a RAG Pipeline

A complete RAG system combines retrieval with generation:

graph LR
    A[Query] --> B[RetrieveKnowledge]
    B --> C[Context + Query]
    C --> D[Generator]
    D --> E[Grounded Answer]

import asyncio
from dotenv import load_dotenv
import synalinks

class Query(synalinks.DataModel):
    query: str = synalinks.Field(description="User question")

class Answer(synalinks.DataModel):
    answer: str = synalinks.Field(description="Answer based on context")

async def main():
    load_dotenv()
    synalinks.clear_session()

    lm = synalinks.LanguageModel(model="openai/gpt-4.1-mini")

    # Assume kb is already populated
    kb = synalinks.KnowledgeBase(
        uri="duckdb://knowledge.db",
        data_models=[Document],
    )

    inputs = synalinks.Input(data_model=Query)

    # Retrieve relevant documents
    retrieved = await synalinks.RetrieveKnowledge(
        knowledge_base=kb,
        language_model=lm,
        search_type="fulltext",
        k=5,
        return_inputs=True,
    )(inputs)

    # Generate answer using retrieved context
    outputs = await synalinks.Generator(
        data_model=Answer,
        language_model=lm,
    )(retrieved)

    rag = synalinks.Program(
        inputs=inputs,
        outputs=outputs,
        name="rag_pipeline",
    )

    result = await rag(Query(query="What is machine learning?"))
    print(result["answer"])

if __name__ == "__main__":
    asyncio.run(main())

Key Takeaways

DuckDB Backend: Fast, embedded database with full-text and vector search capabilities. No external services required.
Three Search Types: Full-text (BM25) for keywords, similarity for semantics, hybrid for best of both.
DataModel as Schema: Your DataModels define the structure of stored documents. The first field is the primary key.
RetrieveKnowledge Module: Automates query generation and retrieval for RAG pipelines. Combines seamlessly with Generator.
Upsert Semantics: The update method inserts new records or updates existing ones based on the primary key.
Raw SQL Access: For complex queries, you can use raw SQL directly.

API References

`Answer`

Bases: DataModel

Answer based on retrieved context.

Source code in guides/6_knowledge_base.py

class Answer(synalinks.DataModel):
    """Answer based on retrieved context."""

    answer: str = synalinks.Field(description="Answer based on the context provided")

`Document`

Bases: DataModel

A document in the knowledge base.

Source code in guides/6_knowledge_base.py

class Document(synalinks.DataModel):
    """A document in the knowledge base."""

    id: str = synalinks.Field(description="Unique document ID")
    title: str = synalinks.Field(description="Document title")
    content: str = synalinks.Field(description="Document content")

`Query`

Bases: DataModel

User query.

Source code in guides/6_knowledge_base.py

class Query(synalinks.DataModel):
    """User query."""

    query: str = synalinks.Field(description="User question")