Skip to content

Knowledge Base

Knowledge Base

A Knowledge Base in Synalinks is a structured storage system that enables your LM applications to retrieve and reason over external data. Unlike simple prompt injection, a Knowledge Base provides semantic search capabilities, automatic chunking, and efficient retrieval - the foundation for building Retrieval-Augmented Generation (RAG) systems.

Why Knowledge Bases Matter

Language models have a knowledge cutoff and limited context windows. A Knowledge Base solves both problems:

graph LR
    subgraph Without Knowledge Base
        A[Query] --> B[LLM]
        B --> C[Hallucination Risk]
    end
    subgraph With Knowledge Base
        D[Query] --> E[Retrieve Relevant Docs]
        E --> F[LLM + Context]
        F --> G[Grounded Answer]
    end

Knowledge Bases provide:

  1. Grounded Responses: Answers based on actual data, not hallucinations
  2. Unlimited Knowledge: Store documents beyond context limits
  3. Up-to-Date Information: Add new data without retraining
  4. Source Attribution: Track where answers come from

Architecture

Synalinks Knowledge Base is built on DuckDB, providing:

graph TD
    A[DataModels] --> B[KnowledgeBase]
    B --> C[DuckDB Storage]
    B --> D[Full-Text Index]
    B --> E[Vector Index]
    F[Search Query] --> G{Search Type}
    G -->|fulltext| D
    G -->|similarity| E
    G -->|hybrid| H[Combine Both]
    D --> I[Results]
    E --> I
    H --> I

Creating a Knowledge Base

Define DataModels for your documents, then create the Knowledge Base:

import synalinks

class Document(synalinks.DataModel):
    """A document in the knowledge base."""
    id: str = synalinks.Field(description="Unique document ID")
    title: str = synalinks.Field(description="Document title")
    content: str = synalinks.Field(description="Document content")

# Create the knowledge base
kb = synalinks.KnowledgeBase(
    uri="duckdb://my_database.db",    # Storage location
    data_models=[Document],            # What types to store
    embedding_model=embedding_model,   # For vector search (optional)
    metric="cosine",                   # Similarity metric
    wipe_on_start=False,               # Preserve existing data
)

Key Parameters

Parameter Description
uri Database connection string (e.g., duckdb://path.db)
data_models List of DataModel classes to store
embedding_model EmbeddingModel for vector search (optional)
metric Similarity metric: cosine, l2, or ip
wipe_on_start Clear database on initialization

Search Methods

Full-Text Search (BM25)

Uses the BM25 algorithm for traditional keyword-based search:

results = await kb.fulltext_search(
    "machine learning neural networks",
    data_models=[Document.to_symbolic_data_model()],
    k=10,           # Number of results
    threshold=None, # Minimum score (optional)
)

Best for:

  • Exact keyword matching
  • When users search with specific terms
  • Quick, lightweight search

Similarity Search (Vector)

Uses embedding vectors for semantic search:

results = await kb.similarity_search(
    "how do computers learn",  # Semantically matches "machine learning"
    data_models=[Document.to_symbolic_data_model()],
    k=10,
    threshold=0.7,  # Minimum similarity score
)

Best for:

  • Semantic meaning matching
  • Natural language queries
  • Finding conceptually related content

Combines both methods for best results:

results = await kb.hybrid_search(
    "machine learning basics",
    data_models=[Document.to_symbolic_data_model()],
    k=10,
    bm25_weight=0.5,    # Weight for BM25 scores
    vector_weight=0.5,  # Weight for vector scores
)

Best for:

  • Production RAG systems
  • When you need both exact and semantic matching
  • Complex queries that benefit from both approaches

CRUD Operations

Create/Update

The update method performs upsert (insert or update). The first field in your DataModel is used as the primary key:

doc = Document(
    id="doc1",
    title="Introduction to AI",
    content="Artificial intelligence is...",
)

await kb.update(doc.to_json_data_model())

Read by ID

result = await kb.get(
    "doc1",  # Primary key value
    data_models=[Document.to_symbolic_data_model()],
)

List All

all_docs = await kb.getall(
    Document.to_symbolic_data_model(),
    limit=100,
    offset=0,
)

Delete

await kb.delete(
    "doc1",
    data_models=[Document.to_symbolic_data_model()],
)

Raw SQL

For complex queries, use raw SQL:

results = await kb.query(
    "SELECT id, title FROM Document WHERE title LIKE ?",
    params=["%Learning%"],
)

Knowledge Modules

Synalinks provides modules for integrating Knowledge Bases into programs:

RetrieveKnowledge

Retrieves relevant documents using LM-generated search queries:

graph LR
    A[Input] --> B[Generate Query]
    B --> C[Search KB]
    C --> D[Context + Input]
retrieved = await synalinks.RetrieveKnowledge(
    knowledge_base=kb,
    language_model=lm,
    search_type="hybrid",  # fulltext, similarity, or hybrid
    k=10,
    return_inputs=True,    # Include original input in output
)(inputs)

UpdateKnowledge

Stores DataModels in the Knowledge Base:

stored = await synalinks.UpdateKnowledge(
    knowledge_base=kb,
)(extracted_data)

EmbedKnowledge

Generates embeddings for DataModels:

embedded = await synalinks.EmbedKnowledge(
    embedding_model=embedding_model,
    in_mask=["content"],  # Which fields to embed
)(inputs)

Building a RAG Pipeline

A complete RAG system combines retrieval with generation:

graph LR
    A[Query] --> B[RetrieveKnowledge]
    B --> C[Context + Query]
    C --> D[Generator]
    D --> E[Grounded Answer]
import asyncio
from dotenv import load_dotenv
import synalinks

class Query(synalinks.DataModel):
    query: str = synalinks.Field(description="User question")

class Answer(synalinks.DataModel):
    answer: str = synalinks.Field(description="Answer based on context")

async def main():
    load_dotenv()
    synalinks.clear_session()

    lm = synalinks.LanguageModel(model="openai/gpt-4.1-mini")

    # Assume kb is already populated
    kb = synalinks.KnowledgeBase(
        uri="duckdb://knowledge.db",
        data_models=[Document],
    )

    inputs = synalinks.Input(data_model=Query)

    # Retrieve relevant documents
    retrieved = await synalinks.RetrieveKnowledge(
        knowledge_base=kb,
        language_model=lm,
        search_type="fulltext",
        k=5,
        return_inputs=True,
    )(inputs)

    # Generate answer using retrieved context
    outputs = await synalinks.Generator(
        data_model=Answer,
        language_model=lm,
    )(retrieved)

    rag = synalinks.Program(
        inputs=inputs,
        outputs=outputs,
        name="rag_pipeline",
    )

    result = await rag(Query(query="What is machine learning?"))
    print(result["answer"])

if __name__ == "__main__":
    asyncio.run(main())

Key Takeaways

  • DuckDB Backend: Fast, embedded database with full-text and vector search capabilities. No external services required.

  • Three Search Types: Full-text (BM25) for keywords, similarity for semantics, hybrid for best of both.

  • DataModel as Schema: Your DataModels define the structure of stored documents. The first field is the primary key.

  • RetrieveKnowledge Module: Automates query generation and retrieval for RAG pipelines. Combines seamlessly with Generator.

  • Upsert Semantics: The update method inserts new records or updates existing ones based on the primary key.

  • Raw SQL Access: For complex queries, you can use raw SQL directly.

API References

Answer

Bases: DataModel

Answer based on retrieved context.

Source code in guides/6_knowledge_base.py
class Answer(synalinks.DataModel):
    """Answer based on retrieved context."""

    answer: str = synalinks.Field(description="Answer based on the context provided")

Document

Bases: DataModel

A document in the knowledge base.

Source code in guides/6_knowledge_base.py
class Document(synalinks.DataModel):
    """A document in the knowledge base."""

    id: str = synalinks.Field(description="Unique document ID")
    title: str = synalinks.Field(description="Document title")
    content: str = synalinks.Field(description="Document content")

Query

Bases: DataModel

User query.

Source code in guides/6_knowledge_base.py
class Query(synalinks.DataModel):
    """User query."""

    query: str = synalinks.Field(description="User question")