Knowledge Base
Knowledge Base
A Knowledge Base in Synalinks is a structured storage system that enables your LM applications to retrieve and reason over external data. Unlike simple prompt injection, a Knowledge Base provides semantic search capabilities, automatic chunking, and efficient retrieval - the foundation for building Retrieval-Augmented Generation (RAG) systems.
Why Knowledge Bases Matter
Language models have a knowledge cutoff and limited context windows. A Knowledge Base solves both problems:
graph LR
subgraph Without Knowledge Base
A[Query] --> B[LLM]
B --> C[Hallucination Risk]
end
subgraph With Knowledge Base
D[Query] --> E[Retrieve Relevant Docs]
E --> F[LLM + Context]
F --> G[Grounded Answer]
end
Knowledge Bases provide:
- Grounded Responses: Answers based on actual data, not hallucinations
- Unlimited Knowledge: Store documents beyond context limits
- Up-to-Date Information: Add new data without retraining
- Source Attribution: Track where answers come from
Architecture
Synalinks Knowledge Base is built on DuckDB, providing:
graph TD
A[DataModels] --> B[KnowledgeBase]
B --> C[DuckDB Storage]
B --> D[Full-Text Index]
B --> E[Vector Index]
F[Search Query] --> G{Search Type}
G -->|fulltext| D
G -->|similarity| E
G -->|hybrid| H[Combine Both]
D --> I[Results]
E --> I
H --> I
Creating a Knowledge Base
Define DataModels for your documents, then create the Knowledge Base:
import synalinks
class Document(synalinks.DataModel):
"""A document in the knowledge base."""
id: str = synalinks.Field(description="Unique document ID")
title: str = synalinks.Field(description="Document title")
content: str = synalinks.Field(description="Document content")
# Create the knowledge base
kb = synalinks.KnowledgeBase(
uri="duckdb://my_database.db", # Storage location
data_models=[Document], # What types to store
embedding_model=embedding_model, # For vector search (optional)
metric="cosine", # Similarity metric
wipe_on_start=False, # Preserve existing data
)
Key Parameters
| Parameter | Description |
|---|---|
uri |
Database connection string (e.g., duckdb://path.db) |
data_models |
List of DataModel classes to store |
embedding_model |
EmbeddingModel for vector search (optional) |
metric |
Similarity metric: cosine, l2, or ip |
wipe_on_start |
Clear database on initialization |
Search Methods
Full-Text Search (BM25)
Uses the BM25 algorithm for traditional keyword-based search:
results = await kb.fulltext_search(
"machine learning neural networks",
data_models=[Document.to_symbolic_data_model()],
k=10, # Number of results
threshold=None, # Minimum score (optional)
)
Best for:
- Exact keyword matching
- When users search with specific terms
- Quick, lightweight search
Similarity Search (Vector)
Uses embedding vectors for semantic search:
results = await kb.similarity_search(
"how do computers learn", # Semantically matches "machine learning"
data_models=[Document.to_symbolic_data_model()],
k=10,
threshold=0.7, # Minimum similarity score
)
Best for:
- Semantic meaning matching
- Natural language queries
- Finding conceptually related content
Hybrid Search
Combines both methods for best results:
results = await kb.hybrid_search(
"machine learning basics",
data_models=[Document.to_symbolic_data_model()],
k=10,
bm25_weight=0.5, # Weight for BM25 scores
vector_weight=0.5, # Weight for vector scores
)
Best for:
- Production RAG systems
- When you need both exact and semantic matching
- Complex queries that benefit from both approaches
CRUD Operations
Create/Update
The update method performs upsert (insert or update). The first field in
your DataModel is used as the primary key:
doc = Document(
id="doc1",
title="Introduction to AI",
content="Artificial intelligence is...",
)
await kb.update(doc.to_json_data_model())
Read by ID
result = await kb.get(
"doc1", # Primary key value
data_models=[Document.to_symbolic_data_model()],
)
List All
Delete
Raw SQL
For complex queries, use raw SQL:
results = await kb.query(
"SELECT id, title FROM Document WHERE title LIKE ?",
params=["%Learning%"],
)
Knowledge Modules
Synalinks provides modules for integrating Knowledge Bases into programs:
RetrieveKnowledge
Retrieves relevant documents using LM-generated search queries:
graph LR
A[Input] --> B[Generate Query]
B --> C[Search KB]
C --> D[Context + Input]
retrieved = await synalinks.RetrieveKnowledge(
knowledge_base=kb,
language_model=lm,
search_type="hybrid", # fulltext, similarity, or hybrid
k=10,
return_inputs=True, # Include original input in output
)(inputs)
UpdateKnowledge
Stores DataModels in the Knowledge Base:
EmbedKnowledge
Generates embeddings for DataModels:
embedded = await synalinks.EmbedKnowledge(
embedding_model=embedding_model,
in_mask=["content"], # Which fields to embed
)(inputs)
Building a RAG Pipeline
A complete RAG system combines retrieval with generation:
graph LR
A[Query] --> B[RetrieveKnowledge]
B --> C[Context + Query]
C --> D[Generator]
D --> E[Grounded Answer]
import asyncio
from dotenv import load_dotenv
import synalinks
class Query(synalinks.DataModel):
query: str = synalinks.Field(description="User question")
class Answer(synalinks.DataModel):
answer: str = synalinks.Field(description="Answer based on context")
async def main():
load_dotenv()
synalinks.clear_session()
lm = synalinks.LanguageModel(model="openai/gpt-4.1-mini")
# Assume kb is already populated
kb = synalinks.KnowledgeBase(
uri="duckdb://knowledge.db",
data_models=[Document],
)
inputs = synalinks.Input(data_model=Query)
# Retrieve relevant documents
retrieved = await synalinks.RetrieveKnowledge(
knowledge_base=kb,
language_model=lm,
search_type="fulltext",
k=5,
return_inputs=True,
)(inputs)
# Generate answer using retrieved context
outputs = await synalinks.Generator(
data_model=Answer,
language_model=lm,
)(retrieved)
rag = synalinks.Program(
inputs=inputs,
outputs=outputs,
name="rag_pipeline",
)
result = await rag(Query(query="What is machine learning?"))
print(result["answer"])
if __name__ == "__main__":
asyncio.run(main())
Key Takeaways
-
DuckDB Backend: Fast, embedded database with full-text and vector search capabilities. No external services required.
-
Three Search Types: Full-text (BM25) for keywords, similarity for semantics, hybrid for best of both.
-
DataModel as Schema: Your DataModels define the structure of stored documents. The first field is the primary key.
-
RetrieveKnowledge Module: Automates query generation and retrieval for RAG pipelines. Combines seamlessly with Generator.
-
Upsert Semantics: The
updatemethod inserts new records or updates existing ones based on the primary key. -
Raw SQL Access: For complex queries, you can use raw SQL directly.
API References
Answer
Document
Bases: DataModel
A document in the knowledge base.