Skip to content

Knowledge Extraction and Storage

Knowledge Extraction and Storage

Synalinks provides a powerful knowledge base system for extracting, storing, and retrieving structured knowledge. This example demonstrates extracting structured information from invoices and documents, storing them, and querying them later.

graph LR
    subgraph Extraction
        A[Document] --> B[Generator]
        B --> C[Structured Data]
    end
    subgraph Storage
        C --> D[UpdateKnowledge]
        D --> E[(KnowledgeBase)]
    end
    subgraph Retrieval
        F[Query] --> G[RetrieveKnowledge]
        E --> G
        G --> H[Results]
    end

Creating a Knowledge Base

The KnowledgeBase uses DuckDB as the underlying storage engine, providing full-text search and optional vector similarity search:

# Define your data model
class Invoice(synalinks.DataModel):
    invoice_number: str = synalinks.Field(description="Invoice number")
    vendor: str = synalinks.Field(description="Vendor name")
    total: float = synalinks.Field(description="Total amount")
    description: str = synalinks.Field(description="Description of items")

# Create a knowledge base
knowledge_base = synalinks.KnowledgeBase(
    uri="duckdb://./invoices.db",
    data_models=[Invoice],
    embedding_model=embedding_model,  # Optional, for similarity search
)

Extracting Information with Generator

Use a Generator to extract structured information from unstructured text:

inputs = synalinks.Input(data_model=DocumentText)
extracted = await synalinks.Generator(
    data_model=Invoice,
    language_model=language_model,
)(inputs)

Storing Data with UpdateKnowledge

The UpdateKnowledge module stores data models in the knowledge base:

stored = await synalinks.UpdateKnowledge(
    knowledge_base=knowledge_base,
)(extracted)

Retrieving Data with RetrieveKnowledge

The RetrieveKnowledge module uses hybrid search to find relevant records:

results = await synalinks.RetrieveKnowledge(
    knowledge_base=knowledge_base,
    language_model=language_model,
    search_type="hybrid",
    k=5,
)(query)

Key Takeaways

  • KnowledgeBase: Unified interface for storing and searching structured data using DuckDB with full-text and vector search capabilities.
  • UpdateKnowledge: Module for inserting/upserting data models into the knowledge base using the first field as primary key.
  • RetrieveKnowledge: Module for intelligent retrieval using LM-generated search queries with hybrid search (full-text + vector).
  • Structured Extraction: Use Generators to extract typed data from unstructured text like invoices, receipts, or documents.

Program Visualizations

Invoice Extraction Pipeline

invoice_extraction

Business Q&A System

business_qa

API References

Answer

Bases: DataModel

An answer based on retrieved information.

Source code in examples/12_knowledge_extraction_and_storage.py
class Answer(synalinks.DataModel):
    """An answer based on retrieved information."""

    answer: str = synalinks.Field(
        description="The answer to the user's question based on retrieved data",
    )

Customer

Bases: DataModel

Extracted customer information.

Source code in examples/12_knowledge_extraction_and_storage.py
class Customer(synalinks.DataModel):
    """Extracted customer information."""

    customer_id: str = synalinks.Field(
        description="Unique customer identifier",
    )
    name: str = synalinks.Field(
        description="Customer name (person or company)",
    )
    email: str = synalinks.Field(
        description="Customer email address",
    )
    description: str = synalinks.Field(
        description="Additional notes about the customer",
    )

DocumentText

Bases: DataModel

Raw document text to extract information from.

Source code in examples/12_knowledge_extraction_and_storage.py
class DocumentText(synalinks.DataModel):
    """Raw document text to extract information from."""

    text: str = synalinks.Field(
        description="The raw text content of the document",
    )

Invoice

Bases: DataModel

Extracted invoice information.

Source code in examples/12_knowledge_extraction_and_storage.py
class Invoice(synalinks.DataModel):
    """Extracted invoice information."""

    invoice_number: str = synalinks.Field(
        description="The unique invoice number or ID",
    )
    vendor: str = synalinks.Field(
        description="The name of the vendor or supplier",
    )
    date: str = synalinks.Field(
        description="The invoice date (YYYY-MM-DD format)",
    )
    total_amount: float = synalinks.Field(
        description="The total amount due",
    )
    currency: str = synalinks.Field(
        description="The currency (e.g., USD, EUR)",
    )
    description: str = synalinks.Field(
        description="A brief description of the invoice items or services",
    )

Query

Bases: DataModel

A user query for searching the knowledge base.

Source code in examples/12_knowledge_extraction_and_storage.py
class Query(synalinks.DataModel):
    """A user query for searching the knowledge base."""

    query: str = synalinks.Field(
        description="The search query or question",
    )