Skip to content

Knowledge Extraction and Storage

Knowledge Extraction and Storage

Synalinks provides a powerful knowledge base system for extracting, storing, and retrieving structured knowledge. This example demonstrates extracting structured information from invoices and documents, storing them, and querying them later.

graph LR
    subgraph Extraction
        A[Document] --> B[Generator]
        B --> C[Structured Data]
    end
    subgraph Storage
        C --> D[UpdateKnowledge]
        D --> E[(KnowledgeBase)]
    end
    subgraph Retrieval
        F[Query] --> G[RetrieveKnowledge]
        E --> G
        G --> H[Results]
    end

Creating a Knowledge Base

The KnowledgeBase uses DuckDB as the underlying storage engine, providing full-text search and optional vector similarity search:

# Define your data model
class Invoice(synalinks.DataModel):
    invoice_number: str = synalinks.Field(description="Invoice number")
    vendor: str = synalinks.Field(description="Vendor name")
    total: float = synalinks.Field(description="Total amount")
    description: str = synalinks.Field(description="Description of items")

# Create a knowledge base
knowledge_base = synalinks.KnowledgeBase(
    uri="duckdb://./invoices.db",
    data_models=[Invoice],
    embedding_model=embedding_model,  # Optional, for similarity search
)

Extracting Information with Generator

Use a Generator to extract structured information from unstructured text:

inputs = synalinks.Input(data_model=DocumentText)
extracted = await synalinks.Generator(
    data_model=Invoice,
    language_model=language_model,
)(inputs)

Storing Data with UpdateKnowledge

The UpdateKnowledge module stores data models in the knowledge base:

stored = await synalinks.UpdateKnowledge(
    knowledge_base=knowledge_base,
)(extracted)

Retrieving Data with RetrieveKnowledge

The RetrieveKnowledge module uses hybrid search to find relevant records:

results = await synalinks.RetrieveKnowledge(
    knowledge_base=knowledge_base,
    language_model=language_model,
    search_type="hybrid",
    k=5,
)(query)

Key Takeaways

  • KnowledgeBase: Unified interface for storing and searching structured data using DuckDB with full-text and vector search capabilities.
  • UpdateKnowledge: Module for inserting/upserting data models into the knowledge base using the first field as primary key.
  • RetrieveKnowledge: Module for intelligent retrieval using LM-generated search queries with hybrid search (full-text + vector).
  • Structured Extraction: Use Generators to extract typed data from unstructured text like invoices, receipts, or documents.

Program Visualizations

Invoice Extraction Pipeline

invoice_extraction

Business Q&A System

business_qa

API References

Answer

Bases: DataModel

An answer based on retrieved information.

Source code in examples/12_knowledge_extraction_and_storage.py
class Answer(synalinks.DataModel):
    """An answer based on retrieved information."""

    answer: str = synalinks.Field(
        description="The answer to the user's question based on retrieved data",
    )

Customer

Bases: DataModel

Extracted customer information.

Source code in examples/12_knowledge_extraction_and_storage.py
class Customer(synalinks.DataModel):
    """Extracted customer information."""

    customer_id: str = synalinks.Field(
        description="Unique customer identifier",
    )
    name: str = synalinks.Field(
        description="Customer name (person or company)",
    )
    email: str = synalinks.Field(
        description="Customer email address",
    )
    description: str = synalinks.Field(
        description="Additional notes about the customer",
    )

DocumentText

Bases: DataModel

Raw document text to extract information from.

Source code in examples/12_knowledge_extraction_and_storage.py
class DocumentText(synalinks.DataModel):
    """Raw document text to extract information from."""

    text: str = synalinks.Field(
        description="The raw text content of the document",
    )

Invoice

Bases: DataModel

Extracted invoice information.

Source code in examples/12_knowledge_extraction_and_storage.py
class Invoice(synalinks.DataModel):
    """Extracted invoice information."""

    invoice_number: str = synalinks.Field(
        description="The unique invoice number or ID",
    )
    vendor: str = synalinks.Field(
        description="The name of the vendor or supplier",
    )
    date: str = synalinks.Field(
        description="The invoice date (YYYY-MM-DD format)",
    )
    total_amount: float = synalinks.Field(
        description="The total amount due",
    )
    currency: str = synalinks.Field(
        description="The currency (e.g., USD, EUR)",
    )
    description: str = synalinks.Field(
        description="A brief description of the invoice items or services",
    )

Query

Bases: DataModel

A user query for searching the knowledge base.

Source code in examples/12_knowledge_extraction_and_storage.py
class Query(synalinks.DataModel):
    """A user query for searching the knowledge base."""

    query: str = synalinks.Field(
        description="The search query or question",
    )

Source

import asyncio
import os

from dotenv import load_dotenv

import synalinks


# =============================================================================
# Define the data models for invoice extraction
# =============================================================================


class DocumentText(synalinks.DataModel):
    """Raw document text to extract information from."""

    text: str = synalinks.Field(
        description="The raw text content of the document",
    )


class Invoice(synalinks.DataModel):
    """Extracted invoice information."""

    invoice_number: str = synalinks.Field(
        description="The unique invoice number or ID",
    )
    vendor: str = synalinks.Field(
        description="The name of the vendor or supplier",
    )
    date: str = synalinks.Field(
        description="The invoice date (YYYY-MM-DD format)",
    )
    total_amount: float = synalinks.Field(
        description="The total amount due",
    )
    currency: str = synalinks.Field(
        description="The currency (e.g., USD, EUR)",
    )
    description: str = synalinks.Field(
        description="A brief description of the invoice items or services",
    )


class Customer(synalinks.DataModel):
    """Extracted customer information."""

    customer_id: str = synalinks.Field(
        description="Unique customer identifier",
    )
    name: str = synalinks.Field(
        description="Customer name (person or company)",
    )
    email: str = synalinks.Field(
        description="Customer email address",
    )
    description: str = synalinks.Field(
        description="Additional notes about the customer",
    )


class Query(synalinks.DataModel):
    """A user query for searching the knowledge base."""

    query: str = synalinks.Field(
        description="The search query or question",
    )


class Answer(synalinks.DataModel):
    """An answer based on retrieved information."""

    answer: str = synalinks.Field(
        description="The answer to the user's question based on retrieved data",
    )


async def main():
    load_dotenv()

    # Enable observability for tracing
    synalinks.enable_observability(
        tracking_uri="http://localhost:5000",
        experiment_name="knowledge_extraction",
    )

    # Initialize models
    language_model = synalinks.LanguageModel(
        model="gemini/gemini-3.1-flash-lite-preview",
    )

    embedding_model = synalinks.EmbeddingModel(
        model="gemini/text-embedding-004",
    )

    # Clean up any existing database
    db_path = "./examples/business_data.db"
    if os.path.exists(db_path):
        os.remove(db_path)

    # ==========================================================================
    # Example 1: Create a Knowledge Base for Business Data
    # ==========================================================================
    print("Example 1: Creating a Knowledge Base")
    print("=" * 50)

    knowledge_base = synalinks.KnowledgeBase(
        uri=f"duckdb://{db_path}",
        data_models=[Invoice, Customer],
        embedding_model=embedding_model,
        metric="cosine",
    )

    print(f"Knowledge base created at: {db_path}")
    tables = [m.get_schema()["title"] for m in knowledge_base.get_symbolic_data_models()]
    print(f"Tables: {tables}")

    # ==========================================================================
    # Example 2: Extract and Store Invoices
    # ==========================================================================
    print("\nExample 2: Extracting and Storing Invoices")
    print("=" * 50)

    # Sample invoice texts (simulating OCR output or email content)
    invoice_texts = [
        """
        INVOICE #INV-2024-001
        From: TechSupply Co.
        Date: 2024-01-15

        Items:
        - 10x USB-C Cables @ $12.99 each
        - 5x Wireless Mouse @ $29.99 each

        Subtotal: $279.85
        Tax: $27.99
        Total Due: $307.84 USD

        Payment due within 30 days.
        """,
        """
        Invoice Number: INV-2024-002
        Vendor: Cloud Services Inc.
        Invoice Date: January 20, 2024

        Monthly subscription for cloud hosting services
        - Basic Plan (January 2024)
        - Storage: 500GB
        - Bandwidth: Unlimited

        Amount: EUR 149.00
        """,
        """
        BILL
        Invoice: INV-2024-003
        Office Furniture Ltd.
        02/01/2024

        Standing Desk - Adjustable Height: $599.00
        Ergonomic Chair - Premium: $449.00
        Desk Lamp - LED: $79.00

        Total: $1,127.00 USD
        """,
    ]

    # Create extraction and storage program
    inputs = synalinks.Input(data_model=DocumentText)
    extracted_invoice = await synalinks.Generator(
        data_model=Invoice,
        language_model=language_model,
        instructions="Extract invoice information from the document text. Use YYYY-MM-DD format for dates.",
    )(inputs)
    stored_invoice = await synalinks.UpdateKnowledge(
        knowledge_base=knowledge_base,
    )(extracted_invoice)

    invoice_program = synalinks.Program(
        inputs=inputs,
        outputs=stored_invoice,
        name="invoice_extraction",
        description="Extract and store invoice data",
    )

    synalinks.utils.plot_program(
        invoice_program,
        to_folder="examples",
        show_module_names=True,
        show_schemas=True,
        show_trainable=True,
    )

    # Process invoices
    print("\nExtracting invoices...")
    for text in invoice_texts:
        result = await invoice_program(DocumentText(text=text))
        print(
            f"  - {result.get('invoice_number')}: {result.get('vendor')} - {result.get('total_amount')} {result.get('currency')}"
        )

    # ==========================================================================
    # Example 3: Extract and Store Customers
    # ==========================================================================
    print("\nExample 3: Extracting and Storing Customers")
    print("=" * 50)

    customer_texts = [
        """
        New Customer Registration:
        ID: CUST-001
        Company: Acme Corporation
        Contact: john.doe@acme.com
        Notes: Enterprise client, interested in bulk orders
        """,
        """
        Customer Profile Update
        Customer Number: CUST-002
        Name: Jane Smith
        Email Address: jane.smith@startup.io
        Remarks: Small business owner, prefers monthly billing
        """,
    ]

    inputs = synalinks.Input(data_model=DocumentText)
    extracted_customer = await synalinks.Generator(
        data_model=Customer,
        language_model=language_model,
        instructions="Extract customer information from the document text.",
    )(inputs)
    stored_customer = await synalinks.UpdateKnowledge(
        knowledge_base=knowledge_base,
    )(extracted_customer)

    customer_program = synalinks.Program(
        inputs=inputs,
        outputs=stored_customer,
        name="customer_extraction",
        description="Extract and store customer data",
    )

    print("\nExtracting customers...")
    for text in customer_texts:
        result = await customer_program(DocumentText(text=text))
        print(
            f"  - {result.get('customer_id')}: {result.get('name')} ({result.get('email')})"
        )

    # ==========================================================================
    # Example 4: Search the Knowledge Base
    # ==========================================================================
    print("\nExample 4: Searching the Knowledge Base")
    print("=" * 50)

    # Full-text search for invoices
    print("\nSearch for 'cloud' in invoices:")
    results = await knowledge_base.fulltext_search(
        "cloud", table_name="Invoice", k=5
    )
    for r in results:
        print(f"  Found: {r}")

    # Hybrid search (vector + BM25 fulltext, fused with RRF)
    print("\nSearch for 'office equipment purchase':")
    results = await knowledge_base.hybrid_fts_search(
        "office equipment purchase", table_name="Invoice", k=5
    )
    for r in results:
        print(f"  Found: {r}")

    # ==========================================================================
    # Example 5: Build a Q&A System with RetrieveKnowledge
    # ==========================================================================
    print("\nExample 5: Q&A System with RetrieveKnowledge")
    print("=" * 50)

    inputs = synalinks.Input(data_model=Query)
    retrieved = await synalinks.RetrieveKnowledge(
        knowledge_base=knowledge_base,
        language_model=language_model,
        search_type="hybrid",
        k=5,
        return_inputs=True,
        return_query=True,
    )(inputs)
    answer = await synalinks.Generator(
        data_model=Answer,
        language_model=language_model,
        instructions="Answer the question based on the retrieved business data. Be specific with numbers and dates.",
    )(retrieved)

    qa_program = synalinks.Program(
        inputs=inputs,
        outputs=answer,
        name="business_qa",
        description="Answer questions about business data",
    )

    synalinks.utils.plot_program(
        qa_program,
        to_folder="examples",
        show_module_names=True,
        show_schemas=True,
        show_trainable=True,
    )

    # Test questions
    questions = [
        "What is the total amount of the invoice from TechSupply?",
        "Which invoice is for cloud services?",
        "What is Jane Smith's email?",
        "How much was the standing desk invoice?",
    ]

    print("\nAsking questions:")
    for question in questions:
        print(f"\nQ: {question}")
        result = await qa_program(Query(query=question))
        print(f"A: {result.get('answer')}")

    # ==========================================================================
    # Example 6: List All Stored Records
    # ==========================================================================
    print("\n\nExample 6: Listing All Stored Records")
    print("=" * 50)

    # Get data models from knowledge base
    data_models = knowledge_base.get_symbolic_data_models()

    for dm in data_models:
        table_name = dm.get_schema()["title"]
        records = await knowledge_base.getall(
            table_name=table_name, limit=10
        )
        print(f"\n{table_name} ({len(records)} records):")
        for record in records:
            json_data = record.get_json()
            # Print a summary of each record
            if table_name == "Invoice":
                print(
                    f"  - {json_data['invoice_number']}: {json_data['vendor']} - {json_data['total_amount']} {json_data['currency']}"
                )
            elif table_name == "Customer":
                print(f"  - {json_data['customer_id']}: {json_data['name']}")

    print("\nDone!")


if __name__ == "__main__":
    asyncio.run(main())